CNL 2012, Zurich, 2012-08-31

Speech-based user interfaces

  • command the computer by human speech (instead of mouse, keyboard, etc.)
  • effective and natural in many environments and for many tasks
    • hands/eyes free (car, operating room)
    • small mobile devices (personal assistants)
    • assumes: lack of background noise, privacy concerns
  • increasingly popular on smart phones
    • Apple's Siri, Google Now
  • dominated by closed-source systems
  • not available for smaller languages e.g. with less than 1 million speakers
    • Google Voice Search supports 42 languages, mostly dictation-only (2012-08-17)
    • Siri supports 9 languages
source: Wikipedia

Our goals

  • demonstrate the state of Estonian automatic speech recognition (ASR)
    • make widely available as a web service and on smart phones
  • build usable (90%+ precision) and useful applications, assumes:
    • single speaker
    • smaller vocabulary
    • simpler syntax
  • provide a platform for building speech-enabled applications
    • open source
    • modular
    • easy to use for a general programmer (with no knowledge of speech and language engineering)
  • build domain-specific grammars for Estonian

CNL definition

  • clearly defined syntax
    • explained to the user via construction rules and example sentences
  • clearly defined semantics
    • explained to the user via interpretation rules and example sentences
    • machine executable
  • ambiguity is controlled
    • possibly no ambiguity

Properties of speech-oriented CNLs

  • not studied explicitly (e.g. at CNL2009, CNL2010)
  • suitable for simpler domains (calculator, unit conversion, alarm clock)
  • cannot assume flexible editing, e.g. backtracking, look-ahead
  • units must be pronounceable (no punctuation, layout, ...)
  • acoustic cues and ambiguity
    • background noise
    • homophones, e.g. cite, site, sight (less of a problem in Estonian)
    • oronyms, e.g. I scream, ice scream
    • sometimes speech gives more cues than standard written form, i.e. do not decode into orthographic text

CNL grammars

for (Estonian) speech-based UIs

Requirements for the grammar formalism

  • mapping between human and machine language
  • can handle the complexities of natural language
  • developer-friendly syntax / editing environments
  • support for standard software engineering practices, e.g.
    • reusable modules
    • unit and regression testing
    • compatibility with modern programming languages
  • compatibility with open-source ASR toolkits (CMU Sphinx)
  • considered two formalisms: JSGF and GF

Java Speech Grammar Format (JSGF)

  • standard speech recognition grammar format supported by e.g. CMU Sphinx
  • simple BNF format
  • no special support for natural languages
  • little support for input normalization into a machine language
<command> = <action> | (<action> and <command>);
<action> = stop {STOP} | start | pause | resume | finish {STOP};

Grammatical Framework (GF)

  • functional programming language for grammar engineering
  • expressivity beyond context-free
  • parsing and generation (linearizing)
  • focus on multilinguality
    • multiple concrete grammars
    • common single abstract grammar
    • translate = parse to abstract tree + linearize into a concrete language
  • special support for natural language features
    • long distance dependencies
    • word form generation
    • resource grammar library (RGL)
  • support for speech recognition formats (incl. JSGF) via mappings

GF grammar example

-- Unitconv/Unit.gf (abstract grammar)
speed : LengthUnit -> TimeUnit -> SpeedUnit ;
time_unit : Time -> TimeUnit ;
second, minute, hour, day, week, month, year, decade, century : Time ;
-- Unitconv/UnitEst.gf (concrete grammar for Estonian)
speed = mk_meter_per_second "";
minute = mkUnit "minutit";
hour = mkUnit "tundi" "tunnis" "tundides";
-- Unitconv/UnitApp.gf (concrete grammar for a machine language)
speed = infixSS_glue "/"; -- e.g. "km/h"
minute = ss "min";
hour = ss "h";
-- lib/*.gf (resources)
ss : Str -> SS = \s -> {s = s} ;
infixSS_glue : Str -> SS -> SS -> SS = \f,x,y -> ss (glue x.s f y.s) ;

Our speech application grammars

  • implemented in GF, compilable to JSGF
  • target small domain languages: calculator, map query, alarm clock, ...
  • linguistically quite simple
    • fixed word order
    • little morphological variation
  • simplicity motivated by
    • need for reliable ASR
    • simple domains
  • available on GitHub
    • in source format for collaborative development
    • in GF's portable grammar format (PGF) to be used in applications

Two types of concrete languages

described by every grammar

  • type 1 (currently only Estonian)
    • CNL
    • spoken, i.e. each token corresponds to a sequence of phonemes
    • possibly (syntactically) ambiguous
    • input to GF parsing
    • respective grammar compiled to JSGF for ASR
  • type 2
    • provides semantics for the CNL
    • machine-executable (Google Search, Wolfram Alpha, ...)
    • output of GF linearization

GF-based translation

Human says /one plus two/, machine responds by displaying 3

Translation
  • ASR/GF: transcribe speech input using a JSGF grammar automatically derived from a GF grammar
  • GF: parse transcription into an abstract tree
  • GF: linearize tree into application format(s)
  • app: evaluate application format(s)
  • app: communicate evaluation result(s) to the user

Grammar module hierarchy

Grammar modules

Main grammars

Action covers everything, i.e. Alarm + Direction + Calc
Alarm simple 24h-clock, e.g. "alarm 06:05", "alarm in 20 minutes"
Calc Expr + Unitconv; optional prefix "kui palju on" (how much is) to help ASR
Direction FROM-TO queries over Estonian place names and Tallinn's street addresses
Expr arithmetical expressions, e.g. (((1 + -2.3) * PI) ^ 5)
Numeral integers -10^12–10^12; imported by most other grammars
Symbols, Estvrp sequence of digits and letters
Unitconv unit and currency conversion expressions, e.g. convert 12.34 km^2 to ft^2

Direction

Est: algus sõpruse puiestee kakssada neliteist lõpp lossi plats kaks
App: FROM Sõpruse puiestee 214, Tallinn TO Lossi plats 2, Tallinn
  • syntactically simple (from street A 123 to street B 321)
  • good coverage (although could be better)
    • 4330 names of Estonian populated places (source: GeoNames)
    • 1500 names of Tallinn's streets (source: Estonian Language Institute)
  • shortcomings:
    • does not model naming variation:
      • (August|A) Weizenbergi (tänav) 39
      • Estonian/Swedish parallel names (very few)
    • does not handle ambiguity, e.g. villages with the same name but different location
    • only nominative case
    • over-generates with house numbers

Unitconv

Est: sada koma üks miili ruut tunnis meetrites ruut sekundis
App: convert 100.1 mi*h^-2 to m*s^-2
  • type-aware unit conversion expressions
    • syntax error: "convert 10 USD to km/h"
  • covers most important units and the main currencies
  • automatically generates required morphological forms
  • supports some syntactic sugar
    • USD can be expressed by "dollar", "ameerika dollar" or "ameerika raha"
  • supports some ambiguity
    • "two euros in large currency" is ambiguous between ~5 readings

Expr

Est: Pii pluss üks miinus kaks korda kolm jagatud neli astmel viis
App: (((((PI + 1) - 2) * 3) / 4) ^ 5)
  • arithmetical expressions with the 5 main operations
  • tiny vocabulary
  • infinitely many expressions
  • left-associative interpretation

Platform

for developing speech-based UIs

Open source stack

  • cloud service
    • real-time ASR (optionally grammar-based) of streaming audio
    • HTTP/REST/JSON
    • users can upload their GF and JSGF grammars
  • ASR system
    • CMU Sphinx (Pocketsphinx) decoder
    • supports (JSGF) grammars and n-gram language models
    • Estonian acoustic models
  • grammar development
    • GF
    • existing modules in a GitHub repository
  • app development
    • Android
    • extended RecognizerIntent-API supported by Kõnele

System architecture

System architecture

  • CNL grammars: PGF files accessible over HTTP
  • ASR server: transcribes speech; uses grammars
  • Kõnele (speak!)
    • maps apps to grammars
    • records speech and transcribes it using the server
  • Arvutaja (the one who computes)
    • maps voice commands to actions (possibly carried out by other apps)
    • transcribes speech using Kõnele with the Action-grammar

Setting an app to use grammar-based ASR

via Kõnele's configuration panel (as an end-user)

Programming your app to use Kõnele's API

// Set of non-standard extras that K6nele supports
public static final String EXTRA_GRAMMAR_URL =
	"ee.ioc.phon.android.extra.GRAMMAR_URL";
public static final String EXTRA_GRAMMAR_TARGET_LANG =
	"ee.ioc.phon.android.extra.GRAMMAR_TARGET_LANG";
// ...
Intent intent = new Intent(RecognizerIntent.ACTION_RECOGNIZE_SPEECH);
intent.setComponent(new ComponentName(
	"ee.ioc.phon.android.speak",
	"ee.ioc.phon.android.speak.RecognizerIntentActivity"));
intent.putExtra(EXTRA_GRAMMAR_URL,
	"http://kaljurand.github.com/Grammars/grammars/pgf/Action.pgf");
intent.putExtra(EXTRA_GRAMMAR_TARGET_LANG, "App");
// ...
startActivityForResult(intent, 1234);

Arvutaja

front-end to the Action-grammar

Screenshot: Arvutaja Screenshot: Arvutaja listening Screenshot: Arvutaja WolframAlpha

Results

  • grammars
    • enable useful applications with high precision (90%+) speech recognition
    • scalable at least to ~6000 terminals
  • development platform
    • easy to use
    • flexible
    • scalable to multiple natural/formal languages (thanks to GF)
  • successfully demonstrated/demonstrates Estonian ASR
    • Kõnele: 10,000+ downloads
    • Arvutaja: 4,500+ downloads
    • award: Estonian Language Deed 2011
    • however, daily usage is still small :(

Future work

  • support multiple natural languages based on GF's resource grammar library
  • mixing grammar-based and n-gram based models
    • "email Bob at work I'm running late"
  • dialog (grammar-based)
    • error recovery
    • awareness of past input/output
    • text-to-speech
  • developer tools optimized for speech-oriented CNLs
    • search for potential ambiguity (resulting from e.g. homophones)
    • propose changes to the grammars based on query log analysis

Summary

  • speech-based UIs
  • CNLs for such UIs
  • GF as an effective formalism/tool for implementing such CNLs
  • open/extendable platform for building such UIs (also for small languages)
  • voice actions smart phone app for Estonian

Links