Controlled natural language in speech recognition based user interfaces

CNL 2012, Zurich, 2012-08-31

Speech-based user interfaces

command the computer by human speech (instead of mouse, keyboard, etc.)
effective and natural in many environments and for many tasks
- hands/eyes free (car, operating room)
- small mobile devices (personal assistants)
- assumes: lack of background noise, privacy concerns
increasingly popular on smart phones
- Apple's Siri, Google Now
dominated by closed-source systems
not available for smaller languages e.g. with less than 1 million speakers
- Google Voice Search supports 42 languages, mostly dictation-only (2012-08-17)
- Siri supports 9 languages

Our goals

demonstrate the state of Estonian automatic speech recognition (ASR)
- make widely available as a web service and on smart phones
build usable (90%+ precision) and useful applications, assumes:
- single speaker
- smaller vocabulary
- simpler syntax
provide a platform for building speech-enabled applications
- open source
- modular
- easy to use for a general programmer (with no knowledge of speech and language engineering)
build domain-specific grammars for Estonian

CNL definition

clearly defined syntax
- explained to the user via construction rules and example sentences
clearly defined semantics
- explained to the user via interpretation rules and example sentences
- machine executable
ambiguity is controlled
- possibly no ambiguity

Properties of speech-oriented CNLs

not studied explicitly (e.g. at CNL2009, CNL2010)
suitable for simpler domains (calculator, unit conversion, alarm clock)
cannot assume flexible editing, e.g. backtracking, look-ahead
units must be pronounceable (no punctuation, layout, ...)
acoustic cues and ambiguity
- background noise
- homophones, e.g. cite, site, sight (less of a problem in Estonian)
- oronyms, e.g. I scream, ice scream
- sometimes speech gives more cues than standard written form, i.e. do not decode into orthographic text

CNL grammars

for (Estonian) speech-based UIs

Requirements for the grammar formalism

mapping between human and machine language
- "quarter to ten in the morning" vs 09:45
- "lecture hall 2.A.01" vs 47.414736,8.548753
can handle the complexities of natural language
developer-friendly syntax / editing environments
support for standard software engineering practices, e.g.
- reusable modules
- unit and regression testing
- compatibility with modern programming languages
compatibility with open-source ASR toolkits (CMU Sphinx)
considered two formalisms: JSGF and GF

Java Speech Grammar Format (JSGF)

standard speech recognition grammar format supported by e.g. CMU Sphinx
simple BNF format
no special support for natural languages
little support for input normalization into a machine language

<command> = <action> | (<action> and <command>);
<action> = stop {STOP} | start | pause | resume | finish {STOP};

Grammatical Framework (GF)

functional programming language for grammar engineering
expressivity beyond context-free
parsing and generation (linearizing)
focus on multilinguality
- multiple concrete grammars
- common single abstract grammar
- translate = parse to abstract tree + linearize into a concrete language
special support for natural language features
- long distance dependencies
- word form generation
- resource grammar library (RGL)
support for speech recognition formats (incl. JSGF) via mappings

GF grammar example

-- Unitconv/Unit.gf (abstract grammar)
speed : LengthUnit -> TimeUnit -> SpeedUnit ;
time_unit : Time -> TimeUnit ;
second, minute, hour, day, week, month, year, decade, century : Time ;

-- Unitconv/UnitEst.gf (concrete grammar for Estonian)
speed = mk_meter_per_second "";
minute = mkUnit "minutit";
hour = mkUnit "tundi" "tunnis" "tundides";

-- Unitconv/UnitApp.gf (concrete grammar for a machine language)
speed = infixSS_glue "/"; -- e.g. "km/h"
minute = ss "min";
hour = ss "h";

-- lib/*.gf (resources)
ss : Str -> SS = \s -> {s = s} ;
infixSS_glue : Str -> SS -> SS -> SS = \f,x,y -> ss (glue x.s f y.s) ;

Our speech application grammars

implemented in GF, compilable to JSGF
target small domain languages: calculator, map query, alarm clock, ...
linguistically quite simple
- fixed word order
- little morphological variation
simplicity motivated by
- need for reliable ASR
- simple domains
available on GitHub
- in source format for collaborative development
- in GF's portable grammar format (PGF) to be used in applications

Two types of concrete languages

described by every grammar

type 1 (currently only Estonian)
- CNL
- spoken, i.e. each token corresponds to a sequence of phonemes
- possibly (syntactically) ambiguous
- input to GF parsing
- respective grammar compiled to JSGF for ASR
type 2
- provides semantics for the CNL
- machine-executable (Google Search, Wolfram Alpha, ...)
- output of GF linearization

GF-based translation

Human says /one plus two/, machine responds by displaying 3

ASR/GF: transcribe speech input using a JSGF grammar automatically derived from a GF grammar
GF: parse transcription into an abstract tree
GF: linearize tree into application format(s)
app: evaluate application format(s)
app: communicate evaluation result(s) to the user

Grammar module hierarchy

Main grammars

Action	covers everything, i.e. Alarm + Direction + Calc
Alarm	simple 24h-clock, e.g. "alarm 06:05", "alarm in 20 minutes"
Calc	Expr + Unitconv; optional prefix "kui palju on" (how much is) to help ASR
Direction	FROM-TO queries over Estonian place names and Tallinn's street addresses
Expr	arithmetical expressions, e.g. (((1 + -2.3) * PI) ^ 5)
Numeral	integers -10^12–10^12; imported by most other grammars
Symbols, Estvrp	sequence of digits and letters
Unitconv	unit and currency conversion expressions, e.g. convert 12.34 km^2 to ft^2

Direction

Est: algus sõpruse puiestee kakssada neliteist lõpp lossi plats kaks
App: FROM Sõpruse puiestee 214, Tallinn TO Lossi plats 2, Tallinn

syntactically simple (from street A 123 to street B 321)
good coverage (although could be better)
- 4330 names of Estonian populated places (source: GeoNames)
- 1500 names of Tallinn's streets (source: Estonian Language Institute)
shortcomings:
- does not model naming variation:
  - (August|A) Weizenbergi (tänav) 39
  - Estonian/Swedish parallel names (very few)
- does not handle ambiguity, e.g. villages with the same name but different location
- only nominative case
- over-generates with house numbers

Unitconv

Est: sada koma üks miili ruut tunnis meetrites ruut sekundis
App: convert 100.1 mi*h^-2 to m*s^-2

type-aware unit conversion expressions
- syntax error: "convert 10 USD to km/h"
covers most important units and the main currencies
automatically generates required morphological forms
supports some syntactic sugar
- USD can be expressed by "dollar", "ameerika dollar" or "ameerika raha"
supports some ambiguity
- "two euros in large currency" is ambiguous between ~5 readings

Expr

Est: Pii pluss üks miinus kaks korda kolm jagatud neli astmel viis
App: (((((PI + 1) - 2) * 3) / 4) ^ 5)

arithmetical expressions with the 5 main operations
tiny vocabulary
infinitely many expressions
left-associative interpretation

Platform

for developing speech-based UIs

Open source stack

cloud service
- real-time ASR (optionally grammar-based) of streaming audio
- HTTP/REST/JSON
- users can upload their GF and JSGF grammars
ASR system
- CMU Sphinx (Pocketsphinx) decoder
- supports (JSGF) grammars and n-gram language models
- Estonian acoustic models
grammar development
- GF
- existing modules in a GitHub repository
app development
- Android
- extended RecognizerIntent-API supported by Kõnele

System architecture

CNL grammars: PGF files accessible over HTTP
ASR server: transcribes speech; uses grammars
Kõnele (speak!)
- maps apps to grammars
- records speech and transcribes it using the server
Arvutaja (the one who computes)
- maps voice commands to actions (possibly carried out by other apps)
- transcribes speech using Kõnele with the Action-grammar

Setting an app to use grammar-based ASR

via Kõnele's configuration panel (as an end-user)

Programming your app to use Kõnele's API

// Set of non-standard extras that K6nele supports
public static final String EXTRA_GRAMMAR_URL =
	"ee.ioc.phon.android.extra.GRAMMAR_URL";
public static final String EXTRA_GRAMMAR_TARGET_LANG =
	"ee.ioc.phon.android.extra.GRAMMAR_TARGET_LANG";
// ...
Intent intent = new Intent(RecognizerIntent.ACTION_RECOGNIZE_SPEECH);
intent.setComponent(new ComponentName(
	"ee.ioc.phon.android.speak",
	"ee.ioc.phon.android.speak.RecognizerIntentActivity"));
intent.putExtra(EXTRA_GRAMMAR_URL,
	"http://kaljurand.github.com/Grammars/grammars/pgf/Action.pgf");
intent.putExtra(EXTRA_GRAMMAR_TARGET_LANG, "App");
// ...
startActivityForResult(intent, 1234);

Arvutaja

front-end to the Action-grammar

Results

grammars
- enable useful applications with high precision (90%+) speech recognition
- scalable at least to ~6000 terminals
development platform
- easy to use
- flexible
- scalable to multiple natural/formal languages (thanks to GF)
successfully demonstrated/demonstrates Estonian ASR
- Kõnele: 10,000+ downloads
- Arvutaja: 4,500+ downloads
- award: Estonian Language Deed 2011
- however, daily usage is still small :(

Future work

support multiple natural languages based on GF's resource grammar library
mixing grammar-based and n-gram based models
- "email Bob at work I'm running late"
dialog (grammar-based)
- error recovery
- awareness of past input/output
- text-to-speech
developer tools optimized for speech-oriented CNLs
- search for potential ambiguity (resulting from e.g. homophones)
- propose changes to the grammars based on query log analysis

Summary

speech-based UIs
CNLs for such UIs
GF as an effective formalism/tool for implementing such CNLs
open/extendable platform for building such UIs (also for small languages)
voice actions smart phone app for Estonian

Speech-based user interfaces

Our goals

CNL definition

Properties of speech-oriented CNLs

CNL grammars

for (Estonian) speech-based UIs

Requirements for the grammar formalism

Java Speech Grammar Format (JSGF)

Grammatical Framework (GF)

GF grammar example

Our speech application grammars

Two types of concrete languages

described by every grammar

GF-based translation

Human says /one plus two/, machine responds by displaying 3

Grammar module hierarchy

Main grammars

Direction

Unitconv

Expr

Platform

for developing speech-based UIs

Open source stack

System architecture

Setting an app to use grammar-based ASR

via Kõnele's configuration panel (as an end-user)

Programming your app to use Kõnele's API

Arvutaja

front-end to the Action-grammar

Results

Future work

Summary

Links