Building Speech-based User Interfaces with GF

Kaarel Kaljurand

Third GF Summer School 2013, Frauenchiemsee, Bavaria

2013-08-28

Presenter Notes

Overview

speech-based user interfaces
controlled natural language for such UIs
implementing CNL grammars for speech-based UIs in GF
platform for building speech-based UIs, based on
- GF grammars
- speech recognition server
- Android
- Android apps Kõnele and Arvutaja
results and future work

Presenter Notes

Speech-based user interfaces

command the computer by human speech (instead of mouse, keyboard, etc.)
effective and natural in many environments and for many tasks
- hands/eyes free (car, operating room)
- small mobile devices (personal assistants)
- assumes: lack of background noise, privacy concerns
increasingly popular on smart phones
- Apple's Siri, Google Now, Windows Phone TellMe
- lots of intelligent assistant apps on Android using APIs from Google, Nuance, AT&T Watson
dominated by closed-source systems
natural language specific
- not available for smaller languages e.g. with less than 1 million speakers
- Google Voice Search (i.e. dictation) supports ~45 languages (2013-07-25) ...
- but Google Now voice actions supports fewer languages
- Siri supports 9 languages

Presenter Notes

Components of speech recognition

acoustic model = mapping of sound to phonetic symbols
- derived from transcribed audio corpora
grapheme-to-phoneme converter
- "grammatical" -> "G R AH M AE T AH K AH L"
- "grammatical(2)" -> "G R AH M AE T IH K AH L"
- "framework" -> "F R EY M W ER K"
language model = valid sequences of orthographic words (in the given domain)
- n-gram model derived from text corpora
- explicit grammar
interpretation = mapping the word sequence to some application format

Presenter Notes

Original goals of the project

build usable (90%+ precision) and useful applications, assumes:
- single speaker
- smaller vocabulary
- simpler syntax
provide a platform for building speech-enabled applications
- open source
- modular
- easy to use for a programmer with no knowledge of speech and language engineering
- available as a web service and on smart phones
focus on Estonian
- demonstrate the state of Estonian automatic speech recognition (ASR)
- build domain-specific grammars for Estonian

Presenter Notes

History of the project

started mid 2011 (around the time of GF Summer School 2011), as part of a speech recognition project at Tallinn Technical University (Tanel Alumäe)
end of 2011: first version of grammars (for Estonian) and platform
late 2012: added support for English
2013: minor improvements to the grammars and apps

Presenter Notes

CNL for speech-based UIs

Presenter Notes

CNL definition

user-friendly language for human-machine communication
clearly defined syntax (fragment of some natural language)
- explained to the user via construction rules and example sentences
clearly defined semantics
- explained to the user via interpretation rules and example sentences
machine executable
ambiguity is controlled
- possibly no ambiguity
- offer unambiguous constructs as variants for ambiguous constructs
examples
- Attempto Controlled English (ACE)
- GF-implemented languages
- for more see: Tobias Kuhn. A Survey and Classification of Controlled Natural Languages

Presenter Notes

Properties of speech-oriented CNLs

not studied explicitly
cannot assume flexible editing, e.g. backtracking, look-ahead
tokens must be pronounceable (no punctuation, layout, ...)
sources of ambiguity are different than with written CNLs
- homophones: cite, site, sight
- oronyms: I scream, ice scream
- similar words (given e.g. some background noise): thirty, thirteen
suitable for simpler domains, e.g. voice actions for calculator, unit conversion, alarm clock

Presenter Notes

Requirements

for a speech-oriented CNL grammar formalism

mapping between human and machine language
- "quarter to ten in the morning" vs 09:45
- "Frauenchiemsee" vs 47.872072,12.425323
can handle the complexities of natural language (word forms, agreement, ...)
developer-friendly syntax / editing environments
support for standard software engineering practices, e.g.
- reusable modules
- unit and regression testing
compatibility with modern programming languages
compatibility with open-source ASR toolkits (CMU Sphinx)

Presenter Notes

Java Speech Grammar Format (JSGF)

<command> = <action> | (<action> and <command>);
 <action> = stop {STOP} | start | pause | resume | finish {STOP};

standard speech recognition grammar format supported by e.g. CMU Sphinx
simple BNF format
not optimized for natural languages
little support for input normalization into a machine language

Does not really fulfill the requirements...

Presenter Notes

GF

fulfills most requirements
also: support for multilinguality

Presenter Notes

Three types of concrete languages

spoken input ("set the alarm for ten oh clock")
- based on a natural language
- user needs to learn to use it (CNL)
- spoken, i.e. each token corresponds to a sequence of phonemes
- possibly (syntactically) ambiguous
- possibly lots of variation
- imperative / interrogative
spoken output ("setting the alarm for ten pee em")
- based on a natural language
- optimized for text-to-speech (e.g. palatalization and quantity markers in Estonian)
- not ambiguous (provides confirmation to users)
- declarative
machine-executable ("alarm 22:00")
- input format for some machine (alarm clock, calculator, webservice URL)
- non ambiguous (provides semantics for the CNL)

Presenter Notes

GF-based speech interaction

Human says /one plus two/, machine responds by /one plus two is three/

transcribe speech input using a grammar
parse transcription into an abstract tree
linearize tree into application format(s), and TTS format(s)
evaluate application format(s)
communicate evaluation result(s) to the user, combining TTS format and the evaluation result

Presenter Notes

Grammar module hierarchy

Grammar modules

Presenter Notes

Numeral

App: 1234567890
Est: sada kaks kümmend kolm miljonit ja neli sada viis kümmend kuus tuhat
    seitse sada kaheksa kümmend üheksa
Eng: one hundred and twenty three million four hundred and fifty six thousand
    seven hundred and eighty nine

natural numbers up to 10^12
based on the GF Numeral grammar for ~80 languages
extended by the Fraction grammar to negative integers and decimals
imported by most other modules

Presenter Notes

ArithExpr

App: PI + 1.2 - (-3) * 400 / 5 ^ 6
Est: pii pluss üks koma kaks miinus miinus kolm korda neli sada
    jagatud viis astmel kuus
Eng: pie plus one point two minus minus three times four hundred
    divided by five to the power of six

arithmetical expressions with the 5 main operations
tiny vocabulary
infinitely long expressions
operator assoc and precedence left to the evaluator
no parentheses

Presenter Notes

Unitconv

App: convert 100.1 mi*h^-2 to m*s^-2
Est: sada koma üks miili ruut tunnis meetrites ruut sekundis
Eng: one hundred point one miles per hours squared to meters per seconds squared

type-aware unit conversion expressions
syntax error: "convert 10 USD to km/h"
covers most important units and the main currencies
supports some syntactic sugar
- USD can be expressed by "dollar", "US dollar" or "American dollar"
supports some ambiguity
- "two euros in large currency" is ambiguous between ~5 readings

Presenter Notes

Alarm

App: alarm 07 : 05
Est: ärata mind kell seitse null viis
Eng: wake me up at seven oh five

simple 24h-clock, e.g. "alarm 06:05"
also, "alarm in 2 hours and 20 minutes"
some variation
- wake me up at, set the timer for, please wake me up, ...

Presenter Notes

Direction

App: FROM Vanemuise 46, Tartu TO Lossi plats 2, Tallinn
Est: algus vanemuise nelikümend kuus tartu lõpp lossi plats kaks
Eng: (missing)

syntactically simple (from street A 123 to street B 321)
good coverage (based on the official Estonian place names registry)
- all names of Estonian populated places (4416)
- all street names of larger towns (Tallinn and Tartu) (~2000)

Shortcomings

does not model naming variation:
- (August) Weizenbergi (street) 39
- Estonian/Swedish parallel names (very few)
- only nominative case
does not encode ambiguity, e.g. villages with the same name but different location
- i.e. geocoding is left to the application (e.g. Google Maps)
over-generates with house numbers

Presenter Notes

Action

union of Alarm, ArithExpr, Unitconv, Direction
resolves ambiguities that result from the merge
- optional prefix "how much is" to distinguish an arithmetical expression from an alarm clock expression
not all grammars are available in all the languages
- Direction only in Estonian

Presenter Notes

Availability/documentation

available on GitHub: http://kaljurand.github.io/Grammars/
- easy to view, fork, etc.
both the source and PGF
documentation
- automatically generated example sentences with high coverage
- some documentation part of the Arvutaja app

Presenter Notes

Platform for building speech-based UIs

Presenter Notes

Open source stack

cloud service
- real-time ASR (optionally grammar-based) of streaming audio
- HTTP/REST/JSON
- users can upload their GF and JSGF grammars
ASR system
- CMU Sphinx (Pocketsphinx) decoder
- supports FSG grammars (Estonian, English) and n-gram language models (Estonian)
- Estonian and English acoustic models
grammar development
- GF
- existing modules in a GitHub repository
app development
- Android
- extended RecognizerIntent API supported by Kõnele
- Arvutaja

Presenter Notes

System architecture

CNL grammars: PGF files accessible over HTTP
ASR server: transcribes speech; uses grammars
Kõnele (speak!)
- maps Android apps to grammars
- records speech and transcribes it using the server
Arvutaja (the one that computes)
- maps voice commands to actions (possibly carried out by other Android apps)
- transcribes speech using Kõnele (with the Action-grammar by default)

Presenter Notes

Integrating GF with ASR engines

Solution 1: integrate via JSGF or FSG

PGF->JSGF/FSG + do standard grammar-based recognition + parse/translate the output with PGF
benefits: not specific to any ASR engine
shortcomings:
- possibly slow: translation can start only after recognition
- low expressivity (not context-sensitive)
- avoid tables and records which cause overgeneration
- compile time errors hard to explain to the user (left-recursion etc.)

Solution 2: direct integration (future work)

benefits: can handle any GF grammar

Presenter Notes

Conversion of PGF to FSG

# Convert PGF to JSGF (using GF Haskell runtime)
gf --make --output-format=jsgf $pgf

# Convert JSGF to FSG (using Sphinx)
sphinx_jsgf2fsg -jsgf $jsgf -fsm $fsm -symtab $sym

# Optimize the FSG (using OpenFst)
fstcompile --arc_type=log --acceptor --isymbols=$sym --keep_isymbols $fsm |\
fstdeterminize | fstminimize | fstrmepsilon | fstprint > $fsg

Presenter Notes

Server

Compile time

input: grammar URL
processing:
- convert PGF to FSG
- extract tokens and map them to their phonetic transcription
output: FSG + dict

Runtime

input: audio stream + grammar URL + params (n-best, langs, etc.)
processing:
1. recognize to obtain the possible strings (Sphinx)
2. parse the strings (GF)
3. linearize the trees (GF)
output: list (hypotheses) of lists (linearizations in the given languages)

Presenter Notes

Speech recognition API on Android

apps can interact with each other via services and intents, e.g.
- TODO-list app calls the speech-to-text app to enable voice notes
- TODO-list app calls a keyboard app which calls a speech-to-text app
loose coupling
RecognizerIntent constants
- EXTRA_LANGUAGE (String)
- EXTRA_RESULTS (ArrayList<String>)
- ...
SpeechRecognizer callbacks
- onBeginningOfSpeech
- onRmsChanged
- ...
is this possible with iOS or Windows Phone?

Presenter Notes

Kõnele

speech recognition service for Android
- implements RecognizerIntent and SpeechRecognizer
- basic UI for voice search
similar to Google Voice Search, but:
- open source
- adds EXTRAs for grammar-based recognition
- user-configurable, e.g. settings UI for mapping apps to grammars
- requires only the recording and the internet-permissions
- slower networking, worse endpointer, uglier UI, etc.

Presenter Notes

Using Kõnele

Speaking to WolframAlpha via Kõnele + the Action-grammar (e.g. in Estonian)

Presenter Notes

Setting an app to use a grammar

via Kõnele's configuration panel (as an end-user)

Presenter Notes

Using Kõnele's API

Set the required EXTRAs

// Set of non-standard extras that K6nele supports
public static final String EXTRA_GRAMMAR_URL =
  "ee.ioc.phon.android.extra.GRAMMAR_URL";
public static final String EXTRA_GRAMMAR_TARGET_LANG =
  "ee.ioc.phon.android.extra.GRAMMAR_TARGET_LANG";
// ...
Intent intent = new Intent(RecognizerIntent.ACTION_RECOGNIZE_SPEECH);
intent.setComponent(new ComponentName(
  "ee.ioc.phon.android.speak",
  "ee.ioc.phon.android.speak.RecognizerIntentActivity"));
intent.putExtra(EXTRA_GRAMMAR_URL,
  "http://kaljurand.github.io/Grammars/grammars/pgf/Action.pgf");
intent.putExtra(EXTRA_GRAMMAR_TARGET_LANG, "App");

Start the activity or service, and handle output extras.

// ...
startActivityForResult(intent, 1234);
// ...

Presenter Notes

Arvutaja

interprets a voice command in Est or Eng as an expression in App
evaluates some expressions itself
- arithmetical expressions
- unit conversion expressions
executes some expressions by calling its corresponding intent, e.g.
- ACTION_VIEW in case of an Uri
- ACTION_VIEW with Uri = "http://maps.google.com?q=App" in case of Direction
rephrases the command as a declarative sentence speaking the TTS-linearization with Android's default text-to-speech app
shows the history of previous commands
shows the readings of ambiguous commands

Presenter Notes

Arvutaja

Front-end to the Action-grammar

Presenter Notes

Using Arvutaja as a developer

Setup

write a grammar where the App-language is a URI
upload this grammar to a public URL and register it with the server
in Kõnele settings, assign the grammar URL to Arvutaja (overriding the default Action.pgf)
implement a simple Android app that responds to the ACTION_VIEW intent of this URI
- in the case of an "http://" URI, just use an existing browser app and implement a webservice

Runtime

if the App string parses as a Java URI, then Arvutaja launches the ACTION_VIEW intent on this URI
Android locates and launches the app that can handle this URI

Presenter Notes

Results

Presenter Notes

Positive results

grammars
- enable useful applications with high precision (90%+) speech recognition
- scalable at least to ~6000 terminals
- scalable to multiple natural/formal languages (thanks to GF)
framework/platform
- allows adding speech-based UIs to 3rd party apps
- flexible and easy to use (?)
world's best voice search for Estonian place names
successfully demonstrated/demonstrates the state of Estonian ASR
- launched in December 2011
- Kõnele: 13,000+ downloads (on Google Play)
- Arvutaja: 5000+ downloads (on Google Play)
- TV/radio/newspaper coverage
- award: Estonian Language Deed 2011

Presenter Notes

Positive results

Android Market (viewed from Estonia), for one week in Dec 2011

Presenter Notes

Negative results

number of "active users" (in Google Play) is shrinking
- Kõnele: 4300 (peak) -> 2500 (now)
- Arvutaja: 1800 (peak) -> 700 (now)
current daily usage is low
- Kõnele: ~100 queries (maybe because most keyboards now access keyboards via proprietary APIs)
- Arvutaja: < 10 queries
few users outside of Estonia (~8%)
no contribution to the open-source app and grammar projects
nobody has uploaded their own grammars

Presenter Notes

Future work

support more languages (based on GF's resource grammar library)
mixing grammar-based and n-gram models
- "email Bob at work I'm running late"
dialog (grammar-based)
- error recovery
- awareness of past input/output
- text-to-speech
visual user interface (on devices like Google Glass)
- is input so far unambiguous
- what tokens can come next
developer tools optimized for speech-oriented CNLs
- search for potential ambiguity (resulting from e.g. homophones)
- propose changes to the grammars based on query log analysis

Presenter Notes

Summary

speech-based UIs
CNLs for such UIs
GF as an effective formalism/tool for implementing such CNLs
open/extendable platform for building such UIs

Table of Contents	t
Exposé	ESC
Full screen slides	e
Presenter View	p
Source Files	s
Slide Numbers	n
Toggle screen blanking	b
Show/hide slide context	c
Notes	2
Help	h

Table of Contents

Help