Building Speech-based User Interfaces with GF
Kaarel Kaljurand
Third GF Summer School 2013, Frauenchiemsee, Bavaria
2013-08-28

Presenter Notes

Overview

  • speech-based user interfaces
  • controlled natural language for such UIs
  • implementing CNL grammars for speech-based UIs in GF
  • platform for building speech-based UIs, based on
    • GF grammars
    • speech recognition server
    • Android
    • Android apps Kõnele and Arvutaja
  • results and future work

Presenter Notes

Speech-based user interfaces

  • command the computer by human speech (instead of mouse, keyboard, etc.)
  • effective and natural in many environments and for many tasks
    • hands/eyes free (car, operating room)
    • small mobile devices (personal assistants)
    • assumes: lack of background noise, privacy concerns
  • increasingly popular on smart phones
    • Apple's Siri, Google Now, Windows Phone TellMe
    • lots of intelligent assistant apps on Android using APIs from Google, Nuance, AT&T Watson
  • dominated by closed-source systems
  • natural language specific
    • not available for smaller languages e.g. with less than 1 million speakers
    • Google Voice Search (i.e. dictation) supports ~45 languages (2013-07-25) ...
    • but Google Now voice actions supports fewer languages
    • Siri supports 9 languages

Presenter Notes

Components of speech recognition

  • acoustic model = mapping of sound to phonetic symbols
    • derived from transcribed audio corpora
  • grapheme-to-phoneme converter
    • "grammatical" -> "G R AH M AE T AH K AH L"
    • "grammatical(2)" -> "G R AH M AE T IH K AH L"
    • "framework" -> "F R EY M W ER K"
  • language model = valid sequences of orthographic words (in the given domain)
    • n-gram model derived from text corpora
    • explicit grammar
  • interpretation = mapping the word sequence to some application format

Presenter Notes

Original goals of the project

  • build usable (90%+ precision) and useful applications, assumes:
    • single speaker
    • smaller vocabulary
    • simpler syntax
  • provide a platform for building speech-enabled applications
    • open source
    • modular
    • easy to use for a programmer with no knowledge of speech and language engineering
    • available as a web service and on smart phones
  • focus on Estonian
    • demonstrate the state of Estonian automatic speech recognition (ASR)
    • build domain-specific grammars for Estonian

Presenter Notes

History of the project

  • started mid 2011 (around the time of GF Summer School 2011), as part of a speech recognition project at Tallinn Technical University (Tanel Alumäe)
  • end of 2011: first version of grammars (for Estonian) and platform
  • late 2012: added support for English
  • 2013: minor improvements to the grammars and apps

Presenter Notes

CNL for speech-based UIs

Presenter Notes

CNL definition

  • user-friendly language for human-machine communication
  • clearly defined syntax (fragment of some natural language)
    • explained to the user via construction rules and example sentences
  • clearly defined semantics
    • explained to the user via interpretation rules and example sentences
  • machine executable
  • ambiguity is controlled
    • possibly no ambiguity
    • offer unambiguous constructs as variants for ambiguous constructs
  • examples
    • Attempto Controlled English (ACE)
    • GF-implemented languages
    • for more see: Tobias Kuhn. A Survey and Classification of Controlled Natural Languages

Presenter Notes

Properties of speech-oriented CNLs

  • not studied explicitly
  • cannot assume flexible editing, e.g. backtracking, look-ahead
  • tokens must be pronounceable (no punctuation, layout, ...)
  • sources of ambiguity are different than with written CNLs
    • homophones: cite, site, sight
    • oronyms: I scream, ice scream
    • similar words (given e.g. some background noise): thirty, thirteen
  • suitable for simpler domains, e.g. voice actions for calculator, unit conversion, alarm clock

Presenter Notes

Requirements

for a speech-oriented CNL grammar formalism

  • mapping between human and machine language
  • can handle the complexities of natural language (word forms, agreement, ...)
  • developer-friendly syntax / editing environments
  • support for standard software engineering practices, e.g.
    • reusable modules
    • unit and regression testing
  • compatibility with modern programming languages
  • compatibility with open-source ASR toolkits (CMU Sphinx)

Presenter Notes

Java Speech Grammar Format (JSGF)

<command> = <action> | (<action> and <command>);
 <action> = stop {STOP} | start | pause | resume | finish {STOP};
  • standard speech recognition grammar format supported by e.g. CMU Sphinx
  • simple BNF format
  • not optimized for natural languages
  • little support for input normalization into a machine language

Does not really fulfill the requirements...

Presenter Notes

GF

  • fulfills most requirements
  • also: support for multilinguality

Presenter Notes

Three types of concrete languages

  • spoken input ("set the alarm for ten oh clock")
    • based on a natural language
    • user needs to learn to use it (CNL)
    • spoken, i.e. each token corresponds to a sequence of phonemes
    • possibly (syntactically) ambiguous
    • possibly lots of variation
    • imperative / interrogative
  • spoken output ("setting the alarm for ten pee em")
    • based on a natural language
    • optimized for text-to-speech (e.g. palatalization and quantity markers in Estonian)
    • not ambiguous (provides confirmation to users)
    • declarative
  • machine-executable ("alarm 22:00")
    • input format for some machine (alarm clock, calculator, webservice URL)
    • non ambiguous (provides semantics for the CNL)

Presenter Notes

GF-based speech interaction

Human says /one plus two/, machine responds by /one plus two is three/

Translation
  • transcribe speech input using a grammar
  • parse transcription into an abstract tree
  • linearize tree into application format(s), and TTS format(s)
  • evaluate application format(s)
  • communicate evaluation result(s) to the user, combining TTS format and the evaluation result

Presenter Notes

Grammar module hierarchy


Grammar modules

Presenter Notes

Numeral

App: 1234567890
Est: sada kaks kümmend kolm miljonit ja neli sada viis kümmend kuus tuhat
    seitse sada kaheksa kümmend üheksa
Eng: one hundred and twenty three million four hundred and fifty six thousand
    seven hundred and eighty nine
  • natural numbers up to 10^12
  • based on the GF Numeral grammar for ~80 languages
  • extended by the Fraction grammar to negative integers and decimals
  • imported by most other modules

Presenter Notes

ArithExpr

App: PI + 1.2 - (-3) * 400 / 5 ^ 6
Est: pii pluss üks koma kaks miinus miinus kolm korda neli sada
    jagatud viis astmel kuus
Eng: pie plus one point two minus minus three times four hundred
    divided by five to the power of six
  • arithmetical expressions with the 5 main operations
  • tiny vocabulary
  • infinitely long expressions
  • operator assoc and precedence left to the evaluator
  • no parentheses

Presenter Notes

Unitconv

App: convert 100.1 mi*h^-2 to m*s^-2
Est: sada koma üks miili ruut tunnis meetrites ruut sekundis
Eng: one hundred point one miles per hours squared to meters per seconds squared
  • type-aware unit conversion expressions
  • syntax error: "convert 10 USD to km/h"
  • covers most important units and the main currencies
  • supports some syntactic sugar
    • USD can be expressed by "dollar", "US dollar" or "American dollar"
  • supports some ambiguity
    • "two euros in large currency" is ambiguous between ~5 readings

Presenter Notes

Alarm

App: alarm 07 : 05
Est: ärata mind kell seitse null viis
Eng: wake me up at seven oh five
  • simple 24h-clock, e.g. "alarm 06:05"
  • also, "alarm in 2 hours and 20 minutes"
  • some variation
    • wake me up at, set the timer for, please wake me up, ...

Presenter Notes

Direction

App: FROM Vanemuise 46, Tartu TO Lossi plats 2, Tallinn
Est: algus vanemuise nelikümend kuus tartu lõpp lossi plats kaks
Eng: (missing)
  • syntactically simple (from street A 123 to street B 321)
  • good coverage (based on the official Estonian place names registry)
    • all names of Estonian populated places (4416)
    • all street names of larger towns (Tallinn and Tartu) (~2000)

Shortcomings

  • does not model naming variation:
    • (August) Weizenbergi (street) 39
    • Estonian/Swedish parallel names (very few)
    • only nominative case
  • does not encode ambiguity, e.g. villages with the same name but different location
    • i.e. geocoding is left to the application (e.g. Google Maps)
  • over-generates with house numbers

Presenter Notes

Action

  • union of Alarm, ArithExpr, Unitconv, Direction
  • resolves ambiguities that result from the merge
    • optional prefix "how much is" to distinguish an arithmetical expression from an alarm clock expression
  • not all grammars are available in all the languages
    • Direction only in Estonian

Presenter Notes

Availability/documentation

  • available on GitHub: http://kaljurand.github.io/Grammars/
    • easy to view, fork, etc.
  • both the source and PGF
  • documentation
    • automatically generated example sentences with high coverage
    • some documentation part of the Arvutaja app

Presenter Notes

Platform for building speech-based UIs

Presenter Notes

Open source stack

  • cloud service
    • real-time ASR (optionally grammar-based) of streaming audio
    • HTTP/REST/JSON
    • users can upload their GF and JSGF grammars
  • ASR system
    • CMU Sphinx (Pocketsphinx) decoder
    • supports FSG grammars (Estonian, English) and n-gram language models (Estonian)
    • Estonian and English acoustic models
  • grammar development
    • GF
    • existing modules in a GitHub repository
  • app development
    • Android
    • extended RecognizerIntent API supported by Kõnele
    • Arvutaja

Presenter Notes

System architecture

System architecture
  • CNL grammars: PGF files accessible over HTTP
  • ASR server: transcribes speech; uses grammars
  • Kõnele (speak!)
    • maps Android apps to grammars
    • records speech and transcribes it using the server
  • Arvutaja (the one that computes)
    • maps voice commands to actions (possibly carried out by other Android apps)
    • transcribes speech using Kõnele (with the Action-grammar by default)

Presenter Notes

Integrating GF with ASR engines

Solution 1: integrate via JSGF or FSG

  • PGF->JSGF/FSG + do standard grammar-based recognition + parse/translate the output with PGF
  • benefits: not specific to any ASR engine
  • shortcomings:
    • possibly slow: translation can start only after recognition
    • low expressivity (not context-sensitive)
    • avoid tables and records which cause overgeneration
    • compile time errors hard to explain to the user (left-recursion etc.)

Solution 2: direct integration (future work)

  • benefits: can handle any GF grammar

Presenter Notes

Conversion of PGF to FSG

# Convert PGF to JSGF (using GF Haskell runtime)
gf --make --output-format=jsgf $pgf

# Convert JSGF to FSG (using Sphinx)
sphinx_jsgf2fsg -jsgf $jsgf -fsm $fsm -symtab $sym

# Optimize the FSG (using OpenFst)
fstcompile --arc_type=log --acceptor --isymbols=$sym --keep_isymbols $fsm |\
fstdeterminize | fstminimize | fstrmepsilon | fstprint > $fsg

Presenter Notes

Server

Compile time

  • input: grammar URL
  • processing:
    • convert PGF to FSG
    • extract tokens and map them to their phonetic transcription
  • output: FSG + dict

Runtime

  • input: audio stream + grammar URL + params (n-best, langs, etc.)
  • processing:
    1. recognize to obtain the possible strings (Sphinx)
    2. parse the strings (GF)
    3. linearize the trees (GF)
  • output: list (hypotheses) of lists (linearizations in the given languages)

Presenter Notes

Speech recognition API on Android

  • apps can interact with each other via services and intents, e.g.
    • TODO-list app calls the speech-to-text app to enable voice notes
    • TODO-list app calls a keyboard app which calls a speech-to-text app
  • loose coupling
  • RecognizerIntent constants
    • EXTRA_LANGUAGE (String)
    • EXTRA_RESULTS (ArrayList<String>)
    • ...
  • SpeechRecognizer callbacks
    • onBeginningOfSpeech
    • onRmsChanged
    • ...
  • is this possible with iOS or Windows Phone?

Presenter Notes

Kõnele

  • speech recognition service for Android
    • implements RecognizerIntent and SpeechRecognizer
    • basic UI for voice search
  • similar to Google Voice Search, but:
    • open source
    • adds EXTRAs for grammar-based recognition
    • user-configurable, e.g. settings UI for mapping apps to grammars
    • requires only the recording and the internet-permissions
    • slower networking, worse endpointer, uglier UI, etc.

Presenter Notes

Using Kõnele

Speaking to WolframAlpha via Kõnele + the Action-grammar (e.g. in Estonian)

Presenter Notes

Setting an app to use a grammar

via Kõnele's configuration panel (as an end-user)

Presenter Notes

Using Kõnele's API

Set the required EXTRAs

// Set of non-standard extras that K6nele supports
public static final String EXTRA_GRAMMAR_URL =
  "ee.ioc.phon.android.extra.GRAMMAR_URL";
public static final String EXTRA_GRAMMAR_TARGET_LANG =
  "ee.ioc.phon.android.extra.GRAMMAR_TARGET_LANG";
// ...
Intent intent = new Intent(RecognizerIntent.ACTION_RECOGNIZE_SPEECH);
intent.setComponent(new ComponentName(
  "ee.ioc.phon.android.speak",
  "ee.ioc.phon.android.speak.RecognizerIntentActivity"));
intent.putExtra(EXTRA_GRAMMAR_URL,
  "http://kaljurand.github.io/Grammars/grammars/pgf/Action.pgf");
intent.putExtra(EXTRA_GRAMMAR_TARGET_LANG, "App");

Start the activity or service, and handle output extras.

// ...
startActivityForResult(intent, 1234);
// ...

Presenter Notes

Arvutaja

  • interprets a voice command in Est or Eng as an expression in App
  • evaluates some expressions itself
    • arithmetical expressions
    • unit conversion expressions
  • executes some expressions by calling its corresponding intent, e.g.
    • ACTION_VIEW in case of an Uri
    • ACTION_VIEW with Uri = "http://maps.google.com?q=App" in case of Direction
  • rephrases the command as a declarative sentence speaking the TTS-linearization with Android's default text-to-speech app
  • shows the history of previous commands
  • shows the readings of ambiguous commands

Presenter Notes

Arvutaja

Front-end to the Action-grammar

Screenshot: Arvutaja Screenshot: Arvutaja Screenshot: Arvutaja

Presenter Notes

Using Arvutaja as a developer

Setup

  • write a grammar where the App-language is a URI
  • upload this grammar to a public URL and register it with the server
  • in Kõnele settings, assign the grammar URL to Arvutaja (overriding the default Action.pgf)
  • implement a simple Android app that responds to the ACTION_VIEW intent of this URI
    • in the case of an "http://" URI, just use an existing browser app and implement a webservice

Runtime

  • if the App string parses as a Java URI, then Arvutaja launches the ACTION_VIEW intent on this URI
  • Android locates and launches the app that can handle this URI

Presenter Notes

Results

Presenter Notes

Positive results

  • grammars
    • enable useful applications with high precision (90%+) speech recognition
    • scalable at least to ~6000 terminals
    • scalable to multiple natural/formal languages (thanks to GF)
  • framework/platform
    • allows adding speech-based UIs to 3rd party apps
    • flexible and easy to use (?)
  • world's best voice search for Estonian place names
  • successfully demonstrated/demonstrates the state of Estonian ASR
    • launched in December 2011
    • Kõnele: 13,000+ downloads (on Google Play)
    • Arvutaja: 5000+ downloads (on Google Play)
    • TV/radio/newspaper coverage
    • award: Estonian Language Deed 2011

Presenter Notes

Positive results

Android Market (viewed from Estonia), for one week in Dec 2011

Presenter Notes

Negative results

  • number of "active users" (in Google Play) is shrinking
    • Kõnele: 4300 (peak) -> 2500 (now)
    • Arvutaja: 1800 (peak) -> 700 (now)
  • current daily usage is low
    • Kõnele: ~100 queries (maybe because most keyboards now access keyboards via proprietary APIs)
    • Arvutaja: < 10 queries
  • few users outside of Estonia (~8%)
  • no contribution to the open-source app and grammar projects
  • nobody has uploaded their own grammars

Presenter Notes

Future work

  • support more languages (based on GF's resource grammar library)
  • mixing grammar-based and n-gram models
    • "email Bob at work I'm running late"
  • dialog (grammar-based)
    • error recovery
    • awareness of past input/output
    • text-to-speech
  • visual user interface (on devices like Google Glass)
    • is input so far unambiguous
    • what tokens can come next
  • developer tools optimized for speech-oriented CNLs
    • search for potential ambiguity (resulting from e.g. homophones)
    • propose changes to the grammars based on query log analysis

Presenter Notes

Summary

  • speech-based UIs
  • CNLs for such UIs
  • GF as an effective formalism/tool for implementing such CNLs
  • open/extendable platform for building such UIs

Presenter Notes