voideX : Voice Driven Desktop
1. Abstract
voideX is a dialog system, to control linux X applications having instructions in natural language with unrestricted grammar and unlimited application domains. Without changing the source-code of individual existing common desktop applications voideX has features to make them all voice enabled. Due to lack of space only features will be mentioned in this paper along with a brief design. The complete design document is at [2]
2. Terms used
Knowledge-File : An XML document in accordance to the schema knowledge.xsd[5]. This file captures the knowledge node of any particular application/process/command/behavior. Bubble-Semantics : A concept introduced by Biermann et al in [3]. The semantics gives sequence of steps to be taken to reach the focus domain. Input events : A mouse movement/mouse click/keystroke. NP : Noun Phrase
2. Demonstration
The following session with voideX will give a clear understanding of its current features.
user: compose a mail
action: terminal opens, pine is entered on command line, and C is pressed to bring pine in compose mode.
user: dictation mode "Hello this is a test mail" command mode
action: dictation mode is turned on and later off.
user: newsgroups
action: pine browses to newsgroups mode, by first exiting from compose mode.
user: some music now
action: xmms starts running
user: increase the volume
action: mouse cursor travels to volume bar and drags it a little further.
user: a c++ program
action: opens emacs, sends ALT+cpp-mode
user: close this window
action: emacs is gone
user: hide the play-list
action: the play-list button on main-window in xmms is clicked with mouse.
user: no more music
action: xmms is closed
user: close all windows
action: first pine is quit, terminal is closed and the editor is also closed.
voideX is currently capable of performing this and much more. The performance depends heavily upon the knowledge-file for Xmms and Pine (in the case here). The first task demonstrates the understanding power. Focus mechanism is clearly visible in third task where system knows that the current focus is on pine. "close this window" and "close all windows", shows the focus mechanism more clearly. The above sequence also demonstrates capability to understand sub-dialogs.
3. Design
voideX consists of following core modules. Due to lack of space only functionalities will be described.
3.1 Speech Recognition Module
This module performs the task of converting speech to text.
3.2 NLP Module
The command in natural text needs parsing and tagging. Use of a probabilistic parser[4] allows quite unrestricted grammar. The command is converted to bubble semantics using the mechanism as described in [3] and Section 4.1. Wordnet [9] is used for morphological analysis.
3.3 Knowledge Module
This module provides information about various applications. It contains all knowledge-files and provides information about any application/command/process. A knowledge-file for any applications specifies all its child processes, action events, keywords and status.
3.4 Action Module
With the support of Knowledge module, Action module transforms the bubble-semantics from NLP Module to a sequence of Input Events. Action module has to maintain desktop state and frequently update it.
3.5 X Module
This module finally executes the Input events generated by Action Module. The voideX X Module is capable of creating, destroying, moving, hiding, focusing and resizing windows. It is also capable of sending fake input events to applications. This module is built using Escher Java XLib libraries [14].
4. Challenges and Solutions
4.1 Understanding Natural Language
The command in natural language is first tagged by parser[4]. Then the command is converted to its bubble-semantic form. eg. "close all terminals in desktop one" =>parser=> "" => "
4.2 Application Independence
It is easier to write a completely new application with inbuilt voice control than adding voice control to an external application so that all tasks can be done with application's API. voideX cannot use an application's API to perform tasks as that would require knowing and modifying source code. Hence voideX performs all tasks by faking Input events. Input Events are part of an application's knowledge-file.
4.3 Speech Recognition
After via-voice departure, no proper public speech recognition module is available[10]. Commercial products[8] have achieved more that 95% accuracy. voideX speech recognition module is built using sphinx4[7]. Sphinx4 is yet in its beta stage so the results are not good now.
5. Technical Details
- Projected Homepage : http://voidex.sourceforge.net
- CVS : cvs.sourceforge.net:/cvsroot/voidex [Anonymous]
- Programming Language : Java 1.4.2
External Dependencies [All Open source]
- Parser : Stanford Probabilistic Parser 1.4[4]
- Wordnet : JWNL 1.3[16]
- Xlib support for Java : Escher[14]
- XML Parser : NanoXML Parser[17]
- Speech Recognizer : Sphinx 4[7]
6. Code
Entire source-code is available anonymously at cvs.sourceforge.net:/cvsroot/voidex. The code is professional, well documented, along with test cases. All configuration is XML based.
7. Conclusion and Results
"Carrying on a full conversation with a computer as one might with another person is well beyond the state of art today[3]". A system which can be operated orally is not only convenient for all users but also for those with disabilities or are less computer savvy. Linux is far behind in this technology. voideX, comes forward in this respect by not only building a natural language interface for Linux but providing a method to turn any linux application to a voice enabled application. The demonstration shows the capability of voideX. A desktop implementation is ready for use. It has been tested with support for Xmms, Console, Pine, Mozilla on IceWM and the results are impressive. With the project in public domain[1], voideX can be expected to revolutionize desktop computing on Linux.
Here are few constraints which are to be worked upon :-
- voideX doesn't understand commands like "increase xmms's volume". It has to be "Increase volume in xmms". "close third desktop windows" will also not work, it should be "close windows in desktop 3".
- The knowledge files are static. So if the location for volume bar changes frequently voideX will fail to recognize it.
- voideX doesn't work with manual user intervention. User is not allowed to move mouse while voideX is running. voideX controls mouse and keyboard itself.
- Knowledge files are manually generated. First a UI for their generation is planned and later auto generation has to be implemented.
8. References
- voideX homepage. http://voidex.sourceforge.net
- Complete voideX design. http://voidex.sourceforge.net/docs/design.pdf
- An imperative sentence processor for voice interactive office applications, ACM TOIS 1985.
- Stanford parser. http://www-nlp.stanford.edu/software/lex-parser.shtml
- Knowledge File Schema. http://voidex.sourceforge.net/docs/knowledge.xsd
- Sphinx. http://cmusphinx.sourceforge.net
- Sphinx 4. http://cmusphinx.sourceforge.net/sphinx4/
- Commercial Speech Recognition Engines. http://www.scansoft.com
- Wordnet. http://www.cogsci.princeton.edu/~wn/index.shtml/
- http://asia.cnet.com/builder/program/unix/0,39009368,39195316,00.htm
- Human Computer Interaction: Input Devices. CSUR 1996
- Xvoice. http://xvoice.sourceforge.net
- Sample XMMS knowledge file. http://voidex.sourcforge.net/knowledge-files/xmms.xml
- Escher http://escher.sourceforge.net/
- Eclipse http://www.eclipse.org/
- JWNL http://jwordnet.sourceforge.net/
- nanoXML parser http://nanoxml.sourceforge.net
- A comparison of voice controlled and mouse controlled web browsing. ACM SIGCAPH 2000.