Articles   Members Online:
-Article/Tip Search
-News Group Search over 21 Million news group articles.
Member Area
-Account Center
-Top 10 NEW!!
-Submit Article/Tip
-Forums Upgraded!!
-My Articles
-Edit Information
-Become a Member
-Why sign up!
-Chat Online!
-Indexes NEW!!
-Build your resume
-Find a job
-Post a job
-Resume Search
-Link to us
Visit Embarcadero
Embarcadero Community
Speech Part 2 - How to Add Simple Dictation speech recognition to your Delphi Ap Turn on/off line numbers in source code. Switch to Orginial background IDE or DSP color Comment or reply to this aritlce/tip for discussion. Bookmark this article to my favorite article(s). Print this article
Delphi 5.x
User Rating
# Votes
DSP, Administrator
Reference URL:
			Author: Alec Bergamini

I'd like to be able to speak into my computers microphone and have what I say 
translated into text that is entered into my application. How Can I do this?



This article shows how you can add simple dictation speech recognition capabilities 
to an application. First some technical considerations will be discussed followed 
by the creation of a small application that will allow the user to add words to a 
TMemo by speaking into a microphone attached to the computers sound card. 

If you did not read Speech Part 1 you should go read it now. It tells you how to 
get the Microsoft Speech API 5.1 (SAPI) and install it on your system and in 

SAPI version 5.1 supports two distinct types of speech recognition; dictation and 
command and control. It’s important to understand the differences between the two 
in order to make correct decisions in the design of your speech enabled 

Dictation Speech Recognition 

Dictation refers to a type of speech recognition where the machine listens to what 
you say and attempts to translate it into text. This all happens inside the speech 
engine and you don’t need to worry about it although a little theory may be 
helpful. Most modern dictation engines use a scheme where they listen to what’s 
said and break what they hear down into a series of word hypothesis. Each word 
hypothesis may actually contain a list of possible words with each word given some 
probability of correctness. So, for example, if I say “The quick red fox” the 
computer will likely break this down into 4 separate word hypothesis. The “fox” 
hypothesis may contain the possibilities of “fax”, “box”, “fix”, etc. These 
individual word hypothesis are then “put in context”. That is, each word is 
considered in relation to the words that came before and after. Based on the rules 
of context the speech engine comes to a final “best” decision about what was spoken 
and returns it to the application. In dictation, context is the name of the game. 
For this reason, dictation engines are considered to be contextual. (My apologies 
to any ASR scientists reading this for this minimalist explanation.) 

As you may imagine, the accuracy of dictation ties directly to the CPU's speed and 
the system's available memory. The more resources, the more context that can be 
considered in a reasonable amount of time the more likely the resulting recognition 
will be accurate. The truth is that the basic principals on how to do speech 
recognition have not changed in over 20 years. What has changed is the power of the 
PC and it’s the processing power of modern PCs that makes speech recognition 
finally usable. 

Also important to accurate dictation recognition is the engine having some 
understanding of the individual speaker’s voice. First speech engines are specific 
to language and possibly even region. This is why we see English engines and French 
engines and Chinese engines, etc. Beyond languages though, there are differences 
(sometimes extreme) within a language. A 5 year old girl sounds very different to 
the computer than a 47 year old man. This is why most current dictation engines 
require voice training. 

If you have SAPI 5.1 installed, go to your system’s Control Panel and click the 
Speech icon. On the speech recognition tab you will find a button called >Train 
Profile..< that brings up the voice training wizard. If you haven’t already done 
so, you should take the time to complete at least one session. The more sessions 
you complete, the more accurate you can expect the dictation recognition. By the 
way, you have access to this wizard from the SDK and you can even provide the text 
for you own personal training sessions. In fact, taking a fairly long document 
that’s you’ve written in your own particular style and using that to train the 
engine can dramatically improve your own personal dictations. 

Command and Control Speech Recognition 

While dictation recognition is use primarily for recording what a user says and 
translating it into text, Command and Control (CnC) speech recognition is used for 
controlling applications. In the same way that you click your mouse on the browser 
icon on your desktop to access the internet you could speak “Computer, run 
browser”, or even better “Compute, go to” to accomplish the same thing. 
Currently you are used to controlling your computer by you mouse and keyboard. CnC 
recognition adds a third input device, you voice. 

CnC speech recognition is fundamentally different from dictation recognition in 
that it is recognition without regard to context. That is, there are no CPU cycles 
spent trying to determine if a word is correct by looking at the words that come 
before or after. For this reason CnC recognition is often also know as context free 

Instead of using context, CnC recognition uses pre-defined grammars. These grammars 
contain rules, and each rule can then have a programmed response. So, in developing 
an application that uses CnC recognition the programmer defines both the grammar 
and the rules as well as the response to the recognition of each rule. 

If grammars and rules are managed properly CnC recognitions can be much more 
accurate than dictation recognition. This is because the number of words that need 
to be recognized for CnC is only a subset of the universe of words needed for 
dictation. With CnC the engine only need to worry about the words in the active 
grammar not all the words in the dictionary. Fewer possibilities mean better 

It turns out that dictation recognition is much more difficult for speech engine 
developers than CnC recognition. But for us as the application developer it is much 
easier to implement simple dictation in an application than CnC because with 
dictation we don’t need to worry about writing a grammar. For this reason I’m 
starting with dictation instead of CnC. I’ll probably do CnC and grammar 
development in the next article. 

A Simple Dictation Application 

All right then, let’s build an application that takes dictation. 

[ Delphi 6 users – In the process of testing this sample in Delphi 6 I ran into a 
known problem with event sinks generated from type library imports. See Article 
number 2590 for more information and some work arounds. 

Start up Delphi (5 or 6, 4 might work to but I didn’t try it). On the SAPI5 palette 
(see Speech Part 1 if you don’t have one) find the TSpSharedRecoContext component 
and drop it on your form along with a TMemo component. 

Add the ActiveX unit to your uses clause and add a private field to the form called 
   fMyGrammar of type IspeechRecoGrammar. 

Create an onCreate event for the form, plus OnRecognition and OnHypothesis events 
for the SpSharedRecoContext component. You complete unit should look something like 

1   unit Unit1;
3   interface
5   uses
6     Windows, Messages, SysUtils, Variants, Classes, Graphics, Controls, Forms,
7     Dialogs, OleServer, SpeechLib_TLB, StdCtrls, ActiveX, ComCtrls;
9   type
10    TForm1 = class(TForm)
11      SpSharedRecoContext1: TSpSharedRecoContext;
12      Memo1: TMemo;
13      procedure FormCreate(Sender: TObject);
14      procedure SpSharedRecoContext1Recognition(Sender: TObject;
15        StreamNumber: Integer; StreamPosition: OleVariant;
16        RecognitionType: TOleEnum; var Result: OleVariant);
17      procedure SpSharedRecoContext1Hypothesis(Sender: TObject;
18        StreamNumber: Integer; StreamPosition: OleVariant;
19        var Result: OleVariant);
20    private
21      { Private declarations }
22      fMyGrammar: ISpeechRecoGrammar;
23    public
24      { Public declarations }
25    end;
27  var
28    Form1: TForm1;
30  implementation
32  {$R *.dfm}
34  procedure TForm1.FormCreate(Sender: TObject);
35  begin
36    fMyGrammar := SpSharedRecoContext1.CreateGrammar(0);
37    fMyGrammar.DictationSetState(SGDSActive);
38  end;
40  procedure TForm1.SpSharedRecoContext1Recognition(Sender: TObject;
41    StreamNumber: Integer; StreamPosition: OleVariant;
42    RecognitionType: TOleEnum; var Result: OleVariant);
43  begin
44    Memo1.Text := Result.PhraseInfo.GetText;
45  end;
47  procedure TForm1.SpSharedRecoContext1Hypothesis(Sender: TObject;
48    StreamNumber: Integer; StreamPosition: OleVariant;
49    var Result: OleVariant);
50  begin
51    Memo1.Text := Result.PhraseInfo.GetText;
52  end;
54  end.

Compile and run this. Speak something. Your words should appear in the memo field. 
If they do not shut the application down and:

Make sure you microphone is not muted. 
Use the Speech control panel applet to make sure that your microphone and the 
recognition engine is working properly. 

Now try it again. 


The SAPI 5.1 automation objects support both dictation and CnC speech recognition. 
Of the 19 components installed the following 4 are central to speech recognition. 

The SpInprocRecognizer represents a speech recognition engine that is instantiated 
in the same process as the application. 

The SpSharedRecognizer represents an instance of a speech recognition engine that 
is shared by many applications. 

The SpInprocRecoContext is a recognition context that uses a SpInprocRecognizer. 

The SpSharedRecoContext is recognition context that uses a SpSharedRecognizer. 

Shared vs. Inprocess 

An application can use either an inprocess instance of a speech engine 
(SpInprocRecognizer) or an instance that is shared with other applications 
(SpSharedRecognizer).  The inprocess recognizer claims resources for the 
application, so, for example, once an inprocess recognizer claims the system’s 
microphone, no other application can use it. 

A shared recognizer runs in a separate process from the application and, as a 
result, it can be shared with other applications. This allows multiple applications 
to share system resources (like the microphone). 

In our sample we are using a shared engine. In most desktop applications shared is 
the way to go. Using a shared recognizer allows your application to play nicely 
with other speech enabled applications on your system. If your application is 
targeted for some dedicated machine like one running a telephone voice response 
application then the inprocess approach would be appropriate. Inprocess recognition 
is somewhat more efficient then shared recognition. 

Recognition Contexts 

A recognition context is an object that manages the relationship between the 
recognition engine object (the recognizer) and the application. Do not confuse the 
use of the word “context“ as used here with its usage in “context free grammar”.   

A single recognizer can be used by many contexts. For example, a speech enabled 
application with 3 forms will likely have a single engine instance with a separate 
context of each form. When one form gets the focus its context becomes active and 
the other two forms contexts are disabled.  In this way, only the commands relevant 
to the one form are recognized by the engine. Another example as seen in Microsoft 
Word XP where there is one context for dictation and another context for issuing 
menu commands. 

The recognition context is the primary means by which an application interacts with 
SAPI. It is the object you use to start and stop recognition and it is the object 
that receives the event notifications when something is recognized. Further, the 
recognition context controls which words (grammars and/or dictation) are 
recognized. By setting recognition contexts, applications limit or expand the scope 
of the words needed for a particular aspect of the application. This granularity 
for speech recognition improves the quality of recognition by removing words not 
needed at that moment. Conversely, the granularity also allows words to be added to 
the application if needed. 

In our example above we do the simple thing (at least programmatically) and just 
load dictation. This means that all words will attempt to be recognized. The other 
possibility is to load one or more specific grammars. Grammars are a big subject 
and will be covered in a later article. 

There’s a lot more on the subjects of recognition contexts and inprocess vs. shared 
recognizers in the SAPI 5.1 documentation but for now that’s enough to talk about 
the sample code. 

What the sample code does 

First, here is the form’s OnCreate event. 

55  procedure TForm1.FormCreate(Sender: TObject);
56  begin
57    fMyGrammar := SpSharedRecoContext1.CreateGrammar(0);
58    fMyGrammar.DictationSetState(SGDSActive);
59  end;

Just two lines of code to set the whole recognition process in motion. First we 
need to create a grammar (CreateGrammar) object for the engine and then we instruct 
this grammar that it is to attempt to recognize all words by 

Notice that neither on the form or in the code do we ever instantiate a 
SpSharedRecognizer. This is because SAPI is smart enough to create the shared 
recognizer object for us automatically when the SpSharedRecoContext is created. 

Next we need some way for the application to be informed by the engine when it 
recognizes something. This is done through the OnRecognition event. 

60  procedure TForm1.SpSharedRecoContext1Recognition(Sender: object;
61    StreamNumber: Integer; StreamPosition: OleVariant;
62    RecognitionType: TOleEnum; var Result: OleVariant);
63  begin
64    Memo1.Text := Result.PhraseInfo.GetText;
65  end;

Of the various parameters passed in the OnRecogntion event, the Result parameter is 
the key. Although declared as an OleVariant for interprocess communications it’s 
really an object with an ISpeechRecoResult interface. This interface lets you get 
all sorts of information about what was said and what the recognizer understood. 
Some of the information available through this interface includes; the words 
recognized, a rating of the engine’s confidence in the recognition, when the 
recognition happened and how long it took. You can even play back the audio for 
what was said. Much of the information returned is only useful for context free 
grammars and doesn’t apply to dictation. 
In the sample we just call the GetText method to return the text of what the engine 

The OnRecogintion event only fire when the engine is satisfied that the user has 
uttered a complete phase and that it has made its best guess about what the user 
said. You could run the sample application with only this event defined and it 
would work. 

I added the OnHypothesis event so you could get a feel for how the engine, working 
in dictation mode, uses all the words together (in context) to create hypothesis’, 
make corrections, and, finally, come to a decision about what was said. 

That’s enough for now 

Speech recognition is a very big subject. I’ve scratched the surface of dictation 
speech recognition but there is much more. To write a really usable dictation 
application the user will need ways to correct mistakes and give the speech 
recognition engines commands like “Bold the last 3 words”. While possible with SAPI 
this level of discussion is beyond the scope of this introduction. I urge to study 
the documentation that comes with the SAPI SDK. 
I haven’t give much more than passing mention of CnC context free grammars. CnC 
recognition and grammars will be the next article. 

OK not quite enough 

I couldn’t leave the sample application alone. Here’s a slightly modified version 
that is a bit more satisfying in that it lets you keep multiple utterances. 

66  type
67    TForm1 = class(TForm)
68      SpSharedRecoContext1: TSpSharedRecoContext;
69      Memo1: TMemo;
70      procedure ButtonSpeakClick(Sender: TObject);
71      procedure FormCreate(Sender: TObject);
72      procedure SpSharedRecoContext1Recognition(Sender: TObject;
73        StreamNumber: Integer; StreamPosition: OleVariant;
74        RecognitionType: TOleEnum; var Result: OleVariant);
75      procedure SpSharedRecoContext1Hypothesis(Sender: TObject;
76        StreamNumber: Integer; StreamPosition: OleVariant;
77        var Result: OleVariant);
78    private
79      { Private declarations }
80      fMyGrammar: ISpeechRecoGrammar;
81      CurrentText: string;
82    public
83      { Public declarations }
84    end;
86  var
87    Form1: TForm1;
89  implementation
91  {$R *.dfm}
93  procedure TForm1.FormCreate(Sender: TObject);
94  begin
95    fMyGrammar := SpSharedRecoContext1.CreateGrammar(0);
96    fMyGrammar.DictationSetState(SGDSActive);
97  end;
99  procedure TForm1.SpSharedRecoContext1Recognition(Sender: TObject;
100   StreamNumber: Integer; StreamPosition: OleVariant;
101   RecognitionType: TOleEnum; var Result: OleVariant);
102 begin
103   Memo1.Text := CurrentText + Result.PhraseInfo.GetText;
104   CurrentText := Memo1.Text;
105 end;
107 procedure TForm1.SpSharedRecoContext1Hypothesis(Sender: TObject;
108   StreamNumber: Integer; StreamPosition: OleVariant;
109   var Result: OleVariant);
110 begin
111   Memo1.Text := CurrentText + Result.PhraseInfo.GetText;
112 end;
114 end. //really

Vote: How useful do you find this Article/Tip?
Bad Excellent
1 2 3 4 5 6 7 8 9 10


Share this page
Download from Google

Copyright © Mendozi Enterprises LLC