Speech To Text

The Windows Speed Recognition (WSR) introduced with Windows Vista required users to write their own macros, so was mostly used by very advanced users.

After Microsoft bought the "voice portal" company TellMe in 2007, Microsoft put voice command technology in Windows Phone 7 and 8, but not on desktops.

Intel worked with Nuance to bring their Dragon Naturally Speaking and Sync in Ford cars to the new Dragon Assistant app for Windows Ultrabooks.

Poor Man's Speech to Text

Many laptops come with a microphone built in.

The computer can be made to recognize your voice if you are willing to jump among 3 programs: Sound Recorder, Windows Explorer, Run command window.

The Run command window would run a custom batch file

Create a sound file. Windows desktop users can use SoundRecorder.exe. Its default format is .wma. But to save .wav files, instead of going to Windows Start > All Programs > Accessories. go to Windows Start and type in the Search box:
```
soundrecorder /file outputfile.wav
```
The Sound Recorder GUI says there is a maximum length of 60 seconds. But this video shows that after the initial 60 seconds is reached, select from the menu Edit > Copy, then File > New. Click No to the pop-up. Select from the menu Paste > Insert, then Edit and Paste Insert to add an extra 60 seconds to the Length.
After recording is stopped, by default files are named outputfile.wav and saved to folder C:\Users\%USERNAME%\Documents.
Ideally, the program would chunk the sound rather than sending a long file.

Run a command to obtain text from the speech file.

Format response to strip out meta data so only the speech text shows.

AT&T Speech to Text

AT&T's Speech capabilities has been improving for decades. Available from the Developer Portal is SDK ATTSpeechKit.zip for iOS and Android

Captures audio from the microphone. When audio capture starts, it either continues until it detects the end of the phrase spoken by the end-user or until it reaches the maximum recording time.
Compress audio data (into AMR files).
Stream data to AT&T's Speech API servers.
Waiting for response, allowing cancel.
Display response feedback and progress.

AT&T's Watson℠ speech engine providers differentiates itself vs. competitors by offering a robust library of speech contexts optimized for specific applications.

Generic - full sentences automatically detects and transcribes English and Spanish.
WebSearch - short speech query phrases common in Web searches.
BusinessSearch - short speech phrases including geographic locations.
QuestionandAnswer - short speech sentences that include questions.
Voicemail - long-form speech, such as voice mail input.
Sms - Transcribes spoken audio into text output for text messaging.
UverseEpg - (EPG being electronic programming guide) to TV show names, actors, and networks (as in IMDB & Netflix app)

One of these is specified in requests as the value for header X-SpeechContext.

BLAH: Unfortunately, one cannot tweak the voice recognition algorithms to improve accuracy for individual users.

OAuth 2.0 Authentication

AT&T's Speech to Text web service works over all networks, using OAuth for access authentication, as demo'd by running a batch command file containing invocation of
TOOL: cURL.exe (the secure version for https) emulates a web browser from within a Run command window.

curl "https://api.att.com/oauth/token" \ 
    --insecure \ 
    --header "Accept: application/x-www-form-urlencoded" \ 
    --header "Content-Type: application/x-www-form-urlencoded" \ 
    --data "client_id=YOUR_APP_KEY&client_secret=YOUR_APP_SECRET&grant_type=client_credentials&scope=SPEECH" \ 
    --proxy "https://proxy.if.you.use.one.com:8080"

--insecure allows connections using HTTPS without SSL certs.

TIP: I've found that -k is needed to make the request work.

Even though requests are usually insecure, the requests are made with customized versions of YOUR_APP_KEY and YOUR_APP_SECRET issued when the services app is registered by a developer who registered.

The response includes a refresh token for use the original access token expires. But since the expiry_on field is 0, the token won't expire. Nevertheless, it is still wise to add logic to handle the expiry case, just in case in the future the access token expiry policy changes (AT&T would just change it without advanced notice...probably during a major release). Getting the refresh token is very similar to the access token. The payload would look like this:

client_id=ABCDEF0123456789ABCDEF0123456789& client_secret=ABCDEF0123456789& grant_type=refresh_token& refresh_token=ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789

Requests

The Access_Token returned during authentication is inserted in all subsequent client requests to obtain text back from audio files sent to the server using a run batch command demo'd by this statement:

curl "https://api.att.com/rest/2/SpeechToText" \ 
    --insecure \ 
    --request POST \ 
    --header "Authorization: Bearer <Access_Token>" \ 
    --header "Accept: application/json" \ 
    --header "Transfer-Encoding: 3444" \ 
    --header "X-SpeechContext: Generic" \ 
    --header "Content-type: audio/wav" \ 
    --data-binary "@<audio_file>" \
    --proxy "https://proxy.if.you.use.one.com:8080"

The content length (3444 in the example above) can be determined on Windows using ???.

Alternately, "Transfer-Encoding: Chunked" specifies using the standard HTTP 1.1 streaming mechanism in 512 bit chunks.

data-binary specifies the HTTP body which contains only audio data.

The Speech API does not support audio data nested in MIME multipart documents. MIME is the format used by file uploads in HTML forms.

Multiple Requests

Several cURL requests can be issued from within a batch command file referencing a list file:

for %f in (*.xml) do curl -F importDataFile=%f http://...ImportServlet 1>%f.out

Individual commands are expanded thus for "some.xml":

curl -F importDataFile=@some.xml http://...ImportServlet 1>some.xml.out

But ATT prefers the .amr (Adaptive Multi-Rate) codec developed by Ericsson using the Algebraic Code Excited Linear Prediction (ACELP) algorithm designed to efficiently compress human speech audio recordings on 3G cell phones for MMS (Multi-Media Messaging).

More specifically, AMR narrowband, 122 kbit/sec, 8 kHz sampling.

Get from Github the sample C# RESTful web program as run from here.

The response looks like this:

{"Recognition":{"Status":"OK","ResponseId":"71b9410cd5259ec81e49abf892eabd44","N Best":[{"WordScores":[0.58,0.05],"Confidence":0.315,"Grade":"confirm","ResultTex t":"You ...","Words":["You","..."],"LanguageId":"en-US","Hypothesis":"you um"}]} }

To strip out meta data and only present the ResultText:

Reformatting Speech Files

The ATT Speech API recognizes several file formats: { ".amr", "audio/amr" }, { ".wav", "audio/wav" }, {".awb", "audio/amr-wb"}, {".spx", "audio/x-speex"}

WARNING: The ATT Speech API only processes wav (Microsoft Waveform) files containing PCM 16, not linear 8.

"Text": "audiostream-wav: only pcm linear 16 supported (found pcm linear 8)", "Text": "RIFF/WAVE coding [85] is not ALAW, ULAW or PCM",

A .wav file can optionally contain a RIFF header in addition to raw PCM (Pulse Code Modulation) audio bits with sampling (Project) rate of 8kHz (8000 Hz) or 16kHz.

WARNING: The ATT API requires format to be mono (not streo).

TOOL: sox from SourceForge can reformat to mono.

TOOL: Cool Edit from www.syntrillium.com can change both non-audio data and PCM bits in a .wav file.

NOTE: AMR files used by Nokia and Ericsson phones contain a "#!AMR\n" header. There is also a 3gpp standard AMR format. http://www.connactivity.com/~eaw/amrwork/ published a python script to convert between them.

Conversion to .wav is necessary * to edit an .amr file:

ffmpeg -i josie-ring.amr josie-ring.wav

After editing, convert the wave file to a full size music mp3:

ffmpeg -i josie-ring.wav -ab 128 -ac 2 -ar 44100 josie-ring.mp3

Several cURL requests can be issued from within a batch command file referencing a list file:

for %f in (*.xml) do curl -F importDataFile=%f http://...ImportServlet 1>%f.out

Individual commands are expanded thus for "some.xml":

curl -F importDataFile=@some.xml http://...ImportServlet 1>some.xml.out

AT&T Speech to Text

AT&T's speech engine is powered by Watson.

In-line hints are pre-pended to voice data.

In-line grammar sends in whole grammar set.

PLS

SRGS

Demo from ivee (talking alarm clocks) Jonathan David Ger

Brent

Your rating of this page:
Low High

Your comments on this topic, please:

Publish this comment publicly

Your first name:

Your family name:

Your location (city, country):

Your Email address:

Email me updates

Top of Page

Thank you!

Speech To Text

Voice Recognition

Microsoft Voice Recognition

Poor Man's Speech to Text

AT&T Speech to Text

OAuth 2.0 Authentication

Requests

Multiple Requests

Reformatting Speech Files

AT&T Speech to Text