|
Speech To TextHere is a report on various ways to make your computer talk. |
|
|
Apple set the bar with Siri when it bought Siri and included it in the iPhone 4S launch in 2011. As their promotional videos shows, Siri makes the iPhone personal assistant by combining voice recognition with task automation.
Using just voice alone, Siri enbles users to compose and send text messages and emails, schedule meetings, ask for directions, set reminders, and so on.
Additionally, Siri added semantic technology so it understands search requests spoken in plain English, like, "What is the largest city in Texas?".
Google has voice recognition built-in.
The Windows Speed Recognition (WSR) introduced with Windows Vista required users to write their own macros, so was mostly used by very advanced users.
After Microsoft bought the "voice portal" company TellMe in 2007, Microsoft put voice command technology in Windows Phone 7 and 8, but not on desktops.
Intel worked with Nuance to bring their Dragon Naturally Speaking and Sync in Ford cars to the new Dragon Assistant app for Windows Ultrabooks.
Many laptops come with a microphone built in.
The computer can be made to recognize your voice if you are willing to jump among 3 programs: Sound Recorder, Windows Explorer, Run command window.
The Run command window would run a custom batch file
soundrecorder /file outputfile.wav
The Sound Recorder GUI says there is a maximum length of 60 seconds. But this video shows that after the initial 60 seconds is reached, select from the menu Edit > Copy, then File > New. Click No to the pop-up. Select from the menu Paste > Insert, then Edit and Paste Insert to add an extra 60 seconds to the Length.
After recording is stopped, by default files are named outputfile.wav and saved to folder C:\Users\%USERNAME%\Documents.
Ideally, the program would chunk the sound rather than sending a long file.
AT&T's Speech capabilities has been improving for decades. Available from the Developer Portal is SDK ATTSpeechKit.zip for iOS and Android
AT&T's Watson℠ speech engine providers differentiates itself vs. competitors by offering a robust library of speech contexts optimized for specific applications.
One of these is specified in requests as the value for header X-SpeechContext.
BLAH: Unfortunately, one cannot tweak the voice recognition algorithms to improve accuracy for individual users.
curl "https://api.att.com/oauth/token" \ --insecure \ --header "Accept: application/x-www-form-urlencoded" \ --header "Content-Type: application/x-www-form-urlencoded" \ --data "client_id=YOUR_APP_KEY&client_secret=YOUR_APP_SECRET&grant_type=client_credentials&scope=SPEECH" \ --proxy "https://proxy.if.you.use.one.com:8080"
--insecure allows connections using HTTPS without SSL certs.
TIP: I've found that -k is needed to make the request work.
Even though requests are usually insecure, the requests are made with customized versions of YOUR_APP_KEY and YOUR_APP_SECRET issued when the services app is registered by a developer who registered.
The response includes a refresh token for use the original access token expires. But since the expiry_on field is 0, the token won't expire. Nevertheless, it is still wise to add logic to handle the expiry case, just in case in the future the access token expiry policy changes (AT&T would just change it without advanced notice...probably during a major release). Getting the refresh token is very similar to the access token. The payload would look like this:
client_id=ABCDEF0123456789ABCDEF0123456789& client_secret=ABCDEF0123456789& grant_type=refresh_token& refresh_token=ABCDEF0123456789ABCDEF0123456789ABCDEF0123456789
curl "https://api.att.com/rest/2/SpeechToText" \ --insecure \ --request POST \ --header "Authorization: Bearer <Access_Token>" \ --header "Accept: application/json" \ --header "Transfer-Encoding: 3444" \ --header "X-SpeechContext: Generic" \ --header "Content-type: audio/wav" \ --data-binary "@<audio_file>" \ --proxy "https://proxy.if.you.use.one.com:8080"
The content length (3444 in the example above) can be determined on Windows using ???.
Alternately, "Transfer-Encoding: Chunked" specifies using the standard HTTP 1.1 streaming mechanism in 512 bit chunks.
data-binary specifies the HTTP body which contains only audio data.
The Speech API does not support audio data nested in MIME multipart documents. MIME is the format used by file uploads in HTML forms.
for %f in (*.xml) do curl -F importDataFile=%f http://...ImportServlet 1>%f.out
Individual commands are expanded thus for "some.xml":
curl -F importDataFile=@some.xml http://...ImportServlet 1>some.xml.out
But ATT prefers the .amr (Adaptive Multi-Rate) codec developed by Ericsson using the Algebraic Code Excited Linear Prediction (ACELP) algorithm designed to efficiently compress human speech audio recordings on 3G cell phones for MMS (Multi-Media Messaging).
More specifically, AMR narrowband, 122 kbit/sec, 8 kHz sampling.
Get from Github the sample C# RESTful web program as run from here.
The response looks like this:
{"Recognition":{"Status":"OK","ResponseId":"71b9410cd5259ec81e49abf892eabd44","N Best":[{"WordScores":[0.58,0.05],"Confidence":0.315,"Grade":"confirm","ResultTex t":"You ...","Words":["You","..."],"LanguageId":"en-US","Hypothesis":"you um"}]} }
To strip out meta data and only present the ResultText:
The ATT Speech API recognizes several file formats: { ".amr", "audio/amr" }, { ".wav", "audio/wav" }, {".awb", "audio/amr-wb"}, {".spx", "audio/x-speex"}
WARNING: The ATT Speech API only processes wav (Microsoft Waveform) files containing PCM 16, not linear 8.
"Text": "audiostream-wav: only pcm linear 16 supported (found pcm linear 8)",
"Text": "RIFF/WAVE coding [85] is not ALAW, ULAW or PCM",
A .wav file can optionally contain a RIFF header in addition to raw PCM (Pulse Code Modulation) audio bits with sampling (Project) rate of 8kHz (8000 Hz) or 16kHz.
WARNING: The ATT API requires format to be mono (not streo).
TOOL: sox from SourceForge can reformat to mono.
TOOL: Cool Edit from www.syntrillium.com can change both non-audio data and PCM bits in a .wav file.
NOTE: AMR files used by Nokia and Ericsson phones contain a "#!AMR\n" header. There is also a 3gpp standard AMR format. http://www.connactivity.com/~eaw/amrwork/ published a python script to convert between them.
Conversion to .wav is necessary * to edit an .amr file:
ffmpeg -i josie-ring.amr josie-ring.wav
After editing, convert the wave file to a full size music mp3:
ffmpeg -i josie-ring.wav -ab 128 -ac 2 -ar 44100 josie-ring.mp3
Several cURL requests can be issued from within a batch command file referencing a list file:
for %f in (*.xml) do curl -F importDataFile=%f http://...ImportServlet 1>%f.out
Individual commands are expanded thus for "some.xml":
curl -F importDataFile=@some.xml http://...ImportServlet 1>some.xml.out
AT&T's speech engine is powered by Watson.
In-line hints are pre-pended to voice data.
In-line grammar sends in whole grammar set.
PLS
SRGS
Demo from ivee (talking alarm clocks) Jonathan David Ger
Brent
| Your first name: Your family name: Your location (city, country): Your Email address: |
Top of Page
Thank you! |