Automated switchboard using voice recognition and 46elks calls API
Voice recognition / Speech recognition / Speech to text is with the help of could services such as Azue , Bluemix , Google Cloud , etc. is gradually becoming a realiable way to get user input. And language support is also improving including support to understand different dialects like the Swedish dialect from Skåne (Scania), that dialect is even hard for some swedes to understand. I created a simple use case using our the Voice Call API. And the result was pretty good despite the limited sound quality of the Cellphone network.
Call flow
Here is an overview of how the systems are connected.
- Call arrives from someone.
- A welcome-sound file is played to the caller.
- A recording is then started.
- When 3 seconds of silence is detected, or when a digit is pressed on the phone, the recording stops.
- Once the recording is complete a request is sent to the set endpoint informing it that the recording is now available for download. At this point it’s possible to start the transcription of the recording.
- In parallel with the transcription, it is possible to make actions in the call, e.g. playing a sound saying - “wait a second”.
- The next step is to check the text content and take an appropriate action.
- In this example I connect the call to the correct number depending on the name in the recording.
Endpoints
To handle this I created two endpoints.
/recording
This is the endpoint that will be informed when new sound files are available for a download in the 46elks API in order to send it to the speech recognition API.
/whatnow
This it the endpoint that will be informed when the caller is ready to be forwarded to the telephone requested. This endpoint needs to wait for the recording to be transcribed.
Soundfiles:
I created 4 sound files:
- welcome.wav : "Welcome how can i help you?"
- ok_wait.wav : "Ok I’ll see what I can do."
- nosound.wav : "Sorry I did not hear what you said, can you please repeat?"
- bussy.wav : "The telephone was busy, what do you want me to do?"
Parts on of the process
First step is to add a start JSON to voice_start on the number for the incoming call.
Initial voice start
The call starts with the welcome.wav, after that the sound is recorded and then ok_wait.wav is played and lastly the request to the /whatnow endpoint is made.
voice_start on number:
{
"play": "https://yourserver.com/sounds/welcome.wav",
"next": {
"record": "https://yourserver.com/api/reccording",
"next": {
"play": "https://yourserver.com/sounds/ok_wait.wav",
"next": "https://yourserver.com/api/whatnow"
}
}
}
Handle recording
The recording the handled by downloading the sound file and then converting it into a BASE64 string as required by the voice recognition API.
First part of recording endpoint
// Set auth header.
$opts = array(
'http' => array(
'method' => 'GET',
'header' => "Authorization: Basic ".
base64_encode('<apiusernam>:<apipassword>')."\r\n",
'timeout' => 180
)
);
$context = stream_context_create($opts);
// Download sound file content.
$sound = file_get_contents($_POST['wav'], false, $context);
$sound = base64_encode($sound);
Ask Speech API for text
When the BASE64 string is available the request to the Google Speech API is made. The reason for using the Google Speech API in this example is that is supports the format of the sound files as is received from the 46elks API. And also Swedish along with lots of other languages is supported.
Second part of recording endpoint
$apirequest = array(
"config"=> array(
"encoding"=> "LINEAR16",
"sampleRate"=> 8000,
"languageCode"=> "sv-SE",
"speechContext" => array (
"phrases" => array("call","martin","johannes")
),
"audio" => array(
"content"=>$sound
)
);
$data_string = json_encode($apirequest);
$ch = curl_init('https://speech.googleapis.com/v1/speech:syncrecognize?key=<api-key>');
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, "POST");
curl_setopt($ch, CURLOPT_POSTFIELDS, $data_string);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_HTTPHEADER, array(
'Content-Type: application/json',
'Content-Length: ' . strlen($data_string))
);
$text = curl_exec($ch);
Handle the text from the API
The API will then reply with different options of what the sound could mean as text. A simple solution is to select one with the highest accuracy. And then store that text in a local file, database, queue, etc.
Third part of recording endpoint
$texts = json_decode($text,true);
$accuracy = 0;
$besttext = '';
foreach($texts['results'][0]['alternatives'] as $alternative{
if($accuracy < $alternative['confidence']){
$besttext = $alternative['transcript'];
}
}
file_put_contents('calls/'.$_POST['callid'],$besttext);
Check that the recording is not mute
If the recording was completely quiet there will not be any recording data, in this case it would be useful to ask the user the repeat the request.
First part of /whatnow endpoint
$opts = array(
'http' => array(
'method' => 'GET',
'header' => "Authorization: Basic ".
base64_encode('<api-username>:<api-password>'). "\r\n",
'timeout' => 180
)
);
$context = stream_context_create($opts);
// Get call information:
$calldata = file_get_contents(https://api.46elks.com/a1/calls/$_POST['callid'], false, $context);
$calldata = json_decode($calldata);
$latestreccordresult = "";
foreach($calldata['actions'] as $action){
if(isset($action['actions'])){
$latestreccordresult = $action['result'];
}
}
if($latestreccordresult ==! "ok"){
print <<<END
{
"play": "https://yourserver.com/sounds/nosound.wav",
"next": {
"record": "https://yourserver.com/api/reccording",
"next": {
"play": "https://yourserver.com/sounds/ok_wait.wav",
"next": "https://yourserver.com/api/whatnow"
}
}
}
END;
die();
}
Wait for recording to be transcribed then handle text.
The transcription may not be fast enough, so some time might be needed for the file to be created. And then take action. In this example I waited for 13 seconds before concluding that the transaction failed.
Second part of /whatnow endpoint
for($i = 0; $i < 13; $i++){
sleep(1);
if(file_exists('calls/'.$_POST['callid'])){
$text = file_get_contents('calls/'.$_POST['callid']);
$to = False;
if(stristr($text,'martin')){
$to="+4672317500";
}
elseif (stristr($text,'johannes')){
$to="+46766861004";
}
if($to){
print <<<END
{
"connect": "{$to}",
"bussy": {
"play": "https://yourserver.com/sounds/bussy.wav",
"next": {
"record": "https://yourserver.com/api/reccording",
"next": {
"play": "https://yourserver.com/sounds/ok_wait.wav",
"next": "https://yourserver.com/api/whatnow"
}
}
}
}
END;
die();
}
}
}
If all fails say sorry try again.
And if all else fails ask the user for input again. Simply play the sound file to the user and make a new recording request.
Third part of /whatnow endpoint
print <<<END
{
"play": "https://yourserver.com/sounds/sorry.wav",
"next": {
"record": "https://yourserver.com/api/reccording",
"next": {
"play": "https://yourserver.com/sounds/ok_wait.wav",
"next": "https://yourserver.com/api/whatnow"
}
}
}
END;
Written 2017-06-30 by Martin Harari Thuresson