“The most important single ingredient in the formula of success is knowing how to get along with people,” Theodore Roosevelt used to say. This formula takes crucial importance in case of robots. Without communication capacity the robot will be just a high-tech machinery. Valery Titov and Vladislav Sazonov, engineers of Promobot company, explain the meaning of the communication with a robot. Apparently, the “microphone-speaker” connection became obsolete back in the last century, and it is almost impossible to find a suitable “head”.
People take everyday communication for granted as a human is able to hear and respond. We meet a challenge, when it comes to robots. The challenge is to develop the robot that can hear and understand the way a human does. Communication with a robot can be accompanied by various conditions, for example, noise or several people talking at the same time. Traditional microphones may react to the robot speech itself. Technically, the communication process is very complex. To teach Promobot robots to communicate, we had to develop appropriate “mouth” and “ears” and make them work in a proper way.
“Honda” for a Robot
For perfect pitch, the robot should posses two main features: “right” ears to hear and a credible head with an ability to process the incoming information. Hardware and software system based on microphone array can be a good way out.
The software is more complex rather then a microphone set that can be placed wherever required. The head at least must be able to
l Recognize the robot speech;*
l Clear sound from noise;*
l Determine speech of all incoming information;*
l Identify the source of speech;*
l Form a beam to amplify the original sound signal from the source;*
l after all, recognize speech in the audio track
As a rule, the microphone array does not meet the last paragraph.
There were not many definite ideas on the world market. And only few of them are up to Promobot. We consider the development of Honda research institute, Japan Audition for Robots with Kyoto University as the most promising one.
HARK was originally designed for robots to hear and understand commands from a human. This open source software could be linked to the Robot Operating System (ROS). The microphone sound processing was easily set up. An additional privilege was the stated possibility of simultaneous detection of several sound sources. It seemed to be what we needed. We were fascinated with the Japanese development so much that we stopped looking for another option.
Fatal sound millimeters
In 2014, we used a RASP LC microphone array with 8 microphones for the first tests. There were 4 microphone arrays located on the central part of the robot’s chest and around the screen, 3 ones on the upper part of the chest, closer to the neck, and one in the back center, at the back of the neck. After the first test, we located two crucial problems: vibration and calculation complexity.
The point is that the robot itself is a mechanism with a huge number of moving parts. Each movement creates background noise that requires constant analysis.
The second point was the processing of the received audio data. HARK developers have taken two fundamentally different approaches to audio stream analytics. The first one, called geometric, is limited to an accurate (up to tenths of a millimeter) description of the microphones space location, having consideration for their directions. The second one, without a special name, in based on a calibration model. It is created by repeatedly recording the same record through an array from different points in the space.
In most cases, the geometric model should have performed well by the creators’ assurances. This did not happen with the first Promobot prototype. The accuracy of microphone location occurred to be the main problem. As seven years ago, many stages of assembly are still done manually. It is not always possible to reduplicate the product. A discrepancy of a millimeter turned out to be fatal, when it comes to accurate algorithms in comparison to a human eye. The result was very discouraging, due to an inaccurate location model, the program calculated the correlation between 8 audio parts of 10 ms from 8 different microphones.
Acoustic robot’s echo
The second prototype had its own board with a microphone array. It was equipped with a powerful general-purpose microcontroller for synchronizing the collected data. The microphones were located on the robot’s chest around the screen. The model itself had an important feature: the robot detected its own noise and sound. We decided to use acoustic echo cancellation (AEC) of the robot’s speech in the microphone array data.