As connected devices increasingly incorporate voice commands and speech-to-text, solution developers need to be sure they are using best-in-class audio pick-up and voice processing to ensure a good user experience. One company that has been advancing this area is Knowles, which recently introduced the IA8508, the first audio processor with context-aware machine learning for complex voice interactions.
Knowles is widely known as a leading provider of acoustic pickups based on microelectromechanical systems (MEMS). MEMS found fame as accelerometers in the first Nintendo Wii hand controllers and are used in fitness devices to track movement. Many of the same principles can be applied to picking up acoustic wave vibrations, to the point that MEMS-based microphones have quickly replaced the classic electret condensing microphone. The success of MEMS acoustic pickups is indicated by Apple using them in its iPhones and by Amazon adding them to its Echo Dot voice-activated platform for voice communications with the Alexa personal assistant.
The advantage of MEMS-based acoustic pickups is that they’re small, low power and low cost. They are also robust and reliable, thanks to improvements in the manufacturing process, as well as good digital compensation techniques. This latter point is critical, as it’s the mixing of MEMS with digital compensation and processing techniques that has released a new wave of voice-activated control and human-machine interface (Figure 1).
Figure 1. The combination of MEMS audio sensors (microphones) and advanced processing architectures has made voice-enabled human-machine interfaces. (Image source: Yole Developpement)
The power, size and cost advantages of MEMS have allowed multiple sensors to be placed on a single board in a device with only marginal loss of space and power. With multiple acoustic sensors in a single device, a number of interesting capabilities become viable, such as beamforming, directionality, voice recognition, voice tracking and filtering in a crowd, and many other features that make voice more useful and interesting as an interface.
All these features work by combining the sensor inputs from multiple sensors that have different responses based on the location of the voice source. Add in more context awareness and the latest advances in machine-learning algorithms, and the usefulness of voice increases rapidly.
Context-aware Machine Learning for Voice-based Interaction
The new algorithms for deep learning and context awareness would require running the audio signal to the cloud and back, or using higher-performance processors and extensive memory on the local device, thereby consuming more power. However, Knowles has introduced a new processor platform, the IA8508, which is optimized for near- and far-field applications. This makes it useful for hearing aids, IoT devices, smart TVs, smart speakers and digital assistants.
The premise is that by incorporating dedicated hardware for deep neural network (DNN) and machine-learning acceleration, the IA8508 can minimize power consumption, while at the same time offering a real-time response and avoiding the latencies of a cloud-based solution. It does this using power-efficient heterogeneous audio cores, an optimized architecture, 5.7 Mbytes of memory and a proprietary instruction set (Figure 2).
Figure 2. Knowles’ IA8508 audio processor uses a proprietary heterogeneous architecture to enable advanced acoustic processing on low-power IoT edge devices. (Image source: Knowles Corp.)
The combination achieves a 10x improvement in efficiency over competitors, according to Knowles. That competition mainly comprises Bosch Akustica, STMicroelectronics, Invensense, XFab, Infineon and Sony.
The IA8508 supports an array of up to eight microphones. Once the signal is acquired by the processing cores, the platform performs acoustic echo cancellation, dynamic beam forming, steering, sound classification and noise suppression. This is where the Delta-Max core comes into play, with specialized instructions. This is augmented by a single-sample processor core that lowers latency when handling noise cancellation and mixing of the various audio signals.
As voice-enabled devices spend much of their time in “listening” mode, the system needs to be architected such that it’s mostly shutdown until a known voice command or key phrase is given, such as OK Google, or Hi Alexa. The processor must be able to recognize these specific commands, before awaking the full system. For this feature, Knowles uses a Hemi-Delta core that provides low-power dynamic processing. The low-power standby mode is helped using its proprietary memory architecture and dynamic voltage scaling. For overall control, a general-purpose Arm Cortex M4 core is used.
It Takes a Village to Raise Voice Control
Not much can happen in a vacuum, especially with voice-enabled control. With that in mind, IoT solution providers should be aware of Knowles’ extensive digital signal processing partnership program. This program brings together leaders in the intelligent voice ecosystem to improve audio performance across a range of applications and use cases. If the IA8508 isn’t the ideal fit for a solution you may have in mind, and you need to augment it, third-part silicon intellectual property can be added for additional features and capabilities.
When the Amazon Echo Dot recognized my heavily accented voice across a room of kids, it was clear that voice-enabled control had arrived. The Dot uses seven Knowles MEMS sensors. While much of that processing is still done in the cloud, by adding more low-power processing capability at the edge, along with machine learning and DNNs, the response times will start to better meet customers’ falling tolerance for command-response latencies.
In doing so, voice-enabled devices will move from novelties to truly interactive, useful solutions.