Multi-channel Acoustic Signal Processing on Edge Devices
Microphone arrays are useful in determining the space-time structure of an acoustic field. They are widely employed in many popular acoustic signal processing applications, including speech enhancement, speech separation, sound source localization, and sound source tracking. This dissertation introduces a set of practical and efficient multi-channel acoustic signal processing algorithms specifically targeted for improving people's spatial awareness and hearing towards sources of interest using edge devices featuring a microphone array. Such devices include smartphones, smart glasses, and hearing aids. As proof of feasibility, the majority of the developed algorithms have been in fact deployed as mobile applications for smartphones. First, Directional Signal Extraction Network (DSENet) is proposed. DSENet is a real-time, computationally-and-memory-efficient neural network which extracts a signal source located within a predefined directional region of interest. Experimental results show that DSENet is capable of outperforming oracle beamformers and state-of-the-art (SOTA) networks in low-latency causal speech separation while incurring a system latency of only 4 ms. Second, a complete method for highly accurate and efficient real-time estimation of 2-dimensional direction of arrival (2D-DOA) using a nonlinear 3-microphone array is presented. The proposed method provides the ability to estimate and track azimuth and elevation angles of one or more acoustic sources in real-time. Third, in an attempt to increase the number of microphones for improved acoustic processing performance, a distributed, real-time, low-latency audio input/output (I/O) framework for mobile devices is proposed. This framework can simulate an irregular and flexible microphone array by wirelessly synchronizing and processing multi-channel audio input of multiple mobile devices into real-time output. Fourth, a method for jointly calibrating and synchronizing two arrays of microphones and loudspeakers is described. This method allows finding the clock offset between two devices featuring an array of microphones and loudspeakers, as well as estimating their exact relative positions. Fifth, Delay-Filter-and-Sum Network (DFSNet) is proposed. DFSNet is a steerable neural beamformer invariant to microphone number and array geometry for real-time, low-latency speech enhancement. Apart from low latency, DFSNet is designed to incur controllable distortion and low memory and computational complexities, making it especially suitable for hearing aid applications. Comparison with SOTA reveals high performance approaching noncausal methods.