..
Audio Signal Processing for ML [WIP]
Introduction
This post explores fundamental concepts in audio signal processing that form the backbone of machine learning applications for sound and speech. The content is based on the Audio Signal Processing for Machine Learning video series, with code examples available in the author’s GitHub repository.
Prerequisites
For understanding Fourier analysis concepts:
- Fourier series introduction
- Fourier Transform - Decomposing complex sounds into their constituent frequencies
Sound Fundamentals
Sound and Waveforms
- Sound: Produced by vibration of an object. Vibrations cause air molecules to oscillate and changes in air pressure create waves.
- Key Properties:
- Mechanical Wave
- Frequency (Hz)
- Amplitude
- Phase
- Higher frequency produces higher-pitched sound, while higher amplitude results in louder sound
- Pitch: Perception of frequency. One octave has 12 notes, and frequency doubles after one octave.
Intensity and Loudness
- Sound Power: Measured in watts (W). Energy per unit of time by sound source in all directions.
- Sound Intensity: Sound power per unit area, measured in decibels (dB)
- Threshold of Hearing (TOH): 10^(-12) W/m^2
- Threshold of Pain (TOP): 10 W/m^2
- Decibel Formula: dB(I) = 10 log(I/I_toh) [log base 10]
- Common Reference Points:
- Normal conversation: 60dB
- Jet take-off: 140dB
- Loudness (phons): Subjective perception of sound intensity, dependent on duration and frequency of sound
Timbre
- The quality that distinguishes sounds of same intensity, frequency and duration
- Key Characteristics:
- Multidimensional in nature
- Sound envelope: Attack-Decay-Sustain-Release model
- Harmonic content:
- Superposition of sinusoids
- A partial is a sinusoid used to describe sound
- The lowest partial is called fundamental frequency
- A harmonic partial is a frequency that’s a multiple of fundamental frequency
- Inharmonicity indicates a deviation from a harmonic partial
- Modulation:
- Amplitude modulation → Tremolo
- Frequency modulation → Vibrato
Digital Audio Processing
Analog to Digital Conversion
- Analog Signal: Continuous value of time and amplitude
- Digital Signal: Sequence of discrete values
- Pulse Code Modulation process:
- Sampling:
- Sampling rate (sr): Number of samples per second
- Nyquist frequency: sr/2 - Upper bound for frequency in digital signal that won’t create artifacts
- Potential artifacts: Aliasing
- Quantization:
- Resolution = number of bits (same as bit depth)
- Dynamic range: Difference between largest and smallest signal a system can record
- Signal-to-quantization-noise ratio (SQNR): Relationship between maximum signal strength and quantization error
- Sampling:
MEL Spectrograms and Audio Features for Machine Learning
Audio features provide meaningful representations of sound for machine learning algorithms to process.