Course 6 • Lesson 1

Audio Processing & Whisper

Master the mathematical foundations of digital audio processing and explore the architecture of OpenAI's Whisper model for high-fidelity multi-task transcription, translation, and Voice Activity Detection (VAD).

Sampling & FFTMel SpectrogramsTransformer Encoder-DecoderInteractive Playground

1. Digital Audio Representation & Physics

Sound is a continuous mechanical wave of pressure and displacement that travels through a medium. In order for computer architectures to ingest sound, it must undergo Analog-to-Digital Conversion (ADC). This conversion is governed by fundamental principles of physics and digital signal processing (DSP).

The Nyquist-Shannon Theorem

To capture an analog audio signal without aliasing (unwanted frequency distortions), the continuous signal must be sampled at a rate greater than twice its highest frequency component.

f_s > 2 • f_max

Human hearing ranges up to roughly 20 kHz. Thus, standard high-fidelity audio utilizes sampling rates of 44.1 kHz or 48 kHz. For machine learning speech recognition systems (including Whisper), a standardized rate of 16 kHz is preferred to minimize computational overhead while retaining semantic information.

Bit Depth & PCM

Pulse Code Modulation (PCM) represents continuous waveforms as discrete numerical amplitudes. The resolution of this quantization is governed by Bit Depth:

16-bit depth: Yields 65,536 potential amplitude values, offering a theoretical dynamic range of ~96 dB.
32-bit float: Represents amplitude values in a floating-point format, critical for neural network inputs to prevent underflow.

Raw audio is typically represented as a 1D vector of float values between [-1.0, 1.0].

2. The Mel-Spectrogram: From Waveform to Image

Raw audio waveforms are incredibly high-dimensional (a single 10-second clip at 16 kHz contains 160,000 samples). Furthermore, temporal representation does not explicitly reveal frequency components. We solve this by computing the Short-Time Fourier Transform (STFT).

STFT slides a small window across the audio, computes the Fast Fourier Transform (FFT) for each chunk, and maps amplitude changes over time. To align this with human auditory systems, which are highly sensitive to small changes in lower frequencies but insensitive at high frequencies, we scale frequencies onto the logarithmic Mel Scale:

The Mel-Frequency Conversion Formula

m = 2595 • log_10(1 + f / 700)

Where f is the raw physical frequency in Hertz, and m is the perceptual frequency in Mel units.

A Log-Mel Spectrogram is created by taking the log-amplitude of the Mel-filtered STFT output. It converts a 1D audio file into a 2D matrix representing time vs. perceptual frequency, which functions perfectly as input to convolutional or vision transformer neural network layers.

3. OpenAI's Whisper Architecture

Whisper is an encoder-decoder Transformer trained on 680,000 hours of multilingual, multi-task labeled audio. Instead of mapping audio phoneme-by-phoneme, it processes entire 30-second blocks end-to-end.

The Neural Processing Chain

1. Audio Encoder

Processes 80-channel Log-Mel spectrogram. Uses two 1D convolution layers with a filter width of 3 to downsample. Adds sinusoidal positional encodings before feeding into a standard Transformer Encoder.

2. Cross-Attention Decoder

An autoregressive Transformer Decoder predicts target text tokens. Uses cross-attention over the encoder's representations, decoding character tokens alongside special control tokens.

3. Special Tokens & Tasks

Python Walkthrough: Custom Audio Pipeline & Whisper Inference

Below is the complete, high-fidelity Python implementation for processing audio into a log-Mel spectrogram, computing Voice Activity Detection (VAD) via raw energy levels, and orchestrating standard inference using HuggingFace's transformers.

import numpy as np
import librosa
import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration

def load_and_preprocess_audio(file_path: str, target_sr=16000) -> np.ndarray:
    """
    Loads audio, resamples it to target sample rate, and normalizes amplitudes.
    """
    # Load audio using librosa (handles MP3, WAV, etc.)
    waveform, sr = librosa.load(file_path, sr=target_sr, mono=True)
    
    # Normalize waveform amplitude to standard float representation [-1.0, 1.0]
    waveform = librosa.util.normalize(waveform)
    return waveform

def compute_log_mel_spectrogram(waveform: np.ndarray, sr=16000, n_mels=80) -> np.ndarray:
    """
    Manually extracts the Log-Mel Spectrogram features.
    """
    # 1. Short-Time Fourier Transform (STFT)
    stft = librosa.stft(waveform, n_fft=400, hop_length=160, win_length=400)
    
    # 2. Compute power magnitude
    power = np.abs(stft) ** 2
    
    # 3. Create Mel filterbank
    mel_filters = librosa.filters.mel(sr=sr, n_fft=400, n_mels=n_mels)
    
    # 4. Map frequency power to Mel scale and compute Log amplitude
    mel_spectrogram = np.dot(mel_filters, power)
    log_mel = np.log10(np.clip(mel_spectrogram, a_min=1e-10, a_max=None))
    
    return log_mel

def simple_voice_activity_detection(waveform: np.ndarray, frame_length=2048, hop_length=512, threshold=0.02) -> np.ndarray:
    """
    Filters audio frame energy to identify voice boundaries (VAD).
    """
    rms = librosa.feature.rms(y=waveform, frame_length=frame_length, hop_length=hop_length)[0]
    voiced_frames = rms > threshold
    return voiced_frames

def run_whisper_transcription(audio_path: str, task: str = "transcribe", language: str = "french"):
    """
    Orchestrates HuggingFace pipeline to load models and compute outputs.
    """
    print(f"Loading processor and Whisper-tiny model...")
    processor = WhisperProcessor.from_pretrained("openai/whisper-tiny")
    model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny")
    
    # Preprocess raw wave file
    waveform = load_and_preprocess_audio(audio_path)
    
    # Standard Whisper inputs expect 30-sec feature representations
    input_features = processor(waveform, sampling_rate=16000, return_tensors="pt").input_features
    
    # Force task steering inside generation configs
    forced_decoder_ids = processor.get_decoder_prompt_ids(language=language, task=task)
    
    print(f"Generating tokens auto-regressively...")
    predicted_ids = model.generate(
        input_features, 
        forced_decoder_ids=forced_decoder_ids,
        return_timestamps=True
    )
    
    # Decode index integers back to language text
    transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
    return transcription[0]

# Execution trace:
# text = run_whisper_transcription("audio_source.wav", task="transcribe", language="french")
# print("Whisper Output:", text)

Whisper Multi-Task Interactive Playground

Simulate the processing stages from raw PCM waveforms to Log-Mel filters and Whisper transcription.

Pipeline Settings

Select Waveform File

VAD Noise Floor Threshold0.5 RMS

Target Decoding Task

DSP & ML Stages16000Hz PCM

1. Load Waveform Array

2. Fast Fourier & Mel Scale

3. Autoregressive Decode

4. Output Reconstruction

Select an audio file and click Run Pipeline to start DSP transforms.

Input File: lecture.wavAudio length: 12.4s

Interactive DSP Challenges

Execute these tasks in your workspace repository code. Toggle checkboxes once they successfully execute within your development environment.

Task 1: Log-Mel Spectrogram Vector Extraction
Write the implementation to convert any standard 1D raw waveform array into an 80-channel log-Mel spectrogram matrix. Test it with Librosa or TorchAudio to assert output dims matches (80, T).
Task 2: Energy-Based VAD Pre-processing
Implement a basic Root Mean Square (RMS) energy analysis function over 2048-sample window chunks. Filter sections below a specific threshold (e.g., VAD < 0.02) to crop out dead noise before model input.
Task 3: Run Multi-Task Whisper Generation
Instantiate openai/whisper-tiny using HuggingFace. Run autoregressive decoding forcing the generation of timestamp tags alongside standard tokens.
Task 4: Cross-Lingual Translation Pipeline
Configure Whisper's decoder tokens using the language tag for Spanish or French, but set the operation task token to <|translate|>, generating the English translation inline.

1. Digital Audio Representation & Physics

The Nyquist-Shannon Theorem

To capture an analog audio signal without aliasing (unwanted frequency distortions), the continuous signal must be sampled at a rate greater than twice its highest frequency component.

f_s > 2 • f_max

Bit Depth & PCM

Pulse Code Modulation (PCM) represents continuous waveforms as discrete numerical amplitudes. The resolution of this quantization is governed by Bit Depth:

16-bit depth: Yields 65,536 potential amplitude values, offering a theoretical dynamic range of ~96 dB.
32-bit float: Represents amplitude values in a floating-point format, critical for neural network inputs to prevent underflow.

Raw audio is typically represented as a 1D vector of float values between [-1.0, 1.0].

2. The Mel-Spectrogram: From Waveform to Image

The Mel-Frequency Conversion Formula

m = 2595 • log_10(1 + f / 700)

Where f is the raw physical frequency in Hertz, and m is the perceptual frequency in Mel units.

3. OpenAI's Whisper Architecture

The Neural Processing Chain

1. Audio Encoder

2. Cross-Attention Decoder

An autoregressive Transformer Decoder predicts target text tokens. Uses cross-attention over the encoder's representations, decoding character tokens alongside special control tokens.

3. Special Tokens & Tasks

Python Walkthrough: Custom Audio Pipeline & Whisper Inference

import numpy as np import librosa import torch from transformers import WhisperProcessor, WhisperForConditionalGeneration def load_and_preprocess_audio(file_path: str, target_sr=16000) -> np.ndarray: """ Loads audio, resamples it to target sample rate, and normalizes amplitudes. """ # Load audio using librosa (handles MP3, WAV, etc.) waveform, sr = librosa.load(file_path, sr=target_sr, mono=True) # Normalize waveform amplitude to standard float representation [-1.0, 1.0] waveform = librosa.util.normalize(waveform) return waveform def compute_log_mel_spectrogram(waveform: np.ndarray, sr=16000, n_mels=80) -> np.ndarray: """ Manually extracts the Log-Mel Spectrogram features. """ # 1. Short-Time Fourier Transform (STFT) stft = librosa.stft(waveform, n_fft=400, hop_length=160, win_length=400) # 2. Compute power magnitude power = np.abs(stft) ** 2 # 3. Create Mel filterbank mel_filters = librosa.filters.mel(sr=sr, n_fft=400, n_mels=n_mels) # 4. Map frequency power to Mel scale and compute Log amplitude mel_spectrogram = np.dot(mel_filters, power) log_mel = np.log10(np.clip(mel_spectrogram, a_min=1e-10, a_max=None)) return log_mel def simple_voice_activity_detection(waveform: np.ndarray, frame_length=2048, hop_length=512, threshold=0.02) -> np.ndarray: """ Filters audio frame energy to identify voice boundaries (VAD). """ rms = librosa.feature.rms(y=waveform, frame_length=frame_length, hop_length=hop_length)[0] voiced_frames = rms > threshold return voiced_frames def run_whisper_transcription(audio_path: str, task: str = "transcribe", language: str = "french"): """ Orchestrates HuggingFace pipeline to load models and compute outputs. """ print(f"Loading processor and Whisper-tiny model...") processor = WhisperProcessor.from_pretrained("openai/whisper-tiny") model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny") # Preprocess raw wave file waveform = load_and_preprocess_audio(audio_path) # Standard Whisper inputs expect 30-sec feature representations input_features = processor(waveform, sampling_rate=16000, return_tensors="pt").input_features # Force task steering inside generation configs forced_decoder_ids = processor.get_decoder_prompt_ids(language=language, task=task) print(f"Generating tokens auto-regressively...") predicted_ids = model.generate( input_features, forced_decoder_ids=forced_decoder_ids, return_timestamps=True ) # Decode index integers back to language text transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True) return transcription[0] # Execution trace: # text = run_whisper_transcription("audio_source.wav", task="transcribe", language="french") # print("Whisper Output:", text)

Lesson 1: Audio Processing & Whisper

1. Digital Audio Representation & Physics

The Nyquist-Shannon Theorem

Bit Depth & PCM

2. The Mel-Spectrogram: From Waveform to Image

3. OpenAI's Whisper Architecture

The Neural Processing Chain

Python Walkthrough: Custom Audio Pipeline & Whisper Inference

Whisper Multi-Task Interactive Playground

Pipeline Settings

Interactive DSP Challenges

1. Digital Audio Representation & Physics

The Nyquist-Shannon Theorem

Bit Depth & PCM

2. The Mel-Spectrogram: From Waveform to Image

3. OpenAI's Whisper Architecture

The Neural Processing Chain

Python Walkthrough: Custom Audio Pipeline & Whisper Inference

Whisper Multi-Task Interactive Playground

Pipeline Settings

Interactive DSP Challenges