2025-07-25

Text-to-Audio-Models

My Journey into Text to Audio Models

I am studying text to audio models with more focus towards Music Generation models.

Text-To-Music Models

1. Mustango: Toward Controllable Text-to-Music Generation

Mustango is a diffusion-based text-to-music model that enables structured control over chords, beats, tempo, and key directly from natural-language prompts.

MusicBench — Dataset Pipeline

Seed corpus: 5521 MusicCaps clips (10 s + captions).
Control sentences: append 0–4 beat/chord/key/tempo lines
Paraphrase: ChatGPT rephrasing
Filter: drop “poor‑quality/low‑fidelity” captions
11× augment: ±1‑3 semitones, ±5–25 % speed, crescendo/decrescendo volume → ≈37 k new samples.

Mustango Model

Latent space: AudioLDM VAE → latent z.
MuNet denoiser: UNet + hierarchical cross‑attention.
- Inputs: FLAN‑T5 text emb; beat & chord encodings. (Beat encoder and Chord encoder)
Inference helpers:
- DeBERTa beat predictor (meter + intervals).
- FLAN‑T5 chord predictor (time‑stamped chords).
Output: 10‑s waveform obeying tempo, key, chords, beats when provided; graceful fallback when not.

2. Noise2Music: Text-conditioned Music Generation with Diffusion Models

Generate a 30-second, 24 kHz stereo music clip from a plain-language prompt.

Training‑Data Pipeline

Raw audio pool: 6.8 M full‑length tracks → chopped into 30 s clips (~340 k h).
Caption vocabularies (built offline)
- LaMDA‑LF – 4M rich sentences (LLM‑generated).
- Rater‑LF / SF – 35k long + 24k short human sentences/tags from MusicCaps.
Embedding space scoring: Encode every clip (MuLan‑audio) & every caption (MuLan‑text).
Pseudo‑labelling: For each clip pick top‑10 captions by cosine sim → sample 3 low‑frequency ones from each vocab (bias toward rarer labels).
Extra metadata: Append title, artist, genre, year, instrument tags.
Quality anchor: Inject ~300 h curated, attribution‑free tracks with rich manual metadata.
Dual‑rate storage: Keep 24 kHz (for super‑res stage) + 16 kHz copies (for the rest).
Final payload: Every 30 s clip carries 10 + text descriptors spanning objective tags → subjective vibes.

Model Stack (three‑stage diffusion cascade)

Stage	I/O	Role	Key details
Waveform Generator	Text → 3.2 kHz audio	Sketch global structure.	1‑D Efficient‑U‑Net; text fed via cross‑attention; CFG during sampling.
Waveform Cascader	Text + 3.2 kHz → 16 kHz audio	Upsample & refine.	Receives up‑sampled low‑fi audio + prompt; blur/noise augmentation during training.
Super‑Res Cascader	16 kHz → 24 kHz audio	Restore full bandwidth.	No text conditioning; lightweight U‑Net.

Spectrogram path (alt): parallel generator + vocoder pair that works in log‑mel space; cheaper but less interpretable.

3. Stable Audio - Fast Timing-Conditioned Latent Audio Diffusion

A convolutional VAE that efficiently compresses and reconstructs long stereo audio.
It uses latent diffusion
It adds timing embeddings.

Dataset Construction

Collect 806284 stereo tracks (≈ 19500 h) from AudioSparx.
Pre‑process audio
- Resample to 44.1 kHz, stereo.
- Slice / pad each file to a fixed 95.1 s window (4 194 304 samples).
Build text prompts from metadata on‑the‑fly
- Randomly sample descriptors (genre, mood, BPM, instruments).
- Emit either free‑form or structured text strings.
Final sets
- Same corpus trains the VAE, CLAP (text encoder), and latent diffusion U‑Net.

Model Pipeline

Stage	Key points
1. VAE	32× compression
2. Text encoder (CLAPours)	trained from scratch
3. Timing embeddings	seconds_start, seconds_total; concatenated with text features
4. Latent U‑Net diffusion	907 M params;
5. Inference	DPMSolver++

Outcome: 44.1 kHz stereo audio, up to 95 s, fast (latent) diffusion with precise duration control via timing conditioning.

More blog posts coming soon as I continue my learning journey…