Text-to-Audio-Models
My Journey into Text to Audio Models
I am studying text to audio models with more focus towards Music Generation models.
Text-To-Music Models
1. Mustango: Toward Controllable Text-to-Music Generation
Mustango is a diffusion-based text-to-music model that enables structured control over chords, beats, tempo, and key directly from natural-language prompts.
MusicBench — Dataset Pipeline
- Seed corpus: 5521 MusicCaps clips (10 s + captions).
- Control sentences: append 0–4 beat/chord/key/tempo lines
- Paraphrase: ChatGPT rephrasing
- Filter: drop “poor‑quality/low‑fidelity” captions
- 11× augment: ±1‑3 semitones, ±5–25 % speed, crescendo/decrescendo volume → ≈37 k new samples.
Mustango Model
- Latent space: AudioLDM VAE → latent z.
- MuNet denoiser: UNet + hierarchical cross‑attention.
- Inputs: FLAN‑T5 text emb; beat & chord encodings. (Beat encoder and Chord encoder)
- Inference helpers:
- DeBERTa beat predictor (meter + intervals).
- FLAN‑T5 chord predictor (time‑stamped chords).
- Output: 10‑s waveform obeying tempo, key, chords, beats when provided; graceful fallback when not.
2. Noise2Music: Text-conditioned Music Generation with Diffusion Models
Generate a 30-second, 24 kHz stereo music clip from a plain-language prompt.
Training‑Data Pipeline
- Raw audio pool: 6.8 M full‑length tracks → chopped into 30 s clips (~340 k h).
- Caption vocabularies (built offline)
- LaMDA‑LF – 4M rich sentences (LLM‑generated).
- Rater‑LF / SF – 35k long + 24k short human sentences/tags from MusicCaps.
- Embedding space scoring: Encode every clip (MuLan‑audio) & every caption (MuLan‑text).
- Pseudo‑labelling: For each clip pick top‑10 captions by cosine sim → sample 3 low‑frequency ones from each vocab (bias toward rarer labels).
- Extra metadata: Append title, artist, genre, year, instrument tags.
- Quality anchor: Inject ~300 h curated, attribution‑free tracks with rich manual metadata.
- Dual‑rate storage: Keep 24 kHz (for super‑res stage) + 16 kHz copies (for the rest).
- Final payload: Every 30 s clip carries 10 + text descriptors spanning objective tags → subjective vibes.
Model Stack (three‑stage diffusion cascade)
| Stage | I/O | Role | Key details |
|---|---|---|---|
| Waveform Generator | Text → 3.2 kHz audio | Sketch global structure. | 1‑D Efficient‑U‑Net; text fed via cross‑attention; CFG during sampling. |
| Waveform Cascader | Text + 3.2 kHz → 16 kHz audio | Upsample & refine. | Receives up‑sampled low‑fi audio + prompt; blur/noise augmentation during training. |
| Super‑Res Cascader | 16 kHz → 24 kHz audio | Restore full bandwidth. | No text conditioning; lightweight U‑Net. |
Spectrogram path (alt): parallel generator + vocoder pair that works in log‑mel space; cheaper but less interpretable.
3. Stable Audio - Fast Timing-Conditioned Latent Audio Diffusion
- A convolutional VAE that efficiently compresses and reconstructs long stereo audio.
- It uses latent diffusion
- It adds timing embeddings.
Dataset Construction
- Collect 806284 stereo tracks (≈ 19500 h) from AudioSparx.
- Pre‑process audio
- Resample to 44.1 kHz, stereo.
- Slice / pad each file to a fixed 95.1 s window (4 194 304 samples).
- Build text prompts from metadata on‑the‑fly
- Randomly sample descriptors (genre, mood, BPM, instruments).
- Emit either free‑form or structured text strings.
- Final sets
- Same corpus trains the VAE, CLAP (text encoder), and latent diffusion U‑Net.
Model Pipeline
| Stage | Key points |
|---|---|
| 1. VAE | 32× compression |
| 2. Text encoder (CLAPours) | trained from scratch |
| 3. Timing embeddings | seconds_start, seconds_total; concatenated with text features |
| 4. Latent U‑Net diffusion | 907 M params; |
| 5. Inference | DPMSolver++ |
Outcome: 44.1 kHz stereo audio, up to 95 s, fast (latent) diffusion with precise duration control via timing conditioning.
More blog posts coming soon as I continue my learning journey…