MOJO-AUDIO

COMPILING _

High-performance audio DSP library that beats librosa

Overview

mojo-audio is a high-performance audio DSP library written from scratch in Mojo. Built specifically for Whisper speech-to-text preprocessing, it generates mel spectrograms faster than librosa across all audio lengths.

We started 31.7x slower than librosa. After 10 optimization stages, we beat it everywhere—pure Mojo outperforming NumPy/SciPy’s optimized C backends.

Active Development

This library is under active development as part of the Mojo Voice project. API may change. Production use at your own risk.

Performance

Whisper audio preprocessing (random audio, fair comparison):

              1 second    10 seconds    30 seconds
librosa:      2-3ms        10ms         26-37ms
mojo-audio:    ~1ms        ~7ms         ~20ms

Result: Mojo wins across all durations
        2-3x faster on short audio
        20-27% faster on long audio

🎮 Try the Interactive Demo

Run live benchmarks in your browser with configurable parameters. Compare mojo-audio vs librosa across different audio durations, FFT sizes, and BLAS backends.

Launch Demo →

Key Features

  • Drop-in Whisper preprocessing (16kHz, 400-sample FFT, 80 mel bands)
  • No external dependencies—full control over memory layout
  • SIMD-optimized FFT and filterbank operations
  • C-compatible API for Rust/Python/Go integration
  • Validates against Whisper’s expected output

Why We Built This

We’re building Mojo Voice—a developer-focused voice-to-text app. We needed mel spectrogram preprocessing for our Whisper pipeline but ran into issues with existing implementations.

Instead of debugging someone else’s abstractions, we built our own from scratch. The result: full control over correctness and performance, and a genuine technical differentiator.

Architecture

The pipeline matches OpenAI Whisper’s expected input:

  1. Audio Input → 16kHz sample rate
  2. Window Function → 400-sample Hann window (25ms)
  3. STFT → Short-time Fourier transform with 160-sample hop (10ms)
  4. Mel Filterbank → 80 mel bands
  5. Log + Normalize → Final mel spectrogram output

Key design decisions:

  • SoA layout for SIMD efficiency (real/imaginary stored separately)
  • 64-byte alignment matching cache line size
  • Handle-based FFI for clean cross-language integration
  • Pre-computed twiddle factors reused across frames

The Optimization Journey

StageTechniqueSpeedupWhat We Did
0NaiveRecursive FFT, allocations everywhere
1Iterative FFT3.0xCooley-Tukey, cache-friendly
2Twiddle caching1.3xPre-compute sine/cosine
3Memory pooling1.2xEliminate per-frame allocations
4SoA layout1.3xSIMD-friendly data structure
5Vectorized FFT1.6xSIMD butterfly operations
6Filterbank fusion1.5xSingle-pass mel computation
7Alignment1.1x64-byte cache alignment
8Parallelization1.5xMulti-core frame processing
9Final tuning1.1xBranch hints, unrolling

Total: 24x internal speedup (476ms → ~20ms for 30s audio)

Read the Deep Dive

For the full technical breakdown—including what failed and why—read the blog post:

Building a Fast Mel Spectrogram Library in Mojo

Installation

# Clone the repository
git clone https://github.com/itsdevcoffee/mojo-audio

# Build (requires Mojo SDK)
mojo build -o mojo-audio src/main.mojo

Usage

from mojo_audio import MelSpectrogram

# Create processor (matches Whisper defaults)
processor = MelSpectrogram(
    sample_rate=16000,
    n_fft=400,
    hop_length=160,
    n_mels=80
)

# Process audio
mel = processor.compute(audio_samples)

FFI Integration

mojo-audio exposes a C-compatible API for integration with Rust, Python, or Go:

// C header
MojoAudioHandle* mojo_audio_create(int sample_rate, int n_fft, int hop_length, int n_mels);
float* mojo_audio_compute(MojoAudioHandle* handle, float* samples, int num_samples);
void mojo_audio_destroy(MojoAudioHandle* handle);

Contributing

This project is part of the Mojo Voice ecosystem. Contributions welcome—especially:

  • Additional audio processing functions
  • Performance improvements
  • Language bindings
  • Documentation

Check the GitHub repository to get started.

CORE_TEMP: 42.1°C
SIGNAL_LOCK: 100%
UPLOADING_METRICS...