v1.1 · GPU Accelerated

WhisperX

Audio transcription API with word-level timestamps and speaker diarization.

Primary Endpoint

POST /transcribe

Upload an audio file and get back a full transcript with timestamps and speaker labels. Returns a job ID for async polling, or block with /transcribe/sync.

Parameter	Default	Description
`file`	required	Audio file to transcribe
`model`	`large-v3`	Whisper model — `tiny`, `base`, `small`, `medium`, `large-v2`, `large-v3`
`language`	auto	ISO 639-1 code (`en`, `de`, …) or leave empty to auto-detect
`speaker_count`	auto	Expected number of speakers (1–20). Helps diarization accuracy
`enable_diarization`	`true`	Identify and label individual speakers
`response_format`	`verbose_json`	`verbose_json` with segments, or `json` for plain text

.mp3 .wav .m4a .ogg .webm .mp4 .aac .wma

Quick Start

curl

# async — returns job ID to poll
curl -X POST "https://transcription-api-v2.lmparsing.cloud/transcribe" \
  -H "X-Api-Key: YOUR_KEY" \
  -F "file=@interview.mp3"

# synchronous — blocks until done
curl -X POST "https://transcription-api-v2.lmparsing.cloud/transcribe/sync" \
  -H "X-Api-Key: YOUR_KEY" \
  -F "file=@meeting.wav" \
  -F "model=large-v3"

# poll job status
curl "https://transcription-api-v2.lmparsing.cloud/jobs/{job_id}" \
  -H "X-Api-Key: YOUR_KEY"

Capabilities

⚙

GPU Accelerated

Batched inference on NVIDIA GPUs with VRAM-aware scheduling.

☺

Speaker Diarization

Identifies who said what using pyannote speaker segmentation.

✎

Word Timestamps

Precise start and end times for every word in the transcript.

▦

6 Model Sizes

From tiny (39M) for speed to large-v3 (1.5B) for accuracy.

⇄

Async Job Queue

Submit and poll, or block for result. Backpressure and VRAM tracking built in.

▶

Auto Language

Detects the spoken language automatically, or pin it with a hint.