Audio transcription API with word-level timestamps and speaker diarization.
Upload an audio file and get back a full transcript with timestamps and speaker labels. Returns a job ID for async polling, or block with /transcribe/sync.
| Parameter | Default | Description |
|---|---|---|
file | required | Audio file to transcribe |
model | large-v3 | Whisper model — tiny, base, small, medium, large-v2, large-v3 |
language | auto | ISO 639-1 code (en, de, …) or leave empty to auto-detect |
speaker_count | auto | Expected number of speakers (1–20). Helps diarization accuracy |
enable_diarization | true | Identify and label individual speakers |
response_format | verbose_json | verbose_json with segments, or json for plain text |
# async — returns job ID to poll curl -X POST "https://transcription-api-v2.lmparsing.cloud/transcribe" \ -H "X-Api-Key: YOUR_KEY" \ -F "file=@interview.mp3" # synchronous — blocks until done curl -X POST "https://transcription-api-v2.lmparsing.cloud/transcribe/sync" \ -H "X-Api-Key: YOUR_KEY" \ -F "file=@meeting.wav" \ -F "model=large-v3" # poll job status curl "https://transcription-api-v2.lmparsing.cloud/jobs/{job_id}" \ -H "X-Api-Key: YOUR_KEY"
Batched inference on NVIDIA GPUs with VRAM-aware scheduling.
Identifies who said what using pyannote speaker segmentation.
Precise start and end times for every word in the transcript.
From tiny (39M) for speed to large-v3 (1.5B) for accuracy.
Submit and poll, or block for result. Backpressure and VRAM tracking built in.
Detects the spoken language automatically, or pin it with a hint.