Upload Audio for Speech Recognition
(max. 100 MB, supported: .mp3, .wav, .ogg, etc.)
About Our Speech Recognition Service
How It Works
On this page, you can upload an audio file (MP3, WAV, OGG, M4A, etc.) and quickly get a text transcription. Simply drag your file into the area above or click it to select from your computer. Maximum file size is 100 MB.
Why Converting.cloud
- We use OpenAI Whisper — a state-of-the-art speech recognition technology with up to 98% accuracy even in noisy environments.
- Support for over 30 languages: English, Russian, Spanish, French, German, Chinese, Japanese, Italian, Portuguese, Arabic, and many more.
- Automatic language detection: if you're unsure which language your recording is in, choose "Auto" and the system will detect it for you.
Available Models
We offer several Whisper models with different capabilities:
Model | Accuracy | Processing Speed | Best For |
---|---|---|---|
tiny | Basic | ~12x faster than real-time | Quick drafts, clear speech |
base | Good | ~8x faster than real-time | Simple audio, minimal background noise |
small | Very Good | ~4x faster than real-time | Standard transcription needs (recommended) |
medium | Excellent | ~2x faster than real-time | Complex audio, multiple speakers |
large | Superior | ~1x real-time | Challenging audio, accents, background noise |
Language Selection
- Auto mode analyzes the first 30 seconds of audio to guess the language. Accuracy is excellent.
- Choosing a language yourself bypasses detection and speeds up transcription slightly.
Processing Time
The processing time depends on the model size and audio duration:
- tiny model: ~5 seconds for 1 minute of audio
- base model: ~8 seconds for 1 minute of audio
- small model: ~15 seconds for 1 minute of audio
- medium model: ~30 seconds for 1 minute of audio
- large model: ~60 seconds for 1 minute of audio
For example, a 10-minute recording using the small model would take approximately 2.5 minutes to process.
Processing Queue
All files are processed in a queue system. The prioritization works as follows:
- Files are first prioritized by model size (tiny → base → small → medium → large)
- Within each model group, shorter audio files are processed before longer ones
This means if you select a smaller model, your file will be processed faster, not only because the model itself is faster but also because it will be placed higher in the processing queue.
How to Use the Service
Once your file is uploaded, you'll see its name, size, and duration. Select a Whisper model size (from tiny to large) — larger models yield higher accuracy but take longer to process. Click Transcribe, and in a few minutes a download link for your TXT file will appear in the last column of the table.