
Deepspeech basics
What is Deepspeech
From Mozilla's github repo for deepspeech:
"DeepSpeech is an open source Speech-To-Text engine, using a model trained by machine learning techniques based on Baidu's Deep Speech research paper. Project DeepSpeech uses Google's TensorFlow to make the implementation easier."
Virtual environment
First let's create a virtual environment for deepspeech
conda create -n ds python=3.8
conda activate deepspeechInstall deepspeech
The only required package is deepspeech
pip install deepspeechDownload Model
A pre-trained english model is available for download
curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.6.1/deepspeech-0.6.1-models.tar.gz
tar xvf deepspeech-0.6.1-models.tar.gzDownload audio files
You can download some example audio files
curl -LO https://github.com/mozilla/DeepSpeech/releases/download/v0.6.1/audio-0.6.1.tar.gz
tar xvf audio-0.6.1.tar.gzRun inference
We can now transcribe the audio file
deepspeech --model deepspeech-0.6.1-models/output_graph.pbmm --audio audio/2830-3980-0043.wavIf you ran the above command you should see something like "experience proofsless" if you are using the same model as me
So not perfect, but we can try it out on our own voice as well
Record a wav file
For deepspeech to run inference correctly you will need to record your voice with some specific parameters.
- Sampling rate: 16 kHz
- Channel: 1
- Bit rate: 256 kb/s
We can achieve this using the sox package
If you're on Ubuntu:
sudo apt install soxArch Linux:
yay -S soxMac:
brew install soxAfter installing sox you should have access to the rec command, we will use this to record our voice
To begin recording you voice enter the following command
rec -r 16k -c 1 my_recording.wavTo make sure you have recorded the audio in the proper format we can install another package called mediainfo and run it like so:
mediainfo my_recording.wavYou should see an output similar to the following:
General
Complete name : my_recording.wav
Format : Wave
File size : 64.0 KiB
Duration : 2 s 48 ms
Overall bit rate mode : Constant
Overall bit rate : 256 kb/s
Audio
Format : PCM
Format settings : Little / Signed
Codec ID : 1
Duration : 2 s 48 ms
Bit rate mode : Constant
Bit rate : 256 kb/s
Channel(s) : 1 channel
Sampling rate : 16.0 kHz
Bit depth : 16 bits
Stream size : 64.0 KiB (100%)Run inference
Now we can run inference on our own voice data
deepspeech --model deepspeech-0.6.1-models/output_graph.pbmm --audio my_recording.wavWrapping up
In the next article I'll go over running inference on a GPU