Speech Command detection in audio file

Technical Source
3 min readDec 29, 2021


I need to remake “Speech Command Recognition Using Deep Learning” example so I can read audio from the wav file and get time intervals in which the command appears, but I don’t know how to change real-time analysis from microphone into static file analysis in this example.

Thank you for your help.

%% Detect Commands Using Streaming Audio from Microphone
% Test your newly trained command detection network on streaming audio from
% your microphone. If you have not trained a network, then type
% |load('commandNet.mat')| at the command line to load a pretrained network
% and the parameters required to classify live, streaming audio. Try
% saying one of the commands, for example, _yes_, _no_, or _stop_.
% Then, try saying one of the unknown words such as _Marvin_, _Sheila_, _bed_,
% _house_, _cat_, _bird_, or any number from zero to nine.
% Specify the audio sampling rate and classification rate in Hz and create
% an audio device reader that can read audio from your microphone.
fs = 16e3;
classificationRate = 20;
audioIn = audioDeviceReader('SampleRate',fs, ...
% Specify parameters for the streaming spectrogram computations and
% initialize a buffer for the audio. Extract the classification labels of
% the network. Initialize buffers of half a second for the labels and
% classification probabilities of the streaming audio. Use these buffers to
% compare the classification results over a longer period of time and by
% that build 'agreement' over when a command is detected.
frameLength = frameDuration*fs;
hopLength = hopDuration*fs;
waveBuffer = zeros([fs,1]);
labels = trainedNet.Layers(end).Classes;
YBuffer(1:classificationRate/2) = categorical("background");
probBuffer = zeros([numel(labels),classificationRate/2]);
framesNumber = audio/frameLength;%%
% Create a figure and detect commands as long as the created figure exists.
% To stop the live detection, simply close the figure.
h = figure('Units','normalized','Position',[0.2 0.1 0.6 0.8]);
while ishandle(h)

% Extract audio samples from the audio device and add the samples to
% the buffer.
x = audioIn();
waveBuffer(1:end-numel(x)) = waveBuffer(numel(x)+1:end);
waveBuffer(end-numel(x)+1:end) = x;

% Compute the spectrogram of the latest audio samples.
spec = auditorySpectrogram(waveBuffer,fs, ...
'WindowLength',frameLength, ...
'OverlapLength',frameLength-hopLength, ...
'NumBands',numBands, ...
'Range',[50,7000], ...
'WindowType','Hann', ...
'WarpType','Bark', ...
spec = log10(spec + epsil);

% Classify the current spectrogram, save the label to the label buffer,
% and save the predicted probabilities to the probability buffer.
[YPredicted,probs] = classify(trainedNet,spec,'ExecutionEnvironment','cpu');
YBuffer(1:end-1)= YBuffer(2:end);
YBuffer(end) = YPredicted;
probBuffer(:,1:end-1) = probBuffer(:,2:end);
probBuffer(:,end) = probs';

% Plot the current waveform and spectrogram.
axis tight

caxis([specMin+2 specMax])
shading flat

% Now do the actual command detection by performing a very simple
% thresholding operation. Declare a detection and display it in the
% figure title if all of the following hold:
% 1) The most common label is not |background|.
% 2) At least |countThreshold| of the latest frame labels agree.
% 3) The maximum predicted probability of the predicted label is at
% least |probThreshold|. Otherwise, do not declare a detection.
[YMode,count] = mode(YBuffer);
countThreshold = ceil(classificationRate*0.2);
maxProb = max(probBuffer(labels == YMode,:));
probThreshold = 0.5;

if YMode == “background” || count


Matlabsolutions.com provide latest MatLab Homework Help,MatLab Assignment Help for students, engineers and researchers in Multiple Branches like ECE, EEE, CSE, Mechanical, Civil with 100% output.Matlab Code for B.E, B.Tech,M.E,M.Tech, Ph.D. Scholars with 100% privacy guaranteed. Get MATLAB projects with source code for your learning and research.

Here is a sketch of how you can achieve this:

1) Read the contents of the audio file

[audioIn,fs] = audioread('filename.wav');

2) Split the signal into 1-second chunks, with overlap between consecutive chunks. The higher the overlap, the higher your resolution (i.e. how close you will be able to detect where the keyword occured)

y = buffer(audioIn, fs, round(3*fs/4));

3) Convert each chunk (each column of y) to a mel spectrogram, and classify it using the network. Based on the network results, you will get an estimate of where the word occured in your wav file (each chunk advances by 1 sec — overlap, so in the this example, it’s ~250 ms.




Technical Source

Simple! That is me, a simple person. I am passionate about knowledge and reading. That’s why I have decided to write and share a bit of my life and thoughts to.