Singing Voice Detection with Deep Recurrent Neural Networks Simon Leglaive, Romain Hennequin and Roland Badeau April 24, 2015 40th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2015, April 19-24 2015, Brisbane, Australia
Introduction Recurrent Neural Networks and Long Short-Term Memory System Overview Results Conclusion
Singing Voice Detection with Deep Recurrent Neural Networks
Introduction Recurrent Neural Networks and Long Short-Term Memory System Overview Results Conclusion
Introduction
pre-processing
short-term analysis
features extraction
classification + temporal smoothing : voice 3/22
: no voice
Audionamix / Institut Mines-T´ el´ ecom
Singing Voice Detection with Deep Recurrent Neural Networks
Introduction Recurrent Neural Networks and Long Short-Term Memory System Overview Results Conclusion
Introduction Our method : I
Bidirectional Long-Short Term Memory (BLSTM) Recurrent Neural Network (RNN) −→ long past and future temporal context
I
With several hidden-layers −→ extract simple useful information from low level features
4/22
Audionamix / Institut Mines-T´ el´ ecom
Singing Voice Detection with Deep Recurrent Neural Networks
Introduction Recurrent Neural Networks and Long Short-Term Memory System Overview Results Conclusion
Outline Recurrent Neural Networks and Long Short-Term Memory Artificial Neural Network Long Short-Term Memory Bidirectional Recurrent Neural Networks System Overview Double HPSS Global system Building the Network Results Dataset Network functioning Results 5/22
Audionamix / Institut Mines-T´ el´ ecom
Singing Voice Detection with Deep Recurrent Neural Networks
Introduction Recurrent Neural Networks and Long Short-Term Memory System Overview Results Conclusion
Artificial Neural Network Long Short-Term Memory Bidirectional Recurrent Neural Networks
Outline Recurrent Neural Networks and Long Short-Term Memory Artificial Neural Network Long Short-Term Memory Bidirectional Recurrent Neural Networks System Overview Double HPSS Global system Building the Network Results Dataset Network functioning Results 6/22
Audionamix / Institut Mines-T´ el´ ecom
Singing Voice Detection with Deep Recurrent Neural Networks
Introduction Recurrent Neural Networks and Long Short-Term Memory System Overview Results Conclusion
Artificial Neural Network Long Short-Term Memory Bidirectional Recurrent Neural Networks
Formal Neuron PI
I
Activation : a =
I
Output : s = f (a) with f the nonlinear activation function (e.g. step function, sigmoid, hyperbolic tangent, ...)
i=1 wi xi
x1
w1j
w2j
x2
aj
sj
wIj xI 7/22
Audionamix / Institut Mines-T´ el´ ecom
Singing Voice Detection with Deep Recurrent Neural Networks
Introduction Recurrent Neural Networks and Long Short-Term Memory System Overview Results Conclusion
Artificial Neural Network Long Short-Term Memory Bidirectional Recurrent Neural Networks
Multi-Layer Perceptron
I I I 8/22
x1
I
xi
I
xI
I
w(1)11
w(2)11
a(2)1 s(2)1
a(Q)1 s(Q)1
y1
a(1)j s(1)j
a(2)j s(2)j
a(Q)j s(Q)j
yj
a(1)H1 s(1)H1
a(2)H2 s(2)H2
a(Q)HQs(Q)HQ
yC
a(1)1 s(1)1
Feedforward Artificial Neural Network Maps inputs to outputs by propagating data through the layers Training : Gradient descent using backpropagation Audionamix / Institut Mines-T´ el´ ecom
Singing Voice Detection with Deep Recurrent Neural Networks
Introduction Recurrent Neural Networks and Long Short-Term Memory System Overview Results Conclusion
Artificial Neural Network Long Short-Term Memory Bidirectional Recurrent Neural Networks
Recurrent Neural Network n = N+1
n=N
Layers
n=1
n=0 Time
I
I
I
t=1
t=2
t=T
Recurrent Neural Network unfolded in time 9/22
Audionamix / Institut Mines-T´ el´ ecom
Singing Voice Detection with Deep Recurrent Neural Networks
Introduction Recurrent Neural Networks and Long Short-Term Memory System Overview Results Conclusion
Artificial Neural Network Long Short-Term Memory Bidirectional Recurrent Neural Networks
Long Short-Term Memory I I
Memory cell Input Gate, Output Gate, Forget Gate = Write, Read and Reset block output output gate o
forget gate
σ
c
f
: multiplicative element σ : logistic sigmoid function
σ
1.0 i tanh
σ input gate
c : memory cell block input 10/22
Audionamix / Institut Mines-T´ el´ ecom
Singing Voice Detection with Deep Recurrent Neural Networks
Introduction Recurrent Neural Networks and Long Short-Term Memory System Overview Results Conclusion
Artificial Neural Network Long Short-Term Memory Bidirectional Recurrent Neural Networks
Bidirectional Recurrent Neural Network n
n
n n
I
I
I
Bidirectional Recurrent Neural Network unfolded in time 11/22
Audionamix / Institut Mines-T´ el´ ecom
Singing Voice Detection with Deep Recurrent Neural Networks
Introduction Recurrent Neural Networks and Long Short-Term Memory System Overview Results Conclusion
Double HPSS Global system Building the Network
Outline Recurrent Neural Networks and Long Short-Term Memory Artificial Neural Network Long Short-Term Memory Bidirectional Recurrent Neural Networks System Overview Double HPSS Global system Building the Network Results Dataset Network functioning Results 12/22
Audionamix / Institut Mines-T´ el´ ecom
Singing Voice Detection with Deep Recurrent Neural Networks
Introduction Recurrent Neural Networks and Long Short-Term Memory System Overview Results Conclusion
Double HPSS Global system Building the Network
Double Harmonic/Percussive Source Separation
HPSS with long analysis window
HPSS with short analysis window 13/22
Audionamix / Institut Mines-T´ el´ ecom
Singing Voice Detection with Deep Recurrent Neural Networks
Introduction Recurrent Neural Networks and Long Short-Term Memory System Overview Results Conclusion
Double HPSS Global system Building the Network
Global System input signal Double HPSS
enhanced vocal components
Spectrogram Mel Filter Bank
percussive components
Spectrogram Mel Filter Bank 2 x 40 coefficients
BLSTM-RNN 80-30-20-40-1
14/22
Audionamix / Institut Mines-T´ el´ ecom
Data conditioning
Singing Voice Detection with Deep Recurrent Neural Networks
Introduction Recurrent Neural Networks and Long Short-Term Memory System Overview Results Conclusion
Double HPSS Global system Building the Network
Building the Network I I
15/22
No theoretical evidence −→ empirical approach, not much discussed in papers Incremental procedure : depth increased by progressively adding hidden layers
Audionamix / Institut Mines-T´ el´ ecom
Singing Voice Detection with Deep Recurrent Neural Networks
Introduction Recurrent Neural Networks and Long Short-Term Memory System Overview Results Conclusion
Dataset Network functioning Results
Outline Recurrent Neural Networks and Long Short-Term Memory Artificial Neural Network Long Short-Term Memory Bidirectional Recurrent Neural Networks System Overview Double HPSS Global system Building the Network Results Dataset Network functioning Results 16/22
Audionamix / Institut Mines-T´ el´ ecom
Singing Voice Detection with Deep Recurrent Neural Networks
Introduction Recurrent Neural Networks and Long Short-Term Memory System Overview Results Conclusion
Dataset Network functioning Results
Jamendo : A Common Benchmark Dataset
17/22
I
Publicly available dataset
I
Singing voice activity annotations
I
Training set : 61 files
I
Validation and Test sets : 16 files each
I
Common database −→ fair comparison of our approach
Audionamix / Institut Mines-T´ el´ ecom
Singing Voice Detection with Deep Recurrent Neural Networks
Introduction Recurrent Neural Networks and Long Short-Term Memory System Overview Results Conclusion