Singing Voice Detection with Deep Recurrent Neural Networks

Report 3 Downloads 41 Views
Institut Mines-T´el´ecom, T´el´ecom ParisTech, CNRS LTCI

Singing Voice Detection with Deep Recurrent Neural Networks Simon Leglaive, Romain Hennequin and Roland Badeau April 24, 2015 40th IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2015, April 19-24 2015, Brisbane, Australia

Introduction Recurrent Neural Networks and Long Short-Term Memory System Overview Results Conclusion

Introduction singing voice presence singing voice absence

singing voice detection vocal melody estimation singing voice separation singer identification ... 2/22

Audionamix / Institut Mines-T´ el´ ecom

Singing Voice Detection with Deep Recurrent Neural Networks

Introduction Recurrent Neural Networks and Long Short-Term Memory System Overview Results Conclusion

Introduction

pre-processing

short-term analysis

features extraction

classification + temporal smoothing : voice 3/22

: no voice

Audionamix / Institut Mines-T´ el´ ecom

Singing Voice Detection with Deep Recurrent Neural Networks

Introduction Recurrent Neural Networks and Long Short-Term Memory System Overview Results Conclusion

Introduction Our method : I

Bidirectional Long-Short Term Memory (BLSTM) Recurrent Neural Network (RNN) −→ long past and future temporal context

I

With several hidden-layers −→ extract simple useful information from low level features

4/22

Audionamix / Institut Mines-T´ el´ ecom

Singing Voice Detection with Deep Recurrent Neural Networks

Introduction Recurrent Neural Networks and Long Short-Term Memory System Overview Results Conclusion

Outline Recurrent Neural Networks and Long Short-Term Memory Artificial Neural Network Long Short-Term Memory Bidirectional Recurrent Neural Networks System Overview Double HPSS Global system Building the Network Results Dataset Network functioning Results 5/22

Audionamix / Institut Mines-T´ el´ ecom

Singing Voice Detection with Deep Recurrent Neural Networks

Introduction Recurrent Neural Networks and Long Short-Term Memory System Overview Results Conclusion

Artificial Neural Network Long Short-Term Memory Bidirectional Recurrent Neural Networks

Outline Recurrent Neural Networks and Long Short-Term Memory Artificial Neural Network Long Short-Term Memory Bidirectional Recurrent Neural Networks System Overview Double HPSS Global system Building the Network Results Dataset Network functioning Results 6/22

Audionamix / Institut Mines-T´ el´ ecom

Singing Voice Detection with Deep Recurrent Neural Networks

Introduction Recurrent Neural Networks and Long Short-Term Memory System Overview Results Conclusion

Artificial Neural Network Long Short-Term Memory Bidirectional Recurrent Neural Networks

Formal Neuron PI

I

Activation : a =

I

Output : s = f (a) with f the nonlinear activation function (e.g. step function, sigmoid, hyperbolic tangent, ...)

i=1 wi xi

x1

w1j

w2j

x2

aj

sj

wIj xI 7/22

Audionamix / Institut Mines-T´ el´ ecom

Singing Voice Detection with Deep Recurrent Neural Networks

Introduction Recurrent Neural Networks and Long Short-Term Memory System Overview Results Conclusion

Artificial Neural Network Long Short-Term Memory Bidirectional Recurrent Neural Networks

Multi-Layer Perceptron

I I I 8/22

x1

I

xi

I

xI

I

w(1)11

w(2)11

a(2)1 s(2)1

a(Q)1 s(Q)1

y1

a(1)j s(1)j

a(2)j s(2)j

a(Q)j s(Q)j

yj

a(1)H1 s(1)H1

a(2)H2 s(2)H2

a(Q)HQs(Q)HQ

yC

a(1)1 s(1)1

Feedforward Artificial Neural Network Maps inputs to outputs by propagating data through the layers Training : Gradient descent using backpropagation Audionamix / Institut Mines-T´ el´ ecom

Singing Voice Detection with Deep Recurrent Neural Networks

Introduction Recurrent Neural Networks and Long Short-Term Memory System Overview Results Conclusion

Artificial Neural Network Long Short-Term Memory Bidirectional Recurrent Neural Networks

Recurrent Neural Network n = N+1

n=N

Layers

n=1

n=0 Time

I

I

I

t=1

t=2

t=T

Recurrent Neural Network unfolded in time 9/22

Audionamix / Institut Mines-T´ el´ ecom

Singing Voice Detection with Deep Recurrent Neural Networks

Introduction Recurrent Neural Networks and Long Short-Term Memory System Overview Results Conclusion

Artificial Neural Network Long Short-Term Memory Bidirectional Recurrent Neural Networks

Long Short-Term Memory I I

Memory cell Input Gate, Output Gate, Forget Gate = Write, Read and Reset block output output gate o

forget gate

σ

c

f

: multiplicative element σ : logistic sigmoid function

σ

1.0 i tanh

σ input gate

c : memory cell block input 10/22

Audionamix / Institut Mines-T´ el´ ecom

Singing Voice Detection with Deep Recurrent Neural Networks

Introduction Recurrent Neural Networks and Long Short-Term Memory System Overview Results Conclusion

Artificial Neural Network Long Short-Term Memory Bidirectional Recurrent Neural Networks

Bidirectional Recurrent Neural Network n

n

n n

I

I

I

Bidirectional Recurrent Neural Network unfolded in time 11/22

Audionamix / Institut Mines-T´ el´ ecom

Singing Voice Detection with Deep Recurrent Neural Networks

Introduction Recurrent Neural Networks and Long Short-Term Memory System Overview Results Conclusion

Double HPSS Global system Building the Network

Outline Recurrent Neural Networks and Long Short-Term Memory Artificial Neural Network Long Short-Term Memory Bidirectional Recurrent Neural Networks System Overview Double HPSS Global system Building the Network Results Dataset Network functioning Results 12/22

Audionamix / Institut Mines-T´ el´ ecom

Singing Voice Detection with Deep Recurrent Neural Networks

Introduction Recurrent Neural Networks and Long Short-Term Memory System Overview Results Conclusion

Double HPSS Global system Building the Network

Double Harmonic/Percussive Source Separation

HPSS with long analysis window

HPSS with short analysis window 13/22

Audionamix / Institut Mines-T´ el´ ecom

Singing Voice Detection with Deep Recurrent Neural Networks

Introduction Recurrent Neural Networks and Long Short-Term Memory System Overview Results Conclusion

Double HPSS Global system Building the Network

Global System input signal Double HPSS

enhanced vocal components

Spectrogram Mel Filter Bank

percussive components

Spectrogram Mel Filter Bank 2 x 40 coefficients

BLSTM-RNN 80-30-20-40-1

14/22

Audionamix / Institut Mines-T´ el´ ecom

Data conditioning

Singing Voice Detection with Deep Recurrent Neural Networks

Introduction Recurrent Neural Networks and Long Short-Term Memory System Overview Results Conclusion

Double HPSS Global system Building the Network

Building the Network I I

15/22

No theoretical evidence −→ empirical approach, not much discussed in papers Incremental procedure : depth increased by progressively adding hidden layers

Audionamix / Institut Mines-T´ el´ ecom

Singing Voice Detection with Deep Recurrent Neural Networks

Introduction Recurrent Neural Networks and Long Short-Term Memory System Overview Results Conclusion

Dataset Network functioning Results

Outline Recurrent Neural Networks and Long Short-Term Memory Artificial Neural Network Long Short-Term Memory Bidirectional Recurrent Neural Networks System Overview Double HPSS Global system Building the Network Results Dataset Network functioning Results 16/22

Audionamix / Institut Mines-T´ el´ ecom

Singing Voice Detection with Deep Recurrent Neural Networks

Introduction Recurrent Neural Networks and Long Short-Term Memory System Overview Results Conclusion

Dataset Network functioning Results

Jamendo : A Common Benchmark Dataset

17/22

I

Publicly available dataset

I

Singing voice activity annotations

I

Training set : 61 files

I

Validation and Test sets : 16 files each

I

Common database −→ fair comparison of our approach

Audionamix / Institut Mines-T´ el´ ecom

Singing Voice Detection with Deep Recurrent Neural Networks

Introduction Recurrent Neural Networks and Long Short-Term Memory System Overview Results Conclusion

Dataset Network functioning Results

Internal Netwok Functioning Truth Decision Output Hidden layer 3 Hidden layer 2 Hidden layer 1

Inputs Time

Color scale between -1 (white) and 1 (black) 18/22

Audionamix / Institut Mines-T´ el´ ecom

Singing Voice Detection with Deep Recurrent Neural Networks

Introduction Recurrent Neural Networks and Long Short-Term Memory System Overview Results Conclusion

Dataset Network functioning Results

Consideration of a Temporal Context Truth Decision Output Hidden layer 3 Hidden layer 2 Hidden layer 1

Inputs Time

19/22

Audionamix / Institut Mines-T´ el´ ecom

Singing Voice Detection with Deep Recurrent Neural Networks

Introduction Recurrent Neural Networks and Long Short-Term Memory System Overview Results Conclusion

Dataset Network functioning Results

Results on Jamendo Dataset

20/22

Audionamix / Institut Mines-T´ el´ ecom

Singing Voice Detection with Deep Recurrent Neural Networks

Introduction Recurrent Neural Networks and Long Short-Term Memory System Overview Results Conclusion

Conclusion

21/22

I

New approach for singing voice detection

I

We do not focus on defining a complex feature set −→ may be suboptimal

I

We make use of neural networks to extract a simple representation, fitted to our task

I

A past and future temporal context is considered by the classifier −→ no need for temporal smoothing

I

The results we obtain encourage further work with BLSTM-RNN in MIR for sequence classification tasks, e.g. melody estimation

Audionamix / Institut Mines-T´ el´ ecom

Singing Voice Detection with Deep Recurrent Neural Networks

Introduction Recurrent Neural Networks and Long Short-Term Memory System Overview Results Conclusion

Thank you

22/22

Audionamix / Institut Mines-T´ el´ ecom

Singing Voice Detection with Deep Recurrent Neural Networks