Music Retrieval Techniques

CSC475 / SENG480B Music Retrieval Techniques

Assignment 5 Fall 2020

J. Shier

November 18, 2020

1 Introduction

In this assignment we will explore similarity and self-similarity matrices, dynamic time warping as well as sound source separation and transcription. There are three levels of engagement: Minimum, Expected, Advanced. Minimum is the least amount of work you can do to keep up with the course. If you are not able to complete the minimum work suggested then this course is probably not for you. Expected is what a typical student of the course should work on – the goal is to provide a solid foundation and understanding but not go deeper into more interesting questions and research topics. Advanced is what students who are particularly interested in the topic of MIR, students who want a high grade, students interested in graduate school, and graduate students should work on. Each level of en- gagement includes the previous one so a student who is very interested in the topic should complete all three levels of engagement (Minimum, Expected, Advanced).

The assignment is worth 10% of the final grade. There is some variance in the amount of time each question probably will require. Please provide your answers as a single Jupyter Python notebook (.ipynb) uploaded through the BrightSpace website for the course. Also please answer the questions in order and explicitly mention if you have decided to skip a question. For the questions that do not require programming use the ability to write Mark- down in Jupyter notebooks for your answer. You are also welcome to do the assignment in any other programming language/framework. If you do so please include a PDF file with plots and figures for questions that re- quire them, as well as answers for questions that don’t require coding. Also include the code itself with instructions on how to run it in a README file.

1

 

 

Note: Please upload your completed notebook as a zip file with all audio files used to complete your assignment. Only include the audio files that you used to keep the upload size as small as possible. (i.e. do not upload all of GTZAN or any other dataset)

2 Problems

The coding problems mention specific libraries/frameworks for Python that can help you with implementation. However you should be able to find corresponding libraries or code for most programming languages if you want to try using a different programming language or environment.

1. For this question use a 30 second clip of audio that you like. You can either use you own or use any of the clips in the GTZAN dataset. Try to select a clip that has a repeating rhythm, phrase, or harmonic struc- ture that you can identify in a self-similarity matrix – you may need to experiment with a few different audio files to find one that works well. You can download GTZAN here: https://drive.google.com/ file/d/16qBM8tYXn2z5LOl82mjnRAxQyZsJ0xdP/view?usp=sharing

• a) Repeat the 30 second recording to create a one minute record- ing. Plot the cross-similarity matrix1 for this new one minute long clip with the original 30 second recording. Describe how the repetition can be seen in a plot of the cross-similarity matrix. Use MFCC features and the affinity mode. (Minimum: 1pt)

• b) Plot the self-similarity matrix for your original 30 second clip (i.e the cross-similarity to itself). Visually identify a repeating structure (it could be a bar, a phrase, a segment) on the self- similarity matrix, describe it, and generate two audio fragments that demonstrate this repetition. Hint: repetition shows as block structure, you will need to map the dimensions of the repeating block to time to select the audio fragments. (Minimum: 1pt)

• c) Use time stretching2 on your audio recording to create the following modified signal: the first 10 seconds should be slowed down (rate 0.75), the middle 10 seconds should remain the same,

1 https://librosa.org/doc/latest/generated/librosa.segment.cross_

similarity.html 2 https://librosa.org/doc/latest/generated/librosa.effects.time_stretch.

html

2

 

 

and the last 10 seconds should be sped up (rate 1.25). Plot the cross-similarity matrix between the original and modified record- ing (use MFCC features and the affinity mode) and describe how the time-stretching can be observed visually.

• d) Use harmonic/percussive sound source separation3 to generate a percussive track and a harmonic track from your 30 second ex- ample. Plot the self-similarity matrices using affinity for the per- cussive and harmonic versions using MFCCs as well as Chroma (use chroma cqt4 or chroma stft5). Based on the resulting four plots discuss feature set works better for each configuration (har- monic/percussive). (Expected: 1pt)

• e) Use Dynamic Time Warping6 using the original and modified (time-stretched) recording you created in the previous subques- tion. Plot the cost matrix and associated optimal path and de- scribe how the optimal path reflects the time stretching. Show how you can estimate the time-stretching rates from the opti- mal path. You can assume that you know that the rate is going to change every 10 seconds but you dont know what the corre- sponding rates are. Test your procedure with a set of different time stretching rates. (Expected: 1pt)

2. In this question you will be using a state-of-the art deep learning sound source separation library: https://github.com/deezer/spleeter to explore sound source separation, and transcription. For answering the question you can either select a track that you like or use one of the tracks from musdb (a database for sound source separation for music for which spleeter performs quite well): https://sigsep.github.io/ datasets/musdb.html.

• a) Run the sound source separation with the 4 stem model on your example audio recording. Listen and plot the time-domain waveforms of the 4 individual stems. Comment on what you hear. (Minimum: 1pt)

• b) Pitch shift the vocal stem by a major third (4 semitones) using librosa.effects.pitch_shift and mix the pitch shifted audio

3 https://librosa.org/doc/latest/generated/librosa.effects.hpss.html

4 https://librosa.org/doc/latest/generated/librosa.feature.chroma_cqt.html

5 https://librosa.org/doc/latest/generated/librosa.feature.chroma_stft.

html 6 https://librosa.org/doc/latest/generated/librosa.sequence.dtw.html

3

 

 

with the other stems. Listen to the result and comment on it. (Minimum: 1pt)

• c) Run monophonic pitch detection (either your own implemen- tation or using some library like librosa) on the vocal stem, sonify it using a sinusoid (you can use the sonification code provided in the template) and mix with the drum track. Listen to the result and comment on whether you still recognize the song. (Expected: 1pt)

• d) Inspect the magnitude spectrogram of the drum track and try to identify features that characterize the kick sound and the snare sound. Based on your visual inspection write a simple kick and snare detection method that outputs the times where a kick or snare drum occurs. Your method only needs to work for the specific example you are examining and does not need to be per- fect. Create a new track by placing the kick.wav and snare.wav samples from the resources folder in the detected locations. Mix your synthetic toy drum track and sinusoid sonification from the previous question and listen to the result. Can you still identify the song? (Advanced: 1pt)

• e) Run beat tracking on the drum track (result from spleeter) and then use the resulting beat onsets combined with the extracted pitch from the vocal melody to transcribe the vocal melody to music notation using music21. For each beat map the median MIDI pitch of the vocals to a tiny notation string. Listen to the generated MIDI melody using Music21. Can you still identify the melody? (Advanced: 1pt)

You didn't find what you were looking for? Upload your specific requirements now and relax as your preferred tutor delivers a top quality customized paper

Order Now