Engineering Innovation – Immigrant Student’s Writing

The City College of New York

Engineering Innovation

Software for the classification of music

Saifeldin Fathelbab

Writing for engineering

Professor Nicholas Otte

December 15. 2020

Engineering innovation: Software for the classification of music

The abstract

The creative innovation lab report consists of the analysis of the study “Automatic recognition of music mood using Russell’s two-dimensional valence and sound space from classified audio and lyric data using SVM and Naïve Bayes.” In which he focuses on the functionality of music in general and how it can influence the human being, he goes beyond just offering software for the recognition of the negativity or positivity of music, but also how in the daily life in human beings it evolves around music and even how it can influence the behavior, learning, and perspective of life.

Introduction

Many young people turn to music as a way of exploring and managing their moods and emotions. The literature is replete with studies that correlate music preferences and mental health, as well as a small but increasing interest in uses of music to promote well-being. (Mehr et al., 2019) Nevertheless, whilst the potential for musical behavior is a characteristic of all human beings, its realization is shaped by the environment and the experiences of individuals, often within groups. (North and Hargreaves, 2008)

Therefore, considering that human beings are influenced by music, which together with their experiences and their character generates behavior, the need arises to explore how an element so rooted in society is very useful to generate certain sensations that stimulate a certain result. While it is true that listening to a certain type of music specifically will not lead to a precise action, since this includes many factors, there is a degree of probability that people who consume certain types of music will have changes in their behavior.

It is then when the need arises to analyze the music itself, which composes it and which generates that composition in the human being, beyond the lyrics, each tone, timbre and tempo has a degree of influence on the individual. Studies have even shown that music can be a differentiating element in the performance of students on the university campus.

Tzanetakis & Cook (2002) tried to solve this problem by creating an automatic gender sorter. Their method of using features extracted from audio files has since served as a standard procedure for future research in this field. By 2007, the new field had added automatic classification of emotions and moods.

Feature extraction is a process that reduces the raw data to a smaller set of characteristics such as pitch, timbre, and tempo. This allowed researchers to use the data without having to classify the entire song. The process of extracting these characteristics requires tools (Giannakopoullos, 2015) (McFe, et al, 2015), and finding the optimal characteristics to use in a study requires further research on these characteristics (Grekow, 2018).

Research in this new field saw the use of machine algorithms to automatically detect mood using decision trees (Patra, 2013), neural network, SVM, and naive Bayes classifiers. Other methods had also been used such as the use of weighted graphs (Yu, et al, 2015) and linear regression in word classifications from word embedding (Li & Lu, 2016).

Music and the human being

Sumare & Mr. Bhalke (2015) they establish that most people enjoy music in their leisure time. At present, there is more and more music on personal computers, in music libraries, and on the Internet. Music is considered the best form of expression of emotions. The music that people listen to is governed by what mood they are in.

The characteristics of music such as rhythm, melody, harmony, pitch, and timbre play a significant role in human physiological and psychological functions, thus altering their mood. For example, when an individual comes back home from work, he may want to listen to some relaxing light music; while when he is at the gymnasium, he may want to choose some exciting music with a strong beat and fast tempo. Music is not merely a form of entertainment but also the easiest way of communication among people, a medium to share emotions, and a place to keep emotions and memories. (PP 83-87)

The mood of the music is classified as follows:

The moods of the songs are divided according to the traditional mood model of psychologist Robert Thayer. The model divides songs along the lines of energy and stress, from happy to sad and calm to energetic, respectively (Bhat et al, 359). The eight categories created by Thayer’s model include the ends of the two lines, as well as each of the possible intersections of the lines (e.g., happy-energetic or sad-quiet). (Nuzzolo, 2015)

Table 1: Moods classified according to musical components.

Source: Derived from Bhat (2014).

Thayer’s eight mood ratings are shown with the relative degree of the different musical components found in each, from very low to very high. Higher energy moods, such as happy, exuberant, and energetic, generally have higher amounts of intensity, timbre, tone, and rhythm than lower energy moods such as calm, contentment, and depression. (Nuzzolo)

Positive effects of music

There is a healing therapy, music therapy that records the effects of music on individuals and organizes them to improve the person’s well-being.

Powell J. in his compilation in the Science Focus article (2020) states the body contains its own ‘pharmacy’ to dispense a variety of chemicals that will help you respond to different situations: calming you down when you need to sleep or putting you on alert if you are in danger.

If the pharmaceutical system is working properly, the right chemicals will be dispensed at the right times. If you lead a busy and stressful life, you may become depressed or physically exhausted because your inner pharmacist is constantly dispensing adrenaline and cortisol, even in non-threatening situations.

Listening to relaxing music has been shown to lower adrenaline and cortisol levels in the bloodstream and therefore reduce stress. Researchers at the University of Toronto have even shown this to be true for distressed babies.

In addition to this, the fact that the music is pleasant tells the in-house pharmacist to start administering chemicals such as dopamine and serotonin, which will improve their mood and help eliminate stress and depression.

Music has also been shown to cure insomnia. In a study involving young insomniac adults in Budapest in 2007, more than 80 percent of participants slept better after three weeks of listening to classical music before bed.

The world of cinema offers obvious examples of how music manipulates our emotions. If the action on the screen is emotionally neutral (a woman walking down the street), music can alert us that something scary or happy is about to happen.

If the director wants to make you jump for fear, a sudden loud noise (or musical sound) is very effective in triggering your fight or flight response, which will flood your system with adrenaline and cortisol.

The possible link between music and concentration has been the subject of much research; it is of interest to everyone from call center managers to students trying to finish a rehearsal. This research has shown that music can help if the alternative sound is distracting noise. If you are trying to finish that report in a busy café, music through your headset will help you stay focused. If, on the other hand, you are working in a quiet environment, the music itself will be a distraction. Part of your intellectual capacity will be devoted to processing the music, leaving less capacity for the work you are trying to do. Music with lyrics is especially distracting.

The situation is a little different if you are doing a simple task such as ironing or washing. In this case, you will have a lot of mental capacity available and the music will help you stay in a good mood and avoid boredom, probably improving your performance on the task at hand.

Studies show the positive effects of music and its social and school results

In terms of childhood and adolescence, for example, ( Putkinen et al. 2019) demonstrate how musical training is likely to foster enhanced sound encoding in 9 to 15-year-olds and thus be related to reading skills. A separate Finnish study by Saarikallio et al. (2020) provides evidence of how musical listening influences adolescents’ perceived sense of agency and emotional well-being, whilst demonstrating how this impact is particularly nuanced by context and individuality.

A related study by Loui et al. (2019) in the USA offers insights into the neurological impact of sustained musical instrument practice. Eight-year-old children who play one or more musical instruments for at least 0.5 h per week had higher scores on verbal ability and intellectual ability, and these correlated with greater measurable connections between particular regions of the brain related to both auditory-motor and bi-hemispheric connectivity.

These studies reflect how the link with music, either by listening or as creators of music through the instruments, there are effects on the human being, which can be reflected in behavior, in the management of situations, in school performance, in the acquisition of skills and address weaknesses.

Importance

Taking into account the above considerations, the question arises as to how one can analyze the music in advance and be able to predict or know what possible effect it might have on individuals.

Human psychologists have done a great deal of work and proposed several models of human emotions.

Hevner’s experiment

In music psychology, the earliest and best known systematic attempt at creating music mood taxonomy was by Kate Hevner. Hevner examined the affective value of six musical features such as tempo, mode, rhythm, pitch, harmony, and melody, and studied how they relate to mood. Based on the study 67 adjectives were categorized into eight different emotional groups with similar emotions.

Russell’s model

Both Ekmans and Hevner’s models belong to Categorical Model” because the mood spaces consist of a set of discrete mood categories. On the contrary, James Russell came up with a circumflex model of emotions arranging 28 adjectives in a circle on two-dimensional bipolar space (arousal – valence). This model helped to separate and keeping away the opposite emotions.

Thayer’s model

Another well known dimensional model was proposed by Thayer. It describes the mood with two factors: Stress dimension (happy/anxious) and Energy dimension (calm/energetic), and divides music mood into four clusters according to the four quadrants in the two-dimensional space: Contentment, Depression, Exuberance, and Anxious (Frantic). (Sumare & Mr. Bhalke)

Materials and methods

This section will analyze the results of the study carried out by K R Tan, et al, (2019) in which they aimed to create a system that would be able to detect the mood of a song based on the four quadrants of Russell’s model. The researchers used two SVM and Naïve Bayes classification algorithms to train separate classification models for valence and excitation using selected audio characteristics for SVM and lyrical characteristics for Naïve Bayes. This process returns four trained valence and excitation models for each algorithm. Valence is positive or negative (pleasant or unpleasant) while excitation is the definition of how exciting or calm a person is in a situation.

The data set used for this study is that used by the study by Panda et al (2019). It contains valence and excitation notations and the quadrant to which it belongs. The data set consisted of 180 songs with lyrics annotated, while 162 songs had audio annotations. The number of songs containing both lyrics and audio annotations amounted to 133 songs.

Audio files: The data set contained 158 mp3 files with clips of 30 seconds each. Each file was converted to the wav format with a single-channel sampling rate of 44100 Hz that was used for this study. 126 songs were used for this study since some commented songs did not have their audio data saved while others had low audio quality. The total number of songs in use in each quadrant is set as: Quadrant 1 – 38, Quadrant 2 – 36, Quadrant 3 – 27, and Quadrant 4 – 26. The data set was divided into two parts: 66.66% for training and 33.33% for testing. Of the 33.33% test data, the test cases contained two parts: 26 songs (20.5%) containing only terrestrial audio truth values and 16 songs (12.5%) containing terrestrial audio and lyrical truth data.

Lyrics: The data set contained 180 songs with links to their available lyrics. The lyrics were stored in text files and pre-processed to meet the criteria of the dictionary used. The total number of songs in each quadrant is set as: quadrant 1 – 44, quadrant 2 – 41, quadrant 3 – 51, and quadrant 4 – 44. 66.67% of the 180 letters were selected for the training and 33.33% for the test.

Extraction of characteristics

The characteristics of both the audio and the lyrics were extracted. Both simple (zero-crossing rate) and multidimensional (Mel frequency head-count) characteristics were used.

Audio extraction: Features were extracted using the py AudioAnalysis by Giannakopolous and an extra feature called tonnetz was extracted from Librosa (McFe, et al) as tonal features.

The selected features were not the same for valence and excitation and instead were selected based on Grekow’s research. The following characteristics were used for excitation detection: Energy, Energy Entropy, Spectral Energy, Spectral Flow, Spectral Roll-off, Strokes per minute, and their standard deviations. The audio characteristics used for valence detection were Zero Crossing Rate (ZCR), Energy, Energy Entropy, the three spectral characteristics, MFCC, Chromatic Vector, Chromatic Deviation, and their standard deviations.

Lyrics extraction: The dataset used links to redirect to the lyrics of each song. The lyrics were stored in text files collected from the provided links in the dataset. Some of the links provided were unavailable or that the lyrics were written wrong and so the researchers searched the lyrics from other links and corrected them. The resulting lyrics were then saved to text files and were pre-processed to fit the criteria of the dictionary that was used. NLTK was also used to increase pre-processed screening. The lyrics were first stripped of their stop words and punctuations and were finally lemmatized down to its root word. Words with negative prefixes were preserved and words ending with “in’” were corrected as well. The results were saved in text files for the training and testing phase.

Machine-learning algorithms: The selected machine-learning algorithms, support-vector machine and Naïve Bayes, were used for audio and lyrical classification respectively.

Support-vector machine classifier: The available classifier in Python’s sklearn was used with the following parameters. For arousal, the C parameter was set at 150 while valence’s C parameter was set at 10^5. The training consists of two models, one for valence and another for arousal.

Naïve Bayes classifier: A modified naïve Bayes classifier that used Warriner et al.’s and the NLTK library in Python to extract and use the pre-processed lyrics was used. Training and testing were split to 66.67% and 33.33%.

Results

The SVM classifier was tested with the following results shown on Table 2. Sixteen songs (12.5%) out of the 126 songs were used to test the accuracy of songs where its arousal is predicted with audio features and valence is predicted using lyrics while 26 songs (20.5%) were used for audio only detection. The remaining 84 songs (66.66%) were used for training. Table 3 shows the precision, recall, and f1-score for Naïve Bayes. Tables 4 and 5 show the results of the tenfold cross-validation scores of the audio trained dataset’s valence and arousal. The results of the 16 songs used for testing both accuracies for the two classification algorithms are found on Table 6.

Table 2: Results of SVM classifier on the training and testing data

Source: K R Tan et al 2019.

Table 3: Results of Naïve Bayes classifier on the testing data

Source: K R Tan et al 2019.

Table 4: Valence Tenfold Cross-Validation Score (%)

Source: K R Tan et al 2019.

Table 5: Arousal Tenfold Cross-Validation Score (%)

Source: K R Tan et al 2019.

Audio detection: In extracting the audio features for training and testing, a total of 84 songs for the training result showed a (0.88 for the precision, 0.89 for the recall, and 0.88 for the f1-score) of valence while arousal showed (0.90 for precision, recall, and f1-score). Tenfold cross-validation was performed on the training data and showed poor results for valence but high accuracy for arousal. The 26 songs used for testing audio only detection showed (0.58 precision, 0.58 recall, and 0.57 f1-score) for valence and arousal achieved (1.00 precision, recall, and f1-score). The 16 songs that were used for testing resulted in (0.45 precision, 0.44 recall, and 0.44 f1-score) for valence while arousal resulted in (0.94 precision, 0.94 recall, and 0.94 f1-score). The cause of low accuracy on detecting valence has been discussed in the work of Yang, Dong, & Li (2017). It is considered that arousal can easily be distinguished between exciting or calm, with tempo being the key factor in this study. But valence is difficult to distinguish because it is ranked as either positive or negative (pleasant or unpleasant) and people have different opinions towards a song’s pleasantness.

Lyrics Detection: Training data consisted of 120 songs and 60 were used for testing. The classifier achieved an 85% accuracy for valence (51 songs) and 75% accuracy for arousal (45 songs). The confusion matrix at Table 3 shows a high accuracy for valence but arousal predicted poorly this time. The trained model was tested on the 16 selected songs and achieved a high accuracy for valence detection as opposed to the low accuracy found on audio classification by SVM.

Table 6: SVM arousal and NB valence have higher accuracies.

Source: K R Tan et al 2019.

Arousal detection is highly accurate when used with audio features while valence detection is highly accurate when using lyrics. Arousal is easily distinguishable when listened to since its range would be from high to low. This study focused more on the use of tempo for arousal detection using extracted audio features. Valence detection using lyrics with Naïve Bayes resulted in higher accuracy than the use of audio because it is difficult to distinguish and analyze the tune and the positiveness or negativity of a word as they cannot be distinguished properly. (K R Tan.et al. 2019)

From the previous considerations and the study developed in the analysis of K R Tan et al (2019), it becomes evident that the analysis of music through the extraction of small characteristics is a possible dynamic through which it could be classified or at least ordered according to the degree of positivity or negativity, however, this is a topic that still requires much research since it is notable that the lyrics factor is a determining factor in the classification.

I consider that music is a cultural good that is in the life of all human beings, that influences us in different ways, and can be a tool of well-being or an element of propitiation of inappropriate behaviors. As it was studied we can be manipulated by music, we can also educate our brain, and from the substances that this one secretes can be created sensations that will lead to executing actions. This is a wide field, but from which many benefits can be obtained, initially in education, from a very early age, the human being can use music as an element to leverage himself in learning.

References

Bhat, A. S., V. S., A., S. Prasad, N., & Mohan D., M. (2014). An efficient classification algorithm for music mood detection in western and hindi music using audio feature extraction. 2014 Fifth International Conference on Signal and Image Processing, pp. 359-364. DOI: 10.1109/ICSIP.2014.63

Giannakopoullos T 2015 pyAudioAnalysis: an open-source Python library for audio signal analysis PloS ONE 12(10) doi:10.1371/journal.pone.0144610

Grekow J 2018 Audio features dedicated to the detection and tracking of arousal and valence in musical compositions J. of Inform. and Telecom. 1(12) doi:10.1080/24751839.2018.1463749

IOSR Journal of Electronics and Communication Engineering (IOSR-JECE) e-ISSN: 2278-2834,p- ISSN: 2278-8735. PP 83-87

Li M, Long Y and Lu Q 2016 A regression approach to valence-arousal ratings of words from word embedding 2016 Int. Conf. on Asian Language Processing (IALP) pp 120-3 doi:10.1109/IALP.2016.7875949

Loui P, Raine LB, Chaddock-Heyman L, Kramer AF and Hillman CH (2019) Musical Instrument Practice Predicts White Matter Microstructure and Cognitive Abilities in Childhood. Front. Psychol. 10:1198. doi: 10.3389/fpsyg.2019.01198

Malheiro R, Panda R, Gomes P and Paiva R P 2016 Bi-modal music emotion recognition: novel lyrical features and dataset 9th Int. Work. on Music and Machine Learning – MML’2016 – in conjunction with the European Conf. on Machine Learning and Principles and Practice of Knowledge Discovery in Databases – ECML/PKDD 2016 Riva del Garda Italy

McFee B, Raffel C, Liang D, Ellis D P W, McVicar M, Battenberg E and Nieto O 2015 Librosa: v0.4.0. Zenodo 2015 doi:10.5281/zenodo.18369

Mehr, A., Singh, M., Knox, D., Ketter, D. M., Pickens-Jones, D., Atwood, S., et al. (2019). Universality and diversity in human song. Science 366:eaax0868. doi: 10.1126/science.aax0868

North, A. C., and Hargreaves, D. J. (2008). The Social and Applied Psychology of Music. New York, NY: Oxford University Press. doi: 10.1093/acprof:oso/9780198567424.001.0001

Nuzzolo, M. (2015). Music Mood Classification. Recovered in 2020, from The Electrical and Computer Engineering Design Handbook. Article. https://sites.tufts.edu/eeseniordesignhandbook/2015/music-mood-classification/

Patra B G, Das D and Bandyopadhyay J 2013 Automatic music mood classification of Hindi songs 3rd Work. on Sentiment Analysis where AI meets Psychology (SAAIP 2013)

Putkinen V, Huotilainen M and Tervaniemi M (2019) Neural Encoding of Pitch Direction Is Enhanced in Musically Trained Children and Is Related to Reading Skills. Front. Psychol. 10:1475. doi: 10.3389/fpsyg.2019.01475

Saarikallio SH, Randall WM and Baltazar M (2020) Music Listening for Supporting Adolescents’ Sense of Agency in Daily Life. Front. Psychol. 10:2911. doi: 10.3389/fpsyg.2019.02911

Tzanetakis G. and Cook P. (2002). Musical genre classification of audio signals IEEE Trans. on Speech and Audio Process. 10(5) pp 293-302 doi:10.1109/tsa.2002.800560

Warriner A B, Kuperman V and Brysbaert M 2013 Norms of valence, arousal, and dominance for 13,915 English lemmas Behavior Res. Methods 45 pp 1191-207

Yang X, Dong Y and Li J 2017 Review of data features-based music emotion recognition methods Multimedia Systems pp 1-25 doi:10.1007/s00530-017-0559-4

Yu L, Wang J, Lai K R and Zhang X 2015 Predicting valence-arousal ratings of words using a weighted graph method Proc. of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th Int. Joint Conf. on Natural Language Processing pp 788-93 retrieved from http://www.aclweb.org/anthology/p15-2129

This entry is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license.