November 21, 2024 TOPICS

[Ambitious Graduate Students] Automatically recognizing the emotions in your voice:Using information processing technology to analyze the emotions conveyed by the human voice

Ryotaro Nagase (3rd-year doctoral student, Graduate School of Information Science and Engineering)

Even the same word can have a completely different meaning depending on how it is spoken. This is a phenomenon that all of us have surely experienced. The same word can give different impressions in different settings due to the emotion in a speaker’s voice.
Speech contains both linguistic and acoustic information. Speech emotion recognition is a technology that combines these two pieces of information to automatically estimate the emotion conveyed by a person’s voice. Ryotaro Nagase, a third-year doctoral student in the Graduate School of Information Science and Engineering, has been working on research to recognize emotions in speech by actively utilizing information such as the meaning of sentences and phoneme differences, and he has published a series of findings that far exceeds what has been produced by previous studies.

How can you quantify speech that conveys sadness?

The process of ascertaining a speaker’s emotions from their voice is called speech emotion recognition, and it is something we all do unconsciously in everyday conversation. For example, if someone says “You idiot” in a strong, decisive tone, the emotion conveyed is completely different than when the same words are uttered in a slightly pampering tone. Conversely, if we pay attention to both the meaning of words and the way they are spoken, we can more accurately understand the emotions of others. Nagase recalls that he has been interested in how people use speech to express emotions ever since he began taking courses in the teacher training program.

“When I took classes on school counseling and educational psychology, my teachers stressed the importance of listening, especially in counseling. You must try to imagine how the other person is feeling right now and try to make them understand that you are receptive to what they are saying. This is when I realized that the way someone speaks makes a difference in the emotions that you feel.”

For Nagase, who is a member of a choir, choral instruction also affords him with an opportunity to consider emotional expression.

“One time, my instructor told me to imagine I was on a station platform on my way home from working overtime. He asked me to sing like I was a person on their way home from work who felt a hint of despair at the fact that I had to go back to work tomorrow. I could more or less understand the nuance of what he was trying to say, but I was frustrated that he could not give me more precise instructions. I started to think that the instructions would be easier to understand if they could be quantified using some kind of indicator. As I was pursuing this line of thought, I began to wonder if I could somehow quantify the emotions expressed in the voice.”

Since the voice is sound, and sound travels in a wave, it can be quantified using a frequency analysis. Therefore, the next step for Nagase was to use deep learning and other methods to analyze the combination of what is spoken and what is meant. After taking this class in the second year of his undergraduate program, which was taught by his current advisor, Professor Yamashita of the Spoken Language Laboratory, Nagase discovered the path he should pursue.

Analyzing voice frequencies and matching them with linguistic information

Human speech can be broken down into acoustic information such as pitch, resonance, and voice quality. On the one hand, spoken content is transmitted to the listener as linguistic information. In speech emotion recognition, acoustic and linguistic (i.e., semantic) information are combined to let the emotion recognition device (computer) estimate the emotional state of the speaker. You understand what someone is saying by combining the meaning expressed in words with the sounds transmitted by speech. This is a process that we do unconsciously to facilitate communication even in normal dialogue. If this process can be analyzed by computer in some form, the results could be applied to robots that interact with humans. Nagase explains:

“Speech contains three types of information: linguistic, paralinguistic, and nonlinguistic. Linguistic information, as the name implies, is information that indicates what was spoken, paralinguistic information indicates nuances of speech and so on, and nonlinguistic information refers to attributes like gender, age, and the like. A good deal of this information can be analyzed in terms of acoustic characteristics. Typical features include spectrograms obtained by analyzing the frequencies of speech with the Fourier transform. This combination of acoustic and linguistic information is then recognized by the emotion recognition device."

Mr. Nagase began this research in earnest in 2021, and the first topic he tackled was speech emotion recognition using both acoustic and linguistic information. So, how can you make the emotion recognition device accurately understand speech that is easily misrecognized based on acoustic information alone? For example, when emotions are classified into the four categories of joy, sadness, anger, and calmness, acoustic information alone can easily lead to the misrecognition of joyful and angry emotions.

“The reason is that, when you only focus on acoustics, joy and anger both demonstrate similar changes in the intensity of the voice and other aspects of speech. Nevertheless, you can increase the possibility of correct recognition if you let the device judge what is being spoken in each utterance, that is, to combine the acoustic information with the linguistic information. So, I combined the acoustic and linguistic information obtained from speech and then performed machine learning on it.”

If humans are listening and making judgments, then the first thing you need to do is ensure that the speech that you want the emotion recognition device to correctly estimate does not contain any mistakes. To do this, Nagase compared the integrated processing of acoustic and linguistic information, which he had not yet exhaustively verified, under the same conditions to identify the best combination. This was an original approach, and Nagase's findings were accepted for publication in Acoustical Science and Technology, the journal of the Acoustical Society of Japan.

A demonstration of speech emotion recognition

Reconsidering issues from wholly new perspectives

The next topic that Nagase addressed was speech emotion recognition by estimating emotional label sequences, that is, the recognition of detailed expressions of emotions that change subtly in a series of utterances. For example, take the utterance, "I'll let it slide this time, but I won't let you off the hook next time.” In conventional speech emotion recognition, generally, only one emotion label can be attached to one utterance. Therefore, if the preceding utterance were to be classified in one of the four categories of joy, sadness, anger, or calmness, it would be tagged as anger.

“However, if we look closely at an utterance, the length and type of emotion can be expected to undergo subtle changes during the course of speech. In the example utterance, if you divide it into ‘I'll let it slide this time, but’ and ‘I won't let you off the hook next time,’ you can see that the emotion level has changed slightly. So, how can you get the emotion recognition device to recognize these kinds of subtle changes? To answer this question, I decided to focus on the acoustic differences contained in an utterance.”

First off, the human voice is divided into voiced and unvoiced sounds according to the vibration of the vocal cords when pronouncing sounds. The Japanese vowels a, i, u, e, and o are basically voiced sounds that entail vocal cord vibrations. Consonants, on the other hand, are divided into voiced and unvoiced sounds. However, previous studies on speech emotion recognition focused only on voiced sounds, and even then, they did not consider the difference between vowels and consonants.

"Even though they are the both voiced sounds, I thought that the acoustic patterns of vowels and voiced consonants must be different. By focusing on unvoiced consonants, I found that a single utterance can be analyzed more precisely by increasing the number delimitations between vowels, voiced consonants, and unvoiced consonants. I was able to redefine the emotion label sequences based the classification of these attributes. For example, when someone gets angry at you in Japanese, it is clearly more frightening when they say ‘gora!’ instead of ‘kora!’ I think the fundamental reason for this difference in impression is that k is an unvoiced consonant while g is voiced. By taking these differences into account when considering the emotion labels, I was able to improve the recognition rate.”

Nagase’s research paper was accepted for entry at Interspeech, the world's largest conference in the field of spoken language processing, and he also won the Outstanding Research Award from the Graduate School of Information Science and Engineering at Ritsumeikan University and the IEEE SPS Tokyo Joint Chapter Student Award. The next theme that Nagase tackled was speech emotion recognition that transcribes descriptions of the emotions conveyed by speech, an innovative method of writing that literally translates a speaker’s emotions into text.

For example, suppose someone wins a game and they say, ‘Yay, I did it!’ When this is analyzed with a conventional emotion recognition device, it is classified as joy from among the four emotion categories, and the predictive value of ‘pleasant’ is greater when emotions are recognized along the dimensions of activation↔deactivation and pleasant↔unpleasant. However, a more accurate assessment of the speaker's emotion is that they are not merely joyful or pleasantly surprised, they are extremely excited and happy. So, how can you more accurately recognize a speaker’s inherent emotions? After giving this question some thought, I came up with a method of transcribing descriptions of a speaker's emotions as is."

Specifically, for the utterance, "Yay, I did it!", the device writes the following description of the emotion: "I feel excited and satisfied that I won.” In pursuing this idea, Nagase developed a new emotion recognition device that predicts emotion captions from speech by adding descriptions of emotions to existing speech emotion data and performing deep learning of them. This method transforms the very way emotions are predicted by transcribing them into text rather than relying on conventional forms of evaluation. Nagase presented the results of this research at the Spring Meeting of the Acoustical Society of Japan in March 2024.

Nagase delivers a presentation at Interspeech 2023

A future where robots can talk to people with emotion.

Nagase's research is built on novel ideas, like his perspective that focuses on the differences between vowels and voiced and unvoiced consonants, which is something had never been considered in conventional speech emotion recognition studies, and his method of expressing emotions in writing. So, how did he come to take these kinds of perspectives? They key appears to lie in his attitude of placing value on error analysis.

“When you're doing speech emotion recognition, you basically tend to focus on the percentage of correct answers. What I think is important in conducting this kind of research, however, is to look at the errors that make up the other side of the equation. If the percentage of correct answers is 70%, that means 30% are incorrect. This is why I think you should take the stance of trying to find something that can be improved in that 30% error rate.”

If you simply view correct answers and errors on the same dimension, certain perspectives will go unnoticed. It is only when you elevate your judgment to the next dimension that you can gain new insights like these. When you take a fresh look at your research from a higher dimension, things that have been overlooked will emerge. This is what led Nagase to divide sounds into more granular categories and what gave him the idea to put emotions into writing rather than just labeling them. In the future, Nagase's research findings might be used to facilitate smooth communication between humans and machines in conversation robots and automatic response systems.

“If it becomes possible to analyze people's emotional expressions from their voice, in other words, to quantify their emotional expressions, it might also become possible for machines to quantify and recognize these emotions in real time when talking to someone. These are the kinds of scenarios that pop into my mind, so it makes the research that much more interesting. I don’t know what path I will pursue in the future, but I hope that I can continue doing research for as long as I can.”

Sometime in the future, after Nagase's research has been further developed, perhaps people will be able to understand each other better and the conflicts that arise from minor differences in communication will be a thing of the past.

Related information

NEXT

November 11, 2024 TOPICS

What is the Sustainable Plastic Resource Circulation that Contributes to Both the Environment and Economy?

ページトップへ