The ICML Expressive Vocalizations Workshop & Competition 2022 

Recognize, Generate, and Personalize Vocal Bursts

Alice Baird, Panagiotis Tzirakis, Kory W. Mathewson, Gauthier Gidel, Eilif B. Muller,
Björn Schuller, Erik Cambria, Dacher Keltner, Alan Cowen

Cosponsored by Hume AI, Mila, and the National Film Board of Canada, the ICML Expressive Vocalization (ExVo) Workshops & Competition is organized by leading researchers in emotion science and machine learning and is focused on understanding and generating vocal bursts – expressive non-verbal vocalizations.

Participants of ExVo will be presented with three tasks that utilize a single dataset. The dataset and three tasks draw attention to new innovations in emotion science and capture 10 dimensions of emotion perceived in distinct vocal bursts: Awe, Excitement, Amusement, Awkwardness, Fear, Horror, Distress, Triumph, Sadness, and Surprise.

These tasks highlight the need for advanced machine learning techniques for the recognition, generation, and personalization of non-verbal communication. With studies of vocal emotional expression often relying on significantly smaller datasets insufficient to apply the latest machine learning innovations, the ExVo competition and workshop will provide an unprecedented platform for the development of novel strategies for understanding vocal bursts and will enable unique forms of collaborations by leading researchers from diverse disciplines.

arXiv Proceedings White Paper GitHub

09:15 - 09:25

Paper

In-Person

The ICML 2022 Expressive Vocalizations Workshop and Competition: Recognizing, Generating, and Personalizing Vocal Bursts

09:30 - 09:40

Paper

Virtual

Exploring speaker enrolment for few-shot personalisation in emotional vocalisation prediction

09:45 - 09:55

Paper

Virtual

Redundancy Reduction Twins Network: A Training framework for Multi-output Emotion Regression

10:00 - 10:30

Break

Coffee Break

10:30 - 11:15

Keynote

Virtual

Dr. Yutian Chen, "Using WaveNet to reunite speech-impaired users with their original voices" ”

11:30 - 11:40

Paper

Virtual

Synthesizing Personalized Non-speech Vocalization from Discrete Speech Representations

11:45 - 11:55

Paper

In-Person

Generating Diverse Vocal Bursts with StyleGAN2 and MEL-Spectrograms

12:00 - 13:30

Break

Lunch Break

13:30 - 14:15

Keynote

In-Person

Dr. Alan Cowen, "Fundamental advances in understanding nonverbal behavior"

14:15 - 14:25

Paper

In-Person

Dynamic Restrained Uncertainty Weighting Loss for Multitask Learning of Vocal Expression

14:30 - 14:40

Paper

In-Person

Multitask vocal burst modeling with ResNets and pre-trained paralinguistic Conformers

14:45 - 14:55

Paper

Virtual

Exploring the Effectiveness of Self-supervised Learning and Classifier Chains in Emotion Recognition of Nonverbal Vocalizations

15:00 - 15:30

Break

Coffee Break

15:30 - 15:50

Keynote

Virtual

Dr. Erik Cambria "Neurosymbolic AI for Sentiment Analysis"

15:50 - 16:00

Paper

In-Person

Self-supervision and Learnable STRFs for Age, Emotion and Country Prediction

16:05 - 16:15

Paper

Virtual

Comparing supervised and self-supervised embedding for ExVo Multi-Task learning track

16:20 - 16:30

Paper

Virtual

Burst2Vec: An Adversarial Multi-Task Approach for Predicting Emotion, Age, and Origin from Vocal Bursts

16:35 - 17:00

Organisers

In-Person

Winner Announcements and Closing Remarks

Workshop Schedule

Saturday 23rd July, Room 301-303, the Baltimore Convention Center

Keynote Speakers

Dr. Yutian Chen. DeepMind, London, UK. “Using WaveNet to reunite speech-impaired users with their original voices”.

Dr Yutian Chen is a staff research scientist at DeepMind. He obtained his PhD in machine learning at the University of California, Irvine, and later worked at the University of Cambridge as a research associate (Postdoc) before joining DeepMind. Yutian took part in the AlphaGo and AlphaGo Zero project, developed Game Go AI programs that defeated the world champions. The AlphaGo project was ranked in the top 10 discoveries of the decade 2010s by the New Scientist magazine. Yutian has conducted research in multiple machine learning areas including Bayesian methods, deep learning, reinforcement learning, generative models and meta-learning with applications in gaming AI, computer vision and text-to-speech. Yutian also serves as reviewers and area chairs for multiple academic conferences and journals.

Dr. Erik Cambria. SenticNet, Singapore. “Neurosymbolic AI for Sentiment Analysis”.

Dr. Erik Cambria is the Founder of SenticNet, a Singapore-based company offering B2B sentiment analysis services, and an Associate Professor at NTU, where he also holds the appointment of Provost Chair in Computer Science and Engineering. Prior to joining NTU, he worked at Microsoft Research Asia (Beijing) and HP Labs India (Bangalore) and earned his PhD through a joint programme between the University of Stirling and MIT Media Lab. His research focuses on neurosymbolic AI for explainable natural language processing in domains like sentiment analysis, dialogue systems, and financial forecasting. He is recipient of several awards, e.g., IEEE Outstanding Career Award, was listed among the AI's 10 to Watch, and was featured in Forbes as one of the 5 People Building Our AI Future. He is an IEEE Fellow, Associate Editor of many top-tier AI journals, e.g., INFFUS and IEEE TAFFC, and is involved in various international conferences as program chair and SPC member.

Dr. Alan Cowen. Hume AI, New York, USA. “Fundamental advances in understanding nonverbal behavior"

Dr. Alan Cowen is an applied mathematician and computational emotion scientist developing new data-driven methods to study human experience and expression. He was previously a researcher at the University of California and visiting scientist at Google, where he helped establish affective computing research efforts. His discoveries have been featured in leading journals such as Nature, PNAS, Science Advances, and Nature Human Behavior and covered in press outlets ranging from CNN to Scientific American. His research applies new computational tools to address how emotional behaviors can be evoked, conceptualized, predicted, and annotated, how they influence our social interactions, and how they bring meaning to our everyday lives.

Competition Tasks and Rules

The Multi-task High-Dimensional Emotion, Age & Country Task (ExVo Multi-Task).

In the ExVo MultiT task, participants will be challenged with predicting the average intensity of each of 10 emotions perceived in vocal bursts, as well as the speaker's Age and Native-Country, as a multi-task process. For emotion and age, the participants will perform a regression task, and for native-country, classification. Participants will report the Concordance Correlation Coefficient (CCC), for the emotion regression task, Mean Absolute Error (MAE) for Age (in years), and Unweighted Average Recall (UAR) for the country classification task. The baseline for this challenge is based on a combined score computed by the harmonic mean between CCC, (inverted) MAE, and UAR.


The Generative Emotional Vocal Burst Task (ExVo Generate).

In the ExVo Generate task, participants will be tasked with applying generative modelling approaches to produce vocal bursts that are associated with 10 distinct emotions. Each team will submit 100 machine-generated vocalizations that differentially convey either a selection of or all 10 emotions—“awe,” “fear,” etc.—with maximal intensity and fidelity. The ExVo organization team will provide a method for scoring the generated samples with a Fréchet Inception Distance (FID), using a baseline predictive model based on the training set. The final evaluation will incorporate human ratings gathered by Hume AI of a random subset of 5/100 samples per targeted emotion. These ratings of the generated vocal bursts will be gathered using the same methodology and from the same participant, population used to collect the training data, with each vocal burst judged in terms of the perceived intensity of each target emotion. Generated samples will be evaluated based on the correlation between normed (0-1) average intensity ratings for the 10 classes and the identity matrix consisting of dummy variables for each class. The baseline for the challenge is based on a combined score from the provided FID of a given class and the human evaluation.

Additionally, winning teams from the ExVo Generate task have the opportunity to demo their approach at the ICML Machine Learning for Audio Synthesis Workshop.


The Few-Shot Emotion Recognition task (ExVo Few-Shot).

In the ExVo Few-Shot task, the participants will predict the same 10 emotions as a multi-output regression task, using a model or multiple models. Participants will be provided with at least 2 samples per speaker in all splits (train, validation, test) and will be tasked with performing two-shot personalized emotion recognition. The subject IDs and corresponding emotion labels with two samples per speaker in the test set will be withheld until a week before the deadline for final evaluation of ExVo Few-Shot models on the test data. Participants will report the Concordance Correlation Coefficient (CCC) across all 10 emotion outputs as an evaluation metric.

Baselines and Team Ranking

The preliminary ExVo white paper details the baseline results. The results listed are only those who beat the organisers baseline. Each result is subject to removal until the peer-review process is complete. After this time, affiliations will be included.

ExVo MultiTask
Team Task Test SMTL
Organisers Baseline ExVo MultiTask 0.335
NLPros 0.435
CMU_MLSP 0.412
0xAC 0.407
EIHW-MM* 0.394
IdiapTeam 0.379
TeamAtmaja 0.378

ExVo Generate (Best across all emotions, cf. White paper for single emotions). For ExVo Generate models we accept samples for any of the 10 classes. We will be announcing winners for best across all emotions, and best for a single emotion.
Team Task Test SGEN
Organisers Baseline ExVo Generate 0.094
StyleMelMila* 0.4079
Resemble 0.1190

ExVo FewShot
Team Task Test CCC
Organisers Baseline ExVo FewShot 0.444
SaruLab-UTokyo 0.739
EIHW-MM* 0.650

* A member of the team is an ExVo organiser, therefore this result is excluded from the official rankings.

Data and Team Registration

tsne representation of emotional space.

Figure 1. t-SNE representation of the emotional space of the Hume-VB dataset (training set only).

This package includes the raw data for a subset of The Hume Vocal Burst Database (H-VB), including all train, validation, and test recordings and corresponding emotion ratings for the train and validation recordings.

This dataset contains 59,201 audio recordings of vocal bursts from 1,702 speakers, from 4 cultures—the U.S, South Africa, China, and Venezuela—ranging in age from 20 to 39.5 years old. The duration of data in this version of H-VB is 36 Hours (Mean: 02.23 sec). The emotion ratings correspond to ten emotion concepts, listed below, and averaged 0-100 intensities for each emotion concept, with each sample having been rated by an average of 85.2 raters.

Emotion Labels: Awe, Excitement, Amusement, Awkwardness, Fear, Horror, Distress, Triumph, Sadness, Surprise


Train Validation Test
HH:MM:SS 12:19:06 12:05:45 12:22:12
Samples 19,990 19,396 19,815
Speakers 571 568 563
F:M 305:266 324:244 --
USA 206 206 --
China 79 76 --
South Africa 244 244 --
Venezuela 42 42 --

An overview of the data can be found at this Zenodo repository. To gain access, register your team by emailing competitions@hume.ai with the following information:

Team Name, Researcher Name, Affiliation, and Research Goals

Restricted Access: After registering your team, you will receive an End User License Agreement (EULA) for signature. Please note that this dataset is provided only for competition use. Requests for use of the data beyond the competition should be directed to Hume AI (hello@hume.ai).

Important Dates (AoE)

  • Challenge Opening (data available): April 1, 2022 ✓

  • Baselines and paper released: April 8, 2022 ✓

  • ExVo MultiTask track submission deadline: May 26, 2022

  • ExVo Few-Shot (test labels): May 27, 2022

  • ExVo Few-Shot and ExVo Generate track submission deadline: June 1, 2022

  • Workshop paper submission: June 3, 2022 June 6, 2022

  • Notification of Acceptance/Rejection: June 13, 2022 ✓

  • Workshop Video Upload Deadline: July 1, 2022 ✓

  • Workshop: Saturday, July 23, 2022 (301-303 T450)

Results Submission

For each task, participants should submit their test set results as a zip file to competitions@hume.ai, following these guidelines:

  • ExVo Multi-Task:

    • Predictions should be submitted as a comma-delineated CSV with the following naming convention: multitask_[team name]_[submission no].csv

    • The CSV should contain only one prediction per test set file.

  • ExVo Generate:

    • Up to 1000 machine-generated audio samples should be provided by each team, with 100 samples corresponding to each targeted class. The audio samples should be provided within a folder with the following naming convention: generate_[team name]_[submission no]

    • The audio files should be each provided in either .mp3 or .wav format using the following naming convention: [team name]_[submission no]_[emotion]_[sample no (0-99)].[mp3 | wav]

    • Note that after initial evaluation, the top 5 teams will be asked to either (1) deliver their trained generative models to workshop organizers to reproduce results by generating new test samples for each targeted class, or (2) deliver 1000 samples per targeted class. New ratings will be collected of 10 randomly selected samples per targeted class and final standings will be determined by the results of this final evaluation.

  • ExVo Few-Shot:

    • Predictions as a comma-delineated CSV with the following naming convention: fewshot_[team name]_[submission no].csv

    • The CSV should contain only one prediction per test set file.

Paper Submission

As well as the test set results, all participants in the ExVo Competition should also submit a paper to describe their approach and results. Paper contributions are also welcome to the workshop from authors within the scope of the workshop, who do not wish to participate in the competition. Manuscripts should be up to 4 pages excluding references and follow the guidelines set by ICML. The review process will be double-blind. To create double-blind manuscript, uncomment \usepackage{icml2022} in the latex template.

Submission will be via CMT, with track options for ‘Competition’ and ‘Other Topics’. Anyone participating in the competition should ensure they submit their manuscript to the ‘Competition’ track. Please let us know if you have questions about this procedure.

The baseline white paper provides a more extensive description of the data as well as baseline results. Competition papers should include the following citation for the data repository:

@article{Cowen2022HumeVB,
     title={The Hume Vocal Burst Competition Dataset (H-VB) | Raw Data [ExVo: updated 02.28.22] [Data set]},
     author={Alan Cowen, Alice Baird, Panagiotis Tzirakis, Michael Opara, 
             Lauren Kim, Jeff Brooks, & Jacob Metrick.},
     journal={Zenodo}, 
     doi = {https://doi.org/10.5281/zenodo.6308780},
     year={2022}}

@misc{BairdExVo2022,
    author = {Baird, Alice and Tzirakis, Panagiotis and Gidel, Gauthier and 
    Jiralerspong, Marco and Muller, Eilif B. and Mathewson, Kory and Schuller, Björn and Cambria, Erik and Keltner, Dacher and Cowen, Alan},
    title = {The ICML 2022 Expressive Vocalizations Workshop and Competition: Recognizing, Generating, and Personalizing Vocal Bursts},
    publisher = {arXiv},
    doi = {10.48550/ARXIV.2205.01780},
    year = {2022}}

Other Topics

For those interested in submitting research to the ExVo workshop outside of the competition, we encourage contributions covering the following topics:

  • Detecting and Understanding Nonverbal Vocalizations

  • Multi-Task Learning in Affective Computing

  • Generating Nonverbal Vocalizations or Speech Prosody

  • Personalized Machine Learning for Affective Computing

  • Other topics related to Affective Verbal and Nonverbal Vocalization

Submission will be via CMT, ensuring the submission is made under the ‘Other Topics’ track.

Technical Program Committee

As well as the organizing committee, submissions to ExVo 2022 will be reviewed by multidisciplinary researchers in the fields of emotion science, auditory and affective machine learning, and generative models.

Lukas Stappen, Recoro, Germany Gil Keren, Meta, USA Jeffrey Brooks, Hume AI, USA Manuel Milling, Augsburg University, Germany Maximillian Schmitt, audEERING, GermanyEmilia Parada-Cabarlero, JKU Linz, AustriaEsther Rituero-González, University Carlos III of Madrid, Spain Zhao Ren, LS3, University of Hannover, Germany — Christopher Gagne, Hume AI, USA, and Max Planck Institute for Biological Cybernetics, Germany — Georgios Rizos, Imperial College London, UK.

Organizers

Alice Baird. Hume AI, New York, USA. Alice Baird is an audio researcher with interdisciplinary expertise in machine learning, computational paralinguistics, stress, and emotional well-being. She completed her PhD at the University of Augsburg’s Chair of Embedded Intelligence for Health Care and Wellbeing in 2021, where she was supervised by Dr Björn Schuller. Her work on emotion understanding from speech, physiological, and multimodal data has been published extensively in leading journals and conferences including INTERSPEECH, ICASSP, IEEE Intelligent Systems, and the IEEE Journal of Biomedical and Health Informatics (i10-index: 29). Alice has had extensive experience with competition organization, holding data chair for both the INTERSPEECH Computational Paralinguistics Challenge (ComParE) and ACM MM Multimodal Sentiment Challenge (MuSe). She recently joined Hume AI as an AI research scientist. 

Panagiotis Tzirakis. Hume AI, New York, USA. Dr Tzirakis is a computer scientist and AI researcher with expertise in deep learning and emotion recognition across modalities. He earned his Ph.D. with the Intelligent Behaviour Understanding Group (iBUG) at Imperial College London, where he advanced multimodal emotion recognition efforts. He has published in top outlets including Information Fusion, International Journal of Computer Vision, and several IEEE conference proceedings (e.g. ICASSP, INTERSPEECH) on topics including 3D facial motion synthesis, multi-channel speech enhancement, the detection of Gibbon calls, and emotion recognition from audio and video (i10-index: 16). He recently joined Hume AI as an AI research scientist. 

Kory W. Mathewson. DeepMind, Montreal QC.   Kory Mathewson is a Research Scientist with DeepMind and a Lab Scientist with the Creative Destruction Lab. Kory holds a Ph.D. in Computing Science from the University of Alberta with the Alberta Machine Intelligence Institute. His research focuses on studying the interaction between humans and machines, with a focus on conversational dialog systems and human-robot interfaces. Kory’s work has been featured in Time magazine, the Wall Street Journal, and the New York Times. His recent work explores new algorithms for laughter generation. Kory is also an accomplished performance artist with more than 15 years of experience in improvisation with Rapid Fire Theatre and Improbotics. He fuses his interests by developing artificial intelligences to perform alongside. Kory is a co-organizer of the WordPlay: When Language Meets Games workshop at NeurIPS and previously organized the Future of Interactive Machine Learning, also at NeurIPS. For more, see: https://korymathewson.com/ and Google Scholar.

Gauthier Gidel. Mila, Université de Montréal, Montreal QC. Gauthier Gidel is an assistant professor at the University of Montreal in the department of computer science and operational research and a core member of Mila. Gauthier did his Ph.D. at the University of Montreal, during which he was awarded a Borealis AI fellowship and two DIRO excellence grants. He is currently a recipient of a Canada CIFAR AI Chair. His research revolves around optimization, game theory, machine learning. He has published several top-tier conference papers on Generative Adversarial Networks (GANs) and recently specifically worked on laughter generation using GANs as part of an artistic collaboration with National Film Board of Canada.

Eilif B. Muller. Mila, Université de Montréal, CHU Ste-Justine Research Center, Montréal QC.  Dr. Muller is a Canada CIFAR AI Chair at Mila, an IVADO Assistant Research Professor in the Department of Neuroscience at the Université de Montréal, a Fond de Recherche du Québec en Santé (FRQS) Research Scholar, and principal investigator of the Architectures of Biological Learning Lab (ABL-Lab) at the CHU Sainte-Justine Research Center.  His research at the intersection of neuroscience and AI employs simulation and modeling to understand learning function and dysfunction in the mammalian brain, and has resulted in publications in top journals such as Cell, Nature Neuroscience, Nature Communications, and Cerebral Cortex (h-index: 25; i10-index: 38), and diverse media coverage including a feature length documentary film, In Silico. He was previously deputy director of simulation neuroscience at the Blue Brain Project, EPFL, Switzerland, and an Applied Research Scientist at Element AI Montreal, QC, and is part of an artistic collaboration with the National Film Board of Canada on laughter generation with GANs.

Björn Schuller. Imperial College London, United Kingdom. Björn W. Schuller received his diploma, doctoral degree, habilitation, and Adjunct Teaching Professor in Machine Intelligence and Signal Processing all in EE/IT from TUM in Munich/Germany. He is Full Professor of Artificial Intelligence and the Head of GLAM – the Group on Language, Audio, & Music at Imperial College London/UK, Full Professor and Chair of Embedded Intelligence for Health Care and Wellbeing at the University of Augsburg/Germany, co-founding CEO and current CSO of audEERING – an Audio Intelligence company based near Munich and in Berlin, Germany amongst other Professorships and Affiliations. He (co-)authored 1000+ publications (43k+ citations, i10-index: 563), is Field Chief Editor of Frontiers in Digital Health and was Editor in Chief of the IEEE Transactions on Affective Computing amongst manifold further commitments and service to the community including Technical Chair of INTERSPEECH 2019 and organization of more than 25 research challenges. 

Erik Cambria. Nanyang Technological University Singapore. Erik Cambria is the Founder of SenticNet, a Singapore-based company offering B2B sentiment analysis services, and an Associate Professor at NTU, where he also holds the appointment of Provost Chair in Computer Science and Engineering. Prior to joining NTU, he worked at Microsoft Research Asia (Beijing) and HP Labs India (Bangalore) and earned his PhD through a joint program between the University of Stirling and MIT Media Lab. His research focuses on neurosymbolic AI for explainable natural language processing in domains like sentiment analysis, dialogue systems, and financial forecasting. He is the recipient of several awards, e.g., IEEE Outstanding Career Award, was listed among the AI's 10 to Watch, and was featured in Forbes as one of the 5 People Building Our AI Future. He is an IEEE Fellow, Associate Editor of many top-tier AI journals, e.g., INFFUS and IEEE TAFFC, and is involved in various international conferences as program chair and SPC member.

Dacher Keltner. The University of California, Berkeley, California, U.S.A. Dr. Keltner is one of the world’s foremost emotion scientists. He is a professor of psychology at UC Berkeley and the director of the Greater Good Science Center. He has over 200 scientific publications (i10-index: 222) and six books, including Born to Be Good, The Compassionate Instinct, and The Power Paradox. He has written for many popular outlets, from The New York Times to Slate. He was also the scientific advisor behind Pixar’s Inside Out, is involved with the education of health care providers and judges, and has consulted extensively for Google, Facebook, Apple, and Pinterest, on issues related to emotion and well-being.

Alan Cowen. Hume AI, New York, U.S.A. Dr. Cowen is an applied mathematician and computational emotion scientist developing new data-driven methods to study human experience and expression. He was previously a researcher at the University of California and visiting scientist at Google, where he helped establish affective computing research efforts. His discoveries have been featured in top journals such as Nature, PNAS, Science Advances, and Nature Human Behavior (i10-index: 16) and covered in press outlets ranging from CNN to Scientific American. His research applies new computational tools to address how emotional behaviors can be evoked, conceptualized, predicted, and annotated, how they influence our social interactions, and how they bring meaning to our everyday lives.