Busso-Recabarren, Carlos A.

Permanent URI for this collectionhttps://hdl.handle.net/10735.1/6770

Carlos Busso-Recabarren is a Professor of Electrical Engineering and Principal Investigator of the MSP (Multimodal Signal Processing) Laboratory. His research interests include:

  • Modeling and synthesis of human behavior
  • Affective State Recognition
  • Multimodal Interfaces
  • Sensing Participant Interaction
  • Digital Signal Processing
  • Speech and Video Processing

Browse

Recent Submissions

Now showing 1 - 4 of 4
  • Item
    Lexical Dependent Emotion Detection Using Synthetic Speech Reference
    (IEEE-Inst Electrical Electronics Engineers Inc, 2019-02-08) Lotfian, Reza; Busso, Carlos A.; 0000-0003-1891-0267 (Lotfian, R); 0000-0002-4075-4072 (Busso, CA); Lotfian, Reza; Busso, Carlos A.
    This paper aims to create neutral reference models from synthetic speech to contrast the emotional content of a speech signal. Modeling emotional behaviors is a challenging task due to the variability in perceiving and describing emotions. Previous studies have indicated that relative assessments are more reliable than absolute assessments. These studies suggest that having a reference signal with known emotional content (e.g., neutral emotion) to compare a target sentence may produce more reliable metrics to identify emotional segments. Ideally, we would like to have an emotionally neutral sentence with the same lexical content as the target sentence where their contents are timely aligned. In this fictitious scenario, we would be able to identify localized emotional cues by contrasting frame-by-frame the acoustic features of the target and reference sentences. This paper explores the idea of building these reference sentences leveraging the advances in speech synthesis. This paper builds a synthetic speech signal that conveys the same lexical information and is timely aligned with the target sentence in the database. Since it is expected that a single synthetic speech will not capture the full range of variability observed in neutral speech, we build multiple synthetic sentences using various voices and text-to-speech approaches. This paper analyzes whether the synthesized signals provide valid template references to describe neutral speech using feature analysis and perceptual evaluation. Finally, we demonstrate how this framework can be used in emotion recognition, achieving improvements over classifiers trained with the state-of-the-art features in detecting low versus high levels of arousal and valence.
  • Item
    Speech-Driven Expressive Talking Lips with Conditional Sequential Generative Adversarial Networks
    (Institute of Electrical and Electronics Engineers Inc., 2019-05-07) Sadoughi, Najmeh; Busso, Carlos; Sadoughi, Najmeh; Busso, Carlos
    Articulation, emotion, and personality play strong roles in the orofacial movements. To improve the naturalness and expressiveness of virtual agents(VAs), it is important that we carefully model the complex interplay between these factors. This paper proposes a conditional generative adversarial network, called conditional sequential GAN(CSG), which learns the relationship between emotion, lexical content and lip movements in a principled manner. This model uses a set of spectral and emotional speech features directly extracted from the speech signal as conditioning inputs, generating realistic movements. A key feature of the approach is that it is a speech-driven framework that does not require transcripts. Our experiments show the superiority of this model over three state-of-the-art baselines in terms of objective and subjective evaluations. When the target emotion is known, we propose to create emotionally dependent models by either adapting the base model with the target emotional data (CSG-Emo-Adapted), or adding emotional conditions as the input of the model(CSG-Emo-Aware). Objective evaluations of these models show improvements for the CSG-Emo-Adapted compared with the CSG model, as the trajectory sequences are closer to the original sequences. Subjective evaluations show significantly better results for this model compared with the CSG model when the target emotion is happiness. IEEE
  • Item
    Speech-Driven Animation with Meaningful Behaviors
    (Elsevier B.V., 2019-04-05) Sadoughi, Najmeh; Busso, Carlos; Sadoughi, Najmeh; Busso, Carlos
    Conversational agents (CAs) play an important role in human computer interaction (HCI). Creating believable movements for CAs is challenging, since the movements have to be meaningful and natural, reflecting the coupling between gestures and speech. Studies in the past have mainly relied on rule-based or data-driven approaches. Rule-based methods focus on creating meaningful behaviors conveying the underlying message, but the gestures cannot be easily synchronized with speech. Data-driven approaches, especially speech-driven models, can capture the relationship between speech and gestures. However, they create behaviors disregarding the meaning of the message. This study proposes to bridge the gap between these two approaches overcoming their limitations. The approach builds a dynamic Bayesian network (DBN), where a discrete variable is added to constrain the behaviors on the underlying constraint. The study implements and evaluates the approach with two constraints: discourse functions and prototypical behaviors. By constraining on the discourse functions (e.g., questions), the model learns the characteristic behaviors associated with a given discourse class learning the rules from the data. By constraining on prototypical behaviors (e.g., head nods), the approach can be embedded in a rule-based system as a behavior realizer creating trajectories that are timely synchronized with speech. The study proposes a DBN structure and a training approach that (1) models the cause-effect relationship between the constraint and the gestures, and (2) captures the differences in the behaviors across constraints by enforcing sparse transitions between shared and exclusive states per constraint. Objective and subjective evaluations demonstrate the benefits of the proposed approach over an unconstrained baseline model. ©2019 Elsevier B.V.
  • Item
    Expressive Speech-Driven Lip Movements with Multitask Learning
    (Institute of Electrical and Electronics Engineers Inc.) Sadoughi, Najmeh; Busso, Carlos A.; Sadoughi, Najmeh; Busso, Carlos A.
    The orofacial area conveys a range of information, including speech articulation and emotions. These two factors add constraints to the facial movements, creating non-trivial integrations and interplays. To generate more expressive and naturalistic movements for conversational agents (CAs) the relationship between these factors should be carefully modeled. Data-driven models are more appropriate for this task than rule-based systems. This paper provides two deep learning speech-driven structures to integrate speech articulation and emotional cues. The proposed approaches rely on multitask learning (MTL) strategies, where related secondary tasks are jointly solved when synthesizing orofacial movements. In particular, we evaluate emotion recognition and viseme recognition as secondary tasks. The approach creates shared representations that generate behaviors that not only are closer to the original orofacial movements, but also are perceived more natural than the results from single task learning.

Works in Treasures @ UT Dallas are made available exclusively for educational purposes such as research or instruction. Literary rights, including copyright for published works held by the creator(s) or their heirs, or other third parties may apply. All rights are reserved unless otherwise indicated by the copyright owner(s).