CEP – Conversational Engagement Prediction

The aim of this task is to estimate the level of engagement of participants in a group video-mediated communication (conversing, responding, following, no interest ). The dataset for this task consists of several audio and video recordings as could be recorded from a potential home teleconference system (see [1,2] ). Each recording captures interaction between a group of co-located participants and one remote participant, involved in activities ranging from casual conversation to simple social games. The audio and video recordings are accompanied by gaze recordings of the remote participant, manually-annotated head positions and voice activity annotations. The experiments will be done for the remote participant for whom gaze data is available.

Evaluation metric

A ground truth annotation for training part of the dataset will be made available to the participants. Ground truth for testing part of the dataset will be released after the challenge. Participants are expected to provide a short description of their system, and its outputs for short intervals of the testing data in a defined simple format for evaluation. The official metric used in the evaluations will be weighted classification
cost reflecting the similarities between the different levels of engagement. The weights will be made public together with the training data. Additionally, DET (Detection Error Tradeoff) curves and confusion matrices will be generated. The participants can choose to ignore the provided voice activity annotations as such submissions will be evaluated separately from those submissions using them.

[1] http://medusa.fit.vutbr.cz/TA2/TA2/
[2] Michal Hradi, Shahram Eivazi, and Roman Bednak. Voice activity detection in video mediated communication from gaze. In Accepted to ETRA’12, page 6, 2012.