AVSR – Audio Visual Speaker Recognition

The aim of the task is to detect, localize and track speakers from audio-visual sequences. The data used are some scenarios of the “interaction” part in the Ravel data set [1,2]. See the data set web site1 for more information on which sequences to use.

Evaluation metric

In order to evaluate the results, the (euclidean) distance matrix between the detected speakers and the ground-truth speakers should be computed. Each ground-truth speaker should be associated at most to one detected
speaker. The assignment procedure is as follows. For each detected speaker its closest ground-truth speaer is computed. If it is not closer than a threshold τloc it is marked as false positive, otherwise the detected speaker is assigned to the ground-truth speaker. Then, for each ground-truth speaker the number of detected clusters are assigned to it is checked. If there is none, it is marked as missing detection. Otherwise, the closest detected speaker becomes the true positive and the remaining ones become false positives. Recall, precision and accuracy values should be shown in tables (and occasionaly in figures also) for values of τloc in the range 1cm – 50cm.

[1] The RAVEL data set. http://ravel.humavips.eu/.
[2] Xavier Alameda-Pineda, Jordi Sanchez-Riera, Vojtech Franc, Johannes Wienke, Jan Cech, Kaustubh Kulkarni, Antoine Deleforge, and Radu P. Horaud. The ravel data set. In IEEE/ACM ICMI 2011 Workshop on Multimodal Corpora, Alicante, Spain, November 2011.