en

Please fill in your name

Mobile phone format error

Please enter the telephone

Please enter your company name

Please enter your company email

Please enter the data requirement

Successful submission! Thank you for your support.

Format error, Please fill in again

Confirm

The data requirement cannot be less than 5 words and cannot be pure numbers

m.nexdata.datatang.com

MLC-SLM Workshop Program

Date & Location: August 22nd, Dock 14 – Rotterdam Ahoy Convention Centre

Time Slot Activity
8:30-9:00
Badge Pickup
9:00-10:00
Keynote 1: Shinji Watanabe
Scaling Multilingual Speech Recognition: From a Handful to Thousands of Languages
10:00-10:30
Coffee Break
10:30-11:00
Challenge Summary + Awards Ceremony
11:00-12:00
Oral Session:
1.Seewo's Submission to MLC-SLM: Lessons learned from Speech Reasoning Language Models, Speaker: Bo Li
2.Transsion Multilingual Speech Recognition System for MLC-SLM 2025 Challenge, Speaker: Xiaoxiao Li
3.Triple X: A LLM-Based Multilingual Speech Recognition System for the INTERSPEECH2025 MLC-SLM Challenge, Speaker: Miaomiao Gao
4.The TEA-ASLP System for Multilingual Conversational Speech Recognition and Speech Diarization in MLC-SLM 2025 Challenge, Speaker: Hongfei Xue
12:00-13:00
Lunch Break
13:00-14:00
Keynote 2: Hung-yi Lee
Advancements in Spoken Language Models
14:00-14:30
Oral Session:
1.ILT: Iterative LORA Training through Focus-Feedback-Fix for Multilingual Speech Recognition, Speaker: Qingliang Meng
2.BUT System for the MLC-SLM Challenge, Speaker: Alexander Polok
14:30-15:00
Coffee Break
15:00-15:30
Invited talk 1: Ming Li
Sequence-to-Sequence Neural Diarization under Online and Multi-modal Scenarios
15:30-16:00
Invited talk 2: Shuai Wang
One Embedding Doesn't Fit All: Rethinking Speaker Modeling for Various Speech Applications
16:00-16:30
Invited talk 3: Pan Pan
Beyond Data Scarcity: Engineering Quality-First Data Pipelines in Different Training Stage
16:30-17:30
Posters
Workshop Registration Channels: Official Registration via Interspeech: (Please select Workshop on Multilingual Conversational Speech Language Model during your registration)Click the link
On-site Registration Channel: Click the link
Registration Fee: €50 Registered participants will receive coffee breaks and one lunch on the day of the workshop.
Note: For participants registering via the on-site channel, payment must be made in cash at the venue.
Keynote 1
Shinji Watanabe, Associate Professor, Carnegie Mellon University
Scaling Multilingual Speech Recognition: From a Handful to Thousands of Languages
Shinji Watanabe is an Associate Professor at Carnegie Mellon University, Pittsburgh, PA. He received his B.S., M.S., and Ph.D. (Dr. Eng.) degrees from Waseda University, Tokyo, Japan. He was a research scientist at NTT Communication Science Laboratories, Kyoto, Japan, from 2001 to 2011, a visiting scholar at Georgia Institute of Technology, Atlanta, GA, in 2009, and a senior principal research scientist at Mitsubishi Electric Research Laboratories (MERL), Cambridge, MA USA from 2012 to 2017. Before Carnegie Mellon University, he was an associate research professor at Johns Hopkins University, Baltimore, MD, USA, from 2017 to 2020. His research interests include automatic speech recognition, speech enhancement, spoken language understanding, and machine learning for speech and language processing. He has published over 500 papers in peer-reviewed journals and conferences and received several awards, including the best paper award from ISCA Interspeech in 2024. He is a Senior Area Editor of the IEEE Transactions on Audio Speech and Language Processing. He was/has been a member of several technical committees, including the APSIPA Speech, Language, and Audio Technical Committee (SLA), IEEE Signal Processing Society Speech and Language Technical Committee (SLTC), and Machine Learning for Signal Processing Technical Committee (MLSP). He is an IEEE and ISCA Fellow.
Keynote 2
Hung-yi Lee, Professor, National Taiwan University
Advancements in Spoken Language Models
Hung-yi Lee is a professor of the Department of Electrical Engineering at National Taiwan University (NTU), with a joint appointment at the Department of Computer Science & Information Engineering of the university. His recent research focuses on developing technology that can reduce the requirement of annotated data for speech processing (including voice conversion and speech recognition) and natural language processing (including abstractive summarization and question answering). He won Salesforce Research Deep Learning Grant in 2019, AWS ML Research Award in 2020, Outstanding Young Engineer Award from The Chinese Institute of Electrical Engineering in 2018, Young Scholar Innovation Award from Foundation for the Advancement of Outstanding Scholarship in 2019, Ta-You Wu Memorial Award from Ministry of Science and Technology of Taiwan in 2019, and The 59th Ten Outstanding Young Person Award in Science and Technology Research & Development of Taiwan. He owns a YouTube channel teaching deep learning technology in Marian, which has more than 300,000 subscribers.
Invited talk 1
Ming Li, Professor, Duke Kunshan University
Sequence-to-Sequence Neural Diarization under Online and Multi-modal Scenarios
Ming Li received his Ph.D. in Electrical Engineering from University of Southern California in 2013. He is currently a Professor of Electronical and Computer Engineering at Division of Natural and Applied Science and Principal Research Scientist at Digital Innovation Research Center at Duke Kunshan University. He is also an Adjunct Professor at School of Computer Science of Wuhan University. His research interests are in the areas of audio, speech and language processing as well as multimodal behavior signal analysis and interpretation. He has published more than 200 papers and served as the member of IEEE speech and language technical committee, APSIPA speech and language processing technical committee. He was an area chair at Interspeech 2016, Interspeech 2018, Interspeech 2020, SLT2022, Interspeech 2024, Interspeech 2025, ASRU 2025. He is the technical program co-chair at Odyssey 2022 and ASRU 2023. He is an editorial member of IEEE Transactions on Audio, Speech and Language Processing, Computer Speech and Language and APSIPA Transactions on Signal and Information Processing. Works co-authored with his colleagues have won first prize awards at Interspeech Computational Paralinguistic Challenges 2011, 2012 and 2019, ASRU 2019 MGB-5 ADI Challenge, Interspeech 2020 and 2021 Fearless Steps Challenges, VoxSRC 2021, 2022 and 2023 Challenges, ICASSP 2022 M2MeT Challenge, IJCAI 2023 ADD challenge, ICME 2024 ChatCLR challenge and Interspeech 2024 AVSE challenge. As a co-author, he has won the best paper award in DCOSS2009 and ISCSLP2014 as well as the best paper shortlist in Interspeech 2024. He received the IBM faculty award in 2016, the ISCA Computer Speech and Language 5-years best journal paper award in 2018 and the youth achievement award of outstanding scientific research achievements of Chinese higher education in 2020. He is a senior member of IEEE.
Invited talk 2
Shuai Wang, Associate Professor, Nanjing University
One Embedding Doesn’t Fit All: Rethinking Speaker Modeling for Various Speech Applications
Shuai Wang is a tenure-track Associate Professor at Nanjing University and an adjunct faculty member at the Chinese University of Hong Kong, Shenzhen (CUHK-SZ). He received his Ph.D. from Shanghai Jiao Tong University in 2020 and his B.Sc. from Northwestern Polytechnical University in 2014. Dr. Wang has published over 60 papers on speaker modeling and has received several honors, including the IEEE Ramaswamy Grant at ICASSP 2018, and first place in both VoxSRC 2019 and DIHARD 2019. He is the initiator of the open-source projects WeSpeaker and WeSep, which are widely adopted by both academia and industry.
Invited talk 3
Pan Pan, Director of AI Business, Nexdata
Beyond Data Scarcity: Engineering Quality-First Data Pipelines in Different Training Stage
Visionary leader and operational architect at Nexdata, Pan leverages over a decade of AI data expertise to lead elite teams in delivering end-to-end solutions for LLM, GenAI, and traditional AI models. She has successfully executed 1000+ projects by integrating global-scale multi-sensor data collection, AI-powered annotation, and a unified platform that streamlines the entire training data pipeline.

Motivation

Large Language Models (LLMs) have demonstrated remarkable capabilities across a variety of downstream tasks, serving as powerful foundation models for language understanding and generation. Recently, there has been significant interest in applying LLMs to speech and audio processing tasks, including Automatic Speech Recognition (ASR), Audio Captioning, and emerging areas such as Spoken Dialogue Models.

However, the development of robust LLM-based Spoken Dialogue Models relies heavily on real-world conversational speech data, which encapsulates the complexity of human communication, including natural pauses, interruptions, speaker overlaps, and diverse conversational styles. The scarcity of such data, especially in multilingual contexts, poses a significant challenge to advancing the field.

The importance of real-world conversational speech extends beyond technological advancement—it is essential for building AI systems that can understand and respond naturally in multilingual, dynamic, and context-rich environments. This is especially crucial for next-generation human-AI interaction systems, where spoken dialogue serves as a primary mode of communication.

Thus, this challenge and workshop aim to bridge the gap by hosting the challenge of building multilingual conversational speech language models (MLC-SLM) and releasing a real-world multilingual conversational speech dataset.

Task Setting and Evaluation

The challenge consists of two tasks, both of which require participants to explore the development of speech language models (SLMs):

Task I: Multilingual Conversational Speech Recognition

Objective: Develop a multilingual LLM-based ASR model.

Participants will be provided with oracle segmentation and speaker labels for each conversation.

This task focuses on optimizing recognition accuracy in a multilingual conversation setting.

Task II: Multilingual Conversational Speech Diarization and Recognition

Objective: Develop a system for both speaker diarization (identifying who is speaking when), and recognition (transcribing speech to text).

No prior or oracle information will be provided during evaluation (e.g., no pre-segmented utterances or speaker labels).

Both pipeline-based and end-to-end systems are encouraged, providing flexibility in system design and implementation.

For Task I, system performance will be evaluated using Word Error Rate (WER) or Character Error Rate (CER) across different languages.

For Task II, performance will be assessed based on the Diarization Error Rate (DER) and the concatenated minimum permutation WER or CER, referred to as tcpWER or tcpCER. The DER is employed to determine the best speaker ID permutation between oracle annotation and diarization results. Then, the recognition results and references belonging to the same speaker within a recording will be concatenated to calculate the tcpWER or tcpCER. All submissions will be ranked according the tcpWER or tcpCER.

Important Dates (AOE Time)

    March 10, 2025: Registration opens

    March 15, 2025: Training data release

    April 1, 2025: Development set and baseline system release

    May 15, 2025: Evaluation set release and leaderboard open

    May 30, 2025: Leaderboard freeze and paper submission portal opens (CMT system)

    June 15, 2025: Paper submission deadline

    July 1, 2025: Notification of acceptance

    August 22, 2025: Workshop date

Dataset Description

Training set

The training set (Train) comprises approximately 11 languages: English (en), French (fr), German (de), Italian (it), Portuguese (pt), Spanish (es), Japanese (jp), Korean (ko), Russian (ru), Thai (th), Vietnamese (vi).

    Each recording consists of two-speaker conversational speech on randomly assigned topics.

    Conversations are natural and fluent, with speakers engaging in meaningful dialogues on each topic.

    Recorded in quiet indoor environments using devices such as iPhones.

    Each recording will provide the oracle segmentation and speaker label for the development of speech recognition and speaker diarization systems.

    Both Task I and Task II share the same training set.

    The English dataset comprises approximately 500 hours of recordings from various regions, including British, American, Australian, Indian, and Philippine English. Other languages contribute around 100 hours each, resulting in a total of approximately 1500 hours of multilingual conversational speech data.

This dataset is designed to provide a rich resource for training and evaluating multilingual conversational speech language models (MLC-SLM), addressing the challenges of linguistic diversity, speaker variability, and contextual understanding.

Language Data Volume (h) Language Classification Sampling Rate Description
English 500 Covers 5 different accents of English, speakers from the United States, the United Kingdom, Philippines, Australia, and India. Diverse genders and ages, natural conversation style. The word error rate is lower than 2%.
100 American English 16K
100 British English 16K
100 Filipino English 16K
100 Australian English 16K
100 Indian English 16K
French 100 16k Recorded on a mobile phone, the recorder selects several familiar topics and records a smooth and natural conversation for each. The speaker need to has various genders and ages. The word error rate is lower than 2%.
German 100 16k Recorded on a mobile phone, the recorder selects several familiar topics and records a smooth and natural conversation for each. The speaker need to has various genders and ages. The word error rate is lower than 2%.
Italian 100 16k Recorded on a mobile phone, the recorder selects several familiar topics and records a smooth and natural conversation for each. The speaker need to has various genders and ages. The word error rate is lower than 2%.
Japanese 100 16k Recorded on a mobile phone, the recorder selects several familiar topics and records a smooth and natural conversation for each. The speaker need to has various genders and ages. The sentence error rate is lower than 5%.
Korean 100 16k Recorded on a mobile phone, the recorder selects several familiar topics and records a smooth and natural conversation for each. The speaker need to has various genders and ages. The sentence error rate is lower than 5%.
Portuguese
(Europe)
100 16k Recorded on a mobile phone, the recorder selects several familiar topics and records a smooth and natural conversation for each. The speaker need to has various genders and ages. The word error rate is lower than 2%.
Russian 100 16k Recorded on a mobile phone, the recorder selects several familiar topics and records a smooth and natural conversation for each. The speaker need to has various genders and ages. The word error rate is lower than 2%.
Spanish
(Spain)
100 16k Recorded on a mobile phone, the recorder selects several familiar topics and records a smooth and natural conversation for each. The speaker need to has various genders and ages. The word error rate is lower than 2%.
Thai 100 16k Recorded on a mobile phone, the recorder selects several familiar topics and records a smooth and natural conversation for each. The speaker need to has various genders and ages. The word error rate is lower than 3%.
Vietnamese 100 16k Recorded on a mobile phone, the recorder selects several familiar topics and records a smooth and natural conversation for each. The speaker need to has various genders and ages. The word error rate is lower than 2%.

Development set

The development set (Dev) has the same setting as the training set but contains approximately 4 hours of recordings for each language. Both Task I and Task II share the same development set.

Evaluation set

Different evaluation sets are employed for each task, designated as Eval_1 and Eval_2. Specifically, Eval_1 includes oracle timestamps and speaker labels, which are evaluated using WER/CER. Eval_2 does not provide timestamps or speaker labels, necessitating a speaker diarization (SD) system to segment the longer recordings before recognition.
Participants can access the dataset by signing the Data use agreement and submitting to the registration form. After submission, the data download link will be sent to your email.

Rules

All participants must adhere to the following rules to be eligible for the challenge.

Use of External Resource: For both Track I and Track II, the use of external datasets and pre-trained models (including speech foundation models and LLMs) is permitted. All external resources utilized must be freely accessible to all research groups and should be clearly indicated in the final system report.
Data augmentation: Data augmentation is allowed on the released training set and may include, but is not limited to, the addition of noise or reverberation, speed perturbation, and tone modification.
Prohibition of Evaluation Sets Usage: The use of evaluation sets in any form of non-compliance is strictly prohibited. This includes, but is not limited to, using evaluation sets for fine-tuning or training the model.
Multi-System Fusion: Participants are NOT allowed to employ system fusion in either Task I and Task II. Submitted results must be derived from a single model rather than through result fusion.
Submission Requirement: All participations are required to submit their system. The submission may include final results, models and a Docker that can directly perform inference to obtain the final results, etc. Detailed submission instructions will be provided following the release of the baseline implementation. Please note that we will publicly disclose the name of teams and their affiliated institutions that confirmed participation but did not submit any files.
Organizer's Interpretation: The organizers reserve the right to make the final interpretation of these rules. In special circumstances, the organizers will coordinate the interpretation as needed.

Other Topics

In addition to challenge system descriptions, participants are encouraged to submit research papers that showcase innovative findings, practical case studies, and forward-looking ideas. Topics of interest include, but are not limited to:

Novel Architectures and Algorithms: Development of new architectures and algorithms for training SLMs.
Audio Data Processing Pipelines: Innovative pipelines for processing raw audio data that facilitate the collection of diverse internet data for training SLMs.
Natural and Emotionally Rich Speech Generation: Algorithms designed to generate more natural and emotionally expressive conversational speech for dialogue systems.
Leveraging Multi-Turn Conversational History: Approaches that utilize multi-turn conversational history to enhance recognition and diarization results.
Evaluation Techniques and Benchmarks: Innovative evaluation techniques or benchmarks specifically tailored for assessing SLMs.
New Datasets: Creation of new datasets, both real and synthetic, for training speech and audio language models.

Data Access and Usage

Registered participants will be given access to the training and testing datasets. They must sign a data use agreement (see below), agree to confidentiality and comply with the data protection agreement. The datasets will only be used for the purpose of the workshop challenge, and redistribution or any other use is strictly prohibited. It is the responsibility of the participant to protect the data from unauthorized access.

Data License Agreement
Data use agreement- nexdata

Registration

To participate, registration is required. Please upload signed Data use agreement and complete the registration form. The challenge begins on March 10, 2025.

For any other information about registration, please send Email to: mlc-slmw@nexdata.ai

Baseline System

Github/MLC-SLM-Baseline

Paper Submission Guidelines

1. Challenge papers:

a. Participants must submit ONE short technical description paper (even if the team participated in both tasks).

b. Length: 2-4 pages of content + 1 page for references.

c. Content Requirements:
  i. Clear system descriptions to access submission correctness and rule compliance.
  ii. Reproducibility details include used open-source datasets and models, data augmentation strategies, model architectures, training configurations, etc.
  iii. Ablation studies demonstrate method's effectiveness.

d. All challenge participants are expected to present a talk or a poster at the workshop.

2. Non-challenge papers:

a. Length: 4 pages of content + 1 page for references.

b. Topics: Include, but not limited to the topics listed on the challenge website.

3. Author Kit:

Please use the provided Interspeech 2022 LaTeX author kit (https://www.interspeech2022.org/files/IS2022_paper_kit.zip) for all submissions. Note that we are using the 2022 Interspeech author kit to keep reviewing single-blind.

4. Submission Portal

a. Submit your paper via the CMT conference system .

b. The Microsoft CMT service was used for managing the peer-reviewing process for this conference. This service was provided for free by Microsoft and they bore all expenses, including costs for Azure cloud services as well as for software development and support.

Prizes

TOTAL FUND FOR PRIZE : $20,000, sponsored by Huawei Technologies.

Prizes for Top-Ranking Teams in this Competition(each task):

1st Place: $5,000
2nd Place: $3,000
3rd Place: $2,000

Competitions Results

MLC-SLM Task I

Username WER/CER No. Team Name Institution
tenp19.61TENPTencent Ethereal Audio Lab
sixteen-years9.672sixteen-yearsChinese Academy of Sciences
t-asr9.833T-ASRSHENZHEN TRANSSION HOLDINGS CO.,LTD.
megaais10.084MegaAISMegatronix (Beijing) Technology Co., Ltd.
maxiaoai10.565MaXiaoAlMashang Consumer Finance Co., Ltd. (MSCF)
ntu_speechlab10.586NTU-SpeechlabNanyang Technological University
cheryfsai11.277Cheryfs-AIChery HuiYin Motor Finance Service Co., Ltd.
seewo11.578seewoGuangzhou Shirui Electronics Co., Ltd.
daominhtri11.719Cake By VPBankCake By VPBank
maybe11.7610MayShanghai Normal University

MLC-SLM Task II

Username tcpWER/tcpCER No. Team Name Institution
megaais16.531MegaAISMegatronix (Beijing) Technology Co., Ltd.
tenp117.492TENPTencent Ethereal Audio Lab
seewo17.673seewoGuangzhou Shirui Electronics Co., Ltd.
duke_kunshan18.084DKUDuke Kunshan University
sixteen-years19.275sixteen-yearsChinese Academy of Sciences
cheryfsai26.36Cheryfs-AIChery HuiYin Motor Finance Service Co., Ltd.
saengthong27.257ST-ShinozakiLabInstitute of Science Tokyo
fosafer31.688FOSAFER_
RESEARCH
Beijing Fosafer Information Technology Co., Ltd.
voicecode55.969VoiceCodeVOICECODE TECHNOLOGY PTE. LTD.
51751759.410INFXZhejiang University

Note: Only the top 10 entries for each task are listed. For any inquiries regarding team results, please contact the organizing committee.

Venue

Dock 14 at Rotterdam Ahoy Convention Centre, Rotterdam, Netherlands

Registration fees for attending workshop

Registration Fee: € 50

Organizers

    Shinji Watanabe, Associate Professor, Carnegie Mellon University (USA)

    Eng Siong Chng, Professor, Nanyang Technological University (Singapore)

    Junlan Feng, IEEE Fellow & Chief Scientist, China Mobile (China)

    Shuai Wang, Research Scientist, Nanjing University (China)

    Longshuai Xiao, Huawei Technologies (China)

    Khalid Choukri, Secretary General, European Language Resources Association (France)

    Qiangze Feng, Co-founder & Data Scientist, Nexdata (USA)

    Daliang Wang, Data Scientist, Nexdata (USA)

    Hexin Liu, Postdoctoral Researcher, Nanyang Technological University (Singapore)

    Pengcheng Guo, PhD Student, Northwestern Polytechnical University (China)

    Bingshen Mu, PhD Student, Northwestern Polytechnical University (China)

    Zhaokai Sun, Master Student, Northwestern Polytechnical University (China)

Sponsors

Media Partners

fdabc207-8583-4e15-bb25-34d451d05840