Workshop on Multi-lingual Conversational Speech Language Model---Nexdata

MLC-SLM Workshop Program
Motivation
Task Setting and Evaluation
Important Dates (AOE Time)
Dataset Description
Rules
Other Topics
Data Access and Usage
Registration
Contact
Baseline System
Leaderboard Submission
Paper Submission Guidelines
Prizes
Venue
Registration fees for attending workshop
Organizers
Sponsors
Media Partners

MLC-SLM Workshop Program

Date & Location: August 22nd, Dock 14 – Rotterdam Ahoy Convention Centre

Time Slot	Activity
8:30-9:00	Badge Pickup
9:00-10:00	Keynote 1: Shinji Watanabe Scaling Multilingual Speech Recognition: From a Handful to Thousands of Languages
10:00-10:30	Coffee Break
10:30-11:00	Challenge Summary + Awards Ceremony
11:00-12:00	Oral Session: 1.Seewo's Submission to MLC-SLM: Lessons learned from Speech Reasoning Language Models, Speaker: Bo Li 2.Transsion Multilingual Speech Recognition System for MLC-SLM 2025 Challenge, Speaker: Xiaoxiao Li 3.Triple X: A LLM-Based Multilingual Speech Recognition System for the INTERSPEECH2025 MLC-SLM Challenge, Speaker: Miaomiao Gao 4.The TEA-ASLP System for Multilingual Conversational Speech Recognition and Speech Diarization in MLC-SLM 2025 Challenge, Speaker: Hongfei Xue
12:00-13:00	Lunch Break
13:00-14:00	Keynote 2: Hung-yi Lee Advancements in Spoken Language Models
14:00-14:30	Oral Session: 1.ILT: Iterative LORA Training through Focus-Feedback-Fix for Multilingual Speech Recognition, Speaker: Qingliang Meng 2.BUT System for the MLC-SLM Challenge, Speaker: Alexander Polok
14:30-15:00	Coffee Break
15:00-15:30	Invited talk 1: Ming Li Sequence-to-Sequence Neural Diarization under Online and Multi-modal Scenarios
15:30-16:00	Invited talk 2: Shuai Wang One Embedding Doesn't Fit All: Rethinking Speaker Modeling for Various Speech Applications
16:00-16:30	Invited talk 3: Pan Pan Beyond Data Scarcity: Engineering Quality-First Data Pipelines in Different Training Stage
16:30-17:30	Posters

Workshop Registration Channels: Official Registration via Interspeech: (Please select Workshop on Multilingual Conversational Speech Language Model during your registration)Click the link

On-site Registration Channel: Click the link

Registration Fee: €50 Registered participants will receive coffee breaks and one lunch on the day of the workshop.

Note: For participants registering via the on-site channel, payment must be made in cash at the venue.

Keynote 1

Shinji Watanabe, Associate Professor, Carnegie Mellon University

Scaling Multilingual Speech Recognition: From a Handful to Thousands of Languages

Shinji Watanabe is an Associate Professor at Carnegie Mellon University, Pittsburgh, PA. He received his B.S., M.S., and Ph.D. (Dr. Eng.) degrees from Waseda University, Tokyo, Japan. He was a research scientist at NTT Communication Science Laboratories, Kyoto, Japan, from 2001 to 2011, a visiting scholar at Georgia Institute of Technology, Atlanta, GA, in 2009, and a senior principal research scientist at Mitsubishi Electric Research Laboratories (MERL), Cambridge, MA USA from 2012 to 2017. Before Carnegie Mellon University, he was an associate research professor at Johns Hopkins University, Baltimore, MD, USA, from 2017 to 2020. His research interests include automatic speech recognition, speech enhancement, spoken language understanding, and machine learning for speech and language processing. He has published over 500 papers in peer-reviewed journals and conferences and received several awards, including the best paper award from ISCA Interspeech in 2024. He is a Senior Area Editor of the IEEE Transactions on Audio Speech and Language Processing. He was/has been a member of several technical committees, including the APSIPA Speech, Language, and Audio Technical Committee (SLA), IEEE Signal Processing Society Speech and Language Technical Committee (SLTC), and Machine Learning for Signal Processing Technical Committee (MLSP). He is an IEEE and ISCA Fellow.

Keynote 2

Hung-yi Lee, Professor, National Taiwan University

Advancements in Spoken Language Models

Hung-yi Lee is a professor of the Department of Electrical Engineering at National Taiwan University (NTU), with a joint appointment at the Department of Computer Science & Information Engineering of the university. His recent research focuses on developing technology that can reduce the requirement of annotated data for speech processing (including voice conversion and speech recognition) and natural language processing (including abstractive summarization and question answering). He won Salesforce Research Deep Learning Grant in 2019, AWS ML Research Award in 2020, Outstanding Young Engineer Award from The Chinese Institute of Electrical Engineering in 2018, Young Scholar Innovation Award from Foundation for the Advancement of Outstanding Scholarship in 2019, Ta-You Wu Memorial Award from Ministry of Science and Technology of Taiwan in 2019, and The 59th Ten Outstanding Young Person Award in Science and Technology Research & Development of Taiwan. He owns a YouTube channel teaching deep learning technology in Marian, which has more than 300,000 subscribers.

Invited talk 1

Ming Li, Professor, Duke Kunshan University

Sequence-to-Sequence Neural Diarization under Online and Multi-modal Scenarios

Ming Li received his Ph.D. in Electrical Engineering from University of Southern California in 2013. He is currently a Professor of Electronical and Computer Engineering at Division of Natural and Applied Science and Principal Research Scientist at Digital Innovation Research Center at Duke Kunshan University. He is also an Adjunct Professor at School of Computer Science of Wuhan University. His research interests are in the areas of audio, speech and language processing as well as multimodal behavior signal analysis and interpretation. He has published more than 200 papers and served as the member of IEEE speech and language technical committee, APSIPA speech and language processing technical committee. He was an area chair at Interspeech 2016, Interspeech 2018, Interspeech 2020, SLT2022, Interspeech 2024, Interspeech 2025, ASRU 2025. He is the technical program co-chair at Odyssey 2022 and ASRU 2023. He is an editorial member of IEEE Transactions on Audio, Speech and Language Processing, Computer Speech and Language and APSIPA Transactions on Signal and Information Processing. Works co-authored with his colleagues have won first prize awards at Interspeech Computational Paralinguistic Challenges 2011, 2012 and 2019, ASRU 2019 MGB-5 ADI Challenge, Interspeech 2020 and 2021 Fearless Steps Challenges, VoxSRC 2021, 2022 and 2023 Challenges, ICASSP 2022 M2MeT Challenge, IJCAI 2023 ADD challenge, ICME 2024 ChatCLR challenge and Interspeech 2024 AVSE challenge. As a co-author, he has won the best paper award in DCOSS2009 and ISCSLP2014 as well as the best paper shortlist in Interspeech 2024. He received the IBM faculty award in 2016, the ISCA Computer Speech and Language 5-years best journal paper award in 2018 and the youth achievement award of outstanding scientific research achievements of Chinese higher education in 2020. He is a senior member of IEEE.

Invited talk 2

Shuai Wang, Associate Professor, Nanjing University

One Embedding Doesn’t Fit All: Rethinking Speaker Modeling for Various Speech Applications

Shuai Wang is a tenure-track Associate Professor at Nanjing University and an adjunct faculty member at the Chinese University of Hong Kong, Shenzhen (CUHK-SZ). He received his Ph.D. from Shanghai Jiao Tong University in 2020 and his B.Sc. from Northwestern Polytechnical University in 2014. Dr. Wang has published over 60 papers on speaker modeling and has received several honors, including the IEEE Ramaswamy Grant at ICASSP 2018, and first place in both VoxSRC 2019 and DIHARD 2019. He is the initiator of the open-source projects WeSpeaker and WeSep, which are widely adopted by both academia and industry.

Invited talk 3

Pan Pan, Director of AI Business, Nexdata

Beyond Data Scarcity: Engineering Quality-First Data Pipelines in Different Training Stage

Visionary leader and operational architect at Nexdata, Pan leverages over a decade of AI data expertise to lead elite teams in delivering end-to-end solutions for LLM, GenAI, and traditional AI models. She has successfully executed 1000+ projects by integrating global-scale multi-sensor data collection, AI-powered annotation, and a unified platform that streamlines the entire training data pipeline.

Motivation

Large Language Models (LLMs) have demonstrated remarkable capabilities across a variety of downstream tasks, serving as powerful foundation models for language understanding and generation. Recently, there has been significant interest in applying LLMs to speech and audio processing tasks, including Automatic Speech Recognition (ASR), Audio Captioning, and emerging areas such as Spoken Dialogue Models.

However, the development of robust LLM-based Spoken Dialogue Models relies heavily on real-world conversational speech data, which encapsulates the complexity of human communication, including natural pauses, interruptions, speaker overlaps, and diverse conversational styles. The scarcity of such data, especially in multilingual contexts, poses a significant challenge to advancing the field.

The importance of real-world conversational speech extends beyond technological advancement—it is essential for building AI systems that can understand and respond naturally in multilingual, dynamic, and context-rich environments. This is especially crucial for next-generation human-AI interaction systems, where spoken dialogue serves as a primary mode of communication.

Thus, this challenge and workshop aim to bridge the gap by hosting the challenge of building multilingual conversational speech language models (MLC-SLM) and releasing a real-world multilingual conversational speech dataset.

Task Setting and Evaluation

The challenge consists of two tasks, both of which require participants to explore the development of speech language models (SLMs):

Task I: Multilingual Conversational Speech Recognition

Objective: Develop a multilingual LLM-based ASR model.

Participants will be provided with oracle segmentation and speaker labels for each conversation.

This task focuses on optimizing recognition accuracy in a multilingual conversation setting.

Task II: Multilingual Conversational Speech Diarization and Recognition

Objective: Develop a system for both speaker diarization (identifying who is speaking when), and recognition (transcribing speech to text).

No prior or oracle information will be provided during evaluation (e.g., no pre-segmented utterances or speaker labels).

Both pipeline-based and end-to-end systems are encouraged, providing flexibility in system design and implementation.

For Task I, system performance will be evaluated using Word Error Rate (WER) or Character Error Rate (CER) across different languages.

For Task II, performance will be assessed based on the Diarization Error Rate (DER) and the concatenated minimum permutation WER or CER, referred to as tcpWER or tcpCER. The DER is employed to determine the best speaker ID permutation between oracle annotation and diarization results. Then, the recognition results and references belonging to the same speaker within a recording will be concatenated to calculate the tcpWER or tcpCER. All submissions will be ranked according the tcpWER or tcpCER.

Important Dates (AOE Time)

March 10, 2025: Registration opens

March 15, 2025: Training data release

April 1, 2025: Development set and baseline system release

May 15, 2025: Evaluation set release and leaderboard open

May 30, 2025: Leaderboard freeze and paper submission portal opens (CMT system)

June 15, 2025: Paper submission deadline

July 1, 2025: Notification of acceptance

August 22, 2025: Workshop date

Dataset Description

Training set

The training set (Train) comprises approximately 11 languages: English (en), French (fr), German (de), Italian (it), Portuguese (pt), Spanish (es), Japanese (jp), Korean (ko), Russian (ru), Thai (th), Vietnamese (vi).

Each recording consists of two-speaker conversational speech on randomly assigned topics.

Conversations are natural and fluent, with speakers engaging in meaningful dialogues on each topic.

Recorded in quiet indoor environments using devices such as iPhones.

Each recording will provide the oracle segmentation and speaker label for the development of speech recognition and speaker diarization systems.

Both Task I and Task II share the same training set.

The English dataset comprises approximately 500 hours of recordings from various regions, including British, American, Australian, Indian, and Philippine English. Other languages contribute around 100 hours each, resulting in a total of approximately 1500 hours of multilingual conversational speech data.

This dataset is designed to provide a rich resource for training and evaluating multilingual conversational speech language models (MLC-SLM), addressing the challenges of linguistic diversity, speaker variability, and contextual understanding.

Language	Data Volume (h)	Language Classification	Sampling Rate	Description
English	500			Covers 5 different accents of English, speakers from the United States, the United Kingdom, Philippines, Australia, and India. Diverse genders and ages, natural conversation style. The word error rate is lower than 2%.
	100	American English	16K
	100	British English	16K
	100	Filipino English	16K
	100	Australian English	16K
	100	Indian English	16K
French	100		16k	Recorded on a mobile phone, the recorder selects several familiar topics and records a smooth and natural conversation for each. The speaker need to has various genders and ages. The word error rate is lower than 2%.
German	100		16k	Recorded on a mobile phone, the recorder selects several familiar topics and records a smooth and natural conversation for each. The speaker need to has various genders and ages. The word error rate is lower than 2%.
Italian	100		16k	Recorded on a mobile phone, the recorder selects several familiar topics and records a smooth and natural conversation for each. The speaker need to has various genders and ages. The word error rate is lower than 2%.
Japanese	100		16k	Recorded on a mobile phone, the recorder selects several familiar topics and records a smooth and natural conversation for each. The speaker need to has various genders and ages. The sentence error rate is lower than 5%.
Korean	100		16k	Recorded on a mobile phone, the recorder selects several familiar topics and records a smooth and natural conversation for each. The speaker need to has various genders and ages. The sentence error rate is lower than 5%.
Portuguese (Europe)	100		16k	Recorded on a mobile phone, the recorder selects several familiar topics and records a smooth and natural conversation for each. The speaker need to has various genders and ages. The word error rate is lower than 2%.
Russian	100		16k	Recorded on a mobile phone, the recorder selects several familiar topics and records a smooth and natural conversation for each. The speaker need to has various genders and ages. The word error rate is lower than 2%.
Spanish (Spain)	100		16k	Recorded on a mobile phone, the recorder selects several familiar topics and records a smooth and natural conversation for each. The speaker need to has various genders and ages. The word error rate is lower than 2%.
Thai	100		16k	Recorded on a mobile phone, the recorder selects several familiar topics and records a smooth and natural conversation for each. The speaker need to has various genders and ages. The word error rate is lower than 3%.
Vietnamese	100		16k	Recorded on a mobile phone, the recorder selects several familiar topics and records a smooth and natural conversation for each. The speaker need to has various genders and ages. The word error rate is lower than 2%.

Development set

The development set (Dev) has the same setting as the training set but contains approximately 4 hours of recordings for each language. Both Task I and Task II share the same development set.

Evaluation set

Different evaluation sets are employed for each task, designated as Eval_1 and Eval_2. Specifically, Eval_1 includes oracle timestamps and speaker labels, which are evaluated using WER/CER. Eval_2 does not provide timestamps or speaker labels, necessitating a speaker diarization (SD) system to segment the longer recordings before recognition.
Participants can access the dataset by signing the Data use agreement and submitting to the registration form. After submission, the data download link will be sent to your email.

Rules

All participants must adhere to the following rules to be eligible for the challenge.

Use of External Resource: For both Track I and Track II, the use of external datasets and pre-trained models (including speech foundation models and LLMs) is permitted. All external resources utilized must be freely accessible to all research groups and should be clearly indicated in the final system report.

Data augmentation: Data augmentation is allowed on the released training set and may include, but is not limited to, the addition of noise or reverberation, speed perturbation, and tone modification.

Prohibition of Evaluation Sets Usage: The use of evaluation sets in any form of non-compliance is strictly prohibited. This includes, but is not limited to, using evaluation sets for fine-tuning or training the model.

Multi-System Fusion: Participants are NOT allowed to employ system fusion in either Task I and Task II. Submitted results must be derived from a single model rather than through result fusion.

Submission Requirement: All participations are required to submit their system. The submission may include final results, models and a Docker that can directly perform inference to obtain the final results, etc. Detailed submission instructions will be provided following the release of the baseline implementation. Please note that we will publicly disclose the name of teams and their affiliated institutions that confirmed participation but did not submit any files.

Organizer's Interpretation: The organizers reserve the right to make the final interpretation of these rules. In special circumstances, the organizers will coordinate the interpretation as needed.

Data Access and Usage

Registered participants will be given access to the training and testing datasets. They must sign a data use agreement (see below), agree to confidentiality and comply with the data protection agreement. The datasets will only be used for the purpose of the workshop challenge, and redistribution or any other use is strictly prohibited. It is the responsibility of the participant to protect the data from unauthorized access.

Data License Agreement

Data use agreement- nexdata

Registration

To participate, registration is required. Please upload signed Data use agreement and complete the registration form. The challenge begins on March 10, 2025.

For any other information about registration, please send Email to: mlc-slmw@nexdata.ai

Contact

Official Email: mlc-slmw@nexdata.ai

Slack: https://join.slack.com/t/mlc-slm-challenge/shared_invite/zt-314nfsmhz-QjOJjhjK3OHYUtJyBRtPxA

Baseline System

Github/MLC-SLM-Baseline

Leaderboard Submission

Task I Submission

Task II Submission

Paper Submission Guidelines

1. Challenge papers:

a. Participants must submit ONE short technical description paper (even if the team participated in both tasks).

b. Length: 2-4 pages of content + 1 page for references.

c. Content Requirements:
  i. Clear system descriptions to access submission correctness and rule compliance.
  ii. Reproducibility details include used open-source datasets and models, data augmentation strategies, model architectures, training configurations, etc.
  iii. Ablation studies demonstrate method's effectiveness.

d. All challenge participants are expected to present a talk or a poster at the workshop.

2. Non-challenge papers:

a. Length: 4 pages of content + 1 page for references.

b. Topics: Include, but not limited to the topics listed on the challenge website.

3. Author Kit:

Please use the provided Interspeech 2022 LaTeX author kit (https://www.interspeech2022.org/files/IS2022_paper_kit.zip) for all submissions. Note that we are using the 2022 Interspeech author kit to keep reviewing single-blind.

4. Submission Portal

a. Submit your paper via the CMT conference system .

b. The Microsoft CMT service was used for managing the peer-reviewing process for this conference. This service was provided for free by Microsoft and they bore all expenses, including costs for Azure cloud services as well as for software development and support.

Prizes

TOTAL FUND FOR PRIZE : $20,000, sponsored by Huawei Technologies.

Prizes for Top-Ranking Teams in this Competition(each task):

1st Place: $5,000

2nd Place: $3,000

3rd Place: $2,000

Competitions Results

MLC-SLM Task I

Username	WER/CER	No.	Team Name	Institution
tenp1	9.6	1	TENP	Tencent Ethereal Audio Lab
sixteen-years	9.67	2	sixteen-years	Chinese Academy of Sciences
t-asr	9.83	3	T-ASR	SHENZHEN TRANSSION HOLDINGS CO.,LTD.
megaais	10.08	4	MegaAIS	Megatronix (Beijing) Technology Co., Ltd.
maxiaoai	10.56	5	MaXiaoAl	Mashang Consumer Finance Co., Ltd. (MSCF)
ntu_speechlab	10.58	6	NTU-Speechlab	Nanyang Technological University
cheryfsai	11.27	7	Cheryfs-AI	Chery HuiYin Motor Finance Service Co., Ltd.
seewo	11.57	8	seewo	Guangzhou Shirui Electronics Co., Ltd.
daominhtri	11.71	9	Cake By VPBank	Cake By VPBank
maybe	11.76	10	May	Shanghai Normal University

MLC-SLM Task II

Username	tcpWER/tcpCER	No.	Team Name	Institution
megaais	16.53	1	MegaAIS	Megatronix (Beijing) Technology Co., Ltd.
tenp1	17.49	2	TENP	Tencent Ethereal Audio Lab
seewo	17.67	3	seewo	Guangzhou Shirui Electronics Co., Ltd.
duke_kunshan	18.08	4	DKU	Duke Kunshan University
sixteen-years	19.27	5	sixteen-years	Chinese Academy of Sciences
cheryfsai	26.3	6	Cheryfs-AI	Chery HuiYin Motor Finance Service Co., Ltd.
saengthong	27.25	7	ST-ShinozakiLab	Institute of Science Tokyo
fosafer	31.68	8	FOSAFER_ RESEARCH	Beijing Fosafer Information Technology Co., Ltd.
voicecode	55.96	9	VoiceCode	VOICECODE TECHNOLOGY PTE. LTD.
517517	59.4	10	INFX	Zhejiang University

Note: Only the top 10 entries for each task are listed. For any inquiries regarding team results, please contact the organizing committee.

Venue

Dock 14 at Rotterdam Ahoy Convention Centre, Rotterdam, Netherlands

Registration fees for attending workshop

Registration Fee: € 50

Organizers

Shinji Watanabe, Associate Professor, Carnegie Mellon University (USA)

Eng Siong Chng, Professor, Nanyang Technological University (Singapore)

Junlan Feng, IEEE Fellow & Chief Scientist, China Mobile (China)

Shuai Wang, Research Scientist, Nanjing University (China)

Longshuai Xiao, Huawei Technologies (China)

Khalid Choukri, Secretary General, European Language Resources Association (France)

Qiangze Feng, Co-founder & Data Scientist, Nexdata (USA)

Daliang Wang, Data Scientist, Nexdata (USA)

Hexin Liu, Postdoctoral Researcher, Nanyang Technological University (Singapore)

Pengcheng Guo, PhD Student, Northwestern Polytechnical University (China)

Bingshen Mu, PhD Student, Northwestern Polytechnical University (China)

Zhaokai Sun, Master Student, Northwestern Polytechnical University (China)

Challenge and Workshop onMultilingual ConversationalSpeech Language Model(MLC-SLM)