| Paper ID |
Paper Title |
Authors |
| 9 |
Transduce and Speak: Neural Transducer for Text-To-Speech with Semantic Token Prediction |
Minchan Kim (Seoul National University)*; Myeonghun Jeong (Seoul National University); Byoung Jin Choi (Seoul National University); Dongjune Lee (Seoul National University); Nam Soo Kim (Seoul National University) |
| 11 |
Leveraging Multilingual Self-Supervised Pretrained Models for Sequence-To-Sequence End-To-End Spoken Language Understanding |
Pavel Denisov (University of Stuttgart)*; Ngoc Thang Vu (University of Stuttgart) |
| 12 |
LC4SV: A Denoising Framework Learning to Compensate for Unseen Speaker Verification Models |
Chi-Chang Lee (Academia Sinica)*; Hong Wei Chen (National Taiwan University); Chu-Song Chen (National Taiwan University); Hsin-Min Wang (Academia Sinica); Tsung-Te Liu (National Taiwan University); Yu Tsao (Academia Sinica) |
| 22 |
Variational Gaussian Process Data Uncertainty |
Jeremy H. M. Wong (Institute for Infocomm Research)*; Huayun Zhang (ASTAR ); Nancy Chen (Institute for Infocomm Research) |
| 26 |
Low-rank Adaptation of Neural Language Model Rescoring for Speech Recognition |
Yu Yu (Stevens Institute of Technology); Chao-Han Huck Yang (Amazon)*; Jari T Kolehmainen (Amazon); Prashanth Gurunath Shivakumar (Amazon); Yile Gu (Amazon); Sungho Ryu (Amazon); Roger Ren (Amazon); Qi Luo (Amazon.com Inc.); Aditya Gourav (Amazon); I-Fan Chen (Amazon Inc.); Yi Chieh Liu (Amazon); Tuan Dinh (Amazon); Denis Filimonov (Amazon); Ankur Gandhe (Amazon Alexa); Andreas Stolcke (Amazon); Ariya Rastrow (Amazon Alexa); Ivan Bulyko (Amazon) |
| 27 |
CrossSinger: A Cross-Lingual Multi-Singer High-Fidelity Singing Voice Synthesizer Trained on Monolingual Singers |
Xintong Wang (XiaoIce ); Chang Zeng (National Institute of Informatics)*; Jun Chen (Tsinghua University); wang chun hui (XiaoIce) |
| 30 |
Active Learning Based Fine-Tuning Framework for Speech Emotion Recognition |
Dongyuan Li (Tokyo Institute of Technology)*; Yusong WANG (Tokyo Institute of Technology); Kotrao Funakoshi (Tokyo Institute of Technology); Manabu Okumura (Tokyo Institute of Technology) |
| 32 |
Identifying People with Mild Cognitive Impairment At Risk of Developing Dementia Using Speech Analysis |
Bahman Mirheidari (University of Sheffield)*; Ronan O’Malley (University of Sheffield); Daniel Blackburn (University of Sheffield); Heidi Christensen (University of Sheffield) |
| 36 |
Bisinger: Bilingual Singing Voice Synthesis |
Huali Zhou (Wuhan University); Yueqian Lin (Duke Kunshan University); Yao Shi (Duke Kunshan University); Peng Sun (Duke Kunshan University); Ming Li (Duke Kunshan University)* |
| 44 |
Robust Recognition of Speaker Emotion with Difference Feature Extraction Using a Few Enrollment Utterances |
Daichi Hayakawa (Toshiba Corporation Corporate R&D Center)*; Takehiko Kagoshima (Toshiba Corporation Corporate R&D Center); Kenji Iwata (Toshiba Corporation Corporate R&D Center); Rama S Doddipatla (Toshiba Europe LTD); Norbert Braunschweiler (Toshiba Europe Limited) |
| 47 |
Exploring the Viability of Synthetic Audio Data for Audio-Based Dialogue State Tracking |
Jihyun Lee (Pohang University of Science and Technology)*; Yejin Jeon (POSTECH); Wonjun Lee (POSTECH); Yunsu Kim (POSTECH); Gary Geunbae Lee (Postech) |
| 48 |
Prompting Large Language Models for Zero-Shot Domain Adaptation in Speech Recognition |
Yuang Li (University of Cambridge)*; Yu Wu (Microsoft Research Asia); Jinyu Li (Microsoft); Shujie Liu (Microsoft Research Asia) |
| 50 |
Mbtfnet: Multi-Band Temporal-Frequency Neural Network for Singing Voice Enhancement |
Weiming Xu (Northwest Polytechnic University)*; Xuanzhou Chen (Lyra Lab, Tencent Music Entertainment, Shenzhen, China); Zhili Tan (Tencent); Shubo Lv (Shaanxi Provincial Key Laboratory of Speech and Image Information Processing, School of Computer Science, Northwestern Polytechnical University); Runduo Han (Northwestern Polytechnical University); Wenjiang Zhou ( Lyra Lab, Tencent Music Entertainment, Shenzhen, China); Weifeng Zhao ( Lyra Lab, Tencent Music Entertainment, Shenzhen, China); Lei Xie (NWPU) |
| 51 |
Can We Use Speaker Embeddings on Spontaneous Speech Obtained From Medical Conversations to Predict Intelligibility? |
Sebastião Quintas (IRIT, Université de Toulouse, CNRS, Toulouse, France)*; Mathieu Balaguer (IRIT); Julie Mauclair (IRIT); Virginie Woisard (Hospitals of Toulouse); Julien Pinquier (IRIT) |
| 53 |
End-to-End Training of a Neural HMM with Label and Transition Probabilities |
Daniel Mann (RWTH Aachen University)*; Tina Raissi (RWTH Aachen University); Wilfried Michel (AppTek); Ralf Schlüter (RWTH Aachen University); Hermann Ney ( RWTH Aachen University) |
| 54 |
Wiki-En-Asr-Adapt: Large-Scale Synthetic Dataset for English Asr Customization |
Alexandra A Antonova (Moscow Institute of Physics and Technology)* |
| 58 |
The Role of Feature Correlation on Quantized Neural Networks |
David Qiu (Google)*; Shaojin Ding (Google); Yanzhang He (Google) |
| 59 |
LV-CTC: Non-autoregressive ASR With CTC and Latent Variable Models |
Yuya Fujita (Yahoo Japan Corporation)*; Shinji Watanabe (Carnegie Mellon University); Xuankai Chang (Carnegie Mellon University); Takashi Maekaku (Yahoo Japan Corporation) |
| 64 |
The Singing Voice Conversion Challenge 2023 |
Wen-Chin Huang (Nagoya University)*; Lester Phillip G Violeta (Nagoya University); Songxiang Liu (Tencent); Jiatong Shi (Carnegie Mellon University); Tomoki Toda (Nagoya University) |
| 65 |
Improving Multilingual and Code-switching ASR using Large Language Model Generated Text |
Ke Hu (Google)*; Tara Sainath (Google); Bo Li (Google); Yu Zhang (Google); Yong Cheng (Google); Tao Wang (Google Inc.); Yujing Zhang (Google); Frederick Liu (Google Inc.) |
| 66 |
Pareto Efficiency of Learning-Forgetting Trade-Off in Neural Language Model Adaptation |
Jerome R Bellegarda (Apple)* |
| 72 |
Improved Multi-modal Emotion Recognition using Squeeze-and-Excitation Block in Cross-Modal Attention |
Junchen Liu (The University of Auckland)*; Jesin James (The University of Auckland); Karan Nathwani (Indian Institute of Technology, Jammu) |
| 73 |
Improving Large-scale Deep Biasing with Phoneme Features and Text-only Data in Streaming Transducer |
Jin Qiu (ByteDance); Lu Huang (ByteDance)*; Boyu Li (ByteDance); Jun Zhang (Bytedance); Lu Lu (Bytedance); Zejun Ma (Bytedance) |
| 76 |
Locality Enhanced Dynamic Biasing and Sampling Strategies for Contextual ASR |
Md Asif Jalal (Samsung Research UK)*; Pablo Peso Parada (Samsung Research UK); George Pavlidis (Information Technologies Institute, Centre for Research and Technology); Vasileios Moschopoulos (Information Technologies Institute, Centre for Research and Technology - Hellas, Thessaloniki, Greece); KARTHIKEYAN SARAVANAN (Samsung Research, UK); Chrysovalantis G Kontoulis (Pragma-IoT); Jisi Zhang (Samsung Research UK); Anastasios Drosou (Information Technologies Institute, Centre for Research and Technology - Hellas, Thessaloniki, Greece); Jung In Lee (Samsung Electronics); Gil Ho Lee (Samsung Electronics); Seokyeong Jung (Samsung Electronics) |
| 77 |
Robust End-To-End Diarization with Domain Adaptive Training and Multi-Task Learning |
Ivan Fung (Fano Labs)*; Lahiru T Samarakoon (Fano Labs, Hong Kong); Samuel J Broughton (Fano Labs) |
| 78 |
Whisper-SLU: Extending a Pretrained Speech-to-Text Transformer for Low Resource Spoken Language Understanding |
Quentin Meeus (KU Leuven)*; Sien Moens (KU Leuven); Hugo Van hamme (KU Leuven) |
| 80 |
Detecting Speech Abnormalities with a Perceiver-based Sequence Classifier that leverages a Universal Speech Model |
Hagen Soltau (Google)*; Izhak Shafran (Google AI); Alex Ottenwess (Google); Joseph R. JR Duffy (Mayo Clinic); Rene L Utianski (Mayo Clinic); Leland R. Barnard (Mayo); John L. Stricker (Mayo Clinic); Daniela Wiepert (Mayo Clinic); David T. Jones (Mayo Clinic); Hugo Botha (Mayo Clinic) |
| 82 |
Contextual Spelling Correction With Large Language Models |
Gan Song (Google)*; Zelin Wu (Google LLC); Golan Pundak (Google); Angad Chandorkar (Google); Xavier Velez (Google); Diamantino Caseiro (Google); Ben Haynor (Google); Weiran Wang (Google); Nikhil Siddhartha (Google); Kandarp Joshi (Google); Pat Rondon (Google); Khe C Sim (Google Inc.) |
| 83 |
Not All Errors Are Created Equal: Evaluating The Impact of Model And Speaker Factors on ASR Outcomes in Clinical Populations |
Daniela Wiepert (Mayo Clinic)*; Rene L Utianski (Mayo Clinic); Joseph Duffy (Mayo Clinic); John Stricker (Mayo Clinic); Leland Barnard (Mayo Clinic); Keith Josephs (Mayo Clinic Rochester); Jennifer Whitwell (Mayo Clinic Rochester); David Jones (Mayo Clinic); Hugo Botha (Mayo Clinic) |
| 85 |
The Gift of Feedback: Improving ASR Model Quality by Learning from User Corrections through Federated Learning |
Lillian Zhou (Google)*; Yuxin Ding (Google); Mingqing Chen (Google Inc.); Harry Zhang (Google); Rohit Prabhavalkar (Google); Dhruv Guliani (Google); Giovanni Motta (Google, Inc.); Rajiv Mathews (Google) |
| 90 |
FAT-HuBERT: Front-end Adaptive Training of Hidden-unit BERT for Distortion-Invariant Robust Speech Recognition |
Dongning Yang (Shanghai Jiao Tong University)*; wei wang (Shanghai Jiao Tong University); Yanmin Qian (Shanghai Jiao Tong University) |
| 96 |
Learning from Flawed Data: Weakly Supervised Automatic Speech Recognition |
Dongji Gao (Johns Hopkins University)*; Hainan Xu (NVIDIA); Desh Raj (Johns Hopkins University); Paola Garcia (Johns Hopkins University); Daniel Povey (Johns Hopkins University); Sanjeev Khudanpur (Johns Hopkins University) |
| 97 |
Acoustics-Text Dual-Modal Joint Representation Learning for Cover Song Identification |
Yanmei Gu (AntGroup)*; Li Jing (AntGroup); Zhou Jiayi (AntGroup); Wang Zhiming (AntGroup); Zhu Huijia (AntGroup) |
| 99 |
Towards Matching Phones and Speech Representations |
Gene-Ping Yang (The University of Edinburgh)*; Hao Tang (The University of Edinburgh) |
| 101 |
RescueSpeech: A German Corpus for Speech Recognition in Search and Rescue Domain |
Sangeet Sagar (Saarland University )*; Mirco Ravanelli (Université de Montréal); Bernd Kiefer (DFKI); Ivana Kruijff (DFKI); Josef van Genabith (Saarland University) |
| 103 |
Mask-Conformer: Augmenting Conformer with Mask-Predict Decoder |
Yosuke Higuchi (Waseda University)*; Andrew Rosenberg (Google LLC); Yuan Wang (Google); Murali Karthick Baskar (Google Inc); Bhuvana Ramabhadran (Google) |
| 105 |
Ed-Cec: Improving Rare Word Recognition Using ASR Post-Processing Based on Error Detection and Context-Aware Error Correction |
Jiajun He (Nagoya University)*; Zekun Yang (Nagoya University); Tomoki Toda (Nagoya University) |
| 108 |
Fast-HuBERT: An Efficient Training Framework for Self-Supervised Speech Representation Learning |
guanrou yang (Shanghai Jiao Tong University)*; Xie Chen (Shanghai Jiaotong University); Ziyang Ma (Shanghai Jiao Tong University); Zhisheng Zheng (Shanghai Jiao Tong University ); Yakun Song (Shanghai Jiao Tong University); Zhikang Niu (Xidian University) |
| 109 |
Can Unpaired Textual Data Replace Synthetic Speech In ARU Model Adaptation? |
Pasquale D'Alterio (Amazon)*; Christian Hensel (Amazon); Bashar Awwad Shiekh Hasan (Amazon) |
| 110 |
Preserving Phonemic Distinctions For Ordinal Regression: A Novel Loss Function For Automatic Pronunciation Assessment |
Bi-Cheng Yan (National Taiwan Normal University )*; Hsin-Wei Wang (NTNU); Yi-Cheng Wang (National Taiwan Normal University); Jiun-Ting Li (National Taiwan Normal University); Chi-Han Lin (E.SUN Financial Holding Co., Ltd.); Berlin Chen (National Taiwan Normal University) |
| 111 |
Exploring Effective Distillation of Self-Supervised Speech Models for Automatic Speech Recognition |
Yujin Wang (Tsinghua University); Changli Tang (Tsinghua University)*; Ziyang Ma (Shanghai Jiao Tong University); Zhisheng Zheng (Shanghai Jiao Tong University ); Xie Chen (Shanghai Jiaotong University); Wei-Qiang Zhang (Tsinghua University) |
| 112 |
Efficient Cascaded Streaming ASR System via Frame Rate Reduction |
Xingyu Cai (Google)*; David Qiu (Google); Shaojin Ding (Google); Dongseong Hwang (Google); Weiran Wang (Google); Antoine Bruguier (Google); Rohit Prabhavalkar (Google); Tara Sainath (Google); Yanzhang He (Google) |
| 119 |
VSANet: Real-time Speech Enhancement Based on Voice Activity Detection and Causal Spatial Attention |
Yuewei Zhang (Shanghai Jiao Tong University)*; Huanbin Zou (Tencent); jie zhu (Shanghai Jiao Tong University) |
| 123 |
Maximizing Data Efficiency for Cross-Lingual TTS Adaptation by Self-Supervised Representation Mixing and Embedding Initialization |
Wei-Ping Huang (National Taiwan University)*; Sung-Feng Huang (National Taiwan University); Hung-yi Lee (National Taiwan University) |
| 124 |
Meta-Learning Framework for End-To-End Imposter Identification in Unseen Speaker Recognition |
Ashutosh Chaubey (LG Ad Solutions); Sparsh Sinha (LG Ad Solutions)*; Susmita Ghose (LG Ad Solutions) |
| 127 |
Using Joint Training Speaker Encoder with Consistency Loss to Achieve Cross-Lingual Voice Conversion and Expressive Voice Conversion |
Houjian Guo (Osaka Univeristy, Riken Guardian Robot Group); Chaoran Liu (Riken)*; Carlos T Ishi (RIKEN); Hiroshi Ishiguro (Osaka University) |
| 128 |
Quickvc: a Lightweight VITS-Based Any-To-Many Voice Conversion Model Using iSTFT for Faster Conversion |
Houjian Guo (Osaka Univeristy, Riken Guardian Robot Group); Chaoran Liu (Riken)*; Carlos T Ishi (RIKEN); Hiroshi Ishiguro (Osaka University) |
| 130 |
Multi Transcription-Style Speech Transcription Using Attention-Based Encoder-Decoder Model |
Yan Huang (Microsoft Research)*; Piyush Behre (Microsoft); Guoli Ye (Microsoft); Shawn Chang (); Yifan Gong (Microsoft) |
| 134 |
NeuralKalman: A Learnable Kalman Filter for Acoustic Echo Cancellation |
Yixuan Zhang (The Ohio State University)*; Meng Yu (Tencent); Hao Zhang (Tencent AI Lab); Dong Yu (Tencent AI Lab); DeLiang Wang (Ohio State University) |
| 135 |
Thai-Dialect: Low Resource Thai Dialectal Speech to Text Corpora |
Artit Suwanbandit (Chulalongkorn University)*; Jaturong Chitiyaphol (KhonKaen University); Sutthinan Chuenchom (Chiang Mai Rajabhat University); Kanyarat Kwiecien (Khon Kaen University); Husen Sawal (Prince of Songkla University); Ruslan Uthai (Prince of Songkla University); Orathai Sangpetch (CMKL University ); Ekapol Chuangsuwanich (Chulalongkorn University) |
| 140 |
Deep Learning for Joint Acoustic Echo and Acoustic Howling Suppression in Hybrid Meetings |
Hao Zhang (Tencent AI Lab)*; Meng Yu (Tencent); Dong Yu (Tencent AI Lab) |
| 142 |
On Time Domain Conformer Models for Monaural Speech Separation in Noisy Reverberant Acoustic Environments |
William Ravenscroft (The University of Sheffield)*; Stefan Goetze (University of Sheffield); Thomas Hain (University of Sheffield) |
| 143 |
NeuralEcho: Hybrid of Full-Band and Sub-Band Recurrent Neural Network for Acoustic Echo Cancellation and Speech Enhancement |
Meng Yu (Tencent)*; Yong Xu (Tencent); Chunlei zhang (Tencent AI Lab); Shixiong Zhang (Tencent AI Lab); Dong Yu (Tencent AI Lab) |
| 144 |
Combining relative and absolute learning formulations to predict emotional attributes from speech |
Abinay Reddy Naini (The University of Texas at Dallas); Shruthi Subramanium (The University of Texas at Dallas); Seong-Gyun Leem (University of Texas at Dallas); Carlos Busso (University of Texas at Dallas)* |
| 145 |
ESPNet-SUMM: Introducing a novel large dataset, toolkit, and a cross-corpora evaluation of speech summarization systems |
Roshan S Sharma (Carnegie Mellon University)*; William Chen (Carnegie Mellon University); Takatomo Kano (NTT Corporation); Ruchira S Sharma (University of Massachusetts, Amherst); Atsunori Ogawa (NTT Corporation); Siddhant Arora (Carnegie Mellon University); Marc Delcroix (NTT); Rita Singh (Carnegie Mellon University); Shinji Watanabe (Carnegie Mellon University); Bhiksha Raj (Carnegie Mellon University) |
| 146 |
Reproducing Whisper-Style Training Using an Open-Source Toolkit and Publicly Available Data |
Yifan Peng (Carnegie Mellon University)*; Jinchuan Tian (Carnegie Mellon University); Brian Yan (Carnegie Mellon University); Dan Berrebbi (Carnegie Mellon University); Xuankai Chang (Carnegie Mellon University); Xinjian Li (Carnegie Mellon University); Jiatong Shi (Carnegie Mellon University); Siddhant Arora (Carnegie Mellon University); William Chen (Carnegie Mellon University); Roshan S Sharma (Carnegie Mellon University); Wangyou Zhang (Shanghai Jiao Tong University); Yui Sudo (Honda Research Institute Japan); Muhammad Mr. Shakeel (Honda Research Institute Japan); Jee-weon Jung (Carnegie Mellon University); Soumi Maiti (CMU); Shinji Watanabe (Carnegie Mellon University) |
| 155 |
Segment-Level Vectorized Beam Search Based on Partially Autoregressive Inference |
Masao Someki (IBM)*; Nicholas Eng (The University of Auckland); Yosuke Higuchi (Waseda University); Shinji Watanabe (Carnegie Mellon University) |
| 156 |
Joint Prediction and Denoising for Large-Scale Multilingual Self-Supervised Learning |
William Chen (Carnegie Mellon University)*; Jiatong Shi (Carnegie Mellon University); Brian Yan (Carnegie Mellon University); Dan Berrebbi (Carnegie Mellon University); Wangyou Zhang (Shanghai Jiao Tong University); Yifan Peng (Carnegie Mellon University); Xuankai Chang (Carnegie Mellon University); Soumi Maiti (CMU); Shinji Watanabe (Carnegie Mellon University) |
| 163 |
Findings of the 2023 ML-SUPERB Challenge: Pre-Training and Evaluation over More Languages and Beyond |
Jiatong Shi (Carnegie Mellon University)*; William Chen (Carnegie Mellon University); Dan Berrebbi (Carnegie Mellon University); Hsiu-Hsuan Wang (National Taiwan University ); Wei Ping Huang (National Taiwan University); En Pei Hu (National Taiwan University); ho lam Chung (National Taiwan University); Xuankai Chang (Carnegie Mellon University); Yuxun Tang (Renmin University of China); Shang-Wen Li (Meta AI); Abdelrahman Mohamed (Rembrand Inc); Hung-yi Lee (National Taiwan University); Shinji Watanabe (Carnegie Mellon University) |
| 165 |
Diffusion-Based Mel-Spectrogram Enhancement For Personalized Speech Synthesis With Found Data |
Yusheng Tian (The Chinese University of Hong Kong)*; Wei Liu (The Chinese University of Hong Kong); Tan Lee (The Chinese University of Hong Kong) |
| 166 |
Sqat-Ld: Speech Quality Assessment Transformer Utilizing Listener Dependent Modeling For Zero-Shot Out-Of-Domain Mos Prediction |
Kailai Shen (Ningbo University); Diqun Yan (Ningbo University)*; Li Dong (Ningbo University); Ren Ying (Ningbo University); Xiaoxun Wu (Ningbo University); Jing Hu (Ningbo University) |
| 170 |
Scenario-Aware Audio-Visual Tf-Gridnet For Target Speech Extraction |
Zexu Pan (National University of Singapore)*; Gordon Wichern (Mitsubishi Electric Research Laboratories (MERL)); Yoshiki Masuyama (Tokyo Metropolitan University); François G Germain (Mitsubishi Electric Research Laboratories (MERL)); Sameer Khurana (Mitsubishi Electric Research Lab); Chiori Hori (Mitsubishi Electric Research Laboratories (MERL)); Jonathan LeRoux (Mitsubishi Electric Research Laboratories (MERL)) |
| 171 |
Generative Asr Error Correction With Large Language Models |
Chao-Han Huck Yang (Amazon)*; Yile Gu (Amazon.com, USA); Yi-Chieh Liu (Georgia Institute of Technology ); Shalini Ghosh (Amazon Alexa AI); Ivan Bulyko (Amazon); Andreas Stolcke (Amazon) |
| 172 |
Enhancing Task-Oriented Dialogues With Chitchat: A Comparative Study Based On Lexical Diversity And Divergence |
Armand Stricker (LISN, CNRS)*; Patrick Paroubek (LISN) |
| 174 |
Token-Level Serialized Output Training For Joint Streaming Asr And St Leveraging Textual Alignments |
Sara Papi (FBK)*; Peidong Wang (Microsoft); Junkun Chen (Microsoft); JIAN XUE (Microsoft Corporation); Jinyu Li (Microsoft); Yashesh Gaur (Microsoft) |
| 175 |
Lae-St-Moe: Boosted Language-Aware Encoder Using Speech Translation Auxiliary Task For E2E Code-Switching Asr |
Guodong Ma (NetEase Yidun AI Lab)*; Wenxuan Wang (NetEase Yidun AI Lab); Yuke Li (NetEase Yidun AI Lab); Yuting Yang (NetEase Yidun AI Lab); Binbin Du (NetEase Yidun AI Lab); Haoran Fu (Department of Civil Engineering, Zhejiang University) |
| 177 |
A Token-Wise Beam Search Algorithm For Rnn-T |
Gil Keren (Facebook)* |
| 181 |
Joint Federated Learning And Personalization For On-Device Asr |
Junteng Jia (Meta AI)*; Ke Li (Johns Hopkins University); Mani Malek (Meta); Kshitiz Malik (Meta); Jay Mahadeokar (Meta AI); Ozlem Kalinli (Meta); Frank Seide (Meta AI) |
| 183 |
Melhubert: A Simplified Hubert On Mel Spectrograms |
Tzu-Quan Lin (National Taiwan University)*; Hung-yi Lee (National Taiwan University); Hao Tang (The University of Edinburgh) |
| 185 |
Exploring Data Augmentation In Bias Mitigation Against Non-Native-Accented Speech |
Yuanyuan Zhang (Technische Universiteit Delft)*; Aaricia Herygers (-); Tanvina Patel (Multimedia computing, Delft University of Technology ); Zhengjun Yue (Technische Universiteit Delft); Odette Scharenborg (Multimedia Computing Group, Delft University of Technology) |
| 190 |
Awmc: Online Test-Time Adaptation Without Mode Collapse For Continual Adaptation |
Jae-Hong Lee (Hanyang University)*; Dohee Kim (Hanyang University); Joon-Hyuk Chang (Hanyang University) |
| 192 |
Le-Ssl-Mos: Self-Supervised Learning Mos Prediction With Listener Enhancement |
Zili Qi (Hithink RoyalFlush AI Research Institute)*; Xinhui Hu (Hithink RoyalFlush AI Research Institute); Wangjin Zhou (Kyoto University); Sheng Li (National Institute of Information & Communications Technology (NICT)); Hao Wu (Hithink RoyalFlush AI Research Institute); Jian Lu (Hithink RoyalFlush AI Research Institute); Xinkang Xu (Hithink RoyalFlush AI Research Institute) |
| 198 |
Transcribing And Aligning Conversational Speech: A Hybrid Pipeline Applied To French Conversations |
Hiroyoshi Yamasaki (Aix-Marseille University); Jérôme Louradour (Linagora); Julie Hunter (LINAGORA); Laurent Prevot (Aix Marseille Université & CNRS)* |
| 201 |
Fedcpc: An Effective Federated Contrastive Learning Method For Privacy Preserving Early-Stage Alzheimer’S Speech Detection |
wenqing wei (Japan Advanced Institute of Science and Technology); Zhengdong Yang (Kyoto University); Gao Yuan (Japan Advanced Institute of Science and Technology); Jiyi Li (University of Yamanashi); Chenhui Chu (Kyoto University); Shogo Okada (Japan Advanced Institute of Science and Technology); Sheng Li (National Institute of Information & Communications Technology (NICT))* |
| 204 |
Toward General-Purpose Text-Instruction-Guided Voice Conversion |
Chun-Yi Kuan (National Taiwan University)*; Chen An Li (National Taiwan University); Tsu-Yuan Hsu (National Taiwan University); Tse-Yang Lin (National Taiwan University); ho lam Chung (National Taiwan University); Kai-Wei Chang (National Taiwan University); Shuo-yiin Chang (Google); Hung-yi Lee (National Taiwan University) |
| 206 |
Av-Data2Vec: Self-Supervised Learning Of Audio-Visual Speech Representations With Contextualized Target Representations |
Jiachen Lian (University of California Berkeley)*; Alexei Baevski (Facebook AI Research); Wei-Ning Hsu (Meta); Michael Auli (Meta) |
| 207 |
Improving Stability In Simultaneous Speech Translation: A Revision-Controllable Decoding Approach |
Junkun Chen (Microsoft)*; JIAN XUE (Microsoft Corporation); Peidong Wang (Microsoft); Jing Pan (Microsoft); Jinyu Li (Microsoft) |
| 211 |
Haha-Pod: An Attempt For Laughter-Based Non-Verbal Speaker Verification |
Yuke Lin (Wuhan University); Xiaoyi Qin (Dukekunshan University); Ning Jiang (Mashang Consumer Finance Co., Ltd.); Guoqing Zhao (Mashang Consumer Finance Co., Ltd); Ming Li (Duke Kunshan University)* |
| 220 |
Pp-Met: A Real-World Personalized Prompt Based Meeting Transcription System |
xiang lyu (ximalaya)*; Yuhang Cao (ximalaya); qing wang (ximalaya); Jingjing Yin (Ximalaya); Yuguang Yang (Ximalaya Inc., ShangHai, China); pengpeng zou (ximalaya); xuecheng hu (ximalaya); yanni hu (ximalaya); heng lu (ximalaya) |
| 223 |
Brouhaha: Multi-Task Training For Voice Activity Detection, Speech-To-Noise Ratio, And C50 Room Acoustics Estimation |
Marvin Lavechin (ENS, Meta AI)*; Marianne Metais (ENS); Hadrien Titeux (ENS); Alodie Boissonnet (Meta AI); Jade Copet (Meta AI); Morgane Riviere (Meta AI); Elika Bergelson (Duke University); Alejandrina Cristia (Exelang, CNRS, LSCP); Emmanuel Dupoux (EHESS, ENS, PSL University, CNRS, INRIA, META); Hervé Bredin (CNRS) |
| 224 |
Magnitude-And-Phase-Aware Speech Enhancement With Parallel Sequence Modeling |
Yuewei Zhang (Shanghai Jiao Tong University)*; Huanbin Zou (Tencent); jie zhu (Shanghai Jiao Tong University) |
| 228 |
Speaker Adaptation For End-To-End Speech Recognition Systems In Noisy Environments |
Dominik Wagner (Technische Hochschule Nuernberg Georg Simon Ohm)*; Ilja Baumann (Technische Hochschule Nürnberg Georg Simon Ohm); Sebastian P Bayerl (Technische Hochschule Nürnberg Georg Simon Ohm); Korbinian Riedhammer (Technische Hochschule Nürnberg Georg Simon Ohm); Tobias Bocklet (TH Nürnberg ) |
| 233 |
Improving Severity Preservation of Healthy-To-Pathological Voice Conversion With Global Style Tokens |
Bence Halpern (Netherlands Cancer Institute)*; Wen-Chin Huang (Nagoya University); Lester Phillip G Violeta (Nagoya University); Rob van Son (Netherlands Cancer Institute); Tomoki Toda (Nagoya University) |
| 235 |
End-to-end Multichannel Speaker-Attributed ASR: Speaker Guided Decoder and Input Feature Analysis |
Can Cui (Inria)*; Imran Sheikh (Vivoka); Mostafa Sadeghi (INRIA); Emmanuel Vincent (Inria) |
| 238 |
GPU-Accelerated WFST Beam Search Decoder for CTC-based Speech Recognition |
Daniel Galvez (NVIDIA)*; Tim Kaldewey (NVIDIA) |
| 239 |
Audio-AdapterFusion: A Task-ID-free Approach for Efficient and Non-Destructive Multi-task Speech Recognition |
Hillary Ngai (Google)*; Rohan Agrawal (Google); Parisa Haghani (Google); Pedro J Moreno (Google); W. Ronny Huang (Google); Neeraj Gaur (Google) |
| 243 |
CAMSAT: Augmentation Mix and Self-Augmented Training Clustering for Self-Supervised Speaker Recognition |
Abderrahim Fathan (Computer Research Institute of Montreal (CRIM), Montreal, Quebec, Canada)*; Jahangir Alam (Computer Research Institute of Montreal (CRIM), Montreal (Quebec) Canada) |
| 244 |
Toward Universal Speech Enhancement For Diverse Input Conditions |
Wangyou Zhang (Shanghai Jiao Tong University)*; Kohei Saijo (Waseda University); Zhong-Qiu Wang (Carnegie Mellon University); Shinji Watanabe (Carnegie Mellon University); Yanmin Qian (Shanghai Jiao Tong University) |
| 245 |
Adversarial Augmentation for Adapter Learning |
Jen-Tzung Chien (National Yang Ming Chiao Tung University)*; Wei-Yu Sun (National Yang Ming Chiao Tung University) |
| 246 |
Optimizing Two-Pass Cross-Lingual Transfer Learning: Phoneme Recognition And Phoneme To Grapheme Translation |
Wonjun Lee (POSTECH)*; Yunsu Kim (POSTECH); Gary Geunbae Lee (Postech) |
| 248 |
Ctc Blank Triggered Dynamic Layer-Skipping For Efficient Ctc-Based Speech Recognition |
Junfeng Hou (Netease)*; Peiyao Wang (Netease); Jincheng Zhang (Netease); Meng Yang (Netease); Minwei Feng (Netease); Jingcheng Yin (Netease) |
| 252 |
Prompt Pool Based Class-Incremental Continual Learning for Dialog State Tracking |
Hong Liu (Tsinghua University)*; Yucheng Cai (tsinghua university); Yuan Zhou (None); Zhijian Ou (Tsinghua University); Yi Huang (China Mobile Research); Junlan Feng (China Mobile Research) |
| 253 |
Model-based Fairness Metric for Speaker Verification |
Maliha Jahan (Johns Hopkins University)*; Laureano Moro-Velazquez (Johns Hopkins University); Thomas Thebaud (Johns Hopkins University); Najim Dehak (Johns Hopkins University); Jesus Antonio Villalba (Johns Hopkins University) |
| 258 |
The VoiceMOS Challenge 2023: Zero-shot Subjective Speech Quality Prediction for Multiple Domains |
Erica Cooper (National Institute of Informatics)*; Wen-Chin Huang (Nagoya University); Yu Tsao (Academia Sinica); Hsin-Min Wang (Academia Sinica); Tomoki Toda (Nagoya University); Junichi Yamagishi (National Institute of Informatics) |
| 263 |
Cross-Modal Alignment with Optimal Transport for Ctc-Based Asr |
Xugang Lu (NICT)*; Peng Shen (NICT); Yu Tsao (Academia Sinica); Hisashi Kawai (NICT) |
| 264 |
Study on the Correlation between Objective Evaluations and Subjective Speech Quality and Intelligibility |
Hsin-Tien Chiang (Academia Sinica); Kuo-Hsuan Hung (Academia Sinica); Szu-Wei Fu (NVIDIA); Heng-Cheng Kuo (Academia Sinica); Ming-Hsueh Tsai (National Academy for Educational Research ); Yu Tsao (Academia Sinica)* |
| 265 |
Prompting and Adapter Tuning for Self-supervised Encoder-Decoder Speech Model |
Kai-Wei Chang (National Taiwan University)*; Ming-Hsin Chen (National Taiwan University); Yun-Ping Lin (National Taiwan University); Jing Neng Hsu (National Taiwan University); Paul KM Huang (NTU); Chien-yu Huang (National Taiwan University); Shang-Wen Li (FAIR); Hung-yi Lee (National Taiwan University) |
| 266 |
VoiceExtender: Short-utterance Text-independent Speaker Verification with Guided Diffusion Model |
Yayun He (Ping An Technology (Shenzhen) Co., Ltd)*; Zuheng Kang (Ping An Technology (Shenzhen) Co., Ltd); Jianzong Wang (Ping An Technology (Shenzhen) Co., Ltd); Junqing Peng (Ping An Technology (Shenzhen) Co., Ltd); Jing Xiao (Ping An Insurance (Group) Company of China) |
| 268 |
Zero-Shot Singing Voice Synthesis From Musical Score |
Jun-You Wang (National Taiwan University)*; Hung-yi Lee (National Taiwan University); Roger Jang (); Li Su (Academia Sinica) |
| 271 |
PerMod: Perceptually Grounded Voice Modification with Latent Diffusion Models |
Robin Netzorg (UC Berkeley)*; Ajil Jalal (UC Berkeley ); Luna McNulty (Brown University); Gopala Krishna Anumanchipalli (UC Berkeley) |
| 272 |
Boosting Modality Representation with Pre-trained Models and Multi-task Training for Multimodal Sentiment Analysis |
Jiarui Hai (Johns Hopkins University)*; Yu-Jeh Liu (Johns Hopkins University); Mounya Elhilali (Johns Hopkins University) |
| 274 |
Efficient Text-Only Domain Adaptation for CTC-based ASR |
Chang Chen (Shanghai Jiao Tong University); Xun Gong (Shanghai Jiaotong University)*; Yanmin Qian (Shanghai Jiao Tong University) |
| 275 |
Adapting Pretrained Speech Model For Mandarin Lyrics Transcription And Alignment |
Jun-You Wang (National Taiwan University)*; Chon In Leong (National Taiwan University); Yu-Chen Lin (National Taiwan University); Li Su (Academia Sinica); Roger Jang () |
| 276 |
Partial Rank Similarity Minimization Method for Quality Mos Prediction Oo Unseen Speech Synthesis Systems in Zero-Shot and Semi-Supervised Setting |
Hemant Yadav (IIIT Delhi)*; Erica Cooper (National Institute of Informatics); Junichi Yamagishi (National Institute of Informatics); Sunayana Sitaram (Microsoft Research); Rajiv Ratn Shah (IIIT Delhi) |
| 282 |
Coco-Nut: Corpus Of Japanese Utterance And Voice Characteristics Description For Prompt-Based Control |
Aya Watanabe (The University of Tokyo)*; Shinnosuke Takamichi (The University of Tokyo); Yuki Saito ("The University of Tokyo, Japan"); Wataru Nakata (The University of Tokyo); Detai Xin (The University of Tokyo); Hiroshi Saruwatari (The University of Tokyo) |
| 284 |
Generative Linguistic Representation For Spoken Language Identification |
Peng Shen (NICT)*; Xugang Lu (NICT); Hisashi Kawai (NICT) |
| 287 |
Spike-Triggered Contextual Biasing For End-To-End Mandarin Speech Recognition |
Kaixun Huang (NWPU)*; Ao Zhang (Northwestern Polytechnical University); Binbin Zhang (Horizon Robotics); Tianyi Xu (NWPU); Xingchen Song (Tsinghua University); Lei Xie (NWPU) |
| 290 |
Towards Robust Packet Loss Concealment System With Asr-Guided Representations |
Da-Hee Yang (Hanyang University); Joon-Hyuk Chang (Hanyang University)* |
| 294 |
U2-Kws: Unified Two-Pass Open-Vocabulary Keyword Spotting With Keyword Bias |
Ao Zhang (Northwestern Polytechnical University)*; Pan Zhou (Li Auto Inc.); Kaixun Huang (NWPU); Yong Zou (Li Auto Inc. ); Ming Liu (Li Auto Inc.); Lei Xie (NWPU) |
| 295 |
Consistency Based Unsupervised Self-Training For Asr Personalisation |
Jisi Zhang (Samsung Research UK)*; Vandana Rajan (Samsung Research UK); Haaris Mehmood (Samsung Research UK); David Tuckey (Samsung Research UK); Pablo Peso Parada (Samsung Research UK); Md Asif Jalal (Samsung Research UK); KARTHIKEYAN SARAVANAN (Samsung Research, UK); Gil Ho Lee (Samsung Electronics); Jung In Lee (Samsung Electronics); Seokyeong Jung (Samsung Electronics) |
| 299 |
Towards A Unified End-To-End Language Understanding System For Speech And Text Inputs |
Mohan LI (Toshiba Europe Ltd)*; Catalin Zorila (Toshiba Cambridge Research Lab); Cong-Thanh Do (Toshiba Research Europe Ltd.); Rama S Doddipatla (Toshiba Europe LTD) |
| 301 |
On Decoder-only Architecture for Speech-to-text and Large Language Model Integration |
Jian Wu (Microsoft)*; Yashesh Gaur (Microsoft); Zhuo Chen (Microsoft); Long Zhou (Microsoft Research Asia); Yimeng Zhu (Microsoft China); Tianrui Wang (Microsoft Research Asia ); Jinyu Li (Microsoft); Shujie Liu (Microsoft Research Asia); Bo Ren (Microsoft); Linquan Liu (Microsoft China); Yu Wu (Microsoft Research Asia) |
| 302 |
Paraconsistent Feature Analysis For the Competency Evaluation of Voice Impersonation |
Rajeev Rajan (Government Engineering College, Barton Hill, Trivandrum)*; Noumida A (College Of Engineering Trivandrum); Sreelakshmi S (GOVERNMENT ENGINEERING COLLEGE, BARTON HILL) |
| 303 |
Knowledge Distillation from Offline to Streaming Transducer: Toward Accurate and Fast Streaming Model by Matching Alignments |
Ji-Hwan Mo (Hanyang University); Jae-Jin Jeon (Kakao Enterprise Corporation); MUNHAK LEE (Hanyang University); Joon-Hyuk Chang (Hanyang University)* |
| 304 |
Transformer Attractors for Robust and Efficient End-to-end Neural Diarization |
Lahiru T Samarakoon (Fano Labs, Hong Kong)*; Samuel J Broughton (Fano Labs); Marc Härkönen (Fano Labs); Ivan Fung (Fano Labs) |
| 308 |
Detection of Vowel Errors in Children's Speech Using Synthetic Phonetic Transcripts |
Ilja Baumann (Technische Hochschule Nürnberg Georg Simon Ohm)*; Dominik Wagner (Technische Hochschule Nuernberg Georg Simon Ohm); Korbinian Riedhammer (Technische Hochschule Nürnberg Georg Simon Ohm); Elmar Noeth (friedrich Alexander Universitat, Erlangen-Nuremberg); Tobias Bocklet (TH Nürnberg ) |
| 312 |
Invert-Classify: Recovering Discrete Prosody Inputs for Text-to-Speech |
Nicholas J Sanders (University of Edinburgh)*; Korin Richmond (University of Edinburgh) |
| 313 |
Kaq: A Non-Intrusive Stacking Framework for Mean Opinion Score Prediction with Multi-Task Learning |
Chenglin Xu (Kuaishou Technology)*; Xiguang Zheng (北京达佳互联信息技术有限公司); Chen Zhang (北京达佳互联信息技术有限公司); Chao Zhou (Kuaishou Technology); Qi Huang (Kuaishou Technology); Bing Yu (Kuaishou Technology) |
| 316 |
SA-Paraformer: Non-autoregressive End-to-End Speaker-Attributed ASR |
Yangze Li (Northwestern Polytechnical University)*; Fan Yu (Speech Lab of DAMO Academy, Alibaba Group); Yuhao Liang (Northwestern Polytechnical University); Pengcheng Guo (Northwestern Polytechnical University); Mohan Shi (University of Science and Technology of China); Zhihao Du (Speech Lab of DAMO Academy, Alibaba Group); Shiliang Zhang (Alibaba Group); Lei Xie (Northwestern Polytechnical University) |
| 318 |
Simulation of Teacher-Learner Interaction in English Language Pronunciation Learning |
Elaf Islam (The University of Sheffield)*; Thomas Hain (University of Sheffield); Protima Nomo Sudro (University of Sheffield) |
| 320 |
Ending The Blind Flight: Analyzing The Impact of Acoustic And Lexical Factors on Wav2Vec 2.0 in Air-Traffic Control |
Alexander Blatt (Saarland University)*; Badr Abdullah (Saarland University); Dietrich Klakow (Saarland University) |
| 323 |
Cross-modal learning for CTC-based ASR: Leveraging CTC-BERTScore and sequence-level training |
MUNHAK LEE (Hanyang University); Sang-Eon Lee (Hanyang University); Jieun Choi (Hanyang University); Joon-Hyuk Chang (Hanyang University)* |
| 324 |
Clustering Unsupervised Representations As Defense Against Poisoning Attacks on Speech Commands Classification System |
Thomas Thebaud (Johns Hopkins University)*; Sonal Joshi (Johns Hopkins University); Henry Li Xinyuan (Johns Hopkins University); Martin Sustek (Johns Hopkins University); Jesús Antonio Villalba López (Johns Hopkins University (JHU)); Sanjeev Khudanpur (Johns Hopkins University); Najim Dehak (Johns Hopkins University) |
| 326 |
ECAPA2: A Hybrid Neural Network Architecture and Training Strategy for Robust Speaker Embeddings |
Jenthe Thienpondt (IDLab, Ghent University)*; Kris Demuynck (Ghent Universitty) |
| 327 |
Multitask Learning Model with Text And Speech Representation for Fine-Grained Speech Scoring |
Seongjin Park (Educational Testing Service)*; Rutuja Ubale (Educational Testing Service Research) |
| 330 |
Librispeech-Pc: Benchmark For Evaluation Of Punctuation And Capitalization Capabilities Of End-To-End Asr Models |
Aleksandr Meister (NVIDIA)*; Matvei Novikov (NVIDIA); Nikolay Karpov (NVIDIA); Evelina Bakhturina (Nvidia); Vitaly Lavrukhin (NVIDIA); Boris Ginsburg (NVIDIA) |
| 331 |
Evaluating Self-Supervised Speech Models on A Taiwanese Hokkien Corpus |
Yi-Hui Chou (Carnegie Mellon University)*; Kalvin Chang (Carnegie Mellon University); Meng-Ju Wu (N/A); Winston Ou (Scripps College); Alice Wen-Hsin Bi (University of Maryland); Carol Yang (N/A); Bryan Y. Chen (Swarthmore College); Rong-Wei Pai (National Taiwan Normal University); Po-Yen Yeh (China Medical University, Taiwan); Jo-Peng Chiang (National Taiwan University); Iu-Tshiann Phoann (N/A); Winnie Chang (Carnegie Mellon University); Chenxuan Cui (Carnegie Mellon University); Noel Chen (Carnegie Mellon University); Jiatong Shi (Carnegie Mellon University) |
| 332 |
Fast Conformer With Linearly Scalable Attention For Efficient Speech Recognition |
Dima Rekesh (Nvidia)*; Nithin Rao Koluguri (NVIDIA); Samuel Kriman (NVIDIA); Somshubra Majumdar (NVIDIA); Vahid Noroozi (NVIDIA); He Huang (NVIDIA); Oleksii Hrinchuk (NVIDIA); Krishna C Puvvada (NVIDIA); Ankur Kumar (UCLA); Jagadeesh Balam (NVIDIA); Boris Ginsburg (NVIDIA) |
| 333 |
Parameter-Efficient Cross-Language Transfer Learning For A Language-Modular Audiovisual Speech Recognition |
Zhengyang Li (Technische Universität Carolo-Wilhelmina Braunschweig)*; Thomas Graave (Technische Universität Carolo-Wilhelmina Braunschweig); Jing Liu (Amazon.com); Timo Lohrenz (Technische Universität Carolo-Wilhelmina Braunschweig); Siegfried Kunzmann (Amazon.com); Tim Fingscheidt ( Technische Universität Braunschweig) |
| 336 |
Generalized Zero-Shot Audio-to-Intent Classification |
Veera Raghavendra Elluru (AWS AI Labs)*; Devang Kulshreshtha (Amazon); Rohit Paturi (AWS AI Labs); Sravan Babu Bodapati (Amazon); Srikanth Ronanki (Amazon) |
| 337 |
Investigating the Effect of Language Models in Sequence Discriminative Training for Neural Transducers |
Zijian Yang (Lehrstuhl fuer Informatik 6, RWTH Aachen)*; Wei Zhou (Chair of Computer Science 6, RWTH Aachen University); Ralf Schlüter (RWTH Aachen University); Hermann Ney ( RWTH Aachen University) |
| 339 |
Torchaudio 2.1: Advancing Speech Recognition, Self-Supervised Learning, and Audio Processing Components for Pytorch |
Jeff Hwang (Meta)*; Moto Hira (Meta); Caroline Chen (Meta); Xiaohui Zhang (Meta); Zhaoheng Ni (Meta AI); Guangzhi Sun (University of Cambridge Department of Engineering); Pingchuan Ma (Meta); Ruizhe Huang (Johns Hopkins University); Vineel Pratap (Facebook); Yuekai Zhang (NVIDIA); Anurag Kumar (Facebook Reality Labs); Chin-Yun Yu (Queen Mary University of London); Chuang Zhu (NVIDIA); Chunxi Liu (Two Sigma); Jacob D Kahn (Facebook AI Research); Mirco Ravanelli (Université de Montréal); Peng Sun (NVIDIA); Shinji Watanabe (Carnegie Mellon University); Yangyang Shi (Facebook); Yumeng Tao (Meta) |
| 340 |
Deriving Translational Acoustic Sub-Word Embeddings |
Amit Meghanani (University of Sheffield)*; Thomas Hain (University of Sheffield) |
| 342 |
A Weakly-Supervised Streaming Multilingual Speech Model with Truly Zero-Shot Capability |
JIAN XUE (Microsoft Corporation)*; Peidong Wang (Microsoft); Jinyu Li (Microsoft); eric sun (Microsoft) |
| 343 |
Transferring Speech-Generic and Depression-Specific Knowledge for Alzheimer'S Disease Detection |
Ziyun Cui (Tsinghua University)*; Wen Wu (University of Cambridge); Chao Zhang (Tsinghua University); Wei-Qiang Zhang (Tsinghua University); Ji Wu (Tsinghua University) |
| 344 |
Robust Logarithmic Champernowne Algorithm for Feedback Cancellation in Hearing Aids |
Vanitha Devi R (National Institute of Technology Warangal)*; Vasundhara . (NIT Warangal) |
| 347 |
Hierarchical Attention-Based Contextual Biasing for Personalized Speech Recognition Using Neural Transducers |
Sibo Tong (Amazon)*; Philip Harding (Amazon Alexa); Simon Wiesler (Amazon) |
| 352 |
E3 Tts: Easy End-To-End Diffusion-Based Text To Speech |
Yuan Gao (Google)*; Nobuyuki Morioka (Google); Yu Zhang (Google); Nanxin Chen (Google) |
| 354 |
Building High-Accuracy Multilingual ASR with Gated Language Experts And Curriculum Training |
eric sun (Microsoft)*; Jinyu Li (Microsoft); Yuxuan Hu (Microsoft); Yimeng Zhu (Microsoft); Long Zhou (Microsoft Research Asia); JIAN XUE (Microsoft Corporation); Peidong Wang (Microsoft); Linquan Liu (Microsoft); Shujie Liu (Microsoft Research Asia); Ed C Lin (Microsoft); Yifan Gong (Microsoft) |
| 360 |
Flap: Fast Language-Audio Pre-Training |
Ching-Feng Yeh (Facebook AI Research)*; Po-Yao Huang (Facebook AI Research); Vasu Sharma (Facebook AI Research); Shang-Wen Li (FAIR); Gargi Ghosh (Facebook AI Research) |
| 361 |
On The Relevance Of Phoneme Duration Variability Of Synthesized Training Data For Automatic Speech Recognition |
Nick Rossenbach (RWTH Aachen University / AppTek GmbH)*; Benedikt Hilmes (HLT); Ralf Schlüter (RWTH Aachen University) |
| 362 |
Enabling Noisy Label Usage for Out-Of-Airspace Data in Read-Back Error Detection |
Lakshmi Rajendram Bashyam (ZBW - Leibniz-Informationszentrum Wirtschaft); Alexander Blatt (Saarland University)*; Dietrich Klakow (Saarland University) |
| 367 |
Enhancing Expressivity Transfer in Textless Speech-to-Speech Translation |
jarod duret (LIA)*; Benjamin O'Brien (LIA - Avignon University); Yannick Estève (LIA - Avignon University); Titouan Parcollet (Samsung AI Research) |
| 368 |
Dialect Adaptation and Data Augmentation for Low-Resource ASR: Team XYZ Systems for the MADASR 2023 Challenge |
Tanel Alumae (Tallinn University of Technology)*; Jiaming Kong (Tallinn University of Technology ); Daniil Robnikov (Tallinn University of Technology) |
| 370 |
Reducing the Cost of Spoof Detection Labeling Using Mixed-Strategy Active Learning and Pretrained Models |
Mark R Lindsey (Carnegie Mellon University)*; Nathaniel R Robinson (Carnegie Mellon University); Francis Kubala (Probity, Inc.); Richard M Stern (Carnegie Mellon University) |
| 373 |
A Single Speech Enhancement Model Unifying Dereverberation, Denoising, Speaker Counting, Separation, and Extraction |
Kohei Saijo (Waseda University)*; Wangyou Zhang (Shanghai Jiao Tong University); Zhong-Qiu Wang (Carnegie Mellon University); Shinji Watanabe (Carnegie Mellon University); Tetsunori Kobayashi (Waseda University); Tetsuji Ogawa (Waseda University) |
| 375 |
Two-Pass Endpoint Detection for Speech Recognition |
Anirudh Raju (Amazon Alexa); Di He (Amazon); Aparna Khare (Amazon)*; Ilya Sklyar (Amazon); Long Chen (Amazon); Viet Anh Tranh (Amazon); Zhe Zhang (Amazon); Colin Vaz (Amazon); Sam Alptekin (Amazon); Venkatesh Ravichandran (Amazon); Roland Maas (Amazon Inc.); Ariya Rastrow (Amazon Alexa) |
| 379 |
Improved Long-Form Speech Recognition by Jointly Modeling The Primary and Non-Primary Speakers |
Guru Prakash Arumugam (Google LLC)*; Shuo-yiin Chang (Google); Tara Sainath (Google); Rohit Prabhavalkar (Google); Quan Wang (Google); Shaan Bijwadia (Google) |
| 380 |
Exploring Time-Frequency Domain Target Speaker Extraction For Causal and Non-Causal Processing |
Wangyou Zhang (Shanghai Jiao Tong University)*; Lei Yang (Samsung Research China – Beijing); Yanmin Qian (Shanghai Jiao Tong University) |
| 381 |
Joint Energy-Based Model for Robust Speech Classification System against Dirty-Label Backdoor Poisoning Attacks |
Martin Sustek (Brno University of Technology)*; Sonal Joshi (Johns Hopkins University); Henry Li Xinyuan (Johns Hopkins University); Thomas Thebaud (Johns Hopkins University); Jesus Antonio Villalba (Johns Hopkins University); Sanjeev Khudanpur (Johns Hopkins University); Najim Dehak (Johns Hopkins University) |
| 382 |
Importance of Smoothness Induced by Optimizers in FL4ASR: Towards Understanding Federated Learning for End-to-End ASR |
Sheikh Shams Azam (Apple)*; Tatiana Likhomanenko (Apple); Martin Pelikan (Apple); Jan Silovsky (Apple) |
| 383 |
Joint Audio and Speech Understanding |
Yuan Gong (Massachusetts Institute of Technology)*; Alexander H Liu (MIT); Hongyin Luo (MIT); Leonid Karlinsky (IBM-Research); James Glass (Massachusetts Institute of Technology) |
| 385 |
No Pitch Left Behind: Addressing Gender Unbalance in Automatic Speech Recognition Through Pitch Manipulation |
Dennis Fucci (Fondazione Bruno Kessler)*; Marco Gaido (Fondazione Bruno Kessler); Matteo Negri (Fondazione Bruno Kessler); Mauro Cettolo (Fondazione Bruno Kessler); Luisa Bentivogli (Fondazione Bruno Kessler ) |
| 387 |
Improving Audiovisual Active Speaker Detection in Egocentric Recordings with the Data-efficient Image Transformer |
Jason Clarke (University of Sheffield)*; Yoshihiko Gotoh (University of Sheffield); Stefan Goetze (University of Sheffield) |
| 391 |
YODAS: Youtube-Oriented Dataset for Audio and Speech |
Xinjian Li (Carnegie Mellon University)*; Shinnosuke Takamichi (The University of Tokyo); Takaaki Saeki (The University of Tokyo); William Chen (Carnegie Mellon University); Sayaka Shiota (Tokyo Metropolitan University); Shinji Watanabe (Carnegie Mellon University) |
| 392 |
Discriminative Speech Recognition Rescoring With Pre-trained Language Models |
Prashanth Gurunath Shivakumar (Amazon)*; Jari T Kolehmainen (Amazon); Yile Gu (Amazon.com, USA); Ankur Gandhe (Amazon Alexa); Ariya Rastrow (Amazon Alexa); Ivan Bulyko (Amazon) |
| 394 |
Unconstrained Dysfluency Modeling for Dysfluent Speech Transcription and Detection |
Jiachen Lian (University of California Berkeley)*; Carly Z Feng (University of California, Berkeley); Naasir S Farooqi (UC Berkeley); Steve Li (Berkeley Speech Group); Anshul P Kashyap (UC Berkeley); Cheol Jun Cho (UC Berkeley); Peter Wu (UC Berkeley); Robin Netzorg (UC Berkeley); Tingle Li (UC Berkeley); Gopala Krishna Anumanchipalli (UC Berkeley) |
| 395 |
Minisuperb: Lightweight Benchmark for Self-Supervised Speech Models |
Yu-Hsiang Wang (National Taiwan University)*; Huang-Yu Chen (National Taiwan University); Kai-Wei Chang (National Taiwan University); Winston H. Hsu (National Taiwan University); Hung-yi Lee (National Taiwan University) |
| 399 |
MASR: Multi-Label Aware Speech Representation Learning |
ANJALI RAJ (Google); Shikhar Bharadwaj (Google); Sriram Ganapathy (Google); Min Ma (Google Research); Shikhar Vashishth (Google)* |
| 403 |
A Comparative Study of Voice Conversion Models with Large-Scale Speech and Singing Data: The T13 Systems for the Singing Voice Conversion Challenge 2023 |
Ryuichi Yamamoto (LINE Corp.)*; Reo Yoneyama (Nagoya University); Lester Phillip G Violeta (Nagoya University); Wen-Chin Huang (Nagoya University); Tomoki Toda (Nagoya University) |
| 404 |
Extending Self-distilled Self-supervised Learning for Semi-supervised Speaker Verification |
Jeong-Hwan Choi (Hanyang University); Jehyun Kyung (Hanyang University); Ju-seok Seong (Hanyang University); Ye-Rin Jeoung (Hanyang University); Joon-Hyuk Chang (Hanyang University)* |
| 406 |
Pseudo-label based Supervised Contrastive Loss for Robust Speech Representations |
Varun Krishna PS Krishna (Indian Institute of Science)*; Sriram Ganapathy (Indian Institute of Science, Bangalore, India, 560012) |
| 409 |
Audio-Visual Neural Syntax Acquisition |
Cheng-I Lai (MIT)*; Haoyue Shi (Toyota Technological Institute at Chicago); Puyuan Peng (The University of Texas at Austin); Yoon Kim (MIT); Kevin Gimpel (Toyota Technological Institute at Chicago); Shiyu Chang (UCSB); Yung-Sung Chuang (MIT); Saurabhchand Bhati (Johns Hopkins University ); David Cox (MIT-IBM Watson AI Lab); David Harwath (The University of Texas at Austin); Yang Zhang (IBM T. J. Watson Research); Karen Livescu (TTI-Chicago); James Glass (Massachusetts Institute of Technology) |
| 411 |
Improving Speech Enhancement Using Audio Tagging Knowledge From Pre-Trained Representations And Multi-Task Learning |
Shaoxiong Lin (ShanghaiJiaoTongUniversity); Chao Zhang (Tsinghua University); Yanmin Qian (Shanghai Jiao Tong University)* |
| 412 |
Ba-Moe: Boundary-Aware Mixture-Of-Experts Adapter For Code-Switching Speech Recognition |
Peikun Chen (Northwestern Polytechnical University)*; Fan Yu (Speech Lab of DAMO Academy, Alibaba Group); Yuhao Liang (Northwestern Polytechnical University); Hongfei Xue (NWPU); Xuchen Wan (Huawei Technologies Co., Ltd.); Naijun Zheng (Huawei Technologies Co., Ltd.); zhou huan (AARC, Huawei Technologies Co., Ltd.); Lei Xie (Northwestern Polytechnical University) |
| 413 |
Zero-shot Domain-sensitive Speech Recognition with Prompt-conditioning Fine-tuning |
Yung-Chieh Chan (MediaTek Research); Feng-Ting Liao (MediaTek Research)*; Chan-Jan Hsu (mediatek research); Yi-Chang Chen (Mediatek Research); Da-shan Shiu (MediaTek Research) |
| 414 |
Few-Shot Spoken Language Understanding via Joint Speech-Text Models |
Chung-Ming Chien (Toyota Technological Institute at Chicago)*; Mingjiamei Zhang (University of Chicago); Ju-Chieh Chou (TTIC); Karen Livescu (TTI-Chicago) |
| 415 |
Summarize while Translating: Universal Model with Parallel Decoding for Summarization and Translation |
Takatomo Kano (NTT Corporation)*; Atsunori Ogawa (NTT Corporation); Marc Delcroix (NTT); Kohei Matsuura (NTT); Takanori Ashihara (NTT Corp.); William Chen (Carnegie Mellon University); Shinji Watanabe (Carnegie Mellon University) |
| 417 |
Acoustic Model Fusion for End-To-End Speech Recognition |
Zhihong Lei (Apple); Mingbin Xu (Apple Inc.)*; Shiyi Han (Apple); Leo Liu (Apple); Zhen Huang (Apple); Tim Ng (Apple); Yuanyuan Zhang (Apple); Ernest Pusateri (Apple Inc.); Mirko Hannemann (Apple); Yaqiao Deng (Apple); Man-Hung Siu (Apple) |
| 420 |
Domain Adaptation by Data Distribution Matching via Submodularity for Speech Recognition |
Yusuke Shinohara (Yahoo Japan Corporation)*; Shinji Watanabe (Carnegie Mellon University) |
| 422 |
The Second Multi-Channel Multi-Party Meeting Transcription Challenge (M2Met 2.0): A Benchmark for Speaker-Attributed ASR |
Yuhao Liang (Northwestern Polytechnical University)*; Mohan Shi (University of Science and Technology of China); Fan Yu (Speech Lab of DAMO Academy, Alibaba Group); Yangze Li (Northwestern Polytechnical University); Shiliang Zhang (Alibaba Group); Zhihao Du (Speech Lab of DAMO Academy, Alibaba Group); Lei Xie (Northwestern Polytechnical University); Yanmin Qian (Shanghai Jiao Tong University); Jian Wu (Microsoft); Zhuo Chen (Microsoft); Kong Aik Lee (ICT Cluster, Singapore Institute of Technology); Zhijie Yan (Alibaba Inc.); Hui Bu (AISHELL) |
| 423 |
Slm: Bridging The Thin Gap Between Speech and Text Foundational Models |
Mingqiu Wang (Google Inc)*; Wei Han (Google); Izhak Shafran (Google AI); Zelin Wu (Google LLC); Chung-Cheng Chiu (Google); Yuan Cao (Google Brain); Nanxin Chen (Google); Yu Zhang (Google); Hagen Soltau (Google); Paul Rubenstein (Google); Lucas Zilka (Google); Dian Yu (Google); Golan Pundak (Google); Nikhil Siddhartha (Google.com); Johan Schalkwyk (Google); Yonghui Wu (Google) |
| 424 |
An Exploration of Task-decoupling on Two-stage Neural Post Filter for Real-time Personalized Acoustic Echo Cancellation |
Zihan Zhang (Northwestern Polytechnical University)*; Jiayao Sun (Northwestern Polytechnical University); Xianjun Xia (RTC Lab, ByteDance); Ziqian Wang (Northwestern Polytechnical University); Xiaopeng Yan (Northwestern Polytechnical University); Yijian Xiao (ByteDabce); Lei Xie (Northwestern Polytechnical University) |
| 425 |
Improving Whispered Speech Recognition Performance using Pseudo-whispered based Data Augmentation |
Zhaofeng Lin (Multimedia Computing Group, Delft University of Technology); Tanvina Patel (Multimedia computing, Delft University of Technology ); Odette Scharenborg (Multimedia Computing Group, Delft University of Technology)* |
| 426 |
Leveraging The Multilingual Indonesian Ethnic Languages Dataset in Self-Supervised Model for Low-Resource ASR Task |
Sakriani Sakti (Japan Advanced Institute of Science and Technology)*; Benita Angela Titalim (JAIST) |
| 429 |
PromptSpeaker: Speaker Generation Based on Text Descriptions |
yongmao zhang (Audio, Speech and Language Processing Group (ASLP@NPU), School of Computer Science, Northwestern Polytechnical University, Xi’an, China)*; Guanghou Liu (Northwestern Polytechnical University); Yi Lei (Northwestern Polytechnical University); Yunlin Chen (mobvoi); Hao Yin (mobvoi); Lei Xie (NWPU); Zhifei Li (Mobvoi) |
| 430 |
HiGNN-TTS : Hierarchical Prosody Modeling with Graph Neural Networks for Expressive Long-form TTS |
Dake Guo (Northwestern Polytechnical University)*; Xinfa Zhu (Northwestern Polytechnical University); Liumeng Xue (The Chinese University of Hong Kong, Shenzhen); Tao Li (School of Computer Science, Northwestern Polytechnical University, Xi’an); Yuanjun Lv (Northwestern Polytechnical University); Yuepeng Jiang (Northwestern Polytechnical University); Lei Xie (NWPU) |
| 433 |
Zero-Shot Emotion Transfer for Cross-Lingual Speech Synthesis |
Yuke Li (Audio, Speech and Language Processing Group (ASLP@NPU))*; Xinfa Zhu (Northwestern Polytechnical University); Yi Lei (Northwestern Polytechnical University); Hai Li (iQIYI Inc); Junhui Liu (iQIYI Inc); Danming Xie (iQIYI); Lei Xie (NWPU) |
| 435 |
VITS-Based Singing Voice Conversion Leveraging Whisper and multi-scale F0 Modeling |
Ziqian Ning (Northwestern Polytechnical University)*; Yuepeng Jiang (Northwestern Polytechnical University); Bin Zhang (Tencent Music Entertainment Group(TME)); Lei Xie (NWPU); Zhichao Wang (Northwestern Polytechnical University) |
| 436 |
SALT: Distinguishable Speaker Anonymization Through Latent Space Transformation |
Yuanjun Lv (Northwestern Polytechnical University)*; Jixun Yao (Northwestern Polytechnical University); Peikun Chen (Northwestern Polytechnical University); Hongbin Zhou (Ximalaya Inc.); Heng Lu (Ximalaya Inc.); Lei Xie (Northwestern Polytechnical University) |
| 440 |
MUST: A Multilingual Student-Teacher Learning Approach for Low-Resource Speech Recognition |
Muhammad Umar Farooq (University of Sheffield)*; Rehan Ahmad (University of Sheffield); Thomas Hain (University of Sheffield) |
| 441 |
WaveNeXt: ConvNeXt-based fast neural vocoder without iSTFT layer |
Takuma Okamoto (National Institute of Information and Communications Technology)*; Haruki Yamashita (Kobe University); Yamato Ohtani (National Institute of Information and Communications Technology); Tomoki Toda (Nagoya University); Hisashi Kawai (NICT) |
| 445 |
Spectral Tilt May Have a Smaller Impact on The Intelligibility of Speech in Noise |
Yoshiki Sato (University of Aizu)*; Julián Villegas (University of Aizu) |
| 447 |
H_Eval: A New Hybrid Evaluation Metric For Automatic Speech Recognition Tasks |
Zitha Sasindran (Indian Institute of Science)*; Harsha Yelchuri (Information Science Engineering RV College of Engineering Bengaluru, India); Prabhakar Venkata Tamma (Electronics Systems Engg); Supreeth Rao ( Indian Institute of Science) |
| 453 |
Towards Developing State-of-the-Art TTS Synthesisers for 13 Indian Languages with Signal Processing aided Alignments |
Anusha Prakash (Indian Institute of Technology Madras)*; S Umesh (IIT Chennai); Hema A Murthy (IIT Madras) |
| 456 |
Parameter-Efficient Tuning with Adaptive Bottlenecks for Automatic Speech Recognition |
Geoffroy Vanderreydt (IDLab)*; Amrutha Prasad (Idiap Research Institute); Srikanth Madikeri (Idiap); Driss Khalil (Idiap Research Institute); Kris Demuynck (Ghent Universitty); Petr Motlicek (Idiap) |
| 468 |
Semi-Supervised Multi-Channel Speaker Diarization with Cross-Channel Attention |
Shilong Wu (University of Science and Technology of China)*; Jun Du (University of Science and Technology of China); Mao-Kui He (University of Science and Technology of China); Shutong Niu (University of Science and Technology of China ); Hang Chen (USTC); Haitao Tang (iFLYTEK Research); Chin-hui Lee (Georgia Institute of Technology) |
| 473 |
Gated Multi Encoders and Multitask Objectives For Dialectal Speech Recognition in Indian Languages |
Sathvik Udupa (Indian Institute of Science)*; Jesuraj Bandekar (IISc); Deekshitha G (IISc); Saurabh Kumar (IISc Bengaluru); Prasanta Ghosh (); Sandhya Badiger (IISc Bangalore); Abhayjeet Singh (Indian Institute of Sciences, Bangalore, India); Savitha S Murthy (IISc); Priyanka Pai (Navana Tech, Mumbai); Srinivasa Raghavan (Navana Tech, Mumbai); Rohan Saxena (Navana Tech, Mumbai) |
| 478 |
Vits-Based Singing Voice Conversion System with Dspgan Post-Processing for SVCC2023 |
yiquan zhou (xjtu)*; Chen Meng (TME); Yi Lei (Northwestern Polytechnical University); Jihua Zhu (Xi'an Jiaotong University); weifeng zhao (tencent) |