This dataset accompanies the research paper, Towards Real-World Streaming Speech Translation for Code-Switched Speech
.
In the paper, we investigate translation of code-switched speech to a third language (i.e., a language not included in the source). To this end, we extend the Fisher and Miami test and validation datasets, which contain English-Spanish codeswitched speech, to include new targets in monolingual Spanish and German.
This dataset extends a codeswitching-focused dataset split accompanying an earlier paper, End-to-End Speech Translation for Code Switched Speech, which can be found here.
- Please follow the instructions found here.
- The naming of the files in this dataset indicates which parallel data they belong to. Please note that a small portion of translations are marked as
<removed>
, these should not be included in evaluation.
Fisher and Miami datasets are licensed differently, please refer to the LICENSE files in the respective subdirectories.
If you use this dataset, please cite our paper as follows:
Belen Alastruey, Matthias Sperber, Christian Gollan, Dominic Telaar, Tim Ng, Aashish Agargwal (2023). Towards Real-World Streaming Speech Translation for Code-Switched Speech. EMNLP 2023 Workshop Computational Approaches to Linguistic Code-Switching (CALCS).