Dataset size
28,853 hours of audio with lyric 2.9 TB in .wav format Recording language - Russian and English. A large amount of data allows for effective training of neural networks. The dataset is distributed under the CC-BY 4.0 license: data can be used for commercial purposes, copied, distributed and created derivative materials.
Recording quality
The same quality of all data: all recordings are reduced to standard parameters, the texts correspond to audio recordings. The markup of audio recordings is carried out manually and undergoes a thorough check. To create a dataset, we involve professional speakers and work with partner sources (youtube channels, audiobook publishers, news agencies, etc.) under a license agreement.
The current composition of the dataset
EngAudiobooksOriginal
English audiobooks, recording on professional equipment, forced alignment markup
EngAudiobooksNoisy
Noisy English audiobooks with augmentation for phone calls, recording on professional equipment, forced alignment markup
RuAudiobooksDevices
Russian audiobooks, recording on non-professional equipment, manual marking
RuDevices
Russian live speech, recording on mobile devices and other non-professional equipment, manual marking
RuYoutube
Video in Russian, recording on non-professional equipment, markup using ASR
Almost 30 000 hours
At the moment, the dataset has 28,853 hours of audio recordings and is regularly updated