SOVA Dataset

An open set of labeled audio data for training speech recognition and synthesis models. One of the largest Russian open datasets.

Dataset size

28,853 hours of audio with lyric 2.9 TB in .wav format Recording language - Russian and English. A large amount of data allows for effective training of neural networks. The dataset is distributed under the CC-BY 4.0 license: data can be used for commercial purposes, copied, distributed and created derivative materials.

Recording quality

The same quality of all data: all recordings are reduced to standard parameters, the texts correspond to audio recordings. The markup of audio recordings is carried out manually and undergoes a thorough check. To create a dataset, we involve professional speakers and work with partner sources (youtube channels, audiobook publishers, news agencies, etc.) under a license agreement.

The current composition of the dataset

EngAudiobooksOriginal

English audiobooks, recording on professional equipment, forced alignment markup

EngAudiobooksOriginal

English audiobooks, recording on professional equipment, forced alignment markup

EngAudiobooksNoisy

Noisy English audiobooks with augmentation for phone calls, recording on professional equipment, forced alignment markup

RuAudiobooksDevices

Russian audiobooks, recording on non-professional equipment, manual marking

RuDevices

Russian live speech, recording on mobile devices and other non-professional equipment, manual marking

RuYoutube

Video in Russian, recording on non-professional equipment, markup using ASR

Almost 30 000 hours

At the moment, the dataset has 28,853 hours of audio recordings and is regularly updated

Now and for free

Use our Open Source Dataset SOVA to train speech recognition and speech synthesis models