Indian Language Speech Corpus

Microsoft India announced the availability of Microsoft Indian language Speech Corpus, offering speech training and test data for Telugu, Tamil and Gujarati. This is the largest publicly available Indian language speech dataset which includes audio and corresponding transcripts. It is aimed at helping researchers and academia build Indian language speech recognition for all applications where speech is used.

This Indian language Speech Corpus content is provided by Microsoft Research Open Data initiative, a collection of free datasets from Microsoft Research to advance state-of-the-art research in areas such as natural language processing, computer vision, and domain specific sciences.

The corpus was tested at the just-concluded Interspeech 2018 held in Hyderabad in a Low Resource Speech Recognition Challenge, where a few participants used data from Microsoft’s newly-launched corpus to build Automatic Speech Recognition (ASR) systems.

Parallelly, a Union Government-led consortium has been working for over a decade to enable inter-language translations. The initiative, titled Sampark, is aimed at facilitating translations by machines in Indian languages. The consortium includes the International Institute of Information Technology (Hyderabad), the University of Hyderabad, C-DAC, Anna University KBC Chennai, and a few IITs and IIITs.

Indian Language Speech Corpus

Checkout button

I agree to the terms and conditions