Speech Corpora - Links

1. LDC - Linguistic Data Consortium
The Linguistic Data Consortium supports language-related education, research and technology development by creating and sharing linguistic resources: data, tools and standards.

http://www.ldc.upenn.edu/

2. European Corpus Initiative Multilingual Corpus I (ECI/MCI)
The European Corpus Initiative (ECI) was founded to oversee the acquisition and preparation of a large multilingual corpus (ECI/MCI) to be made available in digital form for scientific research at a low a cost as possible. The corpus has been available on CD-ROM since 1994, and is being distributed by ELSNET.

http://www.elsnet.org/resources/eciCorpus.html


3. Center for Spoken Language Understanding at OGI (CSLU)

http://cslu.cse.ogi.edu/


4. CoSIH - The Corpus of Spoken Hebrew

http://www.tau.ac.il/humanities/semitic/cosih.html


5. Stanford University
Transcribed and Phonetically Annotated Speech databases vailable transcribed &

http://www.stanford.edu/~tiflo/focus/l_speech_corpora.html

6. The W3C Corpus Linguistics Uni of Essex

http://www.essex.ac.uk/linguistics/clmt/w3c/corpus_ling/content/introduction.html

7. Definitive Resource of Organizations, Programs and Centers in Linguistics

http://www.voices.com/articles/languages-accents-and-dialects/linguist-list.html

8. Lexicon and Corpora for Speech to Speech Translation (LC-STAR)

http://www.k-star.com


9. University of Hawaii, Department of Linguistics Databases & Corpora

http://www.ling.hawaii.edu/corpora.htm


10. ACL SIGLEX Resource Links

http://www.clres.com/phonetic.html


11. OrienTel
Multilingual access to interactive communication services for the Mediterranean and the Middle East

http://www.orientel.org

12. SALA (SpeechDat across Latin America) speech databases

http://www.sala2.org/

13. SpeechDat
Databases for the Creation of Voice Driven Teleservices

http://www.speechdat.org/SpeechDat.html

14. SPEECON
Speech-driven Interfaces for Consumer Devices

http://www.speechdat.org/speecon/index.html

15. LILA
is the collection of a large number of spoken databases for training Automatic Speech Recognition Systems in the Asian Pacific area.

http://www.lilaproject.org/


16. ATLAS -
Architecture and Tools for Linguistic Analysis System ATLAS develops, produces and distributes Language Resources. ATLAS catalogue includes oral databases for speech recognition and text-to-speech conversion. Currently, ATLAS produces databases in Spain and America.

http://www.atlas-cti.com

17. Australian National Database of Spoken Language (ANDOSL)

http://andosl.anu.edu.au/andosl/


18. MESSC International Scientific Research Program:
Joint Research on "Spoken Language Databases and Prosodic Labeling"
MESSC: Japanese Education, Science, Sports and Culture

19. International Committee for Co-ordination
and Standardisation of Speech Databases (COCOSDA)

http://www.cocosda.org/


20. Oriental COCOSDA

http://isw3.aist-nara.ac.jp/IS/Shikano-lab/database/internet-resource/speech_db/itahashi-1.html


21. Signal Processing Information Base (SPIB)
The Signal Processing Information Base (SPIB) is a project sponsored by the Signal Processing Society and the National Science Foundation. SPIB contains information repositories of data, newsgroups, bibliographies, links to other repositories, and addresses, all of which are relevant to signal processing research and development.

http://spib.rice.edu/spib/signal.html

22. Noise sources

http://spib.rice.edu/spib/select_noise.html


23. The ShATR Multiple Simultaneous Speaker Corpus
ShATR is a corpus of overlapped speech collected by the University of Sheffield Speech and Hearing Research Group in collaboration with ATR in order to support research into computational auditory scene analysis. The task involved four participants working in pairs to solve two crosswords. A fifth participant acted as a hint-giver. Eight channels of audio data were recorded from the following sensors: one close microphone per speaker, one omnidirectional microphone, and the two channels of a binaurally-wired mannekin. Around 41% of the corpus contains overlapped speakers. In addition, a variety of other audio data was collected from each participant. The entire corpus, which has a duration of around 37 minutes, has been segmented and transcribed at 5 levels, from subtasks down to phones. In addition, all nonspeech sounds have been marked.

http://dcs.shef.ac.uk/spandh/projects/shatrweb/


24. Intelligent Electronic Systems (IES)

IES is well-known for its highly-collaborative, team-oriented environment. This is a rather unique group compared to most university research environments, since the team concept influences just about every aspect of our work (including lunch :) Faculty, staff, and students must undergo rigorous testing to assess their suitability for our group!

http://www.isip.msstate.edu/projects/nsf_nonlinear/


25. The EMU Speech Database System
EMU is a collection of software tools for the creation, manipulation and analysis of speech databases. At the core of EMU is a database search engine which allows the researcher to find various speech segments based on the sequential and hierarchical structure of the utterances in which they occur. EMU includes an interactive labeller which can display spectrograms and other speech waveforms, and which allows the creation of hierarchical, as well as sequential, labels for a speech utterance.

http://emu.sourceforge.net/


26. Corpora@Stanford
This site contains information for corpus users at the linguistics department and CSLI. The "Getting Started" section is focused on corpora newbies. Thus more experienced users (and users familiar with the local setup) should probably head straight to the "Available Resources" section of this site. While this site definitely also contains information that is relevant to experienced users (of corpora), it focuses on giving support for "beginners".
That is, if you want to know how to gather lots of facinating examples for your research, do searches for certain syntactic patterns, or browse speech recordings, etc. but you do not know how to do what or even where .... these pages are an attempt to provide some guidance for the first steps in the big world of corpora. Furthermore, this site can be used as a reference for all of the following topics

http://www.stanford.edu/dept/linguistics/corpora/

27. University of Gant, Belgium,
Department Electronics and Information Systems (ELIS)

http://www.elis.rug.ac.be/ELISgroups/speech/cost249/database/

28. The ICP (Institute of Spoken Communication)
was founded in 1983 to focus on speech as an object of research. It is now one knowledge, methodologies, experimental tools and their particular scientific and technological challenges, the ICP's researchers cover all the main research areas in the field, including signals, language and cognitive motor and perception systems.

http://www.icp.inpg.fr/ICP/_ressources.en.html

29. Arcadia Computing Innovation
Large scale general purpose telephone speech databases for Japanese speech recognition systems

http://www.arcadia.co.jp/


30. Speech at Carnegie Mellon University (CMU)

http://www.speech.cs.cmu.edu/


31. The CHRISTINE Project SUSANNE meets spoken English
Sponsored by the Economic & Social Research Council (UK), the CHRISTINE project set out to extend my SUSANNE analytic scheme and Corpus to cover spoken English, and particularly spontaneous, informal spoken English.

http://www.grsampson.net/RChristine.html

32. Speech Processing, Joe Campbell , Johns Hopkins University
Whiting School of Engineering

http://www.apl.jhu.edu/Classes/525747/index.html