
Speech Corpora - Links
1. LDC - Linguistic Data Consortium
The Linguistic Data Consortium supports language-related education, research
and technology development by creating and sharing linguistic resources: data,
tools and standards.
http://www.ldc.upenn.edu/
2. European Corpus Initiative Multilingual Corpus I (ECI/MCI)
The European Corpus Initiative (ECI) was founded to oversee the acquisition
and preparation of a large multilingual corpus (ECI/MCI) to be made available
in digital form for scientific research at a low a cost as possible. The corpus
has been available on CD-ROM since 1994, and is being distributed by ELSNET.
http://www.elsnet.org/resources/eciCorpus.html
3. Center for Spoken Language Understanding at OGI (CSLU)
4. CoSIH - The Corpus of Spoken Hebrew
http://www.tau.ac.il/humanities/semitic/cosih.html
5. Stanford University
Transcribed and Phonetically Annotated Speech databases vailable transcribed
&
http://www.stanford.edu/~tiflo/focus/l_speech_corpora.html
6. The W3C Corpus Linguistics Uni of Essex
http://www.essex.ac.uk/linguistics/clmt/w3c/corpus_ling/content/introduction.html
7. Definitive Resource of Organizations, Programs and Centers in Linguistics
http://www.voices.com/articles/languages-accents-and-dialects/linguist-list.html
8. Lexicon and Corpora for Speech to Speech Translation (LC-STAR)
9. University of Hawaii, Department of Linguistics Databases & Corpora
http://www.ling.hawaii.edu/corpora.htm
10. ACL SIGLEX Resource Links
http://www.clres.com/phonetic.html
11. OrienTel
Multilingual access to interactive communication services for the Mediterranean
and the Middle East
12. SALA (SpeechDat across Latin America) speech databases
13. SpeechDat
Databases for the Creation of Voice Driven Teleservices
http://www.speechdat.org/SpeechDat.html
14. SPEECON
Speech-driven Interfaces for Consumer Devices
http://www.speechdat.org/speecon/index.html
15. LILA
is the collection of a large number of spoken databases for training Automatic
Speech Recognition Systems in the Asian Pacific area.
16. ATLAS -
Architecture and Tools for Linguistic Analysis System ATLAS develops, produces
and distributes Language Resources. ATLAS catalogue includes oral databases
for speech recognition and text-to-speech conversion. Currently, ATLAS produces
databases in Spain and America.
17. Australian National Database of Spoken Language (ANDOSL)
http://andosl.anu.edu.au/andosl/
18. MESSC International Scientific Research Program:
Joint Research on "Spoken Language Databases and Prosodic Labeling"
MESSC: Japanese Education, Science, Sports and Culture
19. International Committee for Co-ordination
and Standardisation of Speech Databases (COCOSDA)
20. Oriental COCOSDA
http://isw3.aist-nara.ac.jp/IS/Shikano-lab/database/internet-resource/speech_db/itahashi-1.html
21. Signal Processing Information Base (SPIB)
The Signal Processing Information Base (SPIB) is a project sponsored by the
Signal Processing Society and the National Science Foundation. SPIB contains
information repositories of data, newsgroups, bibliographies, links to other
repositories, and addresses, all of which are relevant to signal processing
research and development.
http://spib.rice.edu/spib/signal.html
22. Noise sources
http://spib.rice.edu/spib/select_noise.html
23. The ShATR Multiple Simultaneous Speaker Corpus
ShATR is a corpus of overlapped speech collected by the University of Sheffield
Speech and Hearing Research Group in collaboration with ATR in order to support
research into computational auditory scene analysis. The task involved four
participants working in pairs to solve two crosswords. A fifth participant acted
as a hint-giver. Eight channels of audio data were recorded from the following
sensors: one close microphone per speaker, one omnidirectional microphone, and
the two channels of a binaurally-wired mannekin. Around 41% of the corpus contains
overlapped speakers. In addition, a variety of other audio data was collected
from each participant. The entire corpus, which has a duration of around 37
minutes, has been segmented and transcribed at 5 levels, from subtasks down
to phones. In addition, all nonspeech sounds have been marked.
http://dcs.shef.ac.uk/spandh/projects/shatrweb/
24. Intelligent Electronic Systems (IES)
IES is well-known for its highly-collaborative, team-oriented environment. This is a rather unique group compared to most university research environments, since the team concept influences just about every aspect of our work (including lunch :) Faculty, staff, and students must undergo rigorous testing to assess their suitability for our group!
http://www.isip.msstate.edu/projects/nsf_nonlinear/
25. The EMU Speech Database System
EMU is a collection of software tools for the creation, manipulation and analysis
of speech databases. At the core of EMU is a database search engine which allows
the researcher to find various speech segments based on the sequential and hierarchical
structure of the utterances in which they occur. EMU includes an interactive
labeller which can display spectrograms and other speech waveforms, and which
allows the creation of hierarchical, as well as sequential, labels for a speech
utterance.
26. Corpora@Stanford
This site contains information for corpus users at the linguistics department
and CSLI. The "Getting Started" section is focused on corpora newbies.
Thus more experienced users (and users familiar with the local setup) should
probably head straight to the "Available Resources" section of this
site. While this site definitely also contains information that is relevant
to experienced users (of corpora), it focuses on giving support for "beginners".
That is, if you want to know how to gather lots of facinating examples for your
research, do searches for certain syntactic patterns, or browse speech recordings,
etc. but you do not know how to do what or even where .... these pages are an
attempt to provide some guidance for the first steps in the big world of corpora.
Furthermore, this site can be used as a reference for all of the following topics
http://www.stanford.edu/dept/linguistics/corpora/
27. University of Gant, Belgium,
Department Electronics and Information Systems (ELIS)
http://www.elis.rug.ac.be/ELISgroups/speech/cost249/database/
28. The ICP (Institute of Spoken Communication)
was founded in 1983 to focus on speech as an object of research. It is now one
knowledge, methodologies, experimental tools and their particular scientific
and technological challenges, the ICP's researchers cover all the main research
areas in the field, including signals, language and cognitive motor and perception
systems.
http://www.icp.inpg.fr/ICP/_ressources.en.html
29. Arcadia Computing Innovation
Large scale general purpose telephone speech databases for Japanese speech recognition
systems
30. Speech at Carnegie Mellon University (CMU)
31. The CHRISTINE Project SUSANNE meets spoken English
Sponsored by the Economic & Social Research Council (UK), the CHRISTINE
project set out to extend my SUSANNE analytic scheme and Corpus to cover spoken
English, and particularly spontaneous, informal spoken English.
http://www.grsampson.net/RChristine.html
32. Speech Processing, Joe Campbell , Johns Hopkins University
Whiting School of Engineering
http://www.apl.jhu.edu/Classes/525747/index.html