Manuel Barbera, Corpus based computational linguistic resources. General: Corpora (§ 2.1).

(1) – Corpora and Corpus Linguistics.

In this section I will present only sites and institutions devoted to corpus linguistics in the narrowest sense, i.e. which are mainly involved in corpora production and distribution. For CL & NLP tools, developers and software houses, please refer to section 2.5 "Tools". For institutions more concerned in developing standards for corpus linguistics or in other NLP activities, for tutorials, link pages, journals, personal pages and other miscellaneous corpus linguistics resources, please refer to section 2.4 "References, Standards & Educational Resources".

Athelstan Online: http://www.athel.com/

A well known commercial site of Books, Software, Corpora and CD-ROMs for linguists and language teachers. It offers the concordance program MonoConc, the COBUILD Collins dictionary and utilities, the CPSA corpus and a lot of interesting educational tools. [last checked 2001 April 23].

CALLHOME Project: http://morph.ldc.upenn.edu/ldc/about/callhome.html

The target of the project is the creation of a multi-lingual speech corpus that will support the development of Large Vocabulary Conversational Speech Recognition (LVCSR) technology. The corpus is being created in phases. The first phase includes Spanish, Japanese and Mandarin. The second phase concentrates on American English, German and Egyptian Arabic. For each language, a minimum of 200 conversations are collected and later transcribed. Participants are offered a free 30-minute phone call to another native speaker of the same language anywhere outside the U.S. or Canada. Possible participants are recruited via World Wide Web postings, newspaper advertisements, on-site presentations, and telephone solicitation. During the registration process (either via telephone or email), the subject is asked for specific demographic information about himself/herself: gender, age, years of completed education, country of birth, city in which he/she was raised, and how long the subject has been in the United States if applicable. Once the registration process is complete, a participant has 30 days in which to place the call. For each language, a 1-800 line was set up with recorded prompts in the appropriate language.

CELEX (Dutch Centre for Lexical Information): http://www.kun.nl/celex

CELEX, the Dutch Centre for Lexical Information, has three separate databases, all of which are open to external users. The Dutch database, version N3.1, was released in March 1990 and contains informations on 381,292 present-day Dutch wordforms, corresponding to 124,136 lemmata. The latest release of the English database (E2.5), completed in June 1993, contains 52,446 lemmata representing 160,594 wordforms. The German database (D2.5), made accessible in February 1995, currently holds 51,728 lemmata with 365,530 corresponding wordforms. Apart from orthographic features, the CELEX database comprises representations of the phonological, morphological, syntactic and frequency properties of lemmata. For Dutch and English lemma homographs, frequencies have been disambiguated on the basis of the 42.4 m. Dutch INL and the 17.9 m. English Collins/COBUILD text corpora. Furthermore, information has been collected on syntactic and semantic subcategorisations for Dutch. The CELEX database is open to all academic researchers and people associated with other not-for-profit research institutes free of charge (at least until 2001). Users will only be charged Dfl. 100,= for the CELEX User Guide on a one-shot basis. CELEX people would like to restrict access, however, to mainly Dutch research groups to minimize workload and use of disk space on their host computer. In order log in to CELEX, a personal account should be obtained from Richard Piepenbrock , project manager, who will grant you access by means of a system username and password and a separate FLEX username and password.

CHILDES Database: http://childes.psy.cmu.edu/

mirroring sites at Antwerp (Belgium) and Chokyo (Japan).
The CHILDES system provides free tools for studying child language data and conversational interactions. These tools include a database of transcripts, programs for computer analysis of transcripts, methods for linguistic coding, and systems for linking transcripts to digitized audio and video. CHILDES corpora cover a 23 European and extra European languages: Cantonese, Catalan, Danish, Dutch, Estonian, French, German, Greek, Hebrew, Hungarian, Irish, Italian, Japanese, Mambila [Bantu], Mandarin, Polish, Portuguese, Russian, Spanish, Swedish, Tamil, Turkish, Welsh. The bulk of the collection is however English.

CLR (Consortium for Lexical Research): http://crl.nmsu.edu/clr/CLR.html

Email: lexical@nmsu.edu. Focuses more on language processing tools and lexicons, but does have some corpora. As Feb 1996, you can get most of their stuff by anonymous ftp to ftp://clr.nmsu.edu/CLR. Their catalog is available as a postscript file.

Dublin-Essex Treebank project: http://www.compapp.dcu.ie/~away/Treebank/treebank.html

The Dublin-Essex Project aimes at deriving Linguistic Resources from Treebanks. There are some freely downloadable results (parsed English sample, ecc.), and they promise to put free also future research products, such as English Lexicon and Probabilistic Grammar.
Probabilistic Unification Grammars (e.g. LFG-DOP: Bod and Kaplan, 1998) require large, high quality training corpora. These corpora have to provide tree structures with feature structure annotations. Such corpora are expensive to construct and hard to come by. The traditional procedure for constructing such corpora is to use a large-scale unification grammar (in the real world, this often means writing one yourself!) and parse text. Typically for each string in the input text the grammar will produce hundreds or thousands of candidate tree-feature structure pairs from which a highly trained linguist has to pick the best analysis for inclusion in the training corpus. This is time consuming and error prone. The Dublin-Essex Project has developed an alternative method. The basic idea is extremely simple. As input a treebank is required; from this the CF-PSG is automatically compiled following the method of [Charniak,96]. Then the CF-PSG is manually annotated with f-structure equations and macros are provided for the lexical categories. Then (and this is the trick) the treebank entries (not the strings) "are reparsed" simply following the annotations put in there by the original human annotators, and while they do that they solve as well the f-equations on the rules encountered in that process. This results in an f-structure induced by the best-fitting tree for the example at hand. If the f-structure annotations are deterministic, then the whole process is, and there is no need to choose from hundreds or thousands of alternatives.

ELRA (European Language Resources Association): http://www.icp.grenet.fr/ELRA

Established in Luxembourg in February, 1995, with the goal of promoting the creation, verification, and distribution of language resources in Europe, ELRA is a non-profit organization. It will collect, market, distribute, and license European language resources. ELRA will help users and developers of language resources, government agencies, and other interested parties exploit language resources for a wide variety of uses. Eventually, ELRA will serve as the European repository for EU-funded language resources and interact with similar bodies in other parts of the world. Subscription is available only for institutions (and it is expensive: 750euro for non-profit making organisations). The RELATOR project was the first attempt, and its homepage still exists, but it's largely moribund, and you should go straight to ELRA.

ELSNET (European Network of Excellence in Human Language Technologies):

http://elsnet.let.uu.nl/
The European Network of Excellence in Human Language Technologies, founded by the European Communities' HLT Programme. ELSNET's objective is to bring together the key players in language and speech technology, both in industry and in academia. To encourage interdisciplinary co-operation ELSNET organises a variety of events and services to the language and speech community.

ECI (European Corpus Initiative): http://www.elsnet.org/resources/eciCorpus.html

ECI, the European Corpus Initiative, was founded to oversee the acquisition and preparation of a large multilingual corpus, and supports existing and projected national and international efforts to carefully design, collect and publish large-scale multilingual written and spoken corpora. ECI has produced ECI/MCI I (ECI/Multilingual Corpus I).

Edinburgh Tools: http://www.ltg.ed.ac.uk/~chrisbr/edintools.html

This is the list of the computational linguistics resources at the Edinburgh Language Technology Group (LTG). Some (of the links are accessible only from machines in the School of Cognitive Science, but at least they provide you with a lot infos about a huge number of tools. Something can be downloaded and something must be licensed. However an useful list of resources.

ELAN (European Language Activity Network): http://solaris3.ids-mannheim.de/elan/

The European Language Activity Network plans (a) to reinforce or, where necessary, create international standards by designing a common query language (ELAN-CQL) and by providing standardised resources for the following languages: Belgian French, Bulgarian, Catalan, Croatian, Czech, Danish, Dutch, English, Estonian, French, German, Greek, Hungarian, Italian, Latvian, Lithuanian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovene, Swedish, Turkish and Ukrainian; (b) to operate a user community network with active awareness-raising measures, a clear copyright policy, user support, e-mail user groups, etc. By September 2000 there weren't yet anything done and available at the site. Now these pages seem to be down [checked 2001 July 3, thanks Alf!].

ICAME (International Computer Archive of Modern English): http://www.hd.uib.no/icame.html

ICAME is an international organization of linguists and information scientists working with English machine-readable texts. The aim of the organization is to collect and distribute information on English language material available for computer processing and on linguistic research completed or in progress on the material, to compile an archive of English text corpora in machine-readable form, and to make material available to research institutions. The archive mentioned in the name resides at the NCCH (Norwegian Computing Centre for the Humanities) in Bergen, Norway. This acts as a distribution centre for computerized English-language Corpora and Corpus related Software: that's to say that they sell various English corpora (including Brown and London-Lund) on CD-ROMS, cf. the list at this page (or this other one). Information on corpora can be obtained on the web, by e-mail, by ftp, and by snail mail (ICAME, Norwegian Computing Centre for the Humanities, Harald Hårfagres gate 31, N-5007 Bergen, Norway). Also, manuals for these corpora are freely available at this address. ICAME publish also a journal, namely the ICAME Journal.

ILC (Istituto di Linguistica Computazionale - CNR Pisa): http://www.ilc.pi.cnr.it/

The Pisa ILC hosts many important computational projects, such as: EAGLES, ISLE, PAROLE. Also the well known PiSystem and DBT came out from ILC. [Last check 2001 April 26].

ILK (Induction of Linguistic Knowledge): http://ilk.kub.nl/

ILK, Induction of Linguistic Knowledge (Center for Language Studies, Faculty of Arts, Tilburg University), is a corpus based research programme in which inductive learning algorithms are developed and employed in solving natural language problems; work has been done on Dutch, English, Spanish, Swedish, Slovene and German. The research programme, co-directed by Walter Daelemans (homepage) and Antal van den Bosch (homepage) is part of the CLS (Center for Language Studies, Faculty of Arts, Tilburg University). There are some demo online, cf. especially the MBT one.

IMS (Institut für Maschinelle Sprachverarbeitung): http://www.ims.uni-stuttgart.de/index.html.en

The Institute for Natural Language Processing (IMS) carries out basic and applied research and trains students to create tools for automated processing of spoken and written language. Over the years, they have collected text corpora totalling in a size of several hundred millions tokens. At the IMS, most of these corpora can be explored via the IMS Corpus Workbench, but are generally not available for the outside. Some tools (cf. Corpus WorkBench, Tree Tagger) are licensed free of charge for research purpouses.

ItalNet: http://ovisun199.csovi.fi.cnr.it/italnet/ (in Italian)

or http://ovisun199.csovi.fi.cnr.it/italnet/index_en.html (in English)
ItalNet is an international consortium whose mission is to make available scholarly Internet resources of literary and historical materials relating to Italian studies. It provides the Internet pubblication of the OVI (Opera del Vocabolario Italiano) database of Early Italian Texts and is associated with the Chicago ARTFEL Project.

LDC (Linguistic Data Consortium): http://www.ldc.upenn.edu/

The Linguistic Data Consortium at Penn (i.e. the University of Pennsylvania) provides the largest range of corpora on CD-ROM. Cost ranges from cheap (e.g., ACL-DCI disk) to very expensive. CDs can be purchased individually; institutions can become members and receive discounts on CDs: follow this link. Their catalog and some other infos are available by the following ftp (acces FTP giving "anonymous" as user name and "your_email" as password) or by the web at this address. There's an LDC Online service (try this page or this other one) for queries over the web (mainly intended for members, but there are some samplers available). Email: ldc@ldc.upenn.edu.

Moby Project: http://www.dcs.shef.ac.uk/research/ilash/Moby/

The Moby lexicon project by Grady Ward’s (e-mail) has been placed into the public domain. There is a downloadable 25 MB tar-gzipped complete distribution, or each sub-project can be downloaded individually, viz.: Moby Hyphenator: 185,000 entries fully hyphenated; Moby Language: Word lists in five of the world's great languages; Moby Part-of-Speech: 230,000 entries fully described by part(s) of speech, listed in priority order; Moby Pronunciator: 175,000 entries fully International Phonetic Alphabet coded; Moby Shakespeare, the complete unabridged works of Shakespeare; Moby Thesaurus: 30,000 root words, 2.5 million synonyms and related words; Moby Words: 610,000+ words and phrases. Especially the Moby Shakespeare has a wide distribution e-texts libraries, since is the only completed free electronic edition of Shakespeare. [2001 April 23].

Online Korpusanalyse mit Hilfe von TactWeb:

http://www.uni-duisburg.de/FB3/ROMANISTIK/PERSONAL/Burr/humcomp/analysis.htm
Some short but sometimes very useful romance languages corpora prepared, marked-up and made freely querable online via TACTweb by Elisabeth Burr. They are chiefly parts of the wider project "Romanische Zeitungssprachen". [2001 April 23].
+ Computergestützte Linguistik des Italienischen - Online Korpusanalyse: cf. LIP Corpus Online and Korpusprojekt "italienische Zeitungssprache" under the Italian section.
+ Computergestützte Linguistik des Französischen - Online Korpusanalyse: cf. Korpus romanischer Zeitungssprachen "Le Monde, 15.06.1994" under the French section.
+ Computergestützte Linguistik des Spanischen - Online Korpusanalyse: cf. Korpusprojekt romanischer Zeitungssprachen "La Vanguardia, 15.06.1994" under the Spanish section.

OTA (Oxford Text Archive): http://ota.ahds.ac.uk/ (lot of frames and Java!).

A large catalogue of electronic texts, mainly of literary, philological and scholarly genre. English Language is prevalent but not exclusive. They offer also some linguistic corpora for free after sending a disclaimer statement (e.g. Lampeter Corpus, Northern Ireland Speech Corpus, SUSANNE Corpus): query their catalogue with search author=corpora. For more infos cf. under the e-texts section.

PAROLE: http://www.ilc.pi.cnr.it/parole/parole.html

The homepage of this important project is still under construction (last check 21-01-01) at the ILC site.

RALI (Le Laboratoire de Recherche Appliquée en Linguistique Informatique - Université de

Montreal):http://www-rali.iro.umontreal.ca/
RALI provides tools and computational resources for the linguistic analysis mainly of Written French (taggers, parsers, electronic grammars, parallel corpora). Some demos are working online, such as a Réacc (restores automatically accents in a French plain text), Morpholyse (a morphological analizer), TRIAL (Trilingual Text Alligner), ecc. There is also a freely downloadable suite of COAL Tools. A particularly useful service maintained by RALI is TransSearch, an online querable version of the Hansard bilingual corpus.

REAL Centre (Research in English and Applied Linguistics – Technische University, Chemintz):

http://www.tu-chemnitz.de/phil/english/real/
Although REAL is an acronym for Research in English and Applied Linguistics, it also indicates the Centre's research specialization: "real language", i.e. the REAL Centre concentrates on the collection and description of natural language from a wide range of historical, regional, social and stylistic contexts.

TDT Phase 2 (Topic Detection and Tracking): http://morph.ldc.upenn.edu/Projects/TDT2/

Documentation of a project aimed to explore techniques for detecting the appearance of new and unexpected topics and for tracking the reappearance and evaluation of them. A special training corpora was constructed for this job (cf. TDT Pilot Study Corpus), and some corpora were conformely tagged, cf.: TDT2 Multilanguage Text Corpus, TDT2 Mandarin Text Corpus and TDT2 English Text corpus Version 2.

TELRI (Trans-European Language Resources Infrastructure): http://www.telri.de/

The Concerted Action Trans-European Language Resources Infrastructure, that has by now reached its second phase, TELRI II, is a pan-European alliance of currently 28 focal national language (technology) institutions with the emphasis on Central and Eastern European and NIS countries. It is planned to extend this alliance during the course of the Concerted Action with at least 3 new nodes in CEE/NIS. TELRI II´s primary objectives are to strengthen the pan-European infrastructure for the multilingual language research and development community; and to collect, promote, and make available monolingual and multilingual language resources and tools for the extraction of language data and linguistic knowledge. TELRI maintains the important TRACTOR Archive.

TRACTOR (TELRI Research Archive of Computational Tools and Resource): http://www.tractor.de/

The TELRI Research Archive of Computational Tools and Resource distributes multilingual resources for the Human Language Technology community. It features monolingual and multilingual corpora and lexicons in a wide variety of languages (see the Catalogue), currently including: Bulgarian, Czech, Dutch, English, Estonian, French, German, Hungarian, Italian, Latvian, Lithuanian, Romanian, Russian, Serbian, Slovak, Slovenian, Swedish, Turkish, Ukrainian, Uzbek. The TRACTOR Catalogue of Tools is currently under development [2001 June].
In order to download resources from the archive, you need to become a member of the TRACTOR User Community (TUC). Private users are accepted as well institutional. For those who deposit resources with TRACTOR, access to the entire collection is free. For others there is a small fee to establish membership formally, so that the use of the archive can be controlled and monitored in order to protect the intellectual property rights of the resource providers. The nominal annual fee for individuals and academic, public, or industrial organizations is 50 EURO (Western European members) or 20 EURO (rest of Europe). Comparable conditions apply for members from outside Europe. Members can download and use the resources for research purposes without further charges. Users wishing to exploit the resources commercially will be referred to the owners of the resources in order to negotiate permission with them. For all enquiries regarding accessing and depositing resources email the TRACTOR Helpdesk.

UCREL (University Center for Computer Corpus Research on Language):

http://www.comp.lancs.ac.uk/computing/research/ucrel/
Lancaster University UCREL, a leader in computer corpus construction and analysis for over twenty years, has a wide variety of machine-readable corpora held in file storage or on CD-ROM. Some corpora are held only as plain orthographic text, whilst others are held with several kinds of annotation (UCREL well-known tagging software is CLAWS). See the list at this page. Note, however, that UCREL doesn't distribute corpora: it's ICAME that acts as the distribution centre. [Rev. 2001 April 28].

VISL (Visual Interactive Syntax Learning): http://visl.hum.ou.dk (a lot of frames!)

The primary focus of the VISL project (Department of Language and Communication University of Southern Denmark - Odense) is to develop interactive computer software, based on the Constraint Grammar model, and on theoretical materials developed by in the area of syntax. The software is being designed for self-paced learning by students of different languages. The original VISL languages were English, French and German. However, additional (Danish, Spanish, Italian, Japanese, Portuguese, Esperanto, Arabic) languages have joined the project. Danish, German, English and Spanish pure-text Corpora and Portuguese tagged Corpus are querable online (at the following http) only for members.

W3-Corpora Project: http://clwww.essex.ac.uk/w3c/corpus_ling/about.html

The World Wide Web Access to Corpora Project (W3-Corpora), run at the Department of Language and Linguistics of the University of Essex, enables and promotes the use of corpus resources by allowing simple and straight forward access, via the WWW, to linguistic corpora. The user only needs access to the WWW to be able to perform corpus searches using a web browsing interface (such as Netscape, Internet Explorer, etc.). The project aim is to provide free access to existing linguistic corpora via the World Wide Web (WWW) to students and researchers in Linguistics and related disciplines. The project has resulted in a WWW site with:
+ W3-Corpora Interface: http://clwww.essex.ac.uk/cgi-bin/w3c/w3c
The W3-Corpora interface is a search engine made to be used to on large collections of text of various kinds (corpora). Usually the default collection used as a "corpus" is the Gutenberg Archive e-texts collection, by you can also define other corpora as well. The tool will search for the word/phrase you enter in the corpus you choose and display the examples, hits, on the screen. Irrespective of what kind of search you want to make, the procedure is the same. The W3-Corpora search engine is designed to handle corpora of different kinds. The tool will only search the corpora and display the results on the screen. All subsequent analyses have to be made by you.
+ Tutorial for the W3-Corpora Interface.
+ W3C Corpus Linguistics Pages: a general introduction to Corpus Linguistics.
+W3C List of Corpora: a good list of the corpora available on the web. Last updated January 19, 1999.

Web Concordancer: http://vlc.polyu.edu.hk/scripts/concordance/WWWconcapp.htm

The Web Concordancer site, by the Virtual Language Centre of the Polytechnic University of Hong Kong, presents a few indexed corpora (English, French, Chinese, Japanese) thant can be freely browsed with the ConcApp program. Corpora available include Brown Corpus, Sherlock Holmes stories, South China Morning Post, etc. [2002 February 17].

WebCorp (The Web as Corpus): http://webcorp.org.uk/

WebCorp is a suite of tools, created by the Research and Development Unit for English Studies (RDUES, University of Liverpool), which allows free access to the World Wide Web as a corpus - a large collection of texts from which facts about the language can be extracted. WebCorp was launched in May 2 2000, and it is surely one of most exiciting novelties of the last months! Although WebCorp is designed for linguistic data search, many users have found its results format (with relevant sections of text from multiple web pages collated on one page) useful for information retrieval of the type for which standard search engines are usually used. The WebCorp interface is similar to the interfaces provided by standard search engines. You enter a word or phrase, choose options from the menus provided and then press the "Submit" button. WebCorp works "on top of" the search engine of your choice, taking the list of URLs returned by that search engine and extracting concordance lines from each of those pages. All of the concordance lines are presented on a single results page, with links to the sites from which they came. Search engines, such as Google and AltaVista, are designed to retrieve information from the World Wide Web. They use complex techniques to index the Web and return the documents from their indices which are most relevant for the user's request. WebCorp is designed to retrieve lingustic data from the Web: concordance lines showing the context in which the user's search term occurs. In response to a user query, standard search engines return a list of URLs (page addresses), along with a description of or some text from each page to help the user decide which pages are most useful. To view the pages, the user must click on each of the links individually. WebCorp actually visits each one of these pages, extracting concordance lines from them. The current version of WebCorp is for demonstration purposes and the speed at which results are returned will increase as the tool is developed further. The reason that WebCorp is slower than search engines is that, although WebCorp has a search engine-like interface, its aims and the way it works are very different. Contact. [2001 April 29].
+ WebCorp Lite has fewer customisation options and works only with single word search terms but returns results more quickly than the standard WebCorp tool. If you are searching for a single word and are having speed problems with WebCorp then you should try WebCorp Lite.

General Resources.