General Resources. |
(1) - Corpora and Corpus Linguistics.
(2) - Multilingual and Parallel Corpora.
(3) - Electronic Literary Text Archives.
(4) - References, Standards & Educational Resources.
(5) - Tools.
http://www-rali.iro.umontreal.ca/arc-a2/BAF/
The BAF Corpus is a corpus of French - English bi-texts, i.e. of pair of French and English texts which are mutual translations, and whose sentences have been aligned. This corpus has been built up by the CITI computer assisted translation group (TAO). Most of the texts are of institutional genre (canadian HANSARD, ONU reports, etc.), but a few scientifical papers and a literary work were also included. The whole corpus has about 400.000 wors for each language. BAF Version 1.1. is already available and can be freely downloaded in UNIX GZ format, ZIP and each file separatedly in TXT and CES formats. Description, allignment conventions, encoding documentation, and a COAL Tools suite, are also freely available on the site. [2001 April 23].
The University of Maryland Parallel Corpus Project is acquiring and annotating texts in order to create multilingual corpora for linguistic research, particularly computational linguistics. Religious texts such as the Bible are widely available, carefully translated, and appear in a huge variety of languages. The MPCP provides versions of the Bible consistently annotated according to the CES. There are also some freely downloadle PS paper related to this project, mainly by Philip Resnik. Versions already freely available are the Chinese, Danish, English, Finnish, French, Greek, Indonesian, Latin, Spanish, Swahili, Swedish and Vietnamese ones. [2001 May 1].
http://www.cilta.unibo.it/SITOBOLC_ITA.htm (English also)
The BOnonia Legal Corpus (BOLC) is an ongoing cross-disciplinary research project. It is aimed at the construction and analysis of a multilingual comparable legal corpus. It is being developed in CILTA (Centro Interfacoltà di Linguistica Teorica e Applicata 'L. Heilmann') at Bologna University since 1997. It has been coordinated by Rema Rossini Favretti, Fabio Tamburini has taken care of computer programming and technical problems. John Sinclair played a crucial role as consultant. For the moment the corpus is formed of two subcorpora: one English, the other Italian, but it could be expanded at a later stage. A detailed description is given on the site, but future availability is not known. For more infos, however, you can e-mail Cristiana Desantis. [2001 April 23].
http://solaris3.ids-mannheim.de/tractor/telri/MAN/man-04.htm
Texts from Deutsche Bundesregierung (German Federal Government), Bonn and Berlin, Germany: German Resource File in HTML and Grundgesetz in French and English as Word documents. From Institut für Deutsche Sprache, Mannheim, Germany.
Available under subscription to TRACTOR.
mirroring sites at Antwerp (Belgium) and Chokyo (Japan).
The CHILDES system provides tools for studying child language data and conversational interactions. These tools include a database of transcripts, programs for computer analysis of transcripts, methods for linguistic coding, and systems for linking transcripts to digitized audio and video.
The CHILDES database (cf. this link) is a large group of Children's Spoken and Written Language Corpora, all freely available for PC or MAC. It includes a vast amount of transcript data collected from children and adults who are learning languages.
All of the data are transcribed in the CHAT format which makes them easily analyzed by using the CLAN programs. A 2.5 MB PDF manual of the CHILDES corpora is freely available (at this address), as well as the CLAN concordancer for accessing the data and his manual.
CHILDES corpora cover a 23 European and extra European languages: Cantonese, Catalan, Danish, Dutch, Estonian, French, German, Greek, Hebrew, Hungarian, Irish, Italian, Japanese, Mambila [Bantu], Mandarin, Polish, Portuguese, Russian, Spanish, Swedish, Tamil, Turkish, Welsh. The bulk of the collection is however English (see under the English section).
There is also a remarkable Multilingual Collection (English UK and USA, German, Hebrew, Italian, Spanish, Swedish and Turkish), made from narratives elicited using Mercer Mayer's "frog story" picture book.
All materials are freely available directly from the Site; moreover texts are also downloadable by ftp://poppy.psy.cmu.edu/. Contact: CHILDES Project, Department of Psychology, Carnegie Mellon University, Pittsburgh, PA 15213, USA; e-mail to brian@andrew.cmu.edu.
http://solaris3.ids-mannheim.de/tractor/telri/LJU1/lju1.htm
It contains a 500000 Words English-Slovene and Sloven-English Corpus of various domains, besides other Slovene resources; Multext East Corpus; and Newspaper Corpus. From Language and Speech Group, Intelligent Systems Dept, Jozef Stefan Institute, Ljubljana, Slovenia.
Available under subscription to TRACTOR.
http://www.elsnet.org/resources/eciCorpus.html
A 98 million word corpus, covering most of the major European languages, as well as many others (viz. Albanian, Bulgarian, Chinese, Czech, Dutch, English, Estonian, French, Gaelic, German, Greek, Italian, Japanese, Latin, Lithuanian, Malay, Spanish, Danish, Uzbek, Norwegian, Portuguese, Russian, Serbian, Swedish, Turkish, Tibetan). The primary focus in this effort is on textual material of all kinds, including transcriptions of spoken material. ECI/MCI has 46 subcorpora in 27 (mainly European) languages. The total size of these is roughly 92 million (lexical) words. The corpora are marked up using TEI P2 conformant SGML (to varying levels of detail), with easy access to the source text without markup. Twelve of the component corpora are multilingual parallel corpora with from two to nine sub-corpora. All the alphabetic corpora (there is some Japanese and Chinese) are encoded in the ISO LATIN family of 8-bit character sets (ISO 8859-1, -5 and -7). A complete list of the contents is available following this link.
Unusually cheap: the ECI/MCI is available directly from ECI at a price of 95 DFl (for payments made by credit card or Eurocheque); 110 DFl (for payments by bank transfer); or 120 DFl (for payments by cheques other than Eurocheques). Need only to sign a license agreement available (Postcript or LaTex version) at this address or this other one.
It is also available at 35$ price (or trough membership) from the LDC in a CD-ROM in High Sierra format (ISO 9660), readable on UNIX, MSDOS and Apple systems at least: cf. this page.
The EMILLE Project is a 3 year EPSRC project at Lancaster University and Sheffield University, designed to build a 63 million word electronic corpus of South Asian languages, especially those spoken in the UK. The project will establish an LE architecture within which minority LE may take place. EMILLE will extend GATE (the General Architecture for Text Engineering) to be fully UNICODE compliant so that it may act as a framework within which the corpora of EMILLE can be both developed and exploited. GATE will be extended at Sheffield, in close liaison with the Lancaster team, to meet the needs of EMILLE. GATE was first released in 1996 and has since had a wide take-up in language processing laboratories around the world (Cunningham, Gaizauskas, Humphreys, and Wilks, 1999).
EMILLE will generate written language corpora of at least 9,000,000 words for Bengali, Gujarati, Hindi, Panjabi, Singhalese, Tamil and Urdu. These are the Indic languages indicated as being those most wanted by the LE community in the Baker & McEnery (1999) survey. For those languages with a UK community large enough to sustain spoken corpus collection (Bengali, Gujarati, Hindi, Panjabi and Urdu) EMILLE will also produce spoken corpora of at least 500,000 words per language. The written corpus data will be contain at least 200,000 words of parallel text. The remainder will be monolingual corpus data. The monolingual written corpus will attempt to shadow the composition of the BNC (British National Corpus) in terms of genres as far as possible. The spoken corpus data will be gathered from communities across the UK data on mini-disks. The digitised sound wave of the minidisks will be stored and released as part of the final project deliverables. Note that the use of digital media to collect the data will ease the transfer of the data to computer. The data will also be transcribed.
EMILLE will publish the corpora on a web site for downloading, one of the favoured distribution formats reported by the Baker et al (1998) review of corpus validation. The Department of Linguistics at Lancaster University has undertaken to maintain the web site beyond the life of the EMILLE project. ELRA (an EMILLE partner) has agreed to organise distribution of the project resources on CD. The corpus will be accompanied by a handbook, analogous to the BNC user reference guide, which will give details of the sources individual corpus texts were gathered from etc. [2001 June 17].
The ET10-63 corpus is a bilingual parallel corpus of English and French, containing EC offical documents on telecommunications. The corpus is part-of-speech tagged and also lemmatized. Approximately 1,250,000 words of each language.
http://solaris3.ids-mannheim.de/tractor/telri/MAN/man-08.htm
Texts from European Free Trade Organization (EFTA), Geneva, Switzerland in English and German (HTML and Word formats). From Institut für Deutsche Sprache, Mannheim, Germany.
Available under subscription to TRACTOR.
http://morph.ldc.upenn.edu/Catalog/LDC95T11.html
The European Language Newspaper Text corpus is also known as the French Language News Corpus. This corpus includes roughly 100 million words of French, 90 million words of German and 15 million words of Portuguese and has been marked using SGML. The text is taken from the following sources: [1] - Approximately 60 million words of text in French and German have been made available from the Associated Press (AP) World Stream. AP World Stream is a compilation of AP news reports produced in 86 bureaus in 68 countries. The Associated Press Worldstream newswire service provides articles in six languages, interleaved on a single data stream. The data is collected via an Associated Press installed telephone line at the LDC. [2] - Approximately 110 million words of text in French, German and Portuguese have been made available from Agence France Presse. Each language was supplied in separate data streams collected via a Dateno MKII satellite receiver and associated equipment at the LDC. [3] - Approximately 20 million words of text in German have been made available from Deutsche Presse Agentur. The text is collected via an AP Datafeatures telephone line installed at the Linguistic Data Consortium. [4] - A smaller part of the corpus comes from Le Monde newspaper. The Le Monde data covers about 65 million words of French. It is quite distinct from the AP and AFP materials in its markup approach, because it has been prepared in compliance with the conventions of the Text Encoding Initiative (TEI), rather than having been based on the model of the TIPSTER collections, which were originally developed prior to the establishment of the TEI conventions.
Available only by LDC membership, cf. this link.
Parallel text in all EU languages.
The Hansard Corpus consists of parallel texts in English and Canadian French, drawn from official records of the proceedings of the Canadian Parliament. While the content is therefore limited to legislative discourse, it spans a broad assortment of topics and the stylistic range includes spontaneous discussion and written correspondance along with legislative propositions and prepared speeches. There are several different versions.
+ Hansard Treebank (The Canadian Hansard Treebank). A skeleton-parsed parallel corpus (English-French) of proceedings in the Canadian Parliament. 750,000 words.
+ Hansard LDC Parallel Corpus (The LDC Canadian Hansard Treebank). The collection presented here has been assembled by the LDC by way of archives from two distinct secondary sources. Material from one time period of parliamentary proceedings was acquired through the IBM T. J. Watson Research Center, while material from another period was acquired through Bell Communications Research Inc. (Bellcore). The combined collection covers a time span from the mid-1970's through 1988, with no apparent duplication between the two data sources. Aside from covering different time periods, the two archives have different organization and have undergone different amounts and kinds of processing in being prepared as a parallel language resource. In addition, the Bellcore set itself comprises two distinct types of data -- one appears to be the main parliamentary proceedings (similar in nature to the IBM set), while the other consists of transcripts from committee hearings. The three sets have been kept distinct in this publication and each is described in greater detail in separate documentation files on the CD-ROM.
Available only by the LDC through membership or 5000$ price, cf. this link.
+ TransSearch Hansard (texts 1986-1993). In this free not tagged online version, elaborated from the RALI, you can specify a word or an expression, in English or in French: TransSearch will look for contexts where this expression appears, and show you the corresponding context in the other language.
http://solaris3.ids-mannheim.de/tractor/telri/MAN/man-11.htm
Intellectual Property and Copyright magazin in French and English versions (DOC Word format). From Institut für Deutsche Sprache, Mannheim, Germany.
Available under subscription to TRACTOR.
http://www.comp.lancs.ac.uk/computing/research/ucrel/corpora.html#crater
An 1,000,000-word trilingual corpus of Spanish, French and English, aligned at the sentence level. The corpus is made up of texts from the telecommunications domain. It has been part-of-speech tagged in all three languages. The corpus can be accessed online (free access).
This corpus is made by a set of pieces fron the Official Journal of the European Community (JOC) and is CES (Corpus Encoding Standard) conformant. It is available with three level of treatment: paragraph annotated (CESDOC), POS-Tagged (CESANA) and parallel text aligned (CESALIGN).
Availability unknown: only a few sample to download on this http.
This paper in Spanish by Joseba Abaitua (e-mail) of the Universidad de Deusto is the text of a seminary on "La ingeniería lingüística en la sociedad de la información" held at Soria (Fundación Duques de Soria), 17-21 July 2000. It is a rich and detailed reference on bilingual parallel and comparable corpora, provided with a large bibliography that makes this page even more useful. [2002 February 23].
An online freely querable database of English-French aligned texts, processed by the same software (by Knut Hofland) used for the Oslo ENPC project.
A simple page with versions of the Pater Noster in many germanic languages (Afrikaans, Alsatian, Bavarian, English, Danish, Dutch, Frisian, German, Gothic, Icelandic, Norn, Norwegian, Old Saxon, Pennsylvania Dutch, Plattdeutsch, Swedish) by Catherine Ball (see her homepage), the webmistress of the Old English Pages. There is also an easy interface that allows the comparation of any two texts, thus creating some sort of parallel texts. This page was prepared for the use of classes in linguistics, history of the English language, and Old English. [2001 July 13].
Multext encompasses a series of projects whose goals are to develop standards and specifications for the encoding and processing of linguistic corpora, and to develop tools, corpora and linguistic resources embodying these standards. Multext is developing tools, corpora, and linguistic resources for a wide variety of languages, including Bambara, Bulgarian, Catalan, Czech, Dutch, English, Estonian, French, German, Hungarian, Italian, Kikongo, Occitan, Romanian, Slovenian, Spanish, Swedish and Swahili. All Multext results are made freely and publicly available for non-commercial, non-military purposes. At least this is what they say: till now the only corpora stuff you can download are some sample from the JOC-CES multilingual corpus and a detailed description of the DI93 Swedish Multext Comparable Corpus. Cf. also the MULTEXT Tools file.
http://solaris3.ids-mannheim.de/tractor/telri/MAN/man-06.htm
HTML texts from North Atlantic Treaty Organization (NATO), Brussels, Belgium, in English, French and German. From Institut für Deutsche Sprache, Mannheim, Germany.
Available under subscription to TRACTOR.
http://solaris3.ids-mannheim.de/tractor/telri/BUC/buc-01.htm
English and Romanian versions of Orwell's 1984, with hand-validated alignment, in HTML format (MULTEXT-EAST resource). From Center for Advanced Research in Machine Learning, NLP, and Cognitive Modelling, Academy of Sciences, Bucharest, Romania
Available under subscription to TRACTOR.
The English-Norwegian Parallel Corpus (ENPC) of the University of Oslo consists of original texts and their translations (English to Norwegian and Norwegian to English). It is intended as a general research tool, available beyond the present project for applied and theoretical linguistic research. It started out as a research project at the Department of British and American Studies, University of Oslo, in 1994. The corpus is now completed. The focus has been on novels and fairly general non-fictional books. In order to include material by a range of authors and translators, the texts of the corpus are limited to text extracts (chunks of 10,000-15,000 words). The fiction part of the corpus contains 30 original text extracts in each language and their translations, whereas the non-fiction part contains 20 in each direction. The parallel corpus is planned as an open text bank and will be expanded as allowed by the resources available. It is intended as a general research tool, available beyond the present project for applied and theoretical linguistic research. The process of compiling the corpus has taken four years. A lot of work has gone into the development of software and into the preparation of the texts. The coding system used to mark up the ENPC follows the suggestions made by the Text Encoding Initiative (TEI) as presented in Guidelines for Electronic Text Encoding and Interchange (Sperberg-McQueen & Burnard, 1994). The English part of the ENPC has been tagged for part-of-speech (POS). The tagging was done automatically by using the English Constraint Grammar parser (cf. EngCG Parser) developed by Atro Voutilainen, Juha Heikkilä, Arto Anttila and Pasi Tapanainen according to the Constraint Grammar framework originally proposed by Fred Karlsson (cf. Constraint Grammars). The Norwegian part of the corpus will not be tagged, for lack of a Norwegian tagger.
Access to the Corpus is up today restricted only to researchers and students at the University of Oslo: cf. this link. Only the manual is freely available online.
http://web.bham.ac.uk/lxw715/English/corporaprojects.html
A good reference page for parallel corpora maintained at Birmingham by Wang Lixun (homepage). Each link is provided with some short comment. [2001 April 23].
http://www.tu-chemnitz.de/phil/english/real/transcorpus/index.htm
The corpus consists of English and German texts, ranging from contemporary British/American literature to scientific textbooks. It aims at creating a machine-readable and aligned corpus which will allow to discover and categorise translation equivalents for a number of linguistic items, such as prepositions, function verbs, deictic elements, metaphors or culture-specific structures. Apart from theoretical insights into contrastive language structures as well as cognitive aspects of the translation process, research results could, for instance, be applied to bilingual lexicography or other language learning and translation aids. One example of such a learning tool is the corpus-based contrastive Chemnitz Internet Grammar. Parts of the English-German translation corpus is now free available online for research purposes only: go to the Chemnitz Internet Grammar and check the access to the online corpus following this link.
RELATOR was a European-wide consortium of researchers who, with the support of the European Commission, are striving to establish a European repository of linguistic resources. Linguistic resources comprise a variety of spoken and written language materials, including lexicons, grammars, corpora, and spoken language databases. This project is now moribund and the site doomed to disappear: its content will soon enrich the ELRA project.
http://solaris3.ids-mannheim.de/tractor/telri/MAN/man-09.htm
Documents relating to the reform of the federal constitution (all HTML; RTF and PDF version also available). From Institut für Deutsche Sprache, Mannheim, Germany.
Available under subscription to TRACTOR.
http://morph.ldc.upenn.edu/Catalog/LDC99T39.html
Topic Detection and Tracking (TDT) refers to automatic techniques for finding topically related material in streams of data such as newswire and broadcast news. The TDT2 corpus was created to support three TDT2 tasks: find topically homogeneous sections (segmentation), detect the occurrrence of new events (detection), and track the reoccurrence of old or new events (tracking). This CD-ROM ROM release consists of the English and Mandarin text components of the TDT2 corpus. The data were collected daily over a period of six months (January-June of 1998) from the following sources: American Broadcasting Company (ABC); Associated Press; Cable News Network, Inc. (CNN); New York Times; Public Radio International (PRI); Voice of America (VOA); Xinhua News Agency; ZaoBao News. The two subcorpora were released also separatedly, cf. TDT2 English Text corpus Version 2 and TDT2 Mandarin Text Corpus.
Available by membership to the LDC or paying $2500 price: see this page.
The Trans European Language Resources Infrastructure, whose main archive is the well known TRACTOR. It features parallel and other texts in central and eastern european languages, but you are not allowed even to know what they are unless you join membership. Sad.
A small Tetun (East Timorese) - English parallel corpus, manually sentence-aligned. It was used by the Statistical Machine Translation Team of Dan Melamed (cf. the Melamed's Tools file) and others exploiting the EGYPT statistical machine toolkit. No informations on its availability.
The Thai On-Line Library of bilingual texts, maintained by the TIE Project (Thai Internet Educational), is a tool for Thai students of English, and for foreign students of Thai. TOLL includes a built-in Thai-English/English-Thai dictionary – look up words by clicking, typing, or cut-and-paste. For the benefit of foreign (and younger Thai) readers, TOLL is able to insert spaces between Thai words. TOLL serves several purposes. It is: (a) a test-bed for innovative Internet software development, (b) a workshop for research in new approaches to language education, (c) a low/no-cost delivery system for high-quality educational resources, (d) a starting point in the long research struggle to build sophisticated Thai/English translation software. English education in Southeast Asia is greatly hampered by the high cost of publishing, and the general lack of high-quality, parallel translations: TOLL lets us prepare these at almost no cost. For foreigners, learning to read Thai, Lao, Burmese, Khmer, and other nonsegmented languages can be extremely difficult. These writing systems do not put spaces between words; often, one feels one has to be able to speak fluently before learning to read! TOLL lets you experiment with new ways of indicating – and gradually eliminating – word breaks. Foreigners can learn exactly the same way that Thai schoolchildren do. Finally, TOLL lets you test software for automated text segmentation and parallel alignment of translations. These are extremely difficult research problems, but solving them is a necessary first step toward the development of a variety of commercial and academic software.
+ Till now however only a free demo is available online (click on the link to The Wonderful Wizard of Oz; beware only that TOLL uses JavaScript but not Java), and it is very good. Let us hope to have more!
+ There is also an interesting free TOLL Toolkit, but there isn't still any downloading directions.
TRIPTIC is a trilingual corpus developed for the analysis of prepositions in English, French and Dutch. There is not a TRIPTIC page on the web (at least I didn't found it) and all the informations I give are taken from Michael Barlow's Parallel Corpora Page.
The corpus forms part of the empirical data used for research on the contrastive analysis of prepositions (PhD thesis). The object of the study, which assumes the cognitive linguistic framework, is to examine in which way languages converge and diverge in the semantic structure of so-called function words. The corpus consists of 2,000,000 words, one half fiction, the other half non-fiction material. All paragraphs are aligned, allowing automatic selection of the n-th paragraph in the 3 languages. The original text files are now being converted into a database structure (4th Dimension on Macintosh), in order to facilitate the description of the prepositions under study.
For further information, contact Hans Paulussen.
http://morph.ldc.upenn.edu/Catalog/LDC94T4A.html
This set of three compact discs contains documents provided to the LDC by the United Nations, for use in research on machine translation technology. The documents come from the Office of Conference Services at the UN in New York and are drawn from archives that span the period between 1988 and 1993. This publication contains the English, French and Spanish archives, with data from each language stored on a separate disc in the set. Care has been taken to arrange the document files in a parallel directory structure for each language, so that corresponding translations of a document are found directly by means of the directory paths and file names. Text prepared with SGML encodings and 8-bit ISO 8859-1 Latin1 character set, in which accented letters and some other non-ASCII characters occupy the upper 128 entries of the character table. English, French and Spanish subcorpora CDs are also available separatedly.
Available by the LDC through membership or 2500$ price (1000$ each separated subcorpus), following this link.
http://web.bham.ac.uk/lxw715/English/EnglishChineseCorpus.html
The corpus, developed at Birmingham by Wang Lixun (homepage), consists of complete texts. Each individual 'text-based' file is one complete novel, essay, or other kind of articles. All the files are classified by their names. The corpus is carefully balanced: Half the texts are of English and half of Chinese origin, and the genres of texts are also properly balanced in both language sources. The texts are classified by genre + topic + language. The corpus will be dynamic, i.e., one can keep adding more texts to it. The work is still in progress, and the present aim is to create a 10 million words parallel corpus (half English, half Chinese), for research and language-teaching purposes. After this aim has been achieved, the corpus will be further expanded and will ultimately reach 100 million words. [2001 April 23].
The texts of the main works of the famous French philosoph Gilles Deleuzes freely available directly on his site. Besides the French original, English and Spanish translation are available as well, so you can construct at least a three language parallel text (if not a true parallel corpus).
http://www-pll.who.ch/programmes/pll/cat/cat_resources.html
Also includes a good selection of links on Computer Assisted Translation. (See also the copyright page: http://www.who.ch/copyright.htm).