Localized Resources .1.

(A-D) Afrikaans - Albanian - Albanian (Caucasic) - Arabic - Armenian - Australian lgs. - Awabakal (Yuin-Kuric) - Azerbaijani - Barbadian (Creole English) - Basque - Bengali - Berbice (Creole Dutch) - Bulgarian - Catalan - Chinese (incl. Cantonese) - Chiricahua (Apache) - Commonwealth Antillean Creole French - Commonwealth Winward Islands Creole English - Czech - Danish - Dutch
(E) English (Modern) - English (Old & Middle) - Esperanto - Estonian
(F-I) Farsi - Finnish - French - French Antillean Creole French - Frisian - Gaelic - Georgian - German - Gothic - Greek (Classic and Modern) - Gujarati - Gulf of Guinea Creole Portuguese - Guyana Creole English - Guyanais (Creole French) - Haitian (Creole French) - Hebrew - Hindi - Hungarian - Icelandic (incl. Old Norse) - Indoeuropean - Indonesian - Irish (incl. Ogamic, Old & Middle Irish) - Italian
(J-R) Jamaican Creole English - Japanese - Karelian - Korean - Krio (Sierra Leone Creole English) - Kru (Liberian Pidgin English) - Latin - Latvian - Leeward Islands Creole English - Lithuanian - Livonian - Louisiana Creole French - Macaísta (Macau Creole Portuguese) - Malay - Maltese - Mambila - Manx - Mari (Eastern Meadow) - Mauritian Creole (Isle de France CF) - Mescalero (Apache) - Miskito Creole English - Mitchif (French-Cree mixed language) - Nahuatl - Neapolitan - Negerhollands (Creole Dutch) - Norwegian - Occitan - Palenquero (Creole Spanish) - Panjabi - Polish - Portuguese (incl. Brazilian & Galego-Portuguese) - Romanian - Russian
(S-Z) Sardinian - Saxon (Old) - Scots - Serbo-Croate - Singhalese - Slavonian (Old Church Slavonian) - Slovak - Slovene - Spanish - Sumerian - Swahili - Swedish - Tagalog - Taino - Tamil - Tetun (East Timorese) - Thai - Tibetan - Tocharian (A & B) - Tok Pisin (Creole English) - Turkish - Ukrainian - Upper Guinea Creole Portuguese - Urdu - Uzbek - Veps - Vietnamese - Virgin Islands Creole English - Welsh - West African Pidgin English.

I provide here language-specific links to corpora, e-texts and NLP resources in general. Resources already presented in the previous sections are also repeated here whenever relevant.

(A-D)

Afrikaans.

Lord's Prayer in the Germanic Languages: http://www.georgetown.edu/cball/oe/pater_noster_germanic.html

A simple page with versions of the Pater Noster in many germanic languages (Afrikaans, Alsatian, Bavarian, English, Danish, Dutch, Frisian, German, Gothic, Icelandic, Norn, Norwegian, Old Saxon, Pennsylvania Dutch, Plattdeutsch, Swedish) by Catherine Ball (see her homepage), the webmistress of the Old English Pages. There is also a simple interface that allows the comparation of any two texts. This page was prepared for the use of classes in linguistics, history of the English language, and Old English. [2001 July 13].

Albanian.

ECI/MCI 1 Corpus (European Corpus Initiative Multilingual Corpus):

See under CHILDES Database.
A 98 million word corpus, covering most of the major European languages, as well as many others (viz. Albanian, Bulgarian, Chinese, Czech, Dutch, English, Estonian, French, Gaelic, German, Greek, Italian, Japanese, Latin, Lithuanian, Malay, Spanish, Danish, Uzbek, Norwegian, Portuguese, Russian, Serbian, Swedish, Turkish, Tibetan).
See under ECI/MCI 1 Corpus in the Multilingual and Parallel Corpora section.

IntraText Library: http://www.eulogos.it/default.htm

A small library of interactive hypertexts for free reading and search maintained by Èulogos. All literary texts, many religious (the BRI, Bibliotheca Religiosa). Nine languages are till now supported (Albanian, German, English, Spanish, French, Italian, Latin, Finnish).

Albanian (Caucasic).

Albanian mss (TITUS): http://titus.fkidg1.uni-frankfurt.de/armazi/armazi03.htm#eProjekt

The "Digitization of the Albanian palimpsest manuscripts from Mt. Sinai" is a part of the Armazi project by TITUS (Thesaurus Indogermanischer Text- und Sprachmaterialien). The two manuscripts in question which were discovered in 1975 in St. Catherine's monastery of Mt. Sinai, raise eminent interest by the fact that the text represented in their lower script can be claimed to be the only authentic manuscript remnants written in the language of the so-called "Albanians" of the Caucasus (so far, only a few inscriptions and some words reported by Armenian authors have been known of their language which is believed to be an ancestor of present-day Udi). The two palimpsests will be photographed in situ, digitized and prepared for deciphering; the results of the project which will consist in a preliminary edition will be made accessible to the scholarly community via the World Wide Web. Attention: these pages are encoded using Unicode / UTF8. The special characters as contained in them can only be displayed and printed by installing a font that covers Unicode such as the freely downloadable TITUS font TITUS Cyberbit Basic.

Arabic.

Al-Hayat 1995 Corpus (CD for the Mac): http://www.ltg.ed.ac.uk/helpdesk/faq/Texts-html/0050.html

It is said to be the largest Arabic corpus available. It has some 140MB of data (about 23M words) in about 44,000 files, all in Arabic Mac encoding (a superset of ISO 8859-6). Not on the web, but available from Dr. Imad Bachir / Al-Hayat Publishing Company / Kensington Centre / 66 Hammersmith Road / LONDON W14 8YT +44 (0) 171 602 9988 (Tel); +44 (0) 171 602 4963 (Fax).

CALLHOME Egyptian Arabic Transcripts: http://morph.ldc.upenn.edu/Catalog/LDC97T19.html

The text component of the package includes transcripts and documentation files. The transcripts cover a contiguous 5 or 10 minute segment taken from 120 unscripted telephone conversations between native speakers of Egyptian Colloquial Arabic (ECA), the spoken variety of Arabic found in Egypt. The dialect of ECA that this dictionary represents is Cairene Arabic. The transcripts are timestamped by speaker turn for alignment with the speech signal and are provided in standard orthography. In addition to transcript files, this corpus contains full documentation on the transcription conventions and format. Complete auditing information on the speakers represented in the transcripts (including gender, channel quality and so on) is also included. Available as FTP file by the LDC through membership or by 500$ price, cf. this page.

Xerox Research Center Europe - MultiLingual Theory and Technology:

http://www.rxrc.xerox.com/research/mltt/
The XRCE (home page) MLTT team creates basic tools for linguistic analysis, e.g. morphological analysers, parsing and generation platforms and corpus analysis tools. These tools are used to develop descriptions of various languages and the relation between them. There are free web demos of some of their tools on the web also for Arabic. For more details cf. the Tools section.

Armenian.

Lieder and Songs Texts Page: http://www.recmusic.org/lieder/

This page contains the texts of 1842 poets and 1217 composers in 22 different languages: one text is Armenian. Better than nothing ... For a more detailed description, see in the E-Texts section.

Australian languages.

Australian Aboriginal Languages: http://www.dnathan.com/VL/austLang.htm

This site points to web resources for more than thirty languages: a lot of linguistic papers, but also a few native language e-Texts.

Awabakal (Yuin-Kuric).

Luke in Awabakal & English & Lexicon: http://www.lakemac.infohunt.nsw.gov.au/communit/bible/bible.htm

Awabakal is a long time (second half of XIX century) extinct tribe (cf. Tindale's Catalogue of Australian Aboriginal Tribes) of Lake Macquarie (not Port Macquarie), south of Newcastle, New South Wales. Their language was part of a cluster of dialects spanning through the New South Wales coast, usually grouped in the Yuin-Kuric family. I learned from Daryn McKenny of the Yamuloong Resource Centre that there is hope for a revival of the language, and that people in the former Awabakal region is about to start teaching the language once again. Awabakal has been sometimes gathered with the Wanarua language (extinct as well, second half of XX century) in a single language unit (cf. Voegelin & Voegelin 1977, p. 355), but contemporary local people strongly object this eventuality.
A translation of Luke in Awabakal was made between 1827 and 1831 by the missionary Lancelot Threlkeld and Biraban (McGill), who lived in the Newcastle/Lake Macquarie area. He actually lived with the Awabakal people and was friend with a leader of the Awabakal people,
Biraban (otherwise known as McGill). He also wrote one of the probably largest attempt in understanding any of the Aboriginal languages in Australia of his time, viz. the book An Australian Language as spoken by the Awabakal, the people of Awaba or Lake Macquarie being an account of their language, traditions and customs, that was published after Threlkeld's death by Dr Fraser and printed by the New South Wales Government Printer in 1892 with an appendix by John Fraser, BA, LL.D. Over that same period, the Awabakal population declined to such an extent that only a few families could be found. For this reason, much to his regret, Threlkeld was unable to publish the gospel or make any attempt to teach the Awabakal people to read it. The original manuscript of Threlkeld's fourth revision is still in the Sir James Grey collection at the Auckland Public Library, New Zealand. The current edition (1997, published on the occasion of the bicentenary of white settlement in Newcastle, NSW) however makes use of the larger 1892 publication, An Australian Language. There are freely available in PDF format both the full text or a smaller sample. [2001 May 18; Rev. 2001 November 28].

Azerbaijani.

Lieder and Songs Texts Page: http://www.recmusic.org/lieder/

This page contains the texts (all freely downloadable HTML) of 1842 poets and 1217 composers in 22 different languages: one is Adzerbaijani (the well known Berio's folksong). For a more detailed description, see in the E-Texts section.

Barbadian (Creole English).

Barbadian Creole Fragments: http://www.ling.su.se/Creole/Text_Collection.shtml

A few freely downloadable old documents of Barbados Caribbean English Creole (1652, 1692, 1859) from the Creolist Archives Text Collections. [2001 August 8].

Basque.

EEBS (Systematic Compilation of Modern Basque): [homepage missing]

EEBS is a 3 million word Basque corpus being carried out by the IXA Group from Computer Science Faculty of The Basque Country University along with UZEI, an association that works on Basque terminology and lexicography. These few information are taken from the IXA pages, but I know nothing else about it. [2001 April 30].

EUSLEM Basque Lemmatizer/Tagger: http://ixa.si.ehu.es/ingeles/dokument/EUSLEM.html

The EUSLEM automatic lemmatizer/tagger, developed by the Lengoaia Naturalaren Prozesamendurako IXA Taldea, is a basic tool for applications such as automatic indexation, documental databases, syntactic and semantic analysis, analysis of text corpora, etc. Its job is to give the correct lemma of a text-word, as well as its grammatical category. The lemmatizer/tagger is being of great help for the second phase of the EEBS project (Systematic Compilation of Modern Basque). A tagset system has been also developed for Basque: it is a three level system which the user can parametrise when using the programme. In the first level seventeen general categories are included (noun, adjective, verb, etc). In the second one each category tag is further refined by subcategory tags. The last level includes other interesting morphological information (case, number, etc.). Information on availability is lacking. [2001 April 30].

IXA Group for Natural Language processing: http://ixa.si.ehu.es/ingeles/main.html

Lengoaia Naturalaren Prozesamendurako IXA Taldea has been working for more than ten years on the Natural Language Processing and all the outcomes it has achieved are related to Basque. The site provides some information on NLP projects involving the Basque language and refers some of the most important results of the Group, such as: MORFEUS Basque Morfological Analizer, EDBL (The Lexical DataBase for Basque), a database of about 70.000 entries, EUSLEM, a Basque lemmatizer/tagger, and XUXEN a speller for Basque. Only the last is a commercial software (distributed by HIZKIA Informatika, Atrium - le Forum, F-64100 Baiona, e-mail). Informations on availability of all other products is lacking. You can however make inquiries to the group’s e-mail. [2001 April 30].

MORFEUS Basque Morfological Analizer : http://ixa.si.ehu.es/ingeles/dokument/MORFEUS.html

MORFEUS, developed by the Lengoaia Naturalaren Prozesamendurako IXA Taldea, has a basic task in the automatic processing of Basque. It assigns to each token in a text its lemma as well as all its possible morphological analysis. The rest of the modules will make use of that output so as to accomplish disambiguation and identify lexical units. The output is given in text-format but they are currently working to give it in SGML format. Information on availability is lacking. [2001 April 30].

Pear-Chaplin Basque Corpus: http://www.lrc.salemstate.edu/aske/basquecorpus/

This is a spoken corpus (MP3 files + transcripts) of Basque made by John Aske. Basically the corpus is built up with 42 recordings of narratives from Basque speakers were made in the Fall of 1993 . In these narratives the speakers recounted a silent movie they had just watched to a friend who had not watched it. Two silent movies were used for this purpose: The Pear Movie (see Chafe 1981) and a short collage of scenes from Charlie Chaplin’s Modern Times. Half of the narratives (21) are about each one of these two short films. In addition, a few other recordings were made during the recording sessions, such as those of some of the speakers describing pictures taken from children’s books (the Ramona pictures) and of some speakers telling jokes or other stories in between recording sessions. The author used the transcripts of these narratives in his doctoral dissertation (Department of Linguistics, UC Berkeley, 1997) and is released them in March 7, 2000 so that other Basque linguists may take advantage of all the work that went into making and transcribing these recordings. If you use them, the only request is that you acknowledge their source. All materials are freely available: you can download singularily transcripts, mp3s (check also the detailed list), pictures, and movies. On the page it is said that there is also a CD version available. Contact the author, Jon Aske, Department of Foreign Languages, Salem State College, 352 Lafayette Street, Salem, MA 01970; e-mail. [2001 May 1].

Bengali.

EMILLE Project: http://www.emille.lancs.ac.uk/

The EMILLE Project is a 3 year EPSRC project at Lancaster University and Sheffield University, designed to build a 63 million word electronic corpus of South Asian languages, especially those spoken in the UK. EMILLE will generate written language corpora of at least 9,000,000 words for Bengali, Gujarati, Hindi, Panjabi, Singhalese, Tamil and Urdu.
For more infos and details cf. under the Multilingual corpora section. [2001 June 17]

Berbice (Creole Dutch).

Berbice Dutch Fragments: http://www.ling.su.se/Creole/Text_Collection.shtml

Two small 1827 and 1881 freely downloadable fragments in Berbice (Caribbean) Dutch Creole from the Creolist Archives Text Collections. [2001 August 8].

Bulgarian.

Aleksova's Corpus of Spoken Bulgarian: http://www.hf.uio.no/east/bulg/mat/Aleksova/

This corpus, hosted on the Oslo Department for East European and Oriental Studies BRSM page, consists of transcribed conversations in family contexts was collected by Krasimira Aleksova of the Faculty of Slavic Studies, Sofia University, for her dissertation Ezikovi procesi v semejstvoto (varchu material ot stolicata) (1994).
You are free to download the entire corpus to your own computer, as long as it is to be used for research purposes only. Your browser should be configured to read Cyrillic text in the encoding corresponding to Code Page 1251 for Windows. If you use a Macintosh, please note that if your web browser is set to show Cyrillic with Mac (Apple Cyrillic) fonts, your saved pages are automatically converted to Apple Cyrillic. If you want to retain the CP 1251 encoding on your Mac, you could download the source for each page and strip the HTML code from it.
+ The Avtoreferat of Aleksova's Dissertation is available at this link.

Annotated Bulgarian Texts: http://www.hf.uio.no/east/bulg/mat/annot.html

Till now there are only three texts, freely available. Beware that by "annotation" they intend only the (prevalently Lexical) notes. [2001 April 26].

Bulgarian Research and Study Materials: http://www.hf.uio.no/east/bulg/mat/

A small page from the Department for East European and Oriental Studies of the University of Oslo. It contains, besides some other minor resources, the Aleksova, Nikolova and Mavrodieva corpora of Bulgarian. [Last checked 2001 April 26].

Bulgarian Poetry Archive: http://www.cs.columbia.edu/~radev/faq/poetry/

This free Electronic Literary Text Archive contains 301 poems in HTML (beware of Bulgarian font encoding: they don't say what it is) format and can be freely viewed and copied.

Bulgarian Text Corpus (TRACTOR): http://solaris3.ids-mannheim.de/tractor/telri/SOF2/sof2-01.htm

Corpus of Bulgarian Texts (news, prose, poetry, legal, etc), CES encoded, from Linguistic Modelling Laboratory, Bulgarian Academy of Sciences, Sofia, Bulgaria.
Available under subscription to TRACTOR.

ECI/MCI 1 Corpus (European Corpus Initiative Multilingual Corpus):

http://www.elsnet.org/resources/eciCorpus.html
A 98 million word corpus, covering most of the major European languages, as well as many others (viz. Albanian, Bulgarian, Chinese, Czech, Dutch, English, Estonian, French, Gaelic, German, Greek, Italian, Japanese, Latin, Lithuanian, Malay, Spanish, Danish, Uzbek, Norwegian, Portuguese, Russian, Serbian, Swedish, Turkish, Tibetan).
See under ECI/MCI 1 Corpus in the Multilingual and Parallel Corpora section.

Mavrodieva’s Transcripts of Bulgarian Parliament Debates Corpus: http://www.hf.uio.no/east/bulg/mat/Parliament/

This corpus, hosted on the Oslo Department for East European and Oriental Studies BRSM page, is built up with transcripts made by Ivanka Mavrodieva from recordings of the debates of the 7th Great National Assembly on 31 October, 1990. Registrations of the broadcast were made at the Sociolinguistics Laboratory at Sofia University of broadcasts with the assistance of Angel Angelo. The texts amount to a total of approx. 20.000 words. you may frely view the three parts separately online. [2001 April 26].

Myriobiblos: http://www.myriobiblos.gr/

Myriobiblos, The E-text Library of the Church of Greece, provides a lot free HTML e-texts (you can browse and save them) from Classical to modern Greek; but there are also fewer texts (mainly translation) in Bulgarian, English, French, German, Italian, Romanian and Russian. For more cf. under the E-Texts section. [2001 May 1].

Nikolova's Corpus of Spoken Bulgarian: http://www.hf.uio.no/east/bulg/mat/Nikolova/

The e-texts available here, amounting to approx. 50.000 word tokens, represent one half of the corpus that served as the base for Cvetanka Nikolova: Chestoten rechnik na balgarskata razgovorna rech (A Frequency Dictionary of Colloquial Bulgarian), Nauka i izkustvo, Sofia 1987. The texts, hosted on the Oslo Department for East European and Oriental Studies BRSM page, are made available with the kind permission of Cvetanka Nikolova and through the assistance of Tzvetomira Venkova, who did computer entry from the original index cards
The original recordings were made with a hidden portable tape recorder in randomly selected places (shops, streetcars, offices, homes) during the years 1975 to 1977. Most informants are from Sofia, while 3 recordings were made in Samokov and two in Plovdiv. None of the informants were aware of being recorded at the time. As the purpose of the original corpus was to investigate lexical variation in spoken Bulgarian, phonetic variants have not been taken into account when the dictionary Chestoten rechnik na balgarskata razgovorna rech was made. However, forms like "k'vo ot t'va" and "nema" are preserved in these e-texts. The files contain only the sentences uttered by the informants, without indication of speakers' identities and turn changes. They are therefore best suited to investigations of phenomena that can be described within the realm of the sentence. For investigations of discourse phaenomena, Aleksova Corpus of Spoken Bulgarian will provide better material.
You are free to download the entire corpus to your own computer, as long as it is to be used for research purposes only. Your browser should be configured to read Cyrillic text in the encoding corresponding to Code Page 1251 for Windows. If you use a Macintosh, please note that if your web browser is set to show Cyrillic with Mac (Apple Cyrillic) fonts, your saved pages are automatically converted to Apple Cyrillic. If you want to retain the CP 1251 encoding on your Mac, you could download the source for each page and strip the HTML code from it
+ The Avtoreferat of Nikolova's Dissertation is available at this link.

Catalan.

Biblioteca Virtual Joan Lluís Vives: http://lluisvives.com/

Besides a lot of other librarian services and resources, this site offers a rich collection of Catalan e-texts of every gender, all freely browsable (and downloadable) in HTML format with frames. The site is wholly in Catalan language. [2002 February 18].

CHILDES Database: http://childes.psy.cmu.edu/

mirroring sites at Antwerp (Belgium) and Chokyo (Japan).
The CHILDES system provides free tools for studying child language data and conversational interactions. These tools include a database of transcripts, programs for computer analysis of transcripts, methods for linguistic coding, and systems for linking transcripts to digitized audio and video. CHILDES corpora cover a 23 European and extra European languages, although the bulk of the collection is English.
The Catalan CHILDES corpus is downloadable from this http.
See under Multilingual and Parallel Corpora section for a fuller file.

CLiC (Centre de Llenguatge i Computació): http://clic.fil.ub.es (English version)

The Centre de Llenguatge i Computació (Universitat de Barcelona), formerly LaReLC (Laboratori de Recerca en Lingüística Computacional) is working mainly in Hispanic NLP and Lexical Aquisition (AQUILEX project). In collaboration with DLSI-UPC it has contributed in the development of NLP tools and in the maintenance of the DLSI-UPC/CLiC-UB Tools online querable Demo. The old site of LaReLc-UB is still working, but it is better to refer to the CLiC new one. [2001 April 30; rev. 2001 October 28].

Corpus Textual Informatitzat de la Llengua Catalana: http://pdl.iec.es/home/index.asp

This corpus was made by the IEC (Institut d'Estudis Catalans) as the main source for the Diccionari de la llengua catalana (DIEC). The default free online search access for unregistered users allows only very simple queries (and the same hold true for the dictionary access). Registration is however free of charge. The corpus is supposed to be quite large, but its composition is not clearly stated anywhere in the site. [2002 February 23].

DLSI-UPC (Departament de Llenguatges i Sistemes Informatica):

http://www.lsi.upc.es/~nlp/
The main research fields of the Departament de Llenguatges i Sistemes Informatica (Universitat Politècnica de Catalunya) are related to the use of multilingual lexical resources, information extraction from documents, design of NL interfaces, basic NLP techniques (tagging, parsing, sense disambiguation), NL understanding and Knowledge Representation. The group has been working as a pluri-disciplinary group since 1986, together with linguists from the CLiC (Universitat de Barcelona). This collaboration was developed in several projects, among which is a suite of NLP tools, viz. MACO+ (morphological analyzer corpus-oriented), EWN (Top-ontology semantic analyzer), Relax (POS Tagger), TreeTagger (POS Tagger), TACAT (parser). A Demo of the full suite is freely querable online. Availabilty is otherwise unknown: contact Núria Castell i Ariño. [2001 April 30].

DLSI-UPC/CLiC-UB Tools Demo: http://nipadio.lsi.upc.es/cgi-bin/demo/demo.pl

This Demonstration page of Morphosyntactic analysis, tagging and parsing of unrestricted text allows you to freely submit some sentences in Spanish, Catalan or English to the full suite of tools developed at the DLSI-UPC (Departament de Llenguatges i Sistemes Informatica - Universitat Politècnica de Catalunya) in collaboration with the CLiC (Centre de Llenguatge i Computació - Universitat de Barcelona). The components of the suite are MACO+ (morphological analyzer corpus-oriented), EWN (Top-ontology semantic analyzer), Relax (POS Tagger), TreeTagger (POS Tagger), TACAT (parser). [2001 April 30; last checked 2001 October 28].

MACO+ (Morphological Analizer Corpus-Oriented): http://www.lsi.upc.es/~nlp/descr-eines.html

The MACO+ Morphological Analizer Corpus-Oriented accepts unrestricted text as input. The tool tokenizes the text, and performs and produces as output all morphological interpretetions possible for each token. It is able to recognize and deal with numbers, proper nouns, punctuation, dates, abbreviations, multiwords, etc. Spanish, Catalan and English versions available. MACO+ is developed at the DLSI-UPC (Departament de Llenguatges i Sistemes Informatica - Universitat Politècnica de Catalunya) in collaboration with the CLiC (Centre de Llenguatge i Computació - Universitat de Barcelona). Availabilty is unknown: contact Núria Castell i Ariño. [2001 April 30].
+ MACO+ can be queried freely online (English, Spanish and Catalan) as part of the DLSI-UPC/CLiC-UB Tools Demo.
+ A MACO+ only online tagging service is freely provided by UNED.
+ A Maco+ & Relax online tagging service is also freely provided by UNED.

EWN: http://www.lsi.upc.es/~nlp/descr-eines.html

EWN top-ontology semantic analyzer accepts as input morphologically analized text (the output of MACO+) and adds to each lemma the nodes in EWN top-ontology that subsume it. EWN is developed at the DLSI-UPC (Departament de Llenguatges i Sistemes Informatica - Universitat Politècnica de Catalunya) in collaboration with the CLiC (Centre de Llenguatge i Computació - Universitat de Barcelona). Availabilty is unknown: contact Núria Castell i Ariño. [2001 April 30].

Relax: http://www.lsi.upc.es/~nlp/descr-eines.html

The Relax POS tagger takes as input the output of the morphological analizer MACO+, and selects the right POS and lemma for each word in the given context. Currently, it produces an output with over 97% precision. The language model may be easily improved with the addition on new context constraints expressed in CG formalism, either hand-written or statistically acquired. Spanish, Catalan and English versions available. Relax is developed at the DLSI-UPC (Departament de Llenguatges i Sistemes Informatica - Universitat Politècnica de Catalunya) in collaboration with the CLiC (Centre de Llenguatge i Computació - Universitat de Barcelona). Availabilty is unknown: contact Núria Castell i Ariño. [2001 April 30].
+ MACO+ can be queried freely online (English, Spanish and Catalan) as part of the DLSI-UPC/CLiC-UB Tools Demo.
+ A Maco+ & Relax online tagging service is also freely provided by UNED.

Selecció de Poesia Catalana: http://www.intercom.es/folch/poesia/

It's only an anthology of poems, but covers 179 catalan poets, ranging from the classical (e.g. Ausias March) to modern unknown and inedited authors. All texts are freely readable as HTML pages (one for each author), and you can save them in this format. Some biography is provided as well.

TACAT: http://www.lsi.upc.es/~nlp/descr-eines.html

TACAT is a parser that takes as input the output of the morphological analizer MACO+, or the output of any tagger, and produces a syntactic analysis. The tool is a chart-based parser, with some extensios for flexibility. It uses CFG grammars, which can produce either a complete sentence analyses or just partial parsing and chunk recognition. Spanish and Catalan versions available. TACAT i is developed at the DLSI-UPC (Departament de Llenguatges i Sistemes Informatica - Universitat Politècnica de Catalunya) in collaboration with the CLiC (Centre de Llenguatge i Computació - Universitat de Barcelona). Availabilty is unknown: contact Núria Castell i Ariño. [2001 April 30].
+ TACAT can be queried freely online (Spanish and Catalan) as part of the DLSI-UPC/CLiC-UB Tools Demo.

WESS Web (Western European Specialists Section): http://www.lib.virginia.edu/wess/etexts.html

The WES Section of the University of Virginia Library provides a list of text resources for 17 European literatures: Catalan, Danish, Dutch, Finnish, French, Galego-Portuguese, German, Greek, Irish, Italian, Latin, Norwegian, Old Norse & Icelandic, Occitan, Portuguese, Spanish and Swedish.

Chinese (Mandarin & Cantonese).

Bible of University of Maryland Parallel Corpus Project: http://benjamin.umd.edu/parallel/bible.html

The MPCP (University of Maryland Parallel Corpus Project) provides versions of the Bible consistently annotated according to the CES. There are also some freely downloadable PS paper related to this project, mainly by Philip Resnik. Versions already freely available are the Chinese, Danish, English, Finnish, French, Greek, Indonesian, Latin, Spanish, Swahili, Swedish and Vietnamese ones. For more cf. under the Parallel Corpora section. [2001 May 1].

CALLHOME Mandarin Chinese Transcripts: http://morph.ldc.upenn.edu/Catalog/LDC96T16.html

The text component of the package includes transcripts and documentation files. The transcripts cover a contiguous 5 or 10 minute segment taken from 120 unscripted telephone conversations between native speakers of Mandarin Chinese. The transcripts are timestamped by speaker turn for alignment with the speech signal and are provided in standard orthography. In addition to transcript files, this corpus contains full documentation on the transcription conventions and format. Auditing and demographic information on the speakers represented in the transcripts (including gender, channel quality and so on) are also included. Available as FTP file by the LDC through membership or by 500$ price.

CHILDES Database: http://childes.psy.cmu.edu/

mirroring sites at Antwerp (Belgium) and Chokyo (Japan).
The CHILDES system provides free tools for studying child language data and conversational interactions. These tools include a database of transcripts, programs for computer analysis of transcripts, methods for linguistic coding, and systems for linking transcripts to digitized audio and video. CHILDES corpora cover a 23 European and extra European languages, although the bulk of the collection is English.
+ The Mandarin CHILDES corpus is available on the web.
+ The Cantonese CHILDES corpus is available on the web.
See under Multilingual and Parallel Corpora section for a fuller file.

Chinese Forest - The Chinese Treebank Project: http://umiacs.umd.edu/labs/CLIP/forest.html

The home page of the Chinese Treebank Project by Mary Ellen Okurowski, Ron Dolan, and John Kovarik, provides guidelines for segmentation, tagset. The project is maintained by the Computational Linguistics and Information Processing Laboratory (CLIP) at the University of Maryland Institute for Advanced Computer Studies (UMIACS). Very interesting stuff, but information on the corpus they use are lacking, and there isn't anything to download.

Chinese Language Processing: http://gamma.is.tokushima-u.ac.jp/member/kita/NLP/Chinese

A part of the Speech and Language Web Resources, the big reference archive by Kenji Kita, Tokushima University. It is smaller than the Japanese one but is in English. [2001 April 28].

Chinese Philosophical E-text Archive: http://sangle.web.wesleyan.edu/etext/index.html

A precious repository of free chinese e-texts ranging from the classical pre-Qin and Song to the Qing and modern ones. There are electronic versions of Chinese philosophical texts created by the Wesleyan Confucian E-text Project; Electronic versions of Chinese philosophical texts from other sources, some with minor improvements; and Information on and links to more information on the preparation and use of these texts. [2001 May 1].

Chinese Treebank: http://morph.ldc.upenn.edu/ctb/

The Chinese Treebank is a project, started in Summer 1998 at Penn (i.e. the University of Pennsylvania), whose aims were building a 100-thousand-word Chinese Treebank and working towards a community consensus on guidelines that will include the input of influential researchers from Taiwan, Singapore, Hong Kong, mainland China and the States. In this progress, two workshops and a number of meetings between 7/1998 to 10/2000 in USA and abroad were held. The final release of the treebank was in December 2000. Authors of the PennChT are Martha Palmer, Mitch Marcus, Tony Kroch, Fei Xia, Nianwen Xue and Fu-Dong Chiou. [2001 April 27].
+ The Chinese Treebank Final Release is available from the LDC as LDC2000T48 catalogue number. It consists of about 100K words, 4185 sentences, 325 data files from 325 articles from Xinhua newswire between 1994 and 1998, all in GB encoding. The format is the same as the English Penn Treebank except that the original file informations such as "DOCNO" and "DATE" were kept in the data file. It costs 100 US$ for people who are not member of the LDC.
+ The Penn Guidelines for Chinese Treebank are freely available as PS or PDF files, namely: Segmentation Guidelines (or PDF), POS Tagging Guideline (or PDF), Bracketing Guideline (or PDF). There is also a freely downloadable PS paper dealing with Developing Guidelines and Ensuring Consistency for Chinese Text Annotation, from the Proceedings of the second International Conference on Language Resources and Evaluation (LREC-2000), Athens, Greece, 2000.
+ A few PS or PDF sample files are freely downloadable, viz.: Sample1 (or PDF) and Sample2 (or PDF).

Christian Classics - Ethereal Library: http://www.ccel.org/

A collection of patristic texts, mainly in english translation, but with a few latin originals, and some translations in other languages as well (mainly russian and chinese). All texts are freely downloadable with theological markup (ThML) or HTML, plain or zipped.

CMA - Chinese Morphological Analyzer: http://www.basistech.com/products/Chinese-analyzer.html

The Chinese Morphological Analyzer (CMA) from Basis Technology is a portable engine that incorporates comprehensive Chinese dictionaries for segmenting Chinese texts in both Traditional Chinese and Simplified Chinese scripts. KMA can segment Chinese text into words, index and search large collections of Chinese documents (or text fields in databases), generate list of words from free-running Chinese text, and identify parts of speech and word-formation processes. Price is not stated, and to request an evaluation version you have to send them an e-mail. There is however also an online demo, for both Simplified and Traditional Chinese. [2002 February 17].

ECI/MCI 1 Corpus (European Corpus Initiative Multilingual Corpus):

http://www.elsnet.org/resources/eciCorpus.html
A 98 million word corpus, covering most of the major European languages, as well as many others (viz. Albanian, Bulgarian, Chinese, Czech, Dutch, English, Estonian, French, Gaelic, German, Greek, Italian, Japanese, Latin, Lithuanian, Malay, Spanish, Danish, Uzbek, Norwegian, Portuguese, Russian, Serbian, Swedish, Turkish, Tibetan).
See under ECI/MCI 1 Corpus in the Multilingual and Parallel Corpora section.

Guo Jin's Chinese PH corpus (Segmented Version): ftp://ftp.cogsci.ed.ac.uk/pub/chinese/

This is a cleaned up segmented version of Guo Jin's Chinese PH corpus made by Chris Brew and Julia Hockenmaier (cf. the readme). The source is news text from the P.R. of China's Xinhua news agency which was written between January 1990 and March 1991. The corpus is in GB-code. The corpus contains 2,447,7719 words and 3,753,291 characters, 492,875 of which are paragraph delimiters. The average number of characters per word is 1.533. The cleaning up concerns mostly punctuation marks, and the recognition of proper names. For instance, full stops followed by double quotes are separated in this version. Some segmentation inconsistencies have been removed as well. Segments are separated by newlines. The corpus is freely downloadable by FTP as a single file. [Last check 2001 August 5].

Hongyin Tao's Seminar in Corpus Linguistics: http://www.bol.ucla.edu/~ht37/teach/222/222_info.html

This page is only the schematic description of a stimulating EALC 222 Winter 2002 course held at UCLA by Hongyin Tao (homepage), but provides also some good references, especially in CJK computational analysis. [2002 February 17].

Hub-4NE - Mandarin Broadcast News Transcripts 1997:

http://morph.ldc.upenn.edu/Catalog/LDC98T24.html
This collection consists of 30 hours of recorded broadcasts and transcripts that have been drawn from the following sources: [1] Voice of America (VOA): United States Information Agency Radio; [2] People's Republic of China Television (CCTV); [3] Commercial radio based in Los Angeles, CA. (KAZN-AM). Of these three sources, the first two comprise the bulk of the collection and are represented in roughly equal amounts; only a relatively small sample of KAZN-AM recordings are included, owing to the relatively high proportion of unusable material (commercials, local traffic reports loaded with California place names, etc). The transcripts were created by native speakers of Mandarin working at the LDC; they are in GB-encoded form, with SGML tagging to identify story boundaries, speaker turn boundaries and phrasal pauses; these tags include time stamps to align the text with the speech data. Word segmentation (white-space between words) is included. A working DTD is provided, and the markup is consistent with that of the 1997 English and Spanish Hub-4 collections. Available only by membership to the LDC.

Hub-5 Mandarin Transcripts: http://morph.ldc.upenn.edu/Catalog/LDC98T26.html

This release of Hub-5 Mandarin training data consists of 42 calls derived from the CallFriend Mandarin Chinese Mainland Dialect (language-ID) collection. The transcribed data is intended as additional training data in support of the project on Large Vocabulary Conversational Speech Recognition (LVCSR), also sponsored by the U.S. Department of Defense. The transcripts cover a contiguous 5-30 minute segment taken from a recorded conversation lasting up to 30 minutes. Speakers were solicited by the LDC to participate in this telephone speech collection effort via the internet, publications (advertisements) and personal contacts. A total of 200 call originators were found, each of whom placed a telephone call via a toll-free robot operator maintained by the LDC. Each caller was allowed to place only one telephone call. They were given no guidelines concerning what they should talk about. Once a caller was recruited to participate, he/she was given a free choice of whom to call. Most participants called family members or close friends. All calls originated in North America and were placed to various locations within North America.
Hub-5 Mandarin speech and transcript data may be obtained (1000$) by email; cf. also on this link.

Mandarin Chinese News Text Corpus: http://morph.ldc.upenn.edu/Catalog/LDC95T13.html

This newswire text corpus includes about 250 million GB-encoded characters. The Mandarin News Corpus includes text from various journalistic sources: newspaper text from Renmin Ribao (People's Daily); radio scripts from China Radio International newswire text from Xinhua newswire service; The format of this corpus uses a labelled bracketing, expressed in the style of SGML (Standard Generalized Markup Language). The header fields provided by the sources, which give information such as topic, date and article ID, have been retained. The articles cover a variety of topics, including international and domestic news, sports and culture. Available only by membership to the LDC or at 200$ price.

Sources of Chinese-Language Text Files: http://www.webcom.com/~bamboo/chinese/text.html

A golden mine of links to chinese downloadable e-texts on the Web, ranging form Classics to online newspapers and usenet archives. Beware however that a lot of links are often down.

TDT2 Mandarin Text Corpus (Topic Detection & Tracking Corpus):

http://morph.ldc.upenn.edu/Catalog/LDC99T38.html
Topic Detection and Tracking (TDT) refers to automatic techniques for finding topically related material in streams of data such as newswire and broadcast news. The TDT2 corpus was created to support three TDT2 tasks: find topically homogeneous sections (segmentation), detect the occurrence of new events (detection), and track the reoccurrence of old or new events (tracking). This CD-ROM release consists of the TDT2 Text Corpus, and contains only the Mandarin TDT2 text data. It was created by removing the English data from the TDT2 Multilanguage Text Corpus. It includes a number of revisions to the data and annotations, and represents the most complete and correct version available as of 6 December 1999 and supercedes all previous releases of TDT2 Mandarin text data. The data were collected daily over a period of six months (January-June of 1998) from Xinhua News Agency and Zaobao News. Available only by membership to the LDC, or paying $1000 price.

TDT2 Multilanguage Text Corpus (Topic Detection & Tracking Corpus):

http://morph.ldc.upenn.edu/Catalog/LDC99T39.html
The Topic Detection and Tracking (TDT) English and Mandarin Corpus.
See under Multilingual and Parallel Corpora section. The two subcorpora were released also separatedly, cf. TDT2 English Text Corpus Version 2 and TDT2 Mandarin Text Corpus

Wang Lixun Chinese - English Parallel Corpus:

http://web.bham.ac.uk/lxw715/English/EnglishChineseCorpus.html
This corpus, developed at Birmingham by Wang Lixun (homepage), is still in progress. The present aim is to create a 10 million words parallel corpus (half English, half Chinese), for research and language-teaching purposes. For more details, cf. under the Parallel Corpora section. [2001 April 23].

Web Concordancer: http://vlc.polyu.edu.hk/scripts/concordance/WWWconcapp.htm

The Web Concordancer site, by the Virtual Language Centre of the Polytechnic University of Hong Kong, presents a few indexed corpora (English, French, Chinese, Japanese) thant can be freely browsed with the ConcApp program. Corpora available include Brown Corpus, Sherlock Holmes stories, South China Morning Post, etc. [2002 February 17].

Chiricahua (Apache).

Chiricahua and Mescalero Apache Texts: http://etext.lib.virginia.edu/apache/

This is a web edition, created at Virginia Library ETC (cf. the Electronic Text Center at the University of Virginia file), of the Chiricahua and Mescalero Apache Texts by Harry Hoijer, originally published by University of Chicago Press, 1938. There are 46 Chiricahua and 9 Mescalero texts, all free, but not very download-friendly, because they are displayed in frames: you can e.g. display a bilingual Apache - English version (either with or without notes), or Apache only, or notes only, or English version with ethnological notes, etc. For correct display a special Apache - Navajo font (a Times New Roman supplement), developed at the San Juan School District's and freely downloadable from this page, is needed. [2001 July 21; rev. 2002 January 25].

Commonwealth Antillean Creole French.

Commonwealth Antillean CF Historical Texts: http://www.ling.su.se/Creole/Text_Collection.shtml

A scanty collection of old relics of the Commonwealth Antillean Creole French, with a fragment from St. Lucia (1900), and a few very small old fragments of the now disappearing Grenada Creole French (1650s). All texts are freely downloadable and come from the Creolist Archives Text Collections. [2001 August 13].

Commonwealth Winward Islands Creole English.

Winward Island CE Historical Texts: http://www.ling.su.se/Creole/Text_Collection.shtml

A few freely downloadable old witnesses and texts of Commonwealth Winward Islands Creole English from St. Vincent (1791, 1817, 1828, 1834); from the Creolist Archives Text Collections. [2001 August 8].

Czech.

CNC (Czech National Corpus): http://ucnk.ff.cuni.cz/english/index.html (Czech also)

The Czech National Corpus (C^eský Národní Korpus) is a large repository of computer-based texts which is being built at the Faculty of Arts, Charles University in Prague. Its present (1999) size of over 100 million words, which is constantly growing, makes it the foremost and largest resource of information on and about language and, through it, about most things reflected in the language.The CNC, which is accessible to broad academic public at home and abroad, is run under a series of programmes which allow the user to search for linguistic units, be it words, word forms, part of words or collocations, and their frequency, grammatical and other characteristics. In its balanced, rather representative shape, the CNC will be released by the turn of 1999/2000 but its provisonal use is offered to anyone since 1996. It is in a concordance format that the user will get results of his search enabling him or her to study the real contextual use of words and the like. The concordances thus obtained can be furhter processed, sorted, classified etc. This makes work with language more of a fast play rather than the old-time drudgery. Next to the contemporary CNC (100 million words and more, later on), two small corpora, that of Old Czech and Spoken Czech are being built at the same time.
+ CNC online. A small part of CNC's (some 20 million words) is open online on the web. Of course, the internet access has several limits (besides the corpus size): you are not allowed to search the phrase or combination, you can search just a single word, the context around the search results is limited to 60 characters, and the corpus is not morphologically annotated.
+ For access to the full CNC, you have to address the administrator and ask for special permission, which is granted to anyone for non-commercial purposes.
+ For a quick reference page in English see here. [Last Rev. 2001 April 28].

Czech National Library Database: http://digit.nkp.cz/search.htm

The CzNL collection of digitized documents contains especially manuscripts from this country and other European countries (some Latin as well). For the time being, there are titles important for the Czech history and culture related to Hussitism, the pre-Hussite and post-Hussite periods, to the activity of Jesuits on this territory, and related to other important events and personalities of our culture. There are also documents coming from abroad or informing about foreign cultures. Also digitization of Oriental manuscripts has been planned. Free online search throrough the corpus is the main service provided. In some cases, the search results point directly to entire digital copies of documents so that you can access them on Internet. For this purpose, the graphic information is in Internet quality. The main distribution medium remains the CD-ROM.

Czech Newspaper Corpus (TRACTOR):

http://solaris3.ids-mannheim.de/tractor/telri/PRA1/pra1-01.htm
Two 5m word newspaper corpora of Czech and other miscellaneous Czech corpus files, from Computational Fund of the Czech Language, Charles University, Prague, Czech Republic.
Available under subscription to TRACTOR.

ECI/MCI 1 Corpus (European Corpus Initiative Multilingual Corpus):

http://www.elsnet.org/resources/eciCorpus.html
A 98 million word corpus, covering most of the major European languages, as well as many others (viz. Albanian, Bulgarian, Chinese, Czech, Dutch, English, Estonian, French, Gaelic, German, Greek, Italian, Japanese, Latin, Lithuanian, Malay, Spanish, Danish, Uzbek, Norwegian, Portuguese, Russian, Serbian, Swedish, Turkish, Tibetan).
Go to ECI/MCI 1 Corpus in the Multilingual and Parallel Corpora section.

Lieder and Songs Texts Page: http://www.recmusic.org/lieder/

This page contains the texts (all freely downloadable HTML) of 1842 poets and 1217 composers in 22 different languages: Czech has 97 texts (incl. a few Slovak and Moravian). For a more detailed description, see in the E-Texts section.

Prague Dependency Treebank: http://ufal.mff.cuni.cz/pdt/pdt_05.html

The Prague Dependency Treebank (PDT) is a morphologically and syntactically annotated corpus of Czech. The Prague Dependency Treebank is - to a certain extent - modelled after the Penn Treebank but it uses the dependency syntax representation of sentences. It has three layers: (1) morphological (uses word forms, tags, lemmas); (2) analytical, or surface syntax (uses dependencies and analytical functions of dependencies); (3) tectogrammatical, which captures linguistic meaning (contains tectogrammatical functions such as Actor, Patient, Addressee, etc.). The text material contains samples from the following sources: (1) Lidové noviny (daily newspapers), 1991, 1994, 1995; (2) Mladá fronta Dnes (daily newspapers), 1992; (3) Ceskomoravský Profit (business weekly), 1994; (4) Vesmír (scientific magazine), Academia Publishers, 1992, 1993. The internal format of the files is based on SGML, cf. the Documet Type Definition (DTD). The current version of PDT (0.5) contains 456705 tokens (words and punctuation) in 26610 sentences and 576 files annotated on the morphological and analytical levels. In order to keep results of NLP applications comparable the data has been divided into a training set (19126 sentences), a development test set (3697 sentences) and a (cross-)evaluation test data set (3787 sentences). [2001 April 28].
+ The PDT Version 0.5 is freely available for research purposes providing you fill in and submit the License Agreement
+ There are also a lot of freely available papers and documentations on the site.
+ PDT 1.0 version is forthcoming. The CD version is announced for April 2001 (but when I checked on April 28th there wasn’t yet). Please come back for more detail on the distribution and availability.

Xerox Research Center Europe - MultiLingual Theory and Technology:

http://www.rxrc.xerox.com/research/mltt/
The XRCE (home page) MLTT team creates basic tools for linguistic analysis, e.g. morphological analysers, parsing and generation platforms and corpus analysis tools. These tools are used to develop descriptions of various languages and the relation between them. There are free web demos of some of their tools on the web also for Czech. For more details cf. the Tools section.

Danish.

Bergenholz Corpus: [homepage missing]

The Bergenholz Corpus is a 4 million word corpus of Danish newspaper articles, magazines, and books, from 1987-1990. Availability, homepage and other informations are unknown. A few samples are available from Daniel Hardt first lesson on Tools for Corpus Linguistics. [2001 May 1].

Bible of University of Maryland Parallel Corpus Project: http://benjamin.umd.edu/parallel/bible.html

The MPCP (University of Maryland Parallel Corpus Project) provides versions of the Bible consistently annotated according to the CES. There are also some freely downloadable PS paper related to this project, mainly by Philip Resnik. Versions already freely available are the Chinese, Danish, English, Finnish, French, Greek, Indonesian, Latin, Spanish, Swahili, Swedish and Vietnamese ones. For more cf. under the Parallel Corpora section. [2001 May 1].

CHILDES Database: http://childes.psy.cmu.edu/

mirroring sites at Antwerp (Belgium) and Chokyo (Japan).
The CHILDES system provides free tools for studying child language data and conversational interactions. These tools include a database of transcripts, programs for computer analysis of transcripts, methods for linguistic coding, and systems for linking transcripts to digitized audio and video. CHILDES corpora cover a 23 European and extra European languages, although the bulk of the collection is English.
The Danish CHILDES corpus is freely downloadable in zipped format.
See under Multilingual and Parallel Corpora section for a fuller file.

DNA (Dansk Nationallitterært Arkiv): http://www.kb.dk/elib/lit/dan/

The Danish National Literary Archive, mantained by the Kongelige Bibliotek, hold a few Danish texts in html format, freely readable online. There aren't download-friendly versions of the texts.

ECI/MCI 1 Corpus (European Corpus Initiative Multilingual Corpus):

http://www.elsnet.org/resources/eciCorpus.html
A 98 million word corpus, covering most of the major European languages, as well as many others (viz. Albanian, Bulgarian, Chinese, Czech, Dutch, English, Estonian, French, Gaelic, German, Greek, Italian, Japanese, Latin, Lithuanian, Malay, Spanish, Danish, Uzbek, Norwegian, Portuguese, Russian, Serbian, Swedish, Turkish, Tibetan).
See under ECI/MCI 1 Corpus in the Multilingual and Parallel Corpora section.

Korpus2000: http://www.korpus2000.dk

Korpus 2000 is a major Danish corpus which consists of approximately 28 million words from texts written from 1998 to 2002. The aim of the Korpus 2000 project is to document the use of the Danish language around the year 2000 - in the form of a text corpus in which one can look up words and phrases via web. The corpus has been POS-tagged. You can freely query for words, then refine your query by POS, and get a list of concordances (one simple sentence context); for each occurrency you got data about the text source can be subsequently queried.
+ Korpus 90 is another Danish corpus available on this site. It was compiled of text excerpts written in the period 1988-1992. This corpus is quite similar to the Korpus 2000 in its composition and size and hence serves as an older comparative corpus for the Korpus 2000. [2002 October 16].

Lieder and Songs Texts Page: http://www.recmusic.org/lieder/

This page contains the texts (all freely downloadable HTML) of 1842 poets and 1217 composers in 22 different languages: Danish has 59 texts. For a more detailed description, see in the E-Texts section.

Lord's Prayer in the Germanic Languages: http://www.georgetown.edu/cball/oe/pater_noster_germanic.html

A simple page with versions of the Pater Noster in many germanic languages (Afrikaans, Alsatian, Bavarian, English, Danish, Dutch, Frisian, German, Gothic, Icelandic, Norn, Norwegian, Old Saxon, Pennsylvania Dutch, Plattdeutsch, Swedish) by Catherine Ball (see her homepage), the webmistress of the Old English Pages. There is also a simple interface that allows the comparation of any two texts. This page was prepared for the use of classes in linguistics, history of the English language, and Old English. [2001 July 13].

Opera e-Libretto: http://www.geocities.com/voyerju/libretti.html

Opera e-Libretto (Collection Ulric Voyer) is a collection of 220 free e-texts of opera libretti. Displayed libretti are in Danish (Nielsen), Italian, French, English, German and Russian. All texts are in html, usually broken in more files according to act divisions. For a more detailed file cf. under the E-Texts section. [2001 June].

VISL on -line Danish Corpus (Visual Interactive Syntax Learning Department of Language and Communication University of Southern Denmark - Odense):

http://visl.hum.ou.dk
VISL (cf. under Corpora and Corpus Linguistics) provides queries online to pure-text Corpora in Danish, German, English and Spanish and to the Portuguese tagged one. The service is for members only (see at this page).

WESS Web (Western European Specialists Section): http://www.lib.virginia.edu/wess/etexts.html

The WES Section of the University of Virginia Library provides a list of text resources for 17 European literatures: Catalan, Danish, Dutch, Finnish, French, Galego-Portuguese, German, Greek, Irish, Italian, Latin, Norwegian, Old Norse & Icelandic, Occitan, Portuguese, Spanish and Swedish.

Dutch.

CELEX Dutch Database: http://www.kun.nl/celex.

CELEX, the Dutch Centre for Lexical Information, has three separate databases, Dutch, English and German, all of which are open to external users. The Dutch database, version N3.1, was released in March 1990 and contains information on 381,292 present-day Dutch wordforms, corresponding to 124,136 lemmata. Apart from orthographic features, the CELEX database comprises representations of the phonological, morphological, syntactic and frequency properties of lemmata. For Dutch and English lemma homographs, frequencies have been disambiguated on the basis of the 42.4 m. Dutch INL and the 17.9 m. English Collins/COBUILD text corpora. Furthermore, information has been collected on syntactic and semantic subcategorisations for Dutch. The CELEX database is open to all academic researchers and people associated with other not-for-profit research institutes free of charge (at least until 2001). Users will only be charged Dfl. 100,= for the CELEX User Guide on a one-shot basis. In order log in to CELEX, a personal account should be obtained from Richard Piepenbrock, project manager: see at this page.

CHILDES Database: http://childes.psy.cmu.edu/

mirroring sites at Antwerp (Belgium) and Chokyo (Japan).
The CHILDES system provides free tools for studying child language data and conversational interactions. These tools include a database of transcripts, programs for computer analysis of transcripts, methods for linguistic coding, and systems for linking transcripts to digitized audio and video. CHILDES corpora cover a 23 European and extra European languages, although the bulk of the collection is English.
The Dutch CHILDES corpus is downloadable.
See under Multilingual and Parallel Corpora section for a fuller file.

Dutch Data Corpus (TRACTOR): http://solaris3.ids-mannheim.de/tractor/telri/LEI/lei-01.htm

300,000 words of corpus data in Dutch (including 20,000 words POS tagged in PAROLE format), from Institute for Dutch Lexicology, Leiden, The Netherlands.
Available under subscription to TRACTOR.

ECI/MCI 1 Corpus (European Corpus Initiative Multilingual Corpus):

http://www.elsnet.org/resources/eciCorpus.html
A 98 million word corpus, covering most of the major European languages, as well as many others (viz. Albanian, Bulgarian, Chinese, Czech, Dutch, English, Estonian, French, Gaelic, German, Greek, Italian, Japanese, Latin, Lithuanian, Malay, Spanish, Danish, Uzbek, Norwegian, Portuguese, Russian, Serbian, Swedish, Turkish, Tibetan).
Look at ECI/MCI 1 Corpus in the Multilingual and Parallel Corpora section.

EdoE Dutch Poetry Anthology: http://pmwww.cs.vu.nl/home/edoe/

The EdoE site maintains a free anthology of Dutch poetry arranged in two sets, before 1880 and after 1880. You can save the two pages in HTML.

Lord's Prayer in the Germanic Languages: http://www.georgetown.edu/cball/oe/pater_noster_germanic.html

A simple page with versions of the Pater Noster in many germanic languages (Afrikaans, Alsatian, Bavarian, English, Danish, Dutch, Frisian, German, Gothic, Icelandic, Norn, Norwegian, Old Saxon, Pennsylvania Dutch, Plattdeutsch, Swedish) by Catherine Ball (see her homepage), the webmistress of the Old English Pages. There is also a simple interface that allows the comparation of any two texts. This page was prepared for the use of classes in linguistics, history of the English language, and Old English. [2001 July 13].

Project Laurens Janszoon Coster (Nederlandstalige klassieke literatuur in elektronische edities):

http://www.dds.nl/~ljcoster/english.html
Laurens Jz. Coster Ontwerp is currently trying to set up a comprehensive collection of Dutch literary masterpieces on the World Wide Web. Although most of the pages are (and will continue to be) in Dutch only, the home page is in English. A lot of Dutch texts in html format, freely readable online. There aren't download-friendly versions of the texts.

TRIPTIC (En-Fr-Du) (TRIlingual Parallel Text Information Corpus):

http://www.ruf.rice.edu/~barlow/para.html
TRIPTIC is a trilingual corpus developed for the analysis of prepositions in English, French and Dutch. There is not a TRIPTIC page on the web (at least I didn't found it) and all the informations I give are taken from Michael Barlow's Parallel Corpora Page. For further information see under Multilingual and Parallel Corpora section. Contact: Hans Paulussen

Xerox Research Center Europe - MultiLingual Theory and Technology:

http://www.rxrc.xerox.com/research/mltt/
The XRCE (home page) MLTT team creates basic tools for linguistic analysis, e.g. morphological analysers, parsing and generation platforms and corpus analysis tools. These tools are used to develop descriptions of various languages and the relation between them. There are free web demos of some of their tools on the web also for Dutch. For more details cf. the Tools section.

WESS Web (Western European Specialists Section): http://www.lib.virginia.edu/wess/etexts.html

The WES Section of the University of Virginia Library provides a list of text resources for 17 European literatures: Catalan, Danish, Dutch, Finnish, French, Galego-Portuguese, German, Greek, Irish, Italian, Latin, Norwegian, Old Norse & Icelandic, Occitan, Portuguese, Spanish and Swedish.