Localized Resources .5. |
I provide here language-specific links to corpora, e-texts and NLP resources in general. Resources already presented in the previous sections are also repeated here whenever relevant.
The Sardinian Text Database of University of Cologne project Limba e cultura de sa Sardigna is a repository of freely downloadable HTML texts in a fair amount of Sardinian varieties, spanning from Campidanesu, Logudoresu, Nugoresu and Gadduresu to even minor ones.
The Archive of this Newspaper is online and can be freely queried for simple strings of text and/or article authors. Although of Sardinian interest, the language is Italian.
A simple page with versions of the Pater Noster in many germanic languages (Afrikaans, Alsatian, Bavarian, English, Danish, Dutch, Frisian, German, Gothic, Icelandic, Norn, Norwegian, Old Saxon, Pennsylvania Dutch, Plattdeutsch, Swedish) by Catherine Ball (see her homepage), the webmistress of the Old English Pages. There is also a simple interface that allows the comparation of any two texts. This page was prepared for the use of classes in linguistics, history of the English language, and Old English. [2001 July 13].
The Helsinki Corpus of Older Scots has been compiled at the University of Helsinki as a supplement to the diachronic part of the Helsinki Corpus of English Texts: Diachronic and Dialectal. It consists of 830.000 not tagged words from 1450 - 1700 texts; the texts represent fifteen different prose genres: acts of Parliament, burgh records, trial proceedings, histories, biographies, travel, dialogues etc.
+ A version on CD-ROM is available from ICAME; a small sample is available on the web, a list of the sources as well. Contact: Department of English, University of Helsinki, Portania 311, 00100 Helsinki, Finland.
+ A 4.9 MB version is freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.
A good collection of medieval Germanic poetry texts, ranging from Old English to Middle English, Old Norse, Old (High and Low) German, etc. And there are also some 15th century Scots texts, like William Dunbar's Tretis, Robert Henrison's Practysis of Medecyne, Ralph the Collier and The Knightly Tale of Gologras and Gawain. All texts are freely downloadable (but often broken in more files). [2001 August 27].
The 173 KB English corpus of biblical texts in Scots based on "A history of the Scots Bible" G. Tulloch. Freely available from OTA (Oxford Text Archive) site for non-commercial use only after you have sent them a statement.
http://solaris3.ids-mannheim.de/tractor/telri/BEL/bel-03.htm
Literature: Aurora corpus, 15 texts, 14 authors, 46k words; 2 texts, 39k2 texts, 39k words; proverbs 56k words (total 140k words). From Faculty of Mathematics, Belgrade University, Yugoslavia.
Available under subscription to TRACTOR.
Croatian E-text project (electronic text) is an effort to bring the printed Croatian books to the Web. As in the Project Gutenberg phylosophy, the Croatian E-text Project should strive to make as much information available as possible in Croatian and other languages (there is also a conspicuous group of text in Spanish) translated from Croatian or about Croatia. A lot of information on Croatian e-writing. The texts appears in the form of links to other sites (many to Croatia Net in the USA) where they are deposited, and their consistency is very variable, ranging from pages mainly devoted to literary criticism, to true downloadable texts.
http://www.elsnet.org/resources/eciCorpus.html
A 98 million word corpus, covering most of the major European languages, as well as many others (viz. Albanian, Bulgarian, Chinese, Czech, Dutch, English, Estonian, French, Gaelic, German, Greek, Italian, Japanese, Latin, Lithuanian, Malay, Spanish, Danish, Uzbek, Norwegian, Portuguese, Russian, Serbian, Swedish, Turkish, Tibetan).
See under ECI/MCI 1 Corpus in the Multilingual and Parallel Corpora section.
http://solaris3.ids-mannheim.de/tractor/telri/BEL/bel-07.htm
Electronic Morphological Dictionary from Faculty of Mathematics, Belgrade University, Yugoslavia.
Available under subscription to TRACTOR.
The Croatian National Corpus directed by Marko Tadic’ (e-mail) aims to (1) compile and process a multimillion corpus of contemporary Croatian; (2) to bring selected texts and dictionaries of older Croatian authors as well as translations of capital works such as Bible, Talmud, Kur’an into the form of a corpus; (3) make a supplement of the orthography dictionary on the basis of corpus processing; (4) investigate on the basis of relevant corpora in diachronic and synchronic perspective the procedures in the formation of neologisms in the creation of Croatian terminologies; (5) publish the corpora as well as the results of their processing on magnetic (CD-ROM) and electronic medium (CARNET).
+ 30m, a first 30 million corpus of contemporary Croatian (Win codepage 1250) is freely querable online (forms only). [2001 May 18]
A single legal text (6k words) from Faculty of Mathematics, Belgrade University, Yugoslavia. Available under subscription to TRACTOR.
http://solaris3.ids-mannheim.de/tractor/telri/BEL/bel-02.htm
Newspapers: Short news, 9k words; Vukova Danica (Culture), 90k words. From Faculty of Mathematics, Belgrade University, Yugoslavia.
Available under subscription to TRACTOR.
The Oslo Corpus of Bosnian Texts (Korpus bosanskih tekstova na Univerzitetu u Oslu) consists of a corpus of approximately 1.5 million words, encoded with the CWB (the IMS Corpus Workbench) developed at the Institut fur Maschinelle Sprachverarbeitung at the University of Stuttgart, to which a suitable interface was added at the Text Laboratory (Tekstlaboratoriet ved Historisk-filosofisk fakultet, Universitetet i Oslo). The corpus contains approximately 1.5 million words, and comprises several different genres: fiction (novels and short stories), essays, children's stories, folklore, islamic texts, legal texts, and newspapers and journals. The texts, written by authors from Bosnia and Herzegovina, have for the most part been published in the 1990s. The corpus provides a new and different basis for research into the language of Bosnia and Herzegovina. The project has been supervised by assistant professor Janne Bondi Johannessen, while professor Svein Mønnesland was responsible for the selection and compilation of the texts. Gordana Vranic and Kemila Basic have made the texts electronically available (by scanning and adaptation) in simple text files. Diana Santos has built the corpus based on those files in the format requested by the corpus tools used (see below for more information), and has also written the Web interface.
The Oslo Corpus of Bosnian Texts can be queried online and is available for anybody who wants to use it for non-commercial academic research. In order to obtain permission, and to be given a username and password, please send an e-mail to tekstlab@ilf.uio.no (with the following informations: NAME / ADDRESS /AFFILIATION /suggested USERNAME for the corpus, /suggested PASSWORD to use the corpus / STATEMENT 1 ("I Promise to use the Oslo Corpus of Tagged Norwegian Texts for academic, non-commercial purposes only") / STATEMENT 2 ("I promise not to distribute my password to any person or institution") / STATEMENT 3 ("In any published or unpublished material that has benefitted from use of the corpus, I will make sure that a proper reference to the corpus by its name and Internet-address is included") or fill the form online at this page.
http://solaris3.ids-mannheim.de/tractor/telri/BEL/bel-01.htm
News Corpus: TANJUG Agency news, 32 days (Sept-Nov 1995, May-Jun 1996), 1.2m words. From Faculty of Mathematics, Belgrade University, Yugoslavia.
Available under subscription to TRACTOR.
http://solaris3.ids-mannheim.de/tractor/telri/BEL/bel-05.htm
Textbooks: 16 texts, various subjects and levels, 263k words. From Faculty of Mathematics, Belgrade University, Yugoslavia.
Available under subscription to TRACTOR.
http://solaris3.ids-mannheim.de/tractor/telri/BEL/bel-04.htm
Translations: 10 texts (incl Orwell's 1984 and Plato's Republic), 322k words. From Faculty of Mathematics, Belgrade University, Yugoslavia.
Available under subscription to TRACTOR.
The EMILLE Project is a 3 year EPSRC project at Lancaster University and Sheffield University, designed to build a 63 million word electronic corpus of South Asian languages, especially those spoken in the UK. EMILLE will generate written language corpora of at least 9,000,000 words for Bengali, Gujarati, Hindi, Panjabi, Singhalese, Tamil and Urdu.
For more infos and details cf. under the Multilingual corpora section. [2001 June 17].
The Green Library (or Cactus Library), a project of the Saint-Petersburg School of Religion and Philosophy (SRPh), has by now a few Greek, Russian and French texts. More titles are announced as forthcoming, also in Latin, French and Old Slavonian (Bible). All titles are freely downloadable in PDF format. For more cf. under the E-Texts section. [2001 May 1].
Slovak free Electronic Literary Text Archive. About 60 Slovak authors are represented. Documents are stored in HTML format. All pages are in Slovak.
30 Raw Text Files in Slovak, one per letter of the Slovak Alphabet. Encoded in PC Latin 2 (Code Page 852). From Computational Linguistics Laboratory, Comenius University, Bratislava, Slovakia.
Available under subscription to TRACTOR.
http://solaris3.ids-mannheim.de/tractor/telri/LJU1/lju1.htm
500000 Words English-Slovene and Slovene English Corpus of various domains; Multext East Corpus; and Newspaper Corpus. From Language and Speech Group, Intelligent Systems Dept, Jozef Stefan Institute, Ljubljana, Slovenia.
Available under subscription to TRACTOR.
Kosmac Corpus: 18 works from 1952-72 by the late (1910-1981) Slovenian writer Ciril Kosmac. From Institute for Slovene Language"Fran Ramovs", Slovene Academy for Sciences and Arts, Ljubljana, Slovenia.
Available under subscription to TRACTOR.
http://solaris3.ids-mannheim.de/tractor/telri/LJU2/lju2-03.htm
Extracts from the Slovenian daily, DELO, 6th May to 17th June 1997, POS tagged, 111k words, 923kb. From Institute for Slovene Language "Fran Ramovs", Slovene Academy for Sciences and Arts, Ljubljana, Slovenia.
Available under subscription to TRACTOR.
FIDA, the Corpus of Slovene Language, represents a reference corpus of the Slovene language and was compiled within the framework of a joint project involving four partners; two from the academic/research sphere and two commercial ones. Corpus compilation started in spring 1997 and was concluded by the end of 2000. The FIDA Corpus of Slovene Language contains just over 100 million words of contemporary Slovene texts, encompassing a broad range of Slovene language variants and registers as found in the Slovene press, complemented by some texts from the Internet and speech transcripts. For a detailed description see also the overview of corpus texts. The FIDA Corpus of Slovene can be accessed with standard web browsers (e.g. IE 4.0 or higher, Netscape 4.0 or higher) but it is not freely available and can be accessed only with a valid username and password. Fees for corpus access are not stated, and you have to submit the form first in order to know their price policy.
+ A limited version, the Sample search, is freely available. It is limited within a 10 concordance lines; all other functions of the software are working. When the identification dialogue box opens, enter username figost and password fida. [2001 may 18].
1 M words, free to download + online concordances.
Slovenian free Electronic Literary Text Archive. A rich collection of Slovenian authors, and some translation as well. Document are stored in HTML format. All the pages are in Slovenian.
This Anthology, a part of the "Repertorio Ibero e Iberoamericano de Ensayistas y Filósofos", provides a fair amount of Spanish Language text of philosophical, critical and scholarly genre. All text (HTML format only) are free, but often are only extracts of larger works.
Lexicons and morphological analysis for Spanish: the ARIES Natural Language Tools make up a lexical platform for the Spanish language. They include: a large Spanish lexicon, lexical maintenance and access tools and morphological analyser/generator. There is a free demo for single words, or you can submit a text by e-mail (cf. this page for more infos) for word morphological analysis and spelling check, but the real lexicons and C/C ++ access tools cost money.
This site provides a fair amount text of Cristian Religion interest in Spanish language. The texts are free, but suitable only for online reading and not for downloading (longer texts are broken in more HTML pages, so you have to painstakingly recompose them).
The MPCP (University of Maryland Parallel Corpus Project) provides versions of the Bible consistently annotated according to the CES. There are also some freely downloadable PS paper related to this project, mainly by Philip Resnik. Versions already freely available are the Chinese, Danish, English, Finnish, French, Greek, Indonesian, Latin, Spanish, Swahili, Swedish and Vietnamese ones. For more cf. under the Parallel Corpora section. [2001 May 1].
The Biblioteca Virtual Miguel de Cervantes of the Universidad of Alicante is an inititiative which projects to digitalize some 30.000 works in Spanish (with a strong vocation to Latin America) giving them public free access through the web. The texts suitable for reading online and not for downloading (longer texts are broken in more HTML pages, so you have to painstakingly recompose them). Search inside texts is not supported.
The text component of the package includes transcripts and documentation files. The transcripts cover a contiguous 5 or 10 minute segment taken from 120 unscripted telephone conversations between native speakers of Spanish. The transcripts are timestamped by speaker turn for alignment with the speech signal and are provided in standard orthography. In addition to transcript files, this corpus contains full documentation on the transcription conventions and format. Auditing and demographic information on the speakers represented in the transcripts (including gender, channel quality and so on) are also included.
Available as FTP file by the LDC through membership or by 800$ price.
mirroring sites at Antwerp (Belgium) and Chokyo (Japan).
The CHILDES system provides free tools for studying child language data and conversational interactions. These tools include a database of transcripts, programs for computer analysis of transcripts, methods for linguistic coding, and systems for linking transcripts to digitized audio and video. CHILDES corpora cover a 23 European and extra European languages, although the bulk of the collection is English.
The Spanish CHILDES corpus is available on the web.
There is also a Spanish component in the Multilingual Collection (English UK and USA, German, Hebrew, Italian, Spanish, Swedish and Turkish) made from narratives elicited using Mercer Mayer's "frog story" picture book.
See under Multilingual and Parallel Corpora section for a fuller file.
The Centre de Llenguatge i Computació (Universitat de Barcelona), formerly LaReLC (Laboratori de Recerca en Lingüística Computacional) is working mainly in Hispanic NLP and Lexical Aquisition (AQUILEX project). In collaboration with DLSI-UPC it has contributed in the development of NLP tools and in the maintenance of the DLSI-UPC/CLiC-UB Tools online querable Demo. The old site of LaReLc-UB is still working, but it is better to refer to the CLiC new one. [2001 April 30; rev. 2001 October 28].
This site provides free texts in Spanish Language of comedies from the Siglo de Oro. The texts provided are good, standard texts, prepared by a number of specialists in the field of Golden Age theater studies. They have been edited from their original form for ease of electronic circulation and may not appear in a format identical to that submitted for inclusion on the homepage. Most of the texts were edited from an early imprint or imprints and, as such, are "diplomatically" edited in their present form. These texts are, therefore, not to be considered as "critical" or definitive editions but should prove adequate for most purposes to which they will be put. All are freely readable in HTM and you can save them in this format.
Contact Professor Aquilino Sánchez: asanchez@fcu.um.es.
http://www.lsi.upc.es/~nlp/
The main research fields of the Departament de Llenguatges i Sistemes Informatica (Universitat Politècnica de Catalunya) are related to the use of multilingual lexical resources, information extraction from documents, design of NL interfaces, basic NLP techniques (tagging, parsing, sense disambiguation), NL understanding and Knowledge Representation. The group has been working as a pluri-disciplinary group since 1986, together with linguists from the CLiC (Universitat de Barcelona). This collaboration was developed in several projects, among which is a suite of NLP tools, viz. MACO+ (morphological analyzer corpus-oriented), EWN (Top-ontology semantic analyzer), Relax (POS Tagger), TreeTagger (POS Tagger), TACAT (parser). A Demo of the full suite is freely querable online. Availabilty is otherwise unknown: contact Núria Castell i Ariño. [2001 April 30].
This Demonstration page of Morphosyntactic analysis, tagging and parsing of unrestricted text allows you to freely submit some sentences in Spanish, Catalan or English to the full suite of tools developed at the DLSI-UPC (Departament de Llenguatges i Sistemes Informatica - Universitat Politècnica de Catalunya) in collaboration with the CLiC (Centre de Llenguatge i Computació - Universitat de Barcelona). The components of the suite are MACO+ (morphological analyzer corpus-oriented), EWN (Top-ontology semantic analyzer), Relax (POS Tagger), TreeTagger (POS Tagger), TACAT (parser). [2001 April 30; last checked 2001 October 28].
http://www.elsnet.org/resources/eciCorpus.html
A 98 million word corpus, covering most of the major European languages, as well as many others (viz. Albanian, Bulgarian, Chinese, Czech, Dutch, English, Estonian, French, Gaelic, German, Greek, Italian, Japanese, Latin, Lithuanian, Malay, Spanish, Danish, Uzbek, Norwegian, Portuguese, Russian, Serbian, Swedish, Turkish, Tibetan).
See under ECI/MCI 1 Corpus in the Multilingual and Parallel Corpora section.
The e-library of Elaleph Com provides so far 413 Spanish literary texts (besides 11 English ones). The texts, all in PDF format, are allegedly free, but they email you the key to unlock the file only after you gave them your personal data.
EWN top-ontology semantic analyzer accepts as input morphologically analized text (the output of MACO+) and adds to each lemma the nodes in EWN top-ontology that subsume it. EWN is developed at the DLSI-UPC (Departament de Llenguatges i Sistemes Informatica - Universitat Politècnica de Catalunya) in collaboration with the CLiC (Centre de Llenguatge i Computació - Universitat de Barcelona). Availabilty is unknown: contact Núria Castell i Ariño.
This elegant site by Pedro Benito Somalo offers all the works of Gonzalo de Berceo, fully provided with vocabulary, biography, critical documentation, accessory papers, ecc. All texts (in flowery HTML, often divided by chapters) are freely browsable and downloadable. Now has also moved from Geocities, and it is even more browsable. [2002 February 18; rev. 2006 October 17].
http://morph.ldc.upenn.edu/Catalog/LDC98T29.html
This corpus contains a portion of the acoustic data designated as the training set for the 1997 DARPA HUB-4 Spanish Benchmark. It contains speech and transcripts of 30 hours of broadcast news from the following sources: Televisa, Univision and VOA. All acoustic files are in NIST SPHERE format, without compression. The sample data are 16-bit linear PCM, 16-KHz sample frequency, single channel. Most files contain 30 minutes of recorded material, and some contain 60 or 120 minutes (approximately); the sampling format requires roughly 2 megabytes (MB) per minute of recording, so the file sizes are typically around 60 MB, with some files ranging up to 120 or 240 MB. The transcripts are in SGML format, using the same markup conventions that have been applied to the other 1997 Broadcast News speech corpora (in English and Mandarin) and are transmitted by FTP, not on the CD-ROMs with speech data
Available only by membership to the LDC.
This release of Hub-5 Spanish training data consists of 42 calls derived from the CallFriend Spanish (language-ID) collection. The transcripts cover a contiguous 10-30 minute segment taken from a recorded conversation lasting up to 30 minutes. These calls were originally collected by the LDC in support of the project on Language Recognition, sponsored by the U.S. Department of Defense. All these calls are being designated as additional training data for the project on Large Vocabulary Conversational Speech Recognition (LVCSR) in Spanish. Speakers were solicited by the LDC to participate in this telephone speech collection effort via the internet, publications (advertisements) and personal contacts. A total of 200 call originators were found, each of whom placed a telephone call via a toll-free robot operator maintained by the LDC. Once a caller was recruited to participate, he/she was given a free choice of whom to call. Recruits were given no guidelines concerning what they should talk about. Most participants called family members or close friends. All calls originated in North America and were placed to various locations within North America, Puerto Rico or the Dominican Republic. The participants were made aware that their telephone call would be recorded, as were the call recipients.
Hub-5 Spanish speech and transcript data may be obtained (1500$) by emailing ldc@ldc.upenn.edu; cf. also this link.
A small library of interactive hypertexts for free reading and search maintained by Èulogos. All literary texts, many religious (the BRI, Bibliotheca Religiosa). Nine languages are till now supported (Albanian, German, English, Spanish, French, Italian, Latin, Finnish).
http://www.comp.lancs.ac.uk/linguistics/crater/corpus.html
The European (English, French, and Spanish) Language Newspaper Text tagged corpus, free querable online.
See this site under Multilingual and Parallel Corpora section.
http://www.lpl.univ-aix.fr/projects/multext/MUL4.html
This corpus is made by a set of pieces fron the Official Journal of the European Community (JOC) and is CES (Corpus Encoding Standard) conformant. It is available with three level of treatment: paragraph annotated (CESDOC), POS-Tagged (CESANA) and parallel text aligned (CESALIGN).
Availability unknown: only a few sample to download.
http://tact.uni-duisburg.de/tactweb/spanish.htm
A few Spanish newspaper texts marked-up and made freely querable online via TACTweb by Elisabeth Burr. Some POS tagging of verbs was made Carola Wahlers, Norbert Neff and Angela Wächter-Freudenberg. This corpus is a subsection of the wider project "Romanische Zeitungssprachen", and is part of E. Burr's Online Korpusanalyse mit Hilfe von TactWeb pages, which offer some small but useful Italian, French and Spanish corpora querable online via TACTweb. [2001 April 23].
This page contains the texts (all freely downloadable HTML) of 1842 poets and 1217 composers in 22 different languages: Spanish has 124 texts. For a more detailed description, see in the E-Texts section.
An oral corpus of Spanish plus some written corpora of South American Spanish (Argentinian and Chilean) from Grupo EUROTRA, Universidad Autonoma de Madrid. It is said to be available by this FTP, but it doesn't work. Reference web page unknown. [2000 October].
This adress contains the longitudinal corpora of Maria, a Spanish child from Madrid, Spain. She was studied by Susana Lopez Ornat of Madrid's Complutense University, from 1988 to 1991. Maria is an only child who was videotaped from ages 1;07 to 4;00, every fortnight in sessions of about 30 minutes. Those took place at home during bath, play or feeding interactions with her parents, who belong to a middle-class professional family. Coding from orthographic transcriptions was done from 1991 to 1993. The corpus has been divided into 662 samples, each of which constitutes a pragmatic unity of linguistic interaction. A sample is called TRAMO. Tramos are numbered from 01 to 662. Every TRAMO is analyzed for linguistic and for psycholinguistic information (when present). These analyses are included in with the TRAMO, under a .ALL file name. So that i.e. TRAMO 221.ALL contains: (a) TRAMO221(.TXT ) which contains the full natural interaction; (b) TRAMO221(.LIN ) which contains the linguistic analysis of the child's utterances in the former .TXT; (c) TRAMO221(.PSI) which contains, if there was any such information in the .TXT, the psycholinguistic analysis. All files are freely available by FTP from the site. The videotapes are available on request from: Dra. S. Lopez Ornat; Dpto. Psicologia Cognitiva; Universidad Complutense de Madrid; Madrid 28223; Spain. Tf: 34-(1)-3943115, e-mail: pscog09@sis.ucm.es. Publications that make use of this corpus should cite S. Lopez Ornat (1994), La adquisicion de la lengua espagnola, Madrid, Siglo XXI. [2001 august 4].
The MACO+ Morphological Analizer Corpus-Oriented accepts unrestricted text as input. The tool tokenizes the text, and performs and produces as output all morphological interpretetions possible for each token. It is able to recognize and deal with numbers, proper nouns, punctuation, dates, abbreviations, multiwords, etc. Spanish, Catalan and English versions available. MACO+ is developed at the DLSI-UPC (Departament de Llenguatges i Sistemes Informatica - Universitat Politècnica de Catalunya) in collaboration with the CLiC (Centre de Llenguatge i Computació - Universitat de Barcelona). Availabilty is unknown: contact Núria Castell i Ariño. [2001 April 30].
+ MACO+ can be queried freely online (English, Spanish and Catalan) as part of the DLSI-UPC/CLiC-UB Tools Demo.
+ A MACO+ only online tagging service is freely provided by UNED.
+ A Maco+ & Relax online tagging service is also freely provided by UNED.
http://www.ilstu.edu/~mdavies/texts.htm
It's not clear what their availability is.
http://parnaseo.uv.es/Lemir.htm
The LEMIR site provide a lot of good Medieval and Renaissance Spanish texts. Some of them are critical editions made directly from manuscripts. All are freely available in TXT format (and optionally also in PDF). See the Index.
This page maintained by Lyle Neff (cf. homepage) is a rich database of the online sources of opera libretti. A lot of e-texts (html format) are freely available directly from the site, other are only linked to. Beside libretti also secular songs and sacred vocal music are also dealt with. Language covered are Italian, French, English, German, Russian, Spanish (zarzuelas), Latin (sacred vocal music) and Jewish (songs). There are also links to other less specific musical and linguistic resources. [2001 June 20].
The Relax POS tagger takes as input the output of the morphological analizer MACO+, and selects the right POS and lemma for each word in the given context. Currently, it produces an output with over 97% precision. The language model may be easily improved with the addition on new context constraints expressed in CG formalism, either hand-written or statistically acquired. Spanish, Catalan and English versions available. Relax is developed at the DLSI-UPC (Departament de Llenguatges i Sistemes Informatica - Universitat Politècnica de Catalunya) in collaboration with the CLiC (Centre de Llenguatge i Computació - Universitat de Barcelona). Availabilty is unknown: contact Núria Castell i Ariño. [2001 April 30].
+ MACO+ can be queried freely online (English, Spanish and Catalan) as part of the DLSI-UPC/CLiC-UB Tools Demo.
+ A Maco+ & Relax online tagging service is also freely provided by UNED.
SIGNUM, a company founded in 1988 in Quito - Equador, provides to Microsoft with proofing tools that support the Spanish language in Office 2000. The spell checker, hyphenator and Thesaurus that MS has included in this product were rated by PC World magazine (April 1999) as one important advantage of the latest Office version. SIGNUM provides a range of consulting and Spanish language engineering services including specialized tagging, customized spell checkers, as well as language related products marketing and distribution. Among the Spanish tools they sell there are linguistic reviser (Revisor), orthographical checker (Ortógrafo), an online verbe conjugator, a Thesaurus, an online hyphenator, etc. There isn't true corpus linguistic tools, but still an interesting site.
The Spanish News Corpus consists of journalistic text data from one newspaper (El Norte, Mexico) and from the Spanish-language services of three newswire sources: Agence France Presse, Associated Press Worldstream, and Reuters. (The Reuters collection comprises two distinct services: Reuters Spanish Language News Service and Reuters Latin American Business Report). All text data are stored on one CD-ROM, in a standard compressed form. The fours sets of newswire data (AFP, APWS and two Reuters services) are each organized as one data file per day of collection. The period covered by these collections extends from December 1993 (for APWS and Reuters) or May 1994 (APWS) through December 1995. The presentation of text data in these collections is modeled on the TIPSTER corpus. Within each data file, SGML tagging is used (1) to mark article boundaries, (2) to delimit the text portion within each article and (3) to label various pieces of information about the article that are external to the text content (e.g. headlines, bylines and so on).
Available only by LDC membership.
This release of Spanish newswire contains data from the following sources: Agence France Presse (13 January 1996 – 31 December 1998), Associated Press Worldstream (1 December 1995 – 31 August 1998), El Norte (1 January 1997 – 31 December 1998) The consistent format chosen for release consists of SGML tagging and the ISO-8859-1 (Latin1) 8-bit character set.
Available only by the LDC through membership or 1000$ price.
TACAT is a parser that takes as input the output of the morphological analizer MACO+, or the output of any tagger, and produces a syntactic analysis. The tool is a chart-based parser, with some extensios for flexibility. It uses CFG grammars, which can produce either a complete sentence analyses or just partial parsing and chunk recognition. Spanish and Catalan versions available. TACAT i is developed at the DLSI-UPC (Departament de Llenguatges i Sistemes Informatica - Universitat Politècnica de Catalunya) in collaboration with the CLiC (Centre de Llenguatge i Computació - Universitat de Barcelona). Availabilty is unknown: contact Núria Castell i Ariño. [2001 April 30].
+ TACAT can be queried freely online (Spanish and Catalan) as part of the DLSI-UPC/CLiC-UB Tools Demo. [2001 April 30].
This Part-of-Speech tagger takes as input the output of the morphological analizer MACO+, and selects the right POS and lemma for each word in the given context. Currently, it produces an output with over 97% precision. The language model is based on decision trees acquired from tagged corpora. Spanish and English versions available. This TreeTagger is developed at the DLSI-UPC (Departament de Llenguatges i Sistemes Informatica - Universitat Politècnica de Catalunya) in collaboration with the CLiC (Centre de Llenguatge i Computació - Universitat de Barcelona), and is not to be confused with the more famous TreeTagger developed at IMS Stuttgart. Availabilty is unknown: contact Núria Castell i Ariño. [2001 April 30].
+ TreeTagger can be queried freely online (English, Spanish) as part of the DLSI-UPC/CLiC-UB Tools Demo.
http://morph.ldc.upenn.edu/Catalog/LDC94T4A.html
It contains texts in English, French and Spanish from the Office of Conference Services at the UN in New York between 1988 and 1993.
See under Multilingual and Parallel Corpora section.
This is the (English language) homepage of the UNED Group in Natural Language processing of Felisa Verdejo. In this page there are infos and links to the activities of the group; in this other one there are links to some useful free services related to Spanish, such as: an online version of MACO Morphological analyzer for Spanish, alone or in combination with Relax POS tagger; an automatic Spanish to English online translation system; etc. for more details cf. under the Reference section. [2001 April 30].
VISL (Visual Interactive Syntax Learning Department of Language and Communication University of Southern Denmark - Odense) provides queries online to pure-text Corpora in Danish, German, English and Spanish and to the Portuguese tagged. The service is for members only (see at this page). For more infos cf. under Corpora and Corpus Linguistics.
The texts of the main works of the famous French philosoph Gilles Deleuzes freely available directly on his site. Besides the French originals, English and Spanish translations are available as well, so you can construct at least a three language parallel text (if not a true parallel corpus).
The WES Section of the University of Virginia Library provides a list of text resources for 17 European literatures: Catalan, Danish, Dutch, Finnish, French, Galego-Portuguese, German, Greek, Irish, Italian, Latin, Norwegian, Old Norse & Icelandic, Occitan, Portuguese, Spanish and Swedish.
http://www.rxrc.xerox.com/research/mltt/
The XRCE (home page) MLTT team creates basic tools for linguistic analysis, e.g. morphological analysers, parsing and generation platforms and corpus analysis tools. These tools are used to develop descriptions of various languages and the relation between them. There are free web demos of some of their tools on the web also for Spanish. For more details cf. the Tools section.
The Electronic Text Corpus of Sumerian Literature is in preparation at the University of Oxford. Its aim is to make accessible, via the World Wide Web, over 400 literary works composed in the Sumerian language in ancient Mesopotamia during the late third and early second millennia BC. At this site you will find a catalogue of these works, together with a Sumerian text, English prose translation and bibliographical information for each composition: all are freely downloadable. New material, and new user facilities, are added to the site regularly. Although minor corrections will be made, no major changes are planned for the editions presented here until the end of the first phase of the project in late 2000. If you wish to use or cite the corpus, please use the following form of citation: J.A. Black - G. Cunningham - E. Robson - G. Zólyomi, The Electronic Text Corpus of Sumerian Literature (http://www-etcsl.orient.ox.ac.uk/), Oxford 1998. [2001 July 10].
A "very provisional" (it's online since October 24, 2000), he says, but also very useful collection of references possibly relevant to the design of encoding / markup for ANE texts made by Robin Cover. [2001 July 14].
The Sumerian Text Archive offers a growing collection of transliterated Sumerian texts. These texts have been transliterated using only characters from the ASCII alphabet so that the text files can be used on every type of computer. As a result, however, the transliterations deviate in a number of ways from what is common practice in Sumerology (cf. the List of Conventions). All texts (administrative UR III, Old Sumerian and Old Akkadian; Royal Inscriptions) are freely downloadable. [2001 May 1].
The MPCP (University of Maryland Parallel Corpus Project) provides versions of the Bible consistently annotated according to the CES. There are also some freely downloadable PS paper related to this project, mainly by Philip Resnik. Versions already freely available are the Chinese, Danish, English, Finnish, French, Greek, Indonesian, Latin, Spanish, Swahili, Swedish and Vietnamese ones. For more cf. under the Parallel Corpora section. [2001 May 1].
An East African language and culture resource page, providing newspaper and information sources for Kenya, Uganda and Tanzania. There are also some links to Swahili e-Texts (and some advertised pages are all in Swahili).
The MPCP (University of Maryland Parallel Corpus Project) provides versions of the Bible consistently annotated according to the CES. There are also some freely downloadable PS paper related to this project, mainly by Philip Resnik. Versions already freely available are the Chinese, Danish, English, Finnish, French, Greek, Indonesian, Latin, Spanish, Swahili, Swedish and Vietnamese ones. For more cf. under the Parallel Corpora section. [2001 May 1].
mirroring sites at Antwerp (Belgium) and Chokyo (Japan).
The CHILDES system provides free tools for studying child language data and conversational interactions. These tools include a database of transcripts, programs for computer analysis of transcripts, methods for linguistic coding, and systems for linking transcripts to digitized audio and video. CHILDES corpora cover a 23 European and extra European languages, although the bulk of the collection is English.
The Swedish CHILDES corpus is available on the web.
There is also a Swedish component in the Multilingual Collection (English UK and USA, German, Hebrew, Italian, Spanish, Swedish and Turkish) made from narratives elicited using Mercer Mayer's "frog story" picture book.
See under Multilingual and Parallel Corpora section for a fuller file.
http://www.lpl.univ-aix.fr/projects/multext/CORP/DAG.html
DI93, by Eva Ejerhed (University of Umea), is a tagged corpus made by 14180 articles from the Swedish daily financial newspaper (with a daily circulation of 95 200 copies) Dagens Industri 1993. Since no parallel Swedish corpus aligned with English is provided by the MULTEXT project, for reasons having to do mainly with there being no Swedish translations of the European Parliamentary debates used by other MULTEXT partners, they have produced this "comparable" corpora to mitigate the inconvenience.
Unfortunately a detailed description, the tagset and a small sample is all you can have ...
http://www.elsnet.org/resources/eciCorpus.html
A 98 million word corpus, covering most of the major European languages, as well as many others (viz. Albanian, Bulgarian, Chinese, Czech, Dutch, English, Estonian, French, Gaelic, German, Greek, Italian, Japanese, Latin, Lithuanian, Malay, Spanish, Danish, Uzbek, Norwegian, Portuguese, Russian, Serbian, Swedish, Turkish, Tibetan).
See under ECI/MCI 1 Corpus in the Multilingual and Parallel Corpora section.
This page contains the texts (all freely downloadable HTML) of 1842 poets and 1217 composers in 22 different languages: Swedish has 122 texts. For a more detailed description, see in the E-Texts section.
A simple page with versions of the Pater Noster in many germanic languages (Afrikaans, Alsatian, Bavarian, English, Danish, Dutch, Frisian, German, Gothic, Icelandic, Norn, Norwegian, Old Saxon, Pennsylvania Dutch, Plattdeutsch, Swedish) by Catherine Ball (see her homepage), the webmistress of the Old English Pages. There is also a simple interface that allows the comparation of any two texts. This page was prepared for the use of classes in linguistics, history of the English language, and Old English. [2001 July 13].
Nordic free Electronic Literary Text Archive. Project Runeberg publishes Nordic literature on the Internet since 1992: this means free electronic editions of old books from Sweden and the Nordic countries. The PR catalogue lists more than 200 titles, most of which are in Swedish language. All texts are freely downloadable, mainly as HTML files.
Department of Swedish, Göteborgs University.
Has various searchable part of speech tagged Swedish corpora (Parole, Bank of Swedish, etc.), and some material in Zimbabwean languages.
SUC is a tagged one million word corpus of written Swedish, composed along the lines of the Brown corpus and other balanced corpora (that's to say that it consists of a predefined set of texts from different text types). Altogether, there are 500 texts of 2 000 words each. The collected texts are all printed between 1990 and 1993, with a few texts from 1994 (cf. bibliographic information). The texts have been collected, prepared and analyzed in cooperation between the Department of Linguistics at Stockholm University and the Department of Linguistics at Umeå University.
+ The corpus is freely but may be only used for non-commercial purposes and after submitting a signed user's agreement form (at this page). It is available both in CD distribution (with the Windows SUCSEE serch tool and many add-ons) and in FTP downloadable raw version; both service are not yet (4.26.2000) activated. See the following link.
+ Tagset is available, training and test data can be freely downloaded.
Swedish Constraint Grammar (SweCG) is a system for part-of-speech disambiguation and shallow syntactic analysis of running Swedish text, developed within the Constraint Grammar (CG) framework. For availability (it is a commercial software!) you have to ask to info@lingsoft.fi.
http://solaris3.ids-mannheim.de/tractor/telri/GOT/got-01.htm
One million word Swedish newspaper corpus. Encoded to Parole standards, from Dept of Swedish, Gothenburg University, Sweden.
Available under subscription to TRACTOR.
The WES Section of the University of Virginia Library provides a list of text resources for 17 European literatures: Catalan, Danish, Dutch, Finnish, French, Galego-Portuguese, German, Greek, Irish, Italian, Latin, Norwegian, Old Norse & Icelandic, Occitan, Portuguese, Spanish and Swedish.
A short collection of texts from the isLa, a band popular in Philippines. Three texts are in tagalog, one in tagalog with English version, and one in English. [2002 February 22].
A small page of cooking recipes (Tagalog only) from the University of Pennsilvania Tagalog website.
+ A second page, with other recipes. [2002 February 22].
This little page offers the text of the Lord's Prayer in Taino (with glosses either in English or Spanish) from Dr. Cayetano Coll y Toste's Prehistoria de Puerto Rico, 1493. It's a part of the Official Jatibonicu Taino Tribal Government Web Site. For the reviving of the Taino language, cf. also their Dictionary of the Spoken Taino language. [2001 July 21].
The EMILLE Project is a 3 year EPSRC project at Lancaster University and Sheffield University, designed to build a 63 million word electronic corpus of South Asian languages, especially those spoken in the UK. EMILLE will generate written language corpora of at least 9,000,000 words for Bengali, Gujarati, Hindi, Panjabi, Singhalese, Tamil and Urdu.
For more infos and details cf. under the Multilingual corpora section. [2001 June 17].
You can freely query from this page a small tagged corpus of corpus of modern Tamil prose. Currently it has only text from two short novels, Yuka Santhi written by Jeyakanthan and Kangkaa Snaanam written by Akilan. This engine may be used to search various sentence patterns, like 'sentences with specific tense markers, participle forms, modal forms, etc.'. A list of options is provided to make the selections easier. [2001 May 1].
+ The recent version of the DOS based Tamil tagger written by Vasu Renganathan is said to be downloadable from this page, but last time I checked the link doesn’t work [2001 May 1].
The Institute of Indology and Tamil Studies at Cologne in Germany has undertaken a project named 'Pongal 2000' to digitise and computerise Tamil literature on a fairly large-scale. This is being done with a view to construct a Tamil national corpus to encompass all major Tamil text categories, classical works starting from the Sangam period as well as prose selected from the earliest Portuguese Tamil prints to the latest contemporary works. Sangam and post-Sangam literature, Silappadikaram, Periyapuranam, Thiruvachakam and Kamparamayanam are said to have been already made available in transliterated form on the Internet, and so should be an online Tamil-English dictionary based on the Madras Tamil lexicon consisting of nearly 130,000 entries. I cannot however find links to this resources [2001 May 1].
A small Tetun-English parallel corpora, manually sentence-aligned. It was used by the Statistical Machine Translation Team of Dan Melamed (cf. the Melamed's Tools file) and others exploiting the EGYPT statistical machine toolkit. No informations on its availability.
The CRCL (Center for Research in Computational Linguistics, Bangkok) pages produced by Doug Cooper for the SEASRC (South East Asian Computing And Linguistics Center) lie at the intersection of computing and linguistics in Southeast Asia. SEASRC publicizes and encourages cooperative research activity in and around Thailand, and provides data, tools, and contacts to scholars around the world. There is a lot of valuable and usually free resources (especially for Thai) on this site (cf. the index), spanning from the TIE project (with the TOLL bilingual texts) to fonts and related tools. A great site!
The POS tagged Orchid Corpus is an aim to build a Thai text corpus with syntactic word class annotation. The part-of-speech tagged corpus is not the final goal, but instead it is only the first step to make Thai text resources available. Though there is no consensus of many issues in Thai syntax (such as, word or sentence construction, word or sentence classification, etc.), Orchid initially proposes a standard structure. Word classification as well as word and sentence breaking using Orchid is somehow verified in machine translation system. They are not closed to the competence of Thai syntax but are expected to be verified together with the corpus and to be improved by thoroughly use in general text. The corpus can be freely downloaded both in TIS-620 or in UTF format. An Online POS tagging service will be available in the future. [2001 August 4].
The lack of publicly accessible, machine-readable data is a major impediment to research in Southeast Asia. These archives, maintained by the SEASRC (South East Asian Computing And Linguistics Center) serve as a repository for raw, tagged or otherwise prepared texts, word lists, dictionaries, and the like. Given interest and availability, we will also archive sound files. Till now only a few Thai texts (mainly onomastics and toponomastics) are available.
The Thai Internet Education Project (TIE) develops and distributes innovative, on-line resources for education, including tools for Thai students of English, and overseas students of Thai. Resources developed and distributed by TIE include on-line guides to pronunciation and transcription, the TOLL (Thai On-Line Library) of parallel English/Thai translations with his TOLL Toolkit, on-line dictionary tools, downloadable "Reader's Reference"cards, etc. TIE focuses on developing "enabling technology" -- software and systems that can be adapted and extended by teachers anywhere in the world. Systems developed for Thai can be applied to Lao, Khmer, and Burmese as well. All TIE Project resources may be freely downloaded and incorporated into lessons by teachers, or used by students directly.
The Thai On-Line Library of bilingual texts, maintained by the TIE Project (Thai Internet Educational), is a tool for Thai students of English, and for foreign students of Thai. TOLL includes a built-in Thai-English/English-Thai dictionary – look up words by clicking, typing, or cut-and-paste. For the benefit of foreign (and younger Thai) readers, TOLL is able to insert spaces between Thai words. TOLL serves several purposes. It is: (a) a test-bed for innovative Internet software development, (b) a workshop for research in new approaches to language education, (c) a low/no-cost delivery system for high-quality educational resources, (d) a starting point in the long research struggle to build sophisticated Thai/English translation software. English education in Southeast Asia is greatly hampered by the high cost of publishing, and the general lack of high-quality, parallel translations: TOLL lets us prepare these at almost no cost. For foreigners, learning to read Thai, Lao, Burmese, Khmer, and other nonsegmented languages can be extremely difficult. These writing systems do not put spaces between words; often, one feels one has to be able to speak fluently before learning to read! TOLL lets you experiment with new ways of indicating – and gradually eliminating – word breaks. Foreigners can learn exactly the same way that Thai schoolchildren do. Finally, TOLL lets you test software for automated text segmentation and parallel alignment of translations. These are extremely difficult research problems, but solving them is a necessary first step toward the development of a variety of commercial and academic software.
+ Till now however only a free demo is available online (beware that TOLL uses JavaScript but not Java), and it is very good. Let us hope to have more!
+ There is also an interesting free TOLL Toolkit (cf. Tool section), but there isn't still any downloading directions.
TOLL Toolkit is a system devised to manage the TOLL bilingual Thai - English texts. Very little special preparation of the texts is required: they should be available in both English and Thai; the original files can be plain ASCII or TSI-620 text. Alignment points, preferably at the paragraph level, can be marked in any regular fashion, eg: <#1>; if possible, Thai text should be presegmented – one space between each word or compound, and a carriage return at the end of each line. The TOLL Toolkit is a set of Perl programs that takes these base files, and generates a collection of HTML pages and forms (your system must be able to manage javascripts!). The text ends up in every practical form – books can be viewed either singly, or in parallel, with or without spaces between Thai words. In addition, each Thai word (spaced or not) is treated as a link. As a result, all the reader has to do is to click on any word in order to look it up.
All TOLL software issaid to be freely available, but there isn't still any downloading direction.
"A Thousand Books of Wisdom" is the fourth major release of Tibetan data by the Asian Classics Input Project. The core of the entire Asian Classics Input Project consists of a dedicated group of Tibetan refugees in south Asia who are accomplishing the great majority of the Project’s work: the input of tens of thousands of pages of Tibetan woodblock prints, in the hope to save the disappearing Tibtan books. First they search the globe for the remaining collections of books and record their location and contents in catalog form (i.e. the St. Petersburg Catalog); next they copy the books in e-text format and send these copies to be input onto computer media at data entry centers. Over the past ten years ACIP has released tens of thousands of pages of great books, on tens of thousands of computer disks and through the World Wide Web, completely free. Nearly all texts RTF format (in Tibetan script and romanization) can be freely accessed, both requesting them on disk (order infos) or downloading them from the site. Beware that only a few a number of the items in the ACIP database are restricted and do not appear in the public releases (in respect of the centuries-old tradition of the Buddhist lineages of Tibet and India, and in particular out of respect for the current holders of these lineages who work closely with ACIP in locating and preserving these materials, ACIP has a policy of not releasing to the general public those texts which are by tradition considered secret; users who have received the necessary initiations to study these materials may however submit a request). Please notice the that Sambotha fonts with the ACIP encoding for Tibetan are freely downloadable from the Nitharta site. [2001 May 1].
http://www.elsnet.org/resources/eciCorpus.html
A 98 million word corpus, covering most of the major European languages, as well as many others (viz. Albanian, Bulgarian, Chinese, Czech, Dutch, English, Estonian, French, Gaelic, German, Greek, Italian, Japanese, Latin, Lithuanian, Malay, Spanish, Danish, Uzbek, Norwegian, Portuguese, Russian, Serbian, Swedish, Turkish, Tibetan).
See under ECI/MCI 1 Corpus in the Multilingual and Parallel Corpora section.
The "Tocharian Ms from Berlin Turfan Collection" page, a part of the TITUS (Thesaurus Indogermanischer Text- und Sprachmaterialien) project, is a preliminary internet edition of the Tocharian (A and B) manuscripts that are preserved in the Staatsbibliothek zu Berlin - Preußischer Kulturbesitz as part of the Turfan collection of the Berlin-Brandenburgische Akademie der Wissenschaften. The data as available from TITUS server consist of both digitized images and texts (transcriptions were prepared by Christiane Schaefer, transliterations by Tatsushi Tamai, digitizing and transliterations by Katharina Kupfer). For the time being, only parts of the collection can be made available. The material are free for non commercial use, as usual in the TITUS project (cf. TITUS conditions). However for using the manuscript images as sources in any kind of publication please contact the Oriental Department of the Staatsbibliothek zu Berlin. Please also beware that these pages, as usual in TITUS, are encoded using Unicode / UTF8. The special characters as contained in them can only be displayed and printed by installing a font that covers Unicode such as the freely downloadable TITUS font TITUS Cyberbit Basic. [2001 July 14; checked 2001 August 30].
This Tok Pisin freely downloadable story of "The woman who became stone" was collected in October 1997 in Bimun village (Kuot-speaking, West Coast Central New Ireland, Papua New Guinea) by Eva Lindström (e-mail) from Veronica Galeng, c. 55-60 years of age, born near Konos in East Coast Madak-speaking area, but married in WC Kuot area since age about 18(?) (fluent also in Kuot). The text is provided with an English translation, and a few linguistic and anthropological notes, by the collector herself. This page is part of the Creolist Archives Text Collections. [2001 August 8].
Dicks Raeparanga Thomas was awarded in 1996 an MA in Linguistics at the University of Papua New Guinea. His thesis, Sotpela Grama bilong Tokpisin (A Short Grammar of Tokpisin), was written in Tokpisin (or Tok Pisin, the PNG dialect of Melanesian Pidgin). Two external examiners of the thesis also wrote their comments and recommendations in Tok Pisin. (This is probably a first for a pidgin or creole!). Here is a short abstract of the thesis, in Tok Pisin and English, taken from the PACE (Pidgins and Creoles in Education) Newsletter no. 8 (1997). This freely downloadable page comes from the Creolist Archives Text Collections. [2001 August 8].
This corpus is made from Academic/technical/conference papers (on spelling correction, corpus tagger, ATN grammar, lexical funtional grammar, spelling checker, morphological specification), PhD thesis proposal and PhD theses, project plan, etc. From Bilkent University, Ankara, Turkey.
Available under subscription to TRACTOR.
The Center aims to promote and perform basic research in language and speech processing, and develop language engineering applications involving Turkish and related Turkic languages. Various resources and links related to the Turkish language, including, besides a word list of palindromes and a few downloadable utilities:
+ a dictionary: http://www.nlp.cs.bilkent.edu.tr/Sozluk/,
+ a Turkish Morphological Analizer Demo,
+ a small Turkish tagged corpus,
+ some parallel Tukish-English texts,
+ cf. also the Academic Papers Corpus (TRACTOR).
mirroring sites at Antwerp (Belgium) and Chokyo (Japan).
The CHILDES system provides free tools for studying child language data and conversational interactions. These tools include a database of transcripts, programs for computer analysis of transcripts, methods for linguistic coding, and systems for linking transcripts to digitized audio and video. CHILDES corpora cover a 23 European and extra European languages, although the bulk of the collection is English.
The Turkish CHILDES corpus is available on the web.
There is also a Turkish component in the Multilingual Collection (English UK and USA, German, Hebrew, Italian, Spanish, Swedish and Turkish) made from narratives elicited using Mercer Mayer's "frog story" picture book.
See under Multilingual and Parallel Corpora section for a fuller file.
http://www.elsnet.org/resources/eciCorpus.html
A 98 million word corpus, covering most of the major European languages, as well as many others (viz. Albanian, Bulgarian, Chinese, Czech, Dutch, English, Estonian, French, Gaelic, German, Greek, Italian, Japanese, Latin, Lithuanian, Malay, Spanish, Danish, Uzbek, Norwegian, Portuguese, Russian, Serbian, Swedish, Turkish, Tibetan).
See under ECI/MCI 1 Corpus in the Multilingual and Parallel Corpora section.
This is the Laboratory for the Computational Studiesof Language page of the Middle East Technical University Computational Linguistics. More focused on language processing tools and lexicons than on corpora; a few tools (cf. TUWPA above) are free.
Turkish texts from Samarkand State Institute for Foreign Languages, Samarkand, Uzbekistan. Available under subscription to TRACTOR.
TELL, developed in the Linguistic Department of the University of California at Berkeley, is a database of 30,000 Turkish words representing both print dictionaries and actual speaker knowledge. TELL was constructed from a master list composed of the following: 17,000 headwords from the 2d edition of the Oxford Turkish-English dictionary, 20,000 headwords from the 3d edition of the Oxford Turkish English-dictionary, 175 place names from an atlas of Istanbul, and 5,000 place names from a telephone area code directory of Turkey. Once duplicates were removed, the resulting list contained some 30,000 lexemes. These were elicited, in various morphological contexts, from a native speaker. The resulting database contains orthographic representations of these 30,000 headwords as well as phonemic transcriptions of all elicited forms. The native speaker knew, and supplied pronunciations for, some 17,500 of the elicited lexemes. In addition, TELL supplies morphological roots for approximately 17,000 lexical items; etymologies are supplied for about 11,000. The Search page is free.
A Tool for Learning Turkish Morphology (analyzing and producing Turkish word forms) querable on the Web.
A brand new English-Turkish and Turkish-English online dictionary. [2002 February 18].
Ukrainian corpus information and documentation in HTML files and access to the corpus and DTD, both online and downloadable. From the Macbride Trading Corporation, Severodonetsk, Ukraine.
Available under subscription to TRACTOR.
A single freely downloadable small and old (1696) witness from the Guinea-Bissau Creole. From the Creolist Archives Text Collections. [2001 August 14].
The EMILLE Project is a 3 year EPSRC project at Lancaster University and Sheffield University, designed to build a 63 million word electronic corpus of South Asian languages, especially those spoken in the UK. EMILLE will generate written language corpora of at least 9,000,000 words for Bengali, Gujarati, Hindi, Panjabi, Singhalese, Tamil and Urdu.
For more infos and details cf. under the Multilingual corpora section. [2001 June 17].
http://www.elsnet.org/resources/eciCorpus.html
A 98 million word corpus, covering most of the major European languages, as well as many others (viz. Albanian, Bulgarian, Chinese, Czech, Dutch, English, Estonian, French, Gaelic, German, Greek, Italian, Japanese, Latin, Lithuanian, Malay, Spanish, Danish, Uzbek, Norwegian, Portuguese, Russian, Serbian, Swedish, Turkish, Tibetan).
http://solaris3.ids-mannheim.de/tractor/telri/SAM/sam-01.htm
Uzbek texts: several chapters of The Constitution of the Tamerlan State. From Samarkand State Institute for Foreign Languages, Samarkand, Uzbekistan.
Available under subscription to TRACTOR.
This page, from the Department of Asian and Pacific Linguistics - Institute of Cross-Cultural Studies - Tokyo University, maintained by Kazuto Matsumura, provides a single and short free Vepse e-Text: Nina Zaiceva, Maria Mullonen, Ic^emoi lugemišt, Vepsän kelen lugendkirj 3.-4. klassale, Petroskoi "Karjala", 1994. HTML format.
The MPCP (University of Maryland Parallel Corpus Project) provides versions of the Bible consistently annotated according to the CES. There are also some freely downloadable PS paper related to this project, mainly by Philip Resnik. Versions already freely available are the Chinese, Danish, English, Finnish, French, Greek, Indonesian, Latin, Spanish, Swahili, Swedish and Vietnamese ones. For more cf. under the Parallel Corpora section. [2001 May 1].
Two small freely downloadable documents of old St. Croix (1878) and Tortola (1830) Creoles, from the Creolist Archives Text Collections. [2001 August 8].
mirroring sites at Antwerp (Belgium) and Chokyo (Japan).
The CHILDES system provides free tools for studying child language data and conversational interactions. These tools include a database of transcripts, programs for computer analysis of transcripts, methods for linguistic coding, and systems for linking transcripts to digitized audio and video. CHILDES corpora cover a 23 European and extra European languages, although the bulk of the collection is English.
The Welsh CHILDES corpus is available on the web.
See under Multilingual and Parallel Corpora section for a fuller file.
A few mainly old freely downloadable witnesses and texts of West African PE from Cameroon (the Creation story told by a Crooboy in 1933), Ghana (1795, 1686, 1721), and Nigeria (1793, 1804, 1807, 1825, 1890-1930, 1926, 1963, 1965, 1995); from the Creolist Archives Text Collections. [2001 August 8].