Localized Resources .3. |
I provide here language-specific links to corpora, e-texts and NLP resources in general. Resources already presented in the previous sections are also repeated here whenever relevant.
http://www.cogsci.ed.ac.uk/~apl2c/
The Association for Persian Language, Linguistics and Computing, administrated from the Edimburgh University Centre of Cognitive Science provides the best Farsi Speech and Natural Language Processing (NLP) page available on the web. It maintains a lot of papers online.
A site which offers links to Persian (and some Arabic as well) resources on the Web. Some traditional and e-text publishers and a lot of miscellaneous resources. Nothing of truly computational, but of some interest for Electronic Publishing.
This short page is a pointer to a couple of Farsi speech database a few other resources.
The MPCP (University of Maryland Parallel Corpus Project) provides versions of the Bible consistently annotated according to the CES. There are also some freely downloadable PS paper related to this project, mainly by Philip Resnik. Versions already freely available are the Chinese, Danish, English, Finnish, French, Greek, Indonesian, Latin, Spanish, Swahili, Swedish and Vietnamese ones. For more cf. under the Parallel Corpora section. [2001 May 1].
This page, from the Department of Asian and Pacific Linguistics - Institute of Cross-Cultural Studies - Tokyo University, maintained by Kazuto Matsumura, provides a few free Finnish e-Texts: two standard Finnish, and one historical (Kieliasetus 1873). All HTML.
A small library of interactive hypertexts for free reading and search maintained by Èulogos. All literary texts, many religious (the BRI, Bibliotheca Religiosa). Nine languages are till now supported (Albanian, German, English, Spanish, French, Italian, Latin, Finnish).
This page contains the texts (all freely downloadable HTML) of 1842 poets and 1217 composers in 22 different languages: Finnish has 137 texts. For a more detailed description, see in the E-Texts section.
The WES Section of the University of Virginia Library provides a list of text resources for 17 European literatures: Catalan, Danish, Dutch, Finnish, French, Galego-Portuguese, German, Greek, Irish, Italian, Latin, Norwegian, Old Norse & Icelandic, Occitan, Portuguese, Spanish and Swedish.
The Association des Bibliophiles Universels, ou ABU (pronounced as "abou") was founded en Avril 1993 in order to gave free acces on the Web to francophone texts of public domain (i.e. of authors deceased at least 70 years ago). It is, so to say, the French equivalent of Project Gutenberg. All text are stored in HTML and in plain TXT format (with line breaks and ISO Latin Charset) and are freely downloadable. Simple string searches can be done online.
http://humanities.uchicago.edu/ARTFL/ARTFL.html
The American and French Research on the Treasury of the French Language (ARTFL) is a cooperative project established in 1981 by the Centre National de la Recherche Scientifique and the University of Chicago. Its function is to administrate the 150 million words corpus created since 1957 by the French government as a ground for the new dictionary of the French language, the Trésor de la Langue Française. Its nearly 2000 texts represent a broad range of written French -- from novels and poetry to biology and mathematics -- stretching from the seventeenth to the twentieth centuries. A Provençal database that includes 38 texts in their original spellings has been recently added. The ARTFL system supports a number of searches which can be performed on the texts selected by users working with WWW or Philologic. A user may search for a single word, a word root, prefix, suffix or a list of words created by the user. Users can generate single line KWIC concordances, multi-line concordances, indices, and bibliographies. Search results on the WWW can be saved directly to the directory of your choice.
It's a shame that acces should be so heavily restricted. Namely, subscriptions (at the cost of an annual fee of 250-500$) to the ARTFL databases are available only to universities and other research institutions within the United States and Canada (private individual European scholars forget it!). Access to the ARTFL database outside of North America should be provided by the INALF (Institution National de la Langua Français), but it is even more impossible ...
The French Texts Archive of the Univerité de Genève. The number of texts hosted or (more often) linked to is very high (see the list at http://un2sg4.unige.ch/athena/html/fran_fr.html) but the availability of texts varies from really free and downloadable (e.g the texts direcltly hosted on the site) to very poor (e.g. the Gallica texts). There is also a special list of texts of French authors written or translated in other languages.
http://www-rali.iro.umontreal.ca/arc-a2/BAF/
The BAF Corpus is a corpus of French - English bi-texts, i.e. of pair of French and English texts which are mutual translations, and whose sentences have been aligned. This corpus has been built up by the CITI computer assisted translation group (TAO). Most of the texts are of institutional genre (canadian HANSARD, ONU reports, etc.), but a few scientifical papers and a literary work were also included. The whole corpus has about 400.000 wors for each language. BAF Version 1.1. is already available and can be freely downloaded in UNIX GZ format, ZIP and each file separatedly in TXT and CES formats. Description, allignment conventions, encoding documentation, and a COAL Tools suite, are also freely available on the site. [2001 April 23].
The MPCP (University of Maryland Parallel Corpus Project) provides versions of the Bible consistently annotated according to the CES. There are also some freely downloadable PS paper related to this project, mainly by Philip Resnik. Versions already freely available are the Chinese, Danish, English, Finnish, French, Greek, Indonesian, Latin, Spanish, Swahili, Swedish and Vietnamese ones. For more cf. under the Parallel Corpora section. [2001 May 1].
http://solaris3.ids-mannheim.de/tractor/telri/MAN/man-04.htm
Texts from Deutsche Bundesregierung. cf. Cf. this site under Multilingual and Parallel Corpora section.
Available under subscription to TRACTOR.
This Chateau has a room called "Chef d'Oeuvre de la Littérature Française", where you can freely and easily download (in ZIP format) some French texts, ranging from Molière's Completed Works to Pascal's Pensées, Baudelaire's Fleurs du ma, and Proust's Du côté de chez Swann.
http://inalf.ivry.cnrs.fr/ccrti/
The INaLF (Institut National de la Langue Française - CNRS) Catalogue provides a selection on scientific grounds (both philological and informatic) of the French literary text resources available on the Web. After you have digested all their disclaimers and evaluation grids, the links to true freely downloadable texts are however relatively few.
mirroring sites at Antwerp (Belgium) and Chokyo (Japan).
The CHILDES system provides free tools for studying child language data and conversational interactions. These tools include a database of transcripts, programs for computer analysis of transcripts, methods for linguistic coding, and systems for linking transcripts to digitized audio and video. CHILDES corpora cover a 23 European and extra European languages, although the bulk of the collection is English.
The French CHILDES corpus is downloadable in a zipped format.
See under Multilingual and Parallel Corpora section for a fuller file.
http://www.swarthmore.edu/Humanities/clicnet/litterature/litterature.html
A few free French texts (the most uncomplete) and lot of links that often don't take you directly to the text you want to download.
This site by the Secrétariat à la politique linguistique du gouvernement du Québec provide a unified interface and free online search facility to twelve french corpora made available form five Québec universities. Corpora span from small lexical databases to larger textual corpora, including Québétext and Témiscouata Corpus. [2002 February 17].
http://www.elsnet.org/resources/eciCorpus.html
A 98 million word corpus, covering most of the major European languages, as well as many others (viz. Albanian, Bulgarian, Chinese, Czech, Dutch, English, Estonian, French, Gaelic, German, Greek, Italian, Japanese, Latin, Lithuanian, Malay, Spanish, Danish, Uzbek, Norwegian, Portuguese, Russian, Serbian, Swedish, Turkish, Tibetan).
See under ECI/MCI 1 Corpus in the Multilingual and Parallel Corpora section.
Early Canadiana Online (ECO) is a full text online collection of more than 3,000 books and pamphlets (English and French languages) documenting Canadian history from the first European contact to the late 19th century. You can make simple queries online, but unfortunately you can download texts only one page at a time. Cf.the English version and the French one.
The ET10-63 corpus is a bilingual parallel corpus of English and French, containing EC offical documents on telecommunications. The corpus is part-of-speech tagged and also lemmatized. Approximately 1,250,000 words of each language.
http://morph.ldc.upenn.edu/Catalog/LDC95T11.html
The European Language Newspaper Text corpus is also know as the French Language News Corpus. This corpus includes roughly 100 million words of French, 90 million words of German and 15 million words of Portuguese and has been marked using SGML.
See under Multilingual and Parallel Corpora section.
This French literature "petit bibliothèque portatif" is hosted by the French Foreign office and is maintained by Olivier D. J. Tableau. It offers a good amount of texts in PDF / RTF / Clarice Works format, all free and easy to download. Texts, ranges from Montaigne to Diderot, from Balzac to Leroux, and from Baudelaire to Mallarmé; there is still some French translation of foreign authors (e.g. Shakespeare Kafka and Goethe). Surely a good site.
The site of the Bibliothèque Nationale Française, online since 1977, promises 70000 digitalized documents, both in image (BNF manusripts ecc.) and in text format (from INALF Frantext database in cooperation with publishing houses Acamédia, Bibliopolis et Honoré Champion). Unfortunately, many items are not available (and the notice "Ce document est protégé au titre de la propriété littéraire et artistique. Pour le consulter, vous devez vous rendre à la Bibliothèque nationale de France" come also for texts that surely can be put in the public domain), and also the (few) free texts can be read online but downloaded only partially in PDF format. If you want texts and not legalese, this is not the right place for you. Some simple string searches can be done online.
Texts from Centre d'Information et de Documentation de l'Ambassade de la République Fédérale d'Allemagne, Paris, France in German and French. From Institut für Deutsche Sprache, Mannheim, Germany..
Available under subscription to TRACTOR.
The Green Library (or Cactus Library), a project of the Saint-Petersburg School of Religion and Philosophy (SRPh), has by now a few Greek, Russian and French (Turgenev) texts. More titles are announced as forthcoming, also in Latin, French (Casanova) and Old Slavonian. All titles are freely downloadable in PDF format. For more cf. under the E-Texts section. [2001 May 1].
The Hansard Corpus consists of parallel texts in English and Canadian French, drawn from official records of the proceedings of the Canadian Parliament.
See under Multilingual and Parallel Corpora section.
http://solaris3.ids-mannheim.de/tractor/telri/MAN/man-11.htm
Cf. Multilingual and Parallel Corpora.
Available under subscription to TRACTOR.
A small library of interactive hypertexts for free reading and search maintained by Èulogos. All literary texts, many religious (the BRI, Bibliotheca Religiosa). Nine languages are till now supported (Albanian, German, English, Spanish, French, Italian, Latin, Finnish).
http://www.comp.lancs.ac.uk/linguistics/crater/corpus.html
The European (English, French, and Spanish) Language Newspaper Text tagged corpus, free querable online.
See this site under Multilingual and Parallel Corpora section.
http://www.lpl.univ-aix.fr/projects/multext/MUL4.html
This corpus is made by a set of pieces fron the Official Journal of the European Community (JOC) and is CES (Corpus Encoding Standard) conformant. It is available with three level of treatment: paragraph annotated (CESDOC), POS-Tagged (CESANA) and parallel text aligned (CESALIGN).
Availability unknown: only a few samples downloadable.
The Textlist page of the Kirchenmusik online site (a good and well known resource for music lovers) by Joachim Vogelsänger unfolds a huge and free collection of texts of Oratorios, Cantatas, Sacred Hymns and so like. The mosts are in German, a few texts are also in Latin, but there are also very few texts in French (Fauré's Cantique de Jean Racine and Saint-Saens' Weihnachtsoratorium). All the texts are freely downloadable in simple HTML format. For more details cf. the full file in the E-Texts section. [2001 August 27].
An online freely querable database of English-French aligned texts, processed by the same software (by Knut Hofland) used for the Oslo ENPC project.
A few French newspaper texts marked-up and made freely querable online via TACTweb by Elisabeth Burr. Some POS tagging of verbs was made Caroline Hummel, Stephanie Demandt and Ivo Andreatta. This corpus is a subsection of the wider project "Romanische Zeitungssprachen", and is part of E. Burr's Online Korpusanalyse mit Hilfe von TactWeb pages, which offer some small but useful Italian, French and Spanish corpora querable online via TACTweb. [2001 April 23].
This page contains the texts (all freely downloadable HTML) of 1842 poets and 1217 composers in 22 different languages: French has 1135 texts. For a more detailed description, see in the E-Texts section.
Myriobiblos, The E-text Library of the Church of Greece, provides a lot free HTML e-texts (you can browse and save them) from Classical to modern Greek; but there are also fewer texts (mainly translation) in Bulgarian, English, French, German, Italian, Romanian and Russian. For more cf. under the E-Texts section. [2001 May 1].
http://solaris3.ids-mannheim.de/tractor/telri/MAN/man-06.htm
HTML texts from NATO. cf. Multilingual and Parallel Corpora.
Available under subscription to TRACTOR.
Opera e-Libretto (Collection Ulric Voyer) is a collection of 220 free e-texts of opera libretti. Displayed libretti are in French (Charpentier, Rameau, Campra, Rousseau, Gluck, Méhul, Gretry, Berlioz, Auber, Bizet, Boieldieu, Offenbach, Dukas, Massenet, Lalo, Chabrier, Saint-Saëns, Thomas, Gounod, Massé, Reyer, Delibes, Debussy, Cui, Halévy, Bruneau, Roussel, Pierné, Laparra, Voyer - for Cendrillon, Jongleur de Notre-Dame and Sapho there are also English versions, and for Werther also Italian and English ones), Italian, English, German, Russian and Danish. All texts are in html, usually broken in more files according to act divisions. For a more detailed file cf. under the E-Texts section. [2001 June].
This page maintained by Lyle Neff (cf. homepage) is a rich database of the online sources of opera libretti. A lot of e-texts (html format) are freely available directly from the site, other are only linked to. Beside libretti also secular songs and sacred vocal music are also dealt with. Language covered are Italian, French, English, German, Russian, Spanish (zarzuelas), Latin (sacred vocal music) and Jewish (songs). There are also links to other less specific musical and linguistic resources. [2001 June 20].
Québétext is a textual database built up with French litterary texts (now about a hundred) from Québec dating from 1837 till present time. The corpus of all text for which copyrights have expired can be freely searched online directly from the site or as a part of the Corpus lexicaux québécois. Québétext full corpus (including copyright protected texts) can be consulted only locally at the CIRAL section of the Trésor de la langue française au Québec, at the Université Laval (Sainte-Foy, Québec). [2002 February 17].
http://www-rali.iro.umontreal.ca/
RALI provides tools and computational resources for the linguistic analysis of mainly of Written French (taggers, parsers, electronic grammars, parallel corpora). Some demo are working online, such as Réacc (restores automaticaly accents in a French plain text), Morpholyse (a morphological analizer), TRIAL (Trilingual Text Alligner), ecc.
SENSEVAL is a project concerned in Evaluating Word Sense Disambiguation Systems. The first SENSEVAL took place in the summer of 1998, for English, French and Italian. The second is planned for Pisa, Spring 2001. For more cf. the References, Standards etc. section.
http://www.loria.fr/projets/Silfide/Index.html
SILFIDE, a project of CNRS and AUPELF-UREF is an interactive server (hosted by LORIA) for the study and diffusion of the French language, which adresses the academic comunity (linguistics, teaching, informatics). It is possible to search online some TEI-XML texts.
http://solaris3.ids-mannheim.de/tractor/telri/MAN/man-09.htm.
Cf. under Multilingual and Parallel Corpora.
Available under subscription to TRACTOR.
This historical Québec corpus, properly "Des lettres des pays d'en bas - Un corpus du Témiscouata (1930-1936)", is an interesting project by the l'Université du Québec à Rimouski. It is build up with letters written between 1930-1936 by colons (often unliterated) to the missionary "abbé" Léo-Pierre Bernier, from which archivistical found are extracted. The corpus can be freelyaccessed directly from the site in SATO or NOMINO modality, or as a part of the Corpus lexicaux québécois.
TRIPTIC is a trilingual corpus developed for the analysis of prepositions in English, French and Dutch. There is not a TRIPTIC page on the web and all the information are taken from Michael Barlow's Parallel Corpora Page. For further information see under Multilingual and Parallel Corpora section. Contact: Hans Paulussen.
http://morph.ldc.upenn.edu/Catalog/LDC94T4A.html
It contains texts in English, French and Spanish from the Office of Conference Services at the UN in New York between 1988 and 1993.
See under Multilingual and Parallel Corpora section.
The Web Concordancer site, by the Virtual Language Centre of the Polytechnic University of Hong Kong, presents a few indexed corpora (English, French, Chinese, Japanese) thant can be freely browsed with the ConcApp program. Corpora available include Brown Corpus, Sherlock Holmes stories, South China Morning Post, etc. [2002 February 17].
The texts of the main works of the famous French philosoph Gilles Deleuzes freely available directly on his site. All texts are in HTML formats and are broken in more pages for better reading online (but for worser downloading ...). Besides the French originals, English and Spanish translations are available as well, so you can construct three parallel texts (if not a true parallel corpus).
http://www.lib.virginia.edu/wess/etexts.html
The WES Section of the University of Virginia Library provides a list of text resources for 17 European literatures: Catalan, Danish, Dutch, Finnish, French, Galego-Portuguese, German, Greek, Irish, Italian, Latin, Norwegian, Old Norse & Icelandic, Occitan, Portuguese, Spanish and Swedish.
http://www.rxrc.xerox.com/research/mltt/
The XRCE (home page) MLTT team creates basic tools for linguistic analysis, e.g. morphological analysers, parsing and generation platforms and corpus analysis tools. These tools are used to develop descriptions of various languages and the relation between them. There are free web demos of some of their tools also for French. For more details cf. the Tools section.
A collection of mainly old freely downloadable relics of the French Antillean Creole French, from several varieties. St. Thomas has a single text 1884. Guadeloupe has more texts, spanning from the 1664 and 1886 ones, to the moderner Un Conte Créole extrait de Maîtresses-Femmes Créoles de Maxette Févrin-Olsson, 1997, and and "Madjoumbé" since 1997 (formerly "Kimafoutiésa"), a journal of the Caribbean diaspora in France. It contains texts in English, Creole, Spanish and Portuguese treating the history, geography, cultures and languages of the peoples of the Caribbean and their diasporas throughout the world. "Madjoumbé", edited by Henry "S'maw" Tayfon, is published by INKAm (Initiative Caribéenne multimédia; BP 18 / 92403 Courbevoie cedex). Martinique has only old witnesses, viz. 1671, 1695, 1698, 1790 and 1793. From the Creolist Archives Text Collections. [2001 August 13].
A simple page with versions of the Pater Noster in many germanic languages (Afrikaans, Alsatian, Bavarian, English, Danish, Dutch, Frisian, German, Gothic, Icelandic, Norn, Norwegian, Old Saxon, Pennsylvania Dutch, Plattdeutsch, Swedish) by Catherine Ball (see her homepage), the webmistress of the Old English Pages. There is also a simple interface that allows the comparation of any two texts. This page was prepared for the use of classes in linguistics, history of the English language, and Old English. [2001 July 13].
Several links to free Gaelic e-texts (HTML), often with English translation side-by-side. They range from poems, to crosswords (e.g.), to Gaelic language Bullettin Board messages on several topics (do you want a bilingual recipe of "Seabhdar Cóilise le Duileasc / Cauliflower Chowder with Dulse"? Here it is). Beware that in all other respects the site is strictly Gaelic language only. [2001 May 1; checked 2001 August 30].
Armazi is the TITUS (Thesaurus Indogermanischer Text- und Sprachmaterialien) page on "Fundamentals of an Electronic Documentation of Caucasian Languages and Cultures". It deals mainly with developing encoding standards (cf. the Encoding standards for the languages of the Caucasus project) and informatic media (cf the Computer models for Caucasian languages project) for Georgian and other Caucasian languages. It hosts also important Georgian projects, such as the Digitization of Old Georgian texts from the Gelati school Project from the Gelati Academy of Sciences, and the Digitization of the Albanian palimpsest manuscripts from Mt. Sinai project. There are also some links to e-text resources from the TITUS server, both for Georgian and for Laz, Svan and Mingrelian. Beware only that these pages are encoded using Unicode / UTF8. The special characters as contained in them can only be displayed and printed by installing a font that covers Unicode such as the freely downloadable TITUS font TITUS Cyberbit Basic. [2001 may 18; Rev. 2001 August 30].
The site is about Gia Shervashidze and his publishing house activities. Besides many linguistics (dictionaries, spellers etc.) and informatic resources (OCR system, fonts, Windows and Linux nativizations, etc.) for Georgian language, also "various old and modern Georgian corpora" are advertised. Descriptions and availability infos, however, are still lacking. And beware that the links are full of commercial popups. [2001 may 18]
This project of a Georgian, Russian, English, German Multilingual Valency Lexicon for Natural Language Processing (GREEG) is carried on by the Georgian Academy of Sciences and Tbilisi State University (which contribute the in-depth knowledge of Georgian, in itself and in contrast to the other languages), the University of Stuttgart (as a world leader in corpus processing for lexicographical purposes), and the University of Brighton (which provides the formalisms for lexical information). Currently there are no lexicons suitable for language engineering ('computational lexicons') for Georgian, and the other languages have been selected because of their salience to Georgia: English, for its international role; German, for the longstanding special relationship between Germany and Georgia; and Russian, because it is Georgia's largest trading partner and the most widely-spoken foreign language. Coordinator Egbert Lehmann (e-mail). [2001 may 18]
http://solaris3.ids-mannheim.de/tractor/telri/MAN/man-04.htm
Texts from Deutsche Bundesregierung. cf. Cf. this site under Multilingual and Parallel Corpora section.
Available under subscription to TRACTOR.
The text component of the package includes transcripts and documentation files. The transcripts cover a contiguous 5 or 10 minute segment taken from 100 unscripted telephone conversations between native speakers of German. The transcripts are timestamped by speaker turn for alignment with the speech signal and are provided in standard orthography. In addition to transcript files, this corpus contains full documentation on the transcription conventions and format. Complete auditing information on the speakers represented in the transcripts (including gender, channel quality and so on) is also included. Available as FTP file by the LDC through membership or by 500$ price.
CELEX, the Dutch Centre for Lexical Information, has three separate databases, Dutch, English and German, all of which are open to external users. The German database (D2.5), made accessible in February 1995, currently holds 51,728 lemmata with 365,530 corresponding wordforms. Apart from orthographic features, the CELEX database comprises representations of the phonological, morphological, syntactic and frequency properties of lemmata. The CELEX database is open to all academic researchers and people associated with other not-for-profit research institutes free of charge (at least until 2001). Users will only be charged Dfl. 100,= for the CELEX User Guide on a one-shot basis. In order log in to CELEX, a personal account should be obtained from Richard Piepenbrock: see at this page.
mirroring sites at Antwerp (Belgium) and Chokyo (Japan).
The CHILDES system provides free tools for studying child language data and conversational interactions. These tools include a database of transcripts, programs for computer analysis of transcripts, methods for linguistic coding, and systems for linking transcripts to digitized audio and video. CHILDES corpora cover a 23 European and extra European languages, although the bulk of the collection is English.
The German CHILDES corpus is downloadable in a zipped format.
There is also a German component in the Multilingual Collection (English UK and USA, German, Hebrew, Italian, Spanish, Swedish and Turkish) made from narratives elicited using Mercer Mayer's "frog story" picture book.
See under Multilingual and Parallel Corpora section for a fuller file.
The COSMAS Corpora of the Mannheim IDS (Institut für deutsche Sprache) range from newspapers such as the Mannheimer Morgen, to literarature, such as the Goethe Korpus or the Grimm Korpus: see the complete list with links to individual corpora descriptions. The corpora include 1736 million words, 26 million words morphologically tagged, stemming, concordancing, collocation analysis. Online search thorough these Corpora is free but with some limitations: due to publishers' copyright restrictions, the corpora available to the general public are limited to approximately 1083 million running words; moreover anonymous COSMAS-I sessions are limited to 60 minutes. A description of the query sintax is available. Informations about fuller access or corpora buying are showed on this page alongside with infos on future evolution of the thing in COSMAS-II. [Rev. 2002 January 22].
http://solaris3.ids-mannheim.de/tractor/telri/MAN/man-05.htm
Proceedings of debates in the Deutscher Bundestag, Bonn, Germany. From Institut für Deutsche Sprache, Mannheim, Germany. Available under subscription to TRACTOR.
http://www.elsnet.org/resources/eciCorpus.html
A 98 million word corpus, covering most of the major European languages, as well as many others (viz. Albanian, Bulgarian, Chinese, Czech, Dutch, English, Estonian, French, Gaelic, German, Greek, Italian, Japanese, Latin, Lithuanian, Malay, Spanish, Danish, Uzbek, Norwegian, Portuguese, Russian, Serbian, Swedish, Turkish, Tibetan).
See under ECI/MCI 1 Corpus in the Multilingual and Parallel Corpora section.
http://solaris3.ids-mannheim.de/tractor/telri/MAN/man-08.htm
Cf. under Multilingual and Parallel Corpora.
Available under subscription to TRACTOR.
http://morph.ldc.upenn.edu/Catalog/LDC95T11.html
The European Language Newspaper Text corpus is also know as the French Language News Corpus. This corpus includes roughly 100 million words of French, 90 million words of German and 15 million words of Portuguese and has been marked using SGML.
See under Multilingual and Parallel Corpora section.
http://solaris3.ids-mannheim.de/tractor/telri/MAN/man-12.htm
German Texts from the Gutenberg Archive. From Institut für Deutsche Sprache, Mannheim, Germany. Available under subscription to TRACTOR.
A small library of interactive hypertexts for free reading and search maintained by Èulogos. All literary texts, many religious (the BRI, Bibliotheca Religiosa). Nine languages are till now supported (Albanian, German, English, Spanish, French, Italian, Latin, Finnish).
http://www.lpl.univ-aix.fr/projects/multext/MUL4.html
This corpus is made by a set of pieces fron the Official Journal of the European Community (JOC) and is CES (Corpus Encoding Standard) conformant. It is available with three level of treatment: paragraph annotated (CESDOC), POS-Tagged (CESANA) and parallel text aligned (CESALIGN).
Availability unknown: only a few sample to download.
The Textlist page of the Kirchenmusik online site (a good and well known resource for music lovers) by Joachim Vogelsänger unfolds a huge and free collection of texts of Oratorios, Cantatas, Sacred Hymns and so like. The mosts are in German, spanning from Schütz's Musikalische Exequien to Graun's Der Tod Jesu, J. S. Bach's Matthäuspassion, Mendelssohn's Paulus, Brahms' Vier ernste Gesänge and Webern's Kantate op. 31. All the texts are freely downloadable in simple HTML format. For more details cf. the full file in the E-Texts section. [2001 August 27].
This page contains the texts (all freely downloadable HTML) of 1842 poets and 1217 composers in 22 different languages: of course German language is prevalent with 3449 texts. A treasure for the music lovers. Wider description in the E-Texts section.
A site devoted to contemporary Austrian literary texts. Original poetry is prevalent, but there is also some translation (e.g. Lull, Shakespeare). All texts are HTML and are freely downloadable.
A simple page with versions of the Pater Noster in many germanic languages (Afrikaans, Alsatian, Bavarian, English, Danish, Dutch, Frisian, German, Gothic, Icelandic, Norn, Norwegian, Old Saxon, Pennsylvania Dutch, Plattdeutsch, Swedish) by Catherine Ball (see her homepage), the webmistress of the Old English Pages. There is also a simple interface that allows the comparation of any two texts. This page was prepared for the use of classes in linguistics, history of the English language, and Old English. [2001 July 13].
A good collection of medieval Germanic (mainly English) poetry texts, ranging from Old English to Middle English and Scots, with some hints of Old Norse and Old (High and Low) German. All texts are freely downloadable (but often broken in more files). [Last checked 2001 August 27].
Morphy, by Wolfgang Lezius, presents a German morphology and tagger in one package. Morphy runs under 1.1 for Win 95-NT. The Morpholgy comprises 50.000 lemmas for 350.000 forms; the tagger works either with a small tagset (51 tags) or with a large tagset (456 tags), reaching resp. 96% and 85% of correctly tagged words.
Morphy 1.1 is freely downloadable following this link.
Myriobiblos, The E-text Library of the Church of Greece, provides a lot free HTML e-texts (you can browse and save them) from Classical to modern Greek; but there are also fewer texts (mainly translation) in Bulgarian, English, French, German, Italian, Romanian and Russian. For more cf. under the E-Texts section. [2001 May 1].
http://solaris3.ids-mannheim.de/tractor/telri/MAN/man-06.htm
HTML texts from NATO. cf. under Multilingual and Parallel Corpora.
Available under subscription to TRACTOR.
Saarland University Syntactically Annotated Corpus of German Newspaper Texts. The NEGRA corpus consists of approximately 176,000 tokens (10,000 sentences) of German newspaper text, taken from the Frankfurter Rundschau as contained in the CD "Multilingual Corpus 1" of the European Corpus Initiative. It is based on approx. 60,000 tokens that were tagged for part-of speech at the Institut für maschinelle Sprachverarbeitung, Stuttgart. This corpus was extended, tagged with part-of-speech and completely annotated with syntactic structures. There is a freely downloadable demo, also in Penn Treebank format. The corpus can be freely obtained after signing a Licence Agreement.
This frequently updated site provides references to German Literature resources. Although it is of exclusive literary interests, sometimes it provides as well some useful links to German e-texts.
http://www.geocities.com/~aristipp/litlinks/litlinks.htm
There are so far 19.748 links to German Literary Texts, but a lot of them bring you to subscription sites or to texts unsuitable for downloading.
Opera e-Libretto (Collection Ulric Voyer) is a collection of 220 free e-texts of opera libretti. Displayed libretti are in German (Mozart, Weber, Wagner, Johann Strauss, Richard Strauss, Freiherr von Franckenstein), Italian, French, English, Russian and Danish. All texts are in html, usually broken in more files according to act divisions. For a more detailed file cf. under the E-Texts section. [2001 June].
Corpus of modern written German, TEI markup, 20 million words. From Institut für Deutsche Sprache, Mannheim, Germany. Available under subscription to TRACTOR.
Nothing to do with Project Gutenberg ... Projekt Gutenberg - DE has Texts in German language of 300 and more authors, from Aesopus to Zola (all in German). All the texts are freely readable online but not are not planned for downloads (longer works are divided into several chapters). You can instead order a CD ROM with the whole corpus of Gutemberg-DE for only 39.80 DM following this link.
This page maintained by Lyle Neff (cf. homepage) is a rich database of the online sources of opera libretti. A lot of e-texts (html format) are freely available directly from the site, other are only linked to. Beside libretti also secular songs and sacred vocal music are also dealt with. Language covered are Italian, French, English, German, Russian, Spanish (zarzuelas), Latin (sacred vocal music) and Jewish (songs). There are also links to other less specific musical and linguistic resources. [2001 June 20].
http://www.tu-chemnitz.de/phil/english/real/transcorpus/index.htm
Parts of the German-English Translation Corpus is now available online.
See under Multilingual and Parallel Corpora section.
http://solaris3.ids-mannheim.de/tractor/telri/MAN/man-07.htm
Texts from Rheinischer Merkur (German Weekly Newspaper). From Institut für Deutsche Sprache, Mannheim, Germany. Available under subscription to TRACTOR.
http://solaris3.ids-mannheim.de/tractor/telri/MAN/man-09.htm
Cf. under Multilingual and Parallel Corpora.
Available under subscription to TRACTOR.
VISL (Visual Interactive Syntax Learning Department of Language and Communication University of Southern Denmark - Odense) provides queries online to pure-text Corpora in Danish, German, English and Spanish and to the Portuguese tagged. The service is for members only (see at this page). For more infos cf. the full file in the Corpora and Corpus Linguistics section.
The WES Section of the University of Virginia Library provides a list of text resources for 17 European literatures: Catalan, Danish, Dutch, Finnish, French, Galego-Portuguese, German, Greek, Irish, Italian, Latin, Norwegian, Old Norse & Icelandic, Occitan, Portuguese, Spanish and Swedish.
http://www.rxrc.xerox.com/research/mltt/
The XRCE (home page) MLTT team creates basic tools for linguistic analysis, e.g. morphological analysers, parsing and generation platforms and corpus analysis tools. These tools are used to develop descriptions of various languages and the relation between them. There are free web demos of some of their tools on the web also for German. For more details cf. the Tools section.
A simple page with versions of the Pater Noster in many germanic languages (Afrikaans, Alsatian, Bavarian, English, Danish, Dutch, Frisian, German, Gothic, Icelandic, Norn, Norwegian, Old Saxon, Pennsylvania Dutch, Plattdeutsch, Swedish) by Catherine Ball (see her homepage), the webmistress of the Old English Pages. There is also a simple interface that allows the comparation of any two texts. This page was prepared for the use of classes in linguistics, history of the English language, and Old English. [2001 July 13].
http://www.georgetown.edu/labyrinth/library/latin/latin-lib.html
A rich collection of Latin e-Texts with also a collection of Greek Classical texts that influenced Latin Tradition. Cf. under e-Texts section for more infos.
The MPCP (University of Maryland Parallel Corpus Project) provides versions of the Bible consistently annotated according to the CES. There are also some freely downloadable PS paper related to this project, mainly by Philip Resnik. Versions already freely available are the Chinese, Danish, English, Finnish, French, Greek, Indonesian, Latin, Spanish, Swahili, Swedish and Vietnamese ones. For more cf. under the Parallel Corpora section. [2001 May 1].
mirroring sites at Antwerp (Belgium) and Chokyo (Japan).
The CHILDES system provides free tools for studying child language data and conversational interactions. These tools include a database of transcripts, programs for computer analysis of transcripts, methods for linguistic coding, and systems for linking transcripts to digitized audio and video. CHILDES corpora cover a 23 European and extra European languages, although the bulk of the collection is English.
Greek CHILDES corpus is downloadable in a zipped format.
See under Multilingual and Parallel Corpora section for a fuller file.
http://www.elsnet.org/resources/eciCorpus.html
A 98 million word corpus, covering most of the major European languages, as well as many others (viz. Albanian, Bulgarian, Chinese, Czech, Dutch, English, Estonian, French, Gaelic, German, Greek, Italian, Japanese, Latin, Lithuanian, Malay, Spanish, Danish, Uzbek, Norwegian, Portuguese, Russian, Serbian, Swedish, Turkish, Tibetan).
See under ECI/MCI 1 Corpus in the Multilingual and Parallel Corpora section.
The Green Library (or Cactus Library), a project of the Saint-Petersburg School of Religion and Philosophy (SRPh), has by now a few Greek (Aristotle’s De Anima, Plato’s De Republica), Russian and French texts. More titles are announced as forthcoming, also in Latin, French and Old Slavonian. All titles are freely downloadable in PDF format. For more cf. under the E-Texts section. [2001 May 1].
The Hellenic National Corpus is is a corpus of written Modern Greek texts, available over the Internet, for research use only. It is based on the General Language corpus developed by the ILSP (Institute of Language and Speech Processing) and is going to be fully available on the Internet in October 2000. It currently contains about 20,000,000 words of written texts from several media (books, periodicals, newspapers etc.), which belong to different genres (articles, essays, literary works, reports, biographies etc.) and various topics (economy, medicine, leisure, art, human sciences etc.). The HNC users can make the following queries concerning the lexicon, morphology, syntax and usage of Modern Greek: (1) specific words (e.g. child), (2) lemmas (e.g. child as a lemma produces every inflected type of the word), (3) parts of speech and (4) up to three combinations of all the above, in which users can specify the distance among lexical items (e.g. word + word, lemma + word, lemma + word + word, lemma + part of speech). Users can define their own sub-corpus within the HNC. This sub-corpus may cover one or more media, genres and/or topics and may also be saved for further reference by the users. Query results are presented as whole sentences, within which the query objects are highlighted. Alternatively, concordances of query results are presented, where the query object is centred on the page. Finally, HNC users can make queries concerning word, lemma and/or parts of speech frequencies within the HNC texts. Statistical information about the 100 and 1,000 most frequent words and lemmata in these texts is also available. [2001 May 1].
+ Infos are available in English.
+ EThEG (Ethnikós Thesaurós Elle:niké:s Gló:ssas). It is the online search engine of the HNC. This page is Greek only (beware of encoding problems).
http://www.ilsp.gr/info_eng.html (Greek also)
The Institute for Language and Speech Processing - Institóutos Epexergasías tóu Lógou was founded in Athens, Greece, with the aim to support the development of Language Technology. Among the activities of ILSP is the development of Language Technologies for Greek. Specifically, ILSP develops environments for translating from and into the Greek language, as well as computational tools and products which assist the translation task; develops CD-ROMs for computer assisted Greek language learning; creates electronic dictionaries (monolingual and multilingual), computational lexica and electronic dictionaries for children; develops prototypes for speech recognition, synthesis and compression; creates text correction tools. Cf. also the HNC (Hellenic National Corpus). [2001 May 1].
Keimena is a large collection of modern Greek texts. The pages are all in Greek (you must have a Greek font installed) are are full of annoying Tripods popups. The texts are freely readable, but are displayed in heavy (and slow) HTML pages unsuitable for download.
The Little Sailing library offers some Classical Greek (Aeschines, Aeschylus, Apollodorous, Aristotle, Aristophanes, Euripides, Herodotus, Hesiod, Lucian, Xenophon, Homer, Pausanias, Plato, Plutarch, Sophocles, Thucydides) and a few Modern Greek (Skaribas, Doumenis) texts in unicode encoding . All texts can be freely downloaded or browsed online, often with side-by-side translation. The texts you can download are compressed and each file contains a full text in MS Word 97 format (the original text only - no translation). A font of Unicode encoding type that supports the Greek Extended range is all what you in order to can see the texts. [2001 May 1].
Myriobiblos, The E-text Library of the Church of Greece, provides a lot free HTML e-texts (you can browse and save them) from Classical to modern Greek; but there are also fewer texts (mainly translation) in Bulgarian, English, French, German, Italian, Romanian and Russian. Texts are also categorized by subjects: Bible, Liturgical texts, Saints, Catechism, Orthodox spirituality, Patristic texts, Patrology, Theology, Church history, Church dialogue, Church art, Christian art, Christian archaeology, Byzantine history, Modern Greek history, Modern Greek Literature, Philosophy, Debates, and Bibliographies. Cf. also under the E-Texts section. [2001 May 1].
OpenText.org is a resource network and data repository for the study of Hellenistic Greek, including the Greek of the New Testament, and the linguistic and literary study of these texts. OpenText.org is also developing a series of XML annotation specifications for the encoding of papyrus texts. OpenText.org advocated the adoption of an 'open source' approach to text annotation and distribution, but currently there isn’t any e-text library available on the site. There are instead an interesting demo of a Papyr XML edition and a few discussions. However OpenText.org is still an expanding site, so let us hope for more in the future. [2001 April 29].
This golden mine of resources for Biblical studies and Semitic philology provides also some link to Greek e-texts of biblical interest (Septuagint Bible and Old Greek versions, etc.).
The Thesaurus Linguae Graecae (TLG) is a research center in the University of California at Irvine. Founded in 1972 the TLG has already collected and digitized all ancient texts from Homer to 600 A.D. and most historiographical, lexicographical and scholiastic texts from the period between 600 and 1453 A.D.. Its goal is to create a digital library which will include the entire corpus of Greek literature from Homer (8th century B.C.) to the present day. [2001 May 1].
+ Conditions. Institutions and individuals may obtain a TLG CD ROM disk from the TLG Project on a five-year subscription basis (renewable); the terms are very restrictive and expensive. The physical medium (i.e., the disk) will remain the property of the Regents of the University of California; specified use of TLG-generated materials on the disk will be licensed to the CD ROM recipient; the disk is to be returned to the TLG after expiring license. (a) Institutional License (Single-user): $850.00. Institutional CD ROM recipients may provide access to the TLG CD ROM and to the TLG data recorded thereon to all regular constituents, as well as to temporary patrons of the institution while on the institutional premises. (b) Individual License (Single-user): $300.00. The license covers use of the TLG CD ROM and the data recorded thereon by the designated licensee only. (c) Site License (multiple user access): price non declared (contact). This license permits institutions to download the contents of the TLGTM CD ROM to a hard disk and to maintain the TLG's materials on such a hard disk for multiple access by members of the institution, and/or permits multiple user access to the TLG's CD ROM on a file server owned and maintained by the institution. Site licensees can also access and search the TLG data online.
+ TLG E, the most recent CD edition of the TLG, was released in February 2000 and contains 76 million words of text (6,625 works/work collections from 1,823 authors). This is a significant expansion compared to the 57 million words (from 831 authors and 4,305 works) included in CD ROM D. In an effort to make the TLG data as widely utilized as possible, the cost of the individual five-year subscription was lowered to $300.
+ Dedicated software for the TLG is available for PC and MAC, at various prices, but is not supplied with TLG packages.
+ Online full access. Only institutions with a site license can now search all TLG texts and Canon materials online.
+ The Canon of Greek Authors and Works is a bibliographic guide to the authors and works included in the TLG Digital Library. Originally compiled as an electronic aid to the TLG staff, the Canon became an invaluable resource, the first truly comprehensive list of all known extant texts in Greek. Today it contains more than 3300 authors and an excess of 11,000 works providing information about the names, dates, geographical origins and descriptive epithets for each author together with detailed bibliographical information about existing text editions for each work. A printed version of the Canon has been published by Oxford University Press (Luci Berkowitz and Karl A. Squitier, Thesaurus Linguae Graecae Canon of Greek Authors and Works, third edition, Oxford University Press, 1990). An electronic version has been included in all TLG CD ROMs. Each Author and Work in the Canon is accompanied by categories of information that may be useful to users interested in history, literary history, and prosopography. These categories form the basis of a relational database and can be searched with the TLG Search Engine, and are freely available from the TLG Online Demo.
+ TLG Online Demo site provides free access to the complete Canon of Greek Authors and Works and a scanty ("representative", they say) selection of TLG texts (see the list). In terms of search capabilities, it offers all the features and functionality of the full Online TLG. Unlike the full version which is currently open only to institutions with a site license, the demo site is open to the public.
http://www.rxrc.xerox.com/research/mltt/
The XRCE (home page) MLTT team creates basic tools for linguistic analysis, e.g. morphological analysers, parsing and generation platforms and corpus analysis tools. These tools are used to develop descriptions of various languages and the relation between them. There are free web demos of some of their tools also for Greek. For more details cf. the Tools section.
The WES Section of the University of Virginia Library provides a list of text resources for 17 European literatures: Catalan, Danish, Dutch, Finnish, French, Galego-Portuguese, German, Greek, Irish, Italian, Latin, Norwegian, Old Norse & Icelandic, Occitan, Portuguese, Spanish and Swedish.
The EMILLE Project is a 3 year EPSRC project at Lancaster University and Sheffield University, designed to build a 63 million word electronic corpus of South Asian languages, especially those spoken in the UK. EMILLE will generate written language corpora of at least 9,000,000 words for Bengali, Gujarati, Hindi, Panjabi, Singhalese, Tamil and Urdu.
For more infos and details cf. under the Multilingual corpora section. [2001 June 17].
A single freely downloadable small and old (1766) text from the São Tomé Creole. From the Creolist Archives Text Collections. [2001 August 14].
A few freely downloadable, mainly old, documents of Guyana Caribbean English Creole (namely: 1796, 1797, 1835, 1836, 1859, 1990s), from the Creolist Archives Text Collections. [2001 August 8].
A few freely downloadable old witnesses and texts of Guyanais (French Guiana Creole French), ranging from 1743, 1744, 1789, 1790s, 1799, to 1872; from the Creolist Archives Text Collections. [2001 August 8].
Besides a few older relics (1786, 1791, 1797, 1802), there are a larger collection (from 1997) of the Nan Peyi Dayiti, i.e. the abridged e-mail edition of the newsweekly "Haïti Progrès" (for the full-length paper version, please contact the publisher), and a link to a Haitian Bible that doesn't seem be working. All materials are freely downloadable and come from the Creolist Archives Text Collections. [2001 August 9].
mirroring sites at Antwerp (Belgium) and Chokyo (Japan).
The CHILDES system provides free tools for studying child language data and conversational interactions. These tools include a database of transcripts, programs for computer analysis of transcripts, methods for linguistic coding, and systems for linking transcripts to digitized audio and video. CHILDES corpora cover a 23 European and extra European languages, although the bulk of the collection is English.
The Hebrew CHILDES corpus is downloadable as a zipped file.
There is also a Hebrew component in the Multilingual Collection (English UK and USA, German, Hebrew, Italian, Spanish, Swedish and Turkish) made from narratives elicited using Mercer Mayer's "frog story" picture book.
See under Multilingual and Parallel Corpora section for a fuller file.
CSIH is currently in planning: "We have launched this web site - they say - in order to let know of our plans, and mainly to serve as a means for interested people in Israel and around the world to share their views with us as regards our preparations. This site will be - at least at the beginning - under construction and constant change. All comments are welcome". Contact.
This page maintained by Lyle Neff (cf. homepage) is a rich database of the online sources of opera libretti. A lot of e-texts (html format) are freely available directly from the site, other are only linked to. Beside libretti also secular songs and sacred vocal music are also dealt with. Language covered are Italian, French, English, German, Russian, Spanish (zarzuelas), Latin (sacred vocal music) and Jewish (songs). There are also links to other less specific musical and linguistic resources. [2001 June 20].
A golden mine of resources for Biblical studies and Semitic philology, including also some link to e-texts (Bible in hebraic, Septuagint and Old Greek versions, Latin Vulgata, Dead Sea Scrolls, etc.). [Last checked 2001 August 27].
The EMILLE Project is a 3 year EPSRC project at Lancaster University and Sheffield University, designed to build a 63 million word electronic corpus of South Asian languages, especially those spoken in the UK. EMILLE will generate written language corpora of at least 9,000,000 words for Bengali, Gujarati, Hindi, Panjabi, Singhalese, Tamil and Urdu.
For more infos and details cf. under the Multilingual corpora section. [2001 June 17].
mirroring sites at Antwerp (Belgium) and Chokyo (Japan).
The CHILDES system provides free tools for studying child language data and conversational interactions. These tools include a database of transcripts, programs for computer analysis of transcripts, methods for linguistic coding, and systems for linking transcripts to digitized audio and video. CHILDES corpora cover a 23 European and extra European languages, although the bulk of the collection is English.
The Hungarian CHILDES corpus is downloadable in a zipped format.
See under Multilingual and Parallel Corpora section for a fuller file.
In the Corvinus Virtual Library you will find freely readable and downloadable transcriptions of a good lot of books on Hungarian history, published in the United States of America, in the English language or translated from Hungarian. Texts are usually in DOC format (with a few PDF).
http://solaris3.ids-mannheim.de/tractor/telri/BUD/bud-02.htm
Early 19th century poetry, incl. the works of Arany János, Peto"fi Sándor, and Kölcsey Ferenc. From the Research Institute for Linguistics of the Hungarian Academy of Sciences (Budapest).
Available under subscription to TRACTOR.
This page contains the texts (all freely downloadable HTML) of 1842 poets and 1217 composers in 22 different languages: Hungarian has 77 texts. For a more detailed description, see in the E-Texts section.
Over 250 files of spoken data from sociolinguistic interviews recorded since 1987, involving teachers, students, and others from Budapest as well as 1000 national samples. From Research Institute for Linguistics, Hungarian Academy of Sciences, Budapest, Hungary.
Available under subscription to TRACTOR.
http://www.rxrc.xerox.com/research/mltt/
The XRCE (home page) MLTT team creates basic tools for linguistic analysis, e.g. morphological analysers, parsing and generation platforms and corpus analysis tools. These tools are used to develop descriptions of various languages and the relation between them. There are free web demos of some of their tools on the web also for Hungarian. For more details cf. the Tools section.
A simple page with versions of the Pater Noster in many germanic languages (Afrikaans, Alsatian, Bavarian, English, Danish, Dutch, Frisian, German, Gothic, Icelandic, Norn, Norwegian, Old Saxon, Pennsylvania Dutch, Plattdeutsch, Swedish) by Catherine Ball (see her homepage), the webmistress of the Old English Pages. There is also a simple interface that allows the comparation of any two texts. This page was prepared for the use of classes in linguistics, history of the English language, and Old English. [2001 July 13].
A good collection of medieval Germanic poetry texts, ranging from Old English to Middle English, Old Norse, Old (High and Low German). All texts are freely downloadable (but often broken in more files).
The WES Section of the University of Virginia Library provides a list of resources for 17 European literatures, some of which may be useful for accessing text archives. The languages displayed are: Catalan, Danish, Dutch, Finnish, French, Galego-Portuguese, German, Greek, Irish, Italian, Latin, Norwegian, Old Norse & Icelandic, Occitan, Portuguese, Spanish and Swedish.
Isidore Dyen, Joseph Kruskal, and Paul Black have given the Linguistic Data Consortium permission to distribute the Comparative Indoeuropean Data Corpus at no cost. This corpus was collected by Isidore Dyen with contributions by Black and Kruskal. The corpus includes: (1) 200-item lexicostatistical lists for 95 Indoeuropean speech varieties, (2) cognation judgments between the lists, (3) lexicostatistical percentages, (4) individual replacement rates for 200 meanings, and (5) time separations based on these rates. It also includes an annotated bibliography of lexicostatistics by Paul Black. As of April 1997, the corpus consists of 3 explanatory web pages and 7 plain text files (in ascii) containing both explanatory material and data. The only restrictions on use are given in the copyright notices contained in all the files. There isn't a homepage anymore for this corpus, but you can freely download it directly by FTPT. [Rev. 2002 October 16].
http://titus.fkidg1.uni-frankfurt.de/
TITUS server provides text materials from languages that are relevant for Indo-European studies. A lot of language (also non Indo-european) are present either with text editions by TITUS itself, or with external e-texts or projects which are simply linked to: see the general Index, but beware that's very heavy. TITUS hosts also many projects and initiatives, such as Armazi, Ogamica and Tocharian Projects. Texts mostly are presents in HTML format, but often also in Wordcruncher format, and sometimes in TXT. Usually the version in HTML (UTF8) and Wordcruncher format are publicly available (that's to say that can be downloaded and used freely for scholarly purposes, provided that they are quoted as sources and the name(s) of the editor(s) and the date of last changes are indicated in publications) whilst the TXT versions is restricted to TITUS members. Most of the texts are as a matter of fact accessible on the TITUS WordCruncher Server for investigations of many kinds). [2001 July 14; Rev. 2001 August 30].
The MPCP (University of Maryland Parallel Corpus Project) provides versions of the Bible consistently annotated according to the CES. There are also some freely downloadable PS paper related to this project, mainly by Philip Resnik. Versions already freely available are the Chinese, Danish, English, Finnish, French, Greek, Indonesian, Latin, Spanish, Swahili, Swedish and Vietnamese ones. For more cf. under the Parallel Corpora section. [2001 May 1].
CELT is the main online resource for contemporary and historical Irish documents in literature, history and politics. Mission Statement is to bring the wealth of Irish literary and historical culture (in Irish, Latin, Old Norse, Anglo-Norman French, and English) to the Internet in a rigorously scholarly project that is, at the same time, user-friendly for the widest possible range of readers and researchers - academic scholars, teachers, students (at all levels), and the general public, in Ireland and internationally. The published texts are freely available in various format: HTML (for browsing), SGML TEI compliant (download from FTP), plain text and PS for printing. The number of texts still in progress is huge, and now [2001 June 19] they have approx. 3,800,000 words online. Simple searches (for word or part of word) throrough the CELT database can be made both for texts and markup. Definetely a great site. [Last Rev. 2001 August 27].
mirroring sites at Antwerp (Belgium) and Chokyo (Japan).
The CHILDES system provides free tools for studying child language data and conversational interactions. These tools include a database of transcripts, programs for computer analysis of transcripts, methods for linguistic coding, and systems for linking transcripts to digitized audio and video. CHILDES corpora cover a 23 European and extra European languages, although the bulk of the collection is English.
The Irish CHILDES corpus is downloadable in a zipped format.
See under Multilingual and Parallel Corpora section for a fuller file.
http://www.elsnet.org/resources/eciCorpus.html
A 98 million word corpus, covering most of the major European languages, as well as many others (viz. Albanian, Bulgarian, Chinese, Czech, Dutch, English, Estonian, French, Gaelic, German, Greek, Italian, Japanese, Latin, Lithuanian, Malay, Spanish, Danish, Uzbek, Norwegian, Portuguese, Russian, Serbian, Swedish, Turkish, Tibetan).
See under ECI/MCI 1 Corpus in the Multilingual and Parallel Corpora section.
This Ogamic database by Jost Gippert, a part of the TITUS (Thesaurus Indogermanischer Text- und Sprachmaterialien) project, contains by now about eighty inscriptions, with romanization, transcription, translitteration, picture, bibliography, description and historical and philological notes. Data are freely downloadable, but cannot be republished in any form without prior permission by Jost Gippert (e-mail). Please beware that these pages are encoded using Unicode / UTF8. The special characters as contained in them can only be displayed and printed by installing a font that covers Unicode such as the freely downloadable TITUS font TITUS Cyberbit Basic. [2001 July 14; checked 2001 August 30].
The WES Section of the University of Virginia Library provides a list of text resources for 17 European literatures: Catalan, Danish, Dutch, Finnish, French, Galego-Portuguese, German, Greek, Irish, Italian, Latin, Norwegian, Old Norse & Icelandic, Occitan, Portuguese, Spanish and Swedish.
Riccardo Scateni page contains about forty Italian Literary Texts, mainly from Progetto Manuzio, free in html format.
It is a collection of commercial CD-ROMs of Italian Literary Texts (Liric Poetry from Petrarca to Marino; Petrarca's Complete Edition; Tasso Complete Edition; Leopardi Complete Edition; XIV-XVI centuries Comments on Dante) in electronic editions, also querable with Eugenio Picchi's DBT. Now available at discount prices also from Libreria Chiari. For more information see under the e-Text section.
AVIP (Archivio Varietà Italiano Parlato, directed by P.M.Bertinetto) Spoken Italian Corpus is now freely available by ftp. The corpus comprises mainly (1) map task dialogues from Bari, Napoli and Pisa, and (2) mixed readings of words uttered by the same speakers of the dialogues. [2001 April 29].
This corpus of Learner Italian consists in transcriptions of about 120 hours of speech by twenty speakers with eight different mother languages. The data, gathered between 1997 and 2000, are available only as HTML in a CD-ROM that can be ordered by e-mail. Signing up a legal disclaimer is required, and the price is cheap (15.000 L.). [2002 January 23].
This Library (maintained by the Tabula Fati Publishing of Chieti) includes only a few literary Italian texts (originals and translations). All are free in HTML format, sometimes not download friendly (larger texts are divided in more pages).
The CiBIT (Centro interuniversitario Biblioteca Italiana Telematica) project, directed by Mirko Tavoni and mantained at Pisa University, is aimed to collect Italian Literary Texts (latin and dialectal as well) and made them freely available for reading, download (not always so free ...) and simple queries. It is working only after March 2000 and is still under construction. There is also a small POS-tagged and lemmatized corpus, cf. CiBIT Lemmatized Corpus.
This Italian free Electronic Literary Text Archive is the library of Cyberia by Nem0, and has also a mirror at Tiscali. The catalogue of classics is not as large as Progetto's, but there is also some modern new text published by the authors directly on the web. Most texts are in HTML format (but there is also a few unusable JPG facsimile) and can be freely viewed and copied. The more outstanding feature of the site is however the availability of a vocal synthesis: the plug-in MyVoice Net, freely available from the Cyberia site, both for Win and Mac, interfaces your browser with a vocal synthesis system based on the Eloquens 2000 technology by CSELT. [Rev. 2001 December 31].
The BOnonia Legal Corpus (BOLC), developed in CILTA (Centro Interfacoltà di Linguistica Teorica e Applicata 'L. Heilmann') at Bologna University since 1997, by Rema Rossini Favretti and Fabio Tamburini, for the moment is formed of two subcorpora: one English, the other Italian, but it could be expanded at a later stage. Future availability is not known. For more details cf. under the Parallel Corpora section.
A rich collection of links of Italian general interest - a few on corpus linguistics as well. From Cesáreo Calvo Rigual (cf. homepage) of the Universitat de València. [2002 February 18].
mirroring sites at Antwerp (Belgium) and Chokyo (Japan).
The CHILDES system provides free tools for studying child language data and conversational interactions. These tools include a database of transcripts, programs for computer analysis of transcripts, methods for linguistic coding, and systems for linking transcripts to digitized audio and video. CHILDES corpora cover a 23 European and extra European languages, although the bulk of the collection is English.
The Italian CHILDES corpus is downloadable in a zipped format.
There is also an Italian component in the Multilingual Collection (English UK and USA, German, Hebrew, Italian, Spanish, Swedish and Turkish) made from narratives elicited using Mercer Mayer's "frog story" picture book.
See under Multilingual and Parallel Corpora section for a fuller file.
(follow the link to Opere di Dante con marcatori grammaticali)
The CiBIT Corpus, a by-project of the Biblioteca Italiana Telematica is said to have Commedia, Fiore, and Milione, but till now only Purgatorio is working (last check 29-08-2000). It can be queried only online (with the web version of DBT). The searches allowed are very rudimentary, but it is hoped that it would improve (DBT sure can do a lot more!). [Last check August 2000].
Television Italian Corpus (CiT) is a collection of transcripts from television broadcasting. Its aim is to analize the lexical, grammatical and sintactical pecuiarities of the Italian broadcasted by television. The project began in 1998; when finished the CiT will have about 500.000 words, in order to render it quantitatively comparable to the other Written (LIF) and Spoken (LIP) Italian corpora. Completion of transcripts is foreseen by the end of 2001. In the maintime a first sample of the corpus, viz. the CiT Demo of about 125.000 words, is being used for testing markup (TEI), lemmatization and tagging. Contact: Stefania Spina. [2001 April 30].
+ CiT Demo. A sampler of CiT (125.000 words, 25% of total corpus) was finished in october 1999. A TEI markup specifically suitable for television transcripts is under testing. The CiT Demo is also being processed with the Tree Tagger from IMS Stuttgart.
http://www.cilta.unibo.it/SITOCORIS_ITA.htm (English also)
CORIS is a corpus of written Italian being developed in CILTA (Centro Interfacoltà di Linguistica Teorica e Applicata 'L. Heilmann') at Bologna University, that will soon be completed and available on-line and on CD-Rom by the end of 2001. The project, designed and co-ordinated by R. Rossini Favretti, was started in 1998, with the purpose of creating a representative and sizeable general reference corpus of written Italian which would be easily accessible and user-friendly. CORIS contains 80 million words and will be updated every two years by means of a built-in monitor corpus. It consists of a collection of authentic and commonly occurring texts in electronic format chosen by virtue of their representativeness of modern Italian (a detailed description is available on the site). When released it would surely be a great reference work for modern written Italian. For more infos e-mail Cristiana Desantis. [2001 April 23].
http://www.cribecu.sns.it/analisi_testuale/settore_informatico/_en_index.html
The CRIBeCu (Centro di Ricerca per i BEni CUlturali), besides some tools, provides a few online querable SGML Italian Literary text, namely Vasari's Vite and the Vocabolario della Crusca).
http://www.elsnet.org/resources/eciCorpus.html
A 98 million word corpus, covering most of the major European languages, as well as many others (viz. Albanian, Bulgarian, Chinese, Czech, Dutch, English, Estonian, French, Gaelic, German, Greek, Italian, Japanese, Latin, Lithuanian, Malay, Spanish, Danish, Uzbek, Norwegian, Portuguese, Russian, Serbian, Swedish, Turkish, Tibetan).
See under ECI/MCI 1 Corpus in the Multilingual and Parallel Corpora section.
Èulogos is a commercial Italian site for Language Engineering. It maintains some free services: (1) an Italian online Morphological Dictionary, based on the SLI (Sistema Lessicale Integrato) technolgy, (2) the 9 languages IntraText library, and (3) the Italian Censor readability GULPEASE and basic vocabulary test (e-mail submission)
http://www.fausernet.novara.it/fauser/biblio/index.htm
Giuseppe Bonghi's Italian Classics Library provides 118 texts of 36 Italian authors (source editions often are not the best available). All texts are HTM and access to them is free; since a lot of texts is divided in a number of minor pages, downloading is not an easy job.
The Ezio Galiano Foundation for Italian blind people, besides other services, provides about 2,500 literary works in italian language, ranging from classical masterpieces to leisure fiction: see the Catalogue. Free downloads can be made typing "http://www.galiano.it/biblio/autori?/filenoun.zip" where "autori?" must be replaced by "autoriA" if the author name begins with A, "autoriB" if begins with B, and so on.
The E-Text section of the Biblioteca magistrale "Freinet" (Tangram, in via Portici 204 - Merano) includes 70 texts of Classics of western literature in Italian language. All are either in TXT or ZIP format, free and ready for download.
A small library of interactive hypertexts for free reading and search maintained by Èulogos. All literary texts, many religious (the BRI, Bibliotheca Religiosa). Nine languages are till now supported (Albanian, German, English, Spanish, French, Italian, Latin, Finnish).
Texts in Italian. 23 ascii text files. Some Italian originals, some translations. From LORIA, Nancy, France.
Available under subscription to TRACTOR.
ItalNet is an international consortium whose mission is to make available scholarly Internet resources of literary and historical materials relating to Italian studies. It provides the Internet publication of the OVI (Opera del Vocabolario Italiano) database of Early Italian Texts and is associated with the Chicago ARTFEL Project.
http://www.lpl.univ-aix.fr/projects/multext/MUL4.html
This corpus is made by a set of pieces fron the Official Journal of the European Community (JOC) and is CES (Corpus Encoding Standard) conformant. It is available with three level of treatment: paragraph annotated (CESDOC), POS-Tagged (CESANA) and parallel text aligned (CESALIGN).
Availability unknown: only a few sample to download.
The Textlist page of the Kirchenmusik online site (a good and well known resource for music lovers) by Joachim Vogelsänger unfolds a huge and free collection of texts of Oratorios, Cantatas, Sacred Hymns and so like. The mosts are in German, and only very few texts are in Italian (Monteverdi's Combattimento di Tancredi e Clorinda and Lamento d'Arianna). All the texts are freely downloadable in simple HTML format. For more details cf. the full file in the E-Texts section. [2001 August 27].
A few Italian newspaper texts (mainly "Corriere della sera" 15-06-94 and 21-10-89 and "Il mattino" 21-10-89) marked-up and made freely querable online via TACTweb by Elisabeth Burr. This corpus is a subsection of the wider project "Romanische Zeitungssprachen", and is part of E. Burr's Online Korpusanalyse mit Hilfe von TactWeb pages, which offer some small but useful Italian, French and Spanish corpora querable online via TACTweb. [2001 April 23].
Since 1973 LABLITA (Linguistic Laboratory of Department of Italian; referee Massimo Moneglia and Emanuela Cresti) collects corpora of spontaneous spoken language in order to create a database to study the linguistic properties of spontaneous spoken language. Corpora are collected during lessons at Florence University and are transcribed and stored in CHAT format. Access is severely restricted (it is allowed only within formal research projects; and an explicit agreement on quotation norms and use conditions will be required). [2001 April 30].
+ Spontaneous Italian Corpus This corpus of Italian spontaneous adult spoken language contains more than 130 texts of variable lenght (5 minutes up to 2 hours) divided in 5 corpora for about 62 hours. The audio archive is on DAT cassettes in chronological order of acquisition. Access is nearly impossible (consultation may be consented only locally by previous agreement with LABLITA). For more details cf. also this PDF paper by Massimo Moneglia.
+ A small sample (about 6 hours of speech) edited by E. Cresti was published (a bit anachronistically) in book form by Accademia della Crusca in 2000. Now this two volumes printed sample can be ordered also via Web from the site of the bookshop Leggere Per at the cost of 100.000 IL.
+ The Corpus of early acquisition of Italian from 12 months to 36 months consists of three subcorpora. (1) A "Corpus Ferrara" with 20 longitudinal collections in nursery school, 181 texts for about 52 hours. (2) A "Corpus Firenze", with 10 longitudinal collections in family, 102 texts for about 33 hours. (3) "Samples", i.e a transversal open collection, with 20 texts for about 10 hours. All texts are in CHAT format.
+ A collection of media corpora, with three distinct corpora of cinematographic transcriptions (respectively of 12, 6 and 6 films), and a corpus of radio and TV broadcasting (about 20 hours).
On this site mantained by Massimo Moneglia all the papers by people at the Linguistic Laboratory of Department of Italian (LABLITA) of Firenze University are collected as freely downloadable PDF files (you can download them also from the list page). [2001 April 30].
This page contains the texts (all freely downloadable HTML) of 1842 poets and 1217 composers in 22 different languages: Italian has 576 texts. For a more detailed description, see in the E-Texts section.
It is the corpus used in 1971 for the IBM Italian Frequency Lexicon edited by Bortolini and alii. This 500.000 words corpus was based on five textual groups (thetre, novels, movies, magazines and sussidiari) whence the 500.000 occurencies of the sample were extracted (100.000 for each genre). Notwithstanding its age, it is sometimes taken as the written counterpart of the more modern LIP (Spoken Italian Frequency Lexicon) corpus. [2001 April 30].
+ The LIF corpus is not available from the Web. You can find some infos on LIF and other Italian frequency lexica in this paper by Nicola Mastidoro and Maurizio Amizzoni.
A useful but still incomplete online TACTweb freely querable version of the corpus used by Tullio de Mauro / Federico Mancini / Massimo Vedovelli / Miriam Voghera for their well-known "Lessico di frequenza dell’Italiano Parlato" (Spoken Italian Lexicon - Milano, Etaslibri, 1993, with 2 floppy disks package) made available by Elisabeth Burr with COCOA-markup by herself and Bettina Möller. It is part of E. Burr's Online Korpusanalyse mit Hilfe von TactWeb pages, which offer some small but useful Italian, French and Spanish corpora querable online via TACTweb. [2001 April 23, Rev. 2001 May 20].
The LIZ is by far the larger and more popular Italian literature commercial querable corpus. The query and managing software is Picchi's renowned DBT (Data Base Testuale). [Rev. 2001 August 31].
+ Ver. 3.1, the last version (1998) referred to in the site, contains 770 literary texts (over 31 millions words) from the Italian canon dating from the thirteenth century through to the twentieth century. The search interface allows a selected corpus to be constructed by author, genre, century and other specifications. Searches can be words, part of words, and multiple words using BOOLEAN connectors. Results can be displayed by keyword in context or downloaded to disk. The search interface also allows for the creation of word indices by alphabet, frequency, incipits, and explicits. Whole texts cannot be extracted.
+ LIZ Compact Disc, edited by Pasquale Stoppelli and Eugenio Picchi, is published by Zanichelli Editore, Bologna, Italia, and is sold by booksellers. Catalogue price: L. 280,000.
+ In the meantime, a new Ver. 4.0 (2001) was released, but is still not advertised in the site. The most noteworthy improvements are the higher numbers of texts (it was raised at about 1,000), the support for lemma queries, and the lower price (it was fixed at L. 198,000).
This site provides some unconventional Italian texts, ranging from translations of the classic Eroda's Mimiambis, Theophrast's Characters and Poggio's Facetiae to the anonimous Manganello and Visconti Venosta's Prode Anselmo. All texts are HTML freely downloadable.
Myriobiblos, The E-text Library of the Church of Greece, provides a lot free HTML e-texts (you can browse and save them) from Classical to modern Greek; but there are also fewer texts (mainly translation) in Bulgarian, English, French, German, Italian, Romanian and Russian. For more cf. under the E-Texts section. [2001 May 1].
This Electronic Data Base, maintained by the Libera Associazione Nuovo Rinascimento (which was born in the Italianistics Department of Firenze University), containes some Italian Literature texts (mainly but not exclusively Renaissance texts) and some modern scholar works as well. All are freely downloadable in several formats.
Opera e-Libretto (Collection Ulric Voyer) is a collection of 220 free e-texts of opera libretti. Displayed libretti are in Italian (Monteverdi, Provenzale, Haendel, Vivaldi, Piccinini, Pergolesi, Cimarosa, Mozart, Salieri, Jommelli, Spontini, Botnjanskij, Fioravanti, Rossini, Bellini, Donizetti, Soliva, Verdi, Boito, Ricci, Mancinelli, Anfossi, Giordano, Catalani, Mascagni, Leoncavallo, Zandonai, Puccini), French, English, German, Russian, and Danish. All texts are in html, usually broken in more files according to act divisions. For a more detailed file cf. under the E-Texts section. [2001 June].
The OVI, Opera del Vocabolario Italiano, formerly Vocabolario della Crusca is run by the ItalNet international consortium. The OVI Early Italian Database contains 1410 vernacular texts (16.8 million words, 442,000 unique forms, 116 MB of text) dated prior to 1375, the year of Boccaccio's death. The verse and prose works include early masters of Italian literature like Dante, Petrarch, and Boccaccio, as well as lesser-known and obscure texts by poets, merchants, and medieval chroniclers. The OVI database was created to aid in the compilation of an historical dictionary of the Italian language, the Tesoro della lingua italiana delle origini, (portions of which are now available online (see address supra). The fully-searchable ItalNet implementation of the OVI database presented here has been produced in order to enable scholars around the world to benefit from this rich textual resource. OVI search form is restricted to registered user, but membership is easy and till now free (follow this link) but in the future a small fee will be required.
"Filosofia in Italia" Page of Philosophical Texts (translated) in Italian Language. A lot of items (but not all) are only links to Progetto Manuzio; most texts are freely downloadable in zip format.
Progetto Duecento is a database covering most of Thirteenth century Italian poetry made by Francesco Bonomi. You can make only very simple queries (and direct access to texts is forbidden) online unless you buy the offline program at the small cost of 40.20$.
Liber Liber has the largest Italian library of electronic literary texts. All are freely downloadable in zip format.
This page maintained by Lyle Neff (cf. homepage) is a rich database of the online sources of opera libretti. A lot of e-texts (html format) are freely available directly from the site, other are only linked to. Beside libretti also secular songs and sacred vocal music are also dealt with. Language covered are Italian, French, English, German, Russian, Spanish (zarzuelas), Latin (sacred vocal music) and Jewish (songs). There are also links to other less specific musical and linguistic resources. [2001 June 20].
SENSEVAL is a project concerned in Evaluating Word Sense Disambiguation Systems. The first SENSEVAL took place in the summer of 1998, for English, French and Italian. The second is planned for Pisa, Spring 2001. For more cf. the References, Standards etc. section.
http://solaris3.ids-mannheim.de/tractor/telri/MAN/man-09.htm
Cf. under Multilingual and Parallel Corpora.
Available under subscription to TRACTOR.
The Turin University Treebank is a project by Cristina Bosco of the Natural Language Processing Group (Turin University, Informatic Department) for the development of a collection of morphologically and syntactically annotated sentences; it includes the definition of a representation format showing peculiarities of Italian and the application of this one to a newspaper corpus through the usage of tagging and parsing tools. The TUT adopts a representation format based on the dependency paradigm centred upon the notion of predicate-argument structure and characterized by a rich grammatical relations system. It is motivated by Italian language peculiarities and advantages coming from the richness of the annotation. In this format the morphological level (Part Of Speech tagging), where are coded data related to single words, was kept separated from the syntactic one, which represents in tree structures the relations between elementary units of the sentence. To represent some phenomenon, such as discontinuity and null subject, the format has been enriched with a trace-filler notation. TUT development includes a fully automated POS tagging, performed with a tagger implemented at the Department of Computer Science of the University of Turin, and a semi-automated syntactic annotation performed with an interactive purpose implemented for this project. For further inquiries contact Cristina Bosco.
+ A small treebank of 500 annotated sentences is already freely downloadable in a 198 Kb .zip file, along with some linguistic notes in Italian (treebank-based description of major linguistic phenomena in a 20 Kb compressed .ps file) and data about the treebank (the data collected about the first 500 annotated sentences in a 5 Kb compressed .xls file). [2002 February 20].
The Archive of this Newspaper is online and can be freely queried for simple strings of text and/or article authors.
The WES Section of the University of Virginia Library provides a list of text resources for 17 European literatures: Catalan, Danish, Dutch, Finnish, French, Galego-Portuguese, German, Greek, Irish, Italian, Latin, Norwegian, Old Norse & Icelandic, Occitan, Portuguese, Spanish and Swedish.
http://www.rxrc.xerox.com/research/mltt/
The XRCE (home page) MLTT team creates basic tools for linguistic analysis, e.g. morphological analysers, parsing and generation platforms and corpus analysis tools. These tools are used to develop descriptions of various languages and the relation between them. There are free web demos of some of their tools on the web also for Italian. For more details cf. the Tools section.