Localized Resources .4. |
I provide here language-specific links to corpora, e-texts and NLP resources in general. Resources already presented in the previous sections are also repeated here whenever relevant.
The Corpus of Written British Creole was compiled at Lancaster University with financial support from the British Academy. "British Creole" is a cover term for local varieties of Jamaican Creole used in Britain by second and third generation Caribbean. Most of the searching for texts, permission clearance and inputting work was carried out by Sally Kedge in 1995. In 1998, additional work, including additional tagging and checking for errors, was done by Susan Dray. The Corpus of Written British Creole is very small in corpus linguistic terms (around 12,000 words), but was projected, selected and tagged (with CLAWS4) with great care. The page hereupon referred is properly the manual of the corpus, which provides a lot of information; you can also check the announcement of the corpus. You can obtain a copy free of charge of the corpus for research purposes by contacting: Dr. Mark Sebba (Lecturer in Linguistics, Lancaster University: homepage). Department of Linguistics and Modern English Language, Lancaster University, Lancaster LA1 4YT (Great Britain) Tel: 01524 592453 (from outside Britain: +44 1524 592453), E-mail. [2001 April 29].
The Corpus of Written Jamaican Creole is currently under construction by Susan Dray (Lancaster University), and there isn’t yet any information available on the Web. [2001 April 29].
A collection of freely downloadable Jamaican English Creole old texts (namely: 1733, 1780, 1788, 1789, 1803, 1806, 1817 1826, 1830s, 1834, 1839, 1841, 1859), from the Creolist Archives Text Collections. [2001 August 8].
The text component of the package includes transcripts and documentation files. The transcripts cover a contiguous 5 or 10 minute segment taken from 120 unscripted telephone conversations between native speakers of Japanese. The transcripts are timestamped by speaker turn for alignment with the speech signal and are provided in standard orthography. In addition to transcript files, this corpus contains full documentation on the transcription conventions and format. Auditing and demographic information on the speakers represented in the transcripts (including gender, channel quality and so on) are also included. Available as FTP file by the LDC through membership or by 500$ price.
ChaSen is a free Japanese Morphological analyser by the Computational Linguistics Laboratory, Graduate School of Information Science, Nara Institute of Science and Technology (NAIST). Version 1.0 was officially released on 19 February 1997; the last release is 2.5.1, 2002/1/30. It grew out of developing JUMAN version 2.0 and has made a significant improvement in system performance. This tool was christened with the Japanese name for the tea whisk because it was developed by the NAIST, situated at Takayama (Nara), which is famous for producing a tea whisk used in traditional Japanese tea ceremony. ChaSen can be freely downloaded from the site in UNIX and Linux versions; the dictionary (IPADIC), with which it works, is available also for Windows. [2002 February 17].
mirroring sites at Antwerp (Belgium) and Chokyo (Japan).
The CHILDES system provides free tools for studying child language data and conversational interactions. These tools include a database of transcripts, programs for computer analysis of transcripts, methods for linguistic coding, and systems for linking transcripts to digitized audio and video. CHILDES corpora cover a 23 European and extra European languages, although the bulk of the collection is English.
The Japanese CHILDES corpus is downloadable.
See under Multilingual and Parallel Corpora section for a fuller file.
http://www.elsnet.org/resources/eciCorpus.html
A 98 million word corpus, covering most of the major European languages, as well as many others (viz. Albanian, Bulgarian, Chinese, Czech, Dutch, English, Estonian, French, Gaelic, German, Greek, Italian, Japanese, Latin, Lithuanian, Malay, Spanish, Danish, Uzbek, Norwegian, Portuguese, Russian, Serbian, Swedish, Turkish, Tibetan).
See under ECI/MCI 1 Corpus in the Multilingual and Parallel Corpora section.
This page is only the schematic description of a stimulating EALC 222 Winter 2002 course held at UCLA by Hongyin Tao (homepage), but provides also some good references, especially in CJK computational analysis. [2002 February 17].
The spoken corpus that the Hypermedia Corpus Project has been developing so far at the Fukuoka Institue of Technology is radically different from the existing large-scale language database such as London-Lund and BNC (British National Corpus) in that it is bundled with digital video and audio data (movies) as well as full texts. The unique features of their hypermedia corpus will be as follows: (1) The intonation, pause and some other information can be represented in their original forms by digital sounds, thereby making it unnecessary to assign specific linguistic notations or 'esoteric' symbols. (2) Non-verbal information is available in the form of multi-angle digital movies, providing the speaker/listener's facial expressions, noddings, and so on. (3) Any movie and sound data is random-accessible by respective code number and/or one or more key words in addition to the traditional text search thanks to new digital video-on-demand (VOD) technology and hypertext description languages such as HTML.
Only a sample is available on the site. For further informations contact Ryuichi Uemura, Department of Liberal Education, Faculty of Engineering, Fukuoka Institute of Technology.
This Japanese language text (EUC character encoding) corpus is composed of business and financial news from two sources: [1] - Approximately 30 million words of text have been made available from the morning edition of Nihon Kezai Shimbun, the largest Japanese financial news daily newspaper; the release this year covers all text that was published during 1994. [2] - A smaller part of the corpus comes from Dow Jones Telerate, which markets its Japanese Language Service. This is a financial newswire produced by Kyodo News Service; its recipients are primarily managers of Japanese owned corporations, or Japanese employees working in North American brokerage houses, banking, etc.
Available only by LDC membership.
This corpus consists of newswire text (EUC character encoding) from Nihon Keizai Shimbun, Inc. (NIKKEI), the largest daily Japanese financial newspaper, and Telerate, Inc. (formerly known as Dow Jones/Kyodo News Service), published primarily for managers of Japanese owned corporations or Japanese employees working in North American financial institutions. The Telerate portion constitutes all newswire text collected by the LDC between December 1994 and September 1998. The Telerate data collected from June 1995 to September 1998 serves as a supplement to the original publication. All NIKKEI data was collected from December 1993 to November 1994 and is also available on the 1995 release of the Japanese Business News Text. LDC added SGML tags.
Available only by LDC membership.
The Japanese Morphological Analyzer (JMA) from Basis Technology is a portable segmentation engine for Japanese text combined with Japanese dictionaries. It can index and search large collections of Japanese documents (also text fields in databases), generate word lists and verify consistency between kanji and yomi forms. Imput texts must be in Unicode UCS-2 format. Price is not stated, and to request an evaluation version you have to send them an e-mail. There is however also a demo online. [2002 February 17].
NAIST-NLT (Nara Intitute of Science and Technology Natural Language Tools) provides a flexible natural language processing environment. The system consists of JUMAN (a morphological analysers for Japanese and English), SAX (a compiler of a DCG to a bottom-up Chart parser), VisIPS (a visual interface for showing the partial results of the parsing process), and supporting programs for implementing natural language grammars. For a full descrition see under the Tools Section. [2001 April 28].
Ngram takes N-gram statistics for text file. Ngram allows encoding of character code of input text file both EUC-JP and ISO-2022-JP, and can detect encoding of character code of input text file automatically, so it is very suitable for Japanese. Documents written in English of source package of Ngram isn't available yet, but the current release version of Ngram, ngram-0.6.1.tar.gz, is freely downloadable. For more details cf. in Tools section . [2001 April 29].
The Web Concordancer site, by the Virtual Language Centre of the Polytechnic University of Hong Kong, presents a few indexed corpora (English, French, Chinese, Japanese) thant can be freely browsed with the ConcApp program. Corpora available include Brown Corpus, Sherlock Holmes stories, South China Morning Post, etc. [2002 February 17].
A part of the Speech and Language Web Resources, the big reference archive by Kenji Kita, Tokushima University. It is a rich and very useful page, but beware that is Japanese only. [2001 April 28].
This page, from the Department of Asian and Pacific Linguistics - Institute of Cross-Cultural Studies - Tokyo University (http://130.69.201.129/), maintained by Kazuto Matsumura, provides a single and short free Karelian e-Text: Pekka Zaikov, Luvemma vienankarjalaksi. 3.-4. luokka, Petroskoi "Karjala", 1995. HTML format.
The "21 century age-length project" is an important national project to compile various kinds of Korean corpora and electronic dictionary resources. Unfortunatley, it is now only in Korean, but an English version is in the making and will soon be provided. Contact the webmaster by e-mail. [2001 July 8]
http://ikc.korea.ac.kr/~bmkang/corpus.htm (also without frames)
Beom-mo Kang is professor of Linguistics at Korea University in Seoul; his researches deal also with corpus and computational linguitics and with the "computers in the Humanities" field. This page is a rich repository of links to Computational Corpus Linguistics resources of general interest and, most notably, of specific Korean contents. A very useful page, but sadly until now only available in Korean; an English version does however exist of Beom-mo Kang’s personal page. Contact. [2001 April 23. Last checked 2002 February 17].
This page, from the Department of Asian and Pacific Linguistics - Institute of Cross-Cultural Studies - Tokyo University, maintained by Kazuto Matsumura, provides a single free Korean e-Text: the 1449 'uer'incengaqjigog saq romanized text with SJIS kanjis.
This page is only the schematic description of a stimulating EALC 222 Winter 2002 course held at UCLA by Hongyin Tao (homepage), but provides also some good references, especially in CJK computational analysis. [2002 February 17].
The KAIST is a data base related Korean Language. It has data such as corpora collected from various documents (Raw Corpus, pos Tagged Corpus, Syntactic Tree Tagged Corpus, Categorized Corpus, Sentence Pattern Collection ), phonetic data from various speakers, character data (Character data on KS-C 5601 code Hangul, 1,000 pairs of 2,350 character) for recognizing Korean off-line handwriting. On this page, there are several tools to process Korean Language Information, ranging from dictionary development and management environment, to POS and Syntactic Tree taggers, and Korean - English alignment workbench. There are also dictionaries (terminology, english bilingual and morphological). For corpus inquires e-mail to this adress, for language tools e-mail this other one. [2001 April 26].
The Korean Morphological Analyzer (KMA) from Basis Technology is a portable linguistic segmentation engine for Korean text. The Korean language presents challenges for morphological analysis, and recognition of word boundaries is often difficult. KMA analyzes and extracts keywords from Korean text based on linguistic characteristics and an optimized dictionary of essential modern Korean words. KMA performs morphological analysis on Korean words (Eojeol), including: segregation of morphemes according to POS, grammar or relational function of each morpheme; examination of likelihood of combination between morphemes; stemming (reducing to root form) of irregular verbs/adverbs/adjectives; presumption of compound nouns; recognition of unknown/unregistered words; support for a user-defined dictionary; support for multiple reference dictionaries; decomposition of compound nouns; generation of a list of words from Korean texts; identification of the root form and part-of-speech (POS) information for each morpheme that constitutes Eojeol; and recognition of patterns for morphological structures of Eojeol. Price is not stated, and to request an evaluation version you have to send them an e-mail. There is however also a demo online. [2002 February 17].
(Korean also)
This is the page of the Dept. of Computer Science and Engineering, Korea University (1, 5-ka, Anam-dong, SEOUL, 136-701, KOREA) established in 1991. They have developed various working systems for processing the natural language (morphological analysis, part-of-speech tagging, word sense disambiguation, etc.) and for language-dependent applications (information retrieval, spelling correction, linguistic knowledge acquisition , etc.). Recently, their resarch interests are concentrated on the syntactic analysis and multilingual applications like multilingual information retrieval and machine translation. They have also some online demos of their works, such as, for ex., a Korean morphological analyser, a Korean POS tagger, a Korean-English cross-language information retrieval system, etc. [2001 April 26].
The Korean morphological/lexical analyzer by Seung-Shik Kang (see his homepage in English or in Korean) from Kokmin University is a part of the Hangul Analysis Module (HAM). It has now reached Version 5.0.0a and it works on Win 98, 2000 and NT, Linux or Solaris. Freely downloadable directly from the site. An online demo (korean page only!) is also available. [2002 February 17]
This page on Korean NLP at Penn (i.e. the University of Pennsylvania; cf. the Penn Tools file) introduces three main projects on Korean NLP currently being conducted at Penn: Korean XTAG, Korean Treebank, and Korean/English Machine Translation. It provides as well some link to NLP in Korea and to some interesting freely downloadable PS papers by the Penn people. [2001 April 27].
A good Korean reference site on NLP - at least if you are interested in the "Korean" point of view on NLP and, of course, if you know a bit of the language, because the site, except from home and navigation frame, is strictly in Korean ... [2001 April 26].
People at Penn (i.e. the University of Pennsylvania; cf. the Korean NLP at Penn file) are developing a Treebank for Korean with linguistically well motivated and theory-neutral POS tagging and syntactic bracketing guidelines, using for syntactic bracketing a phrase structure annotation similar to the one used in the other Penn Treebanks, viz. English Penn Treebank, Chinese Treebank and Middle English Treebank. The corpus for the Korean Treebank project consists of texts from military language training manuals. These texts contain information about various aspects of the military, such as troop movement, intelligence gathering, and equipment supplies, among others. The texts in the manuals were originally in printed form and, in order to be used for the Treebank, were then converted into a machine-readable form. This corpus contains 48,802 words and 6249 sentences. The linguistic information in the Korean Treebank will provide a standard framework in which to train and evaluate tools such as POS tagger and stochastic parsers. The Treebank will also be used to extract lexicalized grammars, e.g. a Korean Tree Adjoining Grammar, which can be used for other applications, such as natural language generation. There are already tools developed at Penn that train parsers and extract Tree Adjoining Grammars from a phrase-structure based Treebank (Xia 1999), which will be equally applicable to the Korean Treebank. The end of rhe project is not yet foreseen, but the POS Tagging and Bracketing Guidelines are announced as forthcoming. [2001 April 27].
Korean XTAG is an on-going project at Penn (i.e. the University of Pennsylvania; cf. the Korean NLP at Penn file) to develop a wide-coverage grammar for Korean using Feature-Based Lexicalized Tree Adjoining Grammar (LTAG) formalism. For grammar development system, it uses the XTAG system (cf. the XTag Project and Tools) originally created for English and now customized for Korean TAG development. The XTAG system consists of a parser, an X-windows grammar development interface and a POS tagger. The original XTAG system has been modificated and incorporated a Korean morphological analyzer to handle rich inflectional morphology in Korean and facilitate lexicon development and parsing. More infos on the Korean XTAG system description can be found in some freely downloadable PS papers.
PosPar is a Korean Syntactic Analyzer using Korean Combinatory Categorial Grammar formalism (including POSTAG) by PosTech Laboratory (cf. also the related paper file in Korean). The N-Best POSPAR99 beta-0.9 demo version (0.01) (binary) (Dec/01/1999) is freely downloadable for non commercial use. There is also an online demo. [2002 February 20].
PosTag is a Korean Morphological Analyzer and POS tagger with generalized unknown morpheme handler by PosTech Laboratory (cf. also the README file in Korean). The AG99 beta-1 demo version (binary) (including 100,000 full vocabulary) (Dec/06/1999) is freely downloadable for non commercial use. There is also an online demo. [2002 February 20].
A POS tagged corpus of about 100,000 morphemes by PosTech Laboratory with POSTAG format. This corpus is freely downloadable for non commercial use. [2002 February 20].
This Korean NLP group (located at San 31, Hyoja-Dong, Pohang, 790-784, Korea) reasearch mainly in Korean TTS using prosody and phonetic analysis & NLP for speech recognition and Korean morphological anlysis and POS tagging Some of the resources and tools developed at PosTech are available as free or open source for research communities (on commercial use), such as:
+ POSTAG Korean Corpus, a POS tagged corpus of about 100,000 morphemes;
+ POSTAG, a Morphological Analyzer / POS tagger with generalized unknown morpheme handler;
+ POSPAR, a Syntactic Analyzer using Korean Combinatory Categorial Grammar formalism
+ POSNIR, a Korean Natural Language Information Retrieval System (Search Engine, compound noun Indexer, NL query processing (including POSTAG and POSPAR) - cf. also the README file in Korean and the online demo.
+ POSTTS, a Korean text to speech system which converts general Korean text sentences into their corresponding phoneme sequences (cf. the README file in Korean). [2002 February 20].
http://lexeme.yonsei.ac.kr/lex/resource/lex_research/main.htm
Korean texts: there are eight collections of papers dealing with Korean lexicography. Each paper is downloadable as a PDF file. Obviously all the site is strictly in Korean ... Beware also that downloading may be a nightmare (at least it was so last time I tried - 2001 Apr. 25).
A few freely downloadable old witnesses and texts of the Krio Creole English, mainly from Sierra Leone (1787, 1791, 1815, 1820, 1822, 1830s, 1840s, 1843, 1846); there is also a more recent (1990s) small collection of proverbs both in Guyana CE and in Sierra Leone CE, and an isolated documentation of Krio from Nova Scotia (from 1791); from the Creolist Archives Text Collections. [2001 August 8].
A few freely downloadable scanty old witnesses of Kru PE from Liberia (1819, 1821, 1832); from the Creolist Archives Text Collections. [2001 August 8].
http://www.georgetown.edu/labyrinth/library/latin/latin-lib.html
A rich collection of Latin e-Texts, ranging from Biblical and Liturgical texts (from the Vulgata - English version as well - to Tridentine Latin Mass, with links to Bible browsers, Database of Gregorian Chants, etc.) to Classical and Late Latin literary texts. There is also a collection of Greek Classical texts that influenced Latin Tradition, of Medieval Latin Texts and Translations (c.400-c.1500), and even of Grammatical Texts (Donatus). Good links to many Latin and Medieval resources and a lot of freely downloadable material (but it depends from the links ...): however a great site.
The MPCP (University of Maryland Parallel Corpus Project) provides versions of the Bible consistently annotated according to the CES. There are also some freely downloadable PS paper related to this project, mainly by Philip Resnik. Versions already freely available are the Chinese, Danish, English, Finnish, French, Greek, Indonesian, Latin, Spanish, Swahili, Swedish and Vietnamese ones. For more cf. under the Parallel Corpora section. [2001 May 1].
An Archive of Literary texts all written in Latin (the Archive pages, not the texts: how nice!). The Bibliotheca Latina (http://www.fh-augsburg.de/~harsch/augusta.html#la) is by far the bigger Archive, and has also a lot of medieval, humanistic and modern Latin texts. There are also smaller Greek, German, English and French Archives.
All texts are freely readable online but are not planned for downloads.
This is one of those things you don't know how to classify, but a good educational indeed. Essentialy it is a Latin hypertext reader for De Bello Gallico, I. It is a freeware (C 1999 by Michael Cummings) for Windows and PC DOS operating systems that you can doenload and install on your PC. These programs are designed for the student of Latin who knows some grammar, but who lacks vocabulary. Looking up all the unfamiliar words in the glossary at the back of the book is extremely time-consuming. The Caesar Machine permits the user to access this information without leaving the place in the text where the information is needed. The user can proceed through the text many times more quickly than is possible when stopping to flip pages and search through an alphabetical listing. This efficiency greatly facilitates the comprehension of the text, and with it, the acquisition of vocabulary. Many users will find that vocabulary learned in association with a text which is being rapidly acquired is much more likely to be retained than vocabulary learned by drills and flashcards.
+ Version 2.0 for Windows. This Visual Basic program lets the user scroll through the Latin text of Caesar's Gallic War, Book I (about 8000 wds.). When the user comes to an unrecognized vocabulary word, clicking on the word with the mouse will open or refresh at the bottom of the screen a small window with the dictionary entry for that word (in English). Or at any time the user may want to search the dictionary entries by clicking on "DictEntry" in the "Search" pulldown menu and then typing in a search string. The body of the dictionary and the body of the text may be searched similarly.
+ Original Version 1.2 for DOS. This software puts the Latin text of Caesar's Gallic War, Book I (about 8000 wds.) on the screen, and lets the user move the cursor through the text with the arrow keys. When the user comes to an unrecognized vocabulary word, pressing the 'd' key will open at the bottom of the screen a small window with the dictionary entry for that word (in English). Or at any time the user may want to search the dictionary by pressing the 's' key to open the search window and typing in a search string. Or the user may want to open the dictionary window by pressing the 'd' key and then keep it open for each successive word in the text by repeatedly pressing 'ENTER'.
A collection of patristic texts, mainly in english translation, but with a few latin originals, and some translations in other languages as well (mainly russian and chinese). All texts are freely downloadable with theological markup (ThML) or HTML, plain or zipped.
http://www.elsnet.org/resources/eciCorpus.html
A 98 million word corpus, covering most of the major European languages, as well as many others (viz. Albanian, Bulgarian, Chinese, Czech, Dutch, English, Estonian, French, Gaelic, German, Greek, Italian, Japanese, Latin, Lithuanian, Malay, Spanish, Danish, Uzbek, Norwegian, Portuguese, Russian, Serbian, Swedish, Turkish, Tibetan).
See under ECI/MCI 1 Corpus in the Multilingual and Parallel Corpora section.
The Green Library (or Cactus Library), a project of the Saint-Petersburg School of Religion and Philosophy (SRPh), has by now a few Greek, Russian and French texts. More titles are announced as forthcoming, also in Latin (Abelard), French and Old Slavonian. All titles are freely downloadable in PDF format. For more cf. under the E-Texts section. [2001 May 1].
A small library of interactive hypertexts for free reading and search maintained by Èulogos. All literary texts, many religious (the BRI, Bibliotheca Religiosa). Nine languages are till now supported (Albanian, German, English, Spanish, French, Italian, Latin, Finnish).
The Textlist page of the Kirchenmusik online site (a good and well known resource for music lovers) by Joachim Vogelsänger unfolds a huge and free collection of texts of Oratorios, Cantatas, Sacred Hymns and so like. The mosts are in German, and some in English. But there are obviously also a few texts in Latin, such as Ave Maris Stella, Stabat Mater, Vexilla Regis, Lamentationes Hieremiae, and so on. All the texts are freely downloadable in simple HTML format. For more details cf. the full file in the E-Texts section. [2001 August 27].
This page contains the texts (all freely downloadable HTML) of 1842 poets and 1217 composers in 22 different languages: there are also 31 Latin texts. For a more detailed description, see in the E-Texts section.
The Patrologia Latina Database is an electronic version of the first edition of Jacques-Paul Migne's Patrologia Latina, published between 1844 and 1855, and the four volumes of indexes published between 1862 and 1865. The text is encoded in SGML and the search interface permits searching by the SGML tags. The database can be searched by single words, truncated terms or phrases, or by using a combination of Boolean operators. The Patrologia is available as a collection of CD-ROMs or as an online database, accessed through the Internet on payment of an annual subscription fee. Conditions must be discussed with a Chadwick-Healey representative (see at this page). For more see the E-Texts section.
mirrors: UK (http://perseus.csad.ox.ac.uk/); De (http://perseus.mpiwg-berlin.mpg.de/)
Perseus, a Tufts University Project, is a database of Classic Greek and Latin markup-tagged Texts you can query online. Texts, images, and maps in the Perseus Digital Library are all interconnected, making it easy for readers to look up for something in more texts using a single Lookup Tool. But unfortunately you can neither download nor read continuously any of their texts.
An austere ftp site from Washington with an archive of Classical Latin text in TeX format (Apuleius, Ausonius, Caesar, Catullus, Cicero, Horatius, Livius, Nepos, Ovidius, Propertius, Prudentius, Sallustius, Tibullus, Vergilius) with separated commentary texts. For more see the E-Texts Section.
This page maintained by Lyle Neff (cf. homepage) is a rich database of the online sources of opera libretti. A lot of e-texts (html format) are freely available directly from the site, other are only linked to. Beside libretti also secular songs and sacred vocal music are also dealt with. Language covered are Italian, French, English, German, Russian, Spanish (zarzuelas), Latin (sacred vocal music) and Jewish (songs). There are also links to other less specific musical and linguistic resources. [2001 June 20].
This golden mine of resources for Biblical studies and Semitic philology provides also some link to Latin e-texts of biblical and hebraic interest (Latin Vulgata, Flavius Joseph, etc.).
The WES Section of the University of Virginia Library provides a list of text resources for 17 European literatures: Catalan, Danish, Dutch, Finnish, French, Galego-Portuguese, German, Greek, Irish, Italian, Latin, Norwegian, Old Norse & Icelandic, Occitan, Portuguese, Spanish and Swedish.
http://solaris3.ids-mannheim.de/tractor/telri/RIG/rig-01.htm
Latvian texts from the newspaper Rigas Balss in MS Write files. From Artificial Intelligence Laboratory, University of Latvia, Riga, Latvia.
Available under subscription to TRACTOR.
A collection of small freely downloadable historical fragments and witnwesses of Caribbean (Leewards Islands) English creole of Antigua (1788, 1810, 1825, 1834, 1850), of Nevis (1802, 1825), of Montserrat (1825) and of St. Kitts (1708, 1718, 1802, 1834), from the Creolist Archives Text Collections. [2001 August 8].
http://www.elsnet.org/resources/eciCorpus.html
A 98 million word corpus, covering most of the major European languages, as well as many others (viz. Albanian, Bulgarian, Chinese, Czech, Dutch, English, Estonian, French, Gaelic, German, Greek, Italian, Japanese, Latin, Lithuanian, Malay, Spanish, Danish, Uzbek, Norwegian, Portuguese, Russian, Serbian, Swedish, Turkish, Tibetan).
See under ECI/MCI 1 Corpus in the Multilingual and Parallel Corpora section.
http://solaris3.ids-mannheim.de/tractor/telri/KAU/kau-01.htm
Sample texts are taken from a 56 million word Lithuanian Corpus. The corpus is stored in the database, where each file is identified by special fields containing the details of bibliography. This structure enables the user to extract the files according the relevant features. (Publishers' texts have been converted into plain ASCII DOS text). From Centre of Computational Linguistics, University Vytauti Magni, Kaunas, Lithuania.
Available under subscription to TRACTOR.
1.5 million word corpus of written Lithuanian, some translated from other languages, marked up with SGML conformant with the PAROLE guidelines. From Centre of Computational Linguistics, University Vytauti Magni, Kaunas, Lithuania.
Available under subscription to TRACTOR.
These texts (poetry, folklore and papers) are taken from "Lîvõd Tekstõd", Rîga, 1991 edited by Valda Úuvcâne, and were transcribed in html by Uldis Balodis (email; other adress). This anthology was intended to make available reading material to learners of the Livonian language. This being mainly due to the fact that written Livonian texts are so scarce. The texts (HTML pages only) are short and sometimes with transcription problems but are free. [2001 May 7].
A collection of freely downloadable old relics of the Louisiana Creole French, ranging from 1720s, 1731, 1748, 1750s, 1773, 1800s, 1881, to 1902; from the Creolist Archives Text Collections. [2001 August 9].
Two freely downloadable relatively small but recent texts in Macaísta (the Macau Creole Portuguese), both from 1997: Jorge Remedios on Macaísta, Unga Lobo co Unga Cordêro (Unga 'Stória di'Sopo). From the Creolist Archives Text Collections. [2001 August 13].
http://www.elsnet.org/resources/eciCorpus.html
A 98 million word corpus, covering most of the major European languages, as well as many others (viz. Albanian, Bulgarian, Chinese, Czech, Dutch, English, Estonian, French, Gaelic, German, Greek, Italian, Japanese, Latin, Lithuanian, Malay, Spanish, Danish, Uzbek, Norwegian, Portuguese, Russian, Serbian, Swedish, Turkish, Tibetan).
See under ECI/MCI 1 Corpus in the Multilingual and Parallel Corpora section.
You can freely look for words and contexts occurring in a selection of classical Malay texts - at present over 1.4 million words of Malay text. Cf the available manual. The project is designed to allow a search target in a standard form to find variant spellings in the texts. The standard spelling adopted is that prescribed by the Dewan Bahasa dan Pustaka Malaysia, as found in its Kamus Dewan. (2nd edition, Kuala Lumpur, Dewan Bahasa dan Pustaka, 1984 ). The only use condition is the acknowledgement of the project in the footnotes and bibliography of anything you write (if you are using a particular text, also mention its contributor, or its printed edition). [2001 August 4].
+ The paper: I. Proudfoot, Concordances and Classical Malay, "Bijdragen tot de Taal-, Land- en Volkenkunde" vol. 147 (1991), pp.74-95, is freely available as well in HTML form.
It's a paper presenting a Maltese Corpus Linguistics Research Proposal by John Caruana (homepage). Aims of the project are: 1) collection, encoding and mark-up of a Maltese language corpus, 2) morpho-syntactic tagging of the corpus contents, and 3) first steps in the linguistic exploitation of the corpus. The first forseen corpus ought to be a machine-readable 10 million word electronic text corpus of current Maltese, wich will consist of texts of recent origin, drawn from both written sources and transcribed speech. Written and speech texts will constitute respective sub-corpora, the written text sub-corpus making up the bulk of the global corpus. There aren't however any links to resources already available. [2001 August 4].
mirroring sites at Antwerp (Belgium) and Chokyo (Japan).
The CHILDES system provides free tools for studying child language data and conversational interactions. These tools include a database of transcripts, programs for computer analysis of transcripts, methods for linguistic coding, and systems for linking transcripts to digitized audio and video. CHILDES corpora cover a 23 European and extra European languages, although the bulk of the collection is English.
The Mambila CHILDES corpus is downloadable.
See under Multilingual and Parallel Corpora section for a fuller file.
The largest linguistic page on Manx. There are dictionary, grammar, lessons and other resources; all freely downloadable. Contact: Phil Kelly.
This page, from the Department of Asian and Pacific Linguistics - Institute of Cross-Cultural Studies - Tokyo University (http://130.69.201.129/), maintained by Kazuto Matsumura, provides a few free standard Eastern Meadow Mari e-Texts: three 1994 newspaper articles. HTML format.
A single freely downloadable old witness of Mauritian, i.e. Isle de France CF, from 1850; from the Creolist Archives Text Collections. [2001 August 8].
This is a web edition, created at Virginia Library ETC (cf. the Electronic Text Center at the University of Virginia file), of the Chiricahua and Mescalero Apache Texts by Harry Hoijer, originally published by University of Chicago Press, 1938. There are 46 Chiricahua and 9 Mescalero texts, all free, but not very download-friendly, because they are displayed in frames: you can e.g. display a bilingual Apache - English version (either with or without notes), or Apache only, or notes only, or English version with ethnological notes, etc. For correct display a special Apache - Navajo font (a Times New Roman supplement), developed at the San Juan School District's and freely downloadable from this page, is needed. [2001 July 21; rev. 2002 January 25].
Only a few freely downloadable scanty and old witnesses (1707, 1827, 1847, 1872, 1899) of the Miskito Coast English-based creole of Honduras and Nicaragua, from the Creolist Archives Text Collections. [2001 August 8].
A freely downloadable short text in the Cree-French mixed language spoken in North Dakota and Canada, with English translation. It's a story told to Rich Rhodes and Bob Papen at the University of North Dakota, Grand Forks, in the summer of 1986. From the Creolist Archives Text Collections. [2001 August 9].
There are five short narratives in Guerrero Nahuatl with Spanish version side-by-side. Free HTML pages. [2001 May 9].
This page, by SorrentoRadio, collects over 500 texts of Napolitan songs, from the classical to the lesser known ones. All texts are freely browsable and downloadable in plain HTML format. There is also a Neapolitan Proverbs page (with glosses in Italian) and a short Old Neapolitan - Italian Glossary. [2002 February 23].
A freely downloadable anthology of Negerhollands (that's to say Virgin Islands Caribbean Dutch Creole) stories collected in 1923 (taken from de Josselin de Jong, 1926) from the Creolist Archives Text Collections. [2001 August 8].
"Den elektroniske bokhylla" is a nice page of Nynorsk texts. It is very simple and easy to read and download (in HTML). Contact: Jon Grepstad
http://www.elsnet.org/resources/eciCorpus.html
A 98 million word corpus, covering most of the major European languages, as well as many others (viz. Albanian, Bulgarian, Chinese, Czech, Dutch, English, Estonian, French, Gaelic, German, Greek, Italian, Japanese, Latin, Lithuanian, Malay, Spanish, Danish, Uzbek, Norwegian, Portuguese, Russian, Serbian, Swedish, Turkish, Tibetan).
See under ECI/MCI 1 Corpus in the Multilingual and Parallel Corpora section.
This page contains the texts (all freely downloadable HTML) of 1842 poets and 1217 composers in 22 different languages: Norwegian has 93 texts. For a more detailed description, see in the E-Texts section.
A simple page with versions of the Pater Noster in many germanic languages (Afrikaans, Alsatian, Bavarian, English, Danish, Dutch, Frisian, German, Gothic, Icelandic, Norn, Norwegian, Old Saxon, Pennsylvania Dutch, Plattdeutsch, Swedish) by Catherine Ball (see her homepage), the webmistress of the Old English Pages. There is also a simple interface that allows the comparation of any two texts. This page was prepared for the use of classes in linguistics, history of the English language, and Old English. [2001 July 13].
The bokmål part of the Oslo Corpus contains about 18.5 million words, while the nynorsk part contains about 3.8 million words. The corpora have been, encoded with the CWB (the IMS Corpus Workbench) developed at the Institut fur Maschinelle Sprachverarbeitung at the University of Stuttgart. The corpus consists of the texts that were available at the Text Laboratory (Tekstlaboratoriet ved Historisk-filosofisk fakultet, Universitetet i Oslo) in January 1999. It is composed of texts from three genres: fiction (bokmål: 1.7 mill. words; nynorsk: 2.1 mill.), newpapers/magazines (bokmå: 9.6 mill.; nynorsk: 1 mill.), and factual prose (bokmål: 7.1 mill.; nynorsk: 700.000). All fiction comes from ECI (European Corpus Initiative) and Norsk Tekstarkiv (Norwegian Text Archive), Bergen. The texts from newspapers and magazines have been collected by the Text Laboratory with kind permission from the various editorial offices. The factual prose consists mainly of NOU reports (Norwegian Official Reports) and Norwegian laws and regulations. A detailed survey of the texts, with source annotation codes, is given at this page.
The Oslo Corpus of Tagged Norwegian Texts can be queried online and is available to anybody who wants to use it for non-commercial academic research. In order to obtain permission, and to be given a username and password, you send an e-mail to the Text Laboratory, with some information: NAME / ADDRESS /AFFILIATION /suggested USERNAME for the corpus, /suggested PASSWORD to use the corpus / STATEMENT 1 ("I Promise to use the Oslo Corpus of Tagged Norwegian Texts for academic, non-commercial purposes only") / STATEMENT 2 ("I promise not to distribute my password to any person or institution") / STATEMENT 3 ("In any published or unpublished material that has benefitted from use of the corpus, I will make sure that a proper reference to the corpus by its name and Internet-address is included").
The English-Norwegian Parallel Corpus (ENPC) of the University of Oslo consists of original texts and their translations (English to Norwegian and Norwegian to English). The focus has been on novels and fairly general non-fictional books. In order to include material by a range of authors and translators, the texts of the corpus are limited to text extracts (chunks of 10,000-15,000 words). The coding system used to mark up the ENPC follows the suggestions made by the Text Encoding Initiative (TEI) as presented in Guidelines for Electronic Text Encoding and Interchange (Sperberg-McQueen & Burnard, 1994). The English part of the ENPC has been tagged for part-of-speech (POS). The tagging was done automatically by using the English Constraint Grammar parser (cf. EngCG Parser) developed by Atro Voutilainen etc.. The Norwegian part of the corpus will not be tagged, for lack of a Norwegian tagger.
Access to the Corpus is up today restricted only to researchers and students at the University of Oslo: cf. this page. Only the manual is freely available online.
See under Multilingual and Parallel Corpora section for more infos.
The WES Section of the University of Virginia Library provides a list of text resources for 17 European literatures: Catalan, Danish, Dutch, Finnish, French, Galego-Portuguese, German, Greek, Irish, Italian, Latin, Norwegian, Old Norse & Icelandic, Occitan, Portuguese, Spanish and Swedish.
The American and French Research on the Treasury of the French Language (ARTFL) recently includes also a Provençal database with 38 texts in their original. Access heavily restricted. Cf. more under the French page.
The WES Section of the University of Virginia Library provides a list of text resources for 17 European literatures: Catalan, Danish, Dutch, Finnish, French, Galego-Portuguese, German, Greek, Irish, Italian, Latin, Norwegian, Old Norse & Icelandic, Occitan, Portuguese, Spanish and Swedish.
This long freely downloadable text (466 turns of conversation between two women and an occasional male third partecipant) was transcribed by Armin Schwegler from his own fieldwork tapes recorded in the spring/summer of 1988. The sample recording can be considered representative of informal Palenquero speech, especially that of Palenqueros who have full command of both Spanish and "lengua" (= name given by Palenqueros to their local creole speech). Today (April 1998), the editor says, many younger Palenqueros have no or only very limited speaking knowledge of the creole, though their passive knowledge (comprehension) is sometimes fairly good.. Besides the transcription (trilinear, with Spanish translation provided), there are also freely downloadable Real Audio files, streaming or format, and MPEG files of the recording. Definitely a very fine and free resource, hosted by the Creolist Archives Text Collections. Readers interested in obtaining further information on this and similar recordings may wish to contact the author, Prof. Armin Schwegler, directly at Dept. of Spanish and Portuguese, Univ. of California - Irvine, Irvine, CA 92697-5275 (U.S.A.), by Fax (949/ 824-6901) or e-mail . [2001 August 14].
The EMILLE Project is a 3 year EPSRC project at Lancaster University and Sheffield University, designed to build a 63 million word electronic corpus of South Asian languages, especially those spoken in the UK. EMILLE will generate written language corpora of at least 9,000,000 words for Bengali, Gujarati, Hindi, Panjabi, Singhalese, Tamil and Urdu.
For more infos and details cf. under the Multilingual corpora section. [2001 June 17].
Polish Literary Texts in HTML format. You can browse them through pop-up windows. Alla page are in Polish.
mirroring sites at Antwerp (Belgium) and Chokyo (Japan).
The CHILDES system provides free tools for studying child language data and conversational interactions. These tools include a database of transcripts, programs for computer analysis of transcripts, methods for linguistic coding, and systems for linking transcripts to digitized audio and video. CHILDES corpora cover a 23 European and extra European languages, although the bulk of the collection is English.
The Polish CHILDES corpus is available on the web.
See under Multilingual and Parallel Corpora section for a fuller file.
The Archive of this Newspaper is online. The pages are all in Polish.
This page contains the texts (all freely downloadable HTML) of 1842 poets and 1217 composers in 22 different languages: Polish has 133 texts. For a more detailed description, see in the E-Texts section.
2 million word corpus of modern written Polish newspaper texts in the following formats: SGML; gzipped SGML; plain text, but with SGML entity references for non-Roman characters. From PELCRA, Department of English Language, Lodz Unversity, Poland..
Available under subscription to TRACTOR.
http://www.rxrc.xerox.com/research/mltt/
The XRCE (home page) MLTT team creates basic tools for linguistic analysis, e.g. morphological analysers, parsing and generation platforms and corpus analysis tools. These tools are used to develop descriptions of various languages and the relation between them. There are free web demos of some of their tools on the web also for Polish. For more details cf. the Tools section.
mirroring sites at Antwerp (Belgium) and Chokyo (Japan).
The CHILDES system provides free tools for studying child language data and conversational interactions. These tools include a database of transcripts, programs for computer analysis of transcripts, methods for linguistic coding, and systems for linking transcripts to digitized audio and video. CHILDES corpora cover a 23 European and extra European languages, although the bulk of the collection is English.
The Portuguese CHILDES corpus is available on the web.
See under Multilingual and Parallel Corpora section for a fuller file.
http://cgi.portugues.mct.pt/acesso/
The project Processamento computacional do português (has put online some Portuguese Corpora that can be freely searched. This service was launched on the 23rd September 1999 and is still sperimental; up to 2000 March 23 the corpora (some very small) were ten (see an accurate description of each corpus. The corpora were organized and coded with the Corpus Workbench of the Stuttgart IMS. The Web interface was developed locally by the Processamento.
http://www.elsnet.org/resources/eciCorpus.html
A 98 million word corpus, covering most of the major European languages, as well as many others (viz. Albanian, Bulgarian, Chinese, Czech, Dutch, English, Estonian, French, Gaelic, German, Greek, Italian, Japanese, Latin, Lithuanian, Malay, Spanish, Danish, Uzbek, Norwegian, Portuguese, Russian, Serbian, Swedish, Turkish, Tibetan).
See under ECI/MCI 1 Corpus in the Multilingual and Parallel Corpora section.
The European Language Newspaper Text corpus is also know as the French Language News Corpus. This corpus includes roughly 100 million words of French, 90 million words of German and 15 million words of Portuguese and has been marked using SGML.
See under Multilingual and Parallel Corpora section.
4 annual CDROMs with full text.
http://www.cce.ufsc.br/~nupill/literatura/literat.html
An hypertext (HTML) Library of Brasilian Literature Texts. Several Brazilian classic literary works are already available in html format for browsing.
It's not clear what their availability is.
http://www.cirp.es/WXN/wxn/frames/meddb.html
MedDb of the "Centro Ramón Piñeiro para a Investigación en Humanidades" is a database providing the complete corpus of Medieval Galego-Portuguese liric. The search online can be activated only after registering, but the registration is easy and free.
cf. also http://www.cce.ufsc.br/~nupill/literatura/literat.html
NALAMA is a theoretical project developed in association with other universities (University of São Paulo - USP - State University of Campinas -UNICAMP -, Federal University of Porto Alegre - UFRGS -, Catholic University of Porto Alegre - PUC/SP -, Federal University of Florianópolis - UFSC -). It aims at the development of systems for natural language processing, searching alternatives to the serial processing architectures. At this moment, a prototype of a syntactic parser for some structures of Brazilian Portuguese is being developed, using the dependence grammar formalism and a multi-agent architecture, in which each word of a structure is an independent agent. This prototype is under the responsibility of Raul Waslawick (UFSC). The participants of this project came from different backgrounds: Computer Sciences, Math, and Linguistics.
These corpora from the Universidade do Minho are small and almost raw corpora, mainly journals transcription, but they are absolutely free!
+ Calão e idiomáticas - Algumas expressões idiomáticas e calão (?lin.)
+ DiárioDoMinho1 - USE ANTES O DiarioDoMinho2 - 12 dias de edições do Diário do Minho (98/12/22 .. 99/01/05) (24klin)
+ DiárioDoMinho2 - 74 dias de edições do Diário do Minho (Jan,Fev,Mar 99) (166Klin.)
+ NaturaVilela - Frases exemplos usadas no dicionario Mário Vilela (17klin.)
+ NaturaPublico91 - Dois paragrafos de cada artigo do jornal PUBLICO 1991 (64klin.)
+ NaturaPublico92 - Dois paragrafos de cada artigo do jornal PUBLICO 1992 (188klin.)
+ NaturaPublico93 - Dois paragrafos de cada artigo do jornal PUBLICO 1993 (175klin.)
+ NaturaPublico94 - Dois paragrafos de cada artigo do jornal PUBLICO 1994 (181klin.)
+ Proverbios - lista de provérbios (700lin.)
Lists corpora, dictionaries, terminological databases, tools and other possible pointers of interest.
This corpus builds on the Portuguese data published previously in the European Language Newspaper Text, and contains the previously published material, as well as more recent material. The data in this corpus come from Agence France Presse from May 13, 1994 through December 31, 1998 (Jun. 27, 1996 - Dec. 31, 1998 previously unpublished by the LDC). The data have been tagged using SGML to identify article boundaries.
Available only by the LDC through membership or 400$ price.
(Not free; e-mail or Web trials available).
This project is a first result of an initiative taken by the Portuguese Ministry of Science and Technology to improve the area of computational processing of the Portuguese language. The project is part of the Ministry's aim to grant native speakers of Portuguese easy access to the ever-increasing information society. This site provides a lot of useful information on Portuguese language processing and also online access to some Portuguese Corpora, cf. Corpora do Processamento computacional do português.
Projecto Vercial has the largest Portuguese library of electronic literary texts. All are freely downloadable.
http://www.ime.usp.br/~tycho/corpus/index.html
The Tycho Brahe Parsed Corpus of Historical Portuguese, leaded by Charlotte Galves of the University of Campinas, is a syntactically annotated corpus which consists of texts written by Portuguese authors born between 1550 and 1850. The annotation work has been done under the direction of Charlotte Galves at the University of Campinas following the annotation scheme designed by Anthony Kroch and Ann Taylor for PPCME2 (The Penn-Helsinki Corpus of Middle English). Contact: Charlotte Galves.
+ The TBCHP is available to scholars without fee for educational and research purposes via anonymous ftp. It is, however, not in the public domain. So, the documentation and utilities files are freely accessible; however, the texts themselves can only be downloaded after the user sends a filled out request form to this adress. This form can be send directly via web. An access password will then be provided by email. You will need the username and password to download the corpus files.
+ The TBCHP is a product of the Rhythmic Patterns Parameter Setting & Language Change project, whose primary goal ois to model up the relationship between prosody and syntax in the process of language change which led from Classical Portuguese to Modern European Portuguese. Beyond the specific results of the linguistic and mathematical research which will be developed within this project, it will also produce two corpora, the aforesaid TBCHP and
+ The Comparative Tagged Corpus of Spoken Modern European Portuguese and Brazilian Portuguese, consisting of categorized recorded registers from speakers of both dialects. This corpus isn't yet released.
Brazilian free Electronic Literary Text Archive. The catalogue lists a lot of texts, all freely downloadable, mainly as RTF or PDF files.
VISL (Visual Interactive Syntax Learning Department of Language and Communication University of Southern Denmark - Odense) provides queries online to pure-text Corpora in Danish, German, English and Spanish and to the Portuguese tagged The service is for members only (see at this page). For more details cf. the full file in the Corpora and Corpus Linguistics section.
The WES Section of the University of Virginia Library provides a list of text resources for 17 European literatures: Catalan, Danish, Dutch, Finnish, French, Galego-Portuguese, German, Greek, Irish, Italian, Latin, Norwegian, Old Norse & Icelandic, Occitan, Portuguese, Spanish and Swedish.
http://www.rxrc.xerox.com/research/mltt/
The XRCE (home page) MLTT team creates basic tools for linguistic analysis, e.g. morphological analysers, parsing and generation platforms and corpus analysis tools. These tools are used to develop descriptions of various languages and the relation between them. There are free web demos of some of their tools on the web also for Portuguese. For more details cf. the Tools section.
This page contains the texts (all freely downloadable HTML) of 1842 poets and 1217 composers in 22 different languages: Romanian has 20 texts. For a more detailed description, see in the E-Texts section.
Myriobiblos, The E-text Library of the Church of Greece, provides a lot free HTML e-texts (you can browse and save them) from Classical to modern Greek; but there are also fewer texts (mainly translation) in Bulgarian, English, French, German, Italian, Romanian and Russian. For more cf. under the E-Texts section. [2001 May 1].
http://solaris3.ids-mannheim.de/tractor/telri/BUC/buc-01.htm
Cf. under Multilingual and Parallel Corpora.
Available under subscription to TRACTOR.
http://solaris3.ids-mannheim.de/tractor/telri/BUC/buc-02.htm
Plato's Republic (CES level 3 encoded). From Center for Advanced Research in Machine Learning, NLP, and Cognitive Modelling, Academy of Sciences, Bucharest, Romania.
Available under subscription to TRACTOR.
mirroring sites at Antwerp (Belgium) and Chokyo (Japan).
The CHILDES system provides free tools for studying child language data and conversational interactions. These tools include a database of transcripts, programs for computer analysis of transcripts, methods for linguistic coding, and systems for linking transcripts to digitized audio and video. CHILDES corpora cover a 23 European and extra European languages, although the bulk of the collection is English.
The Russian CHILDES corpus is available on the web.
There is also a Italian component in the Multilingual Collection (English UK and USA, German, Hebrew, Italian, Spanish, Swedish and Turkish) made from narratives elicited using Mercer Mayer's "frog story" picture book.
See under Multilingual and Parallel Corpora section for a fuller file.
A collection of patristic texts, mainly in english translation, but with a few latin originals, and some translations in other languages as well (mainly russian and chinese). All texts are freely downloadable with theological markup (ThML) or HTML, plain or zipped.
http://www.elsnet.org/resources/eciCorpus.html
A 98 million word corpus, covering most of the major European languages, as well as many others (viz. Albanian, Bulgarian, Chinese, Czech, Dutch, English, Estonian, French, Gaelic, German, Greek, Italian, Japanese, Latin, Lithuanian, Malay, Spanish, Danish, Uzbek, Norwegian, Portuguese, Russian, Serbian, Swedish, Turkish, Tibetan).
See under ECI/MCI 1 Corpus in the Multilingual and Parallel Corpora section.
This Russian only site by Evgenij Peskin is a good source for free Russian e-texts (mostly Pushkin, Chekhov, Blok, Dostoevskij, Gogol’) in sober html format. Beside texts there are also pages dealing with Russian authors. [2001 June 20].
http://solaris3.ids-mannheim.de/tractor/telri/MIN/min-03.htm
German-Russian dictionary of computers; 43,500 words/combinations; from Minsk Linguistic University, Belarus.
Available under subscription to TRACTOR.
http://solaris3.ids-mannheim.de/tractor/telri/MIN/min-02.htm
German-Russian dictionary of computers; 40,200 words/combinations; from Minsk Linguistic University, Belarus.
Available under subscription to TRACTOR.
ftp://infomeister.osc.edu/pub/central_eastern_europe/russian/corpora
GeorgeFowler's FTP must be a source for several corpora of literary Russian texts, accordingly to Nancy Smith: cf. this page. However I never succeded to login ...
The Green Library (or Cactus Library), a project of the Saint-Petersburg School of Religion and Philosophy (SRPh), has by now a few Greek (Aristotle’s De Anima, Plato’s De Republica), Russian (Dostoevskij, Leskov, Turgenev, Derzhavin) and French (Turgenev) texts. More titles are announced as forthcoming, also in Latin (Abelard), French (Casanova) and Old Slavonian (Bible). All titles are freely downloadable in PDF format. For more cf. under the E-Texts section. [2001 May 1].
A rich library of Russian e-text (1.680 MB of data at 28 July 2000), online since 1994. Texts ranges from litterary to technical ones (e.g. Unix materials), and are all freely downloadable in TXT\HTML with Windows cyrillic encoding (koi8-ru windows-1251). A great site! This library, which is maintained by Maksim Eugenievich Moshkow (cf. this home page), has also a lot of mirror sites, listed in http://lib.ru/~moshkow/.
Various literary works.
This page contains the texts (all freely downloadable HTML) of 1842 poets and 1217 composers in 22 different languages: Russian has 818 texts (incl. a few Ukrainian). For a more detailed description, see in the E-Texts section.
Myriobiblos, The E-text Library of the Church of Greece, provides a lot free HTML e-texts (you can browse and save them) from Classical to modern Greek; but there are also fewer texts (mainly translation) in Bulgarian, English, French, German, Italian, Romanian and Russian. For more cf. under the E-Texts section. [2001 May 1].
Opera e-Libretto (Collection Ulric Voyer) is a collection of 220 free e-texts of opera libretti. Displayed libretti are in Russian (Rimsky-Korsakov, Mussorskij), mainly in english-style transcription (only for Bojarynja Vera Šeloga and May Night a KOI-8 version is available), Italian, French, English, German and Danish. All texts are in html, usually broken in more files according to act divisions. For a more detailed file cf. under the E-Texts section. [2001 June].
This page maintained by Lyle Neff (cf. homepage) is a rich database of the online sources of opera libretti. A lot of e-texts (html format) are freely available directly from the site, other are only linked to. Beside libretti also secular songs and sacred vocal music are also dealt with. Language covered are Italian, French, English, German, Russian, Spanish (zarzuelas), Latin (sacred vocal music) and Jewish (songs). There are also links to other less specific musical and linguistic resources. [2001 June 20].
Russian Corpora in Tübingen, a part of project B1 in SFB 441, aim to provide access to Russian text corpora for on-line search. Now they have already made available the Uppsala Corpus of modern Russian texts and the continually growing Russian Interview Corpus. In the future, they intend to include further corpora with annotated texts. [2001 April 27].
+ Online queries can be freely made in Cyrillic KO18-R or Windows 1251, or even in Latin transliteration.
This growing corpus of Russian interview texts is collected and annotated by staff from project B1 (Anja Gattnar, Sebastian Bücking and Jennifer Haberhauer). The interviews are taken from the following free online published Russian newspapers: Argumenty i Fakty, Argumenty i Fakty Vladivostok, Art Peterburga, Ogonek, Otdyxaj, Psixologicheskaja Gazeta, Pjat' Uglov, Ptchela, Segodnja, Strannik, Vasha Gazeta, Vedomosti, Vestnik. The corpus of interview texts includes interviews from 1996 until now. The topics covered by the texts are 'politics and society', economy, music, literature, lifestyle and sports. [2001 April 27].
+ An online querable version is freely available at the Russian Corpora in Tübingen page.
http://solaris3.ids-mannheim.de/tractor/telri/MIN/min-01.htm
Database: Scientific Russian texts on Linguistics (MS Word files). From Minsk Linguistic University, Belarus.
Available under subscription to TRACTOR.
The Uppsala Corpus consists of some 600 Russian texts with a total of one million running words (word tokens), equally divided between informative and literary prose. The informative texts are from between 1985 and 1989, while the literary texts, whose vocabulary does not date as quickly, cover a longer period, 1960-88. The corpus does not include poetry or drama. Within the given frameword, considerable effort has been made to ensure as representative and varied a corpus as possible. The informative texts are drawn from 25 different subject areas. The literary half of the corpus comprises work by 40 authors. [2001 April 27].
+ The following book was based on this corpus: Lönngren, Lennart (eds.), Chastotnyj slovar' sovremennogo russkogo jazyka (A Frequency Dictionary of Modern Russian. With a Summary in English.), Uppsala, 1993 Acta Universitatis Upsaliensis, Studia Slavica Upsaliensia 32, 188 pp. Cf. the English abstract online (ISBN 91-554-3134-8).
+ An online querable version is freely available at the Russian Corpora in Tübingen page.
http://www.rxrc.xerox.com/research/mltt/
The XRCE (home page) MLTT team creates basic tools for linguistic analysis, e.g. morphological analysers, parsing and generation platforms and corpus analysis tools. These tools are used to develop descriptions of various languages and the relation between them. There are free web demos of some of their tools on the web also for Russian. For more details cf. the Tools section.