Tasks
edit- EU parallel data (Europarl, EUR-Lex, TMX, ...)
- Machine translation
- http://www.mendelssohninscotland.com/why-scotland Check the article and add missing info
List of languages by processing properties and tools
edit- CAP: language distinguishes lower and upper-case
- SEG: segmentation tool available
- LEM: lemmatization tool available
- MOR: morphology analysis tool
- SYL: syllabification tool
- POS: Part-of-Speech tagger available
- UD: Universal POS dependency
- ACC: language has accents
- NACC: no accents form
- Script: language uses script
- DIR: left-to-right or right-to-left writing system
- Country: where language is used as an official / unofficial language
Language | CAP | SEG | LEM | MOR | SYL | POS | UD | ACC | NACC | Script | DIR | Country |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Czech | yes | yes | yes | yes | yes[1] | yes | yes | yes | yes | Latin | LTR | Czech Republic |
English | yes | yes | yes | yes | yes | yes | no | yes | yes | Latin | LTR | many |
Norwegian | yes | yes[2] | yes[2] | yes[2] | yes[1] | yes | no | yes | yes | Latin | LTR | Norway |
Amharic | no | yes[3] | yes[4] | yes[5] | ? | yes | yes | no | - | Ethiopic | LTR | Ethiopia |
Oromo | yes | yes[3] | yes[6] | yes[5] | ? | yes[7] | no | no | - | Latin | LTR | Kenya, Ethiopia |
Somali | yes | ? | yes[8] | yes[8] | ? | yes | no | no | - | Latin | LTR | Djibouti, Ethiopia, Kenya, Somalia |
Tigrinya | no | yes[3] | yes[9] | yes[5] | ? | yes | no | no | - | Ethiopic | LTR | Eritrea, Ethiopia |
- ^ a b http://www.ushuaia.pl/hyphen/
- ^ a b c http://www.lrec-conf.org/proceedings/lrec2014/pdf/801_Paper.pdf
- ^ a b c https://devadorner.northwestern.edu/maserver/wordtokenizer.html
- ^ http://www.aclweb.org/anthology/W07-0814
- ^ a b c http://www.cs.indiana.edu/~gasser/Research/software.html
- ^ http://www.cscjournals.org/manuscript/Journals/IJCL/volume1/Issue2/IJCL-6.pdf
- ^ https://thesai.org/Downloads/SpecialIssueNo3/Paper%201-Parts%20of%20Speech%20Tagging%20for%20Afaan%20Oromo.pdf
- ^ a b https://ryantxanson.com/blog/somali-status
- ^ http://www.aclweb.org/anthology/C12-3043
Natural language processing Centre
editNLP Centre is a research centre at Masaryk university in Brno, Czech Republic. It is focused on several topics in natural language processing, computational linguistics and corpus linguistics, namely
- computational lexicography: development of lexicographical software, buildling lexicons,
- corpus linguistics: developing corpus manager Sketch Engine, building very large web corpora in many languages,
- syntactic analysis: developing various parsers for Czech language,
- semantic and logic analysis: TIL, WordNets.
The director of the Centre is Karel Pala.
Tools and resources developed in the centre
edit- Sketch Engine - corpus manager tool
- VerbaLex - valency lexicon of Czech verbs
- Czech WordNet - semantic lexicon of Czech
- SET - parser of Czech
- Synt - chart parser of Czech
- DEB - dictionary editing and browsing platform
- Corpus Pattern Analysis - a corpus-based method for discovering senses of English verbs
- Jazyková příručka
Collaboration
editThe centre participates in various international projects together with other research institutions and departmens, mainly with
- Wolverhampton university (Patrick Hanks),
- Faculty of Arts, Masaryk University (James Thomas, Dana Hlaváčková),
- Khokhlova St. Petersburg
Important publications
editAchievements, prizes
editProjects
edit- BalkaNet, VerbaLex
Comparison of corpus managers
editThis is a comparison of corpus managers processing large corpora. The comparison was worked on the British National Corpus (a 100-million-word text corpus of written and spoken English).[1]
Corpus managers
edit- BNCweb[2] – a web-based interface to the British National Corpus
- BYU-BNC[3] – a website allows search the British National Corpora and others created at Brigham Young University
- Sketch Engine[4][5] – a text corpus management and analysis software
Features of corpus managers
editFeature | BYU[note 1] |
Sketch Engine |
CQP/BNCweb |
---|---|---|---|
Basic queries | |||
word | Yes | Yes | Yes |
phrase | Yes | Yes | Yes |
wildcard | Yes | Yes | Yes |
lemma | Yes | Yes | Yes |
part of speech | Yes | Yes | Yes |
combine any of above | Yes | Yes | Yes |
Visualization | |||
frequency of each matching string | Yes | Yes | Yes |
frequency of each matching string, in each of several sections | Yes | No | No |
overall frequency for all matching forms, in different sections | Yes | No | No |
Collocates | |||
basic collocates search | Yes | Yes | Yes |
sort by Mutual Information | Yes | Yes | Yes |
limit collocate by part of speech | Yes | Yes | Yes |
find specific collocate(s) near node word(s) | Yes | Yes | Yes |
Word comparisons | |||
basic (e.g. collocates of small vs. little, or men and women) | Yes | Yes | No |
Integrated synonyms | |||
basic: search by synonyms | Yes | No | No |
advanced: include synonyms as part of another query | Yes | No | No |
see frequency of synonyms in different sections (e.g. by genre or over time) | Yes | No | No |
compare frequency of synonyms in different sections | Yes | No | No |
see all collocates for a much larger list of words (e.g. all synonyms of large) | Yes | No | No |
"synonym chains": explore web of related words (click on [S] in the entries) | Yes | No | No |
Customized / personalized lists | |||
create lists of words and re-use them as part of query syntax | Yes | No | No |
Limiting by sections of corpus | |||
basic (e.g. collocates of strong in academic journals) | Yes | Yes | Yes |
compare frequencies in different sections (e.g. ADJ in ACAD-Medicine vs ACAD) | Yes | No | No |
compare collocates in different sections (e.g. chair in spoken vs. academic) | Yes | No | No |
Speed of corpus managers
editThe three architectures are equally as fast for single word forms, lemmas and collocates. It is a second or less for most queries of this type. The differences appear in search of strings of words.
String of text | BYU |
Sketch Engine |
CQP/BNCweb[note 2] |
---|---|---|---|
the [adj] thing (e.g. the best thing) | 0.9 | 1.8 + 0.9[note 3] | 10.6 + 3.1 |
[pron] had better [verb] [pron] (e.g. she had better warn him) | 1.1 | 7.5 + 0.8[note 4] | 15.0 + 0.8 |
' ( and she was like , ) | 1.3 | 5.7 + 0.8[note 5] | 19.0 + 0.8 |
- ^ Reduction of the time by about 20% with Sketch Engine and BNCweb to account for network latency.
- ^ The CQL query in BNCweb is the same as [x]–[xx] below, with the change of [tag=...] > [pos=...]
- ^ search by CQL: [word="the"] [tag="AJ."] [word="thing"]
- ^ search by CQL: [tag="PN."] [tag="PN."] [word="had"] [word="better"] [tag="V.."] [tag="PN."]
- ^ search by CQL: [tag="C.."] [tag="PN."] [lemma="be"] [word="like"] [word=","]
References
edit- ^ Comparison on the page corpus.byu.edu (at Brigham Young University)
- ^ interface to the British National Corpus more about British National Corpus
- ^ BYU-BNC: BRITISH NATIONAL CORPUS interface
- ^ The Sketch Engine homepage
- ^ Concordancers, Search Engines, Text-analysis Tools a list on University of Wollongong website
External links
edit- Corpora from BYU (Brigham Young University)
- Sketch Engine website
- UCREL Corpus Research Seminar, Lancaster University
Category:Software comparisons Category:Applied linguistics Category:Linguistic research Category:Corpus linguistics Category:Database management systems Category:Lexicography