Tasks

EU parallel data (Europarl, EUR-Lex, TMX, ...)
Machine translation
http://www.mendelssohninscotland.com/why-scotland Check the article and add missing info

List of languages by processing properties and tools

CAP: language distinguishes lower and upper-case
SEG: segmentation tool available
LEM: lemmatization tool available
MOR: morphology analysis tool
SYL: syllabification tool
POS: Part-of-Speech tagger available
UD: Universal POS dependency
ACC: language has accents
NACC: no accents form
Script: language uses script
DIR: left-to-right or right-to-left writing system
Country: where language is used as an official / unofficial language

Language	CAP	SEG	LEM	MOR	SYL	POS	UD	ACC	NACC	Script	DIR	Country
Czech	yes	yes	yes	yes	yes^[1]	yes	yes	yes	yes	Latin	LTR	Czech Republic
English	yes	yes	yes	yes	yes	yes	no	yes	yes	Latin	LTR	many
Norwegian	yes	yes^[2]	yes^[2]	yes^[2]	yes^[1]	yes	no	yes	yes	Latin	LTR	Norway
Amharic	no	yes^[3]	yes^[4]	yes^[5]	?	yes	yes	no	-	Ethiopic	LTR	Ethiopia
Oromo	yes	yes^[3]	yes^[6]	yes^[5]	?	yes^[7]	no	no	-	Latin	LTR	Kenya, Ethiopia
Somali	yes	?	yes^[8]	yes^[8]	?	yes	no	no	-	Latin	LTR	Djibouti, Ethiopia, Kenya, Somalia
Tigrinya	no	yes^[3]	yes^[9]	yes^[5]	?	yes	no	no	-	Ethiopic	LTR	Eritrea, Ethiopia

Natural language processing Centre

NLP Centre is a research centre at Masaryk university in Brno, Czech Republic. It is focused on several topics in natural language processing, computational linguistics and corpus linguistics, namely

computational lexicography: development of lexicographical software, buildling lexicons,
corpus linguistics: developing corpus manager Sketch Engine, building very large web corpora in many languages,
syntactic analysis: developing various parsers for Czech language,
semantic and logic analysis: TIL, WordNets.

The director of the Centre is Karel Pala.

Tools and resources developed in the centre

Sketch Engine - corpus manager tool
VerbaLex - valency lexicon of Czech verbs
Czech WordNet - semantic lexicon of Czech
SET - parser of Czech
Synt - chart parser of Czech
DEB - dictionary editing and browsing platform
Corpus Pattern Analysis - a corpus-based method for discovering senses of English verbs
Jazyková příručka

Collaboration

The centre participates in various international projects together with other research institutions and departmens, mainly with

Wolverhampton university (Patrick Hanks),
Faculty of Arts, Masaryk University (James Thomas, Dana Hlaváčková),
Khokhlova St. Petersburg

Important publications

Achievements, prizes

Projects

BalkaNet, VerbaLex

Comparison of corpus managers

This is a comparison of corpus managers processing large corpora. The comparison was worked on the British National Corpus (a 100-million-word text corpus of written and spoken English).^[1]

Corpus managers

BNCweb^[2] – a web-based interface to the British National Corpus
BYU-BNC^[3] – a website allows search the British National Corpora and others created at Brigham Young University
Sketch Engine^[4]^[5] – a text corpus management and analysis software

Features of corpus managers

Feature	BYU^{[note 1]}	Sketch Engine	CQP/BNCweb
Basic queries
word	Yes	Yes	Yes
phrase	Yes	Yes	Yes
wildcard	Yes	Yes	Yes
lemma	Yes	Yes	Yes
part of speech	Yes	Yes	Yes
combine any of above	Yes	Yes	Yes
Visualization
frequency of each matching string	Yes	Yes	Yes
frequency of each matching string, in each of several sections	Yes	No	No
overall frequency for all matching forms, in different sections	Yes	No	No
Collocates
basic collocates search	Yes	Yes	Yes
sort by Mutual Information	Yes	Yes	Yes
limit collocate by part of speech	Yes	Yes	Yes
find specific collocate(s) near node word(s)	Yes	Yes	Yes
Word comparisons
basic (e.g. collocates of small vs. little, or men and women)	Yes	Yes	No
Integrated synonyms
basic: search by synonyms	Yes	No	No
advanced: include synonyms as part of another query	Yes	No	No
see frequency of synonyms in different sections (e.g. by genre or over time)	Yes	No	No
compare frequency of synonyms in different sections	Yes	No	No
see all collocates for a much larger list of words (e.g. all synonyms of large)	Yes	No	No
"synonym chains": explore web of related words (click on [S] in the entries)	Yes	No	No
Customized / personalized lists
create lists of words and re-use them as part of query syntax	Yes	No	No
Limiting by sections of corpus
basic (e.g. collocates of strong in academic journals)	Yes	Yes	Yes
compare frequencies in different sections (e.g. ADJ in ACAD-Medicine vs ACAD)	Yes	No	No
compare collocates in different sections (e.g. chair in spoken vs. academic)	Yes	No	No

Speed of corpus managers

The three architectures are equally as fast for single word forms, lemmas and collocates. It is a second or less for most queries of this type. The differences appear in search of strings of words.

String of text	BYU	Sketch Engine	CQP/BNCweb^{[note 2]}
the [adj] thing (e.g. the best thing)	0.9	1.8 + 0.9^{[note 3]}	10.6 + 3.1
[pron] had better [verb] [pron] (e.g. she had better warn him)	1.1	7.5 + 0.8^{[note 4]}	15.0 + 0.8
' ( and she was like , )	1.3	5.7 + 0.8^{[note 5]}	19.0 + 0.8

^ Reduction of the time by about 20% with Sketch Engine and BNCweb to account for network latency.
^ The CQL query in BNCweb is the same as [x]–[xx] below, with the change of [tag=...] > [pos=...]
^ search by CQL: [word="the"] [tag="AJ."] [word="thing"]
^ search by CQL: [tag="PN."] [tag="PN."] [word="had"] [word="better"] [tag="V.."] [tag="PN."]
^ search by CQL: [tag="C.."] [tag="PN."] [lemma="be"] [word="like"] [word=","]

References

^ Comparison on the page corpus.byu.edu (at Brigham Young University)
^ interface to the British National Corpus more about British National Corpus
^ BYU-BNC: BRITISH NATIONAL CORPUS interface
^ The Sketch Engine homepage
^ Concordancers, Search Engines, Text-analysis Tools a list on University of Wollongong website

External links

Category:Software comparisons Category:Applied linguistics Category:Linguistic research Category:Corpus linguistics Category:Database management systems Category:Lexicography

[ushuaia-1] ttp://www.ushuaia.pl/hyphen/

[lrec_2014-2] ttp://www.lrec-conf.org/proceedings/lrec2014/pdf/801_Paper.pdf

[wordtokenizer-3] ttps://devadorner.northwestern.edu/maserver/wordtokenizer.html

[4] ttp://www.aclweb.org/anthology/W07-0814

[gasser-5] ttp://www.cs.indiana.edu/~gasser/Research/software.html

[6] ttp://www.cscjournals.org/manuscript/Journals/IJCL/volume1/Issue2/IJCL-6.pdf

[7] ttps://thesai.org/Downloads/SpecialIssueNo3/Paper%201-Parts%20of%20Speech%20Tagging%20for%20Afaan%20Oromo.pdf

[somalistatus-8] ttps://ryantxanson.com/blog/somali-status

[9] ttp://www.aclweb.org/anthology/C12-3043

[15] Reduction of the time by about 20% with Sketch Engine and BNCweb to account for network latency.

[16] The CQL query in BNCweb is the same as [x]–[xx] below, with the change of [tag=...] > [pos=...]

[17] search by CQL: [word="the"] [tag="AJ."] [word="thing"]

[18] search by CQL: [tag="PN."] [tag="PN."] [word="had"] [word="better"] [tag="V.."] [tag="PN."]

[19] search by CQL: [tag="C.."] [tag="PN."] [lemma="be"] [word="like"] [word=","]

[10] Comparison on the page corpus.byu.edu (at Brigham Young University)

[11] terface to the British National Corpus more about British National Corpus

[12] BYU-BNC: BRITISH NATIONAL CORPUS interface

[13] The Sketch Engine homepage

[14] Concordancers, Search Engines, Text-analysis Tools a list on University of Wollongong website

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[1]

[2]

[3]

[4]

[5]

[note 1]

[note 2]

[note 3]

[note 4]

[note 5]

User:Vít Baisa/sandbox

Contents