Tasks

edit

List of languages by processing properties and tools

edit
  • CAP: language distinguishes lower and upper-case
  • SEG: segmentation tool available
  • LEM: lemmatization tool available
  • MOR: morphology analysis tool
  • SYL: syllabification tool
  • POS: Part-of-Speech tagger available
  • UD: Universal POS dependency
  • ACC: language has accents
  • NACC: no accents form
  • Script: language uses script
  • DIR: left-to-right or right-to-left writing system
  • Country: where language is used as an official / unofficial language
Language CAP SEG LEM MOR SYL POS UD ACC NACC Script DIR Country
Czech yes yes yes yes yes[1] yes yes yes yes Latin LTR Czech Republic
English yes yes yes yes yes yes no yes yes Latin LTR many
Norwegian yes yes[2] yes[2] yes[2] yes[1] yes no yes yes Latin LTR Norway
Amharic no yes[3] yes[4] yes[5] ? yes yes no - Ethiopic LTR Ethiopia
Oromo yes yes[3] yes[6] yes[5] ? yes[7] no no - Latin LTR Kenya, Ethiopia
Somali yes ? yes[8] yes[8] ? yes no no - Latin LTR Djibouti, Ethiopia, Kenya, Somalia
Tigrinya no yes[3] yes[9] yes[5] ? yes no no - Ethiopic LTR Eritrea, Ethiopia

Natural language processing Centre

edit

NLP Centre is a research centre at Masaryk university in Brno, Czech Republic. It is focused on several topics in natural language processing, computational linguistics and corpus linguistics, namely

  • computational lexicography: development of lexicographical software, buildling lexicons,
  • corpus linguistics: developing corpus manager Sketch Engine, building very large web corpora in many languages,
  • syntactic analysis: developing various parsers for Czech language,
  • semantic and logic analysis: TIL, WordNets.

The director of the Centre is Karel Pala.

Tools and resources developed in the centre

edit
  • Sketch Engine - corpus manager tool
  • VerbaLex - valency lexicon of Czech verbs
  • Czech WordNet - semantic lexicon of Czech
  • SET - parser of Czech
  • Synt - chart parser of Czech
  • DEB - dictionary editing and browsing platform
  • Corpus Pattern Analysis - a corpus-based method for discovering senses of English verbs
  • Jazyková příručka

Collaboration

edit

The centre participates in various international projects together with other research institutions and departmens, mainly with

  • Wolverhampton university (Patrick Hanks),
  • Faculty of Arts, Masaryk University (James Thomas, Dana Hlaváčková),
  • Khokhlova St. Petersburg

Important publications

edit

Achievements, prizes

edit

Projects

edit
  • BalkaNet, VerbaLex

Comparison of corpus managers

edit

This is a comparison of corpus managers processing large corpora. The comparison was worked on the British National Corpus (a 100-million-word text corpus of written and spoken English).[1]

Corpus managers

edit

Features of corpus managers

edit
Feature BYU[note 1]
Sketch Engine
CQP/BNCweb
Basic queries
word Yes Yes Yes
phrase Yes Yes Yes
wildcard Yes Yes Yes
lemma Yes Yes Yes
part of speech Yes Yes Yes
combine any of above Yes Yes Yes
Visualization
frequency of each matching string Yes Yes Yes
frequency of each matching string, in each of several sections Yes No No
overall frequency for all matching forms, in different sections Yes No No
Collocates
basic collocates search Yes Yes Yes
sort by Mutual Information Yes Yes Yes
limit collocate by part of speech Yes Yes Yes
find specific collocate(s) near node word(s) Yes Yes Yes
Word comparisons
basic (e.g. collocates of small vs. little, or men and women) Yes Yes No
Integrated synonyms
basic: search by synonyms Yes No No
advanced: include synonyms as part of another query Yes No No
see frequency of synonyms in different sections (e.g. by genre or over time) Yes No No
compare frequency of synonyms in different sections Yes No No
see all collocates for a much larger list of words (e.g. all synonyms of large) Yes No No
"synonym chains": explore web of related words (click on [S] in the entries) Yes No No
Customized / personalized lists
create lists of words and re-use them as part of query syntax Yes No No
Limiting by sections of corpus
basic (e.g. collocates of strong in academic journals) Yes Yes Yes
compare frequencies in different sections (e.g. ADJ in ACAD-Medicine vs ACAD) Yes No No
compare collocates in different sections (e.g. chair in spoken vs. academic) Yes No No


Speed of corpus managers

edit

The three architectures are equally as fast for single word forms, lemmas and collocates. It is a second or less for most queries of this type. The differences appear in search of strings of words.


String of text BYU
Sketch Engine
CQP/BNCweb[note 2]
the [adj] thing (e.g. the best thing) 0.9 1.8 + 0.9[note 3] 10.6 + 3.1
[pron] had better [verb] [pron] (e.g. she had better warn him) 1.1 7.5 + 0.8[note 4] 15.0 + 0.8
' ( and she was like , ) 1.3 5.7 + 0.8[note 5] 19.0 + 0.8


  1. ^ Reduction of the time by about 20% with Sketch Engine and BNCweb to account for network latency.
  2. ^ The CQL query in BNCweb is the same as [x]–[xx] below, with the change of [tag=...] > [pos=...]
  3. ^ search by CQL: [word="the"] [tag="AJ."] [word="thing"]
  4. ^ search by CQL: [tag="PN."] [tag="PN."] [word="had"] [word="better"] [tag="V.."] [tag="PN."]
  5. ^ search by CQL: [tag="C.."] [tag="PN."] [lemma="be"] [word="like"] [word=","]


References

edit


edit

Category:Software comparisons Category:Applied linguistics Category:Linguistic research Category:Corpus linguistics Category:Database management systems Category:Lexicography