About the NPCMC
The NPCMC, the Nijmegen Parsed Corpus of Modern Chechen, is a preliminary attempt
at creating a growing corpus of syntactically annotated Chechen texts.
The annotation of the texts takes the historical English parsed texts, such as the
PPCMBE, as a starting point,
and attempts to follow its annotation
guidelines
as close as possible.
The Chechen language is a major representative of the North-East Caucasian family,
and it is an agglutinative one. It has over sixteen grammatical cases, more than half of which
are devoted to various locational and directional variants, where other languages would use prepositions
and descriptions.
Annotation efforts
Efforts are on the way to annotate a number of texts from different sources.
The first source is a set of newspaper and journal articles collected by the
New Mexico State University around 2005-2006. The second source consists of
texts gleaned from various books and places. The sources are referred to within each text.
The following steps are taken in the annotation process:
- Breaking up into sentences (own software)
- Tokenization (own software)
- Part-of-speech estimation (uses extended Maciev dictionary combined with
MBT)
- Part-of-speech correction (manual process within
Cesax)
- Dependency parsing (uses Maltparser)
- Dependency-to-constituency conversion (done within
Cesax)
- Constituent parse correction (manual process within
Cesax)
|