Описание оригинальной библиотеки с aot.ru (может оказаться полезным)¶
The following text is a description of the original aot.ru POS tagger implementation. Pymorphy was based on aot.ru research and it uses aot.ru dictionaries, but the implementation and further development is independent. The following text is provided only for reference. It does not describe how pymorphy works.
This is a program of morphological analysis (Russian, German, and English languages).
This program is distributed under the Library GNU Public Licence, which is in the file COPYING.
This program was written by Andrey Putrin, Alexey Sokirko. The project started in Moscow in Dialing Company (Russian and English language). The German part was created at Berlin-Brandenburg Academy of Sciences and Humanities in Berlin (the project DWDS).
The Russian lexicon is based upon Zaliznyak’s Dictionary . The German lexicon is based upon Morphy system (http://www-psycho.uni-paderborn.de/lezius/). The English lexicon is based upon Wordnet.
The project uses a regular expression library “PCRE” (Perl Compatible Regular Expressions). We test compilation only with version 6.4. Other versions were not tested. One should download this version from the official site and install it to the default place. If you do not want to install it or you do not have enough rights to do it, then you should create two environment variables:
1. RML_PCRE_LIB, that points to PCRE library directory, where libpcre.a and libpcrecpp.a should be located, for example:
2. RML_PCRE_INCLUDE, that points to PCRE include catalog, where “pcrecpp.h” is located, for example:
The system has been developed under Windows 2000 (MS VS 6.0), but has also been compiled and run under Linux(GCC). It should work with minor changes on other systems.
Website of DDC: www.aot.ru, https://sf.net/projects/morph-lexicon/
I compiled all sources with gcc 3.2. Lower versions are not supported.
Contents of the this source archive
- The main morphological library (Source/LemmatizerLib).
- Library for grammatical codes (Source/AgrgamtabLib).
- Test morphological program (Source/TestLem)..
- Library for working with text version of the dictionaries (Source/MorphWizardLib).
- Generator of morphological prediction base (Source/GenPredIdx).
- Generator of binary format of the dictionaries (Source/MorphGen).
Create a catalog and register a system variable RML, which points to this catalog:
mkdir /home/sokirko/RML export RML=/home/sokirko/RML
Put “lemmatizer.tar.gz”, ”???-src-morph.tar.gz” to this catalog, ”???” can be “rus”, “ger” or “eng” according to what you have downloaded. Unpack it:
tar xfz lemmatizer.tar.gz tar xfz ???-src-morph.tar.gz
- Do not forget to set RML_PCRE (see above)
- cd $RML
- ./compile_morph.sh This step should create all libraries and a test program $RMLBinTestLem.
Building Morphological Dictionary¶
./generate_morph_bin.sh <lang> where <lang> can be Russian, German according to the dictionary yo have downloaded.
The script should terminate with message “Everything is OK”. You can test the morphology:$RML\Bin\TestLem <lang>
If something goes wrong, write me to email@example.com.
This section describes the format of a mrd-file. Mrd-file is a text file which contains one morphological dictionary for one natural language. MRD is an abbreviation of “morphological dictionary”.
The usual place for this file is:
where xxx can be “Eng”, “Rus” or “Ger” depending on the language.
The encoding of the file depends also upon the language:
- Russian - Windows 1251
- German - Windows 1252
- English - ASCII
A mrd-file refers to a gramtab-file, which is language-dependent and which contains all possible full morphological patterns for the words. One line in a gramtab-file looks like as follows:
<ancode> <unused_number> <part_of_speech> <grammems>
An ancode is an ID, which consists of two letters and which uniquely identifies a morphological pattern. A morphological pattern consists of <part_of_speech> and <grammems>. For example, here is a line from the English gramtab:
te 1 VBE prsa,pl
Here “te” is an ancode, “VBE” is a part of speech, “prsa,pl” are grammems, “1” is the obsolete unused number.
In mrd-files we use ancodes to refer to a morphological pattern.
Here is the list of all gramtab-files:
- Russian - $Rml/Dicts/Morph/rgramtab.tab
- German - $Rml/Dicts/Morph/ggramtab.tab
- English - $Rml/Dicts/Morph/egramtab.tab
All words in a mrd-file are written in uppercase.
One mrd-file consists of the following sections:
- Section of flexion and prefix models;
- Section of accentual models;
- Section of user sessions;
- Section of prefix sets;
- Section of lemmas.
Each section is a set of records, one per line. The number of all records of the section is written in the very beginning of the section at a separate line. For example, here is a possible variant of the section of user sessions:
1 alex;17:10, 13 October 2003;17:12, 13 October 2003
“1” means that this section contains only one record, which is written on the next line, thus this section contains only two lines.
Section of possible flexion and prefix models¶
Each record of this section is a list of items. Each item describes how one word form in a paradigm should be built. The whole list describes the whole paradigm (a set of word forms with morphological patterns).
The format of one item is the following:
- <flexion> is a flexion (a string, which should be added to right of the word base) <prefix> is a prefix (a string, which should be added to left of the word base) <ancode> is an ancode.
Let us consider an example of an English flexion and prefix model:
Here we have two items:
1. <flexion> = F; <ancode> = na 2. <flexion> = VES; <ancode> = nb
In order to decipher ancodes we should go the English gramtab-file. There we can find the following lines:
na NOUN narr,sg nb NOUN narr,pl
If base “lea” would be ascribed to this model, then its paradigm would be the following:
leaf NOUN narr,sg leaves NOUN narr,pl
It is important, that each word of a morphological dictionary should contain a reference to a line in this section.
Section of possible accentual models¶
Each record of this section is a comma-delimited list of numbers, where each number is an index of a stressed vowel of a word form(counting from the end). The whole list contains a position for each word form in the paradigm.
If an item of an accentual model of word is equal to 255, then it is undefined, and it means that this word form is unstressed.
Each word in the dictionary should have a reference to an accentual model, even though this model can consist only of empty items.
For one word, the number and the order of items in the accentual model should be equal to the number and the order of items in the flexion and prefix model. For example we can ascribe to word “leaf” with the paradigm:
leaf NOUN narr,sg leaves NOUN narr,pl
the following accentual model:
It produces the following accented paradigm:
le'af NOUN narr,sg le'aves NOUN narr,pl
Section of user section¶
This is a system section, which contains information about user edit sessions.
Section of prefix sets¶
Each record of this section is a comma-delimited list of strings, where each string is a prefix, which can be prefixed to the whole word. If a prefix set is ascribed to a word, it means, that the words with these prefixes can also exist in the language. For example, if “leaf” has the prefix set “anti,contra”, it follows the existence of words “antileaf”, “contraleaf”.
A flexion and prefix model can contain also a reference to a prefix, but this prefix is for one separate word form, while a prefix set is ascribed to the whole word paradigm.
Section of lemmas¶
A record of this section is a space-separated tuple of the following format:
<base> <flex_model_no> <accent_model_no> <session_no> <type_ancode> <prefix_set_no>
<base> is a base (a constant part of a word in its paradigm)
<flex_model_no> is an index of a flexion and prefix model
<accent_model_no> is an index of an accentual model
<session_no> is an index of the session, by which the last user edited this word
<type_ancode> is ancode, which is ascribed to the whole word (intended: the common part of grammems in the paradigm) “-” if it is undefined
<prefix_set_no> is an index of a prefix set, or “-” if it is undefined