DMLex transformation backend demo

Description of the mapping (in JSON):
[Syntax · Examples: ]

Dictionary to be transformed (in XML):
[Examples: ]



Results:

You can Download the above examples here.

Syntax of the JSON mapping descriptions

The mapping description as a whole should be an element mapping description (on which see below) for the root of the output DMLex tree (which probably means for the lexicographicResource element).

Element mapping descriptions

An element mapping description (EMD) is a JSON object that describes how certain elements in the input XML document are to be mapped to certain elements of the output DMLex tree. This JSON object should contain the following members:

Text mapping descriptions

A text mapping description (TMD) is a JSON object that describes how certain content from the input XML element is to be used to generate the value of a text attribute of an element in the output DMLex tree. A TMD is always defined in the context of an EMD (within its textVals member), which may therefore be regarded as the parent of that TMD. The JSON object representing a TMD should contain the following members:

Marker mapping description

A marker mapping description (MMD) is a JSON object that specifies how certain subranges of a text value (as obtained by applying a TMD) are to be identified, and stored as a list of {"startIndex": ..., "endIndex": ...} pairs in the output DMLex tree. A MMD is always defined in the context of a TMD (within its markers member), which may therefore be regarded as the parent of that MMD. The JSON object representing a MMD should contain the following members:

An annotated example

This is an example mapping description interlaced with comments that explain what is going on. The root of our input XML document will be a <Dictionary> element, which we will map into the lexicographicResource element of the output DMLex tree:

{
    "inSelector": "/Dictionary",
    "outElement": "lexicographicResource",

This lexicographicResource element will have a text value called language, whose contents will be extracted from the @sourceLanguage attribute of the <Dictionary> element of the input XML document:

    "textVals": [
        {
            "outElement": "language",
            "attribute": "sourceLanguage"
        }
    ],

The <DictionaryEntry> descendants of the <Dictionary> element will be transformed into entry elements of the output DMLex tree, and will be stored in a list called entries of the lexicographicResource element:

    "children": [
        {
            "inSelector": "DictionaryEntry",
            "outElement": "entry", "jsonPlural": "entries",

Each entry element of the output DMLex tree has an autogenerated id attribute, as well as a headword attribute which is extracted from the inner tex of the <headword> descendant of the <DictionaryEntry> element:

            "textVals": [
                {
                    "outElement": "id",
                    "attribute": "{http://elex.is/wp1/teiLex0Mapper/meta}autogenerated"
                },
                {
                    "outElement": "headword",
                    "inSelector": ".//headword",
                    "attribute": "{http://elex.is/wp1/teiLex0Mapper/meta}innerText"
                },

The entry element also has a text value partOfSpeech, which is obtained by taking the @value attribute of the <PartOfSpeech> descendant of the <DictionaryEntry> element and then passing it through the mapping described by xlat, such that e.g. the string "adjective" becomes "adj", etc.

                {
                    "outElement": "partOfSpeech",
                    "inSelector": ".//PartOfSpeech",
                    "attribute": "value",
                    "xlat": {
                      "adjective": "adj",
                      "adverb": "adv",
                      "adverbpostposition": "x",
                      "cardinal": "num",
                      "conjunction": "cconj",
                      "interjection": "intj",
                      "noun": "noun",
                      "ordinal": "num",
                      "particle": "part",
                      "postposition": "x",
                      "postpositionpreposition": "adp",
                      "preposition": "adp",
                      "pronoun": "pron",
                      "verb": "verb"
                    }
                }
            ],

This is the end of the textVals list for the entry element. An entry element can also have sense elements as children. We see an example of a slightly more advanced XPath expression in inSelector here: if the current <DictionaryEntry> element has any <SenseGrp> descendants, then each of these descendants is transformed into a sense element in the output DMLex tree; otherwise, the <DictionaryEntry> element itself is mapped into a single sense element (which becomes the sole child of the parent entry element; in this case the same <DictionaryEntry> of the input XML document will give rise to both an entry element and a sense element in the output DMLex tree).

            "children": [
                {
                    "outElement": "sense",
                    "inSelector": "(.//SenseGrp)|(self::*[not(descendant::SenseGrp)])",

Now we are in the element mapping specification for sense. We will give each sense an autogenerated id text-value:

                    "textVals": [
                        {
                            "outElement": "id",
                            "attribute": "{http://elex.is/wp1/teiLex0Mapper/meta}autogenerated"
                        }
                    ],

Each <Definition> element that is a descendant of the current <SenseGrp> (or <DictionaryEntry>) will be mapped into a definition child of the current sense element in the output DMLex tree. A definition element has a text attribute called value, obtained as the inner-text of the input <Definition> element and its descendants.


                    "children": [
                        {
                            "outElement": "definition",
                            "inSelector": ".//Definition",
                            "textVals": [
                                {
                                    "outElement": "value",
                                    "attribute": "{http://elex.is/wp1/teiLex0Mapper/meta}innerTextRec"
                                }
                            ]
                        },

A sense element of the output DMLex tree will also have example elements as children, obtained from the <Example> descendant elements in the input XML document. Each example element has a text value called text, obtained as the inner-text of the <Fragment> descendant of the <Example> element.

                        {
                            "outElement":  "example",
                            "inSelector": ".//Example",
                            "textVals": [
                                {
                                    "outElement": "text",
                                    "inSelector": ".//Fragment",
                                    "attribute": "{http://elex.is/wp1/teiLex0Mapper/meta}innerText",

In addition to text, the example element will have a list of markers called vowelPairs, indicating all subranges within text where two adjacent letters are equal vowels. We use a regular expression to identify them; (?i) specifies case-insentive matching. (Note that vowelPairs is not actually a part of DMLex, and is only provided here for illustrative purposes.)

                                    "markers": [
                                        {
                                            "outElement": "vowelPair",
                                            "regex": "(?i)([aeiou])\\1"
                                        }
                                    ]
                                },

Lastly, the example element will have a text-value called xr, obtained from the @href attribute of the <Ptr> child of the <SeeAlso> descendant of <Example. However, we do not use the whole value of the attribute, but only the first group of one or more consecutive lowercase letters. (Note that xr is not actually a part of DMLex, and is only provided here for illustrative purposes.)

                                {
                                    "outElement": "xr",
                                    "inSelector": ".//SeeAlso/Ptr",
                                    "attribute": "{http://www.w3.org/TR/xlink}href",
                                    "regex": "[a-z]+"
                                }
                            ]
                        }
                    ]
                }
            ]
        }
    ],

At the end of the EMD for the lexicographicResource element of the output DMLex tree, we have two arrays, definitionTypeTags and inflectedFormTags, which will be copied without change into the lexicographicResource element of the output DMLex tree.

    "copyToOutElt": {
        "definitionTypeTags": [
            { "tag": "foo", "description": "..." },
            { "tag": "bar", "description": "..." } ],
        "inflectedFormTags": [
            { "tag": "baz", "description": "..." },
            { "tag": "quux", "description": "..." } ]
    }
}

This concludes the example. You can select “KKS (Karelian)” at the top of this page to see it in action.


Janez Brank