TalkBank MOR and UD Grammars

Processing with UD

We are currently transitioning the TalkBank system for morphosyntactic analysis from the MOR/POST/MEGRASP system to the UD (Universal Dependencies) system which is described in detail here . We apply UD taggers to TalkBank files using Stanford's Stanza system that has been built into the Batchalign2 program created by Houjun Liu, as described in this report under review.

Processing with Batchalign

Creating the new UD analysis requires use of Batchalign2 which can be download and installed from here . However, users who are not familiar with the type of installation required by Batchalign are welcome to send their transcript to macw@cmu.edu for tagging. It only takes minutes for us to tag and then send you back the result. However, it is important that the transcripts have already been validated by the CHECK program inside CLAN.

The great advantage of UD over MOR is that it is available for many more languages. It also seems to perform better than MOR for computing dependency relations on the %gra line. However, its control of morphological analysis on the %mor line is not yet as analytic as MOR. So, for English and Spanish, we will retain use of MOR. For English only, the UD tiers are called %umod and %ugra, leaving the names %mor and %gra for the tiers created by MOR.

As of March 2024, we have tagged these languages in CHILDES using UD: Afrikaans, Cantonese, Catalan, Croatian, Czech, Danish, Dutch, English, Estonian, Frfench, German, Icelandic, Irish, Italian, Japanese, Korean, Mandarin, Norwegian, Polish, Portuguese, Serbian, Slovenian, Spanish, Swedish, Turkish, and Welsh. Once UD grammars become available, we hope to apply UD through Batchalign to languages such as Sesotho or Nungon. Currently, application to Arabic, Bulgarian, Farsi, Greek, Hebrew, Russian, and Tamil is blocked by the fact that the transcripts were done in a non-standard romanization not supported by UD. Application to Danish and Hungarian will require extensive cleanup of the transcripts. Users may wish to still rely on the MOR grammars for English, Spanish, and Hebrew and the word segmenter for Chinese .

Within CLAN, it is possible to use the Download MOR grammars option to install MOR grammars for English, Cantonese, CHinese, Danish, Dutch, French, German, Hebrew, Italian, and Japanese, However, we only recommend use of the English, Hebrew, and Spanish MOR grammars, because processing using the UD grammars is better for other languages.