Skip to main content

When Grammar Meets Programmer

by Keith Hautala

A collaboration between a linguist and a computer scientist at the University of Kentucky has resulted in the publication of a groundbreaking text that affords researchers a new means of assessing the complexity of languages using computer-assisted analysis.

UK linguistics Professor Gregory Stump co-authored "Morphological Typology: From Word to Paradigm," with computer science Professor Raphael Finkel. It is being published by Cambridge University Press as No.138 in its distinguished "Cambridge Studies in Linguistics" series. 

In linguistics, typology means classifying languages according to structural features, rather than by language family history. Morphology has to do with the way that words are formed. So, morphological typology, in its most basic sense, classifies languages by the ways in which they form words. For example, many languages use prefixes and suffixes to create a cluster of related terms all based upon the same stem. But morphology gets far more complicated than that.

Morphology also looks at inflection, which is the modification of a word into different grammatical categories, such as tense, number, person, mood and gender. English doesn't have a lot of inflected forms. Verbs are commonly conjugated into three tenses, nouns don't have gender, and most plurals are formed according to fairly predictable rules. Inflection isn't nearly so simple in many other languages.

Some languages use a different form of a noun depending on how it is situated relative to the rest of the sentence, such as whether it is the subject, a direct object, or an indirect object. Some have not just singular and plural forms, but also a dual form, for two of an object. Nouns can be masculine, feminine or neuter. And that's just scratching the surface.

To make matters even more maddeningly complicated (or delightfully so, if you happen to be a linguist), not all of the words in any given language follow the same set of rules. The different rule-sets for inflecting groups of words are called inflection classes. The inflection classes of nouns are referred to as declensions, and those of verbs, as conjugations. Languages can have any number of inflection classes. The comprehensive set of forms defined by a word’s inflection-class membership is that word’s paradigm.Morphological Typology

The sheer number of inflected forms in a word’s paradigm is not the best way to gauge the complexity of a language. ("Where languages lack complexity in one place, they tend to compensate for it with greater complexity elsewhere," Stump notes.) Stump and Finkel define the complexity of an inflection-class system as "the extent to which it inhibits motivated inferences about its paradigms’ word forms." Some languages may have paradigms that contain many inflected forms, but these may follow a predictable pattern that makes each paradigm easy to decode. Other paradigms may be more compact, but harder to crack.

"The thing is, nobody is ever given the whole paradigm at once," Stump says. "Inflections are learned intuitively by hearing them many times, and by trial and error, during the acquisition of language." 

Stump and Finkel devised a system for mapping out morphological paradigms into a two-dimensional matrix, called a "plat," with a full inflectional pattern elaborated for each inflection class. In general, it is possible to predict a word’s whole paradigm if certain "principal parts" of the paradigm are known. Stump and Finkel liken this prediction of wholes from parts to solving a Sudoku puzzle. 

"In Sudoku, you are given a grid with some of the numbers already filled in," Stump said. "The challenge is to fill in the rest of the grid with just a small part of the puzzle known."

Finkel knows more than a little about puzzles. He has written computer programs both to generate and to solve puzzles, including one that produces a "Sudoku Puzzle of the Day", accessible from his UK web page.

Finkel also has a strong interest in languages, as other links on his web page attest. These include a collection of online Yiddish resources, as well as a collaborative project to translate the Suda, a 10th Century Byzantine encyclopedia, for which he is a managing editor. Finkel is also recognized in the annals of computer science as the first to compile the "Jargon File," a glossary of hacker dialect, in 1975 while he was at Stanford.

Finkel has set up a website called "Computer-Assisted Technology Service ― Computational Linguists Automated Workbench," or CATS CLAW, to help linguists with their research; this website provides direct access to a range of computational linguistic tools. Among these is the program that Stump and Finkel use to analyze a language’s plat; this program calculates a variety of complexity measures for a language’s system of inflection classes. The program can, for example, identify the principal parts of a paradigm, those key elements — like the pre-filled numbers on a Sudoku grid — that enable the decoding of the entire paradigm.

Drawing on evidence from a diverse range of languages (including Chinantec, Dakota, French, Fur, Icelandic, Ngiti and Sanskrit), Stump and Finkel propose 10 explicit measures of an inflection-class system’s complexity. Some of these measures involve principal parts, while others are sensitive to the full network of implicative relations uniting a paradigm’s cells. The authors have made the complete data sets used for the book available online, as well as the CATS CLAW computational tool.