About LOWLANDS: Parsing Low-Resource Languages and Domains – University of Copenhagen

Home > About LOWLANDS

About LOWLANDS: Parsing Low-Resource Languages and Domains

There are noticeable asymmetries in natural language processing (NLP). You can adequately summarize English newspapers (Berg-Kirkpatrick et al., 2011) and translate them into Korean (Xu et al., 2009), but we cannot translate Korean newspaper articles into English, and summarizing micro-blogs is much more difficult than summarizing newspaper articles. This is a problem for modern societies, their development and democracy, as well as perhaps the most important research problem in NLP right now.

Most NLP technologies rely on highly accurate syntactic parsing. This holds for many machine translation (Xu et al., 2009), opinion mining (Joshi and Penstein-Rosé, 2009), question answering (Poon and Domingos, 2009; Liang et al., 2011), search engines (Park and Croft, 2010), and summarization (Berg-Kirkpatrick et al., 2011). The parsing models used in these technologies are induced from large collections of manually annotated data. The majority of annotated data, however, is sampled from newswire and consists of newspaper articles or telegrams. In effect, parsing performance drops significantly, when data departs from newswire conventions. Moreover sufficient data is only available for languages such as Chinese, English, and German. So while we can extract information from, translate and summarize newspaper articles in major languages with some success, it is much less feasible to process other valuable resources of information – say microblogs, chat, telephone conversations, or literature – in a robust way. The main reasons for the drop in accuracy are previously unseen words, unseen syntactic constructions, and differences in the marginal distribution of data (McClosky et al., 2008; Søgaard and Haulrich, 2011). The drop is often so dramatic that natural language parsing becomes practically useless. This in turn has dramatic consequences for the NLP technologies that employ syntactic parsing.

The move from one domain to another, say from newspaper articles to telephone conversations, results in a sample selection bias. Our training data is now biased, since it is sampled from a related, but nevertheless different distribution. Other kinds of sample selection biases can be results of dialectal, thematical, or stylistic differences, or it can be about recency effects (language change). In this project, we will also think of cross-language adaptation, i.e. using resources for one language to parse another, as a sample selection bias problem.

We will generally consider the problem of:

Learning how to parse natural language for which no unbiased labeled data exists

Why is it an important problem?

Languages, dialects, domains and styles become important when the people who use them become important. Low-resource languages spoken in less developed regions become important also to European societies when there is need for strong international presence in these regions, or when we want to access their cultural heritage. Dialects become important in blogs and chats. Twitter and Facebook data becomes important, when companies’ customers use social media to evaluate products. Generally, since political agendas change rapidly, and new media constantly emerge, it is important that reliable parsing models can be induced for any language and domain for which a sufficient amount of text is produced. The bias towards English newswire is a fundamental problem for modern societies, their development and democracy.

Why is it a difficult problem?

Parsing low-resource languages and domains is one of the most important problems in NLP, but also one of the most difficult. It has remained an open problem for at least a decade, the topic of several ACL-sponsored workshops, and was recently the main theme of the 12th International Conference on Parsing Technologies (Dublin, October 2011). Consider, for example, the CoNLL 2007 Shared Task on domain adaptation of dependency parsers. The shared task organizers concluded that domain adaptation of dependency parsers was an extremely difficult task, and the winning system only achieved a very small improvement over an un-adapted baseline system. More recent work on these data sets has not changed the picture substantially (Kawahara and Uchimoto, 2008). Unsupervised parsing has seen rapid progress since the seminal paper of Klein and Manning (2004). On the other hand, much better results can still be obtained for any target language using delexicalized parsing models for other source languages on target language sentences with projected part-of-speech (McDonald et al., 2011). Generally, correcting sample bias is extremely difficult because training and test data reflect related, but different distributions, and you do not know in advance how the distributions are related. We know there might be differences in vocabulary across languages and domains, differences in syntactic constructions, and how likely data points are to occur, but in some cases differences may be small, in some cases the difference may be very one-sided, say a difference in vocabulary, and in other cases again the differences may be dramatic. It is only possible to estimate the distance between the distributions under strong assumptions about the bias (Kifer et al., 2004).