Shared Task on Grammar Induction

This shared task aims to foster continuing research in grammar induction and part-of-speech induction, while also opening up the problem to more ambitious settings, including a wider variety of languages, removing the reliance on gold standard parts-of-speech and, critically, providing a thorough evaluation including a task-based evaluation.

The shared task will evaluate dependency grammar induction algorithms, evaluating the quality of structures induced from natural language text. In contrast with the defacto standard experimental setup, which starts with gold standard part-of-speech tags, we will encourage competitors to submit systems which are completely unsupervised. The evaluation will consider the standard dependency tree based measures (directed, undirected edge accuracy, bracketing accuracy etc) as well as measures over the predicted parts of speech. Our aim is to allow a wide range of different approaches, and for this reason we will accept submissions which predict just the dependency trees for gold PoS, just the PoS, or both jointly.

While our focus is on unsupervised approaches, we recognise that there has been considerable related research using semi-supervised learning, domain adaption, cross-lingual projection and other partially supervised methods for building syntactic models. We will support these kinds of systems, but require the participants to declare which external resources they have used. When presenting the results, we will split them into two sets: purely unsupervised approaches and those that have some form of external supervision.

Tracks

Gold Part-of-Speech tags - Dependency structures should be predicted using the gold tagset (universal or not?)
Induced Part-of-Speech tags - Predicting dependency structures and/or Part-of-Speech tags directly from the text. The number of induced tags can be chosen by each participant.
Open resources - Other resources can be used, such as parallel data.

Data

The data that we will provide will be collated from existing treebanks in a variety of different languages, domains and linguistic formalisms. Specifically, we will be using

English Penn Treebank v3 WSJ
Czech Prague Dependency Treebank v2
Arabic Prague Arabic Dependency Treebank v1
English CHILDES
Basque 3LB-cast
Danish Copenhagen Dependency Treebank (CDT) v2
Dutch Alpino Treebank
Portuguese Bosque/Floresta Sintá(c)tica
Slovene jos100k
Swedish Talbanken05

The first three corpora listed above are licensed under the LDC, who have agreed to allow competitors access to the data for the purpose of the competition. Participants will need to sign the special license agreement and send this to the LDC in order to gain access to these corpora (English PTB, Czech PDT, Arabic PADT). The remaining corpora all have open licence agreements from research purposes, and can be freely downloaded.

Note that some of these corpora have been used in previous evaluations, namely the shared tasks at CONLL-X and CONLL 2007. In most cases our data is not identical, as we have updated these corpora to include larger amounts of data and changes to the treebanks that have occurred since the CONLL competitions. In addition, our data format is slightly different in order to include universal PoS tags.

Our multi-lingual setup is designed to allow competitors to develop cross-lingual approaches for transferring syntactic knowledge between languages. To support these techniques, we will evaluate competing systems against the fine tag-set, coarse tag-set and its reduction into Petrov et al.'s universal tag-set. Clustering approaches will be supported using the standard metrics for evaluating cluster identifiers, e.g., many-to-1, 1-1, VI etc.

For the English PTB, we intend to compile multiple annotations for the same sentences such that the effect of the choice of linguistic formalism or annotation procedure can be offset in the evaluation. This is a long-standing issue in parsing where many researchers evaluate only against the Penn Treebank, a setting which does not reflect many modern advances in linguistic theory from the last two decades. Overall this test set will form a significant resource for the evaluation of parsers and grammar induction algorithms, and help to reduce the field's continuing reliance on the Penn Treebank.

Baselines

We provider baselines both for Part-of-Speech Induction and Dependency Grammar Induction that can be used as a starting point for the shared task.

Part-of-Speech Induction

For Part-of-Speech Induction we suggest two systems. The first is the brown clustering algorithm [] that assigns each word type to a single clusters. The second is an HMM based system training using the EM algorithm, that can assign word types to different clusters according to their context. Note that the brown clustering system provides a stronger baseline.

Brown Clustering

Download the code from http://cs.stanford.edu/~pliang/software/brown-cluster-1.2.zip and unzip the file. This will create a directory called brown-cluster-1.2.

HMM Base Tagger

Generating Dependency Files

Generate conll data files using the induced clusters to use on the dependency grammar system.

Dependency Grammar Induction

We provide a Baseline system that can be used as a starting point for the shared task. The Baseline is an implementation of the Dependency Model with Valence (DMV)[Klein and Manning, ACL 2004] as described in the paper [Posterior Sparsity in Dependency Grammar Induction].

Installation

The code requires Java to be installed on your machine. Download the code from http://code.google.com/p/pr-toolkit/downloads/detail?name=pr-dep-parsing.2010.11.tgz&can=2&q=#makechanges to a directory and unpack.

Preparing the data

Create a corpus params file as ....

Running

To run the DMV model starting from Klein and Manning harmonic initialization and running for 100 iterations, run the following command:

java -jar dist/dep-parsing-2010.11.jar -corpus-params /path/to/corpus/params/file/<LANG's params file> -model-init K_AND_M -stats-file /path/to/stats/file/<stats file> -output-prefix <prefix for output files> -num-em-iters 100 -trainingType 0

Evaluation

The evaluation will have two parts, 1) a linguistic evaluation against a tree-bank, and 2) a task-based evaluation, where we incorporate the predictions of grammar/POS induction models into a machine translation system. The exact form of the task-based evaluation is to be decided, but may involve features for reranking MT outputs or features for source-side reordering.

References

[1] - DVM
[2] - Brown Clustering
[3] - EM Part-Of-Speech Tagger
[4] - Jenny Dep Parsing paper