Differences between revisions 19 and 20

PASCAL Challenge on Grammar Induction (Shared Task)

This shared task aims to foster continuing research in grammar induction and part-of-speech induction, while also opening up the problem to more ambitious settings, including a wider variety of languages, removing the reliance on gold standard parts-of-speech and, critically, providing a thorough evaluation including a task-based evaluation.

The shared task will evaluate dependency grammar induction algorithms, evaluating the quality of structures induced from natural language text. In contrast with the defacto standard experimental setup, which starts with gold standard part-of-speech tags, we will encourage competitors to submit systems which are completely unsupervised. The evaluation will consider the standard dependency tree based measures (directed, undirected edge accuracy, bracketing accuracy etc) as well as measures over the predicted parts of speech. Our aim is to allow a wide range of different approaches, and for this reason we will accept submissions which predict just the dependency trees for gold PoS, just the PoS, or both jointly.

While our focus is on unsupervised approaches, we recognise that there has been considerable related research using semi-supervised learning, domain adaption, cross-lingual projection and other partially supervised methods for building syntactic models. We will support these kinds of systems, but require the participants to declare which external resources they have used. When presenting the results, we will split them into two sets: purely unsupervised approaches and those that have some form of external supervision.

Tracks

Gold Part-of-Speech tags - Dependency structures should be predicted using the gold tagset (universal or not?)
Induced Part-of-Speech tags - Predicting dependency structures and/or Part-of-Speech tags directly from the text. The number of induced tags can be chosen by each participant.
Open resources - Other resources can be used, such as parallel data.

If you plan to participate, please join the google group. We'll post announcements and any questions about the data, evaluation etc to this list. You must join the google group to get access to the LDC data (as well as filling out a licence form and faxing it to the LDC).

Data

The data that we are providing was collated from existing treebanks in a variety of different languages, domains and linguistic formalisms. Specifically, we are using the following:
English Penn Treebank v3 WSJ
Czech Prague Dependency Treebank v2
Arabic Prague Arabic Dependency Treebank v1
''English'' CHILDES
''Basque'' 3LB-cast
''Danish'' Copenhagen Dependency Treebank
''Dutch'' Alpino Treebank
''Portuguese'' Floresta Sintá(c)tica
''Slovene'' jos500k
''Swedish'' Talbanken05
Please download the files using the above download links. These corpora have open licence agreements for research purposes, and can be freely downloaded. Please see the README file for each corpus, which includes details of the licensing.
The first three corpora listed above are licensed under the LDC, who have agreed to allow competitors access to the data for the purpose of the competition. Participants will need to sign the special license agreement and send this to the LDC in order to gain access to these corpora (English PTB, Czech PDT, Arabic PADT).
Note that some of these corpora have been used in previous evaluations, namely the shared tasks at CONLL-X and CONLL 2007. In most cases our data is not identical, as we have updated these corpora to include larger amounts of data and changes to the treebanks that have occurred since the CONLL competitions. In addition, our data format is slightly different in order to include universal PoS tags.
Our multi-lingual setup is designed to allow competitors to develop cross-lingual approaches for transferring syntactic knowledge between languages. To support these techniques, we will evaluate competing systems against the fine tag-set, coarse tag-set and its reduction into Petrov et al.'s universal tag-set.
For the English PTB, we will compile multiple annotations for the same sentences such that the effect of the choice of linguistic formalism or annotation procedure can be offset in the evaluation. This is a long-standing issue in parsing where many researchers evaluate only against the Penn Treebank, a setting which does not reflect competing widely supported linguistic theories. Overall this test set will form a significant resource for the evaluation of parsers and grammar induction algorithms, and help to reduce the field's continuing reliance on the Penn Treebank.

Data Format

All data files are encoded in UTF-8, and largely follow the file format from the CONLL-X/2007 shared tasks. Each sentence is represented as a series of lines, with one token per line, and sentences are split by blank lines. Each token is represented as a tab-separated list of fields:
word number, starting from 1
token
lemma
coarse part-of-speech (optional)
fine part-of-speech
universal part-of-speech
lexical features (optional)
index of head word
type of dependency relation linking head to current word
Missing values are denoted by an underscore (_), and not all corpora include values for the optional fields. Each different corpus uses different annotation methods for tokenization, lemmatisation, lexical features, part-of-speech and dependency edge labels. Please see the README file in each corpus for descriptions of these annotations. The universal parts-of-speech, however, use the same tag-set across the different corpora.
For each corpus, we will distribute a large training set, a small development set with all fields and a test set. For the Gold Part-of-Speech stream of the competition, fields 8-9 will be omitted (replaced with an underscore) from the training and test sets, and only provided for the test set at the end of the competition. For the Induced Part-of-Speech stream, fields 4-9 will be omitted for the training and test sets. Note that as our task is induction, participants are encouraged to pool all the data together for the purpose of training their unsupervised models (i.e., use the union of training, development and testing sets for training their models).

Tools

Scripts will be provided to
strip away punctuation
filter sentences based on length
etc

Evaluation

We will provide a script for evaluating
dependencies against a gold standard. This will measure the directed and undirected unlabelled attachment score, both for all sentences and sentences shorter than 10 words.
parts-of-speech against the gold standard, using each of the fine tags, coarse tags and universal tags. Clustering based approaches will be supported using the standard metrics for evaluating cluster identifiers, e.g., many-to-1, 1-1, VI etc.
Suggestions are welcome for other evaluation methods.

Baselines

We suggest a Baseline system that can be used as a starting point for the shared task. The Baseline is an implementation of the Dependency Model with Valence (DMV)[Klein, 2004] as described in the paper [Gillenwater, 2010].
You can download the code from http://code.google.com/p/pr-toolkit/downloads/detail?name=pr-dep-parsing.2010.11.tgz and follow the instructions on the README file on how to run the model.

References

[Klein, 2004] - D. Klein and C. Manning. Corpus-based induction of syntactic structure: Models of dependency and constituency. In Proc. ACL, 2004.
[Gillenwater, 2010] - Posterior Sparsity in Dependency Grammar Induction, J. Gillenwater, K. Ganchev, J. Graca, F. Pereira, and B. Taskar. Journal of Machine Learning Research (JMLR).

-  ⇤ ← Revision 19 as of 2012-01-27 21:59:50 → 
  Size: 7527
  Editor: JoaoGraca
  Comment:
+   ← Revision 20 as of 2012-01-28 00:15:32 → ⇥
  Size: 8497
  Editor: TrevorCohn
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 17:
+'''''If you plan to participate, please join the [[http://groups.google.com/group/wils-challenge?lnk=gcimh|google group]]. We'll post announcements and any questions about the data, evaluation etc to this list. You must join the google group to get access to the LDC data (as well as filling out a licence form and faxing it to the LDC).''''''
-Line 20:
+Line 22:
-The data that we will provide will be collated from existing treebanks in a variety of different languages, domains and linguistic formalisms. Specifically, we will be using
+The data that we are providing was collated from existing treebanks in a variety of different languages, domains and linguistic formalisms. Specifically, we are using the following:
-Line 24:
+Line 26:
- * ''English'' CHILDES
 * ''Basque'' 3LB-cast
 * ''Danish'' Copenhagen Dependency Treebank (CDT) v2
 * ''Dutch'' Alpino Treebank
 * ''Portuguese'' Bosque/Floresta Sintá(c)tica
 * ''Slovene'' jos100k
 * ''Swedish'' Talbanken05
The first three corpora listed above are licensed under the LDC, who have agreed to allow competitors access to the data for the purpose of the competition. Participants will need to sign the special license agreement and send this to the LDC in order to gain access to these corpora (English PTB, Czech PDT, Arabic PADT). The remaining corpora all have open licence agreements for research purposes, and can be freely downloaded.
+ *  [[http://staffwww.dcs.shef.ac.uk/people/T.Cohn/wils/english_CHILDES.tar.gz|''English'' CHILDES]]
 * [[http://staffwww.dcs.shef.ac.uk/people/T.Cohn/wils/basque.tar.gz|''Basque'' 3LB-cast]]
 * [[http://staffwww.dcs.shef.ac.uk/people/T.Cohn/wils/danish.tar.gz|''Danish''  Copenhagen Dependency Treebank]]
 * [[http://staffwww.dcs.shef.ac.uk/people/T.Cohn/wils/dutch.tar.gz|''Dutch'' Alpino Treebank]]
 * [[http://staffwww.dcs.shef.ac.uk/people/T.Cohn/wils/portuguese.tar.gz|''Portuguese'' Floresta Sintá(c)tica]]
 * [[http://staffwww.dcs.shef.ac.uk/people/T.Cohn/wils/slovene.tar.gz|''Slovene'' jos500k]]
 * [[http://staffwww.dcs.shef.ac.uk/people/T.Cohn/wils/swedish.tar.gz|''Swedish'' Talbanken05]]
Please download the files using the above download links.  These corpora have open licence agreements for research purposes, and can be freely downloaded. Please see the README file for each corpus, which includes details of the licensing.

The first three corpora listed above are licensed under the LDC, who have agreed to allow competitors access to the data for the purpose of the competition. Participants will need to sign the special license agreement and send this to the LDC in order to gain access to these corpora (English PTB, Czech PDT, Arabic PADT).
-Line 37:
+Line 41:
-For the English PTB, we intend to compile multiple annotations for the same sentences such that the effect of the choice of linguistic formalism or annotation procedure can be offset in the evaluation. This is a long-standing issue in parsing where many researchers evaluate only against the Penn Treebank, a setting which does not reflect many modern advances in linguistic theory from the last two decades. Overall this test set will form a significant resource for the evaluation of parsers and grammar induction algorithms, and help to reduce the field's continuing reliance on the Penn Treebank.
+For the English PTB, we will compile multiple annotations for the same sentences such that the effect of the choice of linguistic formalism or annotation procedure can be offset in the evaluation. This is a long-standing issue in parsing where many researchers evaluate only against the Penn Treebank, a setting which does not reflect competing widely supported linguistic theories. Overall this test set will form a significant resource for the evaluation of parsers and grammar induction algorithms, and help to reduce the field's continuing reliance on the Penn Treebank.
-Line 45:
+Line 49:
-. universal part-of-speech
-Line 48:
+Line 51:
+. universal part-of-speech
-Line 53:
+Line 57:
-For each corpus, we will distribute a large training set, a small development set with all fields and a test set. For the ''Gold Part-of-Speech'' stream of the competition, fields 8-9 will be omitted from the training and test sets, and only provided for the test set at the end of the competition. For the ''Induced Part-of-Speech'' stream, fields 4-9 will be omitted for the training and test sets. Note that as our task is induction, participants are encouraged to pool all the data together for the purpose of training their unsupervised models (i.e., use the union of training, development and testing sets for training their models).
+For each corpus, we will distribute a large training set, a small development set with all fields and a test set. For the ''Gold Part-of-Speech'' stream of the competition, fields 8-9 will be omitted (replaced with an underscore) from the training and test sets, and only provided for the test set at the end of the competition. For the ''Induced Part-of-Speech'' stream, fields 4-9 will be omitted for the training and test sets. Note that as our task is induction, participants are encouraged to pool all the data together for the purpose of training their unsupervised models (i.e., use the union of training, development and testing sets for training their models).