This assignment is due Wednesday, October 23 at 11PM.

  1. Goals
  2. Inducing a Probabilistic Context-Free Grammar
  3. Converting from CKY to P(robabilistic)CKY
  4. Evaluating the Parser
  5. Improving the Parser
  6. Files
  7. Handing in Your Work

0. Goals

Through this assignment you will:

NOTE: You may work in teams of two (2) on this assignment. If you do so:

[Back to Top]

Background

Please review the class slides and readings in the textbook on the probabilistic Cocke-Kasami-Younger algorithm, optimization, and evaluation.

1. Inducing a Probabilistic Context-free Grammar

Based on the material in the lectures and text, implement a procedure that takes a set of context-free grammar parses of sentences (a small treebank) and induces a probabilistic context-free grammar from them.

Your algorithm must create a grammar of the form:

A -> B C [0.38725]

All productions must have an associated probability.

Specifically, the program should:

1.1 Programming

Create a program named hw4_topcfg.sh to perform PCFG induction invoked as:

hw4_topcfg.sh <treebank_filename> <output_PCFG_file>

where:

[Back to Top]

2. Converting from CKY to Probabilistic CKY

Implement a probabilistic version of the CKY parsing algorithm. Given a probabilistic context-free grammar and an input string, the algorithm should return the highest probability parse tree for that input string.

You should follow the approach outlined in the textbook and course notes. You may adapt the CKY implementation that you created for HW#3. You may use any language that you like, in keeping with the course policies.

Specifically, your program should:

2.1 Programming

Create a program named hw4_parser.sh to perform PCKY parsing, invoked as:

hw4_parser.sh <input_PCFG_file> <test_sentence_filename> <output_parse_filename>, where:

Note: The test sentences may include words not seen in training; this happens in real life. In a baseline system, these may fail to parse.

[Back to Top]

3. Evaluating the PCKY Parser

Use the evalb program to evaluate your parser.

The executable may be found in /dropbox/19-20/571/hw4/tools/ along with the required parameter file. It should be run as:

$dir/evalb -p $dir/COLLINS.prm <gold_standard_parse_file> <hypothesis_parse_file>

where

[Back to Top]

4: Improving the Parser

You will also need to improve your baseline parser. You can improve the parser either by:

You will either:

Create a second script, either:

  1. hw4_improved_parser.sh — if you are modifying the parsing algorithm.
  2. hw4_improved_induction.sh — if you are modifying the induction algorithm.

hw4_improved_parser.sh <input_PCFG_file> <test_sentence_filename> <output_parse_filename>

hw4_improved_induction.sh <treebank_filename> <output_PCFG_file>

Re-run the evaluation script on your new parses to demonstrate your improvement (re-parsing using your new PCFG file if necessary).

[Back to Top]

5. Combining it All

Finally, write a script hw4_run.sh that will call all the components of the system:

  1. Grammar Induction
  2. PCKY Parsing
  3. Evaluation of Baseline System
  4. Improved PCKY Parsing
  5. Evaluation of Improved System

Calling specification:

hw4_run.sh <treebank_filename> <output_PCFG_file> \
	   <test_sentence_filename> <baseline_parse_output_filename> \
	   <input_PCFG_file> \
	   <improved_parse_output_filename> \
		    <baseline_eval> <improved_eval>

Where

  1. If you have not modified the induction process:
    • You should take this argument in your script, but may ignore it and re-use the original induced PCFG output.
  2. If you have modified the induction process:
    • This argument should specify the output PCFG of the modified induction process.

[Back to Top]

6. Files

Training, Test, Evaluation, Example Data

You will use the following files, derived from the Air Travel Information System (ATIS) subset of the Penn Treebank as described in class. All files can be found on patas in /dropbox/19-20/571/hw4/data/, unless otherwise mentioned:

Submission Files

[Back to Top]

6. Handing In your Work

[Back to Top]