This assignment is due Wednesday, October 4 at 11:59PM.
1. Goals
Through this assignment you will:
- Explore the basics of automatic parsing.
- Begin to gain some familiarity with the Natural LanguageToolkit (NLTK)
- Gain some experience with the cluster and condor
2. Background
Please review the class slides and readings in the textbook on context-free grammars. Also, see Section 8.3 of the NLTK Book for examples of grammars and configuration of the included parsers. We’ll get to the later parts of that chapter soon.
3. Parsing
Create a program to parse the test sentences based on the provided grammar and analyze the results. Specifically, your program should:
- Load the grammar
- Build a parser for the grammar using nltk.parse.EarleyChartParser
- Read in the example sentences
- For each example sentence, output to a file:
- The sentence itself
- The simple bracketed structure parse(s), and
- the number of parses for that sentence.
- Finally, print the average number of parses per sentence obtained by the grammar.
4. Programming
Create a program named hw1_parse.sh to perform the parsing as described above, invoked as:
hw1_parse.sh <grammar_file> <test_sentence_file> <output_file>
where
- <grammar file> is the name of the file holding the grammar rules in the NLTK .cfg format
- <test_sentence_file> is the name of the file holding the set of sentences to parse, one sentence per line
- <output_file> is the name of output file for your system
5. Files
In the dropbox:
You will find the following files in the dropbox folder /mnt/dropbox/23-24/571/hw1:
- toy.cfg
- toy_sentences.txt
- toy_output.txt
These files contain a toy grammar, some toy sentences, and the expected output format described above.
You will also find:
- sentences.txt
- grammar.cfg
These two files will be the test data on which to run your parser and generate a hw1_parse.out file.
Files to Submit:
- hw1.tar.gz, containing:
- hw1_parse.sh
- The shell script described above
- Your source code/binaries invoked by the shell script.
- hw1_parse.out
- The output file described above, as run using grammar.cfg and sentences.txt