This assignment is due Wednesday, November 13 at 11:59PM.

  1. Goals
  2. Background
  3. Creating and Evaluating Count-based Models of Distributional Semantic Similarity
  4. Programming
  5. Comparison to CBOW
  6. Files

1. Goals

Through this assignment you will:

[Back to Top]

2. Background

Please review the class slides and readings in the textbook on distributional semantics and models. The count-based and word2vec models are to be implemented separately, so that you may do the more extensive coding required for the count-based distributional model in your preferred programming language and then use the Python-based gensim package for the word2vec implementation.

[Back to Top]

3. Creating and Evaluating Count-based Models of Distributional Semantic Similarity

Implement a program to create and evaluate a distributional model of word similarity based on local context term cooccurrence. Your program should:

  1. Read in a corpus that will form the basis of the distributional model and perform basic preprocessing.
    • All words should be lowercase.
    • Punctuation should removed, both from individual words and from the corpus (e.g. "," should not be a word).
      Only alphanumeric (a-z, 0-9) characters should remain.
      Note: in many regex packages, including Python's, \w matches a single alphanumeric character, while \W denotes a single non-alphanumeric character.
  2. For each word in the corpus:
    • Create a vector representation based on word cooccurrence in a specified window around the word.
    • Each element in the vector should receive weight according to a specified weighting
  3. Read in a file of human judgments of similarity between pairs of words. (See Files Section)
  4. For each word pair in the file:
    • For each word in the word pair:
      • Print the word and its ten (10) highest weighted features (words) and their weights, in the form:
        • word feature1:weight1 feature2:weight2 ….
    • Compute the similarity between the two words, based on cosine similarity (e.g. using scipy.spatial.distance.cosine, though not that this returns a distance, not a similarity, so read the docs for it carefully).
    • Print out the similarity as: wd1,wd2:similarity
  5. Lastly, compute and print the Spearman correlation between the similarity scores you have computed and the human-generated similarity scores in the provided file as:
    correlation:computed_correlation.
    You may use any available software for computing the correlation. In Python, you can use spearmanr from scipy.stats.stats.

[Back to Top]

4. Programming

Create a program hw7_dist_similarity.sh that implements the creation and evaluation of the distributional similarity model as described above and invoked as:
hw7_dist_similarity.sh <window> <weighting> <judgment_filename> <output_filename>, where:

In this assignment, you should use the Brown corpus provided with NLTK in /corpora/nltk/nltk-data/corpora/brown/ as the source of cooccurrence information. The file is white-space tokenized, but all tokens are of the form “word/POS”.

If you choose to use NLTK, you may use the Brown corpus reader as in:
brown_words = nltk.corpus.brown.words()

[Back to Top]

5. Comparison to Continuous Bag of Words (CBOW) using Word2Vec

Implement a program to evaluate a predictive CBOW distributional model of word similarity using Word2Vec. Your program should:

  1. Read in a corpus that will form the basis of the predictive CBOW distributional model and perform basic preprocessing.
    • All words should be lowercase.
    • Punctuation should removed.
  2. Build a continuous bag of words model using a standard implementation package, such as gensim’s word2vec
    An example call would be: model = gensim.models.word2vec.Word2Vec(sents, vector_size=100, window=2, min_count=1, workers=1), where sents is a list of lists of tokens. See the above documentation for more details.
  3. Read in a file of human judgments of similarity between pairs of words.
  4. For each word pair in the file:
    • Compute the similarity between the two words, using the word2vec model
    • Print out the similarity as: wd1,wd2:similarity
  5. Lastly, compute and print the Spearman correlation between the similarity scores you have computed and the human-generated similarity scores in the provided file as:
    • correlation:computed_correlation.
    • You may use any available software for computing the correlation. In Python, you can use spearmanr from scipy.stats.stats.

NB: If you want to play with pre-trained embeddings before doing your own training, try:

			import gensim.downloader as api
			wv = api.load('word2vec-google-news-300')
			

and have a look at their tutorial.

[Back to Top]

Programming #2

Create a program hw7_cbow_similarity.sh that implements the creation and evaluation of the Continuous Bag-of-Words similarity model as described above and invoked as:
hw7_cbow_similarity.sh <window> <judgment_filename> <output_filename>, where:

[Back to Top]

6. Files

Test and Example Data Files

Aside from the Brown corpus, all files related to this assignment may be found on patas in /mnt/dropbox/24-25/571/hw7/, as below:

Submission Files

[Back to Top]