Assignment 2: Word Vectors Solution

\$30.00 \$27.90

Category:

Description

1. Written (30 pts)

1. Softmax (5pts) Prove that softmax is invariant to constant o set in the input, that is, for any input vector x and any constant c,

softmax(x) = softmax(x + c)

where x + c means adding the constant c to every dimension of x. Remember that

exi

softmax(x)i = Pj exj

Note: In practice, we make use of this property and choose c = maxi xi when computing softmax probabilities for numerical stability (i.e., subtracting its maximum element from all elements of x).

1. Sigmoid (5pts)Derive the gradients of the sigmoid function and show that it can be rewritten as a function of the function value (i.e., in some expression where only (x), but not x, is present). Assume that the input x is a scalar for this question. Recall, the sigmoid function is:

(x) =

1

1 + e x

1. Word2vec

1. (5pts) Assume you are given a predicted word vector vc corresponding to the center word c for skipgram, and the word prediction is made with the softmax function

exp(u>vc)

y^o = p(ojc) = PW o

exp(u>v )

w=1 w c

where o is the expected word, w denotes the w-th word and uw (w = 1, …, W) are the \output”

word vectors for all words in the vocabulary. The cross entropy function is de ned as:

X

JCE(o; vc; U) = CE(y; y^) = yi log(^yi)

i

where the gold vector y is a one-hot vector, the softmax prediction vector y^ is a probability distribution over the output space, and U = [u1; u2; :::; uW ] is the matrix of all the output vectors. Assume cross entropy cost is applied to this prediction, derive the gradients with respect to vc.

1. (5pts) As in the previous part, derive gradients for the \output” word vector uw (including uo).

1. (5pts) Repeat a and b assuming we are using the negative sampling loss for the predicted vector vc. Assume that K negative samples (words) are drawn and they are 1,…,K respectively. For simplicity of notation, assume (o 2= f1; :::; Kg). Again for a given word o, use uo to denote its output vector. The negative sampling loss function in this case is:

 K Jneg-sample(o; vc; U) = log( (uo>vc)) kX log( ( uk>vc)) =1

iv. (5pts) Derive gradients for all of the word vectors for skip-gram given the previous parts and given a set of context words [wordc m; :::; wordc; :::; wordc+m] where m is the context size. Denote the \input” and \output” word vectors for word k as vk and uk respectively.

Hint: feel free to use F (o; vc) (where o is the expected word) as a placeholder for the JCE(o; vc:::)

or Jneg-sample(o; vc:::) cost functions in this part { you’ll see that this is a useful abstraction for

the coding part. That is, your solution may contain terms of the form @F (o;vc) Recall that for

@:::

skip-gram, the cost for a context centered around c is:

X

F (wc+j; vc)

m j m;j6=0

1. Coding (70 points)

1. (5pts) Given an input matrix of N rows and D columns, compute the softmax prediction for each row using the optimization in 1.(a). Write your implementation in utils.py.

1. (5pts) Implement the sigmoid function in word2vec.py and test your code.

1. (45pts)Implement the word2vec models with stochastic gradient descent (SGD).

1. (15pts) Write a helper function normalizeRows in utils.py to normalize rows of a matrix in word2vec.py. In the same le, ll in the implementation for the softmax and negative sampling cost and gradient functions. Then, ll in the implementation of the cost and gradient functions for the skip-gram model. When you are done, test your implementation by running python word2vec.py.

1. (15pts) Complete the implementation for your SGD optimizer in sgd.py. Test your implemen-tation by running python sgd.py.

1. (15pts) Implement the k-nearest neighbors algorithm, which will be used for analysis. The algorithm receives a vector, a matrix and an integer k, and returns k indices of the matrix’s rows that are closest to the vector. Use the cosine similarity as a distance metric (https: //en.wikipedia.org/wiki/Cosine_similarity). Implement the algorithm in knn.py. Print out 10 examples: each example is one word and its neighbors.