## Description

Q0 (0pts correct answer, -1,000pts incorrect answer: (0,-1,000) pts): A correct answer to the following questions is worth 0pts. An incorrect answer is worth -1,000pts, which carries over to other homeworks and exams, and can result in an F grade in the course.

- Student interaction with other students / individuals:

- I have copied part of my homework from another student or another person (plagiarism).

- Yes, I discussed the homework with another person but came up with my own answers. Their name(s) is (are)

- No, I did not discuss the homework with anyone

- On using online resources:

- I have copied one of my answers directly from a website (plagiarism).

- I have used online resources to help me answer this question, but I came up with my own answers (you are allowed to use online resources as long as the answer is your own). Here is a list of the websites I have used in this homework:

- I have not used any online resources except the ones provided in the course website.

1

Homework 5

Q1: Theoretical Questions (2pts)

- (1pt) Detail how we can replace the equivariant representation of the Transformer architecture with an

~ ~

equivariant representation using Janossy pooling and an LSTM architecture as f (see f in Equation (1) in our set representation lecture). Note that Janossy pooling originally gives invariant representations.

Speci cally, replace the representation of the i-th word of the m-th self-attention head, z_{i}^{(m)}, of our Transformer lecture with the corresponding LSTM output using Janossy pooling.

Hint: Make sure your representation is equivariant. You will probably want to use the h_{t} representation of the LSTM for your task (see Backpropagation-through-time lecture where we de ne the LSTM variables).

- (1pt) (a) Explain one of the main shortcomings of RNNs and how Bi-directional RNNs help ameliorate this shortcoming. (b) Are Bi-directional LSTMs able to represent well dependencies between the rst and the last elements of a sequence? Why or why not?

Q2: Word Embeddings using 2nd order Markov Chains (2pts)

In class, we have seen that a word2vec model will de ne word embeddings using the SKIPGRAM model with window K (assumed to be an odd number) as a 1st order Markov chain. The negative log-likelihood (NLL) is (not accounting for the corner cases at the end and beginning of a sentence):

- k (K 1)=2

X_{t} |
X |

Lword2vec ^{=} |
(log p(x_{t k}jx_{t}; U; V) + log p(x_{t+k}jx_{t}; U; V)); |

=k | k=1 |

and the probability function is |

1

^{p(xjx}t^{;} ^{U;} ^{V) =} _{Z(x}_{t}_{;} _{U;} _{V)} ^{exp(hUx;} ^{Vx}t^{i);}

^{ }

with appropriately-de ned one-hot encoding of x_{i}, i = 1; : : : ; n, and Z(x_{t}; U; V) is a normalization function.

Task (2pts): Write down the NLL loss function and the probability function of a word embedding model that, again using a window of size K similar to the original SKIPGRAM approach, can embed words based on a 2nd order Markov chain.

Hint 1: We must change the original SKIPGRAM model since the dataset it creates is for a 1st order Markov chain.

Hint 2: There are multiple ways to de ne the model p. Note that you will need more parameters. Also pay attention that your 2nd order Markov chain that cannot be described by a 1st order Markov chain.

2

Q3: Domain Adaptation with Maximum Mean Discrepancy (MMD) (6pts)

In this homework, you are faced with the problem of unsupervised domain adaptation. You are provided a data set of labeled images (CIFAR-10), which we’ll call the source domain. Your task is to predict the labels for a target domain, in which the images have their color shifted. See an example in Figure __1.__

0

10

20

30

40

50

60

0 | 50 | 100 | 150 | 200 | 250 | 300 |

Figure 1: Example of images. Top row: Source domain, CIFAR-100. Bottom row: Target domain, shifted hue.

To aid you in this task, you also have access to unlabeled images from the target domain to use during training. In a more formal description:

Training data: D^{(tr)} = f(x^{(source)}_{i}; y_{i}^{(source)})g^{m}_{i=1} [ fx^{(target)}_{i}g^{n}_{i=1}, where:

{ (x^{(source)}_{i}; y_{i}^{(source)}) are sampled iid from p^{(source)}(x)p(yjx)

{ x^{(target)}_{i} are sampled iid from p^{(target)}(x)

_{Test data:} _{D}(te) _{=} _{f(x}^{(target)}_{i}_{; y}_{i}^{(target)}_{)g}n_{i=1}^{(te)} _{, where:}

_{ }

{ (x_{i}; y_{i}) are sampled iid from p^{(target)}(x)p(yjx).

For this homework, we will use the Maximum Mean Discrepancy (MMD) divergence to help us learn rep-resentations which are useful in both domains. Denote by f( x; W ); W_{f} ) our neural network classi er, where x; W ) is the part of the neural network which is tasked with learning the representations and f( ; W_{f} ) is the part of the neural network which is tasked on classifying an image given its representation.

We modify our loss function to incorporate the MMD as a regularization term:

L(D^{(tr)}; W_{f} ; W ) = L_{CE}(D^{(tr)}; W_{f} ; W ) + MMD^{2}(D^{(tr)}; W ); where L_{CE} is the usual cross-entropy loss and is a hyperparameter.

We will be implementing MMD with the Gaussian kernel: | ^{0}k_{2} |
_{;} |
||

_{k(z;} _{z}0_{) = exp} ^{kz} |
2 ^{2} |
|||

z | 2 | |||

where is a hyperparameter of the kernel. We will apply the MMD to the representations computed by ; W ) for both domains, with the goal of learning a representation that is invariant to both domains. To

compute the MMD term, we will use the unbiased estimator of __Gretton et al. [2012] __(Eq. 3, Lemma 6).

3

Action Items:

In your report, include the answer of the following questions:

- (1 pt) Is this problem an example of covariate shift or label shift? Why? (describe through equations)

- (1 pt) In your report, write the MMD equation below, replacing the question marks by the appropriate variables. Assume a batch of m examples from the source domain and a batch of n examples from the target domain. Your answer must use the variables variables x
^{(source)}; x^{(target)}; m; n.

\^{2} |
^{ } |
(tr) | ;W )= | ||||||||||||||

MMD (D | |||||||||||||||||

1 | ? ? | ||||||||||||||||

= | X X | k( | _{i}; W ); |
_{j}; W )) |
|||||||||||||

?(? | 1) | i=1 j6=i | |||||||||||||||

1 | ? | ? | |||||||||||||||

XX | |||||||||||||||||

+ | k( | _{i}; W ); |
_{j}; W )) |
||||||||||||||

?(? | 1) | ||||||||||||||||

i=1 j6=i | |||||||||||||||||

2 | ? | ? | |||||||||||||||

X X_{j} |
_{ } |
||||||||||||||||

?? | k( | _{i}; W ); _{j}; W )): |

i=1 =1

- (4 pts) Implement the MMD estimator in the le mmd.py and:

- (2 pts) Run the code without using MMD, through the command python hw5.py –reg_str 0. Include in your report:

- Curve of training loss for each epoch.

- Curve of validation accuracy for each epoch, for both source and target domain.

- Final test accuracy of both source and target domain, for the best model.

- (2 pts) Run the code with MMD, through the command python hw5.py –reg_str 10 –kernel_sigma 20. Include in your report:

- Curve of training loss for each epoch.

- Curve of validation accuracy for each epoch, for both source and target domain.

- Final test accuracy of both source and target domain, for the best model.

Hint: If you have GPUs available, pass the option –gpu 0 to use GPU 0. The code will run much faster on a GPU. The scholar cluster has GPUs, see the Piazza post on how to use it.

References

- Gretton, K. M. Borgwardt, M. J. Rasch, B. Scholkopf, and A. Smola. A kernel two-sample test. Jour-nal of Machine Learning Research, 13(Mar):723{773, 2012. URL
__http://www.jmlr.org/papers/v13/____gretton12a.html__.

4

Submission Instructions

Please read the instructions carefully. Failing to follow the instructions might lead to loss of points.

Naming convention: [your purdue login] hw-5

All your submission les, including a ReadMe, and codes, should be included in one folder. The folder should be named with the above naming convention. For example, if my purdue account is \jsmith123″, then for Homework 5 I should name my folder as \jsmith123 hw-5″.

Remove any unnecessary les in your folder, such as training datasets (Data folder). Make sure your folder is structured as the tree shown in Overview section.

Submit: TURNIN INSTRUCTIONS

Please submit your homework les on data.cs.purdue.edu using the turnin command, e.g.:

turnin -c cs690-dpl -p hw-5 jsmith123 hw-5.

Please make sure you didn’t use any library/source explicitly forbidden to use. If any such library/-source code is used, you will get 0 pt for the coding part of the assignment. If your code doesn’t run on scholar.rcac.purdue.edu, then even if it compiles in another computer, your code will still be considered not-running and the respective part of the assignment will receive 0 pt.

5