CS224N Natural Language Processing with Deep Learning Assignment 4
课程主页:https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1194/
视频地址:https://www.bilibili.com/video/av46216519?from=search&seid=13229282510647565239
1. Neural Machine Translation with RNNs
(a)
def pad_sents(sents, pad_token):
""" Pad list of sentences according to the longest sentence in the batch.
@param sents (list[list[str] ]): list of sentences, where each sentence
is represented as a list of words
@param pad_token (str): padding token
@returns sents_padded (list[list[str] ]): list of sentences where sentences shorter
than the max length sentence are padded out with the pad_token, such that
each sentences in the batch now has equal length.
"""
sents_padded = []
### YOUR CODE HERE (~6 Lines)
max_l = 0
for sent in sents:
max_l = max(max_l, len(sent))
for sent in sents:
new_sent = copy.deepcopy(sent)
l = len(sent)
while l < max_l:
new_sent.append(pad_token)
l += 1
sents_padded.append(new_sent)
### END YOUR CODE
return sents_padded
(b)
def __init__(self, embed_size, vocab):
"""
Init the Embedding layers.
@param embed_size (int): Embedding size (dimensionality)
@param vocab (Vocab): Vocabulary object containing src and tgt languages
See vocab.py for documentation.
"""
super(ModelEmbeddings, self).__init__()
self.embed_size = embed_size
# default values
self.source = None
self.target = None
src_pad_token_idx = vocab.src['<pad>']
tgt_pad_token_idx = vocab.tgt['<pad>']
### YOUR CODE HERE (~2 Lines)
### TODO - Initialize the following variables:
### self.source (Embedding Layer for source language)
### self.target (Embedding Layer for target langauge)
###
### Note:
### 1. `vocab` object contains two vocabularies:
### `vocab.src` for source
### `vocab.tgt` for target
### 2. You can get the length of a specific vocabulary by running:
### `len(vocab.<specific_vocabulary>)`
### 3. Remember to include the padding token for the specific vocabulary
### when creating your Embedding.
###
### Use the following docs to properly initialize these variables:
### Embedding Layer:
### https://pytorch.org/docs/stable/nn.html#torch.nn.Embedding
self.source = nn.Embedding(len(vocab.src), self.embed_size, src_pad_token_idx)
self.target = nn.Embedding(len(vocab.tgt), self.embed_size, tgt_pad_token_idx)
### END YOUR CODE
(c)
def __init__(self, embed_size, hidden_size, vocab, dropout_rate=0.2):
""" Init NMT Model.
@param embed_size (int): Embedding size (dimensionality)
@param hidden_size (int): Hidden Size (dimensionality)
@param vocab (Vocab): Vocabulary object containing src and tgt languages
See vocab.py for documentation.
@param dropout_rate (float): Dropout probability, for attention
"""
super(NMT, self).__init__()
self.model_embeddings = ModelEmbeddings(embed_size, vocab)
self.hidden_size = hidden_size
self.dropout_rate = dropout_rate
self.vocab = vocab
# default values
self.encoder = None
self.decoder = None
self.h_projection = None
self.c_projection = None
self.att_projection = None
self.combined_output_projection = None
self.target_vocab_projection = None
self.dropout = None
### YOUR CODE HERE (~8 Lines)
### TODO - Initialize the following variables:
### self.encoder (Bidirectional LSTM with bias)
### self.decoder (LSTM Cell with bias)
### self.h_projection (Linear Layer with no bias), called W_{h} in the PDF.
### self.c_projection (Linear Layer with no bias), called W_{c} in the PDF.
### self.att_projection (Linear Layer with no bias), called W_{attProj} in the PDF.
### self.combined_output_projection (Linear Layer with no bias), called W_{u} in the PDF.
### self.target_vocab_projection (Linear Layer with no bias), called W_{vocab} in the PDF.
### self.dropout (Dropout Layer)
###
### Use the following docs to properly initialize these variables:
### LSTM:
### https://pytorch.org/docs/stable/nn.html#torch.nn.LSTM
### LSTM Cell:
### https://pytorch.org/docs/stable/nn.html#torch.nn.LSTMCell
### Linear Layer:
### https://pytorch.org/docs/stable/nn.html#torch.nn.Linear
### Dropout Layer:
### https://pytorch.org/docs/stable/nn.html#torch.nn.Dropout
self.encoder = nn.LSTM(embed_size, hidden_size, bias=True, bidirectional=True)
self.decoder = nn.LSTMCell((embed_size + hidden_size), hidden_size, bias=True)
self.h_projection = nn.Linear(2 * hidden_size, hidden_size, bias=False)
self.c_projection = nn.Linear(2 * hidden_size, hidden_size, bias=False)
self.att_projection = nn.Linear(2 * hidden_size, hidden_size, bias=False)
self.combined_output_projection = nn.Linear(3 * hidden_size, hidden_size, bias=False)
self.target_vocab_projection = nn.Linear(hidden_size, len(self.vocab.tgt))
self.dropout = nn.Dropout(self.dropout_rate)
### END YOUR CODE
这里注意
所以decoder的input_size为embed_size + hidden_size。
(d)
def encode(self, source_padded: torch.Tensor, source_lengths: List[int]) -> Tuple[torch.Tensor, Tuple[torch.Tensor, torch.Tensor] ]:
""" Apply the encoder to source sentences to obtain encoder hidden states.
Additionally, take the final states of the encoder and project them to obtain initial states for decoder.
@param source_padded (Tensor): Tensor of padded source sentences with shape (src_len, b), where
b = batch_size, src_len = maximum source sentence length. Note that
these have already been sorted in order of longest to shortest sentence.
@param source_lengths (List[int]): List of actual lengths for each of the source sentences in the batch
@returns enc_hiddens (Tensor): Tensor of hidden units with shape (b, src_len, h*2), where
b = batch size, src_len = maximum source sentence length, h = hidden size.
@returns dec_init_state (tuple(Tensor, Tensor)): Tuple of tensors representing the decoder's initial
hidden state and cell.
"""
enc_hiddens, dec_init_state = None, None
### YOUR CODE HERE (~ 8 Lines)
### TODO:
### 1. Construct Tensor `X` of source sentences with shape (src_len, b, e) using the source model embeddings.
### src_len = maximum source sentence length, b = batch size, e = embedding size. Note
### that there is no initial hidden state or cell for the decoder.
### 2. Compute `enc_hiddens`, `last_hidden`, `last_cell` by applying the encoder to `X`.
### - Before you can apply the encoder, you need to apply the `pack_padded_sequence` function to X.
### - After you apply the encoder, you need to apply the `pad_packed_sequence` function to enc_hiddens.
### - Note that the shape of the tensor returned by the encoder is (src_len b, h*2) and we want to
### return a tensor of shape (b, src_len, h*2) as `enc_hiddens`.
### 3. Compute `dec_init_state` = (init_decoder_hidden, init_decoder_cell):
### - `init_decoder_hidden`:
### `last_hidden` is a tensor shape (2, b, h). The first dimension corresponds to forwards and backwards.
### Concatenate the forwards and backwards tensors to obtain a tensor shape (b, 2*h).
### Apply the h_projection layer to this in order to compute init_decoder_hidden.
### This is h_0^{dec} in the PDF. Here b = batch size, h = hidden size
### - `init_decoder_cell`:
### `last_cell` is a tensor shape (2, b, h). The first dimension corresponds to forwards and backwards.
### Concatenate the forwards and backwards tensors to obtain a tensor shape (b, 2*h).
### Apply the c_projection layer to this in order to compute init_decoder_cell.
### This is c_0^{dec} in the PDF. Here b = batch size, h = hidden size
###
### See the following docs, as you may need to use some of the following functions in your implementation:
### Pack the padded sequence X before passing to the encoder:
### https://pytorch.org/docs/stable/nn.html#torch.nn.utils.rnn.pack_padded_sequence
### Pad the packed sequence, enc_hiddens, returned by the encoder:
### https://pytorch.org/docs/stable/nn.html#torch.nn.utils.rnn.pad_packed_sequence
### Tensor Concatenation:
### https://pytorch.org/docs/stable/torch.html#torch.cat
### Tensor Permute:
### https://pytorch.org/docs/stable/tensors.html#torch.Tensor.permute
X = self.model_embeddings.source(source_padded)
X_pack = nn.utils.rnn.pack_padded_sequence(X, source_lengths)
enc_hiddens, (last_hidden, last_cell) = self.encoder(X_pack)
enc_hiddens = nn.utils.rnn.pad_packed_sequence(enc_hiddens, batch_first=True)[0]
hdec = self.h_projection(torch.cat((last_hidden[0], last_hidden[1]), 1))
cdec = self.c_projection(torch.cat((last_cell[0], last_cell[1]), 1))
dec_init_state = (hdec, cdec)
### END YOUR CODE
return enc_hiddens, dec_init_state
对应公式如下:
使用如下命令测试:
python sanity_check.py 1d
得到如下结果:
Running Sanity Check for Question 1d: Encode
--------------------------------------------------------------------------------
torch.Size([5, 20, 6]) torch.Size([5, 20, 6])
enc_hiddens Sanity Checks Passed!
dec_init_state[0] Sanity Checks Passed!
dec_init_state[1] Sanity Checks Passed!
--------------------------------------------------------------------------------
All Sanity Checks Passed for Question 1d: Encode!
--------------------------------------------------------------------------------
(e)
注意$\overline{\mathbf{y}_{t} }$由$\mathbf{y}_{t}$和$\mathbf{o}_{t-1}$拼接而成,以及
所以代码如下:
def decode(self, enc_hiddens: torch.Tensor, enc_masks: torch.Tensor,
dec_init_state: Tuple[torch.Tensor, torch.Tensor], target_padded: torch.Tensor) -> torch.Tensor:
"""Compute combined output vectors for a batch.
@param enc_hiddens (Tensor): Hidden states (b, src_len, h*2), where
b = batch size, src_len = maximum source sentence length, h = hidden size.
@param enc_masks (Tensor): Tensor of sentence masks (b, src_len), where
b = batch size, src_len = maximum source sentence length.
@param dec_init_state (tuple(Tensor, Tensor)): Initial state and cell for decoder
@param target_padded (Tensor): Gold-standard padded target sentences (tgt_len, b), where
tgt_len = maximum target sentence length, b = batch size.
@returns combined_outputs (Tensor): combined output tensor (tgt_len, b, h), where
tgt_len = maximum target sentence length, b = batch_size, h = hidden size
"""
# Chop of the <END> token for max length sentences.
target_padded = target_padded[:-1]
# Initialize the decoder state (hidden and cell)
dec_state = dec_init_state
# Initialize previous combined output vector o_{t-1} as zero
batch_size = enc_hiddens.size(0)
o_prev = torch.zeros(batch_size, self.hidden_size, device=self.device)
# Initialize a list we will use to collect the combined output o_t on each step
combined_outputs = []
### YOUR CODE HERE (~9 Lines)
### TODO:
### 1. Apply the attention projection layer to `enc_hiddens` to obtain `enc_hiddens_proj`,
### which should be shape (b, src_len, h),
### where b = batch size, src_len = maximum source length, h = hidden size.
### This is applying W_{attProj} to h^enc, as described in the PDF.
### 2. Construct tensor `Y` of target sentences with shape (tgt_len, b, e) using the target model embeddings.
### where tgt_len = maximum target sentence length, b = batch size, e = embedding size.
### 3. Use the torch.split function to iterate over the time dimension of Y.
### Within the loop, this will give you Y_t of shape (1, b, e) where b = batch size, e = embedding size.
### - Squeeze Y_t into a tensor of dimension (b, e).
### - Construct Ybar_t by concatenating Y_t with o_prev.
### - Use the step function to compute the the Decoder's next (cell, state) values
### as well as the new combined output o_t.
### - Append o_t to combined_outputs
### - Update o_prev to the new o_t.
### 4. Use torch.stack to convert combined_outputs from a list length tgt_len of
### tensors shape (b, h), to a single tensor shape (tgt_len, b, h)
### where tgt_len = maximum target sentence length, b = batch size, h = hidden size.
###
### Note:
### - When using the squeeze() function make sure to specify the dimension you want to squeeze
### over. Otherwise, you will remove the batch dimension accidentally, if batch_size = 1.
###
### Use the following docs to implement this functionality:
### Zeros Tensor:
### https://pytorch.org/docs/stable/torch.html#torch.zeros
### Tensor Splitting (iteration):
### https://pytorch.org/docs/stable/torch.html#torch.split
### Tensor Dimension Squeezing:
### https://pytorch.org/docs/stable/torch.html#torch.squeeze
### Tensor Concatenation:
### https://pytorch.org/docs/stable/torch.html#torch.cat
### Tensor Stacking:
### https://pytorch.org/docs/stable/torch.html#torch.stack
enc_hiddens_proj = self.att_projection(enc_hiddens)
Y = self.model_embeddings.target(target_padded)
Y_split = torch.split(Y, 1)
for Y_t in Y_split:
y_t = torch.squeeze(Y_t)
Ybar_t = torch.cat((y_t, o_prev), dim=-1)
dec_state, combined_output, e_t = self.step(Ybar_t, dec_state, enc_hiddens, enc_hiddens_proj, enc_masks)
combined_outputs.append(combined_output)
o_prev = combined_output
combined_outputs = torch.stack(combined_outputs)
### END YOUR CODE
return combined_outputs
使用如下命令测试:
python sanity_check.py 1d
得到如下结果:
--------------------------------------------------------------------------------
Running Sanity Check for Question 1e: Decode
--------------------------------------------------------------------------------
combined_outputs Sanity Checks Passed!
--------------------------------------------------------------------------------
All Sanity Checks Passed for Question 1e: Decode!
--------------------------------------------------------------------------------
(f)
这部分的公式如下:
第一部分:
第二部分:
所以代码如下:
def step(self, Ybar_t: torch.Tensor,
dec_state: Tuple[torch.Tensor, torch.Tensor],
enc_hiddens: torch.Tensor,
enc_hiddens_proj: torch.Tensor,
enc_masks: torch.Tensor) -> Tuple[Tuple, torch.Tensor, torch.Tensor]:
""" Compute one forward step of the LSTM decoder, including the attention computation.
@param Ybar_t (Tensor): Concatenated Tensor of [Y_t o_prev], with shape (b, e + h). The input for the decoder,
where b = batch size, e = embedding size, h = hidden size.
@param dec_state (tuple(Tensor, Tensor)): Tuple of tensors both with shape (b, h), where b = batch size, h = hidden size.
First tensor is decoder's prev hidden state, second tensor is decoder's prev cell.
@param enc_hiddens (Tensor): Encoder hidden states Tensor, with shape (b, src_len, h * 2), where b = batch size,
src_len = maximum source length, h = hidden size.
@param enc_hiddens_proj (Tensor): Encoder hidden states Tensor, projected from (h * 2) to h. Tensor is with shape (b, src_len, h),
where b = batch size, src_len = maximum source length, h = hidden size.
@param enc_masks (Tensor): Tensor of sentence masks shape (b, src_len),
where b = batch size, src_len is maximum source length.
@returns dec_state (tuple (Tensor, Tensor)): Tuple of tensors both shape (b, h), where b = batch size, h = hidden size.
First tensor is decoder's new hidden state, second tensor is decoder's new cell.
@returns combined_output (Tensor): Combined output Tensor at timestep t, shape (b, h), where b = batch size, h = hidden size.
@returns e_t (Tensor): Tensor of shape (b, src_len). It is attention scores distribution.
Note: You will not use this outside of this function.
We are simply returning this value so that we can sanity check
your implementation.
"""
combined_output = None
### YOUR CODE HERE (~3 Lines)
### TODO:
### 1. Apply the decoder to `Ybar_t` and `dec_state`to obtain the new dec_state.
### 2. Split dec_state into its two parts (dec_hidden, dec_cell)
### 3. Compute the attention scores e_t, a Tensor shape (b, src_len).
### Note: b = batch_size, src_len = maximum source length, h = hidden size.
###
### Hints:
### - dec_hidden is shape (b, h) and corresponds to h^dec_t in the PDF (batched)
### - enc_hiddens_proj is shape (b, src_len, h) and corresponds to W_{attProj} h^enc (batched).
### - Use batched matrix multiplication (torch.bmm) to compute e_t.
### - To get the tensors into the right shapes for bmm, you will need to do some squeezing and unsqueezing.
### - When using the squeeze() function make sure to specify the dimension you want to squeeze
### over. Otherwise, you will remove the batch dimension accidentally, if batch_size = 1.
###
### Use the following docs to implement this functionality:
### Batch Multiplication:
### https://pytorch.org/docs/stable/torch.html#torch.bmm
### Tensor Unsqueeze:
### https://pytorch.org/docs/stable/torch.html#torch.unsqueeze
### Tensor Squeeze:
### https://pytorch.org/docs/stable/torch.html#torch.squeeze
dec_state = self.decoder(Ybar_t, dec_state)
dec_hidden, dec_cell = dec_state
e_t = torch.bmm(enc_hiddens_proj, dec_hidden.unsqueeze(dim=-1)).squeeze(-1)
### END YOUR CODE
# Set e_t to -inf where enc_masks has 1
if enc_masks is not None:
#e_t.data.masked_fill_(enc_masks.byte(), -float('inf'))
e_t.data.masked_fill_(enc_masks.bool(), -float('inf'))
### YOUR CODE HERE (~6 Lines)
### TODO:
### 1. Apply softmax to e_t to yield alpha_t
### 2. Use batched matrix multiplication between alpha_t and enc_hiddens to obtain the
### attention output vector, a_t.
#$$ Hints:
### - alpha_t is shape (b, src_len)
### - enc_hiddens is shape (b, src_len, 2h)
### - a_t should be shape (b, 2h)
### - You will need to do some squeezing and unsqueezing.
### Note: b = batch size, src_len = maximum source length, h = hidden size.
###
### 3. Concatenate dec_hidden with a_t to compute tensor U_t
### 4. Apply the combined output projection layer to U_t to compute tensor V_t
### 5. Compute tensor O_t by first applying the Tanh function and then the dropout layer.
###
### Use the following docs to implement this functionality:
### Softmax:
### https://pytorch.org/docs/stable/nn.html#torch.nn.functional.softmax
### Batch Multiplication:
### https://pytorch.org/docs/stable/torch.html#torch.bmm
### Tensor View:
### https://pytorch.org/docs/stable/tensors.html#torch.Tensor.view
### Tensor Concatenation:
### https://pytorch.org/docs/stable/torch.html#torch.cat
### Tanh:
### https://pytorch.org/docs/stable/torch.html#torch.tanh
alpha_t = F.softmax(e_t, dim=-1)
a_t = torch.bmm(alpha_t.unsqueeze(1), enc_hiddens).squeeze(1)
U_t = torch.cat((a_t, dec_hidden), dim=1)
V_t = self.combined_output_projection(U_t)
O_t = self.dropout(torch.tanh(V_t))
### END YOUR CODE
combined_output = O_t
return dec_state, combined_output, e_t
使用如下命令测试:
python sanity_check.py 1f
得到如下结果:
--------------------------------------------------------------------------------
Running Sanity Check for Question 1f: Step
--------------------------------------------------------------------------------
dec_state[0] Sanity Checks Passed!
dec_state[1] Sanity Checks Passed!
combined_output Sanity Checks Passed!
e_t Sanity Checks Passed!
--------------------------------------------------------------------------------
All Sanity Checks Passed for Question 1f: Step!
--------------------------------------------------------------------------------
(g)
mask是为了记录哪些位置是pad,可以将单词和pad的$\alpha$设置为很小的值,以此忽略pad的影响,将原始句子长度之后的位置都置为$1$即可:
def generate_sent_masks(self, enc_hiddens: torch.Tensor, source_lengths: List[int]) -> torch.Tensor:
""" Generate sentence masks for encoder hidden states.
@param enc_hiddens (Tensor): encodings of shape (b, src_len, 2*h), where b = batch size,
src_len = max source length, h = hidden size.
@param source_lengths (List[int]): List of actual lengths for each of the sentences in the batch.
@returns enc_masks (Tensor): Tensor of sentence masks of shape (b, src_len),
where src_len = max source length, h = hidden size.
"""
enc_masks = torch.zeros(enc_hiddens.size(0), enc_hiddens.size(1), dtype=torch.float)
for e_id, src_len in enumerate(source_lengths):
enc_masks[e_id, src_len:] = 1
return enc_masks.to(self.device)
(h)
windows上好像无法使用sh命令,所以使用如下方法训练:
python run.py train --train-src=./en_es_data/train.es --train-tgt=./en_es_data/train.en --dev-src=./en_es_data/dev.es --dev-tgt=./en_es_data/dev.en --vocab=vocab.json
python run.py train --train-src=./en_es_data/train.es --train-tgt=./en_es_data/train.en --dev-src=./en_es_data/dev.es --dev-tgt=./en_es_data/dev.en --vocab=vocab.json
(i)
使用如下方式测试:
python run.py decode model.bin ./en_es_data/test.es ./en_es_data/test.en outputs/test_outputs.txt --cuda
run.py decode model.bin ./en_es_data/test.es ./en_es_data/test.en outputs/test_outputs.txt
得到如下结果:
Corpus BLEU: 22.732161209949027
(j)
dot product attention计算快,但是可能不够准确,其他两种attention因为使用了权重参数,所以会更准确些。
2. Analyzing NMT Systems
(a)
(i)one of翻译成another,原因是语言限制,解决方法为增加类似的训练样本。
(ii)断句不对,原因是语言限制,解决方法为增加逗号较多的训练样本。
(iii)没有处理稀有词,原因是模型限制,应该增加处理稀有词的部分。
(iv)an翻译成a,原因是模型限制,应该增加处理a,an的部分。
(v)老师翻译成teacher,原因是模型限制,应该减少模型对性别的偏差。
(vi)数量翻译错误,原因是模型限制,应该增加计量单位的转换。
(b)
这部分从略。
(c)
(i)
对于$c_1$:
对于$c_2$:
根据BLEU,$c_2$比$c_1$好,我同意这点。
(ii)
对于$c_1$:
对于$c_2$:
根据BLEU,$c_1$比$c_2$好,我不同意这点。
(iii)
一个翻译可能不够准确,参考多个翻译的结果更加好。
(iv)
优势:可以较为准确的评估翻译结果,定量化。
劣势:计算速度慢,需要人为提供多个翻译。