CS224N Natural Language Processing with Deep Learning Assignment 4

课程主页：https://web.stanford.edu/class/archive/cs/cs224n/cs224n.1194/

视频地址：https://www.bilibili.com/video/av46216519?from=search&seid=13229282510647565239

1. Neural Machine Translation with RNNs

(a)

def pad_sents(sents, pad_token):
    """ Pad list of sentences according to the longest sentence in the batch.
    @param sents (list[list[str] ]): list of sentences, where each sentence
                                    is represented as a list of words
    @param pad_token (str): padding token
    @returns sents_padded (list[list[str] ]): list of sentences where sentences shorter
        than the max length sentence are padded out with the pad_token, such that
        each sentences in the batch now has equal length.
    """
    sents_padded = []

    ### YOUR CODE HERE (~6 Lines)
    max_l = 0
    for sent in sents:
        max_l = max(max_l, len(sent))
    for sent in sents:
        new_sent = copy.deepcopy(sent)
        l = len(sent)
        while l < max_l:
            new_sent.append(pad_token)
            l += 1
        sents_padded.append(new_sent)

    ### END YOUR CODE

    return sents_padded

(b)

def __init__(self, embed_size, vocab):
    """
    Init the Embedding layers.

    @param embed_size (int): Embedding size (dimensionality)
    @param vocab (Vocab): Vocabulary object containing src and tgt languages
                          See vocab.py for documentation.
    """
    super(ModelEmbeddings, self).__init__()
    self.embed_size = embed_size

    # default values
    self.source = None
    self.target = None

    src_pad_token_idx = vocab.src['<pad>']
    tgt_pad_token_idx = vocab.tgt['<pad>']

    ### YOUR CODE HERE (~2 Lines)
    ### TODO - Initialize the following variables:
    ###     self.source (Embedding Layer for source language)
    ###     self.target (Embedding Layer for target langauge)
    ###
    ### Note:
    ###     1. `vocab` object contains two vocabularies:
    ###            `vocab.src` for source
    ###            `vocab.tgt` for target
    ###     2. You can get the length of a specific vocabulary by running:
    ###             `len(vocab.<specific_vocabulary>)`
    ###     3. Remember to include the padding token for the specific vocabulary
    ###        when creating your Embedding.
    ###
    ### Use the following docs to properly initialize these variables:
    ###     Embedding Layer:
    ###         https://pytorch.org/docs/stable/nn.html#torch.nn.Embedding
    self.source = nn.Embedding(len(vocab.src), self.embed_size, src_pad_token_idx)
    self.target = nn.Embedding(len(vocab.tgt), self.embed_size, tgt_pad_token_idx)

    ### END YOUR CODE

(c)

def __init__(self, embed_size, hidden_size, vocab, dropout_rate=0.2):
    """ Init NMT Model.

    @param embed_size (int): Embedding size (dimensionality)
    @param hidden_size (int): Hidden Size (dimensionality)
    @param vocab (Vocab): Vocabulary object containing src and tgt languages
                          See vocab.py for documentation.
    @param dropout_rate (float): Dropout probability, for attention
    """
    super(NMT, self).__init__()
    self.model_embeddings = ModelEmbeddings(embed_size, vocab)
    self.hidden_size = hidden_size
    self.dropout_rate = dropout_rate
    self.vocab = vocab

    # default values
    self.encoder = None 
    self.decoder = None
    self.h_projection = None
    self.c_projection = None
    self.att_projection = None
    self.combined_output_projection = None
    self.target_vocab_projection = None
    self.dropout = None


    ### YOUR CODE HERE (~8 Lines)
    ### TODO - Initialize the following variables:
    ###     self.encoder (Bidirectional LSTM with bias)
    ###     self.decoder (LSTM Cell with bias)
    ###     self.h_projection (Linear Layer with no bias), called W_{h} in the PDF.
    ###     self.c_projection (Linear Layer with no bias), called W_{c} in the PDF.
    ###     self.att_projection (Linear Layer with no bias), called W_{attProj} in the PDF.
    ###     self.combined_output_projection (Linear Layer with no bias), called W_{u} in the PDF.
    ###     self.target_vocab_projection (Linear Layer with no bias), called W_{vocab} in the PDF.
    ###     self.dropout (Dropout Layer)
    ###
    ### Use the following docs to properly initialize these variables:
    ###     LSTM:
    ###         https://pytorch.org/docs/stable/nn.html#torch.nn.LSTM
    ###     LSTM Cell:
    ###         https://pytorch.org/docs/stable/nn.html#torch.nn.LSTMCell
    ###     Linear Layer:
    ###         https://pytorch.org/docs/stable/nn.html#torch.nn.Linear
    ###     Dropout Layer:
    ###         https://pytorch.org/docs/stable/nn.html#torch.nn.Dropout
    self.encoder = nn.LSTM(embed_size, hidden_size, bias=True, bidirectional=True)
    self.decoder = nn.LSTMCell((embed_size + hidden_size), hidden_size, bias=True)
    self.h_projection = nn.Linear(2 * hidden_size, hidden_size, bias=False)
    self.c_projection = nn.Linear(2 * hidden_size, hidden_size, bias=False)
    self.att_projection = nn.Linear(2 * hidden_size, hidden_size, bias=False)
    self.combined_output_projection = nn.Linear(3 * hidden_size, hidden_size, bias=False)
    self.target_vocab_projection = nn.Linear(hidden_size, len(self.vocab.tgt))
    self.dropout = nn.Dropout(self.dropout_rate)

    ### END YOUR CODE

这里注意

$\overline{\mathbf{y}_{t} } \in \mathbb{R}^{(e+h) \times 1}$

所以decoder的input_size为embed_size + hidden_size。

(d)

def encode(self, source_padded: torch.Tensor, source_lengths: List[int]) -> Tuple[torch.Tensor, Tuple[torch.Tensor, torch.Tensor] ]:
    """ Apply the encoder to source sentences to obtain encoder hidden states.
        Additionally, take the final states of the encoder and project them to obtain initial states for decoder.

    @param source_padded (Tensor): Tensor of padded source sentences with shape (src_len, b), where
                                    b = batch_size, src_len = maximum source sentence length. Note that 
                                   these have already been sorted in order of longest to shortest sentence.
    @param source_lengths (List[int]): List of actual lengths for each of the source sentences in the batch
    @returns enc_hiddens (Tensor): Tensor of hidden units with shape (b, src_len, h*2), where
                                    b = batch size, src_len = maximum source sentence length, h = hidden size.
    @returns dec_init_state (tuple(Tensor, Tensor)): Tuple of tensors representing the decoder's initial
                                            hidden state and cell.
    """
    enc_hiddens, dec_init_state = None, None

    ### YOUR CODE HERE (~ 8 Lines)
    ### TODO:
    ###     1. Construct Tensor `X` of source sentences with shape (src_len, b, e) using the source model embeddings.
    ###         src_len = maximum source sentence length, b = batch size, e = embedding size. Note
    ###         that there is no initial hidden state or cell for the decoder.
    ###     2. Compute `enc_hiddens`, `last_hidden`, `last_cell` by applying the encoder to `X`.
    ###         - Before you can apply the encoder, you need to apply the `pack_padded_sequence` function to X.
    ###         - After you apply the encoder, you need to apply the `pad_packed_sequence` function to enc_hiddens.
    ###         - Note that the shape of the tensor returned by the encoder is (src_len b, h*2) and we want to
    ###           return a tensor of shape (b, src_len, h*2) as `enc_hiddens`.
    ###     3. Compute `dec_init_state` = (init_decoder_hidden, init_decoder_cell):
    ###         - `init_decoder_hidden`:
    ###             `last_hidden` is a tensor shape (2, b, h). The first dimension corresponds to forwards and backwards.
    ###             Concatenate the forwards and backwards tensors to obtain a tensor shape (b, 2*h).
    ###             Apply the h_projection layer to this in order to compute init_decoder_hidden.
    ###             This is h_0^{dec} in the PDF. Here b = batch size, h = hidden size
    ###         - `init_decoder_cell`:
    ###             `last_cell` is a tensor shape (2, b, h). The first dimension corresponds to forwards and backwards.
    ###             Concatenate the forwards and backwards tensors to obtain a tensor shape (b, 2*h).
    ###             Apply the c_projection layer to this in order to compute init_decoder_cell.
    ###             This is c_0^{dec} in the PDF. Here b = batch size, h = hidden size
    ###
    ### See the following docs, as you may need to use some of the following functions in your implementation:
    ###     Pack the padded sequence X before passing to the encoder:
    ###         https://pytorch.org/docs/stable/nn.html#torch.nn.utils.rnn.pack_padded_sequence
    ###     Pad the packed sequence, enc_hiddens, returned by the encoder:
    ###         https://pytorch.org/docs/stable/nn.html#torch.nn.utils.rnn.pad_packed_sequence
    ###     Tensor Concatenation:
    ###         https://pytorch.org/docs/stable/torch.html#torch.cat
    ###     Tensor Permute:
    ###         https://pytorch.org/docs/stable/tensors.html#torch.Tensor.permute
    X = self.model_embeddings.source(source_padded)
    X_pack = nn.utils.rnn.pack_padded_sequence(X, source_lengths)
    enc_hiddens, (last_hidden, last_cell) = self.encoder(X_pack)
    enc_hiddens = nn.utils.rnn.pad_packed_sequence(enc_hiddens, batch_first=True)[0]
    hdec = self.h_projection(torch.cat((last_hidden[0], last_hidden[1]), 1))
    cdec = self.c_projection(torch.cat((last_cell[0], last_cell[1]), 1))
    dec_init_state = (hdec, cdec)

    ### END YOUR CODE

    return enc_hiddens, dec_init_state

对应公式如下：

$\begin{aligned} \mathbf{h}_{0}^{\mathrm{dec} } &=\mathbf{W}_{h}\left[\overleftarrow{\mathbf{h}_{1}^{\mathrm{enc} }} ; \overrightarrow{\mathbf{h}_{m}^{\mathrm{enc} }}\right] \\ \mathbf{c}_{0}^{\mathrm{dec} } &=\mathbf{W}_{c}\left[\overleftarrow{\mathrm{c}_{1}^{\mathrm{enc} }} ; \overrightarrow{ {\mathrm{c} }_{m}^{\mathrm{ent} }}\right] \end{aligned}$

使用如下命令测试：

python sanity_check.py 1d

得到如下结果：

Running Sanity Check for Question 1d: Encode
--------------------------------------------------------------------------------
torch.Size([5, 20, 6]) torch.Size([5, 20, 6])
enc_hiddens Sanity Checks Passed!
dec_init_state[0] Sanity Checks Passed!
dec_init_state[1] Sanity Checks Passed!
--------------------------------------------------------------------------------
All Sanity Checks Passed for Question 1d: Encode!
--------------------------------------------------------------------------------

(e)

注意$\overline{\mathbf{y}_{t} }$由$\mathbf{y}_{t}$和$\mathbf{o}_{t-1}$拼接而成，以及

$\mathbf{h}_{t}^{\mathrm{dec} }, \mathbf{c}_{t}^{\mathrm{dec} }=\operatorname{Decoder}\left(\overline{\mathbf{y}_{t} }, \mathbf{h}_{t-1}^{\mathrm{dec} }, \mathbf{c}_{t-1}^{\mathrm{dec} }\right)$

所以代码如下：

def decode(self, enc_hiddens: torch.Tensor, enc_masks: torch.Tensor,
            dec_init_state: Tuple[torch.Tensor, torch.Tensor], target_padded: torch.Tensor) -> torch.Tensor:
    """Compute combined output vectors for a batch.

    @param enc_hiddens (Tensor): Hidden states (b, src_len, h*2), where
                                 b = batch size, src_len = maximum source sentence length, h = hidden size.
    @param enc_masks (Tensor): Tensor of sentence masks (b, src_len), where
                                 b = batch size, src_len = maximum source sentence length.
    @param dec_init_state (tuple(Tensor, Tensor)): Initial state and cell for decoder
    @param target_padded (Tensor): Gold-standard padded target sentences (tgt_len, b), where
                                   tgt_len = maximum target sentence length, b = batch size. 

    @returns combined_outputs (Tensor): combined output tensor  (tgt_len, b,  h), where
                                    tgt_len = maximum target sentence length, b = batch_size,  h = hidden size
    """
    # Chop of the <END> token for max length sentences.
    target_padded = target_padded[:-1]

    # Initialize the decoder state (hidden and cell)
    dec_state = dec_init_state

    # Initialize previous combined output vector o_{t-1} as zero
    batch_size = enc_hiddens.size(0)
    o_prev = torch.zeros(batch_size, self.hidden_size, device=self.device)

    # Initialize a list we will use to collect the combined output o_t on each step
    combined_outputs = []

    ### YOUR CODE HERE (~9 Lines)
    ### TODO:
    ###     1. Apply the attention projection layer to `enc_hiddens` to obtain `enc_hiddens_proj`,
    ###         which should be shape (b, src_len, h),
    ###         where b = batch size, src_len = maximum source length, h = hidden size.
    ###         This is applying W_{attProj} to h^enc, as described in the PDF.
    ###     2. Construct tensor `Y` of target sentences with shape (tgt_len, b, e) using the target model embeddings.
    ###         where tgt_len = maximum target sentence length, b = batch size, e = embedding size.
    ###     3. Use the torch.split function to iterate over the time dimension of Y.
    ###         Within the loop, this will give you Y_t of shape (1, b, e) where b = batch size, e = embedding size.
    ###             - Squeeze Y_t into a tensor of dimension (b, e). 
    ###             - Construct Ybar_t by concatenating Y_t with o_prev.
    ###             - Use the step function to compute the the Decoder's next (cell, state) values
    ###               as well as the new combined output o_t.
    ###             - Append o_t to combined_outputs
    ###             - Update o_prev to the new o_t.
    ###     4. Use torch.stack to convert combined_outputs from a list length tgt_len of
    ###         tensors shape (b, h), to a single tensor shape (tgt_len, b, h)
    ###         where tgt_len = maximum target sentence length, b = batch size, h = hidden size.
    ###
    ### Note:
    ###    - When using the squeeze() function make sure to specify the dimension you want to squeeze
    ###      over. Otherwise, you will remove the batch dimension accidentally, if batch_size = 1.
    ###   
    ### Use the following docs to implement this functionality:
    ###     Zeros Tensor:
    ###         https://pytorch.org/docs/stable/torch.html#torch.zeros
    ###     Tensor Splitting (iteration):
    ###         https://pytorch.org/docs/stable/torch.html#torch.split
    ###     Tensor Dimension Squeezing:
    ###         https://pytorch.org/docs/stable/torch.html#torch.squeeze
    ###     Tensor Concatenation:
    ###         https://pytorch.org/docs/stable/torch.html#torch.cat
    ###     Tensor Stacking:
    ###         https://pytorch.org/docs/stable/torch.html#torch.stack
    enc_hiddens_proj = self.att_projection(enc_hiddens)
    Y = self.model_embeddings.target(target_padded)
    Y_split = torch.split(Y, 1)
    for Y_t in Y_split:
        y_t = torch.squeeze(Y_t)
        Ybar_t = torch.cat((y_t, o_prev), dim=-1)
        dec_state, combined_output, e_t = self.step(Ybar_t, dec_state, enc_hiddens, enc_hiddens_proj, enc_masks)
        combined_outputs.append(combined_output)
        o_prev = combined_output
    
    combined_outputs = torch.stack(combined_outputs)

    ### END YOUR CODE

    return combined_outputs

使用如下命令测试：

python sanity_check.py 1d

得到如下结果：

--------------------------------------------------------------------------------
Running Sanity Check for Question 1e: Decode
--------------------------------------------------------------------------------
combined_outputs Sanity Checks Passed!
--------------------------------------------------------------------------------
All Sanity Checks Passed for Question 1e: Decode!
--------------------------------------------------------------------------------

(f)

这部分的公式如下：

第一部分：

$\begin{aligned} \mathbf{e}_{t, i}&=\left(\mathbf{h}_{t}^{\mathrm{dec} }\right)^{T} \mathbf{W}_{\mathrm{att} } \mathrm{Proj} \mathbf{h}_{i}^{\mathrm{enc} } \\ \alpha_{t}&=\operatorname{Softmax}\left(\mathbf{e}_{t}\right) \\ \mathbf{a}_{t}&=\sum_{i}^{m} \alpha_{t, i} \mathbf{h}_{i}^{\mathrm{enc} } \end{aligned}$

第二部分：

$\begin{aligned} \mathbf{u}_{t}&=\left[\mathbf{a}_{t} ; \mathbf{h}_{t}^{\mathrm{dec} }\right] \\ \mathrm{v}_{t}&=\mathbf{W}_{u} \mathbf{u}_{t} \\ \mathbf{o}_{t}&=\text { Dropout }\left(\tanh \left(\mathbf{v}_{t}\right)\right)\\ \mathbf{P}_{t}&=\operatorname{Softmax}\left(\mathbf{W}_{\mathrm{vocab} } \mathbf{o}_{t}\right) \end{aligned}$

所以代码如下：

def step(self, Ybar_t: torch.Tensor,
        dec_state: Tuple[torch.Tensor, torch.Tensor],
        enc_hiddens: torch.Tensor,
        enc_hiddens_proj: torch.Tensor,
        enc_masks: torch.Tensor) -> Tuple[Tuple, torch.Tensor, torch.Tensor]:
    """ Compute one forward step of the LSTM decoder, including the attention computation.

    @param Ybar_t (Tensor): Concatenated Tensor of [Y_t o_prev], with shape (b, e + h). The input for the decoder,
                            where b = batch size, e = embedding size, h = hidden size.
    @param dec_state (tuple(Tensor, Tensor)): Tuple of tensors both with shape (b, h), where b = batch size, h = hidden size.
            First tensor is decoder's prev hidden state, second tensor is decoder's prev cell.
    @param enc_hiddens (Tensor): Encoder hidden states Tensor, with shape (b, src_len, h * 2), where b = batch size,
                                src_len = maximum source length, h = hidden size.
    @param enc_hiddens_proj (Tensor): Encoder hidden states Tensor, projected from (h * 2) to h. Tensor is with shape (b, src_len, h),
                                where b = batch size, src_len = maximum source length, h = hidden size.
    @param enc_masks (Tensor): Tensor of sentence masks shape (b, src_len),
                                where b = batch size, src_len is maximum source length. 

    @returns dec_state (tuple (Tensor, Tensor)): Tuple of tensors both shape (b, h), where b = batch size, h = hidden size.
            First tensor is decoder's new hidden state, second tensor is decoder's new cell.
    @returns combined_output (Tensor): Combined output Tensor at timestep t, shape (b, h), where b = batch size, h = hidden size.
    @returns e_t (Tensor): Tensor of shape (b, src_len). It is attention scores distribution.
                            Note: You will not use this outside of this function.
                                  We are simply returning this value so that we can sanity check
                                  your implementation.
    """

    combined_output = None

    ### YOUR CODE HERE (~3 Lines)
    ### TODO:
    ###     1. Apply the decoder to `Ybar_t` and `dec_state`to obtain the new dec_state.
    ###     2. Split dec_state into its two parts (dec_hidden, dec_cell)
    ###     3. Compute the attention scores e_t, a Tensor shape (b, src_len). 
    ###        Note: b = batch_size, src_len = maximum source length, h = hidden size.
    ###
    ###       Hints:
    ###         - dec_hidden is shape (b, h) and corresponds to h^dec_t in the PDF (batched)
    ###         - enc_hiddens_proj is shape (b, src_len, h) and corresponds to W_{attProj} h^enc (batched).
    ###         - Use batched matrix multiplication (torch.bmm) to compute e_t.
    ###         - To get the tensors into the right shapes for bmm, you will need to do some squeezing and unsqueezing.
    ###         - When using the squeeze() function make sure to specify the dimension you want to squeeze
    ###             over. Otherwise, you will remove the batch dimension accidentally, if batch_size = 1.
    ###
    ### Use the following docs to implement this functionality:
    ###     Batch Multiplication:
    ###        https://pytorch.org/docs/stable/torch.html#torch.bmm
    ###     Tensor Unsqueeze:
    ###         https://pytorch.org/docs/stable/torch.html#torch.unsqueeze
    ###     Tensor Squeeze:
    ###         https://pytorch.org/docs/stable/torch.html#torch.squeeze
    dec_state = self.decoder(Ybar_t, dec_state)
    dec_hidden, dec_cell = dec_state
    e_t = torch.bmm(enc_hiddens_proj, dec_hidden.unsqueeze(dim=-1)).squeeze(-1)

    ### END YOUR CODE

    # Set e_t to -inf where enc_masks has 1
    if enc_masks is not None:
        #e_t.data.masked_fill_(enc_masks.byte(), -float('inf'))
        e_t.data.masked_fill_(enc_masks.bool(), -float('inf'))

    ### YOUR CODE HERE (~6 Lines)
    ### TODO:
    ###     1. Apply softmax to e_t to yield alpha_t
    ###     2. Use batched matrix multiplication between alpha_t and enc_hiddens to obtain the
    ###         attention output vector, a_t.
    #$$     Hints:
    ###           - alpha_t is shape (b, src_len)
    ###           - enc_hiddens is shape (b, src_len, 2h)
    ###           - a_t should be shape (b, 2h)
    ###           - You will need to do some squeezing and unsqueezing.
    ###     Note: b = batch size, src_len = maximum source length, h = hidden size.
    ###
    ###     3. Concatenate dec_hidden with a_t to compute tensor U_t
    ###     4. Apply the combined output projection layer to U_t to compute tensor V_t
    ###     5. Compute tensor O_t by first applying the Tanh function and then the dropout layer.
    ###
    ### Use the following docs to implement this functionality:
    ###     Softmax:
    ###         https://pytorch.org/docs/stable/nn.html#torch.nn.functional.softmax
    ###     Batch Multiplication:
    ###        https://pytorch.org/docs/stable/torch.html#torch.bmm
    ###     Tensor View:
    ###         https://pytorch.org/docs/stable/tensors.html#torch.Tensor.view
    ###     Tensor Concatenation:
    ###         https://pytorch.org/docs/stable/torch.html#torch.cat
    ###     Tanh:
    ###         https://pytorch.org/docs/stable/torch.html#torch.tanh
    alpha_t = F.softmax(e_t, dim=-1)
    a_t = torch.bmm(alpha_t.unsqueeze(1), enc_hiddens).squeeze(1)
    U_t = torch.cat((a_t, dec_hidden), dim=1)
    V_t = self.combined_output_projection(U_t)
    O_t = self.dropout(torch.tanh(V_t))

    ### END YOUR CODE

    combined_output = O_t
    return dec_state, combined_output, e_t

使用如下命令测试：

python sanity_check.py 1f

得到如下结果：

--------------------------------------------------------------------------------
Running Sanity Check for Question 1f: Step
--------------------------------------------------------------------------------
dec_state[0] Sanity Checks Passed!
dec_state[1] Sanity Checks Passed!
combined_output  Sanity Checks Passed!
e_t Sanity Checks Passed!
--------------------------------------------------------------------------------
All Sanity Checks Passed for Question 1f: Step!
--------------------------------------------------------------------------------

(g)

mask是为了记录哪些位置是pad，可以将单词和pad的$\alpha$设置为很小的值，以此忽略pad的影响，将原始句子长度之后的位置都置为$1$即可：

def generate_sent_masks(self, enc_hiddens: torch.Tensor, source_lengths: List[int]) -> torch.Tensor:
    """ Generate sentence masks for encoder hidden states.

    @param enc_hiddens (Tensor): encodings of shape (b, src_len, 2*h), where b = batch size,
                                 src_len = max source length, h = hidden size. 
    @param source_lengths (List[int]): List of actual lengths for each of the sentences in the batch.
    
    @returns enc_masks (Tensor): Tensor of sentence masks of shape (b, src_len),
                                where src_len = max source length, h = hidden size.
    """
    enc_masks = torch.zeros(enc_hiddens.size(0), enc_hiddens.size(1), dtype=torch.float)
    for e_id, src_len in enumerate(source_lengths):
        enc_masks[e_id, src_len:] = 1
    return enc_masks.to(self.device)

(h)

windows上好像无法使用sh命令，所以使用如下方法训练：

python run.py train --train-src=./en_es_data/train.es --train-tgt=./en_es_data/train.en --dev-src=./en_es_data/dev.es --dev-tgt=./en_es_data/dev.en --vocab=vocab.json
python run.py train --train-src=./en_es_data/train.es --train-tgt=./en_es_data/train.en --dev-src=./en_es_data/dev.es --dev-tgt=./en_es_data/dev.en --vocab=vocab.json

(i)

使用如下方式测试：

python run.py decode model.bin ./en_es_data/test.es ./en_es_data/test.en outputs/test_outputs.txt --cuda
run.py decode model.bin ./en_es_data/test.es ./en_es_data/test.en outputs/test_outputs.txt

得到如下结果：

Corpus BLEU: 22.732161209949027

(j)

dot product attention计算快，但是可能不够准确，其他两种attention因为使用了权重参数，所以会更准确些。

2. Analyzing NMT Systems

(a)

(i)one of翻译成another，原因是语言限制，解决方法为增加类似的训练样本。

(ii)断句不对，原因是语言限制，解决方法为增加逗号较多的训练样本。

(iii)没有处理稀有词，原因是模型限制，应该增加处理稀有词的部分。

(iv)an翻译成a，原因是模型限制，应该增加处理a,an的部分。

(v)老师翻译成teacher，原因是模型限制，应该减少模型对性别的偏差。

(vi)数量翻译错误，原因是模型限制，应该增加计量单位的转换。

(b)

这部分从略。

(c)

(i)

对于$c_1$：

$\begin{aligned} p_1 &= \frac{0 + 1 + 1 + 1+0}{5}\\ &=\frac 3 5\\ p_2 &= \frac{0 + 1+ 1+0}{4}\\ &=\frac 1 2\\ r^\star &=5\\ c&=5\\ BLEU&= \exp \left(\frac 1 2 \times \log \frac 3 5 + \frac 1 2 \times \log \frac 1 2 \right)\\ &=0.5477225575051662 \end{aligned}$

对于$c_2$：

$\begin{aligned} p_1 &= \frac{1 + 1 + 0 + 1+1}{5}\\ &=\frac 4 5\\ p_2 &= \frac{1+ 0+ 0+1}{4}\\ &=\frac 1 2\\ r^\star &= 4\\ c&=5 \\ BLEU&= \exp \left(\frac 1 2 \times \log \frac 4 5 + \frac 1 2 \times \log \frac 1 2 \right)\\ &=0.6324555320336759 \end{aligned}$

根据BLEU，$c_2$比$c_1$好，我同意这点。

(ii)

对于$c_1$：

$\begin{aligned}p_1 &= \frac{0 + 1 + 1 + 1+0}{5}\\ &=\frac 3 5\\p_2 &= \frac{0 + 1+ 1+0}{4}\\ &=\frac 1 2\\ r^\star &= 5\\ c&=5\\ BLEU&= \exp \left(\frac 1 2 \times \log \frac 3 5 + \frac 1 2 \times \log \frac 1 2 \right)\\&=0.5477225575051662\end{aligned}$

对于$c_2$：

$\begin{aligned} p_1 &= \frac{1 + 1 + 0 + 0+0}{5}\\ &=\frac2 5\\ p_2 &= \frac{1+ 0+ 0+0}{4}\\ &=\frac 1 4\\ r^\star &= 5\\ c&=5 \\ BLEU&= \exp \left(\frac 1 2 \times \log \frac 2 5 + \frac 1 2 \times \log \frac 1 4 \right)\\ &=0.316227766016838 \end{aligned}$

根据BLEU，$c_1$比$c_2$好，我不同意这点。

(iii)

一个翻译可能不够准确，参考多个翻译的结果更加好。

(iv)

优势：可以较为准确的评估翻译结果，定量化。

劣势：计算速度慢，需要人为提供多个翻译。