Last updated on：2 years ago

Sequences are a regular type of dataset for deep learning. Let’s see how to feed them into our RNN models.

Sequence to sequence model

Basic model

Encoder network, decoder network.

$$P(y^{< 1 >}, …, y^{< T_y >} | x^{< 1 >}, …, x^{< T_x >})$$

This model is a “conditional language model” in the sense that the encoder portion (shown in green) is modelling the probability of the output sentence $y$.

Image captioning

Pick the most likely model

$$argmax P(y^{< 1 >, …, y^{< T_y >} | x})$$

Beam search

Beam width, eg. B = 3

$P(y^{< 1 >} | x)$

Network of copying No., B

When B = 1, it is greedy search

Greedy search, pick the first most likely word, and then another. Not works well.

Characteristics:

Large B: better result, slower. Small B, worse outcome, faster

Around 10, 100 (product) and about1000, 3000 (specific for research)

Unlike exact search algorithms like BFS (Breadth first search) or DFS (Depth first search), beam search runs faster but is not guaranteed to find the exact maximum for argmax $P(y|x)$.

-[x] Beam search will generally find better solutions (i.e. do a better job maximizing $P(y \mid x)$)

-[x] Beam search will use up more memory.

-[x] Beam search will run more slowly.

Length normalization

Too small for the floating part representation in your computer to store accurately. The algorithm will tend to output overly short translations in machine translation if we carry out beam search without using sentence normalization.

$$\frac{1}{T_y^{\alpha}} \sum^{T_y}_{t=1} \log P(y^{< t >}|x, y^{< 1 >}, …, y^{< t-1 >})$$

Error analysis

To figure out what fraction of errors is “due to” beam search vs RNN model. You can try to ascribe the error to either the search algorithm or to the objective function or the RNN model that generates the objective function that beam search is supposed to maximize. And through this, you can try to figure out which of these two components is responsible for more errors. And only if you find that beam search is responsible for a lot of mistakes, then maybe is we’re working hard to increase the beam width. P is a possibility !!!

Case 1: $P(y^*|x) > P(\hat{y}|x)$

Conclusion: beam search is at fault

Case 2: $P(y^*|x) \le P(\hat{y}|x)$

Conclusion: RNN model is at fault