MACHINE TRANSLATION
CSC872 Pattern Analysis and Machine Intelligence Final Report
DECEMBER 2020
LINSI LIN
Table of Contents
Introduction ....................................................................................................................................1
Statistical Machine Translation ....................................................................................................3
Neural Machine Translation .........................................................................................................5
Attention Mechanism.....................................................................................................................7
Further Assessment .......................................................................................................................9
Conclusion ....................................................................................................................................13
References .....................................................................................................................................14
1
Introduction
The field of machine translation is almost as old as the modern digital computer. It a sub-
field of computational linguistics that aims to translate text from one language to another
automatically using a computing device. In 1949, Warren Weaver proposed the idea of
approaching this problem from statistics and it wasn’t until Brown et al.(1988) outlined the
approach in a more detail way. They took the view that every sentence in one language, there is a
possible translation of the sentence in other languages. The statistical machine translation model
loops over all sentences in the target language and try to find the highest probability for the
product of the language model and the translation model. There are mainly two types of
statistical machine translation, word-based translation, and phrase-based translation. Word-based
translation translates every single word in a single-true way, it failed to take into considerations
the cases, gender, and homonymy. Koehn et al (2003) proposed the phrased-based statistical
machine translation where the restriction of word-to-word translation limits are removed, instead
they use a lexical unit which is a sequence of words that of any lengths.
The wave of neural models eventually reached the field of statistical machine translation.
Neural machine translation is a recently proposed approach to machine translation, which uses a
single neural network trained jointly to maximize the translation performance. The whole model
is jointly trained to maximize the conditional log-probability of the correct translation given a
source sentence with respect to the model parameter. Before 2015, neural machine translation
was scolded for being too computationally expensive and resource demanding to compete with
state-of-the-art phrase-based statistical machine translation. But in 2015, big progress in neural
machine translation were made with the invention of recurrent neural network encoder-decoder
architecture together with the long short-term memory. For the first time, neural machine
2
translation overtook a variety of phrase-based statistical machine translation approaches with a
large margin on a difficult language pair like English-German. Bahdanau et al. (2014) proposed
an attention mechanism to deal with the problem that perfomance of a basic recurrent nueral
network encoder-decoder architacture deteriorates rapidly as the length of an input sentence
increases.
Compared to statistical machine translation, neural machine translation has several
advantages. First, the whole system is jointly tuned to maximize the translation performance,
unlike the phrase-based system which consists of many feature functions that are tuned
separately. Second, the memory footprint of the neural machine translation model is often much
smaller than the existing system which relies on maintaining large tables of phrase pairs.
Despite the great progress, neural machine translation still has to overcome various
challenges. First, neural machine translation has difficulty translating very rare words. If the
translation of a source sentence requires many words that are not seen in the vocabulary list, the
model performance will dramatically degrade. Second, the out-of-domain performance for the
neural machine translation needs to be improved. Koehn and Knowles (2017) show that for
these two translation models in different domains and the result show that the in-domain
performance for both translation systems are similar (neural machine translation is better for IT
and Subtitles, statistical machine translation is better for Law, Medical, and Koran), the out-of-
domain performance for the neural machine translation systems is worse in almost all cases,
sometimes dramatically so. A possible solution for this particular problem is the combination of
large vocabulary from Jean et al.(2014) and assignment of alignment score based on domains,
which is something I can further explore later.
3
Statistical Machine Translation
In 1949, Warren Weaver proposed the idea of using statistical methods to tackle the
machine translation. However, it wasn’t until Brown et al.(1988) outlined the approach in a more
detail way. At his time, computer was able to read large corpora thanks to the faster speed and
larger storage space, then the statistical methods have proven their value in various fields, such
as automatic speech recognition, lexicography and natural language processing (Brown et al.
1990). In this paper, they took the view that every sentence in one language, there is a possible
translation of the sentence in other languages. If we define S as the source language and T as the
target language, then we can assign to each (S,T) sentence pair a probability P(T|S) which means
a translator will produce T in the target language when presented with S in the source language.
But P(T|S) can be either very small or very large based on the content of the pair, thus they
reframed the machine translation problem as searching among possible source sentences S for
the one that gives the greatest value P(S)P(T|S), if we write it mathematically, it would be as the
following
argmax
argmax
Here
denotes the language model and P(T|S) denotes the translation model. The argmax
function operates over all sentences in the target language denotes the search problem and is
referred to as the decoder.
Brown et al. (1988) proposed the idea that the machine translation ought to be based on a
complex glossary of correspondence of fixed locations, such as [word=mot], [ate = a mange], a
glossary mapping of English and French. The task can be decomposed into three steps. First,
partition the source text into a set of fixed locations. Second, use the glossary plus contextual
4
information to select the corresponding set of fixed locations in the target language. Third,
arrange the words of the target fixed locutions into a sequence that forms the target sentence.
Then a probabilistic method is required to identify the corresponding words in the target and
source sentence. However, the word-based translation translates every single word in a single-
true way, it failed to take into considerations the cases, gender, and homonymy. Koehn et al.
(2003) proposed the phrased-based statistical machine translation where the restriction of word-
to-word translation limits are removed. Instead they use a lexical unit which is a sequence of
words that of any lengths. The format of a lexical entry would be like this (one or more foreign
language words, one or more source language words, a ‘score’ of the lexical entry). The
modeling of phrased-based translation where emeans English and f means foreign language
would be as the following:
argmax

argmax



Where





 

Here f can be segmented into a sequence of i phrases
, then each one of them will be translated
into an English phrase
. Phrase translation is modeled by a probability distribution

and reordering of the English output phrases is modeled by a relative distortion probability
distribution
 

where
denotes the start position of the foreign phrase that was
translated into the ith English phrase, and

denotes the end position of the foreign phrase
translated into the (i+1)th English phrase. They also argue that the establishment of the lexical
correspondence doesn’t have to be at the word level but also at the phrase level. To learn this
5
particular model, a heuristics-based learning algorithm was used. To be specific, they use
French-English parallel corpus of 100000 sentence pairs to train the model. Their experiment
results show that phrase translation outperform the traditional word-based translation.
Neural Machine Translation
Deep neural networks has made promising progress in computer vision (Krizhevsky et al.
(2012)), speech recognition (Hinton et al. (2012)), since then they have also been applied to
solve many natural language processing tasks like word embedding extraction (Mikolov et
al.2013)). Kalchbrenner and Blunsom (2013) introduced Recurrent Continuous Translation
Model, a class of probabilistic continuous translation models that purely based on continuous
representations for words, phrases and sentences and do not rely on alignments or phrasal
translation units. This is the first work that use neural network for machine translation without
including statistical machine translation system.
The major difference between statistical machine translation and neural machine
translation is that in statistical machine translation, we either consider joint probability of co-
occurrence of source and targe language phrases or the conditional probability of generating a
target language phrase given the source language phrase, in essence, phrases are treated as
distinct units. They may share many significant similarities, linguistic or otherwise, but they do
not share statistical weight during translation prediction. This leads to the problem that the
estimation is sparse or skewed for the large number of rare or unseen phrase pairs, thus the
adoption of model to other similar domains would be challenging. Instead, neural network
translation models the continuous representations of linguistic units. The major background work
which is common to almost all the neural machine translation systems is recurrent neural
6
network (RNN) and encoder-decoder architecture. Cho et al. (2014) first propose the RNN
encoder-decoder architecture, but they aim to use the architecture in statistical machine
translation system eventually. Sutskever et al. (2014) gave a more formal introduction to the
sequence-to-sequence RNN encoder-decoder architecture.
RNN is a natural generalization of feedforward neural networks to sequences. With a
given sequence of inputs
, RNN aims to computes a sequence of outputs
through the iteration of the hidden state h, at each step t, the hidden state
is computed by


where f is a non-linear activation function. It can be as simple as the sigmoid
function and as complex as a Long Short-Term Memory (LSTM) ( Hochreiter and
Schmidhubercan (1997)). The reason for using LSTM is although RNN is provided with all the
relevant information, it would be difficult to train the RNNs due to the resulting long term
dependencies (Hochreiter et al. (2001)). During the gradient computation, if many values are less
than 1, further back time steps would have smaller and smaller gradients, this so-called vanishing
gradients would bias parameters to capture short-term dependencies and forget about long-term
one. Then with a vocabulary size V, a RNN can be trained to predict the distribution over
given the words history


at each time step t. Thus we have the Recurrent
Language Model (RLM) to compute the probability of the sequence x using




The idea of encoder-decoder architecture works as follows. The encoder is an RNN that
reads each symbol of an input sequence x sequentially. The hidden state in the middle will
generate a summary c of the whole input sequence. Then the decoder which is another RNN is
7
trained to generate the output sequence by predicting the next symbol
given the hidden state
,

and summary c of the input sequence. Thus, the hidden state of the decoder at time t is
computed by



.
And the conditional distribution of the next symbol is






where f is activation function and g can be a soft-max for example in order to product valid
probabilities. The RNN encoder-decoder are jointly trained to maximize the conditional log-
likelihood
 

Attention Mechanism
The “attention” concept has gained popularity recently in training neural networks,
allowing models to learn alignments between different modalities. For example, Mnih et
al.(2014) used the attention concept between image objects and agent actions in the dynamic
control problem. Xu et al.(2015) used it between visual features of a picture and its text
description in the image caption generation task. The standard RNN encoder-decoder
architecture encodes a source sentence into a fixed-length vector from which a decoder generates
a translation. But the potential problem with this architecture is that neural network needs to be
able to compress all the necessary information of a source sentence into a fixed-length vector
which makes it difficult to take care of long sentences. Cho et al. (2014) showed that the
8
perfomance of a basic RNN encoder-decoder architacture deteriorates rapidly as the length of an
input sentence increases. Bahdanau et al. (2015) proposed an attention mechnism by selectively
focusing on parts of the source sentence during translation. To be more specific, It does not aim
to encode a whole input into a single fixed-length vector, instead, it encodes the input sentence
into a sequence of vectors and chooses a subset of these vectors adaptively by paying attention to
specific input vectors of the input sequence based on the attention weights while decoding the
translation. This new mehod can learn to align and translate jointly based on the context vectors
associated with these source positions and all the previous generated target words. By doing this
way, it frees a neural translation model from having to squash all the information of a source
sentence into a fixed-length vector regardless of the length.
In the new architecture, the conditional probability is defined as:


Where
is an RNN hidden state for time i, computed by:


The context vector
depends on a sequence of annotations 
to which an encoder
maps the input sentence. The context vector
is, then, computed as a weighted sum of these
annotations
:


The weight

of each annotation
is computed by
9







Where



is an alignment model which scores how well the inputs around position
j and the output at position i match. The experiment result shows that the proposed approach of
jointly learning to align and translate achieves significantly improved translation performance
over the basic encoderdecoder approach. The improvement is more apparent with longer
sentences.
Luong et al. (2015) proposed two effective classes of attentional mechanism: a global
approach and a local approach. The global approach which always attends to all source words.
But the drawback is it is computationally expensive and can potentially render it impractical to
translate longer sequences, e.g., paragraphs or documents. Then the local approach is proposed to
only look at a subset of source words at a time. It is a middle ground between the hard and soft
attention models proposed by Xu et al. (2015). Compared to the global approach, it is less
computational expensive, and compred to the hard attention, it is differentiable almost
everywhere, making it easier to implement and train. The global approach is similar to the
architecture brought by Bahdanau et al. (2015) but it has a simpler architecture. To be more
specific, Bahdanau et al. (2015) uses the concatenation of the forward and backward hidden
states in the bi-directional encoder and previous target’s hidden states in their non stacking
unidirectional decoder. Luong et al. (2015) instead uses hidden states at the top LSTM layers in
both the encoder and decoder.
Further Assessment
10
If we carefully examine the statistical translation model proposed by Brown et al. (1988),
there are some unrealistic assumptions. Given the source string S of length n and consisting of
words
, the language model P(S) in the definition of statistical machine translation can
be mathematically expresses in the following way:





The computation of the language model requires computing probabilities for a word given its
history, which is not realistic because there are simply too many histories for a word. Then the
requirement is relaxed by getting rid of the assumption of dependence of the current word on a
fixed subset of the history, for example, only depends on the previous two words, but think about
the sentence “I grew up in China, I speak fluent __”, if only depending on the previous two
words, it would be difficult to make accurate prediction thus less accurate translation.
Wang and Ward (1999) mentioned three sub problems of the statistical machine
translation. First, the modeling problem. It is not clear how the process of generating a sentence
in a source language be depicted, and what process is used by the channel to generate a target
sentence up on receiving a source sentence? Second, the learning problem. Given P(S) and
P(T|S), we still don’t know how can the parameters in these models be estimated from a bilingual
corpus of parallel sentences. Third, a decoding problem. The argmax function operates over all
sentences in the target language denotes the search problem and is referred to as the decoder. But
how do we identify the best decoder? These questions all remained unclear.
When the phrase-based translation system was introduced where there is no restriction of
translating source sentence into target sentence word-by-word, which constitutes a significant
11
departure from word-based models - IBM models. But both the word-based or phrase-based
model fail to take the model structural or syntactic aspects of the language into consideration.
The experiment was run on English and French, a structurally similar language pair. We would
suspect that a language pair with very different word order such as English and Japanese or
English and German would not be modeled well by these models. There are works that try to
capture the external structure, such as Wang and Ward (1999) and Och et al. (1999). Yamada
and Knight (2001) introduced a channel model that aims to capture the internal structure. The
model accepts a parse tree as an input, which means the input sentence is preprocessed by a
syntactic parser. The channel performs operations on each node of the parse tree. The operations
are reordering child nodes, inserting extra words at each node, and translating leaf words.
Neural machine translation with RNN encoder-decoder architecture and help with LSTM
has shown promising results compared to phrase-based statistical machine translation. It builds a
single neural network that reads a source sentence and generates its translation. But there are still
some limitations.
First, compared to the phrase-based approach, neural machine translation has very limited
number of target words the number of target words which is mainly because the complexity of
training and using an neural machine translation model increases as the number of target words
increases. Also, the memory requirement grows linearly with respect to the number of target
words. The problem comes with it is that first, neural machine translation has difficulty
translation very rare words. Second, if the translation of a source sentence requires many words
that are not seen in the vocabulary list, the model performance will dramatically degrade
especially if we think about language that have a rich set of words such as German or other
highly inflected languages such as Latin, Polish, and Finnish. Then the reasonable solution
12
would be the inclusion of large vocabulary in the neural machine translation without increasing
the computational complexity and memory space. Jean et al.(2014) propose an approximate
training algorithm based on (biased) importance sampling that able to train an neural machine
translation model with a much larger target vocabulary. The computational complexity during
the training is only going to be at the level of using only a small subset of the full vocabulary.
But once the model with a very large target vocabulary is trained, you can choose to use either
all the target words or only a subset of them. Experiment result from Koehn and Knowles (2017)
shows that the quality for neural machine translation starts much lower, outperforms statistical
machine translation at about 15 million words, and even beats a statistical machine translation
system with a big 2 billion word in domain language model under high-resource conditions. This
is a very promising sign for the further progress of neural machine learning. Luong et al.(2014)
target the problem that neural machine translation has difficult translating the rare words by
training an neural machine translation system on data that is augmented by the output of a word
alignment algorithm, allowing the neural machine translation system to emit, for each out-of-
vocabulary word in the target sentence, the position of its corresponding word in the source
sentence.
Second, words from different domains have different meaning, translation, and styles.
Neural machine translation, both statistical machine translation and neural machine translation
face the difficulty of having robust performance in different domains, that means they don’t
show stability when confronted with conditions that differ significantly from training conditions.
Koehn and Knowles (2017) show that for these two translation models in different domains and
the result show that the in-domain performance for both translation systems are similar (neural
machine translation is better for IT and Subtitles, statistical machine translation is better for Law,
13
Medical, and Koran), the out-of-domain performance for the neural machine translation systems
is worse in almost all cases, sometimes dramatically so. A possible solution can be the
combination of large vocabulary from Jean et al.(2014) and assigning alignment score based on
domains, which is something I can further explore later.
Conclusion
Machine translation has been a popular research area within the artificial intelligence
filed for many yields. Statistical machine translation is a machine translation paradigm where
translations are generated on the basis of statistical models. Based on the assumption that every
source sentence has a possible translation in the target language, the statistical machine
translation model loops over all sentences in the target language and try to find the highest
probability for the product of the language model and the translation model. There are mainly
two types of statistical machine translation, word-based translation and phrase-based translation.
Phrase-based translation overcome the problem that faced by the word-based translation, that is
the translation of every single word in a single-true way, which fails to take into considerations
the cases, gender and homonymy. Neural machine translation is a recently proposed approach to
machine translation, which uses a single neural network trained jointly to maximize the
translation performance. It is often implemented as the RNN encoderdecoder architecture. The
whole model is jointly trained to maximize the conditional log-probability of the correct
translation given a source sentence with respect to the model parameter. Attention mechanism is
introduced to cope with the problem that perfomance of a basic RNN encoder-decoder
architacture deteriorates rapidly as the length of an input sentence increases. However, despite its
great progress, neural machine translation still has to overcome various challenges. First, neural
machine translation has difficulty translating very rare words. If the translation of a source
14
sentence requires many words that are not seen in the vocabulary list, the model performance
will dramatically degrade especially. Second, the out-of-domain performance for the neural
machine translation needs to be improved. A possible solution for this particular problem is the
combination of large vocabulary from Jean et al.(2014) and assigning alignment score based on
domains, which is something I can further explore later.
References
Bahdanau, D., Cho, K., & Bengio, Y. (2016). Neural Machine Translation by Jointly Learning to
Align and Translate. ArXiv:1409.0473 [Cs, Stat]. http://arxiv.org/abs/1409.0473
Bentivogli, L., Bisazza, A., Cettolo, M., & Federico, M. (2016). Neural versus Phrase-Based
Machine Translation Quality: A Case Study. Proceedings of the 2016 Conference on Empirical
Methods in Natural Language Processing, 257267. https://doi.org/10.18653/v1/D16-1025
Brown, P., Cocke, J., Pietra, S. D., Pietra, V. D., Jelinek, F., Mercer, R., & Roossin, P. (1988). A
statistical approach to language translation. Proceedings of the 12th Conference on
Computational Linguistics -, 1, 7176. https://doi.org/10.3115/991635.991651
Brown, P. F., Cocke, J., Pietra, S. A. D., Pietra, V. J. D., Jelinek, F., Lafferty, J. D., Mercer, R.
L., & Roossin, P. S. (1990). A STATISTICAL APPROACH TO MACHINE TRANSLATION.
16(2), 7.
Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., &
Bengio, Y. (2014). Learning Phrase Representations using RNN Encoder-Decoder for Statistical
Machine Translation. ArXiv:1406.1078 [Cs, Stat]. http://arxiv.org/abs/1406.1078
15
Gradient Flow in Recurrent Nets: The Difficulty of Learning LongTerm Dependencies. (2009).
In J. F. Kolen & S. C. Kremer, A Field Guide to Dynamical Recurrent Networks. IEEE.
https://doi.org/10.1109/9780470544037.ch14
Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A., Jaitly, N., Vanhoucke, V., Nguyen, P.,
Sainath, T., & Kingsbury, B. (2012). Deep Neural Networks for Acoustic Modeling in Speech
Recognition. 27.
Jean, S., Cho, K., Memisevic, R., & Bengio, Y. (2015). On Using Very Large Target Vocabulary
for Neural Machine Translation. ArXiv:1412.2007 [Cs]. http://arxiv.org/abs/1412.2007
Kalchbrenner, N., & Blunsom, P. (n.d.). Recurrent Continuous Translation Models. 10.
Koehn, P., & Knowles, R. (2017). Six Challenges for Neural Machine Translation. Proceedings
of the First Workshop on Neural Machine Translation, 2839. https://doi.org/10.18653/v1/W17-
3204
Koehn, P., Och, F. J., & Marcu, D. (n.d.). Statistical Phrase-Based TranslPartoicoenedings of
HLT-NAACL 2003 Main Papers , pp. 48-54 Edmonton, May-June 2003. 7.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2017). ImageNet classification with deep
convolutional neural networks. Communications of the ACM, 60(6), 8490.
https://doi.org/10.1145/3065386 LSTM.pdf. (n.d.).
Luong, M.-T., Pham, H., & Manning, C. D. (2015). Effective Approaches to Attention-based
Neural Machine Translation. ArXiv:1508.04025 [Cs]. http://arxiv.org/abs/1508.04025
16
Luong, M.-T., Sutskever, I., Le, Q. V., Vinyals, O., & Zaremba, W. (2015). Addressing the Rare
Word Problem in Neural Machine Translation. ArXiv:1410.8206 [Cs].
http://arxiv.org/abs/1410.8206
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (n.d.). Distributed
Representations of Words and Phrases and their Compositionality. 9.
Mnih, V., Heess, N., & Graves, A. (n.d.). Recurrent Models of Visual Attention. 9.
Sutskever, I., Vinyals, O., & Le, Q. V. (n.d.). Sequence to Sequence Learning with Neural
Networks. 9.
Wang, Y.-Y. (n.d.). Grammar Inference and Statistical Machine Translation. 148.
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R., & Bengio, Y.
(2016). Show, Attend and Tell: Neural Image Caption Generation with Visual Attention.
ArXiv:1502.03044 [Cs]. http://arxiv.org/abs/1502.03044
Yamada, K., & Knight, K. (2001). A syntax-based statistical translation model. Proceedings of
the 39th Annual Meeting on Association for Computational Linguistics - ACL ’01, 523–530.
https://doi.org/10.3115/1073012.1073079