MACHINE TRANSLATION

CSC872 Pattern Analysis and Machine Intelligence Final Report

DECEMBER 2020

LINSI LIN

Table of Contents

Introduction ....................................................................................................................................1

Statistical Machine Translation ....................................................................................................3

Neural Machine Translation .........................................................................................................5

Attention Mechanism.....................................................................................................................7

Further Assessment .......................................................................................................................9

Conclusion ....................................................................................................................................13

References .....................................................................................................................................14

Introduction

The field of machine translation is almost as old as the modern digital computer. It a sub-

field of computational linguistics that aims to translate text from one language to another

automatically using a computing device. In 1949, Warren Weaver proposed the idea of

approaching this problem from statistics and it wasn’t until Brown et al.(1988) outlined the

approach in a more detail way. They took the view that every sentence in one language, there is a

possible translation of the sentence in other languages. The statistical machine translation model

loops over all sentences in the target language and try to find the highest probability for the

product of the language model and the translation model. There are mainly two types of

statistical machine translation, word-based translation, and phrase-based translation. Word-based

translation translates every single word in a single-true way, it failed to take into considerations

the cases, gender, and homonymy. Koehn et al (2003) proposed the phrased-based statistical

machine translation where the restriction of word-to-word translation limits are removed, instead

they use a lexical unit which is a sequence of words that of any lengths.

The wave of neural models eventually reached the field of statistical machine translation.

Neural machine translation is a recently proposed approach to machine translation, which uses a

single neural network trained jointly to maximize the translation performance. The whole model

is jointly trained to maximize the conditional log-probability of the correct translation given a

source sentence with respect to the model parameter. Before 2015, neural machine translation

was scolded for being too computationally expensive and resource demanding to compete with

state-of-the-art phrase-based statistical machine translation. But in 2015, big progress in neural

machine translation were made with the invention of recurrent neural network encoder-decoder

architecture together with the long short-term memory. For the first time, neural machine

translation overtook a variety of phrase-based statistical machine translation approaches with a

large margin on a difficult language pair like English-German. Bahdanau et al. (2014) proposed

an attention mechanism to deal with the problem that perfomance of a basic recurrent nueral

network encoder-decoder architacture deteriorates rapidly as the length of an input sentence

increases.

Compared to statistical machine translation, neural machine translation has several

advantages. First, the whole system is jointly tuned to maximize the translation performance,

unlike the phrase-based system which consists of many feature functions that are tuned

separately. Second, the memory footprint of the neural machine translation model is often much

smaller than the existing system which relies on maintaining large tables of phrase pairs.

Despite the great progress, neural machine translation still has to overcome various

challenges. First, neural machine translation has difficulty translating very rare words. If the

translation of a source sentence requires many words that are not seen in the vocabulary list, the

model performance will dramatically degrade. Second, the out-of-domain performance for the

neural machine translation needs to be improved. Koehn and Knowles (2017) show that for

these two translation models in different domains and the result show that the in-domain

performance for both translation systems are similar (neural machine translation is better for IT

and Subtitles, statistical machine translation is better for Law, Medical, and Koran), the out-of-

domain performance for the neural machine translation systems is worse in almost all cases,

sometimes dramatically so. A possible solution for this particular problem is the combination of

large vocabulary from Jean et al.(2014) and assignment of alignment score based on domains,

which is something I can further explore later.

Statistical Machine Translation

In 1949, Warren Weaver proposed the idea of using statistical methods to tackle the

machine translation. However, it wasn’t until Brown et al.(1988) outlined the approach in a more

detail way. At his time, computer was able to read large corpora thanks to the faster speed and

larger storage space, then the statistical methods have proven their value in various fields, such

as automatic speech recognition, lexicography and natural language processing (Brown et al.

1990). In this paper, they took the view that every sentence in one language, there is a possible

translation of the sentence in other languages. If we define S as the source language and T as the

target language, then we can assign to each (S,T) sentence pair a probability P(T|S) which means

a translator will produce T in the target language when presented with S in the source language.

But P(T|S) can be either very small or very large based on the content of the pair, thus they

reframed the machine translation problem as searching among possible source sentences S for

the one that gives the greatest value P(S)P(T|S), if we write it mathematically, it would be as the

following

argmax















 argmax























Here







denotes the language model and P(T|S) denotes the translation model. The argmax

function operates over all sentences in the target language denotes the search problem and is

referred to as the decoder.

Brown et al. (1988) proposed the idea that the machine translation ought to be based on a

complex glossary of correspondence of fixed locations, such as [word=mot], [ate = a mange], a

glossary mapping of English and French. The task can be decomposed into three steps. First,

partition the source text into a set of fixed locations. Second, use the glossary plus contextual

information to select the corresponding set of fixed locations in the target language. Third,

arrange the words of the target fixed locutions into a sequence that forms the target sentence.

Then a probabilistic method is required to identify the corresponding words in the target and

source sentence. However, the word-based translation translates every single word in a single-

true way, it failed to take into considerations the cases, gender, and homonymy. Koehn et al.

(2003) proposed the phrased-based statistical machine translation where the restriction of word-

to-word translation limits are removed. Instead they use a lexical unit which is a sequence of

words that of any lengths. The format of a lexical entry would be like this (one or more foreign

language words, one or more source language words, a ‘score’ of the lexical entry). The

modeling of phrased-based translation where ‘e’ means English and ‘f’ means foreign language

would be as the following:

argmax















 argmax



































Where















  





















 









Here f can be segmented into a sequence of i phrases 







, then each one of them will be translated

into an English phrase 



. Phrase translation is modeled by a probability distribution











and reordering of the English output phrases is modeled by a relative distortion probability

distribution



 







 where 



denotes the start position of the foreign phrase that was

translated into the ith English phrase, and 







denotes the end position of the foreign phrase

translated into the (i+1)th English phrase. They also argue that the establishment of the lexical

correspondence doesn’t have to be at the word level but also at the phrase level. To learn this

particular model, a heuristics-based learning algorithm was used. To be specific, they use

French-English parallel corpus of 100000 sentence pairs to train the model. Their experiment

results show that phrase translation outperform the traditional word-based translation.

Neural Machine Translation

Deep neural networks has made promising progress in computer vision (Krizhevsky et al.

(2012)), speech recognition (Hinton et al. (2012)), since then they have also been applied to

solve many natural language processing tasks like word embedding extraction (Mikolov et

al.2013)). Kalchbrenner and Blunsom (2013) introduced Recurrent Continuous Translation

Model, a class of probabilistic continuous translation models that purely based on continuous

representations for words, phrases and sentences and do not rely on alignments or phrasal

translation units. This is the first work that use neural network for machine translation without

including statistical machine translation system.

The major difference between statistical machine translation and neural machine

translation is that in statistical machine translation, we either consider joint probability of co-

occurrence of source and targe language phrases or the conditional probability of generating a

target language phrase given the source language phrase, in essence, phrases are treated as

distinct units. They may share many significant similarities, linguistic or otherwise, but they do

not share statistical weight during translation prediction. This leads to the problem that the

estimation is sparse or skewed for the large number of rare or unseen phrase pairs, thus the

adoption of model to other similar domains would be challenging. Instead, neural network

translation models the continuous representations of linguistic units. The major background work

which is common to almost all the neural machine translation systems is recurrent neural

network (RNN) and encoder-decoder architecture. Cho et al. (2014) first propose the RNN

encoder-decoder architecture, but they aim to use the architecture in statistical machine

translation system eventually. Sutskever et al. (2014) gave a more formal introduction to the

sequence-to-sequence RNN encoder-decoder architecture.

RNN is a natural generalization of feedforward neural networks to sequences. With a

given sequence of inputs













, RNN aims to computes a sequence of outputs













through the iteration of the hidden state h, at each step t, the hidden state 



is computed by 

















 where f is a non-linear activation function. It can be as simple as the sigmoid

function and as complex as a Long Short-Term Memory (LSTM) ( Hochreiter and

Schmidhubercan (1997)). The reason for using LSTM is although RNN is provided with all the

relevant information, it would be difficult to train the RNNs due to the resulting long term

dependencies (Hochreiter et al. (2001)). During the gradient computation, if many values are less

than 1, further back time steps would have smaller and smaller gradients, this so-called vanishing

gradients would bias parameters to capture short-term dependencies and forget about long-term

one. Then with a vocabulary size V, a RNN can be trained to predict the distribution over 



given the words history

















at each time step t. Thus we have the Recurrent

Language Model (RLM) to compute the probability of the sequence x using









 





















The idea of encoder-decoder architecture works as follows. The encoder is an RNN that

reads each symbol of an input sequence x sequentially. The hidden state in the middle will

generate a summary c of the whole input sequence. Then the decoder which is another RNN is

trained to generate the output sequence by predicting the next symbol 



given the hidden state





, 







and summary c of the input sequence. Thus, the hidden state of the decoder at time t is

computed by





 

















And the conditional distribution of the next symbol is

























  













where f is activation function and g can be a soft-max for example in order to product valid

probabilities. The RNN encoder-decoder are jointly trained to maximize the conditional log-

likelihood





 























Attention Mechanism

The “attention” concept has gained popularity recently in training neural networks,

allowing models to learn alignments between different modalities. For example, Mnih et

al.(2014) used the attention concept between image objects and agent actions in the dynamic

control problem. Xu et al.(2015) used it between visual features of a picture and its text

description in the image caption generation task. The standard RNN encoder-decoder

architecture encodes a source sentence into a fixed-length vector from which a decoder generates

a translation. But the potential problem with this architecture is that neural network needs to be

able to compress all the necessary information of a source sentence into a fixed-length vector

which makes it difficult to take care of long sentences. Cho et al. (2014) showed that the

perfomance of a basic RNN encoder-decoder architacture deteriorates rapidly as the length of an

input sentence increases. Bahdanau et al. (2015) proposed an attention mechnism by selectively

focusing on parts of the source sentence during translation. To be more specific, It does not aim

to encode a whole input into a single fixed-length vector, instead, it encodes the input sentence

into a sequence of vectors and chooses a subset of these vectors adaptively by paying attention to

specific input vectors of the input sequence based on the attention weights while decoding the

translation. This new mehod can learn to align and translate jointly based on the context vectors

associated with these source positions and all the previous generated target words. By doing this

way, it frees a neural translation model from having to squash all the information of a source

sentence into a fixed-length vector regardless of the length.

In the new architecture, the conditional probability is defined as:























 

















Where 



is an RNN hidden state for time i, computed by:





 

















The context vector 



depends on a sequence of annotations 









 to which an encoder

maps the input sentence. The context vector 



is, then, computed as a weighted sum of these

annotations 







 













The weight 



of each annotation 



 is computed by

































Where 



 







is an alignment model which scores how well the inputs around position

j and the output at position i match. The experiment result shows that the proposed approach of

jointly learning to align and translate achieves significantly improved translation performance

over the basic encoder–decoder approach. The improvement is more apparent with longer

sentences.

Luong et al. (2015) proposed two effective classes of attentional mechanism: a global

approach and a local approach. The global approach which always attends to all source words.

But the drawback is it is computationally expensive and can potentially render it impractical to

translate longer sequences, e.g., paragraphs or documents. Then the local approach is proposed to

only look at a subset of source words at a time. It is a middle ground between the hard and soft

attention models proposed by Xu et al. (2015). Compared to the global approach, it is less

computational expensive, and compred to the hard attention, it is differentiable almost

everywhere, making it easier to implement and train. The global approach is similar to the

architecture brought by Bahdanau et al. (2015) but it has a simpler architecture. To be more

specific, Bahdanau et al. (2015) uses the concatenation of the forward and backward hidden

states in the bi-directional encoder and previous target’s hidden states in their non stacking

unidirectional decoder. Luong et al. (2015) instead uses hidden states at the top LSTM layers in

both the encoder and decoder.

Further Assessment

If we carefully examine the statistical translation model proposed by Brown et al. (1988),

there are some unrealistic assumptions. Given the source string S of length n and consisting of

words 











, the language model P(S) in the definition of statistical machine translation can

be mathematically expresses in the following way:









 

















 

















The computation of the language model requires computing probabilities for a word given its

history, which is not realistic because there are simply too many histories for a word. Then the

requirement is relaxed by getting rid of the assumption of dependence of the current word on a

fixed subset of the history, for example, only depends on the previous two words, but think about

the sentence “I grew up in China, I speak fluent __”, if only depending on the previous two

words, it would be difficult to make accurate prediction thus less accurate translation.

Wang and Ward (1999) mentioned three sub problems of the statistical machine

translation. First, the modeling problem. It is not clear how the process of generating a sentence

in a source language be depicted, and what process is used by the channel to generate a target

sentence up on receiving a source sentence? Second, the learning problem. Given P(S) and

P(T|S), we still don’t know how can the parameters in these models be estimated from a bilingual

corpus of parallel sentences. Third, a decoding problem. The argmax function operates over all

sentences in the target language denotes the search problem and is referred to as the decoder. But

how do we identify the best decoder? These questions all remained unclear.

When the phrase-based translation system was introduced where there is no restriction of

translating source sentence into target sentence word-by-word, which constitutes a significant

departure from word-based models - IBM models. But both the word-based or phrase-based

model fail to take the model structural or syntactic aspects of the language into consideration.

The experiment was run on English and French, a structurally similar language pair. We would

suspect that a language pair with very different word order such as English and Japanese or

English and German would not be modeled well by these models. There are works that try to

capture the external structure, such as Wang and Ward (1999) and Och et al. (1999). Yamada

and Knight (2001) introduced a channel model that aims to capture the internal structure. The

model accepts a parse tree as an input, which means the input sentence is preprocessed by a

syntactic parser. The channel performs operations on each node of the parse tree. The operations

are reordering child nodes, inserting extra words at each node, and translating leaf words.

Neural machine translation with RNN encoder-decoder architecture and help with LSTM

has shown promising results compared to phrase-based statistical machine translation. It builds a

single neural network that reads a source sentence and generates its translation. But there are still

some limitations.

First, compared to the phrase-based approach, neural machine translation has very limited

number of target words the number of target words which is mainly because the complexity of

training and using an neural machine translation model increases as the number of target words

increases. Also, the memory requirement grows linearly with respect to the number of target

words. The problem comes with it is that first, neural machine translation has difficulty

translation very rare words. Second, if the translation of a source sentence requires many words

that are not seen in the vocabulary list, the model performance will dramatically degrade

especially if we think about language that have a rich set of words such as German or other

highly inflected languages such as Latin, Polish, and Finnish. Then the reasonable solution

would be the inclusion of large vocabulary in the neural machine translation without increasing

the computational complexity and memory space. Jean et al.(2014) propose an approximate

training algorithm based on (biased) importance sampling that able to train an neural machine

translation model with a much larger target vocabulary. The computational complexity during

the training is only going to be at the level of using only a small subset of the full vocabulary.

But once the model with a very large target vocabulary is trained, you can choose to use either

all the target words or only a subset of them. Experiment result from Koehn and Knowles (2017)

shows that the quality for neural machine translation starts much lower, outperforms statistical

machine translation at about 15 million words, and even beats a statistical machine translation

system with a big 2 billion word in domain language model under high-resource conditions. This

is a very promising sign for the further progress of neural machine learning. Luong et al.(2014)

target the problem that neural machine translation has difficult translating the rare words by

training an neural machine translation system on data that is augmented by the output of a word

alignment algorithm, allowing the neural machine translation system to emit, for each out-of-

vocabulary word in the target sentence, the position of its corresponding word in the source

sentence.

Second, words from different domains have different meaning, translation, and styles.

Neural machine translation, both statistical machine translation and neural machine translation

face the difficulty of having robust performance in different domains, that means they don’t

show stability when confronted with conditions that differ significantly from training conditions.

Koehn and Knowles (2017) show that for these two translation models in different domains and

the result show that the in-domain performance for both translation systems are similar (neural

machine translation is better for IT and Subtitles, statistical machine translation is better for Law,

Medical, and Koran), the out-of-domain performance for the neural machine translation systems

is worse in almost all cases, sometimes dramatically so. A possible solution can be the

combination of large vocabulary from Jean et al.(2014) and assigning alignment score based on

domains, which is something I can further explore later.

Conclusion

Machine translation has been a popular research area within the artificial intelligence

filed for many yields. Statistical machine translation is a machine translation paradigm where

translations are generated on the basis of statistical models. Based on the assumption that every

source sentence has a possible translation in the target language, the statistical machine

translation model loops over all sentences in the target language and try to find the highest

probability for the product of the language model and the translation model. There are mainly

two types of statistical machine translation, word-based translation and phrase-based translation.

Phrase-based translation overcome the problem that faced by the word-based translation, that is

the translation of every single word in a single-true way, which fails to take into considerations

the cases, gender and homonymy. Neural machine translation is a recently proposed approach to

machine translation, which uses a single neural network trained jointly to maximize the

translation performance. It is often implemented as the RNN encoder–decoder architecture. The

whole model is jointly trained to maximize the conditional log-probability of the correct

translation given a source sentence with respect to the model parameter. Attention mechanism is

introduced to cope with the problem that perfomance of a basic RNN encoder-decoder

architacture deteriorates rapidly as the length of an input sentence increases. However, despite its

great progress, neural machine translation still has to overcome various challenges. First, neural

machine translation has difficulty translating very rare words. If the translation of a source

sentence requires many words that are not seen in the vocabulary list, the model performance

will dramatically degrade especially. Second, the out-of-domain performance for the neural

machine translation needs to be improved. A possible solution for this particular problem is the

combination of large vocabulary from Jean et al.(2014) and assigning alignment score based on

domains, which is something I can further explore later.

References

Bahdanau, D., Cho, K., & Bengio, Y. (2016). Neural Machine Translation by Jointly Learning to

Align and Translate. ArXiv:1409.0473 [Cs, Stat]. http://arxiv.org/abs/1409.0473

Bentivogli, L., Bisazza, A., Cettolo, M., & Federico, M. (2016). Neural versus Phrase-Based

Machine Translation Quality: A Case Study. Proceedings of the 2016 Conference on Empirical

Methods in Natural Language Processing, 257–267. https://doi.org/10.18653/v1/D16-1025

Brown, P., Cocke, J., Pietra, S. D., Pietra, V. D., Jelinek, F., Mercer, R., & Roossin, P. (1988). A

statistical approach to language translation. Proceedings of the 12th Conference on

Computational Linguistics -, 1, 71–76. https://doi.org/10.3115/991635.991651

Brown, P. F., Cocke, J., Pietra, S. A. D., Pietra, V. J. D., Jelinek, F., Lafferty, J. D., Mercer, R.

L., & Roossin, P. S. (1990). A STATISTICAL APPROACH TO MACHINE TRANSLATION.

16(2), 7.

Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., &

Bengio, Y. (2014). Learning Phrase Representations using RNN Encoder-Decoder for Statistical

Machine Translation. ArXiv:1406.1078 [Cs, Stat]. http://arxiv.org/abs/1406.1078

Gradient Flow in Recurrent Nets: The Difficulty of Learning LongTerm Dependencies. (2009).

In J. F. Kolen & S. C. Kremer, A Field Guide to Dynamical Recurrent Networks. IEEE.

https://doi.org/10.1109/9780470544037.ch14

Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A., Jaitly, N., Vanhoucke, V., Nguyen, P.,

Sainath, T., & Kingsbury, B. (2012). Deep Neural Networks for Acoustic Modeling in Speech

Recognition. 27.

Jean, S., Cho, K., Memisevic, R., & Bengio, Y. (2015). On Using Very Large Target Vocabulary

for Neural Machine Translation. ArXiv:1412.2007 [Cs]. http://arxiv.org/abs/1412.2007

Kalchbrenner, N., & Blunsom, P. (n.d.). Recurrent Continuous Translation Models. 10.

Koehn, P., & Knowles, R. (2017). Six Challenges for Neural Machine Translation. Proceedings

of the First Workshop on Neural Machine Translation, 28–39. https://doi.org/10.18653/v1/W17-

3204

Koehn, P., Och, F. J., & Marcu, D. (n.d.). Statistical Phrase-Based TranslPartoicoenedings of

HLT-NAACL 2003 Main Papers , pp. 48-54 Edmonton, May-June 2003. 7.

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2017). ImageNet classification with deep

convolutional neural networks. Communications of the ACM, 60(6), 84–90.

https://doi.org/10.1145/3065386 LSTM.pdf. (n.d.).

Luong, M.-T., Pham, H., & Manning, C. D. (2015). Effective Approaches to Attention-based

Neural Machine Translation. ArXiv:1508.04025 [Cs]. http://arxiv.org/abs/1508.04025

Luong, M.-T., Sutskever, I., Le, Q. V., Vinyals, O., & Zaremba, W. (2015). Addressing the Rare

Word Problem in Neural Machine Translation. ArXiv:1410.8206 [Cs].

http://arxiv.org/abs/1410.8206

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (n.d.). Distributed

Representations of Words and Phrases and their Compositionality. 9.

Mnih, V., Heess, N., & Graves, A. (n.d.). Recurrent Models of Visual Attention. 9.

Sutskever, I., Vinyals, O., & Le, Q. V. (n.d.). Sequence to Sequence Learning with Neural

Networks. 9.

Wang, Y.-Y. (n.d.). Grammar Inference and Statistical Machine Translation. 148.

Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R., & Bengio, Y.

(2016). Show, Attend and Tell: Neural Image Caption Generation with Visual Attention.

ArXiv:1502.03044 [Cs]. http://arxiv.org/abs/1502.03044

Yamada, K., & Knight, K. (2001). A syntax-based statistical translation model. Proceedings of

the 39th Annual Meeting on Association for Computational Linguistics - ACL ’01, 523–530.

https://doi.org/10.3115/1073012.1073079