Encoder Decoder Sequences: How Long is Too Long?

In machine learning many times we deal with the input being a sequence and the output also a sequence. We call such a problem a sequence-to-sequence problem.

A popular architecture for dealing with the sequence-to-sequence problem is the encoder decoder architecture. The encoder converts the variable length input sequence to a fixed length context vector, and the decoder converts the fixed length context vector to a variable length output sequence. I have been reading enough literature (for an example, take a look at this article) that says the disadvantage of the encoder decoder architecture is that irrespective of the length of the input sequence the context vector length is fixed, because of which the system cannot remember longer sequences. Often, it has forgotten the earlier parts of the sequence by the time it reaches the end of the sequence.

That is right. But exactly how long is a ‘longer’ sequence? After what length does the encoder decoder performance start to deteriorate? I conducted a few experiments with various sequence lengths, and that is what this article is about. You may find my numbers interesting to understand the limitations of your own networks when it comes to sequence lengths. My experiments are based on a brilliant post on sequence prediction. I suggest you go through that post before proceeding with the rest of this article.

To explain the sequence prediction problem we intend to solve, I have reproduced the above image from the post I just referred to. The image above has 10 rows that serve as examples of the problem. Take any one row. ‘X’ is a sequence of six randomly generated numbers. The numbers range from 1 to 50. The sequence we need to predict is indicated by ‘y’. It is the first three numbers from the sequence ‘X’, but reversed. The sequence predicted by our encoder decoder model is indicated by ‘yhat’. You can see for this simple problem, we have 100% accuracy – ‘y’ matches ‘yhat’ exactly in all cases. When I ran this experiment several times that was not the case always. But the lowest accuracy I got over several runs was 98. By ‘simple’ problem I mean the input sequence length is as short as 6.

The post used 128 LSTM units and a single epoch. I started changing the length of the input sequence (change from the present value of 6), the number of LSTM units (present value 128) and the number of epochs (present value 1), and documented the impact of those changes on accuracy. Here is what I found.

The first row is the data that is already present in the post – I have documented my lower accuracy of 98 instead of 100. When you have only one LSTM unit, the number of variable parameters in the network is too small to express the sequence, and hence accuracy is zero, even when you increase the epochs to 100. When I increased the input length to 30 (the output length remained 3 throughout the rows in this table), accuracy dropped from 98 to 0 for a single epoch. That meant for ‘long’ sequences the encoder decoder architecture actually starts forgetting the sequence. When I increased the epochs to 30, which is when I started getting a decent accuracy of 85.

Then I increased the input length to a sequence of 60 numbers. Accuracy stayed at zero when I ran 50 epochs, and even when I increased the number of LSTM units to 256 and the epochs to 70 that still remained the same – zero accuracy. I could have further increased the LSTM units and epochs – but was discouraged as it is not as if I my present accuracy is 45 or 55 – the accuracy was at zero with the network configuration matching what we find in most literature. To compare, here are the 10 sample rows from the last row of the above table.

X = [21, 1, 48, 8, 41, 49, 27, 40, 17, 15, 12, 31, 38, 31, 8, 43, 45, 6, 40, 3, 44, 4, 27, 27, 46, 45, 43, 11, 29, 23, 30, 25, 3, 30, 15, 20, 29, 21, 13, 2, 43, 15, 8, 16, 33, 47, 9, 13, 8, 29, 28, 40, 26, 43, 3, 21, 29, 31, 44, 2] y = [48, 1, 21], yhat = [2, 48, 13]

X = [15, 27, 30, 10, 8, 12, 47, 41, 31, 42, 34, 5, 35, 26, 46, 5, 24, 4, 9, 19, 5, 2, 21, 29, 22, 42, 27, 38, 34, 42, 19, 7, 45, 31, 15, 22, 18, 10, 20, 36, 17, 9, 40, 49, 16, 16, 12, 32, 37, 36, 49, 42, 21, 37, 31, 48, 13, 33, 33, 45] y = [30, 27, 15], yhat = [6, 41, 40]

X = [14, 42, 28, 5, 28, 42, 33, 3, 24, 49, 10, 23, 32, 38, 34, 32, 29, 24, 37, 43, 40, 48, 44, 30, 3, 6, 18, 48, 44, 43, 4, 25, 39, 46, 28, 4, 43, 25, 22, 42, 7, 12, 49, 26, 17, 28, 6, 30, 2, 17, 17, 33, 20, 9, 23, 48, 44, 46, 13, 40] y = [28, 42, 14], yhat = [16, 25, 41]

X = [37, 2, 28, 38, 41, 25, 17, 36, 28, 8, 30, 38, 43, 44, 21, 30, 23, 46, 13, 29, 35, 37, 20, 42, 33, 42, 7, 29, 10, 50, 45, 20, 29, 38, 19, 50, 24, 24, 38, 40, 36, 36, 35, 49, 20, 33, 42, 8, 21, 23, 22, 35, 44, 43, 23, 25, 26, 30, 42, 41] y = [28, 2, 37], yhat = [11, 7, 3]

X = [38, 1, 27, 37, 2, 12, 20, 41, 24, 6, 31, 49, 26, 46, 11, 15, 17, 37, 10, 15, 18, 37, 8, 39, 21, 1, 3, 50, 35, 12, 42, 36, 6, 40, 6, 6, 34, 44, 23, 14, 50, 45, 15, 3, 47, 9, 23, 34, 11, 32, 25, 3, 32, 36, 37, 32, 1, 24, 31, 34] y = [27, 1, 38], yhat = [37, 12, 42]

X = [6, 46, 11, 20, 1, 48, 17, 6, 39, 18, 49, 33, 46, 6, 18, 4, 39, 19, 35, 49, 10, 39, 29, 14, 42, 35, 30, 16, 35, 36, 39, 18, 27, 4, 24, 31, 11, 39, 8, 11, 12, 47, 24, 42, 21, 25, 36, 50, 11, 25, 20, 47, 4, 27, 41, 9, 29, 5, 11, 35] y = [11, 46, 6], yhat = [20, 10, 10]

X = [11, 28, 33, 14, 38, 8, 27, 45, 1, 30, 7, 10, 28, 6, 2, 38, 4, 22, 47, 18, 36, 39, 39, 7, 23, 31, 6, 20, 42, 33, 16, 26, 21, 7, 24, 43, 49, 47, 20, 14, 15, 15, 13, 3, 9, 25, 29, 13, 46, 27, 7, 50, 47, 8, 47, 17, 45, 15, 9, 36] y = [33, 28, 11], yhat = [4, 42, 5]

X = [17, 14, 50, 16, 49, 11, 23, 31, 41, 2, 19, 40, 23, 10, 34, 4, 10, 49, 8, 23, 8, 46, 10, 48, 41, 20, 11, 43, 22, 10, 24, 42, 28, 15, 24, 17, 35, 45, 18, 11, 10, 17, 37, 39, 44, 18, 17, 3, 44, 42, 5, 28, 31, 50, 25, 6, 9, 48, 11, 43] y = [50, 14, 17], yhat = [40, 40, 23]

X = [43, 37, 48, 36, 6, 8, 49, 18, 25, 38, 49, 14, 49, 45, 12, 28, 12, 13, 49, 7, 28, 25, 13, 26, 20, 11, 10, 7, 32, 18, 20, 46, 10, 29, 30, 34, 44, 12, 3, 10, 3, 12, 4, 3, 34, 5, 17, 3, 6, 48, 46, 19, 3, 45, 37, 21, 12, 30, 2, 48] y = [48, 37, 43], yhat = [46, 35, 42]

X = [36, 28, 17, 42, 29, 10, 23, 22, 38, 37, 25, 37, 43, 7, 50, 42, 8, 43, 39, 11, 26, 14, 36, 17, 2, 38, 47, 25, 37, 20, 5, 22, 13, 43, 26, 35, 26, 30, 20, 5, 25, 17, 20, 10, 27, 32, 10, 23, 36, 22, 1, 15, 49, 5, 11, 29, 32, 40, 19, 32] y = [17, 28, 36], yhat = [23, 6, 10]

It is obvious that when sequence length increases, the encoder decoder network is not able to remember the first three elements of the sequence. Is it able to remember the last three elements of the sequence? I carried out a series of experiments to find that out. I changed the problem in the referred post to predict the last three numbers of the sequence, reversed. Here is the table for the same.

Here are 10 sample rows for the experiment indicated by the last row in the above table.

X = [14, 27, 10, 28, 17, 26, 42, 31, 43, 4, 22, 7, 4, 20, 26, 43, 4, 35, 26, 3, 39, 13, 46, 23, 36, 4, 25, 37, 24, 27, 46, 5, 34, 10, 31, 50, 8, 23, 23, 50, 50, 33, 49, 25, 37, 5, 4, 28, 37, 9, 49, 13, 9, 37, 35, 18, 13, 18, 47, 16] y = [16, 47, 18], yhat = [16, 47, 18]

X = [26, 2, 26, 7, 22, 46, 11, 14, 11, 12, 9, 28, 42, 3, 37, 36, 2, 50, 22, 15, 28, 30, 10, 13, 4, 12, 46, 18, 16, 9, 4, 6, 16, 45, 24, 43, 40, 13, 28, 6, 17, 35, 17, 5, 43, 30, 11, 39, 8, 10, 32, 8, 45, 21, 40, 29, 42, 15, 5, 12] y = [12, 5, 15], yhat = [12, 5, 15]

X = [2, 10, 19, 19, 11, 25, 6, 50, 37, 8, 34, 49, 30, 39, 17, 30, 6, 30, 1, 16, 9, 22, 35, 7, 8, 39, 42, 3, 11, 36, 28, 48, 24, 36, 32, 19, 38, 8, 16, 31, 50, 26, 23, 49, 28, 1, 2, 26, 49, 35, 39, 24, 39, 30, 39, 47, 43, 1, 46, 38] y = [38, 46, 1], yhat = [38, 46, 1]

X = [14, 5, 2, 6, 2, 5, 9, 37, 14, 18, 32, 20, 41, 48, 6, 5, 47, 13, 27, 15, 36, 3, 28, 15, 47, 1, 32, 10, 31, 16, 3, 25, 41, 25, 23, 41, 40, 43, 41, 49, 27, 45, 29, 27, 45, 31, 18, 27, 44, 21, 12, 2, 5, 34, 44, 32, 20, 3, 31, 4] y = [4, 31, 3], yhat = [4, 31, 3]

X = [22, 34, 23, 17, 28, 22, 2, 15, 50, 26, 40, 6, 22, 46, 25, 38, 36, 21, 13, 18, 40, 46, 15, 44, 30, 32, 10, 21, 41, 2, 46, 39, 16, 34, 45, 13, 48, 16, 10, 2, 1, 44, 28, 24, 4, 42, 36, 9, 9, 44, 36, 25, 35, 32, 24, 20, 23, 45, 15, 33] y = [33, 15, 45], yhat = [33, 15, 45]

X = [30, 28, 47, 2, 18, 27, 6, 39, 32, 24, 15, 37, 50, 37, 10, 1, 41, 48, 15, 22, 17, 47, 7, 46, 1, 45, 6, 43, 47, 15, 31, 32, 22, 34, 28, 9, 46, 13, 45, 25, 27, 42, 44, 29, 40, 18, 7, 43, 17, 16, 15, 28, 45, 50, 41, 50, 48, 41, 3, 30] y = [30, 3, 41], yhat = [30, 3, 41]

X = [35, 20, 4, 33, 15, 32, 21, 4, 12, 27, 38, 6, 22, 46, 25, 27, 1, 45, 26, 35, 22, 46, 37, 48, 29, 28, 39, 38, 10, 28, 6, 33, 16, 38, 16, 39, 31, 49, 21, 6, 3, 32, 32, 1, 20, 35, 36, 23, 5, 19, 34, 1, 39, 39, 23, 2, 32, 34, 44, 20] y = [20, 44, 34], yhat = [20, 44, 34]

X = [17, 41, 10, 19, 7, 2, 50, 23, 41, 23, 22, 41, 36, 16, 2, 23, 24, 4, 18, 25, 17, 12, 45, 27, 18, 2, 5, 35, 3, 33, 49, 8, 21, 20, 14, 8, 20, 2, 22, 15, 23, 14, 13, 42, 48, 46, 41, 29, 26, 37, 34, 39, 35, 38, 35, 39, 29, 24, 39, 11] y = [11, 39, 24], yhat = [11, 39, 24]

X = [7, 46, 7, 10, 9, 9, 40, 20, 41, 45, 14, 28, 12, 28, 36, 46, 38, 46, 12, 7, 37, 32, 17, 1, 14, 37, 42, 18, 9, 36, 15, 42, 39, 4, 27, 11, 2, 41, 38, 28, 26, 21, 32, 36, 9, 12, 34, 15, 11, 46, 20, 42, 6, 2, 33, 14, 34, 7, 33, 18] y = [18, 33, 7], yhat = [18, 33, 7]

X = [13, 35, 46, 48, 15, 49, 32, 46, 22, 35, 39, 36, 48, 9, 3, 9, 40, 47, 21, 6, 12, 12, 31, 32, 42, 50, 22, 37, 8, 22, 19, 4, 41, 5, 18, 26, 13, 47, 42, 28, 47, 11, 32, 29, 31, 16, 14, 12, 46, 42, 24, 16, 11, 27, 37, 1, 32, 2, 40, 45] y = [45, 40, 2], yhat = [45, 40, 2]

The experiments were conducted using Google Colab notebooks. Here are my conclusions from the experiments so far.

A simple encoder decoder network with only one LSTM unit will not be able to remember even the simplest sequences.
Somewhere between sequence lengths of 30 and 60, the simple encoder decoder network starts forgetting the first few elements of the sequence. It does not look like having a combination of more LSTM units and more epochs can rectify the situation, as to the limits of reasonable hardware, the accuracy does not move beyond zero.
As we process element after element of the sequence, the fixed length context vector finds it easy to retain the last three numbers of the sequence but finds it difficult to retain the first three numbers of the sequence.

The experiments here are not based on NLP. If you are using character encoding in NLP, the results should be in comparable range as you would one hot encode more than around 50 characters. If you are using word encoding and using any of the popular word embedding schemes, the system has the advantage of advanced embedding techniques but will have to figure out language semantics. I stuck to numbers as there is no subjectivity to the results unlike NLP. For your NLP use cases, I hope the tables presented here will give rough starting points and I would encourage you to carry out your experiments for specific use cases.

Encoder Decoder Sequences: How Long is Too Long?

Artificial Intelligence: Disturbingly High Expectations

The Translation Revolution: How LLMs Are Cutting 90% of…

What are we solving with analytics?

Leave a Reply Cancel reply