Quality Data, Quality Decisions: Why Web Scraping is Essential for Advanced Analytics
Gediminas Rickevičius·9 min

To explain the sequence prediction problem we intend to solve, I have reproduced the above image from the post I just referred to. The image above has 10 rows that serve as examples of the problem. Take any one row. ‘X’ is a sequence of six randomly generated numbers. The numbers range from 1 to 50. The sequence we need to predict is indicated by ‘y’. It is the first three numbers from the sequence ‘X’, but reversed. The sequence predicted by our encoder decoder model is indicated by ‘yhat’. You can see for this simple problem, we have 100% accuracy – ‘y’ matches ‘yhat’ exactly in all cases. When I ran this experiment several times that was not the case always. But the lowest accuracy I got over several runs was 98. By ‘simple’ problem I mean the input sequence length is as short as 6.
The post used 128 LSTM units and a single epoch. I started changing the length of the input sequence (change from the present value of 6), the number of LSTM units (present value 128) and the number of epochs (present value 1), and documented the impact of those changes on accuracy. Here is what I found.
The first row is the data that is already present in the post – I have documented my lower accuracy of 98 instead of 100. When you have only one LSTM unit, the number of variable parameters in the network is too small to express the sequence, and hence accuracy is zero, even when you increase the epochs to 100. When I increased the input length to 30 (the output length remained 3 throughout the rows in this table), accuracy dropped from 98 to 0 for a single epoch. That meant for ‘long’ sequences the encoder decoder architecture actually starts forgetting the sequence. When I increased the epochs to 30, which is when I started getting a decent accuracy of 85.
Then I increased the input length to a sequence of 60 numbers. Accuracy stayed at zero when I ran 50 epochs, and even when I increased the number of LSTM units to 256 and the epochs to 70 that still remained the same – zero accuracy. I could have further increased the LSTM units and epochs – but was discouraged as it is not as if I my present accuracy is 45 or 55 – the accuracy was at zero with the network configuration matching what we find in most literature. To compare, here are the 10 sample rows from the last row of the above table.
X = [21, 1, 48, 8, 41, 49, 27, 40, 17, 15, 12, 31, 38, 31, 8, 43, 45, 6, 40, 3, 44, 4, 27, 27, 46, 45, 43, 11, 29, 23, 30, 25, 3, 30, 15, 20, 29, 21, 13, 2, 43, 15, 8, 16, 33, 47, 9, 13, 8, 29, 28, 40, 26, 43, 3, 21, 29, 31, 44, 2] y = [48, 1, 21], yhat = [2, 48, 13]
X = [15, 27, 30, 10, 8, 12, 47, 41, 31, 42, 34, 5, 35, 26, 46, 5, 24, 4, 9, 19, 5, 2, 21, 29, 22, 42, 27, 38, 34, 42, 19, 7, 45, 31, 15, 22, 18, 10, 20, 36, 17, 9, 40, 49, 16, 16, 12, 32, 37, 36, 49, 42, 21, 37, 31, 48, 13, 33, 33, 45] y = [30, 27, 15], yhat = [6, 41, 40]
X = [14, 42, 28, 5, 28, 42, 33, 3, 24, 49, 10, 23, 32, 38, 34, 32, 29, 24, 37, 43, 40, 48, 44, 30, 3, 6, 18, 48, 44, 43, 4, 25, 39, 46, 28, 4, 43, 25, 22, 42, 7, 12, 49, 26, 17, 28, 6, 30, 2, 17, 17, 33, 20, 9, 23, 48, 44, 46, 13, 40] y = [28, 42, 14], yhat = [16, 25, 41]
X = [37, 2, 28, 38, 41, 25, 17, 36, 28, 8, 30, 38, 43, 44, 21, 30, 23, 46, 13, 29, 35, 37, 20, 42, 33, 42, 7, 29, 10, 50, 45, 20, 29, 38, 19, 50, 24, 24, 38, 40, 36, 36, 35, 49, 20, 33, 42, 8, 21, 23, 22, 35, 44, 43, 23, 25, 26, 30, 42, 41] y = [28, 2, 37], yhat = [11, 7, 3]
X = [38, 1, 27, 37, 2, 12, 20, 41, 24, 6, 31, 49, 26, 46, 11, 15, 17, 37, 10, 15, 18, 37, 8, 39, 21, 1, 3, 50, 35, 12, 42, 36, 6, 40, 6, 6, 34, 44, 23, 14, 50, 45, 15, 3, 47, 9, 23, 34, 11, 32, 25, 3, 32, 36, 37, 32, 1, 24, 31, 34] y = [27, 1, 38], yhat = [37, 12, 42]
X = [6, 46, 11, 20, 1, 48, 17, 6, 39, 18, 49, 33, 46, 6, 18, 4, 39, 19, 35, 49, 10, 39, 29, 14, 42, 35, 30, 16, 35, 36, 39, 18, 27, 4, 24, 31, 11, 39, 8, 11, 12, 47, 24, 42, 21, 25, 36, 50, 11, 25, 20, 47, 4, 27, 41, 9, 29, 5, 11, 35] y = [11, 46, 6], yhat = [20, 10, 10]
X = [11, 28, 33, 14, 38, 8, 27, 45, 1, 30, 7, 10, 28, 6, 2, 38, 4, 22, 47, 18, 36, 39, 39, 7, 23, 31, 6, 20, 42, 33, 16, 26, 21, 7, 24, 43, 49, 47, 20, 14, 15, 15, 13, 3, 9, 25, 29, 13, 46, 27, 7, 50, 47, 8, 47, 17, 45, 15, 9, 36] y = [33, 28, 11], yhat = [4, 42, 5]
X = [17, 14, 50, 16, 49, 11, 23, 31, 41, 2, 19, 40, 23, 10, 34, 4, 10, 49, 8, 23, 8, 46, 10, 48, 41, 20, 11, 43, 22, 10, 24, 42, 28, 15, 24, 17, 35, 45, 18, 11, 10, 17, 37, 39, 44, 18, 17, 3, 44, 42, 5, 28, 31, 50, 25, 6, 9, 48, 11, 43] y = [50, 14, 17], yhat = [40, 40, 23]
X = [43, 37, 48, 36, 6, 8, 49, 18, 25, 38, 49, 14, 49, 45, 12, 28, 12, 13, 49, 7, 28, 25, 13, 26, 20, 11, 10, 7, 32, 18, 20, 46, 10, 29, 30, 34, 44, 12, 3, 10, 3, 12, 4, 3, 34, 5, 17, 3, 6, 48, 46, 19, 3, 45, 37, 21, 12, 30, 2, 48] y = [48, 37, 43], yhat = [46, 35, 42]
X = [36, 28, 17, 42, 29, 10, 23, 22, 38, 37, 25, 37, 43, 7, 50, 42, 8, 43, 39, 11, 26, 14, 36, 17, 2, 38, 47, 25, 37, 20, 5, 22, 13, 43, 26, 35, 26, 30, 20, 5, 25, 17, 20, 10, 27, 32, 10, 23, 36, 22, 1, 15, 49, 5, 11, 29, 32, 40, 19, 32] y = [17, 28, 36], yhat = [23, 6, 10]
It is obvious that when sequence length increases, the encoder decoder network is not able to remember the first three elements of the sequence. Is it able to remember the last three elements of the sequence? I carried out a series of experiments to find that out. I changed the problem in the referred post to predict the last three numbers of the sequence, reversed. Here is the table for the same.
Here are 10 sample rows for the experiment indicated by the last row in the above table.
X = [14, 27, 10, 28, 17, 26, 42, 31, 43, 4, 22, 7, 4, 20, 26, 43, 4, 35, 26, 3, 39, 13, 46, 23, 36, 4, 25, 37, 24, 27, 46, 5, 34, 10, 31, 50, 8, 23, 23, 50, 50, 33, 49, 25, 37, 5, 4, 28, 37, 9, 49, 13, 9, 37, 35, 18, 13, 18, 47, 16] y = [16, 47, 18], yhat = [16, 47, 18]
X = [26, 2, 26, 7, 22, 46, 11, 14, 11, 12, 9, 28, 42, 3, 37, 36, 2, 50, 22, 15, 28, 30, 10, 13, 4, 12, 46, 18, 16, 9, 4, 6, 16, 45, 24, 43, 40, 13, 28, 6, 17, 35, 17, 5, 43, 30, 11, 39, 8, 10, 32, 8, 45, 21, 40, 29, 42, 15, 5, 12] y = [12, 5, 15], yhat = [12, 5, 15]
X = [2, 10, 19, 19, 11, 25, 6, 50, 37, 8, 34, 49, 30, 39, 17, 30, 6, 30, 1, 16, 9, 22, 35, 7, 8, 39, 42, 3, 11, 36, 28, 48, 24, 36, 32, 19, 38, 8, 16, 31, 50, 26, 23, 49, 28, 1, 2, 26, 49, 35, 39, 24, 39, 30, 39, 47, 43, 1, 46, 38] y = [38, 46, 1], yhat = [38, 46, 1]
X = [14, 5, 2, 6, 2, 5, 9, 37, 14, 18, 32, 20, 41, 48, 6, 5, 47, 13, 27, 15, 36, 3, 28, 15, 47, 1, 32, 10, 31, 16, 3, 25, 41, 25, 23, 41, 40, 43, 41, 49, 27, 45, 29, 27, 45, 31, 18, 27, 44, 21, 12, 2, 5, 34, 44, 32, 20, 3, 31, 4] y = [4, 31, 3], yhat = [4, 31, 3]
X = [22, 34, 23, 17, 28, 22, 2, 15, 50, 26, 40, 6, 22, 46, 25, 38, 36, 21, 13, 18, 40, 46, 15, 44, 30, 32, 10, 21, 41, 2, 46, 39, 16, 34, 45, 13, 48, 16, 10, 2, 1, 44, 28, 24, 4, 42, 36, 9, 9, 44, 36, 25, 35, 32, 24, 20, 23, 45, 15, 33] y = [33, 15, 45], yhat = [33, 15, 45]
X = [30, 28, 47, 2, 18, 27, 6, 39, 32, 24, 15, 37, 50, 37, 10, 1, 41, 48, 15, 22, 17, 47, 7, 46, 1, 45, 6, 43, 47, 15, 31, 32, 22, 34, 28, 9, 46, 13, 45, 25, 27, 42, 44, 29, 40, 18, 7, 43, 17, 16, 15, 28, 45, 50, 41, 50, 48, 41, 3, 30] y = [30, 3, 41], yhat = [30, 3, 41]
X = [35, 20, 4, 33, 15, 32, 21, 4, 12, 27, 38, 6, 22, 46, 25, 27, 1, 45, 26, 35, 22, 46, 37, 48, 29, 28, 39, 38, 10, 28, 6, 33, 16, 38, 16, 39, 31, 49, 21, 6, 3, 32, 32, 1, 20, 35, 36, 23, 5, 19, 34, 1, 39, 39, 23, 2, 32, 34, 44, 20] y = [20, 44, 34], yhat = [20, 44, 34]
X = [17, 41, 10, 19, 7, 2, 50, 23, 41, 23, 22, 41, 36, 16, 2, 23, 24, 4, 18, 25, 17, 12, 45, 27, 18, 2, 5, 35, 3, 33, 49, 8, 21, 20, 14, 8, 20, 2, 22, 15, 23, 14, 13, 42, 48, 46, 41, 29, 26, 37, 34, 39, 35, 38, 35, 39, 29, 24, 39, 11] y = [11, 39, 24], yhat = [11, 39, 24]
X = [7, 46, 7, 10, 9, 9, 40, 20, 41, 45, 14, 28, 12, 28, 36, 46, 38, 46, 12, 7, 37, 32, 17, 1, 14, 37, 42, 18, 9, 36, 15, 42, 39, 4, 27, 11, 2, 41, 38, 28, 26, 21, 32, 36, 9, 12, 34, 15, 11, 46, 20, 42, 6, 2, 33, 14, 34, 7, 33, 18] y = [18, 33, 7], yhat = [18, 33, 7]
X = [13, 35, 46, 48, 15, 49, 32, 46, 22, 35, 39, 36, 48, 9, 3, 9, 40, 47, 21, 6, 12, 12, 31, 32, 42, 50, 22, 37, 8, 22, 19, 4, 41, 5, 18, 26, 13, 47, 42, 28, 47, 11, 32, 29, 31, 16, 14, 12, 46, 42, 24, 16, 11, 27, 37, 1, 32, 2, 40, 45] y = [45, 40, 2], yhat = [45, 40, 2]
The experiments were conducted using Google Colab notebooks. Here are my conclusions from the experiments so far.
Real-time institutional flow data and trading signals for serious investors.
Explore DataDrivenAlpha →Instantly repurpose any DDI article into a professionally produced short-form video.
Try DDI Media →
Amaresh is a researcher in the areas of Machine Learning, Generative AI and Natural Language Processing. Additionally, he works in the area of learning and development for professionals in the IT services industry. He uses Python, TensorFlow and Hugging Face for his research work. In these columns he shares his experiences with the intention of helping the reader understand concepts and solving problems. He is based out of Mumbai, India.