We Don’t Push ML To Its Limits

Introduction

In the early days of Netflix, the company set out a million dollar prize to see who could best improve their recommendation system by 10%. Over the next three years the understanding of algorithms as well as their proper implementation became key. We saw basic improvements, such as the usage of ensemble methods, but other developments were curtailed towards the specific data; a highly sparse rating matrix was provided in the case of the Netflix prize.

The prize was won, but more importantly, it shined a light on how algorithms aren’t enough. Even if you have a good algorithm that fits the task well (as many people at the time had), it didn’t translate to immediate success. Oftentimes real and substantive improvements come from boring adjustments to hyperparameter tuning, data handling, and adjusting existing algorithms to the task present.

Matrix factorization was a technique widely used, and while original methods such as Funk MF were popular, SVD++ was an improvement by incorporating other factors outside of ratings (likes, purchases, timelines, etc). Making the most out of the data given to solve the problem to its fullest potential is often not addressed in the research community, but for any practicing MLE, is far more important. Even if we have an algorithm that fits the needs well, how can we optimize its performance?

Three examples shine to light perfectly this discrepancy in research and real-world implementation. The first paper from Google researchers titled “On the Difficulty of Evaluating Baselines” by Rendle et al. describes another recommendation system task. The ML10M dataset saw a lot of small improvements off of the basis of better algorithms rather than better design. At the time, the best RMSE was 0.7644 that used advanced techniques in mathematics. The researchers demonstrated how using simpler bayesian matrix factorization, along with a few bells and whistles and proper design gave an RMSE as low as 0.7485. This was with a worse technique, but with better setup and proper implementation. The full name of the technique “Bayesian timeSVD++ flipped” should tell you how many little improvements were added to bolster the results and make the most out of the data provided.

The second example comes from a paper titled “A ConvNet for the 2020s” by Liu et al. where transformers that had been getting the most attention saw better improvements, but older existing models performed better given proper setup. As transformers were growing in popularity, the optimization of older technology fell by the way side. The best transformer at the time Swin-L achieved a top-1 accuracy of 87.3%, while the optimized ConvNeXt-XL model achieved 87.8%. Not only was the ConvNeXt model better, but simpler and faster. They changed the stage compute ratio, changed the stem to “patchify”, had convolutional filters separated into different groups, inverted the bottleneck, used larger kernel sizes, used GELU instead of ReLu, used fewer activations functions and normalization layers, substituted batch normalization with layer normalization and separated downsampling layers. Knowing how to use the algorithm, and understanding all of the small things that either help or harm the model is important, and takes time to learn. You can have a shiny new car, but if you haven’t piloted it as much, that old Ford Custom will speed right past you.

Lastly my own experience demonstrated this point where I was trying to implement a paper titled “Neural Collaborative Filtering”. I had differences in handling data and different hyperparameters that drastically altered the final result. I didn’t distinguish between newer ratings and older ratings, I didn’t normalize my layers, and I may have conducted an improper grid search that didn’t effectively optimize hyperparameters such as learning rate, batch size, the number of epochs, the layer sizes, and a loss constant.

The results of the paper indicated that the method would have an HR@10 of 68%, but my version only had 17%! In reading the paper closer to understand what went wrong, a single line described how the authors bumped up their results by removing a large section of data to choose from, making the problem much easier to solve and justified this practice by referencing two other papers. “Since it is too time consuming to rank all items for every user during evaluation, we followed the common strategy [6, 21] that randomly samples 100 items that are not interacted by the user, ranking the test item among the 100 items.” After reading this, I added this augment in the problem and found my 17% jumped to 53%! This still wasn’t the 68% claimed, as

This last example also highlights another problem the Google researcher addressed in their paper; that of improper protocol. The testing of various models should mean that models are what’s being tested. But as in my case, the handling of data was different, and in many other cases, we’re not setting up an exact one-to-one to test models. We’re testing algorithms, techniques, not just single decision makers but rather a set of instructions for getting the best result. As we’ve seen, when so much change can occur from algorithm to algorithm, it isn’t always the fancy new model that’s capable of acquiring better results.

So what needs changing?

With any given dataset, an established protocol for handling the data and assessing performance must be established. This also delves into p-value hacking where there have been reported cases of researchers rerunning tests on different seeds just to edge out performance.
Understand how any new improvement affects the field at large. Some useful techniques are only useful for specific datasets, such as the Netflix Prize favoring matrix factorization on sparse datasets rather than all kinds of datasets. Because of this, we venture close to overfitting data with any established dataset that may not be representative of real-world environments.
We should be more critical of hyperparameter optimization. A lot of improvements in the ConvNeXt paper demonstrates that proper tuning is required to get good results. Knowing the boundaries of proper hyperparameters to test, proper data handling, and what algorithm improvements can be made to fit the problem being addressed can far improve performance than simply using a better algorithm.

With so many moving parts in any given ML task it is most important to note how to take advantage of each moving part to optimize performance. In research we take time to investigate new tools, but unless there’s a juicy million dollar prize, we don’t take the time to truly push them to their fullest potential!