TECHNIQUES TO EVALUATE & ADAPT MODELS 

COMPARING TWO STATISTICAL TECHNIQUES THAT CAN BE USED TO TEST & ADAPT MODELS. 

Model Insights | 30-10-2020

TRAINING MODELS BASED ON THEORETICAL RESULTS CAN BE DIFFERENT THAN TRAINING MODELS BASED ON REAL-LIFE RESULTS. IN THIS ARTICLE TWO METHODS ARE PROPOSED TO TEST & ADAPT MODELS IN PRODUCTION

For AI/ML models that are running production environments, clients always ask the questions;

  • Is the model learning over time?
  • Is this the best model that is available? Or are there better models available?
  • Does the model work just as good for all my verticals?

Some of these questions can be answered by looking at metrics like MAE, bias, MAPE. However, sometimes what is needed is to compare two models in a production environment to get real-life, non-theoretical, results. And testing models in a production environment always comes at a cost. To test also the non-optimal model (settings) needs to be tried.

The most common and popular technique used to testing two or more models is A/B testing. The biggest drawback of A/B testing is the fact that during the duration of the test, there is no optimization taking place, and therefore, the cost of the A/B test is high. The non-optimal model is running for the entire period of the test.

There is an alternative for A/B that does optimize during the duration of the test -it is the Multi-Arm Bandit (MAB). Let’s compare the MAB with A/B testing and when to use which methodology.

Explore/exploit

Both methodologies are based on the “explore and exploit” principle. First, the options are explored. In A/B testing this creation of the A and B groups. After the explore phase, the exploit phase follows. In which models are tested on their performance. In A/B testing, the two phases are sequential. In MAB, both phases are more entangled. And the costs of the test are reduced because non-optimal models will be disregarded early in the test period. Also sometimes called “Earn while you learn”. Whereas in A/B testing the non-optimal runs all the way to the end of the test period. And that is thus more costly.

In the example above Model C is the model with the best results. In the A/B test model C is only used in 100% of the cases after week 6. The share of the quotes that use model C is already increasing from week two onwards. And the less accurate models A and B are used less and less during the test period. By already early in the test period selecting the best performing more often will decrease the cost of the test significant.

How does MAB work?

The origin of the MAB lays in a thought experiment where a gambler has to play on multiple slot machines and determine which slot machine has the highest earnings. The gambler can only try one slot machine at a time. If he pulls one arm on a slot machine, he is not pulling the other arms. The goal is to find out in the least amount of tries which slot machine has the highest payout. The MAB algorithm does this by trying to minimize the opportunity costs (try to keep the number of attempts as low as possible) and minimizes the regret. Regret is defined as the delta between your actual earnings and the earnings that would have been received if the optimal slot machine was used.

Is MAB always better than A/B testing?

No, MAB and A/B testing are two different methodologies with different use cases.

When to use which method?

Multi-Armed Bandit:

  • Continuous improvement of running models; can be used continuously to adapt the feature importance to adapt to the latest trend in the data.
  • Low traffic models; if a model has low traffic, it takes too long with A/B testing to get significant results.
  • High-value products; for products with a high value, like trucks, each lost deal is a loss of a considerable amount of money. MAB already in an early stage abandons the less optimal models, and the overall loss will be smaller than the loss with A/B testing, where also the less optimal models run for the complete duration of the test.
  • Hands-off optimization; the MAB can be used as “set and forget” the optimization will continue to pick and choose the optimal model over time.

A/B testing:

  • Test on history; if an actual experiment in production is not feasible, but data variations from the past are available A/B testing can be used to deduce the best model in retrospect.
  • Next to conversion also optimize on other variables; MAB doesn’t work for optimizing more than one parameter. In most cases, this the conversion rate. If a second parameter, like sales price, is added A/B testing is the better option.
  • Proof; if statistical evidence is needed, A/B testing is the faster methodology. In MAB, the number of learning points is not equally distributed over the models. So, it takes longer to reach statistical proof.