The Wayback Machine - https://web.archive.org/web/20201207091143/https://medium.com/analytics-vidhya/testers-guide-for-testing-machine-learning-models-e7e5cea81264
A tester guide for testing machine learning models
A tester guide for testing machine learning models

Testers guide for Testing Machine Learning Models

Mukund Billa
Oct 12, 2019 · 7 min read

Machines learning is a study of applying algorithms and statistics to make the computer to learn by itself without being programmed explicitly. Computers rely on an algorithm that uses a mathematical model. This model uses a data set which is known as “Training Dataset” to learn and to predict the desired outcome. There are multiple learning algorithms that can be used to solve the problem but the concept remains the same. All these algorithms fall into two categories viz. Supervised learning or Unsupervised learning.

Let’s find out more about supervised learning as it is much more researched and used in applications like user profiling, recommended products list, etc. Supervised learning output generates two types of values and is classified in to two, one is Categorical(Classification Model) where the value is from the finite set(male or female, t-shirt or shirt or innerwear, etc) and another one is Nominal(Regression Model) where the value is a real-valued scalar (income level, product ratings, etc). These algorithms are trained using the dataset and the outputs are predicted.

Please note that the machine learning algorithm doesn’t generate a concrete output but it provides an approximation or a probability of outcome.

As a tester have you ever wondered how can we test application which learns by itself and correct its old mistakes. Don’t Worry!! Hold on before you fall off and read this article….

Without much wait let’s find out what testing approach one must take to test such learning algorithms.

Testing approach: The answers lie in the data set. In order to test a machine learning algorithm, tester defines three different datasets viz. Training dataset, validation dataset and a test dataset (a subset of training dataset).

Please keep in mind the process is iterative in nature and it’s better if we refresh our validation and test dataset on every iterative cycle.

Here, below is the basic approach a tester can follow in order to test the developed learning algorithm:

  1. Tester first defines three datasets, training dataset(65%), validation dataset(20%) and test dataset(15%). Please randomize the dataset before splitting and do not use the validation/test dataset in your training dataset.
Image for post
Image for post
Partition of the dataset
Image for post
Image for post
Different Dataset fed to the ML Models

2. Tester once defines the data set, Will begin to train the models with the training dataset. Once this training model is done, the tester then performs to evaluate the models with the validation dataset. This is iterative and can embrace any tweaks/changes needed for a model based on results that can be done and re-evaluated. This ensures that the test dataset remains unused and can be used to test an evaluated model.

Phases of ML Model evaluation: Train, Validate and Evaluate.
Phases of ML Model evaluation: Train, Validate and Evaluate.
An iterative process to evaluate the best machine learning model

3. Once the evaluation of all the models is done, the best model that the team feels confident about based on the least error rate and high approximate prediction will be picked and tested with a test dataset to ensure the model still performs well and matches with validation dataset results. If you find the model accuracy is high then you must ensure that test/validation sets are not leaked into your training dataset.

Image for post
Image for post
An iterative workflow of training, evaluating and testing of ML models

What if we train them with incorrect data??? If we train a model with incorrect data set, then the error rate increases and will lead to Data Poisoning. Models must be trained with an adversary dataset as well such that the system should be capable to sanitize the data before sending it to train models.

With the above information, let’s understand an important concept called “Cross-Validation” that helps us to evaluate the model's average performance.

Cross-Validation

Cross-validation is a technique where the datasets are split into multiple subsets and learning models are trained and evaluated on these subset data. One of the widely used technique is the k-fold cross-validation technique. In this, the dataset is divided into k-subsets(folds) and are used for training and validation purpose for k iteration times. Each subsample will be used at least once as a validation dataset and the remaining (k-1)as the training dataset. Once all the iterations are completed, one can calculate the average prediction rate for each model.

Let’s understand with the below diagram:

Image for post
Image for post
Each sub-sample will be used at least once as a validation dataset across all iterations.

Now we know the testing approach, the main part is how to evaluate the learning models with validation and test dataset… Let’s dig into it and learn the most common evaluation techniques that a tester must be aware of.

There are certain terminologies that we need to understand before diving into the evaluation techniques. So let’s first know what they are.

Image for post
Image for post

With the above basic terminologies, now let’s dive into the techniques:

  1. Classification Accuracy: It’s the most basic way of evaluating the learning model. It’s a ratio between the positive(TN+TP) predictions vs the total number of predictions. If the ratio is high then the model has a high prediction rate. Below are the formulas to find the accuracy ratio.
Image for post
Image for post

However, it is seen that accuracy alone is not a good way to evaluate the model. For e.g. Out of 100 samples of shapes, the model might have correctly predicted True Negative cases however it may have a less success rate for True Positive ones. Hence, The ratio/prediction rate may look good/high but the overall model fails to identify the correct rectangular shapes.

2. Confusion Matrix: It’s a square matrix table of N*N where N is the number of classes that the model needs to classify. It’s best used for classification models that categorizes an outcome into a finite set of values. These values are known as labels. One axis is the label that the model predicted and the other is the actual label. To understand more about this, let’s categories the shapes into 3 labels [Rectangle, Circle, and Square]. As there are 3 labels, we will draw a 3*3 table(Confusion Matrix) of which one axis will be actual and the other is the predicted label.

Image for post
Image for post
Confusion matrix of 3[Actual]*3[Predicted] table. [Note: Remarks column is for the understanding purpose]

With the above matrix, we can calculate the two important metrics to identify the positive prediction rate.

Precision: Precision identifies the frequency with which a model was correct when predicting the positive class. This means the prediction frequency of a positive class by the model. Let’s calculate the precision of each label/class using the above matrix.

Precision is a ration between (True Positive) vs (True Positive +False Positive)
Precision is a ration between (True Positive) vs (True Positive +False Positive)
Precision calculations for each label/class

With the above calculations, the model is 76% of the time is correct when predicted as the rectangle shape. Likewise, 72% and 42% of the time is correct when predicted the circle and square shape.

Recall: This metric answers the following question: Out of all the possible positive labels, how many did the model correctly identify?. This means, the percentage of correctly identified actual True Positive class. In other words, recall measures the number of correct predictions, divided by the number of results that should have been predicted correctly.

Image for post
Image for post
Recall calculation for each label/class

The above simply means that the model has a correct prediction of 66%, 53% and 60% for rectangles, circles, and squares.

What if the threshold value is increased, then the resultant number of correct predictions will be declined which will lower the recall value. Or if the threshold value is lowered then the true predictions will be higher which results in increased precision but will have incorrect predictions as the positive class. To have an optimized metric, we may use the F1 measure which is defined as below. This gives us a score between 0 and 1 where 1 means the model is perfect and 0 means useless. A good score tells us that the model has low false positives[the other shapes which are predicted as rectangles] and low false negative[the rectangles which are not predicted as rectangles].

Image for post
Image for post
F1 Measure formula

There is another evaluation technique called ROC[receiver operating characteristics] and AOC[Area under ROC curve] which needs to plot the graph based on two different parameters [True Postive Rate(TPR or Recall) and False Postive Rate(FPR) for various thresholds. However, we will cover this evaluation technique in our later article.

The above described is a basic testing approach and evaluation technique for a system that is embedded with learning capabilities.

Analytics Vidhya is a community of Analytics and Data…

By Analytics Vidhya

Latest news from Analytics Vidhya on our Hackathons and some of our best articles! Take a look

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Mukund Billa

Written by

A test automation enthusiast

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Mukund Billa

Written by

A test automation enthusiast

Analytics Vidhya

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store