Techniques to handle class imbalance

Read Time: 6 min

In following post I have explained class imbalance problem in classification models and techniques to handle class imbalance problem using python.

Table of content

  • What is the class imbalance?
  • Why class imbalance is an issue?
  • How to handle class imbalance (methods)
  • Use case study with fraud detection data

At first we need to understand what is class imbalance and most important why is it considered as a problem?

Let me explain with an example:

Suppose we have customer data for fraud detection. Based on customer’s transactions and other details, we need to figure out if fraud occured or not. This problem can be considered as binary classification.

For this post, I have used credit card fraud detection dataset as use case. Link of the dataset you can find here:

https://github.com/Appiiee/Techniques-to-handle-class-imbalance

As you can observe we have 2600 cases where no fraud happened while almost 400 cases where fraud occurred that are very few in number as compared to “Non fraud” cases.

In this scenario if I train the model, most of the time output will be Non fraud i.e. biased.

Why biased output?

Reason is..

As fraud cases are very less, machine learning model is not able to learn all features of fraud cases and as ‘non fraud’ cases are high in number , model learns everything from that data and predicts majority class most of the time.

Problems with imbalanced class data

If data is unbalanced, then high chances that model would predict, dominant class. In that case, accuracy score will be high but it won’t be correct measure of the data.

That’s why it is very important to learn from all the features equally. For that we need equal number of data points for both the classes. It’s true for multi-class classification as well.

Okay, we got the problem. One more issue, if we check accuracy through accuracy _score and other metrics similar to this, will it be able to specify the class imbalance in the result?

NO!!!!

Why??

Reason is, accuracy is calculated for all classes. It does not consider class wise accuracy and result will be biased to majority class.

Techniques to handle class imbalance:

  • Resampling methods
    • Random up sampling
    • Random down sampling
  • SMOTE Algorithm
  • Cluster based sampling
  • Cost Sensitive learning
  • Other methods
    • Frame the problem as anomaly detection
    • ensemble methods
    • Combine minority classes

Note:

Whenever any class imbalance handling method is applied on the data, it has to be applied on training data only. Reason behind this is.. If any sampling method is applied on whole data (train and test both), it might be possible that all test/train data contain synthetic samples not the actual one. So accuracy may get affected. Results may lead to overfitting or under-fitting.

Resampling Methods

If we have imbalance data, where one class has few data points as compare to other class, we can re-sample the data points by either increasing minority class data points or by decreasing majority class data points.

There are multiple methods of resampling as follows:

  • Up sampling
  • Down sampling

Random Up sampling

In random up sampling, copies of minority classes are created randomly with replacement. We can generate any number of samples of minority class.

n_samples are number of samples, wewant to generate from minority class. In the example given below, I have up sampled minority class equal to majority class.

Here is the example code:

from sklearn.utils import resample
# Separate majority and minority classes from training sample
df_majority = df_new[df_new['isFradulent']==0]
df_minority = df_new[df_new['isFradulent']==1]

# Upsample minority class till count of majority class
df_minority_upsampled = resample(df_minority, replace=True,n_samples=2115,random_state=123) 
 
# Combine majority class with upsampled minority class
df_upsampled = pd.concat([df_majority, df_minority_upsampled])
 
# Display new class counts
df_upsampled['isFradulent'].value_counts()

As you can observe, number of cases of ‘fraud’ have been increased.

Pros:

  • Easy to implement.

Cons:

  • Sometimes samples which are created for minority class, may be wrongly classified as majority ones as all samples are generated randomly.
  • It may increase chances of overfitting.

Random down sampling/under sampling

In down sampling technique, we remove data instances from majority class randomly.

Here is the example code for random under-sampling:

from sklearn.utils import resample
# Separate majority and minority classes
df_majority = df_new[df_new['isFradulent']==0]
df_minority = df_new[df_new['isFradulent']==1]

df_majority_downsampled = resample(df_majority, replace=False, n_samples=345,random_state=123) 
 
# Combine minority class with downsampled majority class
df_downsampled = pd.concat([df_majority_downsampled, df_minority])
 
# Display new class counts
df_downsampled['isFradulent'].value_counts()

As you can see, number of samples from ‘non fraud’ category have been reduced.

Pros

  • Takes less time to compute
  • Consumes less memory at run time as training samples are reduced

Cons

  • As number of samples are reduced from majority class, highly possible that important information can be lost.
  • It can lead to under fitting

SMOTE (Synthetic Minority Over-Sampling Technique)

In SMOTE algorithm, samples are generated within the region where minority class samples are already present.

How does SMOTE work?

Let’s understand it via an example.

Suppose we have few samples like given below, among them red dots are for minority class and blue ones for majority class samples.

In SMOTE we create, synthetic minority instances within the range of minority class. To create synthetic samples, distance between samples of minority class is measured and samples are created on that line.

All synthetic samples are created over these connected lines as shown by dark red dots.

from imblearn.over_sampling import SMOTE 
sm = SMOTE(sampling_strategy='auto', random_state=None,k_neighbors=3)
X_train_res, y_train_res = sm.fit_sample(X, y) 
print("After applying SMOTE, counts '1': {}".format(sum(y_train_res == 1))) 
print("After applying SMOTE, counts '0': {}".format(sum(y_train_res == 0))) 
Output:
After applying SMOTE, counts '1': 2115
After applying SMOTE, counts '0': 2115

Pros:

  • Reduces chances of overfitting as synthetic samples generated are not random.
  • More effective as no loss of useful information.

Cons:

  • It may possible, while generating synthetic samples from minority class region, majority class sample would occur. This may lead to generated wrong samples.
  • If data is high dimensional, algorithm is not very effective.

Cluster based Sampling

For cluster based sampling, K-Means algorithm is used. Using K-means algorithm, clusters are formed separately for minority and majority class.

It replaces majority sample cluster data points with centroids of the clusters. After applying this under sampling technique, all points of majority class will be replaced with their cluster centroids.

from imblearn.under_sampling import ClusterCentroids

cluster_centroid = ClusterCentroids(ratio={0: 300})
X_cluster, y_cluster = cluster_centroid.fit_sample(X, y)

print("After applying cluster centroid algorithm, counts '1' {}".format(sum(y_cluster == 1))) 
print("After applying cluster centroid algorithm, counts '0': {}".format(sum(y_cluster == 0)))
Output: 
After applying cluster centroid algorithm, counts '1': 345 
After applying cluster centroid algorithm, counts '0': 300

Tomek Links

Tomek link is kind o under sampling or down sampling. Using this approach we first tomek links between data points. Tomek link is a pair of two data points in such a way that one data point is from majority class and one data point is from minority class which are nearest to each other.

These data point pair create more problem as they have high chances to get wrongly classified.

As a solution, we remove data points which belong to majority class from such links and balance the dataset. By using this approach we can increase separation between two classes as well.

from imblearn.under_sampling import TomekLinks
 
tomeklinks = TomekLinks()
X_tomeklinks, y_tomeklinks = tomeklinks.fit_sample(X, y)

print("Using tomek links, counts '1': {}".format(sum(y_tomeklinks == 1))) 
print("Using tomek links, counts '0': {}".format(sum(y_tomeklinks == 0))) 
Output: 
Using tomek links, counts '1': 345 
Using tomek links, counts '0': 1938

Other Methods:

Cost Sensitive Learning

Another approach to deal with class imbalance is cost function is modified in such a way that penalty for misclassification of minority instances will be more.

In Sklearn library there is one argument “class weight”. Using this argument, we can penalize minority class according to how much less proportion is has.

Reframe the problem as anomaly detection

If minority class data points are too less in number then problem can be seen as anomaly detection rather than viewing it as classification. Then methods to resolve anomalies, can be applied to data points.

Combining minority classes into fewer one

If multiple classes are present and data points for each are very few, then minority class samples can be combined to make one class. But this method entirely depends on use case if it feasible for that use case.

Ensemble Methods

Tree based methods, bagging, boosting are also used to handle class imbalance. This will be discussed in separate blog.

Here is the link for GitHub repository, where you find full code for techniques to handle class imbalance.

https://github.com/Appiiee/Techniques-to-handle-class-imbalance

Resources and further reading: