In this article, we’d be studying about the following:
1. What is CNN?
2. What are imbalanced datasets?
3. Why are imbalanced datasets a problem?
4. What does CNN do basically
5. Working of CNN
6. Observations
So, CNN, is per se a fancy word to begin with. We all would be familiar with what KNN (K-Nearest Neighbors) is. Somewhat similar is CNN.
CNN is a SMUTE(Systematic Majority Under-Sampling Technique)used to handle imbalanced datasets, so that we can then apply KNN to it without the algorithm flustering.
What is an Imbalanced dataset?
A dataset is said to be imbalanced if the instances of a class are either very less or very high in number as compared to the instances of other class.
Why is this a problem?
Well, many machine learning algorithms assume that the instances of all the classes are equally(almost) distributed. However, when this is not true in reality, the algorithm becomes biased towards the majority class instances and starts flustering.
Further, many algorithms improve the accuracy by reducing the errors, and simply fluster when the dataset is imbalanced.
Ensemble techniques are not affected by the imbalanced nature of the dataset but well, its always better to cure a disease rather than simply treating the symptoms.
What does CNN do basically?
CNN basically removes the majority class instances in such a manner that there is no information loss and the dataset becomes balanced.
The aim of CNN is to produce a subset of dataset that can correctly classify all the data points in the original dataset.
Simply removing majority class data points from the dataset will lead to information loss and hence a further deterioration of the algorithms output.
Working of CNN
Let us understand the working of CNN.
Suppose that we have a dataset D, given by
Step 1: Choose ‘k’ : CNN works with a value of ‘k’. The value of ‘k’ which we choose will give a unique result. Changing ‘k’ changes our final result. Let us assume k = 3 for our case.
Step 2: Start the iteration by choosing randomly any k (here 3) points to keep in the store S.
For simplicity, consider that we have squares and circles as shown below and we select these 3 points as the initial points in the Store S.
(PS: The diagram below is just an illustration. Clearly, its very much balanced. Imbalancy is feature pertaining to a ratio of 70:30 majority class to minority class instances or more extreme)
Step 3: We check whether the store S is ‘Training Set Consistent’ or not. If it is, we stop; else we add a point to the store so as to improve it and make it training set consistent.
Training Set Consistency: A set is said to be training set consistent if on running KNN on the dataset with the classifiers as the points in store, we get the same classification as when KNN was run on the entire dataset.
i.e.
Now, we check if our store S with the 3 selected points is consistent or not.
Consider the dark blue point.
According to the store, it must be blue (since out of the 3 points in store, 2 are blue).
So,
However, according to the complete dataset, that point must be red, becuase out of 3 nearest neighbors, 2 are red.
i.e.
Clearly, we have a training set inconsistency.
Step 4: We select a random point from the dataset to add in store such that the inconsistency with the classification of dark blue point is solved keeping the prediction of the dataset as gold standard
i.e. select a random point xᵢ from the dataset such that on adding xᵢ we have,
So we add the dark red point (in square) to the store S.
Step 5: Repeat till the Store S is training set consistent.
Some Observations:
- When the store is training set inconsistent, a point will definitely exist that will improve the store S.
- The process will terminate in at max n iterations where n is the number of points in the dataset D
- Upon termination, S will always be Training Set Consistent
You can directly implement this CNN using scikit-learn’s Imblearn library.
Conclusion
Condensed Nearest Neighbors is a very powerful technique to balance an imbalanced dataset and aims at reducing information loss due to under-sampling. However, it still leads to some information loss as instances are removed from the dataset. However, a much more powerful tool Advanced -CNN, can be used to overcome this problem in a much more efficient way.