Cross Validation and its Types
As we all know the data science pipelines Data gathering, Data Preprocessing, Feature Engineering, Feature selection, Model creation and model deployment.
Always we know that before model creation we do split our data into train and test. Lets say you have 1000 records and we are doing 70–30 split, the split will done randomly based on the parameter “random_state”.
First time you have taken random_state = 0 and performed 70–30 split at that lets say the accuracy of the model is 85%.
Second time you have taken random_state=100 the train ,test date gets randomly shuffled and accuracy of the model is 87%.
So if you keep changing the random_state value accuracy will keep changes. you can not confirm your stake holders that what accuracy your model is predicting, in order to prevent this we have concept called Cross validation.
What it is
Cross-validation is the process to separate your total training set into two subsets: training and validation set, and evaluate your model to choose the hyperparameters. But you do this process iteratively, selecting different training and validation set, in order to reduce the bias that you would have by selecting only one validation set.
Types mostly using in the industry
- K fold Cross Validation
- Stratified K fold Cross Validation
- Time Series Cross Validation
K fold cross validation
It is very simple and we will select K folds from training data set and perform K experiments.
Lets take an example we have 1000 records ,say k is 5. we have to divide our 1000 records into 5 folds each one consisting of 200 records.as shown in below and perform k experiments.
Number of records in each fold = (total number of records)/K
The first 200 records (fold1 in split1) will be taken as test data ,remaining 800 records(Fold2, fold3,fold4,fold5) as training data and we will be getting Accuracy1(mertic1).
The second 200 records (fold2 in split2) will be taken as test data ,remaining 800 records(Fold1, fold3,fold4,fold5) as training data and we will be getting Accuracy2(mertic2).
Like this we need perform all the k experiments.
End of each experiment you will get some accuracy, finally you have to take mean of all the experiments accuracy as the final accuracy of the model.
From these experiments you will know what is minimum and maximum accuracy the model can predict, now you can say to stake holder that “My model has minimum this accuracy and this is maximum accuracy ”.
from sklearn.model_selection import corss_val_score
score = cross_val_score(classifer, X,y, cv=10)
cv =10 means i will be performing 10 experiments i.e k=10
Disadvantage or drawback:
lets say we have imbalanced data set with a binary classification problem classes :1,0; where the first 200 records are having same class , at that time you may get not good accuracy. To overcome this problem we have another technique Stratified K fold Cross Validation.
Stratified K fold Cross Validation
Everything will be same like K fold CV, but this technique will make sure that in each experiment the random sample taken as test and train data sets should contain at least number of instances for each classes.
In the simple words the train data set will good proportion of both classes(1,0) records and test data set also will contain both class records.
from sklearn.model_selection import stratifiedkFold
Time series Cross validation
This type of CV will works for completely for different kind of problem i.e Time Series problem.
Lets say you have a problem of stock price prediction for future at that time you can not do train_test_split and find the accuracy because the data is based on time.
In that case we will use Time series CV. The most recent will be taken as test set and remaining will be train data set.