You might have heard “Genetic algorithm” in biology, but ever wondered how this algorithm is applied to data science? For a large dataset (where the number of features is huge), it’s really difficult to select features only through the filter, wrapper or embedded methods as these are not efficient for handling large features alone.
So, to overcome that issue, we use a genetic algorithm for feature selection.
Table of content
- What is genetic algorithm
- Steps followed for implementation
- Use case study (House Price prediction)
- Advantages and Disadvantages
Firstly, Let’s understand what is genetic algorithm
In the first place, we need to understand that the genetic algorithm is a search algorithm based on Darwin’s biological evolution theory. This theory focuses on evolution through natural selection. How genes are selected from two species, how chromosomes are formed by different combinations of genes, and which is the best combination that inherits into new species i.e. their offspring. Darwin’s theory explains well all such questions.
So the question comes, how is this theory applicable for feature selection?
Different stages of genetic algorithm
To demonstrate how genetic algorithm helps in feature selection, let’s check the house price prediction problem on Kaggle. For more details, here is the link for the competition:
This dataset contains 80 features, so to visualize it through different plots directly and select features on the basis of that is really cumbersome. So we are gonna deal with them using a genetic algorithm.
The initial population contains an initial set of features or independent variables or individuals from which we have to select the best features for prediction.
Each variable will be considered like ‘gene’ and the final set of the variables (that will be selected after applying genetic algorithm) will be ‘chromosome’ that will be the best combination of genes. The length of the chromosome will be decided by the total number of features.
At first, we need to create a different combination of these variables (the combination may contain a subset of variables or whole subset) that are initialized randomly.
Once different combinations of variables are generated then we need to train the data using selected variables, one combination at a time.
For example, if 5 variables are selected for one individual then we need to train the data using these variables and evaluate the error. This error is known as a selection error. As selection error will be high, less fit the combination will be. According to the selection error, the rank will be given and a fitness score will be assigned.
Again, it’s based on Darwin’s theory “Survival of the fittest”. Two feature set/combination which has the lowest selection error or highest fitness score will be selected to generate the next offspring, which means those features will be selected for the model training. These individuals pass their genes to next-generation (a final subset of features)
Crossover is the point where the selected feature sets recombine to produce offspring or a new feature set. Here each ‘gene’ or variable of one selected set combines with another set. So final offspring will have features of both ancestors.
New feature subset that is generated through crossover, features will be similar to their parents. So there will be less diversity.
Therefore, some features are changed randomly to create more diversity. This process is known as mutation.
Termination of process
This process converges when the stopping criterion is satisfied. The stopping criterion maybe when the mean of the error gets converged.
As I discussed before, this dataset has 80 features, it is important to realize that it’s very difficult to select features manually or by other feature selection techniques. Therefore, after removing missing values from the dataset, we will try to select features using the genetic algorithm.
- estimator: fitting method/algorithm (e.g. random forest, linear regressor)
- cv: Cross validation
- verbose: Data log
- Scoring: Evaluation method ( R square in this case)
- max_features: Maximum number of features selected
- n_population: Intital population
- crossover_proba: Probability of crossover for the genetic algorithm
- mutation_proba: Probability of mutation for the genetic algorithm
- crossover_indpendent_proba: Independent probability for each attribute to be exchanged
- mutation_independent_proba: Independent probability for each attribute to be mutated
- tournament_size: tournament size for genetic algorithm.
- n_gen_no_change: If set to a number, it will terminate optimization when best individual is not changing in all of the previous `n_gen_no_change` number of generations.
- caching: If True, scores of the genetic algorithm are cached.
Here features that have results as ‘True’ are selected features using the genetic algorithm.
- More efficient than other feature algorithms
- Can handle large datasets
- Can run in parallel for multiple features
- No need to have domain knowledge in depth
- Since each individual uses the predictive model for evaluation, as a result of this it can be computationally expensive.
To sum up, the feature selection process using the genetic algorithm contains 5 steps as initialization, fitness calculation, feature selection, crossover, mutation, and then conditional termination.
Here is the GitHub link for detailed code: https://github.com/letthedataconfess/Feature-Selection-using-genetic-algorithm