So, Here we are on part-3 of Data Analysis: Feature Engineering
At first we need to understand, what does it mean by feature engineering?
Feature Engineering is a part of data analysis where using domain knowledge of data, features are transformed or generated or extracted to improve the model performance.
Let’s dive in deeper!
Feature Engineering can be divided in three parts:
- Feature Transformation
- Feature Generation/Extraction
- Feature Selection
If new features are constructed from the existing data using descriptive methods, then it is called feature transformation.
For e.g. creating features using mean, variance or quantile range
(This is already explained in data analysis part 2)
If new features are created using the given data by applying any mathematical rule or by some other way, known as Feature Extraction.
For e.g. Calculation of Euclidean distance between two data points and adding it as new feature, PCA (Principal Component Analysis)
From the given features, selecting subset of features which are relevant for prediction or classification on the basis of domain knowledge or by any algorithm, is known as feature selection.
You might be thinking, why do we need feature selection at all?
This though may come to your mind that If I use all the features at a time, model will learn better as it has more information to capture and if I remove some features, then it might loose some information.
Well, you are thinking wrong!!
Firstly, it’s not necessary that all features play crucial role in learning. We are going to remove only irrelevant features, not important ones. So even if we keep all features, they will just create model complexity, will lead to more training time, also they might be the reason of overfitting.
So, cut short, it’s necessary to select features wisely.
Feature Selection Methods can be categorized in three parts:
- Filter method
- Wrapper Method
- Embedded Method
Let’s understand one by one.
It includes the process through which we directly filter out some features without applying any algorithm. Usually, it is considered as part of pre-processing step of data analysis.
All following methods are considered as filter methods of feature selection:
- Removing constant/quasi-constant features or duplicate features if any
- Features removal based on correlation
- Feature removal on the basis of fisher score, ANOVA test, VIF or on the basis of ROC curve.
- Removal of features using Chi squared test, information gain.
Quasi constant features
Features who exhibit same values for majority of features, called Quasi constant feautres. To identify such features we can use “VarianceThreshold” function from sklearn library.
By default, it removes all the features that have same values (constant features).
Check the library function for more details:
from sklearn.feature_selection import VarianceThreshold quasidf=VarianceThreshold(threshold=0.1) #It will search for the features having 99% of same value in all samples. quasidf.fit(train)
quasidf.get_support() #True: Not a quasi constant feature #False: Quasi constant feature(It contains 90% same value in all samples.)
According to output given, we can remove the feature column which is showing “False” as an output of VarianceThreshold function.
Value of threshold is decided by domain knowledge and problem statement.
Explained in part 2. Please check out the following link:
I will explain feature removal on the basis of fisher score, ANOVA test, VIF or on the basis of ROC curve, Chi squared test, information gain in upcoming articles where we need all these tests the most.
Pros of filter methods:
- Computation is fast
- Quickly remove irrelevant features.
Cons of filter methods:
- It may keep irrelevant features as it does not consider the interaction with the classifier.
- Not effective as wrapper or embedded methods.
In Wrapper Method, we first randomly select the features (take subset of all features) and train the model. Likewise model is trained using each subset. According to the error occurred, final subset is selected on the basis of minimum error.
Though this method is computationally extensive, but it provides the optimum feature set.
Types of wrapper method
- Forward Feature Selection
- Backward Feature Elimination
- Exhaustive Feature Selection
Forward Feature Selection
Using iteration method, we keep adding features that improve our model. This process is continued until there is no improvement of model by adding features or very minimal.
from mlxtend.feature_selection import SequentialFeatureSelector as sfs from sklearn.ensemble import RandomForestRegressor model=sfs(RandomForestRegressor(),k_features=5,forward=True,verbose=5,cv=5,n_jobs=-1,scoring='r2') model.fit(x_train,y_train)
I have used Random Forest Regression algorithm as an estimator. Any regression algorithm can be selected.
- k_features=5 (It will get top 5 features best suited for prediction)
- forward=True (Forward feature selection model)
- cv=5 (Kfold cross validation)
- n_jobs=-1 (Number of cores it will use for execution.-1 means it will use all the cores of CPU for execution.) if n_jobs is not given it will show some warnings.
- verbose = 5 (it is used to show the log of the process)
- scoring=’r2′ (R-squared is a statistical measure of how close the data are to the fitted regression line) By changing the scoring matrix, features selected might get changed.
#Get the column name for the selected feature. model.k_feature_names_
From the above output, we can see among all 10 features ( ‘PassangeId’, ‘Survived’, ‘Pclass’, ‘Sex’, ‘Age’, ‘SibSp’, ‘Parch’, ‘C’, ‘Q’, ‘S’) of the train data only 5 features are remaining.
In Backward Elimination, at starting, we consider all features then we remove feature whichever is least significant to improve the model. This process is continued until we get set of optimum features and there is no further improvement possible.
Parameters are kept same except “forward=False” as we going backward now.
from mlxtend.feature_selection import SequentialFeatureSelector as sfs from sklearn.ensemble import RandomForestRegressor backwardModel=sfs(RandomForestRegressor(),k_features=5,forward=False,verbose=5,cv=5,n_jobs=-1,scoring='r2') backwardModel.fit(np.array(x_train),y_train)
#Get the column name for the selected feature. x_train.columns[list(backwardModel.k_feature_idx_)]
We can observe, five best possible features are selected from 10 features using backward elimination method.
Exhaustive Feature Selection
This method is based on permutation and combination of features where we consider all possible combinations of feature that improve the result.
- min_features=1 (minimum number of feature)
- max_features=5 (maximum number of feature)
from mlxtend.feature_selection import ExhaustiveFeatureSelector as efs emodel=efs(RandomForestRegressor(),min_features=1,max_features=5,scoring='r2',n_jobs=-1) emodel.fit(x_train,y_train)
You can read more about feature selection technique using mlxtend library through the following link:
Pros of wrapper method:
- Goal is to find out best possible features
- very effective for smaller dataset
- Includes interaction with classifier
Cons of wrapper method:
- computationally expensive
- Not useful for large dataset as consumes a lot of time
- classifier dependent
- Higher risk of overfitting
In Embedded Method, feature selection is a part of model construction. It means algorithm itself penalize the features which are wrongly predicted.
Examples of Embedded Method are:
- Lasso Regression(L1 regularization)
- Ridge Regression(L2 regularization)
- Elastic Net Regression
- Decision Tree
- Weighted Naive Bayes
- Using weighted vector of SVM
Regression techniques for regularization is itself whole concept that needs details explanation, it’s is not possible to cover everything in this article. I will explain it in detail later. Here I am gonna cover small description here. Same with Decision tree,Weighted Naive Bayes,Weighted SVM.
Lasso Regression (L1 regularization):
Regularization is a method of adding penalty to some features to reduce overfitting. By regularization, importance of a feature can be reduced or minimized by adding penalty.
L1 regularization adds penalty equal to absolute value of magnitude of feature’s value. Sometimes, L1 penalty may lead features value to zero.
Ridge Regression (L2 Regularization)
L2 regularization adds penalty equal to square of magnitude of feature’s value. As it adds square of magnitude of value, so it can’t shrink the feature value to zero.
Elastic Net Regression
Elastic Net regression is combination of L1 and L2 regularization.
Pros of Embedded Method:
- Algorithm uses its own variable selection process so include interaction with classification model like wrapper method
- Less computationally expensive as compare to wrapper method.
Cons of Embedded Method:
- Dependent on classifier
Here is the GitHub link for the detailed code:
For more details, refer the following paper:
Let me know if you have any doubts or anything want to discuss related to feature selection. You can comment below.