This post mainly focuses on interview questions asked in the data science interview.
The primary purpose of this post is to help you to understand and learn concepts through questions.
So let’s prepare together!
1. Why do we need Normalization? Is it always necessary to normalize the data?
Answer: Normalization brings all features in the same range. When a model starts learning during training, it tries to achieve an optimized state using different algorithms like “Gradient Descent”. If all features are in different ranges then the model will take more time to converge (to achieve the optimal state).
For e.g. I have to predict house prices and I have features like “number of bedrooms”, “area in sq ft”. In such a case both the values will differ a lot. The price might be high for those houses which have an optimal number of bedrooms according to the area. That is why it is better to have normalized data to learn the features in an optimized way.
If features are already comparable, range normalization is not needed. Also, it’s subjective to the problem domain.
2. How to visualize the outliers?
Answer: The best way to visualize outliers is to draw the “Box Plot”. There are some other methods also like: Z-score (Statistical methods)
You can check “how to draw box plot” here: https://www.letthedataconfess.com/data-analysis-part-2/
3. What is normal distribution?
One of the most confusing questions! Normal distribution and Gaussian distribution are different names of the same distribution. Normal Distribution is a probability density curve where observations are clustered around the central peak.
What is so special in normal distribution?
Usually, most of the methods for machine learning take assumptions that data distribution is Gaussian. Because in real life, mostly all kinds of data distribution are “Normal Distribution”
For more details about Normal distribution, check the following link: https://statisticsbyjim.com/basics/normal-distribution/
4. How to solve the problems faced if the data is showing skewness?
Answer: First, let us understand, what is skewness?
Skewness is the asymmetry in the normal distribution. It tells that to which extent data distribution differs from a normal one.
If the data is showing skewness (positive or negative), then it means that the mean and median are not the same. As most of the algorithms are based on the assumptions of normal data distribution, it may lead to inaccurate results.
To overcome the data skewness problem, we need to transform the data using methods like “log transformation”, “power transformation”, etc.
5. Why data cleansing is necessary before building a model?
Answer: Data cleansing involves updating, correcting, and consolidating the data. It improves the quality of the data so that model can learn more efficiently.
6. How to decide which type of analysis (univariate, bivariate, multivariate) should be done for a given problem?
Answer: Univariate analysis is done only when a feature is needed to be explored. It helps to check frequency distribution, anomaly detection, and statistical analysis.
Bivariate/Multivariate analysis helps in finding out the relationship between two independent variables.
So if you want to remove outliers, you need to check each feature separately. If the use case requires checking that how much the dependent variable varies along with the independent variable, you need to check bivariate analysis.
7. How are outliers removed?
Answer: Outliers can be removed using several methods like clustering, Z-Score, inter quantile range. These methods help in detecting outliers, but we can set the threshold using the above methods to remove them.
Check the following link for more details: https://haridas.in/outlier-removal-clustering.html
8. How to handle missing values?
Answer: Missing values can be handled in various ways like either by removing them or by replacing them with other values.
Suppose in a use case, we have a total of 1000 rows of data. Among them 800 values are missing for a column, then it’s good practice to remove them.
Suppose in a dataset a column of “salary” is mentioned and around 600 values are missing among 1000. Then missing values can be replaced by average salary values.
In conclusion, it depends on the use case and data given.
Please check the following link to learn how to do it: https://www.letthedataconfess.com/data-analysispart-i/
9. What is the difference between normalization and standardization.
Normalization: Method to bring the data on the same scale.
Standardization: It transforms data to have a mean of zero and a standard deviation of 1.
Please refer to the link for more details: https://www.statisticshowto.datasciencecentral.com/normalized/
10. What is the difference between feature extraction, feature generation and feature selection?
Please refer to the following post for detailed answer: https://www.letthedataconfess.com/data-analysis-part-2/
11. What are the challenges for feature engineering?
Answer: The biggest challenge is to handle high-dimensional data. Though a lot of methods, APIs are available for handling a large amount of data, a lot of issues are still remaining as if data is too complex and large, it will take a larger time. But if the dataset is small and has very few features then it fails to learn more efficiently.
12. What is dimensionality of data?
Answer: Dimensionality refers to the number of features in a dataset. As the number of features increases, dimensionality increases.
13. What is the difference between data wrangling, data crunching and data profiling?
Data Wrangling: Data wrangling is the process of data transformation from raw data to a usable form. For more details check here.
Data Crunching: Storing, arranging, and structuring data using an excel sheet.
Data Profiling: Summarizing Data using Statistical methods.
Here I have added all the questions based on my experience, interviews, and doubts I had while studying. If you have more questions in mind, please feel free to ask/share.
I will keep adding more as I will come across. Thank you!