This post mainly focuses on interview questions asked in the data science interview.
The primary purpose of this post is to help you to understand and learn concepts through questions.
So let’s prepare together!
1. Why normalization is needed? Is it always necessary to normalize the data?
Answer: Normalization brings all features in the same range. When a model starts learning during training, it tries to achieve an optimized state using different algorithms like “Gradient Descent”. If all features are in different ranges then the model will take more time to converge (to achieve the optimal state).
For e.g. I have to predict house prices and I have features like “number of bedrooms”, “area in sq ft”. In such a case both the values will differ a lot. The price might be high for those houses which have an optimal number of bedrooms according to the area. So, to learn these features in an optimized way, it would be better if data is normalized first.
If features are already comparable, range normalization is not needed. Also, it’s subjective to the problem domain.
2. How to visualize the outliers?
Answer: The best way to visualize outliers is to draw the “Box Plot”. There are some other methods also like: Z-score (Statistical methods)
You can check “how to draw box plot” here: https://www.letthedataconfess.com/data-analysis-part-2/
3. What is normal distribution?
It is among one of those questions, where people usually get confused. Normal distribution and Gaussian distribution are different names of the same distribution. It’s a probability density curve where observations are clustered around the central peak.
What is so special in normal distribution?
Usually, most of the methods for machine learning take assumptions that data distribution is Gaussian. Because in real life, mostly all kinds of data distribution is “Normal Distribution”
For more details about Normal distribution, check the following link: https://statisticsbyjim.com/basics/normal-distribution/
4. If data is skewed, what problems might occur, and how to solve the problem?
Answer: First, let us understand, what is skewness?
Skewness is the asymmetry in the normal distribution. It tells that to which extent data distribution differs from a normal one.
If data is skewed (positive or negative) then it means that the mean and median are not the same. As most of the algorithms are based on the assumptions of normal data distribution, it may lead to inaccurate results.
To overcome the data skewness problem, we need to transform the data using methods like “log transformation”, “power transformation”, etc.
5. Why data cleansing is necessary before building a model?
Answer: Data cleansing involves updating, correcting, and consolidating the data. It improves the quality of the data so that model can learn more efficiently. So if data cleaning is not done properly, it might lead to inaccurate results or poor performance.
6. How to decide which type of analysis (univariate, bivariate, multivariate) should be done for a given problem?
Answer: Univariate analysis is done only when a feature is needed to be explored. It is used to check frequency distribution, anomaly detection, and statistical analysis.
Bivariate/Multivariate analysis is done to find out the relationship between two independent variables.
So if you want to remove outliers, you need to check each feature separately. If the use case requires checking that how much the dependent variable varies along with the independent variable, you need to check bivariate analysis.
7. How outliers can be removed?
Answer: Outliers can be removed using several methods like clustering, Z-Score, inter quantile range. These methods are used to detect outliers, but we can set the threshold using the above methods to remove them.
Check the following link for more details: https://haridas.in/outlier-removal-clustering.html
8. How to handle missing values?
Answer: Missing values can be handled through various ways like either by removing them or by replacing them with other values. when to remove missing values or when to replace them, is decided by use case.
Suppose in a use case, we have a total of 1000 rows of data. Among them 800 values are missing for a column, then it’s good practice to remove them.
Suppose in a dataset a column of “salary” is mentioned and around 600 values are missing among 1000. Then missing values can be replaced by average salary values.
In conclusion it depends on the use case and data given.
Please check the following link to learn how to do it: https://www.letthedataconfess.com/data-analysispart-i/
9. What is the difference between normalization and standardization.
Normalization: Method to bring the data on same scale.
Standardization: It transforms data to have a mean of zero and a standard deviation of 1.
But usually both are used interchangeably.
Please refer the link for more details: https://www.statisticshowto.datasciencecentral.com/normalized/
10. What is the difference between feature extraction, feature generation and feature selection?
Please refer to the following post for detailed answer: https://www.letthedataconfess.com/data-analysis-part-2/
11. What are the challenges for feature engineering?
Answer: The biggest challenge is to handle high-dimensional data. Though a lot of methods, APIs are available for handling a large amount of data, a lot of issues are still remaining as if data is too complex and large, it will take a larger time. But if the dataset is small and very less number of features then it fails to learn more efficiently.
12. What is meant by dimensionality of data?
Answer: Dimensionality refers to number of features in a dataset. As the number of features increases, dimensionality increases.
13. What is the difference between data wrangling, data crunching and data profiling?
Data Wrangling: Data wrangling is the process of data transformation from raw data to a usable form. For more details check here.
Data Crunching: It is another term used for data analysis. Mostly, in data crunching data is sorted, arranged, and structured using an excel sheet.
Data Profiling: Summarization of data using statistical methods, is known as data profiling.
Here I have added all the questions based on my experience, interviews, and doubts I had while studying. If you have more questions in mind, please feel free to ask/share.
I will keep adding more as I will come across. Thank you!