Through data wrangling, we have seen basic pre-processing methods to make the data in usable form. So just think, what should be our next step..?
Exploratory Data Analysis
In simple terms, every process which gives insights about data.
Now questions is, do I really need that?
Exploratory data analysis is mandatory as it helps to decide model selection for training the dataset, it helps to identify, what features can be considered for prediction, which feature is more significant, if there is any pattern in the data or is there any anomaly present. If data analysis is done right, half of the problem is solved here only.
If a problem well stated is a problem half solved..Charles Kettering
Exploratory data analysis consists of three parts:
- Data Visualization: It helps to visualize relationship among independent and dependent variables, to identify patterns, to detect anomaly if any.
- Descriptive Statistics: Mean, median, mode, variance, distribution(e.g. Gaussian) comes under this category.
- Statistical tests: It includes tests to check the data validity for a particular model.
Here I have used Iris dataset for data exploration. You can download the dataset from the link below:
Descriptive statistics is a way through which data is summarized quantitatively using coefficients like mean, median, mode, variance, spread, distribution etc.
Coefficients for descriptive statistics:
- Mean: Average set of data values
- Median: Middle value of data points.
- Mode: Maximum occurring data points
- Variance: It tells about how much data is spread around a feature.
Correlation infers about connection between two or more than two data points or features.
Note: Correlation just informs about the connection between two variables, it does not infer root-cause relationship.
So how to check correlation using python? Let’s check:
Here as we can see some correlation coefficients are negatively correlated, it means if one variable increases, other one will decrease and vice versa.
If two features are highly correlated with each other then we can remove one among both. You might be wondering why? If I have more features then my model will learn well then why am I reducing features.
Here is the answer:
First, if correlation between two variables is high, it means they almost represent the same features. It’s like we are using duplicates. So there is no point of making the model more complex.
Second reason is, if both the features are producing similar result, it will be very difficult to predict the true relationship between independent variable and dependent variable that which feature has contributed to particular result.
Data visualization involves graphical representation of data so that meaningful insights can be drawn from the data. Python has many libraries for data visualization. For e.g. Matplotlib, Plotly, Seaborn, Bokeh, Pygal, Altair etc.
Here I will cover Seaborn libaray as this is the most frequently used library. It is built on the top of “matplotlib” library, so it includes features of matplotlib and some additional features too.
So let’s explore all plots one by one:
Before Data Visualization, we can split the data in three ways:
- Histogram: Good for interval data.
- Box Plot: Good for statistical analysis of data
- Bar Plot: Good for small categorical data.
- Count plot: Good for categorical data
Histograms is the best way to visualize the frequency distribution of a feature. With the help of histogram, outlier can be detected, it helps to check the distribution of the features to infer about the “Skewness”
Point to be noted:
Histograms are based on area not on the bars. Area of the bar indicates the frequency of occurrences for each bin. It means, it is not necessary that height of a particular bin would represent the occurrences of feature. It is the product of height multiplied by the width of the bin that indicates the frequency of occurrences within that bin.
But usually , we take bins of equal width, so under these circumstances, the height of the bin does reflect the frequency.
Here as we can see it’s not exact normal distribution. Petal Length has more distribution from (0-2) and (4-7). Here, line shows the Gaussian Density Distribution. We can also observe that graph is not skewed.
Box plot is mainly used:
- to detect outliers
- to check the variability of the data
Boxplot displays the distribution of data in a standardized form using this summary:
- median (Q2/50th Percentile): the middle value of the dataset.
- first quartile (Q1/25th Percentile): the middle number between the smallest number (not the “minimum”) and the median of the dataset.
- third quartile (Q3/75th Percentile): the middle value between the median and the highest value (not the “maximum”) of the dataset.
- interquartile range (IQR): 25th to the 75th percentile.
- “maximum”: Q3 + 1.5*IQR
- “minimum”: Q1 -1.5*IQR
- whiskers (shown in blue)
- outliers (shown as green circles)
Here we can see median is not in the middle for petal length. So maximum values of Petal Length lies in (4-5) range. Also, it can be seen, there is no outlier present in this feature.
Bar plot is arguably the simplest data visualization technique. It maps categories to numbers.
ax = sns.barplot(x="PetalLengthCm", y="PetalWidthCm", hue="Species", data=df) #https://seaborn.pydata.org/generated/seaborn.barplot.html
Here we can observe that Iris-setosa has small petal length, while Iris-versicolor has medium length and Iris-virginica has the highest petal length.
What is the difference between bar plot and histogram?
The difference between both the plots is that a histogram is only used to plot the frequency of values occurred in a continuous data set that has been divided into classes, called bins. While bar charts, on the other hand, can be used for a great deal of other types of variables including categorical variable data sets.
A count plot can be thought of as a histogram across a categorical, instead of quantitative, variable.
- Line plot: Good for nominal and ordinal categorical data.
- Scatter plot: Good for interval and some nominal categorical data.
Used for observing relationship between categorical variables.
sns.lineplot(x='Species',y='PetalLengthCm',data=df) #shows relationship when one variable is categorical #https://seaborn.pydata.org/generated/seaborn.lineplot.html
Here we can clearly see relationship between petal length and species.
Scatter plot is used to plot the data points, to check the effect of one variable on other.
We can observe here that Petal Length and Petal Width has almost linear relationship.
- Multivariate Scatter Plot/Pair plot:
- Grouped Box plot:
Pair plot is a type of scatter plot, in which one variable in the same data row is matched with another variable, showing all variables paired with all the other variables at a time.
Grouped Box Plot:
Grouped box plot is a type box plot which represents a feature’s variability according to classes of other features.
Here is the link for GitHub repository: https://github.com/Appiiee/Data-anaylsis-part-2