Through data wrangling, we have seen basic pre-processing methods to make the data in usable form. So just think, what should be our next step..?
Exploratory Data Analysis
In simple terms, every process gives insights into data.
Now questions is, do I really need that?
Exploratory data analysis is mandatory as it helps to decide model selection for training the dataset. It helps in identifying, which features can be considered for prediction, which features are more significant, if there is any pattern in the data, or is there any anomaly is present.
If a problem well stated is a problem half solved..Charles Kettering
Exploratory data analysis consists of three parts:
- Data Visualization: It helps to visualize relationship among independent and dependent variables, to identify patterns, to detect anomaly if any.
- Descriptive Statistics: Mean, median, mode, variance, distribution(e.g. Gaussian) comes under this category.
- Statistical tests: It includes tests to check the data validity for a particular model.
Here I have used the Iris dataset for data exploration. You can download the dataset from the link below:
Descriptive statistics is a way through which data is summarized quantitatively using coefficients like mean, median, mode, variance, spread, distribution, etc.
Coefficients for descriptive statistics:
- Mean: Average set of data values
- Median: Middle value of data points.
- Mode: Maximum occurring data points
- Variance: It helps in finding the data spread around a feature.
Correlation infers the connection between two or more two data points or features.
Note: Correlation just informs about the connection between two variables, it does not infer root-cause relationship.
So how to check correlation using python? Let’s check:
Here as we can see some correlation coefficients are negatively correlated, which means if one variable increases, another one will decrease and vice versa.
If two features are highly correlated with each other then we can remove one among both. You might be wondering why? If I have more features then my model will learn well then why am I reducing features.
Here is the answer:
First, if the correlation between two variables is high, it means they almost represent the same features. It’s like we are using duplicates. So there is no point in making the model more complex.
The second reason is, if both the features are producing similar results, it will be very difficult to predict the true relationship between the independent variable and a dependent variable that which feature has contributed to a particular result.
Data visualization involves the graphical representation of data so that meaningful insights can be drawn from the data. Python has many libraries for data visualization. For e.g. Matplotlib, Plotly, Seaborn, Bokeh, Pygal, Altair, etc.
Here I will cover the Seaborn library as this is the most frequently used library.
So let’s explore all plots one by one:
Before Data Visualization, we can split the data in three ways:
- Histogram: Good for interval data.
- Box Plot: Good for statistical analysis of data
- Bar Plot: Good for small categorical data.
- Count plot: Good for categorical data
Histograms is the best way to visualize the frequency distribution of a feature. It helps in checking the distribution of the features.
Point to be noted:
Histograms are based on an area not on the bars. The area of the bar indicates the frequency of occurrences for each bin. It means, it is not necessary that the height of a particular bin would represent the occurrences of the feature. It is the product of height multiplied by the width of the bin that indicates the frequency of occurrences within that bin.
But usually, we take bins of equal width, so under these circumstances, the height of the bin does reflect the frequency.
Here as we can see it’s not an exact normal distribution. Petal Length has more distribution from (0-2) and (4-7). Here, the line shows the Gaussian Density Distribution.
The box plot is mainly used:
- to detect outliers
- to check the variability of the data
Boxplot displays the distribution of data in a standardized form using this summary:
- median (Q2/50th Percentile): the middle value of the dataset.
- first quartile (Q1/25th Percentile): the middle number between the smallest number (not the “minimum”) and the median of the dataset.
- third quartile (Q3/75th Percentile): the middle value between the median and the highest value (not the “maximum”) of the dataset.
- interquartile range (IQR): 25th to the 75th percentile.
- “maximum”: Q3 + 1.5*IQR
- “minimum”: Q1 -1.5*IQR
- whiskers (shown in blue)
- outliers (shown as green circles)
Here we can see median is not in the middle for petal length. So maximum values of Petal Length lies in the (4-5) range. No outliers are present.
Bar plot is arguably the simplest data visualization technique. It maps categories to numbers.
Here we can observe that Iris-setosa has a small petal length, while Iris-versicolor has a medium length and Iris-virginica has the highest petal length.
What is the difference between bar plot and histogram?
The difference between both the plots is that a histogram is only used to plot the frequency of values that occurred in a continuous data set that has been divided into classes, called bins. While bar charts, on the other hand, can be used for a lot of other types of variables including categorical variable data sets.
A count plot can be thought of as a histogram across a categorical, instead of a quantitative, variable.
- Line plot: Good for nominal and ordinal categorical data.
- Scatter plot: Good for interval and some nominal categorical data.
Used for observing relationships between categorical variables.
Here we can clearly see the relationship between petal length and species.
A Scatter plot heps to plot the data points, to check the effect of one variable on another.
We can observe here that Petal Length and Petal Width have an almost linear relationship.
- Multivariate Scatter Plot/Pair plot:
- Grouped Box plot:
A pair plot is a type of scatter plot, in which one variable in the same data row is matched with another variable, showing all variables paired with all the other variables at a time.
Grouped Box Plot:
Grouped box plot is a type box plot that represents a feature’s variability according to classes of other features.
Here is the link for the GitHub repository: https://github.com/letthedataconfess/Data-anaylsis-part-2