Let’s take first step towards journey of being a data scientist..

For any manipulation, we need to understand the data first. Basically, data is of two types:

Unstructured: A form of data, where data points are not in a particular order and it’s not possible to extract information from that directly such as raw text data, image data (where we need to deal with pixels).

Structured: Structured data is mostly in excel sheet form which contains rows and columns from where it can be processed easily. e.g. Titanic dataset which contains information in form of rows and columns.

So Here I am considering “data analysis” for structured data first. Unstructured data requires some extra steps.

Data Analysis is a process in which data is prepared for modeling or make ready to apply machine learning model. So every kind of process from importing the data to the analysis of features comes under data analysis.

Data analysis is comprised of many levels:

  • Data Wrangling
  • About Data Visualization
  • Feature Engineering

I will try to cover data wrangling in the first part of this tutorial.

So first, let’s check what is data wrangling all about?

It is a process of converting data in such a manner that it can be readily used for further processes.

Steps followed in data wrangling:

  • Identify missing values and handling them
  • Check the data format and its correction if needed
  • Data Normalization
  • Data Encoding

Here I have used the Titanic dataset for data analysis. Also, some datasets need more processing, which is purely based on the use case available. In the titanic dataset, we need to figure out if a passenger is survived or not given other features like Age, Sex, Fare.

Titanic Problem can be referred from Kaggle through the link below:
https://www.kaggle.com/c/titanic

First import the dataset into a data frame:

Let’s start with identifying missing values

Missing values can be present in many forms, like “null”, “NaN” or blank (no value is present) or “?” So we need to check as follows:

This will give you output as True/False.

If you want to know that how many values are missing from a particular column, use the sum() function as follows:

The output will be like this:

Let’s check how can we deal with these missing values

Not removing or replacing missing values cause absurd results. Now the question arises, should I remove it or replace it? If I replace it, what value do I need for replacement?

Let’s see..

Removal of column

First, It depends on data that how many values are missing. If missing values consist of a large amount of data then remove the whole column. For example, the Titanic dataset “Cabin” column contains 687 values, so it will be better to remove that column.

Now come to replacement..

Again, I would suggest you check the context of the data for replacement first, then these methods can be applied:

  1. Replace it by mean
  2. Replacing i by frequency (median)
  3. Replace it with other functions

The rule of the thumb is:

  • Replace missing value with mean if the dataset is normally distributed.
  • Replace missing value with median if the variable is skewed.

In the case of a categorical variable, imputing missing values with the mode is a better choice. Suppose in the gender variable, there are 500 males and 200 females; Replace all missing values with the male.

Here check the “Age” column, Here we can replace missing values with a mean of the values remaining in that column.

But “Embarked” column, contains 2 missing values, so first check the values in this column.

Oh! these column does not have numeric value, it contains 3 categories: “S”, “Q”,”C”. So I checked, which category has the most occurred values:

Okay! So it contains “S” 644 times. So let’s replace ‘nan’ value, with string “S”.

Note: Here “nan” is not a string, so to replace it we need to use “np.nan”

Now let’s check the data type of each column

We need to check the data type because if a column has a category variable it’s dtype should of “obj” type. If it contains numerical values in the column, it will have dtype as “int”/”float”

Normalization

Normalization is done to bring the data to a common scale, without distorting differences in the ranges of values. When Data columns have a finite range, model converges efficiently. So for normalization we generally use, MinMaxScaler function.

Data Encoding

Before feeding the data into any machine learning model, we need to convert categorical data into numerical form so that data can be processed.

There are 3 ways possible for encoding the data:

  • One Hot Encoding
  • Label Encoding
  • Creating Dummy Variables
  • Target Encoding

One Hot Encoding:

If category column has 4 categories then using one-hot encoding, 1 is assigned to one category and other classes are assigned as 0.

Label Encoding:

Label encoding assigns numbers to each category. For example, in this titanic dataset, in “Sex” column, there are 2 classes: Male and Female

So Label encoding assigns ‘1’ to male and ‘2’ to Female values. In df[‘Name’], as it seems that almost all values are different, so using label encoding each unique no will be assigned to each name.

Dummy Variable:

Using dummy variable, we can create new columns for each class. It works similarly to one-hot encoding.

Here in df[‘Embarked’], it contains 3 categories, “S”, “Q”,”C” . So at one time, one class is assigned one and others zero and using this 3 new columns were created.

Target Encoding

In target encoding, we calculate the mean of the target variable for each category and replace the category variable with the mean value. In the case of the categorical target variables, the posterior probability of the target replaces each category.

In the end, I have removed the “Name” and “Ticket” columns too as they were not useful for classification.

You can check the full code with the training dataset on my GitHub repository through the mentioned link:
https://github.com/letthedataconfess/Data-analysis-part-I

So till here, I have covered basic data wrangling techniques. Let me know if you have any queries.

4 Comments

  • well written. Easy for any one to follow to get started!!!

    • Thank you Balachandar!

  • Explained in the most simplest form. Great work !!!

    • Thank You Nidhi!

Comments are closed.