Cleaning data for data visualisation

This small post provides information on cleaning data by dealing  with missing data present in a dataframe.

Data cleaning is the process of ensuring that your data is correct and useable by identifying any errors in the data, or missing data by correcting or deleting them.
Cleaning up data is the first and most important step, as it ensures the quality of the data is met to prepare data for visualization.

We are using python language to clean data for this post.

Reading the data

Initially we have to import libraries to read and load the data.

 

After loading data our first step is to check all the index of the data.

This can be checked by using following command :

data.columns

Handle Missing data

In statistics, missing data, or missing values, occur when no data value is stored/provided for the variable in an observation. Missing data is a common occurrence and can have a significant effect on the conclusions that can be drawn from the data.

Let’s check if the dataframe has missing data :

data.isnull().sum() #detect missing values 

The goal of cleaning operations is to prevent problems caused by missing data that can arise when training a model.

How to deal with missing data

Before trying to deal with missing values in an analysis, we need to understand which variables contain the missing values, and we need to examine the patterns of missing values. In the above example we can see that there are two types of missing data one is int type whereas other is object type.

There are several ways to deal with missing data :

Ignore the data row

Deleting or dropping the rows that contain missing data or Nan data. But, obviously we  get poor performance if the percentage of such rows is high.

data.dropna(inplace = True)

Sometimes, we can delete or drop the row only if all the values are Nan

data.dropna(how='all')  

Use fillna() to fill missing values

Instead of deleting the Nan values or missing values, we can  fill all the missing values.Missing data can be filled by propagating the non-Nan values forward or backward along a Series. Sometimes, Nan value will remain Nan even after forward filling or back filling if a next or previous value isn’t available or it is also a Nan value.

data.fillna(method='ffill')

data.fillna(methods = 'bfill')

 

Use a constant value to fill in for missing values

This technique is used because sometimes it just doesn’t make sense to try and predict the missing value.

# This fills all the null values with 0.
data.fillna(value=0, inplace=True)

For filling missing data in an object type data we can simply fill with string – ‘missing’ , and we will handle the values in code if necessary.

data['Embarked'].fillna('missing', inplace=True) 

Replace with mean, median and mode value

Mean is the average number of the data set, median is the middle number and mode is the number that occurs most often. Mode can also be used to fill missing data in object type data.

# we can use median
train_data['Age'].fillna(train_data['Age'].median(), inplace = True)

#we can use mean()
train_data['Age'].fillna(train_data['Age'].mean(), inplace = True)

#we can use mode()
train_data['Embarked'].fillna(train_data['Embarked'].mode()[0], inplace = True)

Finally, we can check whether all our missing data are filled or removed by the command above.

Hope, this post was useful. cheers

 

I am a greenhorn Data Science student with interest in finding patterns in data. My language of choice is Python and I am starting to get my hands dirty with R.

I blog on Medium.com [1] and ConfusedCoders.com [2]. I share my code on Github.com [3].

  1.  https://medium.com/@nikkisharma536
  2. https://confusedcoders.com/author/nikita
  3. https://github.com/nikkisharma536

Leave a Reply

Your email address will not be published. Required fields are marked *