Exploratory Data Analysis (EDA) techniques for kaggle competition beginners

Exploratory Data Analysis (EDA) is an approach to analysing data sets to summarize their main characteristics, often with visual methods. Following are the different steps involved in EDA  :

1. Data Collection
2.Data Cleaning
3.Data Preprocessing
4. Data Visualisation

Data Collection

Data collection is the process of gathering information in an established systematic way that enables one to test hypothesis and evaluate outcomes easily.

After getting data we need to check the data-type of features.

There are following types of features :

  • numeric
  • categorical
  • ordinal
  • datetime
  • coordinates

In order to know the data types/features of data, we need to run following command:

train_data.dtypes

or

train_data.info()

Let’s have a look to the statistical summary about our dataset.

train_data.describe()

Data Cleaning

Data cleaning is the process of ensuring that your data is correct and useable by identifying any errors in the data, or missing data by correcting or deleting them. Refer to this link for data cleaning.

Once the data is clean we can go further for data preprocessing.

Data Preprocessing

 Data preprocessing is a data mining technique that involves transforming raw data into an understandable format. It includes normalisation and standardisation, transformation, feature extraction and selection, etc. The product of data preprocessing is the final training dataset.

Data Visualisation

 Data visualisation is the graphical representation of information and data.  It uses statistical graphics, plots, information graphics and other tools to communicate information clearly and efficiently.

Here we will focus on commonly used Seaborn visualisation. Seaborn is a Python data visualisation library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

Following are the common used seaborn visualisation  :-

  1. Scatter Plot
  2. Box Plot
  3. Histogram
  4. Cat Plot
  5. Violin Plot
  6. Pair Plot
  7. Joint plot
  8. Heat Map

# import seaborn library

import seaborn as sns

Scatter Plot

A scatter plot is a set of points plotted on a horizontal and vertical axes.

Scatter plot below shows the relationship between the passenger age and passenger fare based on pclass (Ticket class) from data taken from Titanic dataset

sns.scatterplot(x="Age", y="Fare", hue = 'Pclass', data=train_data)

Box Plot

Box plot  is a simple way of representing statistical data on a plot in which a rectangle is drawn to represent the second and third quartiles, usually with a vertical line inside to indicate the median value. The lower and upper quartiles are shown as horizontal lines either side of the rectangle.

 

 

Box plot below shows how the passenger fare varies based on ticket class.

sns.boxplot(x="Pclass", y="Fare",data= train_data)


Histogram

A histogram is an accurate representation of the distribution of numerical data. It is an estimate of the probability distribution of a continuous variable

sns.distplot( train_data['Pclass'], kde=False)

Cat Plot

Cat plot provides access to several axes-level functions that show the relationship between a numerical and one or more categorical variables using one of several visual representations. We can used different kind of plot to draw (corresponds to the name of a categorical plotting function)Options are: “point”, “bar”, “strip”, “swarm”, “box”,  or “violin”. More details about Cat plot is here

Below we do a cat plot with bar kind

sns.catplot(x="Embarked", y="Survived", hue="Sex",col="Pclass", kind = 'bar',data=train_data, palette = "rainbow")

 

Let’s have a look on same cat plot with violin kind

sns.catplot(x="Embarked", y="Survived", hue="Sex",col="Pclass", kind = 'violin',data=train_data, palette = "rainbow")

Violin Plot

A violin plot plays a similar role as a box and whisker plot. It shows the distribution of quantitative data across several levels of one (or more) categorical variables such that those distributions can be compared. More details about Violin plot is here

sns.violinplot(x='Sex', y='Survived',data=train_data)

Pair Plot

Pair plot in seaborn only plots numerical columns although later we will use the categorical variables for coloring. More about pair plot is here.

sns.pairplot(train_data, hue="Sex")

 

Joint Plot

Jointplot is seaborn library specific and can be used to quickly visualize and analyze the relationship between two variables and describe their individual distributions on the same plot.

More about Joint plot is here.

sns.jointplot(x="Age", y="Fare", data=train_data, color ='green')

Heat Map

Heat map is a representation of data in the form of a map or diagram in which data values are represented as colours.

sns.heatmap(train_data.corr(), fmt = ".2f")

 

That’s all for this post, hope it was helpful. Cheers!

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *