Exploratory Data Analysis

In my last blog , I have discussed about path which you should follow to start with Data Science Domain. In this blog, I will discuss about Exploratory Data Analysis (EDA) techniques. When we work with datasets, we first analyze it, to understand What that data is about? What all features are there? What are independent features? What is dependent feature?

Also, using EDA we can check missing values ,duplicate values and outliers. We can also find patterns and trends in data by visualizing data using various charts and plots present in Seaborn and matplotlib package.

So what are we waiting for, lets start exploring the data :

Importing Libraries

Reading Dataset

Using Pandas read_csv function you can load .csv files and perform operations on it.

Eyeballing Data

Using head() function you can display first 5 records of dataset. If you want to display more records e.g 10 records , you can use head(10). Similarly, you can display last 5 records using tail() function.

Checking Datatypes

To check datatype of each feature of dataset, you can use dtypes function. You can also have Object datatype for categorical variables.

Checking dimension

To check number of rows and columns of dataset, you can use shape function. In this dataset there are 400 rows and 9 columns.

Checking Features

To display features names you can use columns function. It will display name of all features of dataset.

Checking information

To check complete information of dataset in details , you can use info() function. It will display information of each feature alongwith Non-Null and datatypes.

Checking Summary Statistics

To display summary statistic of dataset , you can use describe() function. You can transpose it for better understanding. It is also called “Five Number Summary” where 5 number represents minimum, Q1 (First Quartile) , median, Q2 (Second Quartile) and maximum.

Checking Outliers

Outliers are extreme values. It is an observation that has abnormal distance from other values in a sample. To visualize outliers you can use boxplot.

Checking NULL values

To check if your dataset contains any NULL values then you can use isnull() function. In current dataset, there are no NULL values.

Checking Duplicate values

To check if your dataset contains any duplicate values then you can use duplicated() function. In current dataset , there are no duplicate values.

Checking Measure of Central Tendency

To check measures of central tendency i.e Mean , Median and Mode you can simply use mean() , median() and mode() function.

Checking Measure of Dispersion

To check variability, spread and extent of data i.e how much data is stretched or squeezed in dataset , you can use various measures of dispersion (variance, standard deviation , Inter-quartile range (IQR) ).

Univariate Analysis

To perform univariate analysis i.e analysis of single feature you can use various plots like barplot, histogram, countplot etc . Here I have used distribution plot, to check how data is distributed i.e is data following normal distribution , bi-modal distribution or multi-modal distribution. You can also check if data is right skewed or left skewed using distplot.

Bivariate Analysis

To visualize relationship of multiple features in a 2-dimensional form you can use heatmap. It provides colored visual summary of information.

Like this we can eyeball our dataset and draw inferences about data for better understanding. I hope you have find these EDA steps helpful.

Associate Developer @SAP Labs India

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store