In my last blog , I have discussed about path which you should follow to start with Data Science Domain. In this blog, I will discuss about Exploratory Data Analysis (EDA) techniques. When we work with datasets, we first analyze it, to understand What that data is about? What all features are there? What are independent features? What is dependent feature?
Also, using EDA we can check missing values ,duplicate values and outliers. We can also find patterns and trends in data by visualizing data using various charts and plots present in Seaborn and matplotlib package.
So what are we waiting for, lets start exploring the data :
Using Pandas read_csv function you can load .csv files and perform operations on it.
Using head() function you can display first 5 records of dataset. If you want to display more records e.g 10 records , you can use head(10). Similarly, you can display last 5 records using tail() function.
To check datatype of each feature of dataset, you can use dtypes function. You can also have Object datatype for categorical variables.
To check number of rows and columns of dataset, you can use shape function. In this dataset there are 400 rows and 9 columns.
To display features names you can use columns function. It will display name of all features of dataset.
To check complete information of dataset in details , you can use info() function. It will display information of each feature alongwith Non-Null and datatypes.
Checking Summary Statistics
To display summary statistic of dataset , you can use describe() function. You can transpose it for better understanding. It is also called “Five Number Summary” where 5 number represents minimum, Q1 (First Quartile) , median, Q2 (Second Quartile) and maximum.
Outliers are extreme values. It is an observation that has abnormal distance from other values in a sample. To visualize outliers you can use boxplot.
Checking NULL values
To check if your dataset contains any NULL values then you can use isnull() function. In current dataset, there are no NULL values.
Checking Duplicate values
To check if your dataset contains any duplicate values then you can use duplicated() function. In current dataset , there are no duplicate values.
Checking Measure of Central Tendency
To check measures of central tendency i.e Mean , Median and Mode you can simply use mean() , median() and mode() function.
Checking Measure of Dispersion
To check variability, spread and extent of data i.e how much data is stretched or squeezed in dataset , you can use various measures of dispersion (variance, standard deviation , Inter-quartile range (IQR) ).
To perform univariate analysis i.e analysis of single feature you can use various plots like barplot, histogram, countplot etc . Here I have used distribution plot, to check how data is distributed i.e is data following normal distribution , bi-modal distribution or multi-modal distribution. You can also check if data is right skewed or left skewed using distplot.
To visualize relationship of multiple features in a 2-dimensional form you can use heatmap. It provides colored visual summary of information.
Like this we can eyeball our dataset and draw inferences about data for better understanding. I hope you have find these EDA steps helpful.