A Complete Guide on Exploratory Data Analysis

Data is an essential feature of every organization. Data Quality plays an important role, and it helps organizations to make better decisions. Data Science uses scientific approaches, procedures, algorithms, and frameworks to extract knowledge and insight from a vast amount of data. The data can be either structured or unstructured.

Exploratory data analysis (EDA) analyses and summarises datasets to identify patterns, trends, and relationships. It is a crucial step in the data science life cycle because it helps you better understand your data, identify any issues or problems with the data, and formulate hypotheses for further analysis.

When we work with data, Exploratory data Analysis (EDA) is an important step, which is a process of gathering insights from data. EDA helps us to analyze the hidden pattern in the data.

Why EDA?

The primary purpose of EDA is to clean the data and find useful insights from the data. Exploratory data analysis is crucial and generally the first exercise in data mining.

EDA allows us to fantasize about data to understand it and produce suppositions for further analysis. The exploratory analysis centers around creating a synopsis of data or perceptivity for the coming way in a data mining design.

Exploratory Data Analysis has the following primary advantages:

EDA helps us to clean the dataset.
It helps us to identify a relationship between variables
It helps to detect outliers in the dataset.
EDA helps to select the best algorithms for model building.
EDA helps to select the important features from the dataset.

Components of EDA

52 a

Many different techniques can be used as part of EDA, including:

Univariate Analysis: This is an analysis where we visualize a single variable simultaneously. It can include techniques such as calculating summary statistics (e.g., mean, median, mode, etc.), plotting histograms or density plots, and identifying outliers.
Bivariate Analysis: This involves visualizing the relationship between two variables. It can include techniques such as scatter plots, correlation, and regression analysis.
Multivariate Analysis: This is generally done between three or more variables. It can include techniques such as principal component analysis, cluster analysis, and multivariate regression.
Data Cleaning and Preparation: This involves identifying and fixing any errors, inconsistencies, or missing values in the data. It is a crucial step in EDA, as these issues can distort your analysis and lead to misleading conclusions.
Data Visualization: This involves creating plots and charts to help you understand and communicate the patterns and trends in your data. Many different plots and charts can be used, including line plots, bar charts, box plots, and heat maps.

How to Perform EDA?

Let’s explore the steps of Exploratory data analysis in detail using customer churn analysis based on the customers' behaviour on the website or app data.

We will classify what kind of customers are likely to sign up for the paid subscription of a website. After analysing and classifying the dataset, we can do the targeting-based marketing or recommend to the customers likely to sign up for the paid subscription plan.

Import the Libraries

52 b

1. Data is stored in CSV file format. Hence we are importing it using pd.read_csv

2. How many entries (Rows) and attributes(Columns) are present in the data? What is the shape of the data?

.shape method returns a number of rows by a number of columns in the dataset. So, in our dataset, we have 50000 rows and 12 columns.

3. Display the first 5 entries of the data.

52 e

.head() method gives the first 5 rows of the dataset. It is useful for seeing some example values for each variable.

4. What are the different features available in the data?

.columns method returns all the columns in the dataset.

{% module_block module "widget_3bc396ca-5eb0-4f26-828e-4fc46a5728ff" %}{% module_attribute "child_css" is_json="true" %}{% raw %}{}{% endraw %}{% end_module_attribute %}{% module_attribute "css" is_json="true" %}{% raw %}{}{% endraw %}{% end_module_attribute %}{% module_attribute "definition_id" is_json="true" %}{% raw %}null{% endraw %}{% end_module_attribute %}{% module_attribute "field_types" is_json="true" %}{% raw %}{"image_desktop":"image","image_link":"link","image_mobile":"image"}{% endraw %}{% end_module_attribute %}{% module_attribute "image_desktop" is_json="true" %}{% raw %}{"alt":"Blog-Listing-Ad-Sep-05-2023-07-43-34-6906-AM","height":300,"loading":"lazy","max_height":300,"max_width":1200,"size_type":"auto","src":"https://odinschool-20029733.hs-sites.com/hubfs/Blog-Listing-Ad-Sep-05-2023-07-43-34-6906-AM.webp","width":1200}{% endraw %}{% end_module_attribute %}{% module_attribute "image_link" is_json="true" %}{% raw %}{"no_follow":false,"open_in_new_tab":true,"rel":"noopener","sponsored":false,"url":{"content_id":null,"href":"https://www.odinschool.com/datascience-bootcamp","type":"EXTERNAL"},"user_generated_content":false}{% endraw %}{% end_module_attribute %}{% module_attribute "image_mobile" is_json="true" %}{% raw %}{"alt":"Mobile-version-of-blog-ads-_1_-Sep-05-2023-07-43-39-7661-AM","height":300,"loading":"lazy","max_height":300,"max_width":500,"size_type":"auto","src":"https://odinschool-20029733.hs-sites.com/hubfs/Mobile-version-of-blog-ads-_1_-Sep-05-2023-07-43-39-7661-AM.webp","width":500}{% endraw %}{% end_module_attribute %}{% module_attribute "label" is_json="true" %}{% raw %}null{% endraw %}{% end_module_attribute %}{% module_attribute "module_id" is_json="true" %}{% raw %}132581904694{% endraw %}{% end_module_attribute %}{% module_attribute "path" is_json="true" %}{% raw %}"/OdinSchool_V3/modules/Blog/Blog Responsive Image"{% endraw %}{% end_module_attribute %}{% module_attribute "schema_version" is_json="true" %}{% raw %}2{% endraw %}{% end_module_attribute %}{% module_attribute "smart_objects" is_json="true" %}{% raw %}null{% endraw %}{% end_module_attribute %}{% module_attribute "smart_type" is_json="true" %}{% raw %}"NOT_SMART"{% endraw %}{% end_module_attribute %}{% module_attribute "tag" is_json="true" %}{% raw %}"module"{% endraw %}{% end_module_attribute %}{% module_attribute "type" is_json="true" %}{% raw %}"module"{% endraw %}{% end_module_attribute %}{% module_attribute "wrap_field_tag" is_json="true" %}{% raw %}"div"{% endraw %}{% end_module_attribute %}{% end_module_block %}

5. Display the distribution of Numerical Variables.

.describe() method summarizes the count, mean, standard deviation, min, and max for numeric variables. It helps to understand the skewness in the data.

52 i

The key inference from the above image - age and numscreens have some outliers. We will treat them in the section below.

6. Distribution of Target Variable

.value_counts() method gives count of unique values of variable.

7. Missing Value check

We can observe that enrolled_date has missing value. We will drop this column as it doesn’t have significance on our target variable.

8. Distribution of users who enrolled and not Enrolled in the APP.

52 m

52 n

9. Univariate Analysis

In Univariate analysis we consider only one variable. It gives summary statistics for each field in the data. Countplot , BoxPlot etc are univariate graphs.

Distribution of Target variable

52 p

Key Insights: Count of users who enrolled the feature is more than who haven’t enrolled.

Distribution of Minigame feature

52 r

Key Insights: Less number of users have used this feature.

Distribution of Used_premium_feature

52 t

Key Insights: Less number of users used this feature.

Histogram Plot for Numerical Columns

52 u

52 v

Key Insights:

We can observe that the distribution of "days of weeks" data is evenly distributed with weekdays having slight downward dip as people are most likely to be busy across the weekdays
The age of the customers has maximum distribution in the age range of 20-35 year, fading after 40 years of age. So, most people in the data belongs to the 20-35 years age range.
The distribution of customers that have not used premium features earlier or have played minigame is a lot higher than who have.
"numscreens" variable has highest distribution density at around 15-25 number of screens.

{% module_block module "widget_005ad500-8dc2-4459-b125-b40b217fd907" %}{% module_attribute "child_css" is_json="true" %}{% raw %}{}{% endraw %}{% end_module_attribute %}{% module_attribute "css" is_json="true" %}{% raw %}{}{% endraw %}{% end_module_attribute %}{% module_attribute "definition_id" is_json="true" %}{% raw %}null{% endraw %}{% end_module_attribute %}{% module_attribute "field_types" is_json="true" %}{% raw %}{"image_desktop":"image","image_link":"link","image_mobile":"image"}{% endraw %}{% end_module_attribute %}{% module_attribute "image_desktop" is_json="true" %}{% raw %}{"alt":"Blog-Listing-Ad-Sep-05-2023-07-44-38-3482-AM","height":300,"loading":"lazy","max_height":300,"max_width":1200,"size_type":"auto","src":"https://odinschool-20029733.hs-sites.com/hubfs/Blog-Listing-Ad-Sep-05-2023-07-44-38-3482-AM.webp","width":1200}{% endraw %}{% end_module_attribute %}{% module_attribute "image_link" is_json="true" %}{% raw %}{"no_follow":false,"open_in_new_tab":true,"rel":"noopener","sponsored":false,"url":{"content_id":null,"href":"https://www.odinschool.com/datascience-bootcamp","type":"EXTERNAL"},"user_generated_content":false}{% endraw %}{% end_module_attribute %}{% module_attribute "image_mobile" is_json="true" %}{% raw %}{"alt":"Mobile-version-of-blog-ads-_1_-Sep-05-2023-07-44-42-5775-AM","height":300,"loading":"lazy","max_height":300,"max_width":500,"size_type":"auto","src":"https://odinschool-20029733.hs-sites.com/hubfs/Mobile-version-of-blog-ads-_1_-Sep-05-2023-07-44-42-5775-AM.webp","width":500}{% endraw %}{% end_module_attribute %}{% module_attribute "label" is_json="true" %}{% raw %}null{% endraw %}{% end_module_attribute %}{% module_attribute "module_id" is_json="true" %}{% raw %}132581904694{% endraw %}{% end_module_attribute %}{% module_attribute "path" is_json="true" %}{% raw %}"/OdinSchool_V3/modules/Blog/Blog Responsive Image"{% endraw %}{% end_module_attribute %}{% module_attribute "schema_version" is_json="true" %}{% raw %}2{% endraw %}{% end_module_attribute %}{% module_attribute "smart_objects" is_json="true" %}{% raw %}null{% endraw %}{% end_module_attribute %}{% module_attribute "smart_type" is_json="true" %}{% raw %}"NOT_SMART"{% endraw %}{% end_module_attribute %}{% module_attribute "tag" is_json="true" %}{% raw %}"module"{% endraw %}{% end_module_attribute %}{% module_attribute "type" is_json="true" %}{% raw %}"module"{% endraw %}{% end_module_attribute %}{% module_attribute "wrap_field_tag" is_json="true" %}{% raw %}"div"{% endraw %}{% end_module_attribute %}{% end_module_block %}

10. Correlation of Numerical features with Response Variable

52 w

52 x

Key Insights:

The "daysofweek" shows the highest correlation and "hour" having the second highest correlation wrt target variable.
In conclusion, the data is not normally distributed among variables, but since the correlation of the target variable is negligible for most of the variables, we can overlook their distribution.

11. Analyzing the relationship between variables i.e., Correlation Matrix

Correlation is another important step in EDA. Before building any machine learning model it is necessary to check correlation between independent and dependent variables.
Correlation matrix is used to find correlation between different variables. Correlation coefficient tells how two variables are correlated with each other.
Correlation gives us ideas how one variable affects the other.
Correlation helps us determine how change in one variable affect the other variable.
Values closer to +1 and -1 that we got from Correlation matrix are considered as maximum correlated variables.

52 y

52 z

Key Insights:

"hour" is positively correlated to the "daysofweek" variable as people might be spending more hours as the weekends approach
"minigame" is most correlated with the "used_premium_feature" seeing which we can conclude that people who play minigame tends to purchase the premium features.
"used_premium_features" is also highly correlated with the "numscreens" variable which could be the case as the customer would be exploring more screens when they have a premium feature unlocked
"minigame" and "numscreens" are also positively correlated as people would get more knowledge of screens when playing minigame
"age" is negatively correlated with the "numscreens" variable which is explanable as the age tends to increase, people get selective of their choices and instead of exploring or trying to explore more screens they know what they want and they shortlist in lesser throughput in the app

An unexplainable correlation exists between "numscreens" and "hour" variable which is negatively correlated. Ideally, when numbers of screens visited are higher than time spend should also have been higher but looks like people are just exploring the number of screens and maybe they couldn't get what they want so they exit the app.

12. Detecting Outliers using BoxPlot

Outliers are data points that are different from other data points. It lies far from other observations in the dataset that means it is either very large or it is very small.

52 z1

52 z2

Key Insights : age, numscreens feature has some outliers

13. Removing Outliers with IQR

52 z3

IQR is used to remove outliers in the dataset. Basically the first dataset is divided into 4 quartiles i.e q1, q2 , q3 and q4.

Q1 : represents 25^th percentile
Q2 : represents 50^th percentile
Q3 : represents 75^th percentile

IQR = Q3-Q1

Outliers are the data points which fall below Q1 – 1.5 IQR or above Q3 + 1.5 IQR.

52 z4

We can observe that after removing outliers, the shape of the data is changed.

Conclusion

Data is critical for us. Finding insights from data is an important part where Exploratory data analysis plays an important role. We have seen how EDA makes our life easier by going deep into the data and giving us results which can be accurate.