A Complete Guide on Exploratory Data Analysis
Summary
What is Exploratory Data Science and what do you understand by it? Do you know that EDA could be applied using customer churn analysis based on the previous customers behaviour. Know everything about Exploratory Data Analysis using this guide.
Data is an essential feature of every organization. Data Quality plays an important role, and it helps organizations to make better decisions. Data Science uses scientific approaches, procedures, algorithms, and frameworks to extract knowledge and insight from a vast amount of data. The data can be either structured or unstructured.
Exploratory data analysis (EDA) is an approach to analyzing and summarizing datasets to identify patterns, trends, and relationships. It is a crucial step in the data science life cycle because it helps you to get a better understanding of your data, identify any issues or problems with the data, and formulate hypotheses for further analysis.
When we work with data, Exploratory data Analysis (EDA) is an important step, that is a process of gathering insights from data. EDA helps us to analyze the hidden pattern in the data.
Why EDA?
The primary purpose of EDA is to clean the data and find useful insights from the data. Exploratory data analysis is crucial, and generally the first exercise in data mining.
EDA allows us to fantasize data to understand it as well as to produce suppositions for further analysis. The exploratory analysis centers around creating a synopsis of data or perceptivity for the coming way in a data mining design.
Exploratory Data Analysis has following primary advantages:
- EDA helps us to clean the dataset.
- It helps us to identify relationship between variables
- It helps to detect outliers in the dataset.
- EDA helps to select the best algorithms for model building.
- EDA helps to select the important features from the dataset.
Components of EDA
There are many different techniques that can be used as part of EDA, including:
- Univariate Analysis: This is a type of analysis where we visualize a single variable at a time. It can include techniques such as calculating summary statistics (e.g., mean, median, mode, etc.), plotting histograms or density plots, and identifying any outliers.
- Bivariate Analysis: This involves visualizing the relationship between two variables. It can include techniques such as scatter plots, correlation analysis, and regression analysis.
- Multivariate Analysis: This is generally done between three or more variables. It can include techniques such as principal component analysis, cluster analysis, and multivariate regression.
- Data Cleaning and Preparation: This involves identifying and fixing any errors, inconsistencies, or missing values in the data. It is a crucial step in EDA, as these issues can distort your analysis and lead to misleading conclusions.
- Data Visualization: This involves creating plots and charts to help you understand and communicate the patterns and trends in your data. There are many different types of plots and charts that can be used, including line plots, bar charts, box plots, and heat maps.
How to Perform EDA?
Let’s explore steps of Exploratory data analysis in detail using customer churn analysis based on the customers behaviour on the website or app data.
We will classify what kind of customers are likely to sign up for the paid subscription of a website. After analysing and classifying the dataset, we will be able to do the targeting-based marketing or recommendation to the customers who are likely to sign up for the paid subscription plan.
Import the Libraries:
1. Data is stored in csv file format, hence we are importing it using pd.read_csv
2. How many entries (Rows) and attributes(Columns) are present in the data? What is the shape of the data?
.shape method returns number of rows by number of columns in the dataset. So, in our dataset we have 50000 rows and 12 columns.
3. Display the first 5 entries of the data.
.head() method gives the first 5 rows of the dataset. It is useful for seeing some example values for each variable.
4. What are the different features available in the data?
.columns method returns all the columns in the dataset.
5. Display the distribution of Numerical Variables.
.describe() method summarizes the count, mean, standard deviation, min, and max for numeric variables. It helps to understand the skewness in the data.
The key inference from the above image - age and numscreens have some outliers. We will treat them in the section below.
6. Distribution of Target Variable
.value_counts() method gives count of unique values of variable.
7. Missing Value check
We can observe that enrolled_date has missing value. We will drop this column as it doesn’t have significance on our target variable.
8. Distribution of users who enrolled and not Enrolled in the APP.
9. Univariate Analysis
In Univariate analysis we consider only one variable. It gives summary statistics for each field in the data. Countplot , BoxPlot etc are univariate graphs.
- Distribution of Target variable
Key Insights: Count of users who enrolled the feature is more than who haven’t enrolled.
- Distribution of Minigame feature
Key Insights: Less number of users have used this feature.
- Distribution of Used_premium_feature
Key Insights: Less number of users used this feature.
- Histogram Plot for Numerical Columns
Key Insights:
- We can observe that the distribution of "days of weeks" data is evenly distributed with weekdays having slight downward dip as people are most likely to be busy across the weekdays
- The age of the customers has maximum distribution in the age range of 20-35 year, fading after 40 years of age. So, most people in the data belongs to the 20-35 years age range.
- The distribution of customers that have not used premium features earlier or have played minigame is a lot higher than who have.
- "numscreens" variable has highest distribution density at around 15-25 number of screens.
10. Correlation of Numerical features with Response Variable
Key Insights:
- The "daysofweek" shows the highest correlation and "hour" having the second highest correlation wrt target variable.
- In conclusion, the data is not normally distributed among variables, but since the correlation of the target variable is negligible for most of the variables, we can overlook their distribution.
11. Analyzing the relationship between variables i.e., Correlation Matrix
- Correlation is another important step in EDA. Before building any machine learning model it is necessary to check correlation between independent and dependent variables.
- Correlation matrix is used to find correlation between different variables. Correlation coefficient tells how two variables are correlated with each other.
- Correlation gives us ideas how one variable affects the other.
- Correlation helps us determine how change in one variable affect the other variable.
- Values closer to +1 and -1 that we got from Correlation matrix are considered as maximum correlated variables.
Key Insights:
- "hour" is positively correlated to the "daysofweek" variable as people might be spending more hours as the weekends approach
- "minigame" is most correlated with the "used_premium_feature" seeing which we can conclude that people who play minigame tends to purchase the premium features.
- "used_premium_features" is also highly correlated with the "numscreens" variable which could be the case as the customer would be exploring more screens when they have a premium feature unlocked
- "minigame" and "numscreens" are also positively correlated as people would get more knowledge of screens when playing minigame
- "age" is negatively correlated with the "numscreens" variable which is explanable as the age tends to increase, people get selective of their choices and instead of exploring or trying to explore more screens they know what they want and they shortlist in lesser throughput in the app
12. Detecting Outliers using BoxPlot
Outliers are data points that are different from other data points. It lies far from other observations in the dataset that means it is either very large or it is very small.
Key Insights : age, numscreens feature has some outliers
13. Removing Outliers with IQR
IQR is used to remove outliers in the dataset. Basically the first dataset is divided into 4 quartiles i.e q1, q2 , q3 and q4.
- Q1 : represents 25^{th} percentile
- Q2 : represents 50^{th} percentile
- Q3 : represents 75^{th} percentile
IQR = Q3-Q1
Outliers are the data points which fall below Q1 – 1.5 IQR or above Q3 + 1.5 IQR.
We can observe that after removing outliers, the shape of the data is changed.
Conclusion
Data is critical for us. Finding insights from data is an important part where Exploratory data analysis plays an important role. We have seen how EDA makes our life easier by going deep into the data and giving us results which can be accurate.
Overall, the goal of EDA is to gain a deep understanding of your data and to identify any interesting or unusual patterns that you can investigate further. By performing EDA, you can better understand your data, identify potential problems or biases, and develop hypotheses about what may be driving the patterns you see.
This article gave a detailed guide on exploratory data analysis. It also talked about the importance of exploratory data analysis and why exploratory data analysis is important for us.
If EDA sounds interesting and you wish to make a career in Data Science, then OdinSchool's Data Science Course is here for you. It is an intensive 6-month Data Science Bootcamp that comes with placement assistance. It is led by industry experts and offers an industry-vetted curriculum with a special focus on the most in-demand skills.