Top 20 Data Science Interview Questions And Answers
As the world becomes more datadriven, the demand for Data Science professionals continues to increase. Data Science roles exist in large numbers across a wide range of organizations including corporations, nonprofit organizations, and government agencies.
If you have decided to kickstart your career in this happening field, you are certainly treading the right path. Here are top 20 Data Science interview questions and answers to help you prepare for your next interview.
1. What do you understand by the term Data Science? What is the significance of Data Science in the modern world?
Ans. Data Science is an interdisciplinary domain that makes use of algorithms, machine learning techniques, and scientific processes, to detect patterns in data and to uncover insights using statistical and mathematical analysis.
The following are the basic steps involved in Data Science:
 Collating business requirements
 Collecting relevant data
 Performing the following processes in sequential order  data cleaning, data warehousing, data staging, and data architecture
 Subjecting the cleansed data to various algorithms based on business requirements
 Visualizing the insights in order to communicate to the business
In the modern world, data is referred to as the ‘new oil’. Making sense out of data is tantamount to doing away with uncertainties. Data Science enables organizations to make datainformed strategies and choices based on the insights unearthed from data. From customer experience and retention to healthcare and governance, data is proactively used across sectors to improve the quality of decisionmaking.
2. What does a low or high pvalue mean?
Ans. A pvalue refers to the measure of the probability of having results equal to/more than the results you achieve under a certain hypothesis, where a null hypothesis is correct.
There are three instances here:
pvalue = 0.05 means that the hypothesis can go either way.
A low pvalue means a value ≤ 0.05. This means that the null hypothesis is likely to be rejected and that the data is unlikely with true null.
A high pvalue means a value ≥ 0.05. This means that the null hypothesis is likely to be true and that the data is like with true null.
3. How is supervised learning different from unsupervised learning?
Ans. Supervised and unsupervised machine learning are two key concepts in Data Science. The majority of tools used in data science to predict outcomes depend on supervised machine learning techniques.
The methods used in supervised learning require prior supervision before they can be used to train the model. Two prominent methods used in supervised learning are classification and regression.
Unsupervised machine learning does not require any prior model training. The main objective of unsupervised learning is to find the hidden structure in data. Two examples of unsupervised machine learning are clustering and association.
4. If you have run the association rules algorithm on a dataset, and the rules {pineapple, apple} => {papaya} and {apple, orange} => {papaya} have been found to be relevant, what else must be true? {pineapple, apple, papaya, orange} must be a frequent itemset
 {pineapple, apple} => {orange} must be a relevant rule
 {papaya} => {pineapple, apple} must be a relevant rule
 {papaya, apple} must be a frequent itemset
Ans. {papaya, apple} must be a frequent itemset.
5. Define imbalance data.
Ans. Imbalance data is any data that is distributed unequally amongst various categories. They cause errors in the performance of the model and leads to inaccuracy.
6. When do you perform resampling?
Ans. Resampling is performed to improve the accuracy of sample data. It can also help quantify the uncertainties in population parameters. The quality of the model is ensured by training it on different patterns of a data set.
Resampling is also performed when models need to be validated using random subsets.
7. What are confounding variables?
Ans. Confounding variables or confounders are extraneous variables that have an influence on both dependent and independent variables. It brings about mathematical relationships and the spurious association between variables that are not causally related to each other but otherwise associated.
8. Differentiate between Data Science and Data Analytics.
Ans. Data Science is used to unearth useful intelligence from the data available; these insights further add value to the business or the organization by enhancing the quality of decision making. It employs algorithms and scientific methods to extract useful information from available data.
Data analytics refers to a set of processes and techniques that are used to make conclusions from raw data through analysis. It makes use of mechanical processes and automated algorithms that detect metrics that could potentially be easily lost in massive amounts of data.
The following are the core differences between Data Science and Data Analytics:
Data Science 
Data Analytics 

Processes 
Procuring, organizing, and storing data 
Analysis of raw data to find answers 
Area of Focus 
Learning frameworks for modeling data 
Statistics and analytics to extract useful intelligence 
Purpose 
Develops methods to ask questions 
Analyses data to answer specific queries 
Goals 
Establishes trends 
Detects trends and supports decisionmaking 
Subfields 
Machine learning, AI, search engine engineering, etc 
Data warehousing, statistical analytics, marketrelated analysis, etc 
Approach 
Mathematical approach 
Statistical approach 
Advantages 
Prevents redundancy 
Detects trends from massive amounts of data without loss 
Applications 
Applied in large and smallscale datadriven organizations 
Applied in large and smallscale datadriven organizations for business data analytics 
9. What are the steps in making a decision tree?
Ans. Follow these steps to create a decision tree:
 Treat your entire data as your input.
 Find the entropy of the target variable and the predictor attributes.
 Calculate the information gain of all the attributes.
 Your root node is the attribute with the highest information gain.
 Repeat these steps on every branch until the decision node is finalized for all the branches.
10. Explain the ROC curve.
Ans. ROC curve is the graph used in binary classification and denotes the False Positive Rate on the xaxis and the True Positive Rate on the yaxis.
FPR is the ratio between False Positives and the total number of negative samples, and the True Positive Rate (TPR) is the ratio between True Positives and the total number of positive samples. To plot the ROC curve, both TPR and FPR values are mapped out on multiple threshold values. The area range under the ROC curve lies between 0 and 1.
 Do you think R is the best fit for visualization purposes? If yes, justify.
Ans. R is the best tool for Data Visualizations due to the following reasons:
 R has numerous inbuilt functions and several libraries.
 Any kind of graph can be created using R.
 R lets you customize your graphics easily.
 R is of great use in Exploratory Data Analysis and feature engineering.
 What would be your course of action if your data goes missing?
Ans. It is quite probable for one’s data to go missing from your raw data set. This is one of the most common challenges that data science professionals encounter. Data missing not at random (MNAR) is considered a critical issue. However, if data is missing at random (MAR) or data is missing completely at random (MCAR), the bias in the study will not increase much.
The following are some tips to handle missing data (random or nonrandom):
 Meticulously plan your study as well as your data collection.
 Create a manual of operations at the very beginning of the study.
 Make sure all personnel associated with data are properly trained.
 Document your missing data, especially in the instance of eliminating it.
 Employ data analysis methods to handle your missing data.
 Adopt approaches like pairwise, listwise, or casewise deletion.
*Please note that certain methods mentioned above may vary depending on the field of study or the case. Make sure you answer accordingly.
 What is the difference between data profiling and data mining?
Data profiling is a process of analyzing raw data in order to characterize the information embedded within a dataset. Data profiling helps to gain insight into the content of the dataset and the qualitative characteristics of those values.
Data mining is done to uncover knowledge in the form of correlations, patterns, and anomalies from large amounts of data. It uses quantitative methods such as clustering, classifications, neural networks, etc to extract knowledge. The results obtained are then used to make predictions.
 What is the difference between regression and classification?
Regression and Classification are types of supervised machine learning. Both regression and classification use training datasets to predict the outcome of the test or new datasets. The objective of classification is to predict the category of a new observation. Regression estimates or predicts quantity or response.
 List the types of biases that can occur during sampling?
Ans. The following are the types of biases that can occur during sampling.
 Selection bias
 Survivorship bias
 Undercoverage bias
 What is survivorship bias?
Ans. Survivorship bias is the logical error of spotlighting only the aspects that support surviving a process and overlooking those that did not, owing to them being less prominent. This can lead to skewed conclusions.
 What is the difference between analytics and analysis?
Ans. These terms are often used interchangeably. However, they are two distinct processes.
Analytics is performed to predict future events and make informed guesses and extensions of trends. It is the application of deduction and computational techniques. For instance, one can analyze customer behavior by referring to past sales data. This can then be used to increase sales in the future.
Analysis is used to find answers to key questions in past events. We examine past events using analysis to find out if there was an increase or decrease in sales during a specific period.
 What is a confidence interval?
Ans. Confidence Interval denotes the range of values likely containing the population parameter. It also tells us how likely it is for that particular interval to contain the population parameter. The Confidence Coefficient is denoted by 1alpha and the level of significance is by alpha.
 Briefly describe crossvalidation.
A model validation technique for evaluating how the outcomes of a statistical analysis will generalize to an independent data set, crossvalidation is used mainly in making forecasts about the efficiency of a model. It helps to limit challenges like overfitting and to understand how the model will generalize to an independent data set.
 Which algorithm powers recommendations like ‘people who bought this also bought’?
Ans. The recommendation engine is powered by collaborative filtering. Collaborative filtering explains customer behavior and purchase history on the basis of selection, ratings, etc. Item features are unknown in this algorithm.
If you are looking for credible handson experience and intensive training led by experts to become jobready, join OdinSchool's Data Science Bootcamp.