Crack your Next Data Science Interview with these Top 22 Questions
Summary
If you're pursuing a career as a data science professional, you need to be ready to impress potential employers with your expertise.
You must be able to ace your upcoming data science interview, and in order to accomplish that, here are the top 22 data science interview questions to expect.
Table of Content
Data science is one of the most advanced and popular technologies now in use. Every business/sector has understood that it needs data science professionals to play with data in order to maximize corporate profitability. Professionals in this discipline are being hired by big companies. Jobs in data science are to increase by 30%.
Easy but Mandatory Steps
Top Data Science Interview Questions
Following are the top 22 data science interview questions you should prepare if you want to crack it.
Q1: What distinguishes data science from conventional application programming?
Data Science involves analyzing and modeling large and complex datasets to extract insights, patterns, and trends, whereas traditional application programming focuses on developing software applications to perform specific tasks or functions.

Emphasis on statistical and machine learning techniques: When analysing data and formulating predictions or recommendations, data science heavily relies on statistical and machine learning techniques, as opposed to traditional application programming, which primarily concentrates on writing code to implement specific functionalities.

Datacentric approach: Data Science revolves around a datacentric approach, where data is the key driver of decisionmaking and problemsolving. In contrast, traditional application programming may not always have data as the central focus, but rather focuses on implementing functionalities or features.

Exploratory and iterative nature: Data Science often involves exploratory data analysis (EDA) and iterative modeling, where data scientists may need to experiment with different techniques and algorithms to find the best approach, whereas traditional application programming typically follows a more linear and structured development process.

Business and domain knowledge integration: Data Science often requires the integration of business and domain knowledge to understand the context and implications of data analysis and modeling results. Traditional application programming, on the other hand, may not always require deep business or domain knowledge.
It's important to note that while there are differences between Data Science and traditional application programming, they can also overlap in some areas, and the boundaries between the two can sometimes be blurry. The specific roles and responsibilities of a Data Scientist or an application programmer may vary depending on the organization and the project requirement.
Q2: What is bias in Data Science?
Bias in data science refers to the presence of systematic errors in data or models that can lead to inaccurate or unfair results. There are several types of bias that can impact data science:
 Sampling bias: This occurs when the data collected for analysis is not representative of the entire population, leading to a skewed or incomplete view of reality. For example, if a survey on customer satisfaction is conducted only among customers who voluntarily provide feedback, it may not capture the opinions of less satisfied customers who chose not to participate.
 Measurement bias: This occurs when there are errors or inaccuracies in the measurement or recording of data. For example, if a temperature sensor used to collect weather data is not calibrated properly, it may introduce measurement bias, leading to inaccurate temperature readings.
 Labeling bias: This occurs when the labels or categories assigned to data samples are subjective or discriminatory, leading to biased training data for machine learning models. For example, if a resume screening model is trained on biased labeled data that favors male applicants, it may result in gender bias in the hiring process.
 Algorithmic bias: This occurs when machine learning models are trained on biased data, leading to biased predictions or decisions. For example, if a facial recognition system is trained on a dataset with predominantly lightskinned individuals, it may have reduced accuracy for darkerskinned individuals, leading to racial bias in its predictions.
 Confirmation bias: This occurs when data scientists or analysts selectively choose or interpret data that confirms their preconceived notions or beliefs, leading to biased conclusions or recommendations.
Bias in data science can have serious consequences, including perpetuating discrimination, unfair decisionmaking, and inaccurate insights. Therefore, it is important for data scientists to be aware of and address potential biases in their data, models, and interpretations to ensure that their work is fair, transparent, and reliable. Techniques such as resampling, relabeling, recalibrating, and using diverse datasets can be employed to mitigate bias in data science. Additionally, incorporating ethical considerations and diverse perspectives in the data science process can help minimize bias and promote fairness in datadriven decisionmaking.
Q3: Why is only Python used for Data Cleaning in DS?
It's not accurate to say that only Python is used for data cleaning in Data Science. Data cleaning, also known as data preprocessing, can be done using various programming languages and tools depending on the preferences, requirements, and expertise of data scientists and practitioners. However, Python is a popular choice for data cleaning in Data Science due to 3 main reasons:
 Rich ecosystem of libraries: Python has a rich ecosystem of opensource libraries specifically designed for data manipulation and cleaning, such as Pandas, NumPy, and SciPy. These libraries provide powerful and efficient functions for handling missing values, filtering, transforming, and aggregating data, making data cleaning tasks easier and more convenient.
 Ease of use: Python is a wellliked choice among data scientists and practitioners because of its versatility and usability. Even for people with little programming knowledge, it is simple to learn and use thanks to its accessible syntax and thorough documentation. Because of its adaptability, Python can be easily integrated with other programmes and libraries that are frequently used in data science workflows.
 Large community support: Python has a sizable and vibrant development and user community, so data scientists may easily find tutorials, forums, and other resources to assist with data cleaning chores. Many data scientists favour Python because of the communitydriven approach's access to a wealth of knowledge and expertise.
Q4: How do you build a random forest model?
Start by preparing your data for model building. This typically involves tasks such as data cleaning, handling missing values, encoding categorical variables, and splitting the data into training and testing sets.

Ensemble of decision trees: An ensemble technique called random forest mixes various decision trees to produce forecasts. Each tree is trained using replacement (bootstrapping) on a random subset of the data and a random selection of features. Create a group of decision trees, each trained on a different subset of data and features from a training dataset.

Tree building: For each decision tree in the ensemble, recursively split the data into subsets based on feature values that minimize the impurity or maximize the information gain at each split. Continue this process until a stopping criterion, such as a maximum depth or minimum number of samples per leaf, is met.

Voting mechanism: When making predictions, each decision tree in the ensemble contributes its prediction, and the final prediction is determined through a voting mechanism. For classification tasks, the class with the most votes is chosen as the predicted class, and for regression tasks, the average of the predicted values is taken as the final prediction.
Remember to always validate your model on unseen data to ensure its generalization performance and finetune the model as needed based on the evaluation results. Random forests are a popular and powerful machine learning technique for classification and regression tasks, known for their ability to handle complex data patterns, handle missing values, and reduce overfitting compared to single decision trees.
Q5: A data collection containing variables with more than 30% of their values missing is handed to you. How are you going to manage them?
 Assess the impact of missing values: Evaluate the impact of missing values on the analysis or modeling task, considering the type of missingness (MAR, MCAR, NMAR). This will help determine the appropriate handling approach.
 Impute missing values: Use imputation techniques, such as mean or median imputation, mode imputation, regression imputation, knearest neighbors imputation, or machine learningbased imputation, to fill in the missing values with estimated values. Choose the imputation method based on the data nature and the underlying assumptions of the analysis or modeling task.
 Consider multiple imputation: Alternatively, consider using multiple imputation, where missing values are imputed multiple times to account for uncertainty. This generates multiple complete datasets with imputed values and combines the results for a final result.
 Create indicator variables: Create separate indicator variables to represent the presence or absence of missing values for each variable. This captures information about missingness and can be included as a separate feature in the analysis or modeling task.
 Document the handling approach: Thoroughly document the approach chosen for handling missing values, including any assumptions made, for transparency and reproducibility in the analysis or modeling task.
Remember that handling missing values should be done carefully, taking into consideration the specific characteristics of the data and the goals of the analysis or modeling task, and consulting with domain experts if possible.
Q6: What do you understand about the truepositive rate and falsepositive rate?
In binary classification problems, the truepositive rate (TPR), also known as sensitivity, recall, or hit rate, quantifies the percentage of positive cases that are properly predicted as positive by a model.
TPR = True Positives / (True Positives + False Negatives)
In binary classification problems, the falsepositive rate (FPR), also known as fallout or false alarm rate, quantifies the percentage of negative cases that a model mistakenly predicts as positive.
FPR = False Positives / (False Positives + True Negatives)
A binary classification model's effectiveness is assessed using both the truepositive rate and falsepositive rate. On a receiver operating characteristic (ROC) curve, which illustrates the tradeoff between TPR and FPR at various classification thresholds, they are frequently plotted. Higher values denote greater performance, and the area under the ROC curve (AUCROC) is a regularly used metric to sum up the entire performance of a binary classification model.
Q7: What are Exploding Gradients and Vanishing Gradients?
Exploding Gradients: They occur when the gradients during backpropagation become very large, causing the weights to be updated by excessively large values. This can result in the model's weights being updated in a way that overshoots the optimal values, leading to unstable training and poor convergence.
Exploding gradients are typically caused by deep networks with large weight initialization or activation functions that amplify the inputs to the point of saturation. Techniques to mitigate exploding gradients include gradient clipping, weight regularization methods (e.g., L1 or L2 regularization), and using weight normalization techniques (e.g., batch normalization) during training.
Vanishing Gradients: They occur when the gradients during backpropagation become very small, causing the weights to be updated by excessively small values. This can result in the model's weights being updated too slowly, leading to slow convergence and poor training performance.
Vanishing gradients are typically caused by deep networks with small weight initialization or activation functions that dampen the inputs, leading to gradients that approach zero. Techniques to mitigate vanishing gradients include using activation functions that have better gradient properties (e.g., ReLU, Leaky ReLU), initializing weights carefully (e.g., Xavier or He initialization), and using skip connections or residual connections to help gradients propagate more effectively through deep networks.
Both exploding and vanishing gradients are common challenges in deep learning and can severely impact the performance of neural networks. Proper weight initialization, activation functions, and regularization techniques can be used to mitigate these issues and ensure stable and effective training of deep neural networks.
Q8: The likelihood that you will see a shooting star or a group of them in a period of 15 minutes is 0.2. What percentage of the time, if you are exposed to it for about an hour, will you see at least one star shooting from the sky?
Let's call the likelihood that at least one shooting star will be visible within a 15minute window p. Given that the likelihood of spotting a shooting star or a group of them in a 15minute period is 0.2, we can state that the likelihood of spotting none at all is 1  0.2 = 0.8.
The likelihood of not seeing a shooting star in four consecutive 15minute periods (which total up to an hour, or 60 minutes) must now be determined. We can multiply the probabilities together because the intervals are independent of one another.
Probability of not seeing a shooting star in a 60minute interval = (Probability of not seeing a shooting star in a 15minute interval) ^ 4
= 0.8 ^ 4
= 0.4096
Therefore, the probability of seeing at least one shooting star in an hour (60minute interval) is the complement of the probability of not seeing any shooting star, which is:
Probability of seeing at least one shooting star in an hour = 1  Probability of not seeing any shooting star in an hour
= 1  0.4096
= 0.5904
So, there is approximately a 59.04% chance of seeing at least one shooting star in an hour if the probability of seeing a shooting star or a bunch of them in a 15minute interval is 0.2.
Q9: Explain the difference between Normalization and Standardization with an example.
Let's say you have a dataset of exam scores for three subjects: math, science, and history. The scores for math range from 60 to 100, the scores for science range from 30 to 90, and the scores for history range from 40 to 80. You want to preprocess the data to ensure that all the scores are on a common scale.
If you choose to normalize the data, you would scale each score to a range of 0 to 1, for example, by dividing each score by 100. So, a score of 80 in math would be normalized to 0.8, a score of 60 in science would be normalized to 0.6, and a score of 70 in history would be normalized to 0.7.
If you decide to standardise the data, you would determine the mean and normal deviation of the scores for each subject, subtract the mean from each score, and divide the result by the standard deviation. This would result in the scores for each subject having a mean of 0 and a standard deviation of 1, making comparisons between the various individuals simple.
As a result, while standardisation centres data around zero with a unit variance, normalisation adjusts data to a defined range.
Q10: Describe Markov chains
The probabilistic transitions of a system between states are described by Markov chains, a sort of mathematical model where the future state only depends on the present state and not on the past states. They have a collection of states, probabilities for transitions between the states, and an initial state. Markov chains are frequently used to represent and analyse systems that behave randomly or vary over time in a stochastic way in many different disciplines, including statistics, computer science, economics, and biology.
Q11: Give one example where both false positives and false negatives are important equally?
One example where both false positives and false negatives are equally important is in medical testing for a lifethreatening disease, such as cancer. In cancer screening, a false positive occurs when a person is mistakenly identified as having cancer when they do not, while a false negative occurs when a person with cancer is mistakenly identified as not having the disease. Both false positives and false negatives can have significant consequences:
False Positives: If a screening test produces a high rate of false positives, it can lead to unnecessary followup tests, treatments, and psychological distress for patients who do not have cancer. This can result in increased healthcare costs, unnecessary interventions, and potential harm from unnecessary treatments.
False Negatives: On the other hand, if a screening test produces a high rate of false negatives, it can result in missed diagnoses and delayed treatment for patients who do have cancer. This can lead to progression of the disease, reduced treatment options, and poorer health outcomes.
In this scenario, both false positives and false negatives are equally important as they can have significant implications for patient care and outcomes. Balancing the tradeoff between false positives and false negatives is crucial in designing and evaluating the performance of medical screening tests to ensure accurate and timely detection of the disease while minimizing unnecessary interventions or missed diagnoses.
Q12: How do you know if a coin is biased?

Empirical Testing: This involves physically flipping the coin multiple times and recording the outcomes. If the coin is unbiased, it should produce roughly equal numbers of heads and tails over a large number of flips. However, if one side (heads or tails) consistently occurs more frequently than the other, it could indicate a biased coin.

Statistical Analysis: Statistical tests can be applied to the observed data from coin flips to determine if the coin is biased. For example, the chisquared test or the binomial test can be used to assess if the observed frequencies of heads and tails deviate significantly from the expected frequencies of an unbiased coin.

Visual Inspection: Plotting the observed frequencies of heads and tails on a graph or a histogram can provide a visual indication of coin bias. If the distribution appears skewed or uneven, it may suggest that the coin is biased.

Comparison with Expected Probabilities: Coins are expected to be unbiased, meaning they have a 50% chance of landing on heads and a 50% chance of landing on tails. Therefore, comparing the observed frequencies of heads and tails with the expected probabilities (50% for each) can help identify if a coin is biased. If the observed frequencies consistently deviate from the expected probabilities, it may suggest a biased coin.
It's important to note that identifying coin bias may require a large number of flips to obtain statistically meaningful results. Additionally, other factors such as the shape, weight, and surface properties of the coin, as well as the flipping technique, can also affect the outcomes and need to be carefully controlled during testing.
Q13: Toss the selected coin 10 times from a jar of 1000 coins. Out of 1000 coins, 999 coins are fair and 1 coin is doubleheaded, assume that you see 10 heads. Estimate the probability of getting a head in the next coin toss.
Based on the information provided, we can use Bayesian probability to estimate the probability of getting a head in the next coin toss.
Let's define the following events:
A: Coin selected from the jar is fair
B: Coin selected from the jar is doubleheaded
C: 10 heads are observed in 10 tosses
We need to calculate P(AC), the probability that the selected coin is fair given that 10 heads are observed in 10 tosses.
According to Bayes' theorem, we have:
P(AC) = P(CA) * P(A) / P(C)
where:
P(CA): Probability of observing 10 heads in 10 tosses given that the coin is fair. Since a fair coin has a 0.5 probability of landing on heads, we have P(CA) = 0.5^10.
P(A): Probability of selecting a fair coin from the jar. Since there are 999 fair coins out of 1000, we have P(A) = 999/1000.
P(C): Probability of observing 10 heads in 10 tosses, regardless of the type of coin. This can be calculated by summing the probabilities of two scenarios: (1) selecting a fair coin and getting 10 heads, and (2) selecting the doubleheaded coin and getting 10 heads. We can write this as: P(C) = P(CA) * P(A) + P(CB) * P(B), where P(CB) = 1 (since the doubleheaded coin always lands on heads) and P(B) = 1/1000 (since there is only 1 doubleheaded coin out of 1000).
Plugging in the values, we get:
P(AC) = P(CA) * P(A) / P(C)
= (0.5^10) * (999/1000) / [(0.5^10) * (999/1000) + 1/1000]
Evaluating the above expression will give us the estimated probability of getting a head in the next coin toss, given that 10 heads were observed in 10 tosses and the selected coin is from the jar of 1000 coins with 999 fair coins and 1 doubleheaded coin.
Q14: What do you understand by a kernel trick? Give an example.
In machine learning, a kernel trick is a technique used to transform data in a way that allows linear algorithms to effectively model nonlinear relationships. It is commonly used in support vector machines (SVMs) and other kernelbased algorithms.
One common example of a kernel trick is the radial basis function (RBF) kernel, also known as the Gaussian kernel. The RBF kernel maps the input data points into an infinitedimensional feature space, allowing SVMs to model complex nonlinear decision boundaries. The RBF kernel is defined by the formula:
K(x, x') = exp(gamma x  x'^2)
where x and x' are the input data points, gamma is a hyperparameter that controls the shape of the kernel, and x  x'^2 is the squared Euclidean distance between the two data points.
Q15: What feature selection techniques are available for choosing the appropriate variables for effective prediction models?

Univariate Feature Selection: This method involves selecting features based on their individual performance in isolation, using statistical tests or scoring methods.

Recursive Feature Elimination (RFE): This method involves recursively fitting the model multiple times, each time eliminating the least important features based on their importance scores or coefficients, until a desired number of features is selected.

Treebased Methods: This method involves using decision trees or treebased ensemble methods, such as Random Forests and Gradient Boosting, which automatically perform feature selection by selecting the most important features for splitting the tree nodes based on impurity or information gain measures.

Principal Component Analysis (PCA): This method involves transforming the original features into a lowerdimensional space using linear combinations of the original features, known as principal components. The first few principal components can capture most of the variability in the data and can be used as the selected features.
Q16: Why is data cleaning crucial? Explain with an example.
Data cleaning, also known as data cleansing or data scrubbing, is a critical step in the data preparation process for machine learning and data analysis. It involves identifying and correcting or removing errors, inconsistencies, and inaccuracies in the raw data to ensure that the data is accurate, reliable, and of high quality.
Let's consider a scenario where a company is analyzing customer data to identify customer preferences for a new product launch. During the data cleaning process, it is discovered that some customer records have missing values for the "age" attribute, while others have inconsistent values such as negative ages or ages exceeding 100 years. By cleaning the data and imputing or correcting the missing or inconsistent age values, the company ensures that the customer analysis is based on accurate and reliable data, leading to more accurate insights and informed decisionmaking for the product launch strategy.
Q17: How can you avoid overfitting your model?
Overfitting occurs when a model is too complex and fits the training data too closely, resulting in poor performance on new or unseen data. Here are three ways to avoid overfitting:

Regularization: Regularization is a technique used to prevent overfitting by adding a penalty term to the loss function of the model. This penalty term discourages the model from fitting the training data too closely, thereby reducing overfitting.

Crossvalidation: Crossvalidation is a technique used to evaluate the performance of a model on new data. It involves splitting the data into multiple training and testing sets and training the model on each set. This helps to assess how well the model generalizes to new data and can help to identify overfitting.

Simplify the model: A simple model is less likely to overfit than a complex model. Therefore, it is important to use a model that is appropriate for the complexity of the problem. For example, if the problem is relatively simple, a linear regression model may be sufficient, whereas for a more complex problem, a deep neural network may be necessary.
Q18: What is dimensionality reduction and its benefits?
Dimensionality reduction is the process of reducing the number of variables or features in a dataset by transforming the data into a lowerdimensional space while preserving as much of the original information as possible. There are two main types of dimensionality reduction:
 Feature selection: This involves selecting a subset of the original features to use in the model.
 Feature extraction: This involves creating new features that are combinations of the original features.
The benefits of dimensionality reduction include:
 Improved computational efficiency: With fewer features, it is easier and faster to compute models, especially for large datasets.
 Improved model performance: Dimensionality reduction can help to improve the performance of a model by reducing the noise in the data and focusing on the most important features.
 Visualization: Dimensionality reduction can help to visualize highdimensional data by projecting it onto a lowerdimensional space, making it easier to interpret and understand the data.
 Reduced overfitting: Dimensionality reduction can reduce overfitting by simplifying the model and reducing the number of parameters.
Overall, dimensionality reduction is a useful technique for dealing with highdimensional datasets, improving computational efficiency, model performance, and interpretability.
Q19: How should you maintain a deployed model?
Maintaining a deployed model is essential to ensure its continued performance and accuracy over time. Here are some key steps to consider when maintaining a deployed model:
 Monitor performance: Regularly monitor the performance of the model to ensure that it is meeting the desired performance metrics. This can be done by tracking key performance indicators (KPIs) such as accuracy, precision, recall, and F1 score.
 Update the model: As new data becomes available, it may be necessary to update the model to improve its accuracy or performance. This can be done by retraining the model on the new data or by finetuning the existing model.
 Monitor data quality: The quality of the data used to train and test the model can have a significant impact on its performance. Regularly monitoring the quality of the data can help to identify any issues that may affect the performance of the model.
 Address feedback: Feedback from users can provide valuable insights into the performance of the model and any issues that may need to be addressed. It is important to actively solicit feedback from users and address any issues that are identified.
 Regular maintenance: Regularly maintain the infrastructure and environment in which the model is deployed, including updating dependencies and software versions, to ensure that the model is running smoothly and efficiently.
 Ensure security and privacy: It is important to ensure that the deployed model is secure and that any sensitive data is protected. This can include measures such as data encryption, access controls, and regular security audits.
By following these steps, you can ensure that your deployed model continues to perform effectively and efficiently over time.
Q20: What are recommender systems?
A type of information filtering system called a recommender system predicts and suggests products that a user might be interested in.
 Collaborative filtering is based on the idea that people who have similar preferences in the past will have similar preferences in the future. It involves analyzing user behavior and item attributes to find patterns and make recommendations based on those patterns.
 Contentbased filtering, on the other hand, focuses on the characteristics of items, such as genre or product features, to make recommendations that are similar to items a user has shown interest in.
 A hybrid approach combines both collaborative and contentbased filtering to provide more accurate and personalized recommendations.
Recommender systems have several benefits, including increasing user engagement, improving customer satisfaction, and increasing sales or revenue for businesses.
Q21: What is a Confusion Matrix?
A confusion matrix is a table that is used to evaluate the performance of a machine learning model for a binary classification problem.
In a binary classification problem, the predicted results can be one of two classes: positive or negative. True positives (TP) are the number of positive cases that are correctly identified as positive by the model. True negatives (TN) are the number of negative cases that are correctly identified as negative by the model. False positives (FP) are the number of negative cases that are incorrectly identified as positive by the model. False negatives (FN) are the number of positive cases that are incorrectly identified as negative by the model.
A confusion matrix is typically presented in the following format:
Actual Positive  Actual Negative  
Predicted Positive  True Positive (TP)  False Positive (FP) 
Predicted Negative  False Negative (FN)  True Negative (TN) 
Once we have the confusion matrix, we can calculate several performance metrics such as accuracy, precision, recall, and F1 score.
Q22: What is Deep Learning?
Deep learning is a branch of machine learning that makes use of multiplelayered artificial neural networks to model and resolve complicated issues. Deep learning algorithms use a hierarchical method, where the output of one layer is used as the input for the following layer, to automatically learn representations of data.
For applications including image identification, speech recognition, natural language processing, and computer vision, deep learning models are frequently utilised. Because they can learn and extract features from large and complex datasets, which can be challenging or impossible for humans to do manually, these models are incredibly effective.
Closing Notes
These questions are not exhaustive. The goal was to address the top fundamental questions that interviewers ask. Apart from these questions, here are a few key tips to help you through the interview,
 If you don't understand the interviewer's question clearly, ask him or her to rephrase the question politely.
 Explain your thought process with which you arrived at the answer.
 Do not lose composure if you don't know the answer to a question. Be honest and say that you do not know the answer.
 Speak clearly, and at a moderate pace. If you speak too fast, you'll come off as anxious and the interviewer may not catch all the points you've made.
If you're having an online interview, then take a look at this article for interview tips over video conferencing. Join OdinSchool's Data Science Bootcamp to build your Data Science career!