Data Science Interview Questions (2024) from Top Companies

Data Science Interview Questions (2024) from Top Companies


In the ever-evolving landscape of data science, interview success extends beyond problem-solving. This blog asserts that interview questions are windows into analytical thinking and communication skills. This questionnaire is a key tool for comprehensive interview preparation, emphasizing the importance of asking thoughtful questions. Aspiring data scientists are encouraged to stay curious, hone skills, and embrace challenges. 

In this questionnaire of Data Science Interview Questions and Answers from Top Companies for 2024, we'll delve into some of the most challenging and insightful Data Science interview questions posed by leading companies.

Statistics for categorized interview questions

These interview questions are divided into different categories, and here are the statistics that tell you which category is 'asked the most' in a data science interview. Please remember these statistics, which will help you prepare for the technical questions.


OdinSchool | Statistics for categorized interview questions

Along with these statistics, here are some more tips when preparing for a data science interview


Algorithm-based Questions

Question 1: What do you know about deep learning? (Facebook)

Deep learning is an exciting field within machine learning that harnesses the power of neural networks to tackle intricate problems. Drawing inspiration from the intricate workings of the human brain, neural networks consist of interconnected nodes that form layers. In deep learning, these networks become even more powerful with multiple layers, paving the way for creating "deep neural networks."

Passionate about ML, see how a mechanical engineer stuck at a service company transitioned into a successful machine learning engineer!

Key components and concepts of deep learning include:

  1. Neural Networks: Neural networks are complex systems composed of interconnected nodes, also known as neurons. These networks are designed to receive input data, process it through multiple hidden layers using weighted connections, and generate accurate predictions or classifications through output nodes.

  2. Deep Neural Networks (DNNs): Deep learning is all about harnessing the power of deep neural networks with multiple layers that are hidden. These intricate architectures enable the model to grasp hierarchical representations of features, ultimately allowing for the effective modelling of intricate patterns and relationships within data.

  3. Activation Functions: Activation functions introduce non-linearity into the neural network, allowing it to learn and represent complex mappings between inputs and outputs. Common activation functions include ReLU (Rectified Linear Unit), Sigmoid, and Tanh.

  4. Training with Backpropagation: Deep learning models undergo training using a robust optimization algorithm called backpropagation. Throughout the training process, the model fine-tunes its weights by considering the error, which is the disparity between the predicted and actual values. This adjustment is made using gradient descent or its variations, allowing the model to improve its performance continually.

  5. Convolutional Neural Networks (CNNs): CNNs are specialized deep neural networks designed for image and spatial data. They use convolutional layers to automatically learn hierarchical representations of features in images, making them highly effective for tasks like image classification and object detection.

  6. Recurrent Neural Networks (RNNs): Recurrent Neural Networks (RNNs) are crafted explicitly for handling sequential data, such as time series or natural language. By incorporating loops, RNNs enable the seamless flow of information from one step to another, making them exceptionally well-suited for tasks like language modelling, speech recognition, and sequence generation.

  7. Transfer Learning: Transfer learning is a powerful technique that harnesses the knowledge and expertise from pre-trained deep learning models on extensive datasets. By adapting these models to new, similar tasks with smaller datasets, the need for a large amount of labelled data for training can be significantly reduced. This innovative approach opens up exciting possibilities for more efficient and effective machine learning.

  8. Generative Models: Generative models can generate new data samples, like Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs). GANs, for instance, can produce stunningly realistic images by training a generator to create data virtually indistinguishable from accurate data.

  9. Applications of Deep Learning: Deep learning has revolutionized various fields, from computer vision to healthcare, delivering exceptional breakthroughs. Its impact can be seen in cutting-edge applications like image and speech recognition, translation systems, recommendation engines, and self-driving cars.

Deep learning is integral to Facebook's operations, powering essential applications like content recommendation, image recognition, and natural language understanding. With a strong focus on research and development in artificial intelligence, especially deep learning, Facebook continuously strives to enhance user experience and introduce groundbreaking features.

Question 2: What is logistic regression? (Microsoft, NTT Data)

Logistic regression is a powerful machine-learning method that shines in classification problems. It delves into the intricate relationship between independent variables and a binary outcome. Think of it as a decision-making guru who can decipher whether an email is spam.

It's a good idea to have a regression-based project on your resume. Here are some ideas for regression projects that can help you stand out.

Question 3: What is the concept of ensemble learning? (Microsoft, NTT Data)

Ensemble learning is a powerful machine learning technique that harnesses the collective intelligence of multiple models, known as base learners, to elevate performance and fortify the robustness of the overall system. By combining the predictions of these diverse models, ensemble learning has the unique ability to outperform individual models in terms of accuracy, combatting overfitting, and enhancing the generalization capabilities of machine learning models. This widely acclaimed approach empowers data scientists to unlock unprecedented predictive power and pave the way for groundbreaking advancements in machine learning. They also use these top ML algorithms.

Here are critical concepts related to ensemble learning:

  1. Base Learners: Base learners are individual models, often of the same type or built using different algorithms. These models can be trained on the same dataset using different subsets of data or on different datasets.

  2. Ensemble Methods: Ensemble methods combine the predictions of multiple base learners to make a final prediction. There are various ensemble methods, including bagging, boosting, and stacking.

  3. Bagging (Bootstrap Aggregating): Bagging involves training multiple base learners independently on random subsets of the training data (with replacement). Each model is given equal weight in the final prediction.

    Example Algorithm: Random Forest is a popular bagging algorithm that builds an ensemble of decision trees.

  4. Boosting: Boosting focuses on training models sequentially, with each model giving more attention to instances the previous models misclassified. This helps correct errors made by earlier models and improves overall accuracy.

    Example Algorithms: AdaBoost, Gradient Boosting Machines (GBM), and XGBoost are popular boosting algorithms.

  5. Stacking (Stacked Generalization): Stacking involves training multiple base learners and then training a meta-model that uses the predictions of the base learners as inputs. The meta-model learns to combine the strengths of the base models.

    Example Algorithm: Stacking can use diverse base models, such as decision trees, support vector machines, and neural networks, and a meta-model like a linear regression or another machine learning algorithm.

  6. Voting: Voting is a simple ensemble technique where the predictions of multiple models are combined, and the final prediction is determined by a majority vote (for classification) or averaging (for regression).

    Example Algorithm: Random Forest uses a voting mechanism to make predictions based on multiple decision trees.

  7. Diversity in Ensembles: Ensemble methods benefit from diverse base learners. Diversity is achieved by training models using different subsets of data, different algorithms, or different hyperparameters.

  8. Reduction of Overfitting: Ensemble learning helps reduce overfitting by combining the strengths of multiple models. Even if individual models overfit certain parts of the data, the ensemble is less likely to suffer from the same issue.

In companies like Microsoft and NTT Data, ensemble learning may be applied to various machine learning tasks, such as improving the accuracy of predictive models, enhancing the robustness of decision-making systems, and optimizing the performance of complex algorithms in diverse domains. Here are some of the top 10 essential machine-learning models you should know


While pursuing a master's in molecular sciences, Mohit was fascinated by Artificial Intelligence and Machine Language. Mohit was consumed by the idea that if he could grasp machine language, he could contribute more to his passionate subject.

Question 4: What is the difference between supervised and unsupervised learning? (Google, Citi Bank, Apple)

Supervised and unsupervised learning are two fundamental paradigms in machine learning, differing in the learning task and the data type used for training models.

Supervised Learning

  • Definition: Supervised learning involves training a model on a labelled dataset, where the input data is paired with corresponding target labels.

  • Learning Task: The goal is to learn a mapping from input features to output labels based on the examples provided in the training set.

  • Examples: Common supervised learning tasks include classification (assigning input data to predefined categories or classes) and regression (predicting a continuous target variable).

Training Process

During training, the model is exposed to input-output pairs and adapts its parameters to minimize the disparity between predicted and actual outputs. This process enables the model to learn the underlying patterns and relationships in the data accurately. Supervised learning finds applications in various domains, including identifying spam emails, classifying images, recognizing speech, and predicting housing prices.

Unsupervised Learning

  • Definition: Unsupervised learning involves training a model on an unlabeled dataset, where the algorithm explores the inherent structure or patterns in the data without explicit target labels.

  • Learning Task: The goal is to identify patterns, group similar data points, or reduce the dimensionality of the data without guidance from predefined output labels.

  • Examples: Common unsupervised learning tasks include clustering (grouping similar data points), dimensionality reduction (representing data in a lower-dimensional space), and density estimation.

Training Process

The model learns the underlying structure of the data by identifying similarities or relationships between data points.
Use Cases: Examples of unsupervised learning applications include customer segmentation, anomaly detection, topic modelling, and dimensionality reduction for visualization.

  Supervised Learning Unsupervised Learning
Labelling of Data Uses labelled training data with known input-output pairs It uses unlabeled training data, focusing on discovering patterns or relationships within the data
 Learning Goal Aims to learn a mapping from inputs to predefined outputs. It aims to discover the data's inherent patterns, structures, or relationships.
 Tasks Classification and regression tasks. Clustering, dimensionality reduction, and density estimation.
 Examples Image classification, speech recognition, and predicting stock prices. Customer segmentation, anomaly detection, and topic modelling.
 Training Process The model adjusts parameters to minimize the difference between predicted and actual outputs. The model identifies patterns or relationships within the data without explicit guidance.

Supervised and unsupervised learning techniques may be utilised in companies like Google, Citi Bank, and Apple, depending on the specific tasks and applications.

For instance, Google might use supervised learning to improve search algorithms and unsupervised learning for clustering related topics in news articles.

Citi Bank might apply supervised learning for credit risk assessment and unsupervised learning for detecting unusual patterns in financial transactions.

Apple could leverage supervised learning for Siri's speech recognition and unsupervised learning for user behaviour analysis.

Modelling Based Questions

Question 5: What is Overfitting, and how can you avoid overfitting your model? (Google, NTT Data)

Overfitting is a prevalent issue in machine learning, where a model becomes too familiar with the training data, including its noise and random fluctuations, to the point where it hinders its performance on unseen data. Simply put, an overfit model excels in handling the training data but struggles to adapt effectively to new and unseen data.

The key indicators of overfitting include high accuracy on the training data but poor performance on validation or test data.

Here are some standard techniques to avoid overfitting:

  1. Cross-Validation: Employing k-fold cross-validation is crucial for evaluating the model's effectiveness across various data subsets, ensuring consistent performance and minimizing the potential of overfitting specific datasets.

  2. Regularization: Incorporate regularization terms into the model's optimization objective, such as L1 or L2 regularization, to add a touch of finesse. Regularization introduces a penalty term to the model's loss function, discouraging the development of overly intricate models by considering the magnitude of the model parameters.

  3. Feature Selection: Carefully select relevant features and remove irrelevant ones. Too many features, especially noise or irrelevant, can lead to overfitting. Feature selection methods can help identify and retain only the most informative features.

  4. Data Augmentation: Enhance the variety of the training dataset by implementing a range of transformations to the existing data, including rotations, flips, or zooms. Data augmentation aids in the model's ability to develop resilience and improve its ability to generalize to new examples.

  5. Early Stopping: Continuously track the model's performance on a validation set while it undergoes training. If you notice a decline in performance on the validation set, even as the training performance improves, it is crucial to halt the training process early to avoid overfitting.

  6. Ensemble Methods: Harness the power of ensemble methods, like bagging and boosting, to merge predictions from multiple models. Ensemble methods effectively combat overfitting by tapping into the unique strengths of each model and mitigating their weaknesses.

  7. Pruning (for Decision Trees): If you're working with decision trees, consider pruning the tree to limit its depth. Pruning removes unnecessary branches of the tree, preventing it from becoming too specific to the training data.

  8. Dropout (for Neural Networks): Apply dropout during training in neural networks. Dropout randomly deactivates a proportion of neurons during each forward and backward pass, preventing the network from relying too heavily on specific neurons.

  9. Proper Model Complexity: Choosing the perfect model complexity is paramount when tackling a task. Striking the right balance is key, as an excessively intricate model can significantly heighten the chances of overfitting, particularly in scenarios where the training data is scarce.

  10. Regular Monitoring and Validation: Regularly monitor the model's performance on validation data and ensure it generalises well to new examples. If the performance degrades, revisit the training process and consider adjusting hyperparameters or other techniques to avoid overfitting.

When applied judiciously, these strategies can help balance model complexity and generalization, reducing the risk of overfitting in machine learning models.

Question 6: What do you know about cross-validation? (NTT Data, Nestle)

Cross-validation, a powerful and indispensable statistical technique in machine learning, is vital in evaluating a model's performance and determining its ability to generalize effectively. This technique involves intelligently dividing the dataset into various subsets, skillfully training the model on some subsets, and assessing its performance on the remaining data. By diligently combating the perils of overfitting, cross-validation bestows upon us a more resilient and accurate estimation of a model's true capabilities.

Common types of cross-validation include:

  1. K-fold Cross-Validation: The dataset is partitioned into k-folds of equal size. The model undergoes k iterations, each involving training on k-1 folds and validating the remaining fold. The overall performance metric is determined by averaging the metrics obtained from each fold.

  2. Leave-One-Out Cross-Validation (LOOCV): Leave-One-Out Cross-Validation (LOOCV) is an intriguing variation of k-fold cross-validation, where k is set to equal the number of samples in the dataset. A single data point is held out for validation during each iteration while the model is trained on the remaining data. This meticulous process is repeated for every data point, and the overall performance is skillfully averaged.

  3. Stratified k-fold Cross-Validation: This technique proves particularly valuable when handling imbalanced datasets, as it guarantees that every fold possesses a comparable distribution of the target variable to the entire dataset. This approach is crucial in preserving representation from all classes in the training and validation sets.

  4. Time Series Cross-Validation: Traditional cross-validation methods might not be suitable for time-dependent datasets due to the temporal nature of the data. Time series cross-validation involves using past data for training and future data for validation, mimicking the temporal structure of the dataset.

The main advantages of cross-validation

  1. Robust Performance Estimation: Cross-validation provides a more reliable estimate of a model's performance by assessing its ability to generalize to different subsets of the data.

  2. Reduced Dependency on a Single Split: Traditional train-test splits might lead to overfitting or underfitting if the split is not representative of the entire dataset. Cross-validation helps mitigate this risk by using multiple splits.

  3. Model Selection and Hyperparameter Tuning: Cross-validation is often used with model selection and tuning to choose the best-performing model and set of hyperparameters.

In the context of NTT Data and Nestle, these companies may use cross-validation as a standard practice when developing machine learning models to ensure robust performance and generalization across various scenarios and datasets. It is a widely accepted methodology in the machine learning community and is considered good practice for model evaluation.

Question 7: What is k-fold cross-validation? (NTT Data)

K-fold cross-validation, a widely adopted technique in machine learning, offers a practical approach to evaluating the performance of predictive models. By dividing the dataset into k equal folds or subsets, this technique enables the model to be trained and evaluated multiple times. Each iteration uses a different fold as the validation set, while the remaining folds serve as the training set.

Here's a step-by-step explanation of k-fold cross-validation:

  1. Dataset Splitting: The original dataset is divided into k equally sized folds. For example, if you choose k=5, the dataset is divided into five folds, each containing approximately 1/5th of the total data.

  2. Model Training and Evaluation: The model undergoes k iterations of training, with each iteration utilizing a distinct fold as the validation set, while the remaining k-1 folds serve as the training data. This iterative process generates k distinct models, each contributing to a comprehensive dataset understanding.

  3. Performance Metrics: The model's performance is assessed by applying a metric (such as accuracy, precision, or recall) to the validation set during each iteration, resulting in k performance scores.

  4. Average Performance: By averaging the performance scores for k, the final performance metric offers a more robust evaluation of the model's capabilities than a single train-test split. K-fold cross-validation addresses concerns regarding data variability and ensures that the model is assessed across diverse subsets of the data. This approach provides a more dependable estimate of the model's performance while highlighting potential issues such as overfitting or underfitting.

Common choices for the value of k include 5, 10, or other multiples of 5. The choice of k depends on factors such as the size of the dataset and computational resources. A larger k value leads to a more minor validation set in each iteration, which can be computationally expensive but provides a more stable estimate.

In the context of NTT Data, k-fold cross-validation is likely employed when developing and evaluating machine learning models to ensure robust performance across different subsets of the data. It is a standard practice to assess model generalization and reliability, helping to make informed decisions about model deployment and use.


Question 8: What is a validation set and a test set? (LinkedIn)

A validation set and a test set are subsets of data used in machine learning to assess the performance of a trained model. Both sets serve distinct purposes in the model development and evaluation process.

Validation Set

  • Purpose: During the training phase, the validation set is crucial in refining the model's hyperparameters and evaluating its performance on unseen data. Utilizing an independent dataset for evaluation effectively safeguards against overfitting and ensures the model's reliability and generalizability.

  • Usage: After training the model on the training set, it is evaluated on the validation set. The model's performance on the validation set guides the adjustment of hyperparameters, such as learning rates or regularization strengths, to optimize its generalization to new, unseen data.

  • Role: The validation set is critical in the model development phase, helping refine the model until satisfactory performance is achieved.

Test Set

  • Purpose: The test set provides an unbiased evaluation of the final trained model. It represents unseen data the model has not encountered during training or validation. The test set is crucial for assessing how well the model generalizes to new, real-world data.

  • Usage: Once the model has been trained and tuned using the training and validation sets, its final evaluation is performed on the test set. This provides an unbiased estimate of the model's performance and helps gauge its effectiveness in real-world scenarios.

  • Role: The test set serves as a final checkpoint for the model's performance. It indicates how well the model is expected to perform on new, unseen data in production.

In summary, the typical split of a dataset in machine learning involves three main subsets,

  • Training Set: Used to train the model.

  • Validation Set: Used to fine-tune hyperparameters and evaluate the model during training.

  • Test Set: Used to provide an unbiased evaluation of the final trained model.

Using separate training, validation, and test sets helps ensure that the model's performance is accurately assessed and that the evaluation represents its ability to generalize to new, unseen data. This practice is crucial for building robust and reliable machine-learning models. 

Question 9: What is the difference between cross-validation and train-test split? (Microsoft)

Cross-validation and train-test split are valuable techniques used to evaluate the performance of machine learning models. However, it is essential to note that these approaches have distinct characteristics and nuances that set them apart.

Cross-validation involves dividing the training data into folds and training the model on all but one at a time. The model is then tested on the last fold, and the process is repeated for all folds. The model's generalisation error is then estimated using the average performance across all folds.

The train-test split divides the training data into two sets: training and testing. The model is trained on the training set before being tested on the test set. The test set performance is used to estimate the model's generalisation error.


Technical based Questions

Question 10: Do you know the steps of a model deployment? (NTT data, LinkedIn)

Certainly! The deployment of a machine learning model involves making the trained model available for use in a production environment, where it can make predictions on new, unseen data. Here are the general steps involved in the model deployment process:

  1. Data Preprocessing: Ensure the preprocessing steps applied during training are replicated during deployment. This includes handling missing values, encoding categorical variables, scaling features, and other data transformations.

  2. Model Export: Save the trained model and any associated preprocessing steps to a format easily loaded in the deployment environment. Standard formats include serialized files like Pickle in Python or ONNX (Open Neural Network Exchange).

  3. Containerization: Package the model, its dependencies, and any necessary runtime components into a container (e.g., Docker container). Containerization helps ensure that the model can run consistently across different environments.

  4. Scalability Considerations: Assess the scalability requirements for the deployment environment. Considerations may include the expected number of requests, the need for parallel processing, and resource utilization. Deploy the model in a scalable and efficient manner.

  5.  Integration with Application: Integrate the model into the target application or system. This may involve creating APIs (Application Programming Interfaces) or microservices to expose the model's functionality for other software components.

  6. Monitoring and Logging: Implement monitoring and logging mechanisms to track the model's performance and usage in the production environment. This helps identify any issues, such as a drop in model accuracy or changes in data patterns and facilitates debugging.

  7. Security Considerations: Implement security measures to protect the deployed model from threats. This may include authentication mechanisms, access controls, and encryption of sensitive data.

  8. Versioning: Establish a versioning system for the deployed models. This allows for easy rollback in case of issues with a new model version and facilitates tracking model performance over time.

  9. Testing and Quality Assurance: Conduct thorough testing of the deployed model to ensure it behaves as expected in the production environment. This includes testing for edge cases, performance under load, and overall system stability.

  10. Documentation: Provide comprehensive documentation for the deployed model, including information on how to use the API, any model-specific considerations, and troubleshooting guidelines.

  11. User Training and Support: If applicable, provide training and support to end-users or stakeholders interacting with the model. Ensure users understand the model's capabilities, limitations, and proper usage.

  12. Feedback Loop: Establish a feedback loop to monitor the model's performance and gather user feedback continuously. This information can be used to retrain and improve the model over time.

In the NTT Data and LinkedIn context, these steps would be part of their model deployment processes when implementing machine learning solutions for various applications. The specific details of the deployment process may vary depending on the nature of the project and the technologies used.

Question 11: What is the difference between data science and data analytics? (Google, NTT Data)

The technique and practice of analysing data to answer questions, extract insights, and discover trends is referred to as data analytics. This is accomplished through various tools, approaches, and frameworks that vary based on the type of analysis being performed.

Business analytics is the application of data analytics techniques and methodologies in a business setting. Business analytics' primary goal is to extract valuable insights from data that an organisation may use to inform its strategy and achieve its goals.

Data analytics focuses on understanding information and gaining insights that may be utilised to take action, while data science focuses on creating, cleaning, and organising datasets. Data scientists develop and apply algorithms, statistical models, and bespoke analysis.

Question 12: Which one would you pick for text analytics between Python and R, and why? (NTT Data)

The choice between Python and R for text analytics depends on various factors, including the project's specific requirements, the team's preferences and expertise, and the broader ecosystem and tools available. Python and R are widely used for text analytics, and each has strengths. Here are some considerations:

Python for Text Analytics

  1. General-Purpose Language: Python is a jack-of-all-trades programming language with a massive and vibrant community. It is highly sought after in data science, web development, automation, and countless other domains.

  2. Rich Ecosystem: Python boasts an extensive collection of libraries and frameworks dedicated to text analytics, including NLTK (Natural Language Toolkit), spaCy, TextBlob, and sci-kit-learn. These powerful resources offer diverse tools for tasks such as tokenization, sentiment analysis, named entity recognition, etc. With Python's rich ecosystem, practitioners have everything they need to unravel the complexities of text data and extract valuable insights.

  3. Machine Learning Integration: Python is well-integrated with popular machine learning frameworks like TensorFlow and PyTorch, allowing seamless integration of text analytics with machine learning models. This is particularly useful for classification, topic modelling, and generation tasks.

  4. Web Scraping and APIs: Python has excellent support for web scraping libraries (e.g., BeautifulSoup) and working with APIs. This is crucial for collecting text data from various sources on the web.

  5. Community and Documentation: Python has a large and active community, resulting in extensive documentation, tutorials, and online resources. This makes it easier for developers to find solutions to problems and get support.

R for Text Analytics

  1. Statistical Analysis and Visualization: R is known for its robust statistical analysis and visualization capabilities. R might be a good choice if your text analytics tasks involve in-depth statistical analysis or if you need sophisticated data visualization.

  2. Tidyverse for Data Wrangling: The Tidyverse, a collection of R packages, including dplyr and tidyr, is decisive for data wrangling and manipulation. If your text data requires extensive preprocessing and cleaning, R's Tidyverse can be advantageous.

  3. Shiny for Interactive Apps: R's Shiny framework allows the creation of interactive web applications. If your text analytics results need to be presented in an interactive and user-friendly manner, Shiny can be a valuable tool.

  4. Text Mining Packages: R has dedicated packages for text mining, such as tm and quanteda. These packages offer functionalities for text preprocessing, term-document matrices, and various text mining algorithms.

  5. Academic and Research Community: R has a strong presence in academic and research communities, particularly in statistics and social sciences. R may have specialised tools and packages if your text analytics tasks are related to these domains.

  6. Considerations: Team Expertise: Consider the expertise and familiarity of your team members with either Python or R. The team's comfort and experience with the language can significantly impact productivity and the quality of the implementation.

  7. Integration with Other Tools: Consider other tools and systems your project needs to integrate. Python, a general-purpose language, often integrates seamlessly with various tools and platforms.

  8. Scalability and Performance: Depending on the scale of your text analytics tasks, the performance and scalability of the chosen language and libraries can be crucial factors.

Ultimately, the selection between Python and R for text analytics hinges on the project's unique demands and the team's preferences. Several factors, including the existing infrastructure, integration requirements, and the larger context within the organization, play a crucial role in making this decision.


Statistics based questions

Question 13: What is the ROC curve? (Nestle, Walmart)

The ROC (Receiver Operating Characteristic) curve is a powerful visual representation that showcases the performance of a binary classification model across various classification levels. It beautifully captures the delicate balance between sensitivity and specificity, revealing the intricate trade-off.

In the realm of practicality, perfection is a myth for classifiers. Most ROC curves gracefully reside between the ideal ROC curve and the diagonal line, embodying real-world nuances. As classifiers improve, the ROC curve gracefully aligns itself closer to the elusive perfect ROC curve, signifying excellence in classification.

The area under the ROC curve (AUC) is a single value summarising a classifier's performance. The AUC can range between 0 and 1. An AUC of 0.5 indicates that the classifier is no better than chance, whereas an AUC of 1 indicates that the classifier is excellent.

The ROC curve is a handy tool for assessing the effectiveness of binary classification algorithms. It is a simple way of visualising the trade-off between sensitivity and specificity.

Question 14: What are the basic steps of EDA? (World Bank, LinkedIn)

Exploratory data analysis (EDA) is essential in data science because it helps us understand the data and discover patterns, trends, and anomalies. It entails data summarization, data type identification, missing value detection, outlier detection, and visualisation.

The fundamental steps of EDA:

  1. Load Dataset and Import Libraries: Import the appropriate data manipulation and visualisation packages, such as pandas, NumPy, and Matplotlib. The dataset should next be loaded into a data frame or other appropriate data structure.

  2. Cleaning Data: Check for missing values and treat them correctly by eliminating rows with missing values, imputing them using mean, median, or other methods, or creating a new missing values category.

  3. Data Summarization: Use descriptive statistics such as mean, median, mode, standard deviation, quantiles, and range to summarise the data. This gives an overview of the data's central tendency and dispersion.

  4. Data Types: Determine whether each variable's data type is numerical, categorical, ordinal, or nominal. This aids in the selection of suitable data cleaning and visualisation approaches.

  5. Univariate Analysis: To investigate the distribution of numerical variables, use histograms, box plots, or density plots, and to investigate the distribution of categorical variables, use bar charts or frequency tables. This aids in comprehending the data's structure, dispersion, and central tendency.

  6. Bivariate Analysis: Scatter plots, correlation matrices, or heat maps can be used to examine the relationship between two variables. This aids in identifying potential connections, trends, or outliers between variables.

  7. Multivariate Analysis: Use principal component analysis (PCA) or t-distributed stochastic neighbour embedding (t-SNE) to investigate the relationship between numerous variables. This aids in identifying patterns and groupings in high-dimensional data.

  8. Outlier Detection: Use box plots, z-scores, or other outlier detection approaches to identify outliers in the data. Outliers can substantially impact statistical analysis and model performance.

  9. Cleaning Data: Check for missing values and treat them correctly by eliminating rows with missing values, imputing them using mean, median, or other methods, or creating a new missing values category.

  10. Data Visualisation: To get insights and communicate findings, visualise the data using relevant charts, graphs, and plots. Select visualisations that accurately depict the data type and distribution.

EDA is an iterative process, and we may need to revisit prior processes as we get deeper insights into the data. The goal is to get a thorough grasp of the data, its properties, and its potential for further analysis and modelling.

Question 15: What are type 1 and type 2 errors? (Nestle)

Type 1 and Type 2 errors are concepts in statistical hypothesis testing that describe the errors made when interpreting the results of a hypothesis test.

Type 1 Error (False Positive): A Type 1 error occurs when a null hypothesis is incorrectly rejected when it is true.

Symbol: α (alpha)

Explanation: This error signifies a scenario in which the test erroneously determines the presence of a significant effect or difference when there is no such effect or difference. It is commonly known as a "false positive."

Example: In a medical context, a Type 1 error would occur if a diagnostic test incorrectly indicates the presence of a disease in a healthy individual.

Type 2 Error (False Negative): A Type 2 error occurs when a null hypothesis is incorrectly accepted when it is false.

Symbol: β (beta)

Explanation: This error arises when the test overlooks a substantial effect or difference despite its existence. It is commonly known as a "false negative", as it fails to detect the presence of the effect or difference correctly.

Example: In a medical context, a Type 2 error would occur if a diagnostic test fails to detect the presence of a disease in an individual affected by the disease.

The interplay between Type 1 and Type 2 errors can be understood through the fascinating concept of statistical test "power". Power represents the likelihood of correctly detecting a false null hypothesis, equivalent to (1−β). In simpler terms, power measures a test's ability to accurately identify the presence of an effect or difference when it truly exists.

A delicate balancing act exists between Type 1 and Type 2 errors in statistical hypothesis testing. By adjusting the significance level (often represented by the symbol α), we can manipulate the likelihood of Type 1 errors. Still, we must also be cautious of the potential impact of Type 2 errors. The decision regarding the significance level to be employed hinges on the study's specific context and the potential ramifications of committing either Type 1 or Type 2 errors.

Question 16: What is regression analysis? (Walmart)

Regression analysis is a powerful toolkit of statistical techniques that allows us to uncover and understand the intricate connections between a dependent variable and one or more independent variables. By delving into the strength of these relationships, we gain valuable insights and even model their future dynamics.

Question 17: Please tell me some of the sampling techniques. (Google, Citi Bank)

Sampling is selecting a subset of data from a broader population. There are several sampling strategies. However, some of the most common are:

  • Random Sampling: Each member of the population has an equal chance of being chosen by random sampling.

  • Stratified Sampling involves dividing the population into strata and taking random samples from each stratum.

  • Systematic Sampling: The population is arranged, and every kth member is chosen.

  • Cluster Sampling: Involves dividing the population into clusters and taking random samples from each cluster.

Coding based questions

Question 18: Tell me the libraries you know in Python. (NTT Data, World Bank)

NumPy, Scipy, Scikit-learn, Theano, TensorFlow, Keras.

Question 19: Which Python library handles complex data? (World Bank)

Scikit-learn is a well-known Python toolkit for dealing with complicated data. It is a free and open-source machine-learning library that supports various supervised and unsupervised methods, such as linear regression, classification, and clustering. This library is compatible with NumPy and SciPy.

Question 20: How do you load a pdf file in Python? (World Bank)

To load a PDF file in Python, you can use the PyPDF2 library. This library provides various functions for working with PDF files, such as reading, writing, and extracting text.

Scikit-learn is a well-known Python toolkit for dealing with complicated data. It is a free and open-source machine-learning library that supports various supervised and unsupervised methods, such as linear regression, classification, and clustering. This library is compatible with Numpy and SciPy.


Frequently Asked Interview Questions (FAQ) - Data Science Interview

Are data science interviews hard?
Data Science interviews can pose a significant challenge for candidates, who must showcase diverse skills to secure a job. These skills encompass technical proficiency, problem-solving capabilities, and effective communication.

One of the easiest methods is to enroll in a data science course with an updated curriculum as per industry standards.

How do I prepare for a data science interview?

This is the essential checklist for preparation

  • How to Prepare for a Data Science Interview: Background Research.

  • Review the job posting.

  • Go to their website.

  • Study their competitors.

  • Check the company's values and culture.

  • Find out the company's recent achievements.

  • Research your interviewer

Is coding asked in data science interviews?

Depending on the data science role you have applied for and its responsibilities, the interviewer asks coding questions. But then they primarily revolve around computer science fundamentals and programming languages such as Python, SQL and R.

How long is a data science interview?

Typically, the interview duration is from 20 to 30 minutes. It also depends on the requirements of the interviewer. He might continue to pose questions until he receives the answers he seeks.  

What are the key skills required for data science?
To gather the right skill set, joining an industry-vetted Data Science Course is always advised. You might also want to go through this top data science skills list.


The world of data science is dynamic and ever-evolving, and a successful interview is not just about solving problems but demonstrating a deep understanding of the field.

Remember, each question is a window into how you approach problem-solving, your analytical thinking, and your ability to communicate complex ideas. And please do not ignore behavioural interview questions as they too carry a good amount of weightage in a data science interview.

As you embark on your data science journey, keep honing your skills, stay curious, and embrace the challenges that come your way. Use this questionnaire as a reference for interviews and a tool to deepen your understanding of data science principles. Additionally, it is just not answering the interviewer's questions; they give the interviewee a chance to ask questions, and we need to be very smart and prepared for it.

We wish you the best of luck in your interviews, and may your passion for data science continue to drive your success in this exciting and rapidly evolving field!


Data science bootcamp

About the Author

Meet Nalini, a talented writer who enjoys baking and taking pictures in addition to contributing insightful articles. She contributes a plethora of knowledge to our blog with years of experience and skill.

Join OdinSchool's Data Science Bootcamp

With Job Assistance

View Course