Top 50 SQL Interview Questions
Summary
Table of Content
Its obvious, isn't it? Any person who is working on data SHOULD know about data extraction or retrieval.
Hence, regardless of the data science role, having strong SQL skills is a major asset for anyone in the data field. It demonstrates your ability to work with data effectively, a crucial skill in today's data-driven world.
Company size and industry: Larger companies and those in data-driven industries like finance or healthcare will likely place a higher value on SQL skills.
Specific requirements of the job: Always pay attention to the job description. If it mentions SQL prominently, be prepared for a SQL-heavy interview.
Basic SQL interview questions
Following are basic SQL interview questions. A recruiter can consider them as SQL interview questions for freshers.
Q1: What is the difference between SELECT, INSERT, UPDATE, and DELETE statements?
-
SELECT: retrieves data from a table.
-
NSERT: Adds new records to a table.
-
UPDATE: Modifies existing records in a table.
-
DELETE: Removes records from a table.
Q2: How do you sort the results of a query using the ORDER BY clause?
The ORDER BY clause sorts the results of a query based on a specified column, either ascending (ASC) or descending (DESC).
SELECT * FROM Customers ORDER BY City ASC;
Q3: Write a SQL query to find the top 3 countries with the highest average purchase amount per customer, considering only countries where the average purchase amount exceeds $100 and there are at least 50 customers.
SELECT Country, AVG(TotalAmount) AS AvgPurchaseAmount
FROM (
SELECT c.Country, o.CustomerID, SUM(o.ProductPrice * o.Quantity * (1 - o.Discount)) AS TotalAmount
FROM Orders o
JOIN Customers c ON o.CustomerID = c.CustomerID
GROUP BY c.Country, o.CustomerID
) AS CustomerPurchaseAmounts
GROUP BY Country
HAVING AVG(TotalAmount) > 100 AND COUNT(DISTINCT CustomerID) >= 50
ORDER BY AvgPurchaseAmount DESC
LIMIT 3;
Q4: Write a SQL query to find the top 5 products that have been purchased the most frequently in consecutive months by the same customer, considering only those products that have been purchased for at least 6 consecutive months.
WITH MonthlyPurchaseCounts AS (
SELECT CustomerID, ProductID, EXTRACT(YEAR_MONTH FROM OrderDate) AS YearMonth, COUNT(*) AS PurchaseCount
FROM Orders
GROUP BY CustomerID, ProductID, EXTRACT(YEAR_MONTH FROM OrderDate)
),
ConsecutiveMonthGroups AS (
SELECT CustomerID, ProductID, YearMonth,
ROW_NUMBER() OVER (PARTITION BY CustomerID, ProductID ORDER BY YearMonth) -
ROW_NUMBER() OVER (PARTITION BY CustomerID, ProductID, PurchaseCount ORDER BY YearMonth) AS MonthDiff
FROM MonthlyPurchaseCounts
),
ConsecutiveMonths AS (
SELECT CustomerID, ProductID, MIN(YearMonth) AS StartMonth, MAX(YearMonth) AS EndMonth, COUNT(*) AS ConsecutiveMonths
FROM ConsecutiveMonthGroups
GROUP BY CustomerID, ProductID, MonthDiff
HAVING COUNT(*) >= 6
)
SELECT ProductID, SUM(PurchaseCount) AS TotalPurchaseCount
FROM ConsecutiveMonths
GROUP BY ProductID
ORDER BY TotalPurchaseCount DESC
LIMIT 5;
Q5: What is the difference between a GROUP BY clause and a HAVING clause?
GROUP BY: Groups rows based on shared values in one or more columns. HAVING: Filters groups created by GROUP BY based on a condition applied to aggregate functions.
SELECT City, COUNT(*) AS CustomerCount
FROM Customers
GROUP BY City
HAVING CustomerCount > 100;
Q6: Write a SQL query to find the customers who have made purchases in all product categories available in the store.
SELECT CustomerID
FROM (
SELECT CustomerID, COUNT(DISTINCT ProductCategoryID) AS CategoryCount
FROM Orders
GROUP BY CustomerID
) AS CustomerCategoryCounts
WHERE CategoryCount = (SELECT COUNT(DISTINCT ProductCategoryID) FROM ProductCategories);
Q7: What is a subquery, and how can it be used in SQL?
A subquery is a nested query embedded within another query. It allows you to perform more complex data retrieval based on conditions involving multiple tables.
Q8: Explain the concept of NULL values in SQL and how they differ from empty strings.
NULL represents the absence of a value, while an empty string is a character data type with no characters. NULL indicates "unknown" or "not applicable," while an empty string is a specific value (though sometimes unintended).
Q9: How can you combine results from multiple SELECT statements into a single result set? Explain the concept of UNION and UNION ALL.
UNION: Returns distinct rows (duplicates are removed). UNION ALL: Returns all rows, including duplicates.
SELECT CustomerName FROM Customers WHERE City = 'New York'
UNION
SELECT CustomerName FROM Customers WHERE City = 'Los Angeles';
Q10: Describe the concept of data integrity and how constraints like PRIMARY KEY and FOREIGN KEY help maintain it.
Data integrity ensures data accuracy and consistency within a database.
-
PRIMARY KEY: Enforces a unique identifier for each row in a table, preventing duplicate records.
-
FOREIGN KEY: Creates a link between two tables, referencing the PRIMARY KEY of another table, ensuring referential integrity (valid relationships between data points).
Q11: Explain the concept of transactions in SQL and the ACID properties (Atomicity, Consistency, Isolation, Durability).
A transaction is a unit of work involving multiple database operations. ACID properties ensure data integrity during transactions:
-
Atomicity: All operations succeed or fail as a whole.
-
Consistency: The database transitions from one valid state to another.
-
Isolation: Concurrent transactions do not interfere with each other.
-
Durability: Committed transactions persist even in the event of system failures.
Q12: Write a query to find the top N customers based on their total order amount.
ORDER BY: Order customers by total order amount (DESC for descending).
LIMIT: Specify the number of top N results to retrieve.
SELECT CustomerID, CustomerName, SUM(OrderAmount) AS TotalAmount
FROM Orders O
INNER JOIN Customers C ON O.CustomerID = C.CustomerID
GROUP BY CustomerID, CustomerName
ORDER BY TotalAmount DESC
LIMIT 10;
Q13: Describe the difference between a correlated subquery and a non-correlated subquery.
Non-correlated subquery: Evaluated independently for each row in the outer query. Doesn't reference any column from the outer query. Example (Non-correlated): Find orders with an amount exceeding the average order amount:
SELECT * FROM Orders
WHERE OrderAmount > (SELECT AVG(OrderAmount) FROM Orders);
Correlated subquery: References a column from the outer query in its WHERE clause. Evaluated once for each row in the outer query. Example (Correlated): Find employees who earn more than their manager:
SELECT e.EmployeeName
FROM Employees e
WHERE e.Salary > (SELECT ManagerSalary FROM Employees m WHERE m.EmployeeID = e.ManagerID);
Q14: Explain the concept of window functions with a use case.
Window functions operate on a set of rows within a query partition (defined by a window clause). They allow for calculations like ranking, moving averages, or cumulative sums within the result set.
Example: Rank employees within each department by salary:
SELECT EmployeeName, Department, RANK() OVER (PARTITION BY Department ORDER BY Salary DESC) AS Rank
FROM Employees;
Q15: Explain the concept of temporary tables and their advantages/disadvantages compared to permanent tables.
Temporary tables: Exist only for the duration of a session or transaction. Useful for intermediate calculations within complex queries without modifying existing data. Advantages: Improve query performance for complex operations, avoid modifying existing data unintentionally. Disadvantages: Not persistent, data is lost after the session ends.
Q16: Describe how to handle errors and exceptions in SQL.
Most databases offer mechanisms for handling errors within SQL statements. Techniques include:
-
TRY...CATCH block: Similar to programming languages, allows defining code to execute if an error occurs (CATCH block).
-
RAISE statement: Explicitly raise a custom error message to handle specific situations.
Q17: Describe the concept of triggers and their use cases in SQL.
Triggers are stored procedures automatically executed based on specific events (e.g., INSERT, UPDATE, DELETE) on a table. They allow for automated data validation, maintaining referential integrity, or logging changes.
Example: Trigger to enforce a minimum order amount:
CREATE TRIGGER validate_order_amount
BEFORE INSERT ON Orders
FOR EACH ROW
BEGIN
IF NEW.OrderAmount < 100 THEN
SIGNAL SQLSTATE '45000' SET MESSAGE_TEXT = 'Minimum order amount is $100';
END IF;
END;
Q18: Describe some basic techniques for backing up and restoring a database.
Database management system (DBMS) tools: Most DBMS offer built-in utilities for creating backups (full or incremental) at scheduled intervals.
SQL commands: Some databases allow using commands like EXPORT
and IMPORT
to export data to a file and restore it later.
Q19: Describe some best practices for writing efficient and maintainable SQL queries.
-
Use clear and descriptive naming conventions for tables, columns, and aliases.
-
Proper indentation and formatting to enhance readability.
-
Optimize WHERE clause conditions to avoid full table scans.
-
Utilize appropriate data types for columns based on the expected data.
-
Use comments to explain complex logic within the query.
Q20: Explain the concept of ETL (Extract, Transform, Load) and its role in data integration.
ETL: A process for extracting raw data from various sources, transforming it into a suitable format for storage and analysis, and loading the transformed data into a target system.
Role in data integration: Enables seamless data transfer and transformation between different data sources, ensuring data consistency and completeness for analysis.
By understanding these advanced concepts and best practices, you'll showcase your potential for tackling more intricate SQL challenges in the workplace.
Intermediate SQL interview questions
Following are SQL interview questions for less experienced candidates. This listing also includes SQL query interview questions (you should be able to come up with your own hypothetical scenario in such cases if needed).
Q1: Write a query to find all orders with an order amount exceeding the average order amount for that specific customer.
SELECT o.order_id, o.customer_id, o.amount
FROM orders o
WHERE o.amount > (
SELECT AVG(amount)
FROM orders o2
WHERE o2.customer_id = o.customer_id
);
Q2: Write a query to find the top 5 most popular products (based on total quantity ordered) across all orders.
SELECT product_id, SUM(quantity) AS total_quantity
FROM order_products
GROUP BY product_id
ORDER BY total_quantity DESC
LIMIT 5;
Q3: Write a query to transform the data, showing total sales quantity for each product across all dates and stores in separate columns.
SELECT sd.store_id,
p.product_id,
SUM(CASE WHEN op.product_id = p.product_id THEN op.quantity ELSE 0 END) AS product_quantity
FROM sales_data sd
LATERAL JOIN (SELECT DISTINCT product_id FROM sales_data) p ON TRUE -- Virtual table with all products
LEFT JOIN sales_data op ON sd.store_id = op.store_id AND sd.date = op.date
GROUP BY sd.store_id, p.product_id;
Q4: Write a query to find the top 3 actions performed by each user within the last day, ranked by their timestamps (earliest being 1).
SELECT user_id, action_type,
DENSE_RANK() OVER (PARTITION BY user_id ORDER BY timestamp) AS action_rank
FROM user_activity
WHERE timestamp >= DATE_SUB(CURDATE(), INTERVAL 1 DAY)
ORDER BY user_id, action_rank;
Q5: You have two tables: customers
(customer_id, name, country) and orders
(order_id, customer_id, order_date, amount). Write a query to find the total number of orders placed by customers from each country in the last quarter, excluding orders with an amount exceeding $1000.
SELECT p.product_id, p.name, p.category
FROM products p
WHERE p.average_rating > (
SELECT AVG(rating)
FROM ratings
);
WITH ProductAverage (product_id, average_rating) AS (
SELECT product_id, AVG(rating) AS average_rating
FROM ratings
GROUP BY product_id
)
UPDATE products p
SET p.average_rating = pa.average_rating
FROM ProductAverage pa
WHERE p.product_id = pa.product_id;
Q6: You have an employees
table with employee_id
, manager_id
, department_id
, and salary
columns. Write a query to find the total payroll cost for each department, considering all employees within the department hierarchy (including managers and their subordinates).
WITH DepartmentPayroll (department_id, employee_id, salary, level) AS (
SELECT e.department_id, e.employee_id, e.salary, 1
FROM employees e
WHERE e.manager_id IS NULL -- Starting point: Department heads
UNION ALL
SELECT dp.department_id, e.employee_id, e.salary, dp.level + 1
FROM DepartmentPayroll dp
INNER JOIN employees e ON dp.department_id = e.department_id
)
SELECT dp.department_id, SUM(salary) AS total_payroll
FROM DepartmentPayroll dp
GROUP BY dp.department_id;
Q7: Write a query to find the 7-day moving average of sales for each store.
SELECT sd.date, sd.store_id,
AVG(sales_amount) OVER (PARTITION BY sd.store_id ORDER BY sd.date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS moving_average
FROM sales_data sd;
Q8: You have three tables: customers
(customer_id, name, country), orders
(order_id, customer_id, order_date, amount), and products
(product_id, name, category). Write a query to find the total number of orders placed and the total revenue generated for each country, separating domestic and international orders (assuming a reference country).
SELECT c.country,
COUNT(*) AS total_orders,
SUM(CASE WHEN o.customer_id = c.customer_id THEN o.amount ELSE 0 END) AS domestic_revenue,
SUM(CASE WHEN o.customer_id != c.customer_id THEN o.amount ELSE 0 END) AS international_revenue
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id
INNER JOIN products p ON o.product_id = p.product_id
WHERE c.country = 'Your Reference Country' -- Replace with your reference country
OR c.country != 'Your Reference Country'
GROUP BY c.country;
Q9: Write a query to find all users who have performed a specific action ("login") at least once but haven't performed any other actions ("logout") within the last hour.
SELECT ua.user_id
FROM user_activity ua
WHERE ua.action_type = 'login'
AND NOT EXISTS (
SELECT 1
FROM user_activity ua2
WHERE ua2.user_id = ua.user_id
AND ua2.action_type = 'logout'
AND ua2.timestamp >= DATE_SUB(CURDATE(), INTERVAL 1 HOUR)
);
Q10: Write a SQL query to find the top 5 departments with the highest average salary, but only consider departments where the number of employees is greater than 10, and order the result by the department name alphabetically.
SELECT Department, AVG(Salary) AS AvgSalary
FROM Employees
GROUP BY Department
HAVING COUNT(*) > 10
ORDER BY Department
LIMIT 5;
Q11: Write a SQL query to find the cumulative sum of salaries for each department, but only consider departments where the cumulative sum is greater than the average cumulative sum of all departments.
WITH DeptSalary AS (
SELECT Department, Salary,
SUM(Salary) OVER (PARTITION BY Department ORDER BY Salary) AS CumulativeSalary
FROM Employees
),
AvgCumulative AS (
SELECT AVG(CumulativeSalary) AS AvgCumulativeSalary
FROM DeptSalary
)
SELECT Department, CumulativeSalary
FROM DeptSalary
INNER JOIN AvgCumulative ON DeptSalary.CumulativeSalary > AvgCumulative.AvgCumulativeSalary;]
Q12: Write a SQL query to find the top 3 most common categories that occur together in the "ProductCategories" table
SELECT Category1, Category2, COUNT(*) AS CombinationCount
FROM ProductCategories pc1
INNER JOIN ProductCategories pc2 ON pc1.ProductID = pc2.ProductID AND pc1.CategoryID < pc2.CategoryID
GROUP BY Category1, Category2
ORDER BY CombinationCount DESC
LIMIT 3;
Q13: Write a SQL query to find the total sales amount for each month of the year, including months with zero sales, from the "Sales" table.
SELECT EXTRACT(MONTH FROM SaleDate) AS Month, COALESCE(SUM(SaleAmount), 0) AS TotalSales
FROM generate_series(1, 12) AS months
LEFT JOIN Sales ON EXTRACT(MONTH FROM SaleDate) = months
GROUP BY Month
ORDER BY Month;
Q14: Write a SQL query to find the top 3 pairs of customers who have made purchases together the most frequently. Consider only those pairs who have bought at least one product in common and have made purchases within a 30-minute time frame of each other. Additionally, ensure that both customers in each pair have made purchases on at least 10 different days.
WITH CustomerPairs AS (
SELECT o1.CustomerID AS Customer1, o2.CustomerID AS Customer2, COUNT(*) AS PurchaseCount
FROM Orders o1
JOIN Orders o2 ON o1.ProductID = o2.ProductID AND o1.CustomerID < o2.CustomerID
WHERE ABS(EXTRACT(EPOCH FROM (o1.OrderDateTime - o2.OrderDateTime))) <= 1800 -- 30 minutes in seconds
GROUP BY o1.CustomerID, o2.CustomerID
HAVING COUNT(*) >= 10
)
SELECT Customer1, Customer2, PurchaseCount
FROM CustomerPairs
ORDER BY PurchaseCount DESC
LIMIT 3;
Q15: Write a SQL query to find the top 5 products that have been reordered the most frequently. Consider only those products that have been reordered within 10 days of the initial order and have been reordered at least 3 times
WITH ReorderedProducts AS (
SELECT ProductID, COUNT(*) AS ReorderCount
FROM Orders
WHERE Reorder = TRUE
AND EXTRACT(EPOCH FROM (ReorderDate - OrderDate)) <= 864000 -- 10 days in seconds
GROUP BY ProductID
HAVING COUNT(*) >= 3
)
SELECT ProductID, ReorderCount
FROM ReorderedProducts
ORDER BY ReorderCount DESC
LIMIT 5;
Q16: Write a SQL query to find the top 3 customers who have made the most consecutive monthly purchases, considering only those customers who have made purchases in every month of the year.
WITH MonthlyPurchaseCounts AS (
SELECT CustomerID, EXTRACT(MONTH FROM OrderDateTime) AS Month, COUNT(*) AS PurchaseCount
FROM Orders
GROUP BY CustomerID, EXTRACT(MONTH FROM OrderDateTime)
),
CustomersWithConsecutivePurchases AS (
SELECT CustomerID, COUNT(*) AS ConsecutiveMonths
FROM (
SELECT CustomerID, Month,
ROW_NUMBER() OVER (PARTITION BY CustomerID ORDER BY Month) -
EXTRACT(MONTH FROM MIN(OrderDateTime)) AS MonthDiff
FROM MonthlyPurchaseCounts
) AS Subquery
GROUP BY CustomerID, MonthDiff
HAVING COUNT(*) = 12 -- 12 consecutive months in a year
)
SELECT CustomerID, ConsecutiveMonths
FROM CustomersWithConsecutivePurchases
ORDER BY ConsecutiveMonths DESC
LIMIT 3;
Scenario-based SQL interview questions
Following are scenario-based SQL interview questions for experienced candidates
Scenario 1: You manage a website and have a database table named "website_traffic" that stores information about user visits. The table includes columns like date
, user_id
, page_visited
, and duration
. The task is to write a query to provide insights into website traffic for the marketing team. They'd like to see:
-
The total number of unique visitors for the past month.
-
The top 5 most visited pages for the past week.
-
The average time spent on the website by new users (users without previous visits) for the past day.
-- Total Unique Visitors (Past Month)
SELECT COUNT(DISTINCT user_id) AS total_unique_visitors
FROM website_traffic
WHERE date >= DATE_SUB(CURDATE(), INTERVAL 1 MONTH);
-- Top 5 Most Visited Pages (Past Week)
SELECT page_visited, COUNT(*) AS visits
FROM website_traffic
WHERE date >= DATE_SUB(CURDATE(), INTERVAL 1 WEEK)
GROUP BY page_visited
ORDER BY visits DESC
LIMIT 5;
-- Average Time Spent by New Users (Past Day)
SELECT AVG(duration) AS avg_duration_new_users
FROM website_traffic w
WHERE w.date = CURDATE()
AND NOT EXISTS (
SELECT 1 FROM website_traffic u
WHERE u.user_id = w.user_id AND u.date < CURDATE()
);
Scenario 2: You're an engineer for a company that uses a distributed logging system. The system generates logs for various events (e.g., user actions, database access, error messages), stored in a database. Write queries to:
-
Filter logs for a specific user ID and time range.
-
Group logs by event type and calculate the number of occurrences for each event type.
-
Identify the top 10 most frequently occurring errors in the database.
-- Filter Logs for a User and Time Range
SELECT l.timestamp, l.level, l.message
FROM logs l
WHERE l.user_id = 12345 AND l.timestamp >= DATE_SUB(CURDATE(), INTERVAL 24 HOURS)
AND l.timestamp < CURDATE();
-- Group Logs by Event Type
SELECT event_type, COUNT(*) AS occurrences
FROM logs l
GROUP BY l.event_type;
-- Find Top 10 Most Frequent Errors
SELECT error_code, COUNT(*) AS occurrences
FROM errors e
GROUP BY e.error_code
ORDER BY occurrences DESC
LIMIT 10;
Scenario 3: You're a data analyst for a retail company. The database has tables named "products" (product_id, name, price, category) and "sales" (order_id, product_id, quantity, date). The task is to create a report for the sales manager that includes:
-
Total revenue generated for each product category in the last quarter.
-
The top 5 performing products (by total revenue) across all categories for the entire year so far.
-
A list of products with a significant drop (e.g., 20% or more) in sales quantity compared to the previous quarter.
-- Total Revenue per Category (Last Quarter)
SELECT c.category, SUM(s.price * s.quantity) AS total_revenue
FROM sales s
INNER JOIN products p ON s.product_id = p.product_id
INNER JOIN categories c ON p.category = c.category_id -- Assuming a categories table exists
WHERE s.date >= DATE_SUB(CURDATE(), INTERVAL 3 MONTH)
AND s.date < CURDATE()
GROUP BY c.category;
-- Top 5 Performing Products (Year-to-Date)
SELECT p.name, SUM(s.price * s.quantity) AS total_revenue
FROM sales s
INNER JOIN products p ON s.product_id = p.product_id
WHERE s.date >= DATE_SUB(CURDATE(), INTERVAL 1 YEAR)
AND s.date < CURDATE()
GROUP BY p.product_id, p.name
ORDER BY total_revenue DESC
LIMIT 5;
-- Products with Significant Sales Drop (vs. Previous Quarter)
WITH PreviousQuarterSales AS (
SELECT product_id, SUM(quantity) AS prev_quarter_sales
FROM sales
WHERE date >= DATE_SUB(CURDATE(), INTERVAL 4 MONTH)
AND date < DATE_SUB(CURDATE(), INTERVAL 3 MONTH)
GROUP BY product_id
)
SELECT p.name, s.quantity, (prev_quarter_sales.prev_quarter_sales - s.quantity) AS sales_drop
FROM sales s
INNER JOIN products p ON s.product
Scenario 4: You're a database administrator responsible for ensuring data quality. The database has a table named "customers" with fields for name, email, phone number, and address. You want to write queries to:
-
Validate that a customer's email address is in the format
user@domain.com
. -
Filter out customers with missing phone numbers or invalid addresses.
-
Calculate the average phone number length for all customers.
-- Validate Customer Email Format
SELECT name, email, email_pattern
FROM customers
WHERE email_pattern NOT IN ('[a-z0-9-]+\.[a-z]{2,5}$');
-- Filter Customer by Phone Number
SELECT name, email, phone_number
FROM customers
WHERE phone_number IS NOT NULL;
-- Calculate Average Phone Number Length
SELECT AVG(LENGTH(phone_number)) AS average_phone_length
FROM customers;
Scenario 5: You're a data analyst tasked with updating a customer's name if it's misspelled in the database. The table named "customers" has a name column. Write a query to:
-
Check for misspelled names and update them to the correct spelling.
-
Avoid updating names that are already correct.
UPDATE customers c
SET c.name = REPLACE(c.name, '-', '')
WHERE REPLACE(c.name, '-', '') IN (
SELECT REPLACE(customer_name, '-', '')
FROM customers
WHERE customer_name IS NOT NULL
);
Scenario 6: You're the database administrator for a retail store. The database has tables named "products" (product_id, name, quantity, reorder_level), "suppliers" (supplier_id, name, contact_info), and "inventory_log" (log_id, product_id, quantity_change, date, note). Write queries to achieve the following:
-
Identify products with stock levels below the reorder threshold.
-
Find the total quantity of each product received from all suppliers in the past quarter.
-- Products Below Reorder Threshold
SELECT p.name, p.quantity
FROM products p
WHERE p.quantity < p.reorder_level;
-- Total Quantity Received per Product (Past Quarter)
SELECT p.name, SUM(il.quantity_change) AS total_received
FROM products p
INNER JOIN inventory_log il ON p.product_id = il.product_id
WHERE il.quantity_change > 0 -- Filter for positive quantity changes (incoming stock)
AND il.date >= DATE_SUB(CURDATE(), INTERVAL 3 MONTH)
AND il.date < CURDATE()
GROUP BY p.product_id, p.name;
Scenario 7: You're a marketing analyst for a clothing store. The database has tables named "customers" (customer_id, name, email, purchase_history), and "transactions" (transaction_id, customer_id, amount, date, items_purchased). Write queries to:
Segment customers based on their total purchase amount in the past year (e.g., High Spenders, Medium Spenders, Low Spenders).
Identify customers who haven't made a purchase in the last 6 months.
-- Customer Segmentation by Purchase Amount (Past Year)
WITH CustomerSpend AS (
SELECT customer_id, SUM(amount) AS total_spent
FROM transactions
WHERE date >= DATE_SUB(CURDATE(), INTERVAL 1 YEAR)
AND date < CURDATE()
GROUP BY customer_id
)
SELECT c.name, cs.total_spent,
CASE WHEN cs.total_spent >= (
SELECT AVG(total_spent) * 1.5 FROM CustomerSpend
) THEN 'High Spender'
WHEN cs.total_spent >= (
SELECT AVG(total_spent) * 0.5
) THEN 'Medium Spender'
ELSE 'Low Spender'
END AS spending_category
FROM customers c
INNER JOIN CustomerSpend cs ON c.customer_id = cs.customer_id;
-- Customers with No Purchases (Last 6 Months)
SELECT name, email
FROM customers c
WHERE customer_id NOT IN (
SELECT customer_id FROM transactions
WHERE date >= DATE_SUB(CURDATE(), INTERVAL 6 MONTH)
AND date < CURDATE()
);
Scenario 8: You're a database developer tasked with optimizing the performance of a critical application. The database has tables for user login attempts ("login_attempts", user_id, timestamp, success_flag) and application errors ("errors", error_code, timestamp, details). Write queries to:
-
Identify users with a high number of failed login attempts (e.g., exceeding 5 attempts) within the last hour.
-
Find the most frequently occurring application errors within the past day.
-- Users with Excessive Failed Login Attempts (Past Hour)
SELECT u.username, COUNT(*) AS failed_attempts
FROM users u
INNER JOIN login_attempts la ON u.user_id = la.user_id
WHERE la.success_flag = 0 -- Filter for failed attempts
AND la.timestamp >= DATE_SUB(CURDATE(), INTERVAL 1 HOUR)
AND la.timestamp < CURDATE()
GROUP BY u.username
HAVING failed_attempts > 5; -- Adjust threshold as needed
-- Most Frequent Application Errors (Past Day)
SELECT error_code, COUNT(*) AS occurrences
FROM errors
WHERE timestamp >= DATE_SUB(CURDATE(), INTERVAL 1 DAY)
AND timestamp < CURDATE()
GROUP BY error_code
ORDER BY occurrences DESC
LIMIT 10;
Scenario 9: You're a database architect designing a trigger to automatically update stock prices whenever a new order is placed. The table named "orders" has a order_id
, customer_id
, and quantity
column. Write a trigger to update the stock_level
for a product in the products
table:
CREATE TRIGGER update_stock_levels
AFTER INSERT ON orders
FOR EACH ROW
BEGIN
UPDATE products p
SET p.stock_level = p.quantity + NEW.quantity
WHERE p.product_id = NEW.product_id;
END;
Scenario 10: You're facing a performance bottleneck in a query that involves filtering millions of records. Analyze the query using tools like SQL Profiler and identify performance bottlenecks. Then, suggest optimization techniques to improve performance, such as using indexes, subqueries, or window functions.
Analyze the query using SQL Profiler to identify the most time-consuming operations. Based on the analysis, suggest optimization techniques such as:
-
Using an index on the
customer_id
column to improve filtering performance. -
Replacing expensive subqueries with JOINs or aggregation functions.
-
Using window functions to analyze data over a specific period or window. By implementing these optimization techniques, you can improve the performance of your SQL query and make it more efficient for large data sets.
Scenario 11: You're a data analyst for a delivery service company. The database has tables named "customers" (customer_id, name, address) and "orders" (order_id, customer_id, timestamp, delivery_address). Write queries to achieve the following:
- Identify the top 5 busiest delivery zones based on the number of orders in the past week.
- Calculate the average delivery time for orders placed within a specific zip code in the past month.
-- Top 5 Busiest Delivery Zones (Past Week)
SELECT zone_name, COUNT(*) AS order_count
FROM (
SELECT c.address, ST_GEOMFROMTEXT(CONCAT('POINT(', REPLACE(delivery_address, ',', ' '), ')')) AS delivery_point
FROM customers c
INNER JOIN orders o ON c.customer_id = o.customer_id
WHERE o.timestamp >= DATE_SUB(CURDATE(), INTERVAL 1 WEEK)
AND o.timestamp < CURDATE()
) AS delivery_data
GROUP BY zone_name -- Assuming a zone_name can be derived from the address
ORDER BY order_count DESC
LIMIT 5;
-- Average Delivery Time for Specific Zip Code (Past Month)
SELECT AVG(TIMEDIFF(o.delivered_at, o.placed_at)) AS avg_delivery_time
FROM orders o
WHERE SUBSTRING_INDEX(delivery_address, ',', -1) = '12345' -- Replace '12345' with the specific zip code
AND o.delivered_at IS NOT NULL -- Filter for delivered orders only
AND o.timestamp >= DATE_SUB(CURDATE(), INTERVAL 1 MONTH)
AND o.timestamp < CURDATE();
Scenario 12: You're a data scientist working with historical stock prices. The database has a table named "stock_prices" (date, symbol, open_price, close_price) with daily stock price data. Write queries to find:
-
The stock with the highest average daily closing price for the past year.
-
The month with the highest average daily price volatility (absolute difference between opening and closing price) for a specific stock symbol.
SELECT symbol, AVG(close_price) AS avg_closing_price
FROM stock_prices
WHERE date >= DATE_SUB(CURDATE(), INTERVAL 1 YEAR)
AND date < CURDATE()
GROUP BY symbol
ORDER BY avg_closing_price DESC
LIMIT 1;
-- Month with Highest Volatility (Specific Stock)
WITH MonthlyVolatility AS (
SELECT YEAR(date) AS year, MONTH(date) AS month, symbol, AVG(ABS(open_price - close_price)) AS avg_volatility
FROM stock_prices sp
WHERE sp.symbol = 'AAPL' -- Replace 'AAPL' with the stock symbol
GROUP BY YEAR(date), MONTH(date), symbol
)
SELECT month, year
FROM MonthlyVolatility
ORDER BY avg_volatility DESC
LIMIT 1;
Scenario 13: You're a data engineer tasked with implementing data lineage tracking for a data warehouse. You want to track how data from a source table ("source_data") is transformed and loaded into a target table ("target_data"). Write a script or procedure to:
-
Capture information about the transformation process, such as the timestamp of data loading and the user who triggered the process.
This scenario might require a combination of database features and scripting depending on the specific platform.
Create a Lineage Tracking Table:
CREATE TABLE data_lineage (
source_table_name VARCHAR(255) NOT NULL,
target_table_name VARCHAR(255) NOT NULL,
load_timestamp DATETIME NOT NULL DEFAULT CURRENT_TIMESTAMP,
loaded_by VARCHAR(255) DEFAULT CURRENT_USER, -- Capture username if applicable
transformation_script VARCHAR(MAX) DEFAULT NULL -- Optional for storing script details
);
This table stores information about each data loading event, including:
source_table_name
: Name of the source table from which data is extracted.target_table_name
: Name of the target table in the data warehouse where data is loaded.load_timestamp
: Timestamp of the data loading process.loaded_by
: Username of the user who triggered the data load (if user authentication is available).transformation_script
(Optional): This can store details about the data transformation script used (e.g., stored procedure name or script content).
Trigger on Data Load Process:
Here's a pseudocode example of a trigger that inserts a record into the data_lineage
table whenever the data loading process is executed:
Remember, the specific solutions will vary depending on the nature of your specific problem and the database schema you're working with. Be sure to carefully consider all factors before implementing any solutions, and test your queries thoroughly to ensure they meet your requirements.
-- Assuming the data loading process is triggered by a stored procedure
CREATE TRIGGER lineage_tracking_trigger
AFTER UPDATE ON source_data -- Modify table name based on your data loading mechanism
FOR EACH ROW
BEGIN
INSERT INTO data_lineage (source_table_name, target_table_name)
VALUES ('source_data', 'target_data'); -- Update table names as needed
END;
This trigger fires after data is updated in the source table (assuming your data loading process updates the target table). You might need to modify the trigger based on your specific data loading mechanism (e.g., triggers might not be applicable for all ETL tools).
-
Data Lineage Visualization (Optional):
Develop a separate script or tool to periodically query the
data_lineage
table and process the information. You can achieve this using:- SQL Script: Write a script that retrieves data from
data_lineage
and joins it with relevant tables (e.g., user table) to obtain additional details. This script can then format the data into a desired format (e.g., CSV, JSON). - Programming Language: Use a programming language like Python to connect to the database, query the
data_lineage
table, and potentially access transformation scripts if stored elsewhere. Process and visualize the lineage information using libraries likepandas
(Python) for data manipulation and visualization libraries likematplotlib
for creating lineage diagrams.
- SQL Script: Write a script that retrieves data from
By implementing this approach, you can track data transformations and loading events within your data warehouse. The specific implementation details might vary depending on your database platform and data loading tools. Remember to test and refine the solution to ensure it captures the necessary information for your data lineage needs.
Scenario 13: You're a database administrator responsible for maintaining a production database. You want to implement a version control system to track changes made to the database schema and data. Describe the limitations of traditional file system backups for version control and suggest alternative approaches.
Data Sources
-
Relational Database Table:
customer_data
- Columns: customer_id, name, address, purchase_history (date, amount, product_id)
-
Flat File:
product_info.csv
- Columns: product_id, name, description, category
-
JSON Feed: Real-Time Social Media Sentiment
- Attributes: category, sentiment_score, timestamp
1. ETL Process Design
-
Extract:
-
customer_data
: Use SQL to extract customer data. -
product_info.csv
: Read CSV data via Python (e.g., Pandas). -
Social Media Sentiment: Extract JSON data via API.
-
-
Transform:
- Normalize customer names and addresses.
- Standardize product categories.
- Aggregate sentiment data by category.
import pandas as pd
product_info = pd.read_csv('product_info.csv')
product_info['category'] = product_info['category'].str.lower()
Load:
- Load transformed data into the data warehouse using batch processing (e.g., SQL COPY, Sqoop).
COPY customer_purchases FROM '/path/to/transformed/customer_data.csv' WITH CSV HEADER;
2. Unified Dimensional Data Model
-
Fact Tables:
-
Customer Purchases Fact Table:
- Grain: One row per purchase
- Columns: customer_id, product_id, purchase_date, amount, sentiment_score
-
CREATE TABLE customer_purchases (
purchase_id SERIAL PRIMARY KEY,
customer_id INT,
product_id INT,
purchase_date DATE,
amount DECIMAL,
sentiment_score FLOAT
);
Dimension Tables:
- Customer Dimension: customer_id, name, address
- Product Dimension: product_id, name, description, category
CREATE TABLE customer_dimension (
customer_id INT PRIMARY KEY,
name VARCHAR(255),
address VARCHAR(255)
);
CREATE TABLE product_dimension (
product_id INT PRIMARY KEY,
name VARCHAR(255),
description TEXT,
category VARCHAR(255)
);
Handling Data Inconsistencies and Missing Information
- Data Aggregation: Aggregate sentiment scores per category.
- Data Normalization: Standardize customer addresses and product categories.
By designing a consistent ETL process and implementing a unified dimensional data model, comprehensive customer analysis is enabled with minimal data inconsistencies.
Scenario 14: You're a data analyst for a ride-sharing company. The database has tables named:
- "rides" (ride_id, driver_id, passenger_id, pickup_location, dropoff_location, timestamp)
- "drivers" (driver_id, name, average_rating)
- "passengers" (passenger_id, name, number_of_rides)
- "locations" (location_id, address, latitude, longitude)
Challenge:
Develop an analytical query to answer the following question:
- Identify the top 5 busiest city districts for ride requests in the past month, considering both pickup and dropoff locations.
WITH RideLocations AS (
SELECT ride_id,
ST_GEOMFROMTEXT(CONCAT('POINT(', pickup_longitude, ',', pickup_latitude, ')')) AS pickup_point,
ST_GEOMFROMTEXT(CONCAT('POINT(', dropoff_longitude, ',', dropoff_latitude, ')')) AS dropoff_point
FROM rides
INNER JOIN locations p ON p.location_id = rides.pickup_location_id
INNER JOIN locations d ON d.location_id = rides.dropoff_location_id
WHERE rides.timestamp >= DATE_SUB(CURDATE(), INTERVAL 1 MONTH)
AND rides.timestamp < CURDATE()
),
RideDistricts AS (
SELECT rl.ride_id,
SUBSTRING_INDEX(SUBSTRING_INDEX(p.address, ',', -2), ',', 1) AS pickup_district,
SUBSTRING_INDEX(SUBSTRING_INDEX(d.address, ',', -2), ',', 1) AS dropoff_district
FROM RideLocations rl
INNER JOIN locations p ON ST_Dwithin(rl.pickup_point, p.geom, 50) -- Adjust radius as needed
Scenario 15: You're a data engineer for a social media platform. The platform uses a sharded database architecture to manage a massive amount of user data. The database has several shards, each containing user tables with attributes like user ID, name, location, and a "friends" table storing friend connections between users.
Challenge:
Develop a solution to identify the top 5 most influential users on the platform based on a combination of factors, including:
-
Number of followers (in-degree of the "friends" table for each user)
-
Engagement level (measured by the frequency of user posts and comments)
-
Activity on the platform (number of logins in the past month)
-
Sharded Database Considerations:
- Utilize tools or APIs provided by the sharding platform to access data across various shards.
- Alternatively, pre-process the data and aggregate relevant information (e.g., follower count) within each shard before querying for the final ranking.
-
Calculating Engagement Level (UDF):
- Create a UDF that calculates engagement based on the user ID. This function could:
- Query the posts and comments tables to count the number of posts and comments created by the user.
- Assign weights to posts and comments (e.g., posts might have a higher weight than comments).
- Calculate a total engagement score based on the weighted sum of posts and comments.
- Create a UDF that calculates engagement based on the user ID. This function could:
SELECT user_id,
your_engagement_udf(user_id) AS engagement_score
FROM users u
),
UserActivity AS (
SELECT user_id,
COUNT(*) AS login_count
FROM user_logins ul
WHERE ul.login_date >= DATE_SUB(CURDATE(), INTERVAL 1 MONTH)
AND ul.login_date < CURDATE()
GROUP BY user_id
)
SELECT u.user_id, u.name,
f.follower_count, ue.engagement_score, ua.login_count,
(f.follower_count * 0.5) + (ue.engagement_score * 0.3) + (ua.login_count * 0.2) AS influence_score
FROM users u
INNER JOIN (SELECT user_id, COUNT(*) AS follower_count FROM friends GROUP BY user_id) f ON u.user_id = f.user_id
INNER JOIN UserEngagement ue ON ue.user_id = u.user_id
INNER JOIN UserActivity ua ON ua.user_id = u.user_id
ORDER BY influence_score DESC
LIMIT 5;
SQL Interview Preparation
Here are some tips to help you prepare for an SQL interview:
Pre SQL Interview
-
Solidify the fundamentals: Ensure you have a strong grasp of core SQL concepts like data types, DDL and DML statements, JOINs, filtering (WHERE clause), aggregation functions, and basic database design principles (normalization).
-
Practice, practice, practice: Many online resources offer SQL practice problems. Platforms like SQLZoo: https://sqlzoo.net/wiki/SQL_Tutorial and HackerRank offer interactive exercises.
-
Focus on the role: Review the job description and tailor your preparation to the specific requirements. If the role emphasizes data analysis, prioritize questions related to filtering, aggregation, and working with large datasets.
-
Brush up on advanced topics (if applicable): Depending on the role, explore advanced concepts like subqueries, window functions, stored procedures, and database security (preventing SQL injection).
During SQL interview
-
Clarify the problem: Before diving into writing a query, take a moment to understand the scenario presented in the question. Ask clarifying questions if needed.
-
Think out loud: Explain your thought process as you approach the problem. This demonstrates your understanding and helps the interviewer gauge your thought patterns.
-
Start simple, then optimize: Write a basic query that retrieves the desired data. Then, refine it for efficiency by using appropriate JOINs, filtering techniques, and indexes (if applicable).
-
Explain your choices: Briefly explain why you chose a particular approach (e.g., using an INNER JOIN vs. a LEFT JOIN).
-
Test your query: If possible, ask the interviewer if you can test your query with sample data to ensure it produces the correct output. This shows attention to detail and avoids errors.
Common mistakes in a SQL interview
During a SQL interview, even experienced candidates can make mistakes. Here are some common pitfalls to avoid:
Syntax Errors:
-
Typos: Double-check your query for typos in keywords, table names, column names, and punctuation. A single typo can render your query non-functional.
-
Incorrect Brackets/Quotes: Ensure proper use of parentheses, brackets, and quotes around strings and aliases.
Conceptual Errors:
-
Mixing WHERE and HAVING: The WHERE clause filters data before aggregation, while HAVING filters aggregated data. Be clear on when to use each.
-
Incorrect JOIN Types: Understanding INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL JOIN is crucial. Choose the appropriate JOIN based on the desired relationship between tables.
-
UNION vs. UNION ALL: UNION removes duplicates while UNION ALL keeps them. Select the right operator based on whether you want distinct results or all combined rows.
Logic and Efficiency:
-
Not reading the question carefully: Take time to understand the specific data points and manipulations required by the question before writing your query.
-
Forgetting to filter (WHERE clause): Ensure you include a WHERE clause to filter data based on specific conditions.
-
Not using aliases: Assigning aliases to tables can improve readability and avoid confusion, especially when working with multiple tables.
-
Inefficient JOINs: Consider using appropriate JOINs and filtering techniques to optimize your query's performance, especially when dealing with large datasets.
-
Not explaining your thought process: Clearly explain your thought process and why you chose a particular approach. This demonstrates your problem-solving skills.
Other Mistakes:
-
Overlooking ORDER BY: If the question requires results in a specific order, don't forget to include an ORDER BY clause with the appropriate sorting criteria.
-
Forgetting to handle NULL values: NULL values can lead to unexpected results. Consider using functions like ISNULL or COALESCE to handle them appropriately.
-
Not testing your query: If possible, try to test your query with sample data to ensure it produces the correct output.
By being aware of these common mistakes and practicing good habits, you can significantly improve your chances of success in your SQL interview.
Frequently Asked Questions (FAQ)
Can a non-IT person understand SQL?
Yes, a non-IT person can understand SQL. It is a specialized skill that is relatively accessible compared to other programming languages, making it a popular choice for individuals from various professional backgrounds.
Is it necessary to join a data science course to master SQL?
No, it is not necessary to join a data science course to master SQL. Though it can be learnt independently through the abundant online resources, data science course provides a structured guidance from the industry experts especially if a data science job is being targeted.
Why are SQL skills important for data-related jobs?
SQL skills are crucial because they allow you to efficiently create, retrieve, manipulate, and analyze data within databases, which is fundamental for roles in data analysis, data science, business intelligence, and software engineering.
What are some common SQL commands I should know for an interview?
Some fundamental SQL commands include SELECT, INSERT, UPDATE, DELETE, WHERE, JOIN, GROUP BY, and ORDER BY. Understanding these commands and when to use them is essential for any SQL-related position.
Can you provide examples of intermediate SQL concepts that might be covered in an interview?
Intermediate concepts often include understanding different types of JOINs (INNER, LEFT, RIGHT, FULL), using subqueries, implementing aggregation functions like COUNT, SUM, AVG, MAX, and MIN, and using GROUP BY and HAVING clauses.
Are there any specific database systems I should focus on?
While SQL syntax is fairly standardized, some nuances differ between systems like MySQL, PostgreSQL, Oracle, and SQL Server. Familiarity with the specific SQL dialect used by the employer's database system can be advantageous.
How should I explain my SQL solution during an interview?
Be clear and concise in your explanation, discuss your thought process, and justify why you chose a particular approach. Demonstrating your reasoning skills is as important as getting the right answer.