Building Effective Thinking Framework
During the practicing process, I recorded my mistakes and studied how to make my solution more effective. I summarized them all and built up a thinking framework (or flow chart of SQL queries) which include the following essential decision points.
Step 1 Define Selection
- What is the main table that covers the most complete information?
- Where are the desired columns? What tables are we gonna use?
- How to combine data from multiple tables? Join or Union? What types of Join/ Union?
In the beginning, the first decision is picking the right main table. We want to select the main table with the most of the desired columns and complete coverage of information. Use this main table as the backbone, and you can add other branch columns thereafter. Next, observe what columns are desired and what tables are needed to restore these information. Make a decision on how these tables are combined, via Joins or Unions.
Step 2 Handle Duplicates
- Are there duplicates in the table?
- How do the duplicates affect our results?
- Does the selection require DISTINCT? If yes, where should it locate?
Handling duplicates is especially important when it comes to counting number of rows, frequency, or some kind of performance indicators (e.g. acceptance rate). If there is NULL in data, be aware of how they’re gonna affect our results. Although it is not hard to avoid duplicates, people sometimes overlook checking duplicates in queries.
Step 3 Transformation
- What are the suitable functions to transform our data?
- Should we use single-row function, aggregated function or window function?
- If there is no function that directly serve your purpose, any alternative steps?
There are three types of function for data transformation. Single-row function act on each row and return one result per row. This includes transformation like arithmetic calculation, converting data type or manipulating data format. Multiple-row function (also called aggregated function/ group function) manipulate groups of rows to give one result per group of rows. This kind of transformation includes getting aggregated statistics like average, maximum, minimum, sum, count and so on. Also it is required to use group by clause to define which group of rows are used. The last type of function is window function (in MySQL, also called analytical function in Oracle), which manipulate group of rows to give one result per group, and report the same result to all the rows of the group. Window function is extremely helpful when the desired data is at granular level and needs to be filtered by their aggregated statistics. The decision point in transformation step is choosing a suitable and efficient function to serve the goal of your task. Using a right function can save your script and time.
Step 4 Refine Selection
- What rows are selected? What conditional expressions are used to filter data?
- Where should the conditional expression locate? WHERE/ HAVING/ ORDER BY/ JOIN ON?
- How to organize multiple conditional expressions? What subjects & operators are used?
This step is not necessarily the last step to execute, but you can examine your syntax again to refine your selection in an efficient way. When there are multiple conditions, venn diagram is a good way to assist your selection logic and refinement.
Some Examples to show WHY Thinking Framework Matters
[Example 1] Market Analysis I
This example is not hard but it clearly shows the importance of the ‘Step 1 Define Selection’.
Write an SQL query to find for each user, the join date and the number of orders they made as a buyer in 2019. Users table: +---------+------------+----------------+ | user_id | join_date | favorite_brand | +---------+------------+----------------+ | 1 | 2018-01-01 | Lenovo | | 2 | 2018-02-09 | Samsung | | 3 | 2018-01-19 | LG | | 4 | 2018-05-21 | HP | +---------+------------+----------------+ Orders table: +----------+------------+---------+----------+-----------+ | order_id | order_date | item_id | buyer_id | seller_id | +----------+------------+---------+----------+-----------+ | 1 | 2019-08-01 | 4 | 1 | 2 | | 2 | 2018-08-02 | 2 | 1 | 3 | | 3 | 2019-08-03 | 3 | 2 | 3 | | 4 | 2018-08-04 | 1 | 4 | 2 | | 5 | 2018-08-04 | 1 | 3 | 4 | | 6 | 2019-08-05 | 2 | 2 | 4 | +----------+------------+---------+----------+-----------+ Items table: +---------+------------+ | item_id | item_brand | +---------+------------+ | 1 | Samsung | | 2 | Lenovo | | 3 | LG | | 4 | HP | +---------+------------+ Result table: +-----------+------------+----------------+ | buyer_id | join_date | orders_in_2019 | +-----------+------------+----------------+ | 1 | 2018-01-01 | 1 | | 2 | 2018-02-09 | 2 | | 3 | 2018-01-19 | 0 | | 4 | 2018-05-21 | 0 | +-----------+------------+----------------+
The better solution: using Users as the main table as backbone is a better idea because we want results for ‘all users’.
SELECT U.user_id AS 'buyer_id', U.join_date,
SUM(CASE WHEN YEAR(O.order_date) = '2019' THEN 1 ELSE 0 END) AS 'orders_in_2019'
FROM Users U
LEFT JOIN Orders O
ON U.user_id = O.buyer_id
GROUP BY U.user_id
ORDER BY U.user_id;
The ineffective solution: the reason why this solution is not that effective or reliable is that we regard Order as the main table. If there is no record for any users in this table, then there will be NULL and the missing users can’t gather aggregated values.
SELECT O.buyer_id, U.join_date, SUM(CASE WHEN YEAR(order_date) = '2019' THEN 1 ELSE 0 END) AS 'orders_in_2019'
FROM Orders O
LEFT JOIN Users U
ON O.buyer_id = U.user_id
GROUP BY O.buyer_id
ORDER BY O.buyer_id
[Example 2] Find out those buyers that bought S8 but not iPhone
This example explains the importance of ‘Step 4 Refine Select’, showing how to use and locate conditional expression to include/ exclude the rows to obtain the final result.
Write an SQL query that reports the buyers who have bought S8 but not iPhone. Note that S8 and iPhone are products present in theProduct
table. Product table: +------------+--------------+------------+ | product_id | product_name | unit_price | +------------+--------------+------------+ | 1 | S8 | 1000 | | 2 | G4 | 800 | | 3 | iPhone | 1400 | +------------+--------------+------------+Sales
table: +-----------+------------+----------+------------+----------+-------+ | seller_id | product_id | buyer_id | sale_date | quantity | price | +-----------+------------+----------+------------+----------+-------+ | 1 | 1 | 1 | 2019-01-21 | 2 | 2000 | | 1 | 2 | 2 | 2019-02-17 | 1 | 800 | | 2 | 1 | 3 | 2019-06-02 | 1 | 800 | | 3 | 3 | 3 | 2019-05-13 | 2 | 2800 | +-----------+------------+----------+------------+----------+-------+ Result table: +-------------+ | buyer_id | +-------------+ | 1 | +-------------+ The buyer with id 1 bought an S8 but didn't buy an iPhone. The buyer with id 3 bought both.
The solution is first taking all buyers who bought S8, then exclude the buyers who also bought iphone.
SELECT DISTINCT S.buyer_id
FROM Sales S
LEFT JOIN Product P
ON S.product_id = P.product_id
WHERE P.product_name = 'S8'
AND S.buyer_id NOT IN (SELECT DISTINCT S.buyer_id FROM Sales S LEFT JOIN Product P ON S.product_id = P.product_id WHERE P.product_name = 'iPhone')
Below is a wrong solution because the subject of the later clause is not right. Where clause evaluate the conditional expression for the given subject per row. The following syntax only returns those buyers who bought S8 (because of course there is no iPhone buying record in the same row).
SELECT buyer_id
FROM Product as p
JOIN Sales as s ON p.product_id=s.product_id
WHERE p.product_name = 'S8' AND p.product_name != 'iPhone'
Recommendation for Practicing LeetCode
[1] Don’t rush to write code. Make sure you fully understand the question. Think of the framework and solution step by step.
[2] Use MySQL to assist in building your thinking framework and solution.
[3] If you believe there is a dedicated function to solve the question, feel free to google the suitable function. Don’t waste time on figuring out something you never learn.
[4] Take a look at other people’s solutions and seek a more effective solution. Reflect on what you learn.
Additional Reference: The Power of Framework Thinking