SQL Complexity and 8 Different Ways to Join Tables

Be Cool, Write Efficient SQL Queries

Recently while working on Salesforce Marketing Cloud platform, a colleague and I came across an SQL query written by an external vendor in a form that we were rather unfamiliar with. It goes like this:

https://medium.com/media/bd886a79dad5624e74bffa485565887d/href

This query performed very badly on Salesforce Marketing Cloud with very long run time and frequent timeouts, giving us much frustration. Unlike Google Bigquery which we are familiar with, Salesforce Marketing Cloud has a much lower compute and cannot process the load of a complex query.

Notice that two of its sub-queries referenced the main table in the where table:

Sub-queries referencing main query in the where function

This prompted an intellectual curiosity within the team; `How many ways can we write an SQL query and what is the time complexity of each query?`.

We decided to replay an experiment based on a 1988 article by Fabian Pascal to compare the performance of different ways of writing joins in an SQL query. Presenting ‘8 ways to join tables in SQL’ with added in time complexity from Google Bigquery, hope you enjoy this piece of work!

Method 1: Relational Algebra

Relational algebra is the most common way of writing a query and also the most natural way to do so. The code is clean, easy to troubleshoot, and unsurprisingly, it is also the most efficient way to join two tables.

https://medium.com/media/0d69100969f1a9fa35c77412cea1a774/href

Method 2: Uncorrelated Subquery

The uncorrelated subquery method executes the filter function by first creating a subquery list of account_number, followed by an IN function to filter account number in the subquery. Although efficiency is not as good as Relational Algebra, it is one of the more commonly used join method due to the ease of writing.

https://medium.com/media/7cb3500012e8af173d4e010836b37444/href

Method 3: Correlated Subquery

In the Correlated Subquery, the EXISTS function is used to search in an unfiltered subquery `SELECT *`. The filter operation in the subquery requires a `where mp.account_number = t.account_number`. This is one of the join function used in the inefficient query. The performance of this join is agonizing.

https://medium.com/media/4220e6744b45841ac7a84bf5de498ca8/href

Method 4: Scalar Subquery in the WHERE clause

By using a subquery as a filter on the WHERE function, the query is able to filter f_online = 1. Quite a cool way of thinking but unfortunately it doesn’t perform well.

https://medium.com/media/12cb7753302c94cae19450c618794ee7/href

Method 5: Scalar Subquery in the SELECT clause

Another really interesting way of writing a query, this method uses a subquery in the SELECT function to extract the account_number from another table, but as the two tables have a many to many relation, we have to add in a filter to remove the nulls.

https://medium.com/media/a567187f912219cd0577cf5a5e821f22/href

Method 6: Aggregate function to check existence

Similar to Scalar Subquery, this method uses a subquery in the WHERE function. The difference is that this method uses a subquery COUNT(*) with a filter of >1.​

https://medium.com/media/5553b906b52d53df2feba5a0a253fb27/href

Method 7: Correlated Subquery (double negative)

Similar to Correlated Subquery, but uses a double negative. This is also one of the join used in the inefficient query . But interestingly, it did not perform as badly as I expected. This could be simply due to the data structure where there are more exceptions than inclusions.

https://medium.com/media/7eb4a79f573e4a6bafd484d31d9ada1e/href

Method 8: Uncorrelated Subquery (double negative)

Similar to Uncorrelated subquery but uses a double negative.

https://medium.com/media/491d25464d6d2dafdbfbccf6ac613d75/href

In conclusion, the way we write our SQL query does have a big impact on the efficiency of the query. The most efficient query runs for 13.8 seconds, compared to the least efficient query runtime of 52.6 seconds. We re-wrote the inefficient query by the external vendor on Salesforce Marketing Cloud and we were able to reduce the run time from 45 mins coupled with frequent timeouts, to 4 mins. After that, we went for a nice cold beer and celebrated our victory.

Bonus

Queries don’t start from `SELECT` but `FROM`. Now we know why LIMIT doesn’t reduce compute cost. Source

SQL Queries run in this order

Reference
*1988 article by Fabian Pascal “SQL Redundancy and DBMS Performance” in the journal Database Programming & Design — http://www.dbdebunk.com/2013/02/language-redundancy-and-dbms.html
*Day 4: The Twelve Days of SQL: The way you write your query matters — https://iggyfernandez.wordpress.com/2011/12/04/day-4-the-twelve-days-of-sql-there-way-you-write-your-query-matters/

Thanks Daniel for the inspiration behind this article and Shawn for the the bonus material on SQL sequence. I am blessed with cool colleagues like you guys.

And thanks George for sharing Fabian Pascal’s 1988 experiment 😉


SQL Complexity and 8 Different Ways to Join Tables was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s