Saturday, October 11, 2014

SQL Server: Correlated Subqueries

The Question

I had an interview via web-meeting recently. I was given access to SQL Manager and I had to solve a set of TSQL programming problems. The first programming task they throw at me is a question, "What is a correlated subquery?" I said, "I do not know" and moved on. It turns out I knew what a correlates subquery was but I just did not know such queries were called that.

This post presents an overview of correlates subqueries in TSQL.

Subqueries

A subquery is just a portion of a query that is evaluated before main part of a query. To understand this, consider the following (from the same job interview):

SELECT *
FROM [TableA] AS A
  INNER JOIN [TableB] AS B
    ON A.[AID] = B.[AID]
  INNER JOIN [TableC] AS C
    ON A.[AID] = C.[AID]
  INNER JOIN [TableD] AS D
    ON A.[AID] = D.[AID]
  INNER JOIN [TableE] AS E
    ON A.[AID] = E.[AID]
WHERE [AName] = 'Smith'

The above query has no sub-query. The interviewer explained, "This query is suffering from poor performance. What would you do to optimize it?" The answer is obvious -- use a subquery to get only the required values from TableA before joining with TableB, TableC and TableD. The subquery that should get evaluated first is:

SELECT *
FROM [TableA] AS A
WHERE [AName] = 'Smith'

The optimized query, including the subquery in its entirety, is below where the subquery is shown in boldface:

SELECT *
FROM (
  SELECT *
  FROM [TableA] AS A
  WHERE [AName] = 'Smith') AS A
  INNER JOIN [TableB] AS B
    ON A.[AID] = B.[AID]
  INNER JOIN [TableC] AS C
    ON A.[AID] = C.[AID]
  INNER JOIN [TableE] AS E
    ON A.[AID] = E.[AID]
  INNER JOIN [TableE] AS E
    ON A.[AID] = E.[AID]

The subquery is executed by SQL Server's optimizer first. This will reduce the number of data values and will potentially improve performance when the outer query is executed. The previous query is a simple subquery or a non-correlated subquery.

Correlated Subqueries

A correlated subquery is a subquery that depends on values from the outer query. Technet covers correlated subqueries here Correlated Subqueries. To demonstrate correlated queries, consider a database with tables:
  • Player: this table holds a list of soccer players (football players for you non-Americans)
  • Goal: this table holds a list of goals scored by players
To demonstrate correlated subqueries, a query will be developed that shows all players who have not scored any goals:

SELECT P.[PlayerID], P.[PlayerName]
FROM [Player] AS P
WHERE NOT EXISTS (
  SELECT * 
  FROM [Goal] AS G 
  WHERE P.[PlayerID] = G.[PlayerID])

The outer query contains table Player aliased as P. The subquery contains table Goal aliased as G. The subquery references the table Player from the outer query so the subquery is a correlated subquery. The subquery in this case did not aid performance by reducing the results returned to the outer query. Performance may suffer when using a correlated subquery because the query has to be evaluated for ever row of the outer query.

So yes, I knew how to write a correlated subquery. I just did not know the definition. 


Implementing the query without using a Correlated Subquery

A LEFT JOIN in a subquery  can be used as follows to implement the same query -- all players who have not scored a single goal:

SELECT [PlayerID], [PlayerName]
FROM (
  SELECT P.[PlayerID], P.[PlayerName], G.[GoalID]
  FROM [Player] AS P
    LEFT JOIN [Goal] AS G 
      ON P.[PlayerID] = G.[PlayerID]) AS P
WHERE P.[GoalID] IS NULL

Conclusion

Yes, I moved onto the second interview.

1 comment:

  1. Hi Jan, in my opinion the answer to the slow running query is to index the IDs and the Name helping the query optimizer do its job.

    ReplyDelete