HPlogo ALLBASE/SQL Performance and Monitoring Guidelines: HP 9000 Computer Systems > Chapter 3 Guidelines on Query Design

Using Subqueries

» 

Technical documentation

Complete book in PDF
» Feedback

 » Table of Contents

 » Index

You can nest queries within the predicates of other queries. This makes it easier to express complex queries, and it makes it possible to formulate queries in several different ways. Thus you can create different queries which return the same rows but which do so with vastly different performance. In the following discussion, queries 1, 2, and 3 are all equivalent.

Most subqueries can also be expressed in an equivalent join form. For example, assuming that the PartNumber column in PurchDB.Parts is unique, the query:

(1) SELECT * FROM PurchDB.SupplyPrice

    WHERE PartNumber IN (SELECT PartNumber from PurchDB.Parts

                     WHERE PartName = 'Cache Memory Unit')

can also be written:

(2) SELECT PurchDB.SupplyPrice.* 

    FROM PurchDB.SupplyPrice, PurchDB.Parts

    WHERE PurchDB.SupplyPrice.PartNumber = PurchDB.Parts.PartNumber 

    AND PurchDB.Parts.PartName = 'Cache Memory Unit'

Both queries return the same information. In general, while the subquery is more easily understood and easier to formulate, the join actually improves performance because it gives the optimizer more efficient choices of how to execute the query. There are two types of subqueries: correlated and non-correlated. A correlated subquery is one in which the subquery makes reference to one or more columns of an outer query. For example,

(3) SELECT * FROM PurchDB.SupplyPrice

    WHERE EXISTS (SELECT PartNumber 

                  FROM PurchDB.Parts

                  WHERE PartName = 'Cache Memory Unit'

                  AND PartNumber = PurchDB.SupplyPrice.PartNumber)

This is a correlated subquery, since the subquery makes reference to PartNumber from the outer query. In this case, the subquery must be executed for each row returned from the outer query. For the above query, since the SupplyPrice table has 69 rows in it, there must be one scan on the outer query and 69 scans on the inner query for a total of 70 scans. The equivalent join shown in query (2) can be executed using a sort/merge technique, which takes two sorts and two scans.

It should be noted that the predicate WHERE PartNumber IN of query (1) is internally the same as the predicate WHERE EXISTS in query (3). All quantified predicates involving subqueries get transformed into EXISTS predicates internally. Therefore the nested query using the IN predicate also takes 70 scans.

In general, non-correlated subqueries are faster than joins, and correlated subqueries are slower than joins. You should know whether the subquery will return a single row or not so as to take advantage of the speed of non-correlated subqueries. Since we know that each part in the Parts table is unique, there is no reason why we could not express query (1) as a non-correlated subquery by simply replacing the IN with an equal sign (=). Then the subquery does not depend on the outer query and the subquery can be executed only once. In this case, there is one scan for the outer query and one scan for the subquery for a total of two scans. This is even faster than the sort/merge join since there is no sorting involved.

One exception to the rule that correlated subqueries are slower than an equivalent join is when the equivalent join involves an aggregate. For example, consider the following query, which is transformed into a correlated subquery internally by the query processor:

SELECT VendorNumber, PartNumber, DiscountQty 

FROM PurchDB.SupplyPrice

WHERE DiscountQty < ALL (SELECT DiscountQty 

                         FROM PurchDB.SupplyPrice 

                         WHERE VendorNumber = 9010)

becomes

SELECT VendorNumber, PartNumber, DiscountQty 

FROM PurchDB.SupplyPrice sp1 

WHERE NOT EXISTS (SELECT DiscountQty 

                  FROM PurchDB.SupplyPrice sp2

                  WHERE VendorNumber =9010 

                  AND sp2.DiscountQty <= sp1.DiscountQty)

The equivalent join would require the use of an aggregate, and would therefore be slower:

SELECT sp1.VendorNumber, sp1.PartNumber, sp1.DiscountQty

FROM PurchDB.SupplyPrice sp1, PurchDB.SupplyPrice sp2

WHERE sp2.VendorNumber = 9010

GROUP BY sp1.VendorNumber, sp1.PartNumber, sp1.DiscountQty

HAVING sp1.DiscountQty < MIN(sp2.DiscountQty)

In this case, because of the complexity of the query, the subquery is a better choice than the join. The best solution, however, is an equivalent non-correlated subquery:

SELECT VendorNumber, PartNumber, 

       DiscountQty from PurchDB.SupplyPrice

WHERE DiscountQty < (SELECT MIN(DiscountQty) 

                     FROM PurchDB.SupplyPrice 

                     WHERE VendorNumber = 9010)

When Not to Use DISTINCT in Subqueries

In general, you should avoid using the DISTINCT keyword in subqueries. DISTINCT does not change the query result and, in fact, hinders performance.

The following two queries return the same result:

(4) SELECT * FROM T1 WHERE C1 IN (SELECT DISTINCT C2 FROM T2);



(5) SELECT * FROM T1 WHERE C1 IN (SELECT C2 FROM T2);

ALLBASE/SQL transforms IN predicates and also other quantified predicates where the subquery can return multiple values to make them similar to an EXISTS predicate. Therefore, queries (4) and (5) would be transformed to the two following queries:

(6) SELECT * FROM T1 WHERE EXISTS (SELECT DISTINCT C2 FROM T2

				    WHERE T1.C1 = T2.C2);



(7) SELECT * FROM T1 WHERE EXISTS (SELECT C2 FROM T2      

				    WHERE T1.C1 = T2.C2);

These transformed queries are correlated in nature, meaning that the subquery result depends on values of the outer query. This means that the subquery needs to be reevaluated for each row of the outer query. Therefore, the performance of the subquery is critical.

When you use the DISTINCT keyword, the subquery must do a complete scan of the table, sort the values, eliminate duplicates, and then return TRUE if any rows are returned. Without the DISTINCT keyword, the subquery can scan until it finds the first qualifying row, return the row, return TRUE, and then terminate. As you can see, subqueries without the DISTINCT keyword are faster.

Feedback to webmaster