Join Ordering Problem

2021-01-05 by xiaoguang

Intro

Join ordering problem is the core topic of a query optimizer. This problem is considered complex because even 5 or 6 join relations could create a search space whose size exceeds 1000 different query plans. This complexity comes from many factors, including many different join types (inner join, outer join, cross product) to support, many different physical implementations and cost functions, etc.

However, join ordering problem is also a “simple” one because it’s a really specific problem: to find the optimal join orders (join tree) of the query.

Join Shapes

The following image shows several different join shapes. The dot stands for a relation, the line connecting two dots stands for a join predicate between two dots.

Fig 1: different join shapes Fig 1: different join shapes

BTW, relational algebra operators like filter, project etc. are not present in the graph. So the join graph is restricted to join ordering problems, it requires more work to make a general query optimizer.

Literatures show that, only some certain join graph has definit solutions, but generally, join ordering poblem are considered NP-hard.

Join Trees

The result join trees also have different shapes, like left-deep trees, right-deep trees, bushy trees etc.

Solutions

There are many ways to category the solutions. But here we want to list 2 different solution approaches.

Transformation

These are very earlier kind of solutions, they are adopted by Vocalno/Cascade framework. The query optimzer will apply different rules on the query graph, either changing the result join tree shape, or use different join alrightm implementation, and choose the least cost plan after exausting all the rules.

Fig 2: different rules Fig 2: different rules

Hypergraph

Using DPccp algorithm as an example, this kind of solution modeling the join ordering problem using hypergraphs. The hypergrpah can be decomposed into many csg-cmp-pairs, and the number of those pairs are only related to the query itself:

  1. csg stands for connected subgraph
  2. cmp stands for the complement connected subgraph

The paper in futher verifys that, the difference of many algorithms using hypergraph are about how to enumerate those subgraph pairs to find the best solution (least cost one). And among the existing ones, DPccp is the one meets the lower bound algorithm complexity.

Reference

[1] Analysis of Two Existing and One New Dynamic Programming Algorithm for the Generation of Optimal Bushy Join Trees without Cross Products, Guido Moerkotte and Thomas Neumann

[2] Building Query Compilers, Guido Moerkotte

[3] Efficiency in the Columbia Database Query Optimizer, Yongwen Xu