Vectorization benchmarks

Results from benchmarking some test queries (below) on an EC2 instance using the SqlBenchmark class, presented as times and rates. In both cases, the first query, SELECT COUNT(*) FROM foo, is not plotted because it would skew the y axes too much.

This third plot shows the speedup ratio, and it does include the first query.

Analysis

Query 0 speeds up absurdly (15x) but, well, it’s just a COUNT(*). It makes sense to see a huge number here, since the count aggregator doesn’t actually need to read data, so it’s really measuring how much overhead there is in the cursor and query engine. It’s important to note that the query engine isn’t “cheating” for COUNT(*): it could just do return adapter.getNumRows(), but it doesn’t. It goes through the work of building a cursor and walking through it.

Queries 1 and 5 don’t change much. For query 1, most of the work is in manipulating HLL objects, which row batching wouldn’t help with. Query 5 has a high-selectivity filter applied to it (only 0.1% of rows match). I haven’t studied it, but I’d guess that most of the work is in creating the cursor, including the bitmap lookups, not in walking it.

Queries 2, 3, 4, 6, 7, 8, and 9 are various kinds of timeseries and all of them have about a 1.8-3x speedup. Queries 3 and 8 stand out. Query 3 has a granularity component, so its higher speedup might be due to the query engine handling its own granularization. I don’t have a guess about query 8.

Queries 10 and 11 are groupBys on two strings, unfiltered. They look similar to each other. They’re also the slowest queries of the whole set: barely processing 7 million rows per second. I suppose it makes sense, since these would be hash-based aggregations of multiple columns.

Queries 12, 13, and 14 are groupBys on one string, unfiltered, with various aggregator configurations. They’re all grouping on the same string column, dimZipf, which has small enough cardinality to trigger array-based aggregation. They’re fast to start with, and they also get bigger speedup boosts. Two of them manage to break 100 million rows per second.

Queries 15, 16, 17, and 18 are groupBys on a long column with different combinations of filters and aggregators. They would all be using hash-based aggregation, since currently, array-based aggregation is only supported for string columns. They’re all quite a bit slower than the groupBys that use array-based aggregation. Like queries 10 and 11, which are also hash-based, their speedup ratios are also somewhat lower.

Overall, all the test queries that processed a reasonably high number of rows and use relatively simple aggregators had speedups in the 1.3-3x range. The biggest speedups were posted by queries that also have relatively simple query engines (timeseries and groupBy array-based aggregation) vs. queries that used relatively more complex query engines (hash-based aggregation).

Queries

// 0, 1, 2, 3: Timeseries, unfiltered
"SELECT COUNT(*) FROM foo",
"SELECT COUNT(DISTINCT hyper) FROM foo",
"SELECT SUM(sumLongSequential), SUM(sumFloatNormal) FROM foo",
"SELECT FLOOR(__time TO MINUTE), SUM(sumLongSequential), SUM(sumFloatNormal) FROM foo GROUP BY 1",

// 4: Timeseries, low selectivity filter (90% of rows match)
"SELECT SUM(sumLongSequential), SUM(sumFloatNormal) FROM foo WHERE dimSequential NOT LIKE '%3'",

// 5: Timeseries, high selectivity filter (0.1% of rows match)
"SELECT SUM(sumLongSequential), SUM(sumFloatNormal) FROM foo WHERE dimSequential = '311'",

// 6: Timeseries, mixing low selectivity index-capable filter (90% of rows match) + cursor filter
"SELECT SUM(sumLongSequential), SUM(sumFloatNormal) FROM foo\n"
+ "WHERE dimSequential NOT LIKE '%3' AND maxLongUniform > 10",

// 7: Timeseries, low selectivity toplevel filter (90%), high selectivity filtered aggregator (0.1%)
"SELECT\n"
+ "  SUM(sumLongSequential) FILTER(WHERE dimSequential = '311'),\n"
+ "  SUM(sumFloatNormal)\n"
+ "FROM foo\n"
+ "WHERE dimSequential NOT LIKE '%3'",

// 8: Timeseries, no toplevel filter, various filtered aggregators with clauses repeated.
"SELECT\n"
+ "  SUM(sumLongSequential) FILTER(WHERE dimSequential = '311'),\n"
+ "  SUM(sumLongSequential) FILTER(WHERE dimSequential <> '311'),\n"
+ "  SUM(sumLongSequential) FILTER(WHERE dimSequential LIKE '%3'),\n"
+ "  SUM(sumLongSequential) FILTER(WHERE dimSequential NOT LIKE '%3'),\n"
+ "  SUM(sumLongSequential),\n"
+ "  SUM(sumFloatNormal) FILTER(WHERE dimSequential = '311'),\n"
+ "  SUM(sumFloatNormal) FILTER(WHERE dimSequential <> '311'),\n"
+ "  SUM(sumFloatNormal) FILTER(WHERE dimSequential LIKE '%3'),\n"
+ "  SUM(sumFloatNormal) FILTER(WHERE dimSequential NOT LIKE '%3'),\n"
+ "  SUM(sumFloatNormal),\n"
+ "  COUNT(*) FILTER(WHERE dimSequential = '311'),\n"
+ "  COUNT(*) FILTER(WHERE dimSequential <> '311'),\n"
+ "  COUNT(*) FILTER(WHERE dimSequential LIKE '%3'),\n"
+ "  COUNT(*) FILTER(WHERE dimSequential NOT LIKE '%3'),\n"
+ "  COUNT(*)\n"
+ "FROM foo",

// 9: Timeseries, toplevel time filter, time-comparison filtered aggregators
"SELECT\n"
+ "  SUM(sumLongSequential)\n"
+ "    FILTER(WHERE __time >= TIMESTAMP '2000-01-01 00:00:00' AND __time < TIMESTAMP '2000-01-01 12:00:00'),\n"
+ "  SUM(sumLongSequential)\n"
+ "    FILTER(WHERE __time >= TIMESTAMP '2000-01-01 12:00:00' AND __time < TIMESTAMP '2000-01-02 00:00:00')\n"
+ "FROM foo\n"
+ "WHERE __time >= TIMESTAMP '2000-01-01 00:00:00' AND __time < TIMESTAMP '2000-01-02 00:00:00'",

// 10, 11: GroupBy two strings, unfiltered, unordered
"SELECT dimSequential, dimZipf, SUM(sumLongSequential) FROM foo GROUP BY 1, 2",
"SELECT dimSequential, dimZipf, SUM(sumLongSequential), COUNT(*) FROM foo GROUP BY 1, 2",

// 12, 13, 14: GroupBy one string, unfiltered, various aggregator configurations
"SELECT dimZipf FROM foo GROUP BY 1",
"SELECT dimZipf, COUNT(*) FROM foo GROUP BY 1 ORDER BY COUNT(*) DESC",
"SELECT dimZipf, SUM(sumLongSequential), COUNT(*) FROM foo GROUP BY 1 ORDER BY COUNT(*) DESC",

// 15, 16: GroupBy long, unfiltered, unordered; with and without aggregators
"SELECT maxLongUniform FROM foo GROUP BY 1",
"SELECT maxLongUniform, SUM(sumLongSequential), COUNT(*) FROM foo GROUP BY 1",

// 17, 18: GroupBy long, filter by long, unordered; with and without aggregators
"SELECT maxLongUniform FROM foo WHERE maxLongUniform > 10 GROUP BY 1",
"SELECT maxLongUniform, SUM(sumLongSequential), COUNT(*) FROM foo WHERE maxLongUniform > 10 GROUP BY 1"

Vectorization benchmarks

Analysis

Queries

Raw data