[SR-3811] optimize high card count distinct #139

dirtysalt · 2021-09-09T14:12:46Z

Use another way to serialize/deserialize hahset in distinct agg state:

save packet size
better cache locality to do merge

Testing following SQL on ssb dataset, query time can be reduced from 60s -> 20s

SELECT COUNT (DISTINCT lo_orderkey), lo_linenumber FROM lineorder_flat group by lo_linenumber;

imay · 2021-09-09T14:37:41Z

@dirtysalt
I have one question. If this CL has been merged, can it be compatible with old version.
If not this will make our program can not grayscale upgrade

dirtysalt · 2021-09-09T15:12:41Z

@dirtysalt
I have one question. If this CL has been merged, can it be compatible with old version.
If not this will make our program can not grayscale upgrade

@imay Unfortunately not yet. But I think we can do something hacky to make it compatible.

Right now serialized format is like,

| size | capacity | control array | slot array|

and size is size_t, in most cases it should be less than (1<<63)-1(otherwise this hashtable is too huge). So we can tell this is an old serialized version if msb of size is 0. And for the further version, we put set msb of size to 1.

dirtysalt · 2021-09-09T15:19:44Z

@dirtysalt
I have one question. If this CL has been merged, can it be compatible with old version.
If not this will make our program can not grayscale upgrade

yeah, it's very hard to be backward compatible. Because old version of BE does not recognize new data format. So basically yes, it's can not be upgraded in a grayscale way.

imay · 2021-09-09T15:36:49Z

@dirtysalt
I have one question. If this CL has been merged, can it be compatible with old version.
If not this will make our program can not grayscale upgrade

yeah, it's very hard to be backward compatible. Because old version of BE does not recognize new data format. So basically yes, it's can not be upgraded in a grayscale way.

If we can not upgrade in a grayscale way, this would be bad for users.
So we should make it back compatible.

be/src/exprs/agg/distinct.h

DeepThinker666 · 2021-09-09T16:26:57Z

could you add some background for this pr?

dirtysalt · 2021-09-10T02:09:46Z

could you add some background for this pr?

Right, we are doing some benchmark comparisons with CK. And we find out that we don't have good performance when we are doing (count/distinct + group by) those cases. There are several optimization points:

better hashtable for inserting and finding
serialization and deserialization.

be/src/column/hash_set.h

…distinct

…ble way

dirtysalt · 2021-09-13T06:19:15Z

running benchmark on ssb_100g with set parallel_fragment_exec_instance_num = 16

Query	SR Master	SR PR
Q01	1110	1086
Q02	1052	1069
Q03	400	360
Q04	176	174
Q05	15840	9526
Q06	2305	2037
Q07	1347	1261
Q08	326	337
Q09	6277	4898
Q10	2449	2189
Q11	6011	4749
Q12	1958	1832
Q13	10079	9475
Q14	6816	6764
Q15	9766	9243
Q16	6003	5844

# 100G
# lo_orderkey = 150,000,000
# lo_partkey = 1000,000
# lo_suppkey = 200,000
# lo_quantity = 50

# lo_linenumber = 7
# lo_supplycost = 15933
# lo_linenumber, lo_supplycost = 111531

-- Q01
select count (distinct lo_orderkey) from lineorder_flat;
-- Q02
SELECT COUNT (DISTINCT lo_partkey) FROM lineorder_flat;
-- Q03
SELECT COUNT (DISTINCT lo_suppkey) FROM lineorder_flat;
-- Q04
SELECT COUNT (DISTINCT lo_quantity) FROM lineorder_flat;

-- Q05
SELECT COUNT (DISTINCT lo_orderkey), lo_linenumber FROM lineorder_flat group by lo_linenumber;
-- Q06
SELECT COUNT (DISTINCT lo_partkey), lo_linenumber FROM lineorder_flat group by lo_linenumber;
-- Q07
SELECT COUNT (DISTINCT lo_suppkey), lo_linenumber FROM lineorder_flat group by lo_linenumber;
-- Q08
SELECT COUNT (DISTINCT lo_quantity), lo_linenumber FROM lineorder_flat group by lo_linenumber;

-- Q09
SELECT COUNT (DISTINCT lo_orderkey), lo_supplycost FROM lineorder_flat group by lo_supplycost;
-- Q10
SELECT COUNT (DISTINCT lo_partkey), lo_supplycost FROM lineorder_flat group by lo_supplycost;
-- Q11
SELECT COUNT (DISTINCT lo_suppkey), lo_supplycost FROM lineorder_flat group by lo_supplycost;
-- Q12
SELECT COUNT (DISTINCT lo_quantity), lo_supplycost FROM lineorder_flat group by lo_supplycost;

-- Q13
SELECT COUNT (DISTINCT lo_orderkey), lo_linenumber, lo_supplycost FROM lineorder_flat group by lo_linenumber, lo_supplycost;
-- Q14
SELECT COUNT (DISTINCT lo_partkey), lo_linenumber, lo_supplycost FROM lineorder_flat group by lo_linenumber, lo_supplycost;
-- Q15
SELECT COUNT (DISTINCT lo_suppkey), lo_linenumber, lo_supplycost FROM lineorder_flat group by lo_linenumber, lo_supplycost;
-- Q16
SELECT COUNT (DISTINCT lo_quantity), lo_linenumber, lo_supplycost FROM lineorder_flat group by lo_linenumber, lo_supplycost;

gensrc/thrift/PlanNodes.thrift

be/src/column/hash_set.h

SeaAndHillMe · 2022-01-07T02:02:52Z

good to learn

* Update release-2.1.md (#96) (#101) * Add release notes 2.0.5 (#117) * Other features to 2.2 (#139) * npm i -E in actions (#87) * ctas (#88) added CTAS and Toc index * flink-connector (#93) * flink-connector added the flink-connector topic, two images, and errors in spark connector * Update Flink connector.md * Update Flink connector.md (#95) * Update release-2.1.md (#96) * Update Scale_up_down.md (#111) * Fix Scale in and out on main (#112) * Fix Scale in and out on main * Add release notes 2.0.5 (#114) * New iceberg external table (#119) * Update External_table.md external table baseline optimization, added iceberg external table * Update Query_planning.md * add sql fingerprint (#89) * ctas added CTAS and Toc index * add SQL fingerprint added SQL fingerprint to this topic * Update Query_planning.md * Update Query_planning.md * New iceberg external table (#121) * Update External_table.md external table baseline optimization, added iceberg external table * Update Query_planning.md * Update External_table.md * Update External_table.md * 2.1 alter routine load (#135) * delete extra comma * Update TOC.md * add Ctas to v2.1 (#90) * add CTAS to 2.1 add CTAS to 2.1 * Update TOC.md * Update ROUTINE LOAD.md * Update ROUTINE LOAD.md * ctas (#88) added CTAS and Toc index * add SQL fingerprint added SQL fingerprint to this topic * add release note 2.1 (#98) * Update Insert_into_faq.md (#99) update one faq to this doc based on the corresponding Chinese version * Update release note 2.0 (#103) * add 2.0.2 release note (#48) * add 2.0.2 release note * Update release-2.0.md * add release note 2.0.3 (#53) * add release note 2.0.3 * Update release-2.0.md * add 2.0.4 release note (#76) * Changes 2.1 (#100) * Update Insert_into_faq.md update one faq to this doc based on the corresponding Chinese version * Update DataX_faq.md update one faq to this doc based on the corresponding Chinese version * Update Deploy_faq.md optimize the existed text and add some new faqs * Update Deploy_faq.md * Update array_position.md optimize the doc * Update Deploy_faq.md * Create array_remove this is a new topic, which describe the function of removing a element from an array * Delete array_remove * Create array_remove This is a new topic, which describes the function of removing an element from an array * Delete array_remove * Create array_remove this is a new topic that describes the function of removing an element from an array * Create any_value.md this is a new topic * Create md5.md this is a new topic * Create sha2.md this is a new topic * Update TOC.md add 4 new docs to the TOC, including any_value, remove_position, md5, and sha2 * Fix Scale Expression on 2.1 (#113) * Fix Scale Expression on 2.1 * Add release notes 2.0.5 (#116) * Update Query_planning.md * 2.1 add sql fingerprint (#118) * Update ROUTINE LOAD.md * ctas (#88) added CTAS and Toc index * add SQL fingerprint added SQL fingerprint to this topic * Update Query_planning.md * 2.1 external table update (#128) * Update ROUTINE LOAD.md * ctas (#88) added CTAS and Toc index * add SQL fingerprint added SQL fingerprint to this topic * Update Query_planning.md * 2.1 update iceberg for v2.1 external table * 2.1 flink connector (#130) * Update ROUTINE LOAD.md * ctas (#88) added CTAS and Toc index * add SQL fingerprint added SQL fingerprint to this topic * Update Query_planning.md * flink-connector-2.1 * Alter routine load (#132) * Update ROUTINE LOAD.md * ctas (#88) added CTAS and Toc index * add SQL fingerprint added SQL fingerprint to this topic * Update Query_planning.md * alter routine load-2.1 * New flink connector (#133) * Make the load_process_max_memory_limit_percent configuration right (#39) * sr2.1_update release note (#44) * sr2.1_update release note * Update release-2.1.md * add tpch benchmark (#46) * add tpch benchmark * Update TPC-H_Benchmarking.md * update ssb benchmark (#45) * add ssb benchmark * Update SSB_Benchmarking.md * Update SSB_Benchmarking.md * add more pictures * Update SSB_Benchmarking.md * add primary key model (#47) * add primary key model * Update Data_model.md * add 2.0.2 release note (#48) * add 2.0.2 release note * Update release-2.0.md * update readme and add function template (#50) * update readme and add function template * Update How to Write Functions Documentation.md * Update How to Write Functions Documentation.md * add release note 2.1.2 (#52) * add release note 2.1.2 * Update release-2.1.md * add release note 2.0.3 (#53) * add release note 2.0.3 * Update release-2.0.md * delete extra comma (#55) * add 2.1 upgrade docs (#54) * add 2.1 upgrade docs * Update Cluster_administration.md * Update Cluster_administration.md * Update Cluster_administration.md Co-authored-by: hellolilyliuyi <[email protected]> * Update Deploy_faq.md * Correct spelling (#57) Signed-off-by: Sida <[email protected]> Co-authored-by: Sida <[email protected]> * Fix grammar in quick_start docs (#59) Signed-off-by: Sida <[email protected]> * Fix grammar in intro and loading (#58) Signed-off-by: Sida <[email protected]> * Fix typo in docker build instructions (#37) * Update release-1.19.md (#62) * Update release-2.1.md (#63) * Update Routine load doc (#72) Signed-off-by: Sida <[email protected]> * add release note 2.1.4 (#73) * add 2.0.4 release note (#76) * add buildcheck.yml (#69) * add release note 2.2 (#77) * add release note 2.2 * Update release-2.2.md * Update release-2.2.md * Update release-2.2.md * Update release-2.2.md * Update release-2.2.md * Update release-2.2.md * Update release-2.2.md * Update release-2.2.md * Update TOC.md * Bug fixes (#80) * Create release-2.2.md * Update TOC.md * Update 5.3.1-1.png * Update 5.3.1-1.png * Update release-2.2.md * Update release-2.2.md * add 2.1.5 (#82) * add 2.1.5 * Update release-2.1.md * delete load in toc (#86) * npm i -E in actions (#87) * ctas (#88) added CTAS and Toc index * Update ROUTINE LOAD.md * ctas (#88) added CTAS and Toc index * add SQL fingerprint added SQL fingerprint to this topic * flink-connector (#93) * flink-connector added the flink-connector topic, two images, and errors in spark connector * Update Flink connector.md * Update Flink connector.md (#95) * Update release-2.1.md (#96) * Update Scale_up_down.md (#111) * Fix Scale in and out on main (#112) * Fix Scale in and out on main * Add release notes 2.0.5 (#114) * Update Query_planning.md * New iceberg external table (#119) * Update External_table.md external table baseline optimization, added iceberg external table * Update Query_planning.md * add sql fingerprint (#89) * ctas added CTAS and Toc index * add SQL fingerprint added SQL fingerprint to this topic * Update Query_planning.md * Update Query_planning.md * New iceberg external table (#121) * Update External_table.md external table baseline optimization, added iceberg external table * Update Query_planning.md * Update External_table.md * Update External_table.md * flink-connector-2.1 * new flink-connector Co-authored-by: lichaoyong <[email protected]> Co-authored-by: hellolilyliuyi <[email protected]> Co-authored-by: stdpain <[email protected]> Co-authored-by: imay <[email protected]> Co-authored-by: SidaShen <[email protected]> Co-authored-by: Sida <[email protected]> Co-authored-by: Erik Ritter <[email protected]> Co-authored-by: don <[email protected]> Co-authored-by: 絵空事スピリット <[email protected]> Co-authored-by: amber-create <[email protected]> * 2.1 alter routine load * Update Deploy_faq.md * Update Deploy_faq.md * Update Deploy_faq.md Co-authored-by: hellolilyliuyi <[email protected]> Co-authored-by: hellolilyliuyi <[email protected]> Co-authored-by: Sihui <[email protected]> Co-authored-by: 絵空事スピリット <[email protected]> Co-authored-by: amber-create <[email protected]> Co-authored-by: lichaoyong <[email protected]> Co-authored-by: stdpain <[email protected]> Co-authored-by: imay <[email protected]> Co-authored-by: SidaShen <[email protected]> Co-authored-by: Sida <[email protected]> Co-authored-by: Erik Ritter <[email protected]> Co-authored-by: don <[email protected]> * Update SparkLoad.md * 2.2 external table (#137) * Update release-2.1.md (#96) (#101) * Add release notes 2.0.5 (#117) * 2.2 external table Co-authored-by: hellolilyliuyi <[email protected]> Co-authored-by: amber-create <[email protected]> * other features in 2.2 * Update Flink_connector.md * other features to 2.2 * Update Query_planning.md Co-authored-by: don <[email protected]> Co-authored-by: hellolilyliuyi <[email protected]> Co-authored-by: 絵空事スピリット <[email protected]> Co-authored-by: amber-create <[email protected]> Co-authored-by: hellolilyliuyi <[email protected]> Co-authored-by: Sihui <[email protected]> Co-authored-by: lichaoyong <[email protected]> Co-authored-by: stdpain <[email protected]> Co-authored-by: imay <[email protected]> Co-authored-by: SidaShen <[email protected]> Co-authored-by: Sida <[email protected]> Co-authored-by: Erik Ritter <[email protected]> * 2.2 external table (#140) * Update TOC.md * Create array_agg.md (#147) * Create array_agg.md add this topic to array functions * Faqs update 1. delete spark load and others under the loading 2. unify topic titles and navigation names * Update TOC.md (#150) update TOC of FAQ, make it be consistent with main * add cbo&external table (#156) * english doc optimize word usage, topic title * english doc optimize word usage, topic title * redundant space Co-authored-by: hellolilyliuyi <[email protected]> Co-authored-by: amber-create <[email protected]> Co-authored-by: don <[email protected]> Co-authored-by: 絵空事スピリット <[email protected]> Co-authored-by: hellolilyliuyi <[email protected]> Co-authored-by: Sihui <[email protected]> Co-authored-by: lichaoyong <[email protected]> Co-authored-by: stdpain <[email protected]> Co-authored-by: imay <[email protected]> Co-authored-by: SidaShen <[email protected]> Co-authored-by: Sida <[email protected]> Co-authored-by: Erik Ritter <[email protected]>

DeepThinker666 reviewed Sep 9, 2021

View reviewed changes

be/src/exprs/agg/distinct.h Outdated Show resolved Hide resolved

dirtysalt added 2 commits September 11, 2021 12:08

[SR-3811] optimize high card count distinct

48ac317

use prefetch when doing count/distinct only

62f4a3b

stdpain reviewed Sep 11, 2021

View reviewed changes

be/src/column/hash_set.h Outdated Show resolved Hide resolved

dirtysalt added 5 commits September 11, 2021 19:57

fix for pr review and add extended version of count_distinct and sum_…

4d3f1d7

…distinct

change multi_distinct_count/sum implemtation

5356af7

add agg_func_set_version field to upgrade agg function in a compati…

3e7fdaa

…ble way

support upgraded agg function set

c0b7be2

merge master

c0cdc94

Merge branch 'main' into opt-count-distinct

bebbbf8

kangkaisen reviewed Sep 13, 2021

View reviewed changes

gensrc/thrift/PlanNodes.thrift Show resolved Hide resolved

kangkaisen reviewed Sep 13, 2021

View reviewed changes

be/src/column/hash_set.h Outdated Show resolved Hide resolved

kangkaisen reviewed Sep 13, 2021

View reviewed changes

be/src/column/hash_set.h Outdated Show resolved Hide resolved

dirtysalt added 2 commits September 13, 2021 15:57

fix for pr review

96dfc05

Merge branch 'main' into opt-count-distinct

e469502

trueeyu approved these changes Sep 13, 2021

View reviewed changes

satanson approved these changes Sep 13, 2021

View reviewed changes

dirtysalt merged commit f52ddac into StarRocks:main Sep 13, 2021

dirtysalt deleted the opt-count-distinct branch September 15, 2021 22:31

dirtysalt mentioned this pull request Oct 2, 2021

[SR-3811]Optimize Count Distinct using prefetch technique #544

Merged

dirtysalt mentioned this pull request Nov 19, 2021

Improve performance of count(distinct int) in high cardinality cases #252

Closed

caneGuy pushed a commit to caneGuy/starrocks that referenced this pull request Mar 28, 2023

Update Cluster_administration.md (StarRocks#139)

000fba4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SR-3811] optimize high card count distinct #139

[SR-3811] optimize high card count distinct #139

Uh oh!

dirtysalt commented Sep 9, 2021

Uh oh!

imay commented Sep 9, 2021

Uh oh!

dirtysalt commented Sep 9, 2021 •

edited

Loading

Uh oh!

dirtysalt commented Sep 9, 2021

Uh oh!

imay commented Sep 9, 2021

Uh oh!

Uh oh!

DeepThinker666 commented Sep 9, 2021

Uh oh!

dirtysalt commented Sep 10, 2021

Uh oh!

Uh oh!

dirtysalt commented Sep 13, 2021 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SeaAndHillMe commented Jan 7, 2022

Uh oh!

Uh oh!

[SR-3811] optimize high card count distinct #139

[SR-3811] optimize high card count distinct #139

Uh oh!

Conversation

dirtysalt commented Sep 9, 2021

Uh oh!

imay commented Sep 9, 2021

Uh oh!

dirtysalt commented Sep 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dirtysalt commented Sep 9, 2021

Uh oh!

imay commented Sep 9, 2021

Uh oh!

Uh oh!

DeepThinker666 commented Sep 9, 2021

Uh oh!

dirtysalt commented Sep 10, 2021

Uh oh!

Uh oh!

dirtysalt commented Sep 13, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SeaAndHillMe commented Jan 7, 2022

Uh oh!

Uh oh!

dirtysalt commented Sep 9, 2021 •

edited

Loading

dirtysalt commented Sep 13, 2021 •

edited

Loading