-
Notifications
You must be signed in to change notification settings - Fork 2.1k
[SR-3811] optimize high card count distinct #139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@dirtysalt |
@imay Unfortunately not yet. But I think we can do something hacky to make it compatible. Right now serialized format is like,
and size is |
yeah, it's very hard to be backward compatible. Because old version of BE does not recognize new data format. So basically yes, it's can not be upgraded in a grayscale way. |
If we can not upgrade in a grayscale way, this would be bad for users. |
could you add some background for this pr? |
Right, we are doing some benchmark comparisons with CK. And we find out that we don't have good performance when we are doing (count/distinct + group by) those cases. There are several optimization points:
|
running benchmark on ssb_100g with
|
good to learn |
* Update release-2.1.md (#96) (#101) * Add release notes 2.0.5 (#117) * Other features to 2.2 (#139) * npm i -E in actions (#87) * ctas (#88) added CTAS and Toc index * flink-connector (#93) * flink-connector added the flink-connector topic, two images, and errors in spark connector * Update Flink connector.md * Update Flink connector.md (#95) * Update release-2.1.md (#96) * Update Scale_up_down.md (#111) * Fix Scale in and out on main (#112) * Fix Scale in and out on main * Add release notes 2.0.5 (#114) * New iceberg external table (#119) * Update External_table.md external table baseline optimization, added iceberg external table * Update Query_planning.md * add sql fingerprint (#89) * ctas added CTAS and Toc index * add SQL fingerprint added SQL fingerprint to this topic * Update Query_planning.md * Update Query_planning.md * New iceberg external table (#121) * Update External_table.md external table baseline optimization, added iceberg external table * Update Query_planning.md * Update External_table.md * Update External_table.md * 2.1 alter routine load (#135) * delete extra comma * Update TOC.md * add Ctas to v2.1 (#90) * add CTAS to 2.1 add CTAS to 2.1 * Update TOC.md * Update ROUTINE LOAD.md * Update ROUTINE LOAD.md * ctas (#88) added CTAS and Toc index * add SQL fingerprint added SQL fingerprint to this topic * add release note 2.1 (#98) * Update Insert_into_faq.md (#99) update one faq to this doc based on the corresponding Chinese version * Update release note 2.0 (#103) * add 2.0.2 release note (#48) * add 2.0.2 release note * Update release-2.0.md * add release note 2.0.3 (#53) * add release note 2.0.3 * Update release-2.0.md * add 2.0.4 release note (#76) * Changes 2.1 (#100) * Update Insert_into_faq.md update one faq to this doc based on the corresponding Chinese version * Update DataX_faq.md update one faq to this doc based on the corresponding Chinese version * Update Deploy_faq.md optimize the existed text and add some new faqs * Update Deploy_faq.md * Update array_position.md optimize the doc * Update Deploy_faq.md * Create array_remove this is a new topic, which describe the function of removing a element from an array * Delete array_remove * Create array_remove This is a new topic, which describes the function of removing an element from an array * Delete array_remove * Create array_remove this is a new topic that describes the function of removing an element from an array * Create any_value.md this is a new topic * Create md5.md this is a new topic * Create sha2.md this is a new topic * Update TOC.md add 4 new docs to the TOC, including any_value, remove_position, md5, and sha2 * Fix Scale Expression on 2.1 (#113) * Fix Scale Expression on 2.1 * Add release notes 2.0.5 (#116) * Update Query_planning.md * 2.1 add sql fingerprint (#118) * Update ROUTINE LOAD.md * ctas (#88) added CTAS and Toc index * add SQL fingerprint added SQL fingerprint to this topic * Update Query_planning.md * 2.1 external table update (#128) * Update ROUTINE LOAD.md * ctas (#88) added CTAS and Toc index * add SQL fingerprint added SQL fingerprint to this topic * Update Query_planning.md * 2.1 update iceberg for v2.1 external table * 2.1 flink connector (#130) * Update ROUTINE LOAD.md * ctas (#88) added CTAS and Toc index * add SQL fingerprint added SQL fingerprint to this topic * Update Query_planning.md * flink-connector-2.1 * Alter routine load (#132) * Update ROUTINE LOAD.md * ctas (#88) added CTAS and Toc index * add SQL fingerprint added SQL fingerprint to this topic * Update Query_planning.md * alter routine load-2.1 * New flink connector (#133) * Make the load_process_max_memory_limit_percent configuration right (#39) * sr2.1_update release note (#44) * sr2.1_update release note * Update release-2.1.md * add tpch benchmark (#46) * add tpch benchmark * Update TPC-H_Benchmarking.md * update ssb benchmark (#45) * add ssb benchmark * Update SSB_Benchmarking.md * Update SSB_Benchmarking.md * add more pictures * Update SSB_Benchmarking.md * add primary key model (#47) * add primary key model * Update Data_model.md * add 2.0.2 release note (#48) * add 2.0.2 release note * Update release-2.0.md * update readme and add function template (#50) * update readme and add function template * Update How to Write Functions Documentation.md * Update How to Write Functions Documentation.md * add release note 2.1.2 (#52) * add release note 2.1.2 * Update release-2.1.md * add release note 2.0.3 (#53) * add release note 2.0.3 * Update release-2.0.md * delete extra comma (#55) * add 2.1 upgrade docs (#54) * add 2.1 upgrade docs * Update Cluster_administration.md * Update Cluster_administration.md * Update Cluster_administration.md Co-authored-by: hellolilyliuyi <[email protected]> * Update Deploy_faq.md * Correct spelling (#57) Signed-off-by: Sida <[email protected]> Co-authored-by: Sida <[email protected]> * Fix grammar in quick_start docs (#59) Signed-off-by: Sida <[email protected]> * Fix grammar in intro and loading (#58) Signed-off-by: Sida <[email protected]> * Fix typo in docker build instructions (#37) * Update release-1.19.md (#62) * Update release-2.1.md (#63) * Update Routine load doc (#72) Signed-off-by: Sida <[email protected]> * add release note 2.1.4 (#73) * add 2.0.4 release note (#76) * add buildcheck.yml (#69) * add release note 2.2 (#77) * add release note 2.2 * Update release-2.2.md * Update release-2.2.md * Update release-2.2.md * Update release-2.2.md * Update release-2.2.md * Update release-2.2.md * Update release-2.2.md * Update release-2.2.md * Update TOC.md * Bug fixes (#80) * Create release-2.2.md * Update TOC.md * Update 5.3.1-1.png * Update 5.3.1-1.png * Update release-2.2.md * Update release-2.2.md * add 2.1.5 (#82) * add 2.1.5 * Update release-2.1.md * delete load in toc (#86) * npm i -E in actions (#87) * ctas (#88) added CTAS and Toc index * Update ROUTINE LOAD.md * ctas (#88) added CTAS and Toc index * add SQL fingerprint added SQL fingerprint to this topic * flink-connector (#93) * flink-connector added the flink-connector topic, two images, and errors in spark connector * Update Flink connector.md * Update Flink connector.md (#95) * Update release-2.1.md (#96) * Update Scale_up_down.md (#111) * Fix Scale in and out on main (#112) * Fix Scale in and out on main * Add release notes 2.0.5 (#114) * Update Query_planning.md * New iceberg external table (#119) * Update External_table.md external table baseline optimization, added iceberg external table * Update Query_planning.md * add sql fingerprint (#89) * ctas added CTAS and Toc index * add SQL fingerprint added SQL fingerprint to this topic * Update Query_planning.md * Update Query_planning.md * New iceberg external table (#121) * Update External_table.md external table baseline optimization, added iceberg external table * Update Query_planning.md * Update External_table.md * Update External_table.md * flink-connector-2.1 * new flink-connector Co-authored-by: lichaoyong <[email protected]> Co-authored-by: hellolilyliuyi <[email protected]> Co-authored-by: stdpain <[email protected]> Co-authored-by: imay <[email protected]> Co-authored-by: SidaShen <[email protected]> Co-authored-by: Sida <[email protected]> Co-authored-by: Erik Ritter <[email protected]> Co-authored-by: don <[email protected]> Co-authored-by: 絵空事スピリット <[email protected]> Co-authored-by: amber-create <[email protected]> * 2.1 alter routine load * Update Deploy_faq.md * Update Deploy_faq.md * Update Deploy_faq.md Co-authored-by: hellolilyliuyi <[email protected]> Co-authored-by: hellolilyliuyi <[email protected]> Co-authored-by: Sihui <[email protected]> Co-authored-by: 絵空事スピリット <[email protected]> Co-authored-by: amber-create <[email protected]> Co-authored-by: lichaoyong <[email protected]> Co-authored-by: stdpain <[email protected]> Co-authored-by: imay <[email protected]> Co-authored-by: SidaShen <[email protected]> Co-authored-by: Sida <[email protected]> Co-authored-by: Erik Ritter <[email protected]> Co-authored-by: don <[email protected]> * Update SparkLoad.md * 2.2 external table (#137) * Update release-2.1.md (#96) (#101) * Add release notes 2.0.5 (#117) * 2.2 external table Co-authored-by: hellolilyliuyi <[email protected]> Co-authored-by: amber-create <[email protected]> * other features in 2.2 * Update Flink_connector.md * other features to 2.2 * Update Query_planning.md Co-authored-by: don <[email protected]> Co-authored-by: hellolilyliuyi <[email protected]> Co-authored-by: 絵空事スピリット <[email protected]> Co-authored-by: amber-create <[email protected]> Co-authored-by: hellolilyliuyi <[email protected]> Co-authored-by: Sihui <[email protected]> Co-authored-by: lichaoyong <[email protected]> Co-authored-by: stdpain <[email protected]> Co-authored-by: imay <[email protected]> Co-authored-by: SidaShen <[email protected]> Co-authored-by: Sida <[email protected]> Co-authored-by: Erik Ritter <[email protected]> * 2.2 external table (#140) * Update TOC.md * Create array_agg.md (#147) * Create array_agg.md add this topic to array functions * Faqs update 1. delete spark load and others under the loading 2. unify topic titles and navigation names * Update TOC.md (#150) update TOC of FAQ, make it be consistent with main * add cbo&external table (#156) * english doc optimize word usage, topic title * english doc optimize word usage, topic title * redundant space Co-authored-by: hellolilyliuyi <[email protected]> Co-authored-by: amber-create <[email protected]> Co-authored-by: don <[email protected]> Co-authored-by: 絵空事スピリット <[email protected]> Co-authored-by: hellolilyliuyi <[email protected]> Co-authored-by: Sihui <[email protected]> Co-authored-by: lichaoyong <[email protected]> Co-authored-by: stdpain <[email protected]> Co-authored-by: imay <[email protected]> Co-authored-by: SidaShen <[email protected]> Co-authored-by: Sida <[email protected]> Co-authored-by: Erik Ritter <[email protected]>
Use another way to serialize/deserialize hahset in distinct agg state:
Testing following SQL on ssb dataset, query time can be reduced from 60s -> 20s