Skip to content

Conversation

dirtysalt
Copy link
Contributor

Use another way to serialize/deserialize hahset in distinct agg state:

  1. save packet size
  2. better cache locality to do merge

Testing following SQL on ssb dataset, query time can be reduced from 60s -> 20s

SELECT COUNT (DISTINCT lo_orderkey), lo_linenumber FROM lineorder_flat group by lo_linenumber;

@imay
Copy link
Contributor

imay commented Sep 9, 2021

@dirtysalt
I have one question. If this CL has been merged, can it be compatible with old version.
If not this will make our program can not grayscale upgrade

@dirtysalt
Copy link
Contributor Author

dirtysalt commented Sep 9, 2021

@dirtysalt
I have one question. If this CL has been merged, can it be compatible with old version.
If not this will make our program can not grayscale upgrade

@imay Unfortunately not yet. But I think we can do something hacky to make it compatible.

Right now serialized format is like,

| size | capacity | control array | slot array|

and size is size_t, in most cases it should be less than (1<<63)-1(otherwise this hashtable is too huge). So we can tell this is an old serialized version if msb of size is 0. And for the further version, we put set msb of size to 1.

@dirtysalt
Copy link
Contributor Author

@dirtysalt
I have one question. If this CL has been merged, can it be compatible with old version.
If not this will make our program can not grayscale upgrade

yeah, it's very hard to be backward compatible. Because old version of BE does not recognize new data format. So basically yes, it's can not be upgraded in a grayscale way.

@imay
Copy link
Contributor

imay commented Sep 9, 2021

@dirtysalt
I have one question. If this CL has been merged, can it be compatible with old version.
If not this will make our program can not grayscale upgrade

yeah, it's very hard to be backward compatible. Because old version of BE does not recognize new data format. So basically yes, it's can not be upgraded in a grayscale way.

If we can not upgrade in a grayscale way, this would be bad for users.
So we should make it back compatible.

@DeepThinker666
Copy link
Contributor

could you add some background for this pr?

@dirtysalt
Copy link
Contributor Author

could you add some background for this pr?

Right, we are doing some benchmark comparisons with CK. And we find out that we don't have good performance when we are doing (count/distinct + group by) those cases. There are several optimization points:

  1. better hashtable for inserting and finding
  2. serialization and deserialization.

@dirtysalt
Copy link
Contributor Author

dirtysalt commented Sep 13, 2021

running benchmark on ssb_100g with set parallel_fragment_exec_instance_num = 16

Query SR Master SR PR
Q01 1110 1086
Q02 1052 1069
Q03 400 360
Q04 176 174
Q05 15840 9526
Q06 2305 2037
Q07 1347 1261
Q08 326 337
Q09 6277 4898
Q10 2449 2189
Q11 6011 4749
Q12 1958 1832
Q13 10079 9475
Q14 6816 6764
Q15 9766 9243
Q16 6003 5844
# 100G
# lo_orderkey = 150,000,000
# lo_partkey = 1000,000
# lo_suppkey = 200,000
# lo_quantity = 50

# lo_linenumber = 7
# lo_supplycost = 15933
# lo_linenumber, lo_supplycost = 111531

-- Q01
select count (distinct lo_orderkey) from lineorder_flat;
-- Q02
SELECT COUNT (DISTINCT lo_partkey) FROM lineorder_flat;
-- Q03
SELECT COUNT (DISTINCT lo_suppkey) FROM lineorder_flat;
-- Q04
SELECT COUNT (DISTINCT lo_quantity) FROM lineorder_flat;

-- Q05
SELECT COUNT (DISTINCT lo_orderkey), lo_linenumber FROM lineorder_flat group by lo_linenumber;
-- Q06
SELECT COUNT (DISTINCT lo_partkey), lo_linenumber FROM lineorder_flat group by lo_linenumber;
-- Q07
SELECT COUNT (DISTINCT lo_suppkey), lo_linenumber FROM lineorder_flat group by lo_linenumber;
-- Q08
SELECT COUNT (DISTINCT lo_quantity), lo_linenumber FROM lineorder_flat group by lo_linenumber;

-- Q09
SELECT COUNT (DISTINCT lo_orderkey), lo_supplycost FROM lineorder_flat group by lo_supplycost;
-- Q10
SELECT COUNT (DISTINCT lo_partkey), lo_supplycost FROM lineorder_flat group by lo_supplycost;
-- Q11
SELECT COUNT (DISTINCT lo_suppkey), lo_supplycost FROM lineorder_flat group by lo_supplycost;
-- Q12
SELECT COUNT (DISTINCT lo_quantity), lo_supplycost FROM lineorder_flat group by lo_supplycost;

-- Q13
SELECT COUNT (DISTINCT lo_orderkey), lo_linenumber, lo_supplycost FROM lineorder_flat group by lo_linenumber, lo_supplycost;
-- Q14
SELECT COUNT (DISTINCT lo_partkey), lo_linenumber, lo_supplycost FROM lineorder_flat group by lo_linenumber, lo_supplycost;
-- Q15
SELECT COUNT (DISTINCT lo_suppkey), lo_linenumber, lo_supplycost FROM lineorder_flat group by lo_linenumber, lo_supplycost;
-- Q16
SELECT COUNT (DISTINCT lo_quantity), lo_linenumber, lo_supplycost FROM lineorder_flat group by lo_linenumber, lo_supplycost;

@SeaAndHillMe
Copy link

good to learn

jaogoy pushed a commit that referenced this pull request Jul 14, 2022
* Update release-2.1.md (#96) (#101)

* Add release notes 2.0.5 (#117)

* Other features to 2.2 (#139)

* npm i -E in actions (#87)

* ctas (#88)

added CTAS and Toc index

* flink-connector (#93)

* flink-connector

added the flink-connector topic, two images, and errors in spark connector

* Update Flink connector.md

* Update Flink connector.md (#95)

* Update release-2.1.md (#96)

* Update Scale_up_down.md (#111)

* Fix Scale in and out on main (#112)

* Fix Scale in and out on main

* Add release notes 2.0.5 (#114)

* New iceberg external table (#119)

* Update External_table.md

external table baseline optimization, added iceberg external table

* Update Query_planning.md

* add sql fingerprint (#89)

* ctas

added CTAS and Toc index

* add SQL fingerprint

added SQL fingerprint to this topic

* Update Query_planning.md

* Update Query_planning.md

* New iceberg external table (#121)

* Update External_table.md

external table baseline optimization, added iceberg external table

* Update Query_planning.md

* Update External_table.md

* Update External_table.md

* 2.1 alter routine load (#135)

* delete extra comma

* Update TOC.md

* add Ctas to v2.1 (#90)

* add CTAS to 2.1

add CTAS to 2.1

* Update TOC.md

* Update ROUTINE LOAD.md

* Update ROUTINE LOAD.md

* ctas (#88)

added CTAS and Toc index

* add SQL fingerprint

added SQL fingerprint to this topic

* add release note 2.1 (#98)

* Update Insert_into_faq.md (#99)

update one faq to this doc based on the corresponding Chinese version

* Update release note 2.0 (#103)

* add 2.0.2 release note (#48)

* add 2.0.2 release note

* Update release-2.0.md

* add release note 2.0.3 (#53)

* add release note 2.0.3

* Update release-2.0.md

* add 2.0.4 release note (#76)

* Changes 2.1 (#100)

* Update Insert_into_faq.md

update one faq to this doc based on the corresponding Chinese version

* Update DataX_faq.md

update one faq to this doc based on the corresponding Chinese version

* Update Deploy_faq.md

optimize the existed text and add some new faqs

* Update Deploy_faq.md

* Update array_position.md

optimize the doc

* Update Deploy_faq.md

* Create array_remove

this is a new topic, which describe the function of removing a element from an array

* Delete array_remove

* Create array_remove

This is a new topic, which describes the function of removing an element from an array

* Delete array_remove

* Create array_remove

this is a new topic that describes the function of removing an element from an array

* Create any_value.md

this is a new topic

* Create md5.md

this is a new topic

* Create sha2.md

this is a new topic

* Update TOC.md

add 4 new docs to the TOC, including any_value, remove_position, md5, and sha2

* Fix Scale Expression on 2.1 (#113)

* Fix Scale Expression on 2.1

* Add release notes 2.0.5 (#116)

* Update Query_planning.md

* 2.1 add sql fingerprint (#118)

* Update ROUTINE LOAD.md

* ctas (#88)

added CTAS and Toc index

* add SQL fingerprint

added SQL fingerprint to this topic

* Update Query_planning.md

* 2.1 external table update (#128)

* Update ROUTINE LOAD.md

* ctas (#88)

added CTAS and Toc index

* add SQL fingerprint

added SQL fingerprint to this topic

* Update Query_planning.md

* 2.1 update iceberg

for v2.1 external table

* 2.1 flink connector (#130)

* Update ROUTINE LOAD.md

* ctas (#88)

added CTAS and Toc index

* add SQL fingerprint

added SQL fingerprint to this topic

* Update Query_planning.md

* flink-connector-2.1

* Alter routine load (#132)

* Update ROUTINE LOAD.md

* ctas (#88)

added CTAS and Toc index

* add SQL fingerprint

added SQL fingerprint to this topic

* Update Query_planning.md

* alter routine load-2.1

* New flink connector (#133)

* Make the load_process_max_memory_limit_percent configuration right (#39)

* sr2.1_update release note (#44)

* sr2.1_update release note

* Update release-2.1.md

* add tpch benchmark (#46)

* add tpch benchmark

* Update TPC-H_Benchmarking.md

* update ssb benchmark (#45)

* add ssb benchmark

* Update SSB_Benchmarking.md

* Update SSB_Benchmarking.md

* add more pictures

* Update SSB_Benchmarking.md

* add primary key model (#47)

* add primary key model

* Update Data_model.md

* add 2.0.2 release note (#48)

* add 2.0.2 release note

* Update release-2.0.md

* update readme and add function template (#50)

* update readme and add function template

* Update How to Write Functions Documentation.md

* Update How to Write Functions Documentation.md

* add release note 2.1.2 (#52)

* add release note 2.1.2

* Update release-2.1.md

* add release note 2.0.3 (#53)

* add release note 2.0.3

* Update release-2.0.md

* delete extra comma (#55)

* add 2.1 upgrade docs (#54)

* add 2.1 upgrade docs

* Update Cluster_administration.md

* Update Cluster_administration.md

* Update Cluster_administration.md

Co-authored-by: hellolilyliuyi <[email protected]>

* Update Deploy_faq.md

* Correct spelling (#57)

Signed-off-by: Sida <[email protected]>

Co-authored-by: Sida <[email protected]>

* Fix grammar in quick_start docs (#59)

Signed-off-by: Sida <[email protected]>

* Fix grammar in intro and loading (#58)

Signed-off-by: Sida <[email protected]>

* Fix typo in docker build instructions (#37)

* Update release-1.19.md (#62)

* Update release-2.1.md (#63)

* Update Routine load doc (#72)

Signed-off-by: Sida <[email protected]>

* add release note 2.1.4 (#73)

* add 2.0.4 release note (#76)

* add buildcheck.yml (#69)

* add release note 2.2 (#77)

* add release note 2.2

* Update release-2.2.md

* Update release-2.2.md

* Update release-2.2.md

* Update release-2.2.md

* Update release-2.2.md

* Update release-2.2.md

* Update release-2.2.md

* Update release-2.2.md

* Update TOC.md

* Bug fixes (#80)

* Create release-2.2.md

* Update TOC.md

* Update 5.3.1-1.png

* Update 5.3.1-1.png

* Update release-2.2.md

* Update release-2.2.md

* add 2.1.5 (#82)

* add 2.1.5

* Update release-2.1.md

* delete load in toc (#86)

* npm i -E in actions (#87)

* ctas (#88)

added CTAS and Toc index

* Update ROUTINE LOAD.md

* ctas (#88)

added CTAS and Toc index

* add SQL fingerprint

added SQL fingerprint to this topic

* flink-connector (#93)

* flink-connector

added the flink-connector topic, two images, and errors in spark connector

* Update Flink connector.md

* Update Flink connector.md (#95)

* Update release-2.1.md (#96)

* Update Scale_up_down.md (#111)

* Fix Scale in and out on main (#112)

* Fix Scale in and out on main

* Add release notes 2.0.5 (#114)

* Update Query_planning.md

* New iceberg external table (#119)

* Update External_table.md

external table baseline optimization, added iceberg external table

* Update Query_planning.md

* add sql fingerprint (#89)

* ctas

added CTAS and Toc index

* add SQL fingerprint

added SQL fingerprint to this topic

* Update Query_planning.md

* Update Query_planning.md

* New iceberg external table (#121)

* Update External_table.md

external table baseline optimization, added iceberg external table

* Update Query_planning.md

* Update External_table.md

* Update External_table.md

* flink-connector-2.1

* new flink-connector

Co-authored-by: lichaoyong <[email protected]>
Co-authored-by: hellolilyliuyi <[email protected]>
Co-authored-by: stdpain <[email protected]>
Co-authored-by: imay <[email protected]>
Co-authored-by: SidaShen <[email protected]>
Co-authored-by: Sida <[email protected]>
Co-authored-by: Erik Ritter <[email protected]>
Co-authored-by: don <[email protected]>
Co-authored-by: 絵空事スピリット <[email protected]>
Co-authored-by: amber-create <[email protected]>

* 2.1 alter routine load

* Update Deploy_faq.md

* Update Deploy_faq.md

* Update Deploy_faq.md

Co-authored-by: hellolilyliuyi <[email protected]>
Co-authored-by: hellolilyliuyi <[email protected]>
Co-authored-by: Sihui <[email protected]>
Co-authored-by: 絵空事スピリット <[email protected]>
Co-authored-by: amber-create <[email protected]>
Co-authored-by: lichaoyong <[email protected]>
Co-authored-by: stdpain <[email protected]>
Co-authored-by: imay <[email protected]>
Co-authored-by: SidaShen <[email protected]>
Co-authored-by: Sida <[email protected]>
Co-authored-by: Erik Ritter <[email protected]>
Co-authored-by: don <[email protected]>

* Update SparkLoad.md

* 2.2 external table (#137)

* Update release-2.1.md (#96) (#101)

* Add release notes 2.0.5 (#117)

* 2.2 external table

Co-authored-by: hellolilyliuyi <[email protected]>
Co-authored-by: amber-create <[email protected]>

* other features in 2.2

* Update Flink_connector.md

* other features to 2.2

* Update Query_planning.md

Co-authored-by: don <[email protected]>
Co-authored-by: hellolilyliuyi <[email protected]>
Co-authored-by: 絵空事スピリット <[email protected]>
Co-authored-by: amber-create <[email protected]>
Co-authored-by: hellolilyliuyi <[email protected]>
Co-authored-by: Sihui <[email protected]>
Co-authored-by: lichaoyong <[email protected]>
Co-authored-by: stdpain <[email protected]>
Co-authored-by: imay <[email protected]>
Co-authored-by: SidaShen <[email protected]>
Co-authored-by: Sida <[email protected]>
Co-authored-by: Erik Ritter <[email protected]>

* 2.2 external table (#140)

* Update TOC.md

* Create array_agg.md (#147)

* Create array_agg.md

add this topic to array functions

* Faqs update

1. delete spark load and others under the loading
2. unify topic titles and navigation names

* Update TOC.md (#150)

update TOC of FAQ, make it be consistent with main

* add cbo&external table (#156)

* english doc optimize

word usage, topic title

* english doc optimize

word usage, topic title

* redundant space

Co-authored-by: hellolilyliuyi <[email protected]>
Co-authored-by: amber-create <[email protected]>
Co-authored-by: don <[email protected]>
Co-authored-by: 絵空事スピリット <[email protected]>
Co-authored-by: hellolilyliuyi <[email protected]>
Co-authored-by: Sihui <[email protected]>
Co-authored-by: lichaoyong <[email protected]>
Co-authored-by: stdpain <[email protected]>
Co-authored-by: imay <[email protected]>
Co-authored-by: SidaShen <[email protected]>
Co-authored-by: Sida <[email protected]>
Co-authored-by: Erik Ritter <[email protected]>
caneGuy pushed a commit to caneGuy/starrocks that referenced this pull request Mar 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants