Different optimizations for report #197

smacker · 2019-02-11T13:37:44Z

Please look at each commit message for details.

smacker · 2019-02-11T15:28:48Z

Python stylecheck is failing but it also failing on master (weird, all prs passed it. Most probably something changed in the linter). So it shouldn't block this pr.

src/main/python/community-detector/community_detector.py

se7entyse7en · 2019-02-11T17:07:10Z

src/main/scala/tech/sourced/gemini/ConnectedComponents.scala

+    var bucket = List[Int]()
+
+    getHashValues(hashTable).foreach { case FileHash(sha1, value) =>
+      val elId = elementIds(sha1)


The assumption here is that it's not possible that there's an element id in the non-first hashtable that is actually not present in the first hashtable, right?

yes. WMH produces one long hash for a file. Then we split it to "bands" so each file has number of hashtables rows with partial hash in it.

- makes it faster - allows to pass number of buckets that doesn't much number of cc Signed-off-by: Maxim Sukharev <[email protected]>

Bucket with only 1 element means that element isn't connected to anything. Currently such elements are filtered only when we build graph but we can remove it much earlier which would improve performance a lot. Signed-off-by: Maxim Sukharev <[email protected]>

It repeats a little bit of code for the first hashtable but more performant because it loops only once both for building elementsIds map and for buckets generation. Signed-off-by: Maxim Sukharev <[email protected]>

Signed-off-by: Maxim Sukharev <[email protected]>

Order of keys in map is random. But python code relies on indexes as element id. Signed-off-by: Maxim Sukharev <[email protected]>

previous commits introduced filtering of elements that appear only in one bucket. But it breaks python logic. Signed-off-by: Maxim Sukharev <[email protected]>

because now python receives only elements appeared in more than 1 bucket it's possible that bucket id and element ids in scala will collide. Signed-off-by: Maxim Sukharev <[email protected]>

find dups in scala instead of a new query for each hash Signed-off-by: Maxim Sukharev <[email protected]>

smacker · 2019-02-13T11:20:33Z

I have reworked the PR a lot. Please, another pass. (sorry Marvin)
//cc @se7entyse7en @carlosms

P.S. On test dataset report time went down from 60+ hours (I stopped it on 3rd day of running) to 15 minutes.

src/main/python/community-detector/community_detector.py

src/main/scala/tech/sourced/gemini/Report.scala

se7entyse7en · 2019-02-14T09:01:36Z

@smacker unfortunately I barely know what gemini does, so as in the previous reviews I just reviewed things that are narrower in scope. Maybe it worth for me taking some time to also take a deeper look at gemini for future reviews, but I'm afraid that if I do this now it would take much time and would delay these PRs for the POC.

carlosms

👏 👏 👏

src/main/python/community-detector/community_detector.py

carlosms · 2019-02-14T11:47:57Z

src/main/python/community-detector/community_detector.py

+    for el_id, bucket in id_to_buckets:
+        indices[pos:(pos + len(bucket))] = bucket
+        pos += len(bucket)
+        indptr[el_id + 1:] = pos


Maybe there is a performance gain we can squeeze here, avoiding to write always up to the end of the array.
Something like this (did not test this, please double check)

prev_el_id = 0 prev_pos = 0 for el_id, bucket in id_to_buckets: indices[pos:(pos + len(bucket))] = bucket pos += len(bucket) indptr[prev_el_id+2:el_id] = prev_pos prev_el_id = el_id prev_pos = post indptr[prev_el_id+1:] = prev_pos

thanks! It's a good idea. I re-wrote it a little bit different and added one more test to be sure it works correctly for an edge case.

Signed-off-by: Maxim Sukharev <[email protected]>

smacker · 2019-02-14T12:21:28Z

@se7entyse7en thanks anyway for your valuable review. Carlos knows internal of gemini better and he approved the code so I think we are good to merge now without you going deep into details.

smacker requested review from carlosms and se7entyse7en February 11, 2019 15:28

se7entyse7en suggested changes Feb 11, 2019

View reviewed changes

smacker added 8 commits February 13, 2019 11:23

remove unnecessary transformation in report.py

3ae1c4f

- makes it faster - allows to pass number of buckets that doesn't much number of cc Signed-off-by: Maxim Sukharev <[email protected]>

use multiple cores to calculate buckets

9be3079

It repeats a little bit of code for the first hashtable but more performant because it loops only once both for building elementsIds map and for buckets generation. Signed-off-by: Maxim Sukharev <[email protected]>

update report tests

ff63d39

Signed-off-by: Maxim Sukharev <[email protected]>

fix incorrect order of buckets in parquet file

98d4ef6

Order of keys in map is random. But python code relies on indexes as element id. Signed-off-by: Maxim Sukharev <[email protected]>

add support for non-sequential element ids

63989c6

previous commits introduced filtering of elements that appear only in one bucket. But it breaks python logic. Signed-off-by: Maxim Sukharev <[email protected]>

filter out buckets from communties

1f89b4d

because now python receives only elements appeared in more than 1 bucket it's possible that bucket id and element ids in scala will collide. Signed-off-by: Maxim Sukharev <[email protected]>

Speedup duplicates calculation

9a69a87

find dups in scala instead of a new query for each hash Signed-off-by: Maxim Sukharev <[email protected]>

se7entyse7en suggested changes Feb 13, 2019

View reviewed changes

smacker changed the title ~~Diffrerent optimizations for report~~ Different optimizations for report Feb 13, 2019

carlosms approved these changes Feb 14, 2019

View reviewed changes

smacker added 2 commits February 14, 2019 13:00

apply review comments

38849af

Signed-off-by: Maxim Sukharev <[email protected]>

apply more code review comments

611444f

Signed-off-by: Maxim Sukharev <[email protected]>

se7entyse7en approved these changes Feb 14, 2019

View reviewed changes

smacker merged commit d4eb34f into src-d:master Feb 14, 2019

Different optimizations for report #197

Different optimizations for report #197

Uh oh!

Conversation

smacker commented Feb 11, 2019

Uh oh!

smacker commented Feb 11, 2019

Uh oh!

Uh oh!

se7entyse7en Feb 11, 2019

Choose a reason for hiding this comment

Uh oh!

smacker Feb 11, 2019

Choose a reason for hiding this comment

Uh oh!

smacker commented Feb 13, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

se7entyse7en commented Feb 14, 2019

Uh oh!

carlosms left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

carlosms Feb 14, 2019

Choose a reason for hiding this comment

Uh oh!

smacker Feb 14, 2019

Choose a reason for hiding this comment

Uh oh!

smacker commented Feb 14, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants