-
Notifications
You must be signed in to change notification settings - Fork 16
Different optimizations for report #197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Python stylecheck is failing but it also failing on master (weird, all prs passed it. Most probably something changed in the linter). So it shouldn't block this pr. |
| var bucket = List[Int]() | ||
|
|
||
| getHashValues(hashTable).foreach { case FileHash(sha1, value) => | ||
| val elId = elementIds(sha1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assumption here is that it's not possible that there's an element id in the non-first hashtable that is actually not present in the first hashtable, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. WMH produces one long hash for a file. Then we split it to "bands" so each file has number of hashtables rows with partial hash in it.
- makes it faster - allows to pass number of buckets that doesn't much number of cc Signed-off-by: Maxim Sukharev <[email protected]>
Bucket with only 1 element means that element isn't connected to anything. Currently such elements are filtered only when we build graph but we can remove it much earlier which would improve performance a lot. Signed-off-by: Maxim Sukharev <[email protected]>
It repeats a little bit of code for the first hashtable but more performant because it loops only once both for building elementsIds map and for buckets generation. Signed-off-by: Maxim Sukharev <[email protected]>
Signed-off-by: Maxim Sukharev <[email protected]>
Order of keys in map is random. But python code relies on indexes as element id. Signed-off-by: Maxim Sukharev <[email protected]>
previous commits introduced filtering of elements that appear only in one bucket. But it breaks python logic. Signed-off-by: Maxim Sukharev <[email protected]>
because now python receives only elements appeared in more than 1 bucket it's possible that bucket id and element ids in scala will collide. Signed-off-by: Maxim Sukharev <[email protected]>
find dups in scala instead of a new query for each hash Signed-off-by: Maxim Sukharev <[email protected]>
|
I have reworked the PR a lot. Please, another pass. (sorry Marvin) P.S. On test dataset report time went down from 60+ hours (I stopped it on 3rd day of running) to 15 minutes. |
|
@smacker unfortunately I barely know what |
carlosms
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👏 👏 👏
| for el_id, bucket in id_to_buckets: | ||
| indices[pos:(pos + len(bucket))] = bucket | ||
| pos += len(bucket) | ||
| indptr[el_id + 1:] = pos |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe there is a performance gain we can squeeze here, avoiding to write always up to the end of the array.
Something like this (did not test this, please double check)
prev_el_id = 0
prev_pos = 0
for el_id, bucket in id_to_buckets:
indices[pos:(pos + len(bucket))] = bucket
pos += len(bucket)
indptr[prev_el_id+2:el_id] = prev_pos
prev_el_id = el_id
prev_pos = post
indptr[prev_el_id+1:] = prev_posThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! It's a good idea. I re-wrote it a little bit different and added one more test to be sure it works correctly for an edge case.
Signed-off-by: Maxim Sukharev <[email protected]>
Signed-off-by: Maxim Sukharev <[email protected]>
|
@se7entyse7en thanks anyway for your valuable review. Carlos knows internal of gemini better and he approved the code so I think we are good to merge now without you going deep into details. |
Please look at each commit message for details.