Skip to content

Conversation

@smacker
Copy link
Contributor

@smacker smacker commented Feb 11, 2019

Please look at each commit message for details.

@smacker
Copy link
Contributor Author

smacker commented Feb 11, 2019

Python stylecheck is failing but it also failing on master (weird, all prs passed it. Most probably something changed in the linter). So it shouldn't block this pr.

var bucket = List[Int]()

getHashValues(hashTable).foreach { case FileHash(sha1, value) =>
val elId = elementIds(sha1)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assumption here is that it's not possible that there's an element id in the non-first hashtable that is actually not present in the first hashtable, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. WMH produces one long hash for a file. Then we split it to "bands" so each file has number of hashtables rows with partial hash in it.

- makes it faster
- allows to pass number of buckets that doesn't much number of cc

Signed-off-by: Maxim Sukharev <[email protected]>
Bucket with only 1 element means that element isn't connected to
anything. Currently such elements are filtered only when we build graph
but we can remove it much earlier which would improve performance a lot.

Signed-off-by: Maxim Sukharev <[email protected]>
It repeats a little bit of code for the first hashtable but more
performant because it loops only once both for building elementsIds map
and for buckets generation.

Signed-off-by: Maxim Sukharev <[email protected]>
Signed-off-by: Maxim Sukharev <[email protected]>
Order of keys in map is random. But python code relies on indexes as
element id.

Signed-off-by: Maxim Sukharev <[email protected]>
previous commits introduced filtering of elements that appear only in
one bucket. But it breaks python logic.

Signed-off-by: Maxim Sukharev <[email protected]>
because now python receives only elements appeared in more than 1 bucket
it's possible that bucket id and element ids in scala will collide.

Signed-off-by: Maxim Sukharev <[email protected]>
find dups in scala instead of a new query for each hash

Signed-off-by: Maxim Sukharev <[email protected]>
@smacker
Copy link
Contributor Author

smacker commented Feb 13, 2019

I have reworked the PR a lot. Please, another pass. (sorry Marvin)
//cc @se7entyse7en @carlosms

P.S. On test dataset report time went down from 60+ hours (I stopped it on 3rd day of running) to 15 minutes.

@smacker smacker changed the title Diffrerent optimizations for report Different optimizations for report Feb 13, 2019
@se7entyse7en
Copy link

@smacker unfortunately I barely know what gemini does, so as in the previous reviews I just reviewed things that are narrower in scope. Maybe it worth for me taking some time to also take a deeper look at gemini for future reviews, but I'm afraid that if I do this now it would take much time and would delay these PRs for the POC.

Copy link
Contributor

@carlosms carlosms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👏 👏 👏

for el_id, bucket in id_to_buckets:
indices[pos:(pos + len(bucket))] = bucket
pos += len(bucket)
indptr[el_id + 1:] = pos
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe there is a performance gain we can squeeze here, avoiding to write always up to the end of the array.
Something like this (did not test this, please double check)

    prev_el_id = 0
    prev_pos = 0

    for el_id, bucket in id_to_buckets:
        indices[pos:(pos + len(bucket))] = bucket
        pos += len(bucket)
        indptr[prev_el_id+2:el_id] = prev_pos

        prev_el_id = el_id
        prev_pos = post
    
    indptr[prev_el_id+1:] = prev_pos

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! It's a good idea. I re-wrote it a little bit different and added one more test to be sure it works correctly for an edge case.

Signed-off-by: Maxim Sukharev <[email protected]>
Signed-off-by: Maxim Sukharev <[email protected]>
@smacker
Copy link
Contributor Author

smacker commented Feb 14, 2019

@se7entyse7en thanks anyway for your valuable review. Carlos knows internal of gemini better and he approved the code so I think we are good to merge now without you going deep into details.

@smacker smacker merged commit d4eb34f into src-d:master Feb 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants