Skip to content

Conversation

@smacker
Copy link
Contributor

@smacker smacker commented Jan 25, 2019

Hashing can't be executed incrementally due to calculation of document
frequencies which require full input.

this commit checks if hashtables and docfreq tables are empty and gemini
exits with error if they are not.

it also introduces new flag --replace which would clean up db for
current hashing mode.

There is also separate commit that changes type of cassandra flag to unit.
It allows to pass just --cassandra instead of --cassandra=true
(for consistency)

Output when db isn't empty:

$ ./hash src/test/resources/siva/duplicate-files/
Using spark-submit from ./spark
Running Hashing as Apache Spark job, master: local[*]
Hashing 2 repositories in: 'src/test/resources/siva/duplicate-files/' ()
	file:/Users/smacker/Work/gemini/src/test/resources/siva/duplicate-files/f281ab6f2e0e38dcc3af05360667d8f530c00103.siva
	file:/Users/smacker/Work/gemini/src/test/resources/siva/duplicate-files/9279be3cf07fb3cca4fc964b27acea57e0af461b.siva
Database keyspace is not empty! Hashing may produce wrong results. Please choose another keyspace or pass --replace argument
smacker in ~/Work/gemini on master [!?$]

}

def isDBEmpty(session: Session, mode: String): Boolean = {
var row = session.execute(s"select count(*) from $keyspace.${tables.docFreq} where id='$mode'").one()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is select count(*) performance ok to run this before every command? Would it improve if it was select count(*) ... limit 1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently, it's okay because this table can contain only 2 rows max.
but it's a good point and better to update in case we change it. 👍

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I missed that about the table :D

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

applied

Copy link
Contributor

@carlosms carlosms left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍, left a small suggestion


if (!gemini.isDBEmpty(cassandra, config.mode)) {
println("Database keyspace is not empty! Hashing may produce wrong results. " +
"Please choose another keyspace or pass --replace argument")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Please choose another keyspace or pass --replace argument")
"Please choose another keyspace or pass the --replace option")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed the first commit with new message

Hashing can't be executed incrementally due to calculation of document
frequencies which require full input.

this commit checks if hashtables and docfreq tables are empty and gemini
exits with error if they are not.

it also introduces new flag --replace which would clean up db for
current hashing mode.

Signed-off-by: Maxim Sukharev <[email protected]>
It allows to pass just `--cassandra` instead of `--cassandra=true`

Signed-off-by: Maxim Sukharev <[email protected]>
@smacker smacker merged commit 90587da into src-d:master Jan 29, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants