-
Notifications
You must be signed in to change notification settings - Fork 16
move feature frequencies to separate table #185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -135,15 +135,14 @@ class FileQuery( | |
| case Gemini.funcSimilarityMode => FeaturesHash.funcParams | ||
| } | ||
|
|
||
| val hashtablesTable = s"${tables.hashtables}_${mode}" | ||
| val cols = tables.hashtablesCols | ||
| val wmh = hashFile(featuresList, docFreq, sampleSize) | ||
|
|
||
| val bands = FeaturesHash.wmhToBands(wmh, htnum, bandSize) | ||
|
|
||
| log.info("Looking for similar items") | ||
| val similar = bands.zipWithIndex.foldLeft(Set[String]()) { case (sim, (band, i)) => | ||
| val cql = s"""SELECT ${cols.sha} FROM $keyspace.${hashtablesTable} | ||
| val cql = s"""SELECT ${cols.sha} FROM $keyspace.${tables.hashtables(mode)} | ||
| WHERE ${cols.hashtable}=$i AND ${cols.value}=0x${MathUtil.bytes2hex(band)}""" | ||
| log.debug(cql) | ||
|
|
||
|
|
@@ -186,18 +185,26 @@ class FileQuery( | |
|
|
||
| protected def readDocFreqFromDB(): Option[OrderedDocFreq] = { | ||
| log.info(s"Reading docFreq from DB") | ||
| val cols = tables.docFreqCols | ||
| val row = conn.execute(s"SELECT * FROM ${tables.docFreq} WHERE ${cols.id} = '${mode}'").one() | ||
| if (row == null) { | ||
| val docsCols = tables.featuresDocsCols | ||
| val freqCols = tables.featuresFreqCols | ||
| val docsRow = conn.execute(s"SELECT * FROM ${tables.featuresDocs} WHERE ${docsCols.id} = '$mode'").one() | ||
| if (docsRow == null) { | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How can I avoid null here? It's java lib returns null. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. By doing as follows:
But maybe is just me that don't like reading
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. hm. I see what you mean now. Imo in this particular case checking Please let me know if you see any clear disadvantages of the current code or clear advantages of the code with fold and I'll update it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, I don't think there's any clear pros/cons in favour/disfavour for one or another. I think it's more a style thing.
In this case I think we should be consistent. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just for completeness, using pattern matching is even more readable imo 😛
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I hope you agreed to keep the code as it is now. With you last example I don't see why it's better:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure! I just wrote it down for completeness. Anyway I just found it more idiomatic, but I don't actually have that much experience in Scala to say what is more idiomatic and what is not. 👍 |
||
| log.warn("Document frequency table is empty.") | ||
| None | ||
| } else { | ||
| val df = row | ||
| .getMap("df", classOf[java.lang.String], classOf[java.lang.Integer]) | ||
| var tokens = IndexedSeq[String]() | ||
| val df = conn | ||
| .execute(s"SELECT * FROM ${tables.featuresFreq} WHERE ${freqCols.id} = '$mode' ORDER BY ${freqCols.feature}") | ||
| .asScala | ||
| .mapValues(_.toInt) | ||
| .map { row => | ||
| // tokens have to be sorted, df.keys isn't sorted | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How do you want them to be sorted? Alphabetical? Isn't it possible to sort outside the block? I'm thinking about something as:
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Alphabetical. DB already returns it in the correct order. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't know cassandra actually, so correct me if I'm wrong. Isn't relying on the db ordering without actually specifying it at query time a bit dangerous? Let me explain better. I guess that in the table there's some sort of index that is being used as ordering key, and we're using it implicitly here. But if we change something at db level we also need to remember to change the code. Isn't it better to also include the ordering in the query? I guess that it won't affect the execution time given that even without including it, the db still orders them using that ordering. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If then we change something at db level everything should continue to work. It could just perform much slower (in case for example we remove the ordering key).
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (mode, feature) is a primary key that's why it's sorted. Thanks for pointing that I missed order by in SELECT! I'll add it.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. fixed |
||
| val name = row.getString(freqCols.feature) | ||
| tokens = tokens :+ name | ||
|
|
||
| Some(OrderedDocFreq(row.getInt(cols.docs), df.keys.toIndexedSeq, df)) | ||
| (name, row.getInt(freqCols.weight)) | ||
| }.toMap | ||
|
|
||
| Some(OrderedDocFreq(docsRow.getInt(docsCols.docs), tokens, df)) | ||
| } | ||
| } | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -229,14 +229,21 @@ class Hash(session: SparkSession, | |
|
|
||
| protected def saveDocFreqToDB(docFreq: OrderedDocFreq, keyspace: String, tables: Tables): Unit = { | ||
| log.warn(s"save document frequencies to DB") | ||
| CassandraConnector(session.sparkContext).withSessionDo { cassandra => | ||
| val cols = tables.docFreqCols | ||
| val javaMap = docFreq.df.asJava | ||
|
|
||
| CassandraConnector(session.sparkContext).withSessionDo { cassandra => | ||
| val docsCols = tables.featuresDocsCols | ||
| cassandra.execute( | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Not sure if it's a good idea to do There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Regarding this may better wait for someone with more insights than me on Spark. 😛
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We calculate docFreq on driver. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If it's already in the driver, can't we write in batches from there?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Mostly because Cassandra does not recommend it
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I update code using prepared statement to make it a bit better. |
||
| s"INSERT INTO $keyspace.${tables.docFreq} (${cols.id}, ${cols.docs}, ${cols.df}) VALUES (?, ?, ?)", | ||
| mode, int2Integer(docFreq.docs), javaMap | ||
| s"INSERT INTO $keyspace.${tables.featuresDocs} (${docsCols.id}, ${docsCols.docs}) VALUES (?, ?)", | ||
| mode, int2Integer(docFreq.docs) | ||
| ) | ||
|
|
||
| val freqCols = tables.featuresFreqCols | ||
| val prepared = cassandra.prepare(s"INSERT INTO $keyspace.${tables.featuresFreq}" + | ||
| s"(${freqCols.id}, ${freqCols.feature}, ${freqCols.weight}) VALUES (?, ?, ?)") | ||
|
|
||
| docFreq.df.foreach { case(feature, weight) => | ||
| cassandra.execute(prepared.bind(mode, feature, int2Integer(weight))) | ||
| } | ||
| } | ||
| } | ||
|
|
||
|
|
@@ -267,7 +274,6 @@ class Hash(session: SparkSession, | |
| case Gemini.funcSimilarityMode => FeaturesHash.funcParams | ||
| } | ||
|
|
||
| val hashtablesTable = s"${tables.hashtables}_${mode}" | ||
| val cols = tables.hashtablesCols | ||
| rdd | ||
| .flatMap { case RDDHash(doc, wmh) => | ||
|
|
@@ -276,7 +282,7 @@ class Hash(session: SparkSession, | |
| .toDF(cols.sha, cols.hashtable, cols.value) | ||
| .write | ||
| .mode("append") | ||
| .cassandraFormat(hashtablesTable, keyspace) | ||
| .cassandraFormat(tables.hashtables(mode), keyspace) | ||
| .save() | ||
| } | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd use
${x}or$xand not a mix of them at least not in the same string.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I don't do as above InteliJ IDEA linter complains.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does it complain? 😕
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can't use
$xbecause "can't resolve symbolhashtables_.if I use
${mode}it complains "the enclosing block is redundant".I try to keep IntelliJ IDEA happy because right now it's the only linter we have.
Some work about style/lint was done before in #87 and #86, but there is no agreement.
If we add some opinionated linter/formatter (like gofmt or prettier) most probably it would be possible to update code to follow the style automatically.