This repository contains the AnyBlox plugin for Spark.
You need Java 11, Maven 3.9, SBT 1.10, and Scala 2.12. We recommend SKDMan for managing those.
After that simply run sbt package
. The .jar
file will be produced in target/scala-2.12
.
The plugin needs to be registered with Spark in spark-defaults.conf
:
spark.plugins org.anyblox.spark.AnyBloxPlugin
You will need the following Arrow jars to be plugged in as well:
You can then run spark-shell
by passing required packages and jars:
/opt/spark/bin/spark-shell --packages org.scala-lang:toolkit_2.12:0.1.7 --jars "/anyblox/anyblox-spark_2.12-0.1.0-SNAPSHOT.jar,/arrow/arrow-c-data-18.1.0.jar,/arrow/arrow-vector-18.1.0.jar"
Open .any
files as dataframes using standard Spark syntax:
val df = spark.read.format("anyblox").load("/path/to/data.any")
You can use the dataframe like any other Spark df, e.g. create a view and query it with SQL:
df.createTempView("myview")
spark.sql("SELECT * FROM myview").show