Skip to content

this extension allows you to write spark df to duckdb hassle free becuase erlier you can not write from spark to duckdb directly because of duckDB's write once but spark write with muliple writer's (so will raise error : lock already set ) with this package you dont need to worrry about all these.

License

Notifications You must be signed in to change notification settings

ChannabasavAngadi/spark_duckDB_extension

Repository files navigation

DuckDB Extension for PySpark

Since DuckDB supports only a single writer at a time, writing directly from PySpark can lead to locking errors due to Spark's multi-worker write process.

This custom PySpark extension provides a reliable way to write DataFrames to DuckDB, ensuring smooth data transfer without concurrency issues.

Features

  • Seamlessly write PySpark DataFrames to DuckDB
  • Supports overwrite and append modes
  • Automatically detects and adds new columns when appending data
  • Simple integration with PySpark's DataFrameWriter API

Installation

You can install the package using pip:

pip install duckdb-spark


## Usage

```bash
from pyspark.sql import SparkSession
from duckdb_extension import register_duckdb_extension

spark = SparkSession.builder.appName("DuckDB Example").getOrCreate()

# Register the DuckDB extension
register_duckdb_extension(spark)

df=spark.read.csv("employe.csv",header=True)

# Use the custom extension to write the DataFrame to DuckDB and specify the table name
df.write.duckdb_extension(database="./company_database.duckdb", table_name="employe_tbl", mode="append")

About

this extension allows you to write spark df to duckdb hassle free becuase erlier you can not write from spark to duckdb directly because of duckDB's write once but spark write with muliple writer's (so will raise error : lock already set ) with this package you dont need to worrry about all these.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages