DuckDB File Tools Extension

A DuckDB extension written in Rust that provides file system operations, metadata extraction, and compression.

Vibe coded so use at your discretion. Not tested much yet.

Features

File Operations

Parallel file scanning with multi-threaded hash computation
Memory-efficient streaming for large files
Cross-platform path handling (Windows, Unix, macOS)

File Metadata

File size, timestamps (created, modified, accessed)
Permissions, inode numbers, file type detection
Path decomposition and analysis

Multi-Algorithm Compression

GZIP (balanced), ZSTD (best compression), LZ4 (fastest)
Automatic format detection for decompression
Optimized for different use cases

Content Analysis

SHA256 hash computation with streaming support
BLOB manipulation and substring extraction
Binary file header analysis

Quick Start

Installation

Option 1: Load Pre-built Extension

-- Load the extension (unsigned mode for development)
LOAD './build/release/file_tools.duckdb_extension';

Option 2: Build from Source

# Clone the repository
git clone https://github.com/yourusername/duckdb-file-tools.git
cd duckdb-file-tools

# Configure and build
make configure
make release

# Load in DuckDB
duckdb -unsigned -c "LOAD './build/release/file_tools.duckdb_extension';"

Basic Usage

-- Scan files with metadata
SELECT path, size, modified_time 
FROM glob_stat('*.csv')
WHERE size > 1000000;

-- Get file hashes in parallel (high performance)
SELECT path, hash, size
FROM glob_stat_sha256_parallel('data/**/*')
WHERE is_file = 'true';

-- Analyze file types by extension
SELECT 
    path_parts(path).suffix AS extension,
    count(*) AS file_count,
    sum(cast(size AS BIGINT)) AS total_size
FROM glob_stat('**/*')
WHERE is_file = 'true'
GROUP BY extension
ORDER BY total_size DESC;

-- Compress data with different algorithms
SELECT 
    'GZIP' as algo, 
    octet_length(compress(data)) as size,
    'Balanced performance' as use_case
FROM (SELECT 'Large text data...'::BLOB as data)
UNION ALL
SELECT 
    'ZSTD' as algo,
    octet_length(compress_zstd(data)) as size, 
    'Best compression ratio' as use_case
FROM (SELECT 'Large text data...'::BLOB as data)
UNION ALL
SELECT 
    'LZ4' as algo,
    octet_length(compress_lz4(data)) as size,
    'Fastest compression' as use_case  
FROM (SELECT 'Large text data...'::BLOB as data);

Function Reference

Table Functions

Function	Purpose	Performance
`glob_stat(pattern)`	File metadata collection	Standard
`glob_stat_sha256_parallel(pattern)`	High-performance parallel hashing	Fast
`glob_stat_sha256_jwalk(pattern)`	Alternative parallel implementation	Fast

warning: glob_stat_sha256_* compute full file checksum, even though it is performed in parallel it can take a long time on big files and/or big directories. It should outperform using glob with file_sha256 that doesn't seem parallelized (more testing needed).

Scalar Functions

Function	Purpose	Example
`file_stat(path)`	Single file metadata	`file_stat('data.csv').size`
`file_sha256(path)`	SHA256 hash of file	`file_sha256('document.pdf')`
`path_parts(path)`	Path decomposition	`path_parts('/a/b/file.tar.gz').suffix`
`blob_substr(blob, start, length)`	BLOB substring	`blob_substr(data, 1, 4)`
`compress(data)`	GZIP compression	`compress('text'::BLOB)`
`compress_zstd(data)`	ZSTD compression	`compress_zstd(large_data)`
`compress_lz4(data)`	LZ4 compression	`compress_lz4(stream_data)`
`decompress(data)`	Auto-detect decompression	`decompress(compressed_blob)`

Performance

Compression Estimations

not verified.

Algorithm	Compression Ratio	Speed	Best Use Case
LZ4	~50% size reduction	2-3 GB/s	Real-time, streaming, cache
GZIP	~70% size reduction	30-50 MB/s	General purpose, compatibility
ZSTD	~75% size reduction	100-400 MB/s	Balanced: recommended

Parallel Performance

The parallel functions (glob_stat_sha256_parallel, glob_stat_sha256_jwalk) provide significant performance improvements:

Multi-core utilization: Scales with available CPU cores
Large directories: 5-10x faster than sequential processing
Memory efficient: Streaming hash computation prevents memory issues
Debug instrumentation: Set DUCKDB_FILE_TOOLS_DEBUG=1 for detailed timing

Advanced Usage

Performance Monitoring

# Enable detailed performance instrumentation
export DUCKDB_FILE_TOOLS_DEBUG=1
duckdb -unsigned -c "
  LOAD './build/release/file_tools.duckdb_extension';
  SELECT count(*) FROM glob_stat_sha256_parallel('large_dataset/**/*');
"

File Integrity Workflows

-- Create file integrity manifest
CREATE TABLE file_manifest AS
SELECT 
    path,
    hash,
    cast(size AS BIGINT) AS size,
    modified_time
FROM glob_stat_sha256_parallel('critical_data/**/*')
WHERE is_file = 'true';

-- Later: verify file integrity  
SELECT 
    path,
    hash = file_sha256(path) AS is_valid
FROM file_manifest;

Data Archival with Compression

-- Archive logs with maximum compression
CREATE TABLE archived_logs AS
SELECT 
    date_trunc('day', timestamp) AS log_date,
    compress_zstd(string_agg(log_entry, '\n' ORDER BY timestamp)::BLOB) AS compressed_logs,
    count(*) AS entry_count
FROM application_logs 
WHERE timestamp < current_date - INTERVAL '30 days'
GROUP BY date_trunc('day', timestamp);

-- Fast cache with LZ4
CREATE TABLE cached_reports AS
SELECT 
    report_id,
    compress_lz4(report_data::BLOB) AS cached_data
FROM expensive_reports
WHERE access_count > 100;

Development

Building

# Development build with debug symbols
make configure
make debug

# Optimized release build
make release

# Run tests
make test_debug
make test_release

Testing with Different DuckDB Versions

# Clean previous builds
make clean_all

# Configure for specific DuckDB version
DUCKDB_TEST_VERSION=v1.3.1 make configure

# Build and test
make debug
make test_debug

Dependencies

Rust 1.70+ (for development)
Python 3 with venv support
Make
Git

Runtime dependencies are statically linked into the extension.

Planned Features

Additional file system utilities and performance improvements are under development.

Contributing

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Development Guidelines

Follow Rust conventions and cargo fmt
Add tests for new functionality
Update documentation for new features
Run make test_debug before submitting

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

mostly coded by Claude Code
Built on the DuckDB Rust Extension Template
Uses the excellent jwalk crate for parallel directory traversal
Compression powered by flate2, zstd, and lz4_flex
SHA256 implementation from sha2

Related Projects

DuckDB - An in-process SQL OLAP database management system

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.github/workflows		.github/workflows
extension-ci-tools		extension-ci-tools
src		src
test_data		test_data
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.toml		Cargo.toml
EXTENSION_PLAN.md		EXTENSION_PLAN.md
FUNCTIONS.md		FUNCTIONS.md
IMPLEMENTATION_SUMMARY.md		IMPLEMENTATION_SUMMARY.md
LICENSE		LICENSE
Makefile		Makefile
NAMED_PARAMETERS_IMPLEMENTATION.md		NAMED_PARAMETERS_IMPLEMENTATION.md
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

DuckDB File Tools Extension

Features

Quick Start

Installation

Option 1: Load Pre-built Extension

Option 2: Build from Source

Basic Usage

Function Reference

Table Functions

Scalar Functions

Performance

Compression Estimations

Parallel Performance

Advanced Usage

Performance Monitoring

File Integrity Workflows

Data Archival with Compression

Development

Building

Testing with Different DuckDB Versions

Dependencies

Planned Features

Contributing

Development Guidelines

License

Acknowledgments

Related Projects

About

Uh oh!

Releases

Packages

Languages

License

nicad/duckdb-file-tools

Folders and files

Latest commit

History

Repository files navigation

DuckDB File Tools Extension

Features

Quick Start

Installation

Option 1: Load Pre-built Extension

Option 2: Build from Source

Basic Usage

Function Reference

Table Functions

Scalar Functions

Performance

Compression Estimations

Parallel Performance

Advanced Usage

Performance Monitoring

File Integrity Workflows

Data Archival with Compression

Development

Building

Testing with Different DuckDB Versions

Dependencies

Planned Features

Contributing

Development Guidelines

License

Acknowledgments

Related Projects

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages