Kreuzberg

Kreuzberg is a polyglot document intelligence framework that allows you to extract text, metadata, images, and structured information from PDFs, Office documents, images, and 91+ formats.

The kreuzberg-surrealdb package connects Kreuzberg's document extraction pipeline to SurrealDB. It handles schema creation, content deduplication, optional chunking and embedding, and index configuration.

How it works

Extract — Kreuzberg parses the source documents and runs OCR where needed.
Connect — The connector receives the extracted output and manages the SurrealDB connection.
Store — Each document is hashed (SHA-256) for deduplication, optionally chunked and embedded, then written to SurrealDB under an auto-generated schema.
Search — Full-text (BM25), vector (HNSW), and hybrid (RRF) search are available immediately after ingestion.

Key capabilities

Schema management — setup_schema() creates tables, indexes, and analyzers. No manual DDL required.
Deduplication — Deterministic record IDs derived from content hashes prevent duplicate rows across ingestion runs.
Flexible ingestion — Single files, file lists, directories (with glob), or raw bytes.
Extraction control — Pass Kreuzberg's ExtractionConfig to set OCR behavior, output format, and quality processing.
Batch tuning — Adjust insert_batch_size to balance throughput against memory usage.

Getting started

To get started, visit SurrealDB in the Kreuzberg Docs. For the complete API reference, embedding model options, chunking configuration, and database schema details, see the kreuzberg-surrealdb README.