safe_store is a Python library that turns your local folders of documents into a powerful, private, and intelligent knowledge base. It achieves this by combining two powerful AI concepts into a single, seamless tool:
- Deep Semantic Search: It reads and understands the content of your files, allowing you to search by meaning and context, not just keywords.
- AI-Powered Knowledge Graph: It uses a Large Language Model (LLM) to automatically identify key entities (people, companies, concepts) and the relationships between them, building an interconnected web of your knowledge.
All of this happens entirely on your local machine, using a single, portable SQLite file. Your data never leaves your control.
safe_store is designed to grow with your needs. You can start with a simple, powerful RAG system in minutes, and then evolve it into a sophisticated knowledge engine.
The Foundation: Retrieval-Augmented Generation (RAG)
RAG is the state-of-the-art technique for making Large Language Models (LLMs) answer questions about your private documents. The process is simple:
- Retrieve: Find the most relevant text chunks from your documents related to a user's query.
- Augment: Add those chunks as context to your prompt.
- Generate: Ask the LLM to generate an answer based only on the provided context.
SafeStore is the perfect tool for the "Retrieve" step. It uses vector embeddings to understand the meaning of your text, allowing you to find relevant passages even if they don't contain the exact keywords.
Example: A Simple RAG Pipeline
import safe_store
# 1. Create a store. This will create a 'my_notes.db' file.
store = safe_store.SafeStore(db_path="my_notes.db", vectorizer_name="st")
# 2. Add your documents. It will scan the folder and process all supported files.
with store:
store.add_document("path/to/my_notes_and_articles/")
# 3. Query the store to find context for your RAG prompt.
user_query = "What were the main arguments about AI consciousness in my research?"
context_chunks = store.query(user_query, top_k=3)
# 4. Build the prompt and send to your LLM.
context_text = "\n\n".join([chunk['chunk_text'] for chunk in context_chunks])
prompt = f"""
Based on the following context, please answer the user's question.
Do not use any external knowledge.
Context:
---
{context_text}
---
Question: {user_query}
"""
# result = my_llm_function(prompt) # Send to your LLM of choiceWith just this, you have a powerful, private RAG system running on your local files.
Level 2: Uncover Hidden Connections with a Knowledge Graph
The Next Dimension: From Passages to a Web of Knowledge
Semantic search is great for finding relevant passages, but it struggles with questions about specific facts and relationships scattered across multiple documents.
GraphStore complements this by building a structured knowledge graph of the key instances (like the person "Geoffrey Hinton") and their relationships (like PIONEERED the concept "Backpropagation"). This allows you to ask precise, factual questions.
Understanding the structure of your knowledge base can be challenging. safe_store provides a powerful tool to visually explore the semantic relationships within your documents.
The export_point_cloud() method performs a Principal Component Analysis (PCA) on all the vectors in your store to create a 2D "map" of your data. When combined with a simple web interface, this allows you to:
- See Clusters: Identify natural groupings of related content at a glance.
- Explore Relationships: Understand how different documents and topics relate to each other in the vector space.
- Debug and Refine: Visually inspect the results of different chunking strategies or vectorization models to see how they affect the semantic layout of your data.
Example Visualization:
(This UI is generated by the example script below)
This entire interactive application, including the web server and the API to fetch chunk text on hover, is available as a complete, runnable example. It's the perfect starting point for building your own custom knowledge exploration tools.
Save the following code as run_point_cloud_app.py and execute it with python run_point_cloud_app.py.
# examples/point_cloud_and_api.py
import safe_store
from pathlib import Path
import shutil
import json
import webbrowser
from http.server import HTTPServer, SimpleHTTPRequestHandler
import threading
import pipmaster as pm
# Ensure necessary packages for PCA and the example are installed
pm.ensure_packages(["scikit-learn", "pandas"])
# --- Helper Functions ---
def print_header(title):
print("\n" + "="*10 + f" {title} " + "="*10)
def setup_environment():
"""Cleans up old files and creates new ones for the example."""
print_header("Setting Up Example Environment")
db_file = Path("point_cloud_example.db")
doc_dir = Path("temp_docs_point_cloud")
# Clean up DB and its artifacts
for p in [db_file, Path(f"{db_file}.lock"), Path(f"{db_file}-wal"), Path(f"{db_file}-shm")]:
p.unlink(missing_ok=True)
# Clean up and create doc directory
if doc_dir.exists():
shutil.rmtree(doc_dir)
doc_dir.mkdir(exist_ok=True)
# Create sample documents with metadata
(doc_dir / "animals.txt").write_text(
"The quick brown fox jumps over the lazy dog. A fast red fox is athletic. The sleepy dog rests."
)
(doc_dir / "tech.txt").write_text(
"Python is a versatile programming language. Many developers use Python for AI. RAG pipelines are a common use case."
)
(doc_dir / "space.txt").write_text(
"The sun is a star at the center of our solar system. The Earth revolves around the sun. Space exploration is fascinating."
)
print("- Created sample documents and cleaned up old database.")
return db_file, doc_dir
# --- Main Logic ---
DB_FILE, DOC_DIR = setup_environment()
print_header("Initializing SafeStore and Indexing Documents")
# Initialize SafeStore
store = safe_store.SafeStore(
db_path=DB_FILE,
vectorizer_name="st",
vectorizer_config={"model": "all-MiniLM-L6-v2"},
chunk_size=10, # small chunks for more points
chunk_overlap=2
)
# Add documents to the store with metadata
with store:
store.add_document(DOC_DIR / "animals.txt", metadata={"topic": "animals", "source": "fiction"})
store.add_document(DOC_DIR / "tech.txt", metadata={"topic": "technology", "source": "documentation"})
store.add_document(DOC_DIR / "space.txt", metadata={"topic": "space", "source": "science"})
print("- Documents indexed successfully.")
# --- Data Export for Visualization ---
print_header("Exporting Point Cloud Data")
with store:
point_cloud_data = store.export_point_cloud(output_format='dict')
# Save data to a JSON file for the web page to fetch
web_dir = Path("point_cloud_web_app")
web_dir.mkdir(exist_ok=True)
data_file = web_dir / "data.json"
with open(data_file, "w") as f:
json.dump(point_cloud_data, f)
print(f"- Point cloud data exported to {data_file}")
# --- Web Server and HTML Page ---
html_content = """
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>SafeStore | 2D Chunk Visualization</title>
<script src="https://cdn.tailwindcss.com"></script>
<script src="https://cdn.plot.ly/plotly-2.32.0.min.js"></script>
</head>
<body class="bg-slate-50 dark:bg-slate-900 text-slate-800 dark:text-slate-200">
<main class="container mx-auto p-8">
<header class="text-center mb-12">
<h1 class="text-4xl font-bold text-slate-900 dark:text-white">2D Document Chunk Visualization</h1>
<p class="mt-2 text-lg text-slate-600 dark:text-slate-400">Interactive PCA plot of vectorized chunks. Hover to inspect.</p>
</header>
<div class="grid grid-cols-1 lg:grid-cols-5 gap-8">
<div class="lg:col-span-3 bg-white dark:bg-slate-800 rounded-xl shadow-lg p-6 h-[70vh]">
<div id="plot" class="w-full h-full"></div>
</div>
<div class="lg:col-span-2 bg-white dark:bg-slate-800 rounded-xl shadow-lg p-6">
<h2 class="text-2xl font-semibold text-slate-900 dark:text-white mb-4">Chunk Inspector</h2>
<div id="chunk-info-container" class="relative h-[calc(70vh-80px)]"></div>
</div>
</div>
</main>
<script>
document.addEventListener('DOMContentLoaded', function() {
// ... (JavaScript remains the same as in the example file) ...
});
</script>
</body>
</html>
"""
# (For brevity, the full JavaScript is in the example file but the structure is shown here)
# Write the HTML file
index_file = web_dir / "index.html"
index_file.write_text(html_content)
# Define a custom request handler to serve files and provide an API
class CustomHandler(SimpleHTTPRequestHandler):
def __init__(self, *args, **kwargs):
super().__init__(*args, directory=str(web_dir), **kwargs)
def do_GET(self):
if self.path.startswith('/chunk/'):
try:
chunk_id = int(self.path.split('/')[-1])
with store:
chunk_data = store.get_chunk_by_id(chunk_id)
if chunk_data:
self.send_response(200)
self.send_header('Content-type', 'application/json')
self.end_headers()
self.wfile.write(json.dumps(chunk_data).encode('utf-8'))
else:
self.send_error(404, "Chunk not found")
except Exception as e:
self.send_error(500, str(e))
return
super().do_GET()
print(f"- Wrote web application files to '{web_dir.resolve()}'")
# --- Run Server ---
PORT = 8008
server_address = ('', PORT)
httpd = HTTPServer(server_address, CustomHandler)
url = f"http://localhost:{PORT}"
print_header("Starting Web Server")
print(f"Serving visualization at: {url}")
print("Please open the URL in your web browser.")
print("Press Ctrl+C to stop the server.")
threading.Timer(1.5, lambda: webbrowser.open(url)).start()
try:
httpd.serve_forever()
except KeyboardInterrupt:
print("\n- Server stopped.")
finally:
httpd.server_close()One of safe_store's most powerful features is its ability to self-document. You don't need to guess which vectorizers are available or what parameters they need. You can discover everything at runtime.
This makes it easy to experiment with different embedding models and build interactive tools that guide users through the setup process.
The SafeStore.list_available_vectorizers() class method scans the library for all built-in and custom vectorizers and returns their complete configuration metadata.
import safe_store
import pprint
# Get a list of all available vectorizer configurations
available_vectorizers = safe_store.SafeStore.list_available_vectorizers()
# Pretty-print the result to see what's available
pprint.pprint(available_vectorizers)This will produce a detailed output like this:
[{'author': 'ParisNeo',
'class_name': 'CohereVectorizer',
'creation_date': '2025-10-10',
'description': "A vectorizer that uses Cohere's API...",
'input_parameters': [{'default': 'embed-english-v3.0',
'description': 'The name of the Cohere embedding model...',
'mandatory': True,
'name': 'model'},
{'default': '',
'description': 'Your Cohere API key...',
'mandatory': False,
'name': 'api_key'},
...],
'last_update_date': '2025-10-10',
'name': 'cohere',
'title': 'Cohere Vectorizer'},
{'author': 'ParisNeo',
'class_name': 'OllamaVectorizer',
'name': 'ollama',
'title': 'Ollama Vectorizer',
...},
...
]
Once you know which vectorizer you want to use, you can ask safe_store what specific models it supports. This is especially useful for API-based or local server-based vectorizers like ollama, which can have many different models available.
import safe_store
# Example: List all embedding models available from a running Ollama server
try:
# This requires a running Ollama instance to succeed
ollama_models = safe_store.SafeStore.list_models("ollama")
print("Available Ollama embedding models:")
for model in ollama_models:
print(f"- {model}")
except Exception as e:
print(f"Could not list Ollama models. Is the server running? Error: {e}")You can use this metadata to create an interactive setup script, guiding the user to choose and configure their desired vectorizer on the fly.
Full Interactive Example:
Copy and run this script. It will guide you through selecting and configuring a vectorizer, then initialize SafeStore with your choices.
# interactive_setup.py
import safe_store
import pprint
def interactive_vectorizer_setup():
"""
An interactive CLI to guide the user through selecting and configuring a vectorizer.
"""
print("--- Welcome to the safe_store Interactive Vectorizer Setup ---")
# 1. List all available vectorizers
vectorizers = safe_store.SafeStore.list_available_vectorizers()
print("\nAvailable Vectorizers:")
for i, vec in enumerate(vectorizers):
print(f" [{i+1}] {vec['name']} - {vec.get('title', 'No Title')}")
# 2. Get user's choice
choice = -1
while choice < 0 or choice >= len(vectorizers):
try:
raw_choice = input(f"\nPlease select a vectorizer (1-{len(vectorizers)}): ")
choice = int(raw_choice) - 1
if not (0 <= choice < len(vectorizers)):
print("Invalid selection. Please try again.")
except ValueError:
print("Please enter a number.")
selected_vectorizer = vectorizers[choice]
selected_name = selected_vectorizer['name']
print(f"\nYou have selected: {selected_name}")
print(f"Description: {selected_vectorizer.get('description', 'N/A').strip()}")
# 3. Dynamically build the configuration dictionary
vectorizer_config = {}
print("\nPlease provide the following configuration values (press Enter to use default):")
params = selected_vectorizer.get('input_parameters', [])
if not params:
print("This vectorizer requires no special configuration.")
else:
for param in params:
param_name = param['name']
description = param.get('description', 'No description.')
default_value = param.get('default', None)
prompt = f"- {param_name} ({description})"
if default_value is not None:
prompt += f" [default: {default_value}]: "
else:
prompt += ": "
user_input = input(prompt)
# Use user input if provided, otherwise use default
final_value = user_input if user_input else default_value
# Simple type conversion for demonstration (can be expanded)
if final_value is not None:
if param.get('type') == 'int':
vectorizer_config[param_name] = int(final_value)
elif param.get('type') == 'dict':
# For simplicity, we don't parse dicts here, but a real app might use json.loads
vectorizer_config[param_name] = final_value
else:
vectorizer_config[param_name] = str(final_value)
# 4. Initialize SafeStore with the dynamically created configuration
print("\n--- Configuration Complete ---")
print(f"Vectorizer Name: '{selected_name}'")
print("Vectorizer Config:")
pprint.pprint(vectorizer_config)
try:
print("\nInitializing SafeStore with your configuration...")
store = safe_store.SafeStore(
db_path=f"{selected_name}_store.db",
vectorizer_name=selected_name,
vectorizer_config=vectorizer_config
)
print("\nβ
SafeStore initialized successfully!")
print(f"Database file is at: {selected_name}_store.db")
store.close()
except Exception as e:
print(f"\nβ Failed to initialize SafeStore: {e}")
if __name__ == "__main__":
interactive_vectorizer_setup()This script demonstrates how the self-documenting nature of safe_store enables you to build powerful, user-friendly applications on top of it.
safe_store can chunk your documents based on character count (character strategy) or token count (token strategy). Using the token strategy is often more effective as it aligns better with how Large Language Models (LLMs) process text.
When you select chunking_strategy='token', safe_store intelligently handles tokenization:
-
Vectorizer's Native Tokenizer: If the chosen vectorizer (like a local
sentence-transformersmodel) has its own tokenizer,safe_storewill use it. This is the most accurate method, as the chunking tokens will perfectly match the vectorizer's tokens. -
Fallback to
tiktoken: Some vectorizers, especially those accessed via an API (like OpenAI or Cohere), do not expose their tokenizer for local use. In these cases,safe_storeusestiktoken(specifically thecl100k_basemodel) as a reliable fallback.tiktokenis the tokenizer used by modern OpenAI models and provides a very close approximation for many other models, ensuring your chunks are sized correctly for optimal performance.
You can also specify a custom tokenizer during SafeStore initialization if you have specific needs.
Metadata is extra information about your documents that provides crucial context. You can attach a dictionary of key-value pairs to any document you add to safe_store.
How to Add Metadata:
Simply pass a dictionary to the metadata parameter when adding content.
# Example of adding a document with metadata
doc_info = {
"title": "Quantum Entanglement in Nanostructures",
"author": "Dr. Alice Smith",
"year": 2024,
"topic": "Quantum Physics"
}
with store:
store.add_document(
"path/to/research_paper.txt",
metadata=doc_info
)How Metadata is Used in Queries:
When you perform a query, the document's metadata is returned in two ways for maximum flexibility:
- As a structured dictionary: The
document_metadatafield contains the parsed metadata, which your application can use for filtering, logging, or display purposes. - Prepended to the
chunk_text: A human-readable version of the metadata is automatically added to the beginning of the returnedchunk_text. This "just-in-time" context injection dramatically improves an LLM's ability to understand the source and relevance of the information, leading to better-quality responses without any extra work on your part.
A query result object looks like this:
[
{
"chunk_id": 123,
"similarity_percent": 95.4,
"file_path": "/path/to/research_paper.txt",
"document_metadata": {
"title": "Quantum Entanglement in Nanostructures",
"author": "Dr. Alice Smith",
"year": 2024,
"topic": "Quantum Physics"
},
"chunk_text": "--- Document Context ---\\nTitle: Quantum Entanglement in Nanostructures\\nAuthor: Dr. Alice Smith\\nYear: 2024\\nTopic: Quantum Physics\\n------------------------\\n\\n...the actual text from the document chunk begins here..."
}
]After indexing, you may need to retrieve the full, original text of a document as it was processed by safe_store. The reconstruct_document_text method does this by fetching and reassembling all of a document's stored chunks.
# Assuming 'store' is an initialized SafeStore instance
# with "path/to/research_paper.txt" already added.
full_text = store.reconstruct_document_text("path/to/research_paper.txt")
if full_text:
print("--- Reconstructed Text ---")
print(full_text[:500] + "...")
# Note: If a chunk_overlap was used during indexing, the reconstructed text
# will contain these repeated, overlapping segments. This method provides a
# raw reassembly of the stored data.For advanced RAG, you might need to transform the text of a chunk before it's vectorized and stored. The chunk_processor is a powerful hook that lets you do exactly that.
It's an optional callable that you can pass to add_document or add_text. The function receives the raw text of each chunk and the document's metadata, and it must return the string that you want to be stored and vectorized instead.
This enables powerful workflows like:
- Summarization: Replace long chunks with concise summaries generated by an LLM.
- Keyword Extraction: Prepend important keywords to each chunk to boost relevance for certain queries.
- Translation: Translate chunks into a different language before indexing.
- Formatting: Clean or reformat text in a specific way for your RAG pipeline.
Example: Prepending Metadata to Each Chunk
import safe_store
store = safe_store.SafeStore(db_path="processed_store.db")
def prepend_topic_processor(chunk_text: str, metadata: dict) -> str:
"""A processor that adds the 'topic' from metadata to the chunk text."""
topic = metadata.get("topic", "general")
return f"[Topic: {topic}] {chunk_text}"
with store:
store.add_text(
unique_id="processed_doc_1",
text="This chunk is about quantum mechanics.",
metadata={"topic": "Physics"},
chunk_processor=prepend_topic_processor,
force_reindex=True
)
# When you query this, the stored text will be:
# "[Topic: Physics] This chunk is about quantum mechanics."
# This can make the vector more specific to the topic.
results = store.query("information related to physics", top_k=1)
if results:
print(results['chunk_text'])
store.close()This simple hook provides immense flexibility for customizing your data ingestion pipeline.
Because safe_store is built on a single, portable SQLite database file, ensuring the safety of your knowledge base is straightforward.
Backup:
To back up your entire store, simply make a copy of the main database file (e.g., my_notes.db). For a complete and safe backup, especially if the database might be in use, it's best to also copy the associated temporary files:
my_notes.db(the main database file)my_notes.db-shmmy_notes.db-wal
Copying these three files to a secure location (like a separate hard drive or a cloud storage folder) creates a complete snapshot of your store at that moment.
Recovery:
To recover from a backup, simply replace the corrupted or lost .db, .db-shm, and .db-wal files with the copies from your backup.
This file-based approach avoids the complexity of database dumps and restores, giving you a simple and robust way to protect your data.
This example shows the end-to-end workflow: indexing a document, then building and querying a knowledge graph of its instances using a simple string-based ontology.
import safe_store
from safe_store import GraphStore, LogLevel
from lollms_client import LollmsClient
from pathlib import Path
import shutil
# --- 0. Configuration & Cleanup ---
DB_FILE = "quickstart.db"
DOC_DIR = Path("temp_docs_qs")
if DOC_DIR.exists(): shutil.rmtree(DOC_DIR)
DOC_DIR.mkdir()
Path(DB_FILE).unlink(missing_ok=True)
# --- 1. LLM Executor & Sample Document ---
def llm_executor(prompt: str) -> str:
try:
client = LollmsClient()
return client.generate_code(prompt, language="json", temperature=0.1) or ""
except Exception as e:
raise ConnectionError(f"LLM call failed: {e}")
doc_path = DOC_DIR / "doc.txt"
doc_path.write_text("Dr. Aris Thorne is the CEO of QuantumLeap AI, a firm in Geneva.")
# --- 2. Level 1: Semantic Search with SafeStore ---
print("--- LEVEL 1: SEMANTIC SEARCH ---")
store = safe_store.SafeStore(db_path=DB_FILE, vectorizer_name="st", log_level=LogLevel.INFO)
with store:
store.add_document(doc_path)
results = store.query("who leads the AI firm in Geneva?", top_k=1)
print(f"Semantic search result: '{results['chunk_text']}'")
# --- 3. Level 2: Knowledge Graph with GraphStore ---
print("\n--- LEVEL 2: KNOWLEDGE GRAPH ---")
ontology = "Extract People and Companies. A Person can be a CEO_OF a Company."
try:
graph_store = GraphStore(store=store, llm_executor_callback=llm_executor, ontology=ontology)
with graph_store:
graph_store.build_graph_for_all_documents()
graph_result = graph_store.query_graph("Who is the CEO of QuantumLeap AI?", output_mode="graph_only")
print("Graph query result:")
for rel in graph_result.get('relationships', []):
source = rel['source_node']['properties'].get('identifying_value')
target = rel['target_node']['properties'].get('identifying_value')
print(f"- Relationship: '{source}' --[{rel['type']}]--> '{target}'")
except ConnectionError as e:
print(f"[SKIP] GraphStore part failed: {e}")
store.close()pip install safe-storeInstall optional dependencies for the features you need:```bash
pip install safe-store[sentence-transformers]
pip install safe_store[openai,ollama,cohere]
pip install safe-store[parsing]
pip install safe-store[encryption]
pip install safe-store[all]
---
## π‘ API Highlights
#### `SafeStore` (The Foundation)
* `__init__(db_path, vectorizer_name, ...)`: Creates or loads a database. The vectorizer is locked in at creation.
* `add_document(path, ...)`: Parses, chunks, vectorizes, and stores a document or an entire folder.
* `query(query_text, top_k, ...)`: Performs a semantic search and returns the most relevant text chunks for your RAG pipeline.
* `get_chunk_by_id(chunk_id)`: Retrieves the full text and metadata for a specific chunk by its ID.
* `reconstruct_document_text(file_path)`: Reassembles and returns the full, original text of a document by joining its stored chunks.
* `export_point_cloud()`: Exports all vectors as a 2D point cloud for visualization, using PCA for dimensionality reduction.
#### `GraphStore` (The Intelligence Layer)
* `__init__(store, llm_executor_callback, ontology)`: Creates the graph manager on an existing `SafeStore` instance.
* `build_graph_for_all_documents()`: Scans documents and uses an LLM to build the knowledge graph based on your ontology.
* `query_graph(natural_language_query, ...)`: Translates a question into a graph traversal, returning nodes, relationships, and/or the original source text.
* `add_node(...)`, `add_relationship(...)`: Manually edit the graph to add your own expert knowledge.
---
## π€ Contributing & License
Contributions are highly welcome! Please open an issue to discuss a new feature or submit a pull request on [GitHub](https://github.com/ParisNeo/safe_store).
Licensed under Apache 2.0. See [LICENSE](LICENSE).