✅ Overview

Scimax VS Code includes a powerful database and search system that indexes your org, markdown, and Jupyter notebook files. The database provides:

Full-text search with FTS5 (SQLite's Full-Text Search) and BM25 ranking
Semantic search using vector embeddings for meaning-based queries
Hybrid search combining keyword and semantic approaches
Advanced search with query expansion, weighted RRF, and LLM reranking (SOTA)
Structured queries for headings, TODOs, tags, properties, and links
Agenda views for scheduled items and deadlines
Code block search filtered by programming language

The database is built on SQLite (via @libsql/client) with support for vector similarity search, making it both fast and capable of sophisticated semantic queries.

✅ What Gets Indexed

✅ File Types

The database automatically indexes three types of files:

`.org' files - Org-mode documents
`.md' files - Markdown documents
`.ipynb' files - Jupyter notebooks

✅ Indexed Content

For each file, the database extracts and indexes:

✅ Headings

Heading text and level (*, **, ***, etc.)
TODO states (TODO, DONE, IN-PROGRESS, etc.)
Priority markers ([#A], [#B], [#C])
Tags (both direct and inherited)
Properties (CUSTOM_ID, CATEGORY, etc.)
Scheduling information (SCHEDULED, DEADLINE, CLOSED)
Line numbers for navigation

✅ Source Blocks

Programming language
Complete code content
Header arguments (:results, :exports, etc.)
Line numbers
For notebooks: cell indices

✅ Links

Link type (file, http, https, id, etc.)
Target path or URL
Optional description text
Line number

✅ Hashtags

Inline hashtags (e.g., #research, #todo)
Associated file paths

✅ Full Text

Complete document content for full-text search
Indexed with FTS5 virtual tables
Porter stemming and Unicode normalization

Here is a hashtag #FullTextSearch

✅ Text Chunks (for Semantic Search)

Document divided into ~2000 character chunks
3-line overlap between chunks for context
Vector embeddings (if embedding service configured)
Line ranges for each chunk

✅ File Watching

The database automatically watches for file changes:

New files are indexed when created
Modified files are re-indexed (debounced with 500ms delay)
Deleted files are removed from the index
Changes are queued and processed sequentially

✅ Ignore Patterns

By default, the following patterns are ignored:

**/node_modules/**
**/.git/**
**/dist/**
**/build/**
**/.ipynb_checkpoints/**

Configure additional patterns in settings: scimax.db.exclude

[[cmd:workbench.action.openSettings2]]

Open Scimax Settings

✅ Full-Text Search

✅ Overview

Full-text search uses SQLite's FTS5 (Full-Text Search version 5) with:

BM25 ranking - Industry-standard relevance scoring
Porter stemming - Matches word variations (e.g., "run" matches "running")
Unicode normalization - Handles accented characters correctly
Snippet generation - Shows matching context with highlighting

✅ Usage

✅ Command

Run Scimax: Search All Files (FTS5) or use command scimax.db.search

[[cmd:scimax.db.search]]

✅ Query Syntax

FTS5 supports rich query syntax:

✅ Basic Queries

machine learning          # Match both words (in any order)
"machine learning"        # Match exact phrase
neural OR artificial      # Match either word

Boolean Operators

python AND jupyter        # Must contain both
python NOT tensorflow     # Contains python but not tensorflow
deep OR machine learning  # Contains "deep" or the phrase "machine learning"

Prefix Matching

comput*                   # Matches: computer, computing, computation
data*                     # Matches: data, database, dataset

Column Queries

title: introduction       # Search only in titles
content: python           # Search only in content

Proximity Queries

NEAR(neural network, 5)   # Words within 5 tokens of each other

Results

Search results include:

File path and basename
Line number
Preview snippet with <mark> tags highlighting matches
BM25 relevance score
Up to 100 results (configurable)

Example Searches

# Find all references to "gradient descent"
gradient descent

# Find Python-related TODO items
TODO python

# Find documents about data science or machine learning
"data science" OR "machine learning"

# Find recent mentions of TensorFlow (excluding Keras)
tensorflow NOT keras

# Find all documents with "introduction" in title
title: introduction

✅ Semantic Search

✅ Overview

Semantic search finds content by meaning rather than exact words. It uses vector embeddings to represent text in a high-dimensional space where semantically similar content is close together.

Benefits:

Find content even when using different words
Discover related concepts
More natural query language

Example: Searching for "machine learning algorithms" will also find documents about "neural networks", "deep learning models", and "classification methods" even if they don't contain those exact words.

✅ Requirements

Semantic search requires:

Vector search support in libsql - The database must support vector operations
Embedding provider configured - Ollama must be running locally
Files indexed with embeddings - Run scimax.db.reindex after configuring

Checking Availability

Run Scimax: Show Database Stats (scimax.db.stats) to see:

Whether vector search is supported
Number of embeddings stored
Any error messages

If vector search is unavailable, the stats will show:

Semantic search: Unavailable (vector search not supported)

Fallback to Full-Text Search

If semantic search is unavailable, use full-text search (`scimax.db.search') instead. FTS5 is always available and very fast for keyword-based queries.

Embedding Provider

Scimax uses Ollama for embeddings:

Ollama

Pros: Free, private, local control, high quality
Cons: Requires Ollama installed and running
Models:
Setup: Install Ollama, then ollama pull nomic-embed-text

✅ Configuration

✅ Interactive Setup

Run Scimax: Configure Embedding Service (scimax.db.configureEmbeddings)

This wizard will:

Let you choose a provider
Select a model
Test the connection
Update your settings
Prompt you to reindex files

✅ Manual Configuration

Add to your VS Code settings (settings.json):

Ollama Configuration

{
  "scimax.db.embeddingProvider": "ollama",
  "scimax.db.ollamaUrl": "http://localhost:11434",
  "scimax.db.ollamaModel": "nomic-embed-text"
}

✅ Usage

✅ Command

Run Scimax: Semantic Search or use command scimax.db.searchSemantic

✅ Query Examples

Semantic search uses natural language:

# Conceptual searches
explain neural network architecture
how to optimize database queries
project management best practices

# Related concept discovery
clustering algorithms    # Finds: k-means, hierarchical, DBSCAN
data visualization       # Finds: matplotlib, charts, plots, graphs

# Question answering
what is gradient descent
how does attention mechanism work

Results

Results include:

File path and line number
Preview (first 200 characters of chunk)
Similarity score (0-100%, higher is more relevant)
Cosine distance (lower is more similar)
Up to 20 results by default

Reindexing for Semantic Search

After configuring an embedding provider, you must reindex your files:

Run Scimax: Reindex Files (scimax.db.reindex)
Wait for indexing to complete
The database will generate embeddings for all text chunks
Semantic search will now be available

Hybrid Search

Overview

Hybrid search combines full-text (keyword) and semantic (vector) search using Reciprocal Rank Fusion (RRF). This approach:

Gets the best of both worlds
Balances exact matches with conceptual similarity
Provides more robust results

Usage

Command

Run Scimax: Hybrid Search or use command scimax.db.searchHybrid

How It Works

Query runs through both FTS5 and vector search
Results from each are ranked
RRF algorithm combines rankings:
Final results sorted by combined score

Weights

Default weights are 50/50, but can be adjusted:

// Internal API (for extension developers)
db.searchHybrid(query, {
  limit: 20,
  ftsWeight: 0.5,      // 50% weight on keywords
  vectorWeight: 0.5    // 50% weight on semantics
});

When to Use Hybrid

Default choice for general searches
When you want both exact matches and related concepts
When query has both specific terms and broad concepts
Example: "python sklearn classification accuracy" - finds both sklearn-specific code and general ML accuracy discussions

Results

Results include:

Combined ranking score
Source indicator (Keywords, AI, or both)
File location and preview
Up to 20 results

✅ Advanced Search (SOTA Pipeline)

Overview

Advanced search implements a state-of-the-art (SOTA) search pipeline inspired by modern search engines like qmd. It combines multiple techniques for maximum recall and precision:

Query Expansion - Generates alternative query formulations
Parallel Retrieval - Runs FTS5 and vector search concurrently
Weighted Reciprocal Rank Fusion (RRF) - Combines results intelligently
LLM Reranking - Uses AI to improve final ranking (optional)

Usage

Command

Run Scimax: Advanced Search ([[cmd:scimax.db.searchAdvanced]])

When to Use

Complex research queries
When you want maximum recall
When hybrid search isn't finding relevant content
For important searches where accuracy matters more than speed

Query Expansion

Query expansion improves recall by generating alternative formulations of your query.

Pseudo-Relevance Feedback (PRF)

How it works:

Runs initial search with original query
Extracts key terms from top 5 results
Creates expanded query with additional terms
Searches again with expanded query

Example:

Original: "machine learning"
Expanded: "machine learning neural network training models"

LLM Query Expansion

How it works:

Sends query to LLM (e.g., qwen3:1.7b)
LLM generates 3 alternative phrasings
All variants searched in parallel
Original query gets 2× weight in ranking

Example:

Original: "database optimization"
Variants:
  - "improve SQL query performance"
  - "speed up database queries"
  - "index tuning for databases"

Weighted RRF

Standard RRF assigns scores based on rank position. Advanced search enhances this with:

Position Bonuses

Top-ranked results get bonuses:

Rank 1: +15% bonus
Rank 2: +10% bonus
Rank 3: +5% bonus

Original Query Weight

The original query's results receive 2× weight compared to expanded query results.

Score Normalization

Different backends produce different score ranges:

BM25: Negative values (normalized to 0-1)
Vector: Cosine distance (converted to similarity)
All scores normalized before fusion

LLM Reranking

How It Works

Takes top 30 candidates from RRF fusion
LLM scores each document's relevance (0-10)
Scores blended with retrieval scores using position-aware weights

Position-Aware Blending

High-confidence retrieval matches are preserved:

Ranks 1-3: 75% retrieval, 25% reranker
Ranks 4-10: 60% retrieval, 40% reranker
Ranks 11+: 40% retrieval, 60% reranker

Performance Considerations

Reranking adds latency (~1-2 seconds for 30 documents). Disable for fast searches.

Capabilities Check

Run Scimax: Show Search Capabilities (scimax.db.searchCapabilities) to see:

✓ Full-Text Search (FTS5/BM25) - Available
✓ Semantic/Vector Search - Available (Ollama)
✓ Query Expansion (PRF) - Available (no LLM required)
✗ Query Expansion (LLM) - Unavailable - check Ollama
✗ LLM Reranking - Unavailable - pull qwen3:0.6b

Graceful Degradation

Advanced search works without all features:

If Unavailable	Fallback
Vector search	FTS-only
LLM expansion	PRF-only
Reranking	Skip reranking
All LLM features	Equivalent to hybrid search

Configuration

Search Mode

{
  "scimax.search.defaultMode": "hybrid",  // or "fast", "semantic", "advanced"
  "scimax.search.defaultLimit": 20
}

Query Expansion

{
  "scimax.search.queryExpansion.enabled": true,
  "scimax.search.queryExpansion.method": "prf",  // or "llm", "both"
  "scimax.search.queryExpansion.prfTopK": 5,
  "scimax.search.queryExpansion.prfTermCount": 5,
  "scimax.search.queryExpansion.llmModel": "qwen3:1.7b"
}

Reranking

{
  "scimax.search.reranking.enabled": false,  // Enable for better accuracy
  "scimax.search.reranking.model": "qwen3:0.6b",
  "scimax.search.reranking.topK": 30,
  "scimax.search.reranking.usePositionBlending": true
}

Hybrid Weights

{
  "scimax.search.hybrid.ftsWeight": 0.5,
  "scimax.search.hybrid.vectorWeight": 0.5,
  "scimax.search.hybrid.usePositionBonus": true,
  "scimax.search.hybrid.k": 60  // RRF constant
}

Caching

{
  "scimax.search.caching.enabled": true,
  "scimax.search.caching.ttlSeconds": 900,  // 15 minutes
  "scimax.search.caching.maxEntries": 500
}

Setting Up LLM Features

To enable query expansion and reranking with Ollama:

Install Ollama: https://ollama.ai
Start Ollama: ollama serve
Pull models:
Enable in settings:

Performance Comparison

Mode	Speed	Recall	Precision	When to Use
Fast	<50ms	Low	High	Exact matches
Semantic	~200ms	High	Medium	Conceptual queries
Hybrid	~300ms	High	High	General purpose
Advanced	1-3s	Highest	Highest	Important searches

Structured Queries

Beyond free-text search, the database supports structured queries for specific content types.

Heading Search

Search specifically in headings, with optional filtering:

Command

Scimax: Search Headings (scimax.db.searchHeadings)

Features

Search by heading text
Filter by TODO state
Filter by tag
Shows heading level, tags, TODO state
Displays deadlines and scheduled dates

Example Use Cases

# Find all headings about Python
python

# Find TODOs with specific tag (use Tag Search)
:work:

# Browse document structure
<empty query>

Tag Search

Search headings by org-mode tags:

Command

Scimax: Search By Tag (scimax.db.searchByTag)

Features

Lists all tags found in indexed files
Shows heading count per tag
Supports both direct and inherited tags
Tags displayed as :tagname:

Property Search

Search headings by property drawer values:

Command

Scimax: Search By Property (scimax.db.searchByProperty)

Common Properties

CUSTOM_ID    # Unique identifiers for linking
ID           # Auto-generated UUIDs
CATEGORY     # Classification
CREATED      # Creation timestamp
MODIFIED     # Last modification time

Examples

# Find all entries with CATEGORY property
Property: CATEGORY
Value: <empty>

# Find entries with specific CATEGORY
Property: CATEGORY
Value: research

# Find entries with CUSTOM_ID
Property: CUSTOM_ID
Value: <empty or specific>

TODO Search

Browse and filter TODO items:

Command

Scimax: Show TODOs (scimax.db.showTodos)

Features

Lists all TODO items across workspace
Filter by state (TODO, DONE, IN-PROGRESS, etc.)
Shows priority, tags, scheduling
Excludes DONE and CANCELLED by default in other views

Common TODO States

TODO          # Not started
IN-PROGRESS   # Currently working on
NEXT          # Up next
WAIT          # Waiting on something
DONE          # Completed
CANCELLED     # Abandoned

✅ Source Block Search

Search code blocks by language:

✅ Command

Scimax: Search Code Blocks (scimax.db.searchBlocks). [[cmd:scimax.db.searchBlocks]]

✅ Features

Filter by programming language
Optional text search within code
Shows first line of code as preview
Includes both org files and notebook cells

✅ Example Workflow

1. Select language (or "All languages")
2. Optionally enter search text
3. Browse matching code blocks
4. Jump to file location

✅ Hashtag Search

Find files by inline hashtags:

✅ Command

Scimax: Search Hashtags (scimax.db.searchHashtags). [[cmd:scimax.db.searchHashtags]]

✅ Features

Lists all hashtags found (`#tagname')
Shows file count per hashtag
Displays files containing selected hashtag
Case-insensitive matching

✅ Hashtag Format

# In your documents
This is a #research note about #machinelearning

# Database indexes as
research
machinelearning

✅ File Browser

Browse all indexed files:

✅ Command

Scimax: Browse Indexed Files (scimax.db.browseFiles). [[cmd:scimax.db.browseFiles]]

✅ Features

Lists all files in database
Shows last indexed date
Sorted by most recently indexed
Displays file type (org, md, ipynb)

✅ Agenda and Time Management

✅ Agenda View

The agenda shows scheduled items and deadlines:

✅ Command

Scimax: Show Agenda (scimax.db.agenda). [[cmd:scimax.db.agenda]]

✅ Time Periods

Next 2 weeks (default)
Next month
Next 3 months
All items (no time limit)

✅ Options

Include unscheduled TODOs
Filter by time range
Sorted by urgency (overdue first)

✅ Item Types

✅ Deadline

Items with DEADLINE timestamps:

,* TODO Submit report
DEADLINE: <2026-01-20 Mon>

✅ Scheduled

Items with SCHEDULED timestamps:

,* TODO Team meeting
SCHEDULED: <2026-01-15 Wed 14:00>

✅ Unscheduled TODOs

Items with TODO state but no scheduling

✅ Deadline View

Show only upcoming deadlines:

Command

Scimax: Show Deadlines (scimax.db.deadlines). [[cmd:scimax.db.deadlines]]

✅ Features

Next 2 weeks of deadlines
Overdue items highlighted
Shows days until deadline
Excludes DONE and CANCELLED

Display Format

⚠️  Overdue: Submit TPS Report (3 days ago)
🔔 Today: Code Review
🔔 Tomorrow: Documentation Update
🔔 In 5 days: Project Demo

Date Formats

Scheduling in Org Files

# Simple date
SCHEDULED: <2026-01-20>

# Date with time
SCHEDULED: <2026-01-20 Mon 14:00>

# Date with time range
SCHEDULED: <2026-01-20 Mon 14:00-16:00>

# Deadline with warning period
DEADLINE: <2026-01-20 Mon -3d>

# Closed timestamp
CLOSED: [2026-01-13 Mon 10:30]

Relative Dates

+2w    # 2 weeks from now
+1m    # 1 month from now
+3d    # 3 days from now
+1y    # 1 year from now

Search Scope

Limit searches to specific directories:

Commands

Scimax: Set Search Scope (scimax.db.setScope)

Scope Types

All Files (Default)

Searches entire indexed database
Includes all workspace folders
Includes additional configured directories

Current Directory

Limits to active file's directory
Includes subdirectories
Useful for project-focused searches

Current Scope Indicator

The current scope is shown when setting scope:

Search scope: all
Search scope: directory (my-project)

Database Management

Reindexing

Full Reindex

Command: Scimax: Reindex Files (scimax.db.reindex). [[cmd:scimax.db.reindex]]

Scans all workspace folders
Checks file modification times
Only reindexes changed files
Shows progress notification
Reports statistics on completion

✅ Auto-Indexing

{
  "scimax.db.autoIndex": true
}

Warning: Disable for very large workspaces (>10,000 files) to prevent memory issues.

✅ Indexing Sources

By default, the database indexes:

Journal directory (scimax.db.includeJournal: true)
Workspace folders (scimax.db.includeWorkspace: true)
Scimax projects (scimax.db.includeProjects: true)

Add additional directories with scimax.db.include:

{
  "scimax.db.include": [
    "/home/user/research",
    "/home/user/notes",
    "~/Documents/org"
  ]
}

Optimization

✅ Command

Scimax: Optimize Database (scimax.db.optimize). [[cmd:scimax.db.optimize]]

Operations

Removes entries for deleted files
Runs VACUUM to reclaim space
Rebuilds indexes for performance
Should be run periodically (monthly)

Clearing Database

Command

Scimax: Clear Database (scimax.db.clear)

Warning

This is destructive and requires confirmation:

Removes all indexed data
Clears embeddings
Resets statistics
Requires full reindex to restore

When to Clear

Database corruption
Major schema changes
Troubleshooting issues
Fresh start needed

Statistics

Command

Scimax: Show Database Stats (scimax.db.stats)

Information Displayed

Scimax DB: 127 files (98 org, 23 md, 6 ipynb),
1,234 headings, 456 code blocks, 789 links.
Semantic search: Enabled (243 chunks).
Last indexed: 2026-01-13 14:30:00

Stats Include

File count by type
Heading count
Code block count
Link count
Chunk count (for semantic search)
Embedding status
Last index timestamp

Performance Considerations

Indexing Performance

File Size

Small files (<100KB): ~10-50ms
Medium files (100KB-1MB): ~50-200ms
Large files (>1MB): ~200ms-1s

Batch Indexing

100 small files: ~2-5 seconds
1,000 small files: ~20-60 seconds
With embeddings: 2-5x slower

Optimization Tips

Use ignore patterns for large non-content directories
Disable auto-indexing for huge workspaces
Index incrementally (only changed files)
Run optimization monthly

Search Performance

Full-Text Search (FTS5)

Query time: 10-50ms (typical)
Scales well to 10,000+ files
BM25 scoring is highly optimized
Results returned in rank order

Semantic Search

Query time: 50-500ms depending on provider
Local embeddings: slower but private
Ollama: moderate speed
OpenAI: fastest but requires network

Hybrid Search

Query time: Combined FTS + vector time
Typically 100-600ms
Runs searches in parallel
RRF fusion adds ~10ms

Database Size

Typical Sizes

100 files:      ~5-10 MB
1,000 files:    ~50-100 MB
10,000 files:   ~500 MB-1 GB

With Embeddings

+50-100% size increase for chunks and vectors
384-dim embeddings: ~1.5 KB per chunk
768-dim embeddings: ~3 KB per chunk
1536-dim embeddings: ~6 KB per chunk

Memory Usage

Indexing

Base: ~50-100 MB
Peak during large batch: ~200-500 MB
Embedding generation: +100-300 MB

Searching

FTS5: ~10-50 MB
Vector search: ~50-200 MB (loads embeddings)
Minimal memory footprint when idle

Scaling Guidelines

Small Workspace (<100 files)

Enable auto-indexing
Use any embedding provider
Full reindex in seconds

Medium Workspace (100-1,000 files)

Enable auto-indexing
Local or Ollama embeddings recommended
Full reindex in under a minute

Large Workspace (1,000-10,000 files)

Consider disabling auto-indexing
Ollama embeddings recommended
Reindex incrementally

Very Large Workspace (>10,000 files)

Disable auto-indexing (manual reindex)
Use selective directory indexing
Consider multiple smaller databases
Ollama with a fast model recommended

Configuration Reference

Database Settings

`scimax.db.includeJournal`

Type: boolean Default: true

Include journal directory in database indexing.

`scimax.db.includeWorkspace`

Type: boolean Default: true

Include workspace folders in database indexing.

`scimax.db.includeProjects`

Type: boolean Default: true

Include all scimax projects in database indexing.

`scimax.db.include`

Type: string[] Default: []

Additional directories or files to index (supports ~ for home directory).

{
  "scimax.db.include": [
    "/home/user/notes",
    "~/Documents/research"
  ]
}

`scimax.db.exclude`

Type: string[] Default: ["**/node_modules/**", "**/.git/**", "**/dist/**", "**/build/**"]

Patterns or paths to exclude from indexing (globs and absolute paths).

{
  "scimax.db.exclude": [
    "**/node_modules/**",
    "**/.git/**",
    "**/dist/**",
    "**/temp/**",
    "**/*.backup.org",
    "~/notes/scratch.org"
  ]
}

`scimax.db.autoIndex'

Type: boolean Default: false

Automatically index workspace on activation. Disable for large workspaces.

{
  "scimax.db.autoIndex": true
}

Embedding Settings

`scimax.db.embeddingProvider'

Type: enum Values: "none" | "ollama" Default: "ollama"

Embedding provider for semantic search.

{
  "scimax.db.embeddingProvider": "ollama"
}

`scimax.db.ollamaUrl'

Type: string Default: "http://localhost:11434"

Ollama server URL.

{
  "scimax.db.ollamaUrl": "http://localhost:11434"
}

`scimax.db.ollamaModel'

Type: string Default: "nomic-embed-text"

Ollama embedding model name.

{
  "scimax.db.ollamaModel": "nomic-embed-text"
}

Command Reference

Search Commands

Command	Description
scimax.db.search	Full-text search (FTS5)
scimax.db.searchSemantic	Semantic search (vector)
scimax.db.searchHybrid	Hybrid search (FTS + vector)
scimax.db.searchAdvanced	Advanced search (full pipeline)
scimax.db.searchCapabilities	Show search capabilities
scimax.db.searchHeadings	Search headings
scimax.db.searchByTag	Search by org tag
scimax.db.searchByProperty	Search by property value
scimax.db.searchBlocks	Search code blocks
scimax.db.searchHashtags	Search by hashtag

View Commands

Command	Description
scimax.db.showTodos	Show TODO items
scimax.db.agenda	Show agenda
scimax.db.deadlines	Show upcoming deadlines
scimax.db.browseFiles	Browse indexed files

Management Commands

Command	Description
scimax.db.reindex	Reindex all files
scimax.db.optimize	Optimize database
scimax.db.clear	Clear database
scimax.db.stats	Show database statistics
scimax.db.setScope	Set search scope
scimax.db.configureEmbeddings	Configure embedding service
scimax.db.backup	Backup database to file
scimax.db.restore	Restore database from file
scimax.db.rebuild	Rebuild database completely
scimax.db.verify	Verify database integrity

✅ Database Maintenance

✅ Backup and Restore

The database can be backed up and restored to prevent data loss and enable migration between machines.

✅ Backup

Command: Scimax: Backup Database (scimax.db.backup)

Creates a portable backup file containing:

All indexed file paths (not file contents)
Project information
Database metadata

Backup is stored in JSON format for portability.

# Example backup location
~/.scimax/backup-2026-01-22.json

✅ Restore

Command: Scimax: Restore Database (scimax.db.restore)

Restores database from a backup file:

Imports project list
Queues files for reindexing
Preserves original creation timestamps

Note: Actual file content must still be reindexed after restore.

✅ Database Rebuild

Command: Scimax: Rebuild Database (scimax.db.rebuild)

Completely rebuilds the database from scratch:

Drops and recreates all tables
Re-scans all configured directories
Regenerates all indexes
Regenerates embeddings (if configured)

Use when:

Database appears corrupted
Major schema changes after update
Switching embedding providers
Performance issues after many incremental updates

Options

Option	Description
Full rebuild	Complete reindex of all files
Projects only	Only rebuild project table

✅ Database Verification

Command: Scimax: Verify Database (scimax.db.verify)

Checks database integrity and freshness:

Checks Performed

File existence - Verifies indexed files still exist on disk
Modification time - Detects files modified since indexing
Index integrity - Validates FTS5 and vector indexes
Project validity - Checks project directories exist

Result Format

Database Verification Results:
- Total files: 127
- Missing files: 2
- Stale files: 5
- Projects: 8 (7 valid, 1 missing)
- Status: NEEDS_REINDEX

Status Values

Status	Meaning
OK	Database is current and valid
NEEDS_REINDEX	Some files are stale or missing
CORRUPTED	Index integrity check failed

✅ Project Integration

The database now stores project information, integrated with the Projectile project manager:

Benefits

Projects persist across VS Code restarts
Shared project list between Projectile and Database
Fast project switching using indexed data
Projects can be associated with indexed files

Project Commands

Projects are managed through Projectile commands (C-c p), but the database provides the persistence layer.

See Projectile for project management commands.

Troubleshooting

Semantic Search Not Working

Problem

Semantic search returns no results, shows "unavailable", or displays an error.

First: Check Vector Search Support

Run Scimax: Show Database Stats (scimax.db.stats) to see the semantic search status:

Status Message	Meaning
Semantic search: Enabled (N chunks)	Working, N chunks with embeddings
Semantic search: Ready (no embeddings)	Supported, but no provider configured
Semantic search: Unavailable (error)	Vector search not supported by database

Vector Search Unavailable

If you see "Semantic search: Unavailable", the libsql database doesn't support vector operations. This can happen if:

The libsql version doesn't include vector support
The vector index failed to create

Embedding Provider Issues

If vector search is supported but not working:

Check embedding provider is configured: scimax.db.embeddingProvider
Test connection: Run Scimax: Configure Embedding Service
Ensure files are reindexed after configuring embeddings
Check console for errors (Help: Toggle Developer Tools)

Local Provider Issues

First use downloads model (~30MB), wait for completion
Check extension cache directory has write permissions
Try different model if one fails

Ollama Issues

Ensure Ollama is running: ollama serve
Pull model: ollama pull nomic-embed-text
Check URL is correct in settings
Test connection: curl http://localhost:11434/api/embeddings

Search Returns No Results

Problem

Searches return empty results despite having files.

Solutions

Run Scimax: Show Database Stats to check file count
If files 0, run Scimax: Reindex Files
Check file extensions (.org, .md, .ipynb)
Verify files aren't in ignored directories
Check search scope is set to "All files"

Slow Indexing

Problem

Indexing takes very long or appears stuck.

Solutions

Check workspace size (number of files)
Add ignore patterns for large non-content directories
Disable embedding generation if not needed
Index directories incrementally
Check disk I/O and available memory

Database Corruption

Problem

Errors mentioning "database is locked" or "disk I/O error".

Solutions

Close other VS Code windows accessing same workspace
Restart VS Code
Run Scimax: Clear Database and reindex
Check disk space is available
Verify database file permissions

High Memory Usage

Problem

VS Code uses excessive memory during indexing or searching.

Solutions

Disable auto-indexing
Reduce number of indexed directories
Use more aggressive ignore patterns
Clear and reindex database
Restart VS Code between large indexing operations

Examples and Workflows

Research Paper Management

# Organize papers with properties
,* TODO Read: Attention Is All You Need
:PROPERTIES:
:CUSTOM_ID: vaswani2017attention
:AUTHOR: Vaswani et al.
:YEAR: 2017
:CATEGORY: research
:END:

#transformers #attention #nlp

SCHEDULED: <2026-01-15 Wed>

# Search by property
Property: CATEGORY
Value: research

# Search by hashtag
#nlp

# Semantic search
transformer architecture papers

Project Todo Management

# Use tags for organization
,* TODO Implement login feature :work:backend:
DEADLINE: <2026-01-20 Mon>

,* TODO Write API documentation :work:docs:
SCHEDULED: <2026-01-18 Sat>

# Search by tag
:work: -> shows all work items
:backend: -> shows backend tasks

# View agenda
Next 2 weeks -> prioritized by deadline

# Search TODOs
Filter by: TODO (in progress)

Code Snippet Library

# Store reusable code blocks
,* Data Processing Utils

,#+BEGIN_SRC python
def normalize_data(df):
    """Normalize numeric columns"""
    return (df - df.mean()) / df.std()
,#+END_SRC

# Search code blocks
Language: python
Query: normalize

# Or semantic search
how to standardize dataframe columns

Personal Knowledge Base

# Use hybrid search for discovery
Query: "improve code performance"

Results will include:
- Exact matches: "code performance" articles
- Related concepts: optimization, profiling, caching
- Similar topics: algorithm efficiency, memory management

# Use properties for metadata
:PROPERTIES:
:CREATED: [2026-01-13 Mon 10:00]
:MODIFIED: [2026-01-13 Mon 15:30]
:CATEGORY: programming
:END:

Best Practices

Indexing Strategy

Start selective - Index specific directories first
Use ignore patterns - Exclude build artifacts and dependencies
Index incrementally - Don't reindex everything on changes
Schedule optimization - Run monthly for large databases
Monitor statistics - Check file counts and sizes regularly

Search Strategy

Start broad - Use semantic or hybrid search for exploration
Refine with keywords - Switch to FTS for specific terms
Use structured queries - Filter by tags/properties when possible
Set scope appropriately - Narrow to directories for focused work
Combine approaches - Use multiple search types for thorough research

Organization Tips

Use consistent tags - Establish tag naming conventions
Add properties - Include metadata for filtering
Set schedules - Use SCHEDULED/DEADLINE for time management
Include hashtags - Quick inline categorization
Write descriptive headings - Better search results

Performance Tips

Disable auto-indexing - For large workspaces (manual trigger)
Choose appropriate embeddings - Balance quality vs. speed
Limit result counts - Don't request thousands of results
Use search scope - Narrow searches to relevant directories
Cache frequent queries - Database has 15-minute result cache

Technical Architecture

Database Schema

Files Table

Tracks indexed files with modification tracking:

path, file_type, mtime, hash, size, indexed_at

Headings Table

Org/markdown headings with full metadata:

level, title, todo_state, priority
tags, inherited_tags, properties
`scheduled', `deadline', `closed'
line_number, begin_pos

Source Blocks Table

Code blocks with language and content:

`language', `content', `headers'
line_number, cell_index

Links Table

All link types (file, http, id, etc.):

link_type, target, description
line_number

Hashtags Table

Inline hashtags:

tag, file_path

Chunks Table

Text chunks for semantic search:

content, line_start, line_end
embedding (F32_BLOB vector)

FTS Content (Virtual Table)

Full-text search index:

file_path, title, content
Porter stemming, Unicode normalization
BM25 ranking support

Indexes

Performance indexes on:

headings.file_id, headings.todo_state
`headings.deadline', `headings.scheduled'
source_blocks.language
`hashtags.tag'
`chunks.embedding' (vector index with cosine metric)
files.file_type

Vector Search

Using libsql's vector extension:

Cosine similarity metric
F32_BLOB storage format
HNSW-like index structure
Efficient nearest neighbor queries

Parsers

Org Mode

UnifiedParserAdapter - Full AST parser compatible with org-element:

Recursive heading parsing with inheritance
Property drawer extraction
Timestamp parsing (scheduled, deadline, closed)
Source block with headers
Link extraction

Markdown

Simplified parser:

ATX heading syntax (`#')
Fenced code blocks
Inline links

Jupyter Notebooks

ipynbParser - Notebook-specific parser:

Markdown cells → headings and links
Code cells → source blocks
Cell indices tracked for navigation
Hashtags from markdown cells