Search & Tokenizer

Search & Tokenizer

Xplorer includes a powerful search engine that tokenizes file contents and builds an inverted index for instant full-text search.

Smart search bar

How It Works

Search pipeline

Natural Language Search

Natural language search results

Xplorer can parse natural language queries and extract structured filters:

| Query | Interpretation | | ------------------------------ | ----------------------------------------------------- | | "large videos from last month" | Filter: size > 100MB, type: video, date: last 30 days | | "python files in documents" | Filter: extension: .py, path contains: documents | | "photos from 2024" | Filter: type: image, date: year 2024 | | "small text files" | Filter: size < 1MB, type: text |

The NLP parser (enhanced_natural_language_search) supports:

  • Size filters: "large", "small", "tiny", "huge"
  • Type filters: "images", "videos", "documents", "code", "audio"
  • Date filters: "today", "this week", "last month", "from 2024"
  • Path filters: Directory name keywords
  • Synonym expansion: "pics" → "images", "docs" → "documents"

AI-Enhanced Indexing

When Ollama is available, Xplorer can generate AI descriptions for files that don't contain searchable text:

Image Indexing

Uses vision models (LLaVA, bakllava, moondream) to:

  • Generate natural language descriptions of images
  • Extract tags and categories
  • These descriptions become searchable tokens

Semantic Search

Uses embedding models (nomic-embed-text, all-minilm) to:

  • Create vector embeddings of file content
  • Find semantically similar files (not just keyword matches)
  • "Find files related to machine learning" finds relevant files even without exact keyword matches

Tokenizer Settings

Configure via the Settings page or the TokenizerSettings API:

| Setting | Default | Description | | ------------------------ | ------------------- | ---------------------------- | | enabled | true | Enable/disable the tokenizer | | max_file_size | 10 MB | Skip files larger than this | | blacklisted_extensions | exe, dll, bin, etc. | Skip these file types | | indexed_paths | [] | Directories to index |

Index Persistence

The token index is saved to token_index.json in the app data directory. On startup:

  1. Load the existing index from disk
  2. When rebuild_index is called, reuse cached tokens for unchanged files
  3. Only re-extract content for new or modified files
  4. Save the updated index back to disk

This means the first indexing of a large directory may take time, but subsequent re-indexes are near-instant for unchanged files.