Hi everyone,
I've been following the recent discussions around AI features (search, chat, auto-tagging), and one pattern that stood out is that many of these ideas rely on the same underlying components — embedding generation, indexing, and retrieval.
While exploring ideas 1, 3, and 4 in detail, it became clear to me that they are not really independent features, but different consumers of the same retrieval foundation.
Core Idea
My current approach is to focus on building a shared embedding and retrieval infrastructure, instead of each plugin maintaining its own pipeline.
- A single embedding pipeline incorporating chunking, change detection via hashing, and incremental updates
- A shared vector index, likely SQLite-based (after reading through the existing discussions on this), for cross-platform compatibility
- A hybrid retrieval system combining BM25 with vector similarity
- Reranking integrated as a core component of the pipeline, not a stretch goal — especially important for smaller on-device models
The goal is to build the retrieval layer once and let multiple features consume it, rather than producing duplicate, fragmented plugins.
Design Considerations
A few aspects I'm actively thinking through:
- Chunking strategy: Given the range of Joplin note types — from brief entries to long structured documents — I'm leaning toward an adaptive approach that starts with structure-aware splitting (headings, paragraphs) but allows semantic merging for longer sections, rather than a fixed-size sliding window.
- Model choice: Lightweight, locally runnable embedding models (e.g.,
Xenova/bge-small-en-v1.5or similar) as the default, with the provider abstraction keeping the system open to alternatives.
Demo Layer
To validate the infrastructure end-to-end, I plan to build a few minimal consumer features on top:
- Hybrid search with a tunable keyword ↔ semantic balance
- A basic single-turn RAG chat interface (top-k retrieval, streamed response, source citations)
- Auto-tagging suggestions
These act as reference integrations to demonstrate the shared infrastructure working in practice, not as fully standalone products.
Doubts
I'd really appreciate feedback on a few architectural points:
1. Shared storage vs plugin model
Since plugins are sandboxed to their own dataDir(), my current assumption is that the right approach is a single "provider plugin" that owns the vector index and exposes a clean internal API (search, insert, update) for other features to consume — rather than true shared storage across plugins. Is that the correct mental model, or is there a lighter architecture you'd suggest?
2. Reusing Joplin's existing FTS index
I can consume Joplin's existing FTS results via the search API for the lexical signal in hybrid retrieval. But is there any way to access raw BM25 scores from the underlying SQLite FTS index, or should I treat the API output as a black-box ranked list and apply RRF on rank positions only? The latter avoids core changes but I want to know if score-level access is possible before committing.
3. Mobile constraints
On mobile (React Native), is WASM execution supported in the plugin sandbox at all, or is the environment too restricted for any local inference? This directly affects whether the embedding pipeline can run on-device for mobile users or whether mobile would always fall back to an external provider. Understanding this early shapes how I design the provider abstraction.
4. Scope validation
My plan treats the shared infrastructure as the primary deliverable, with the three consumer features as demonstrators. Given the ~350-hour scope, does that balance feel right — or would you recommend dropping one consumer feature to give the retrieval pipeline and reranking layer more depth?
5. Agent-based search as an extension
You mentioned agent-based search as a direction nobody has proposed yet — an LLM reasoning over Joplin's existing structured tools (keyword search, tag filters, date ranges) rather than relying on a vector index. Would it make sense to include a basic intent-routing layer within this project, routing structured queries to tool-use search and vague semantic queries through the embedding pipeline? Or is that better left as a separate proposal?
Goal
The intention is to build something that doesn't just power one feature, but becomes a reusable foundation for future AI capabilities in Joplin — so that chat, search, auto-tagging, and whatever comes next don't each reinvent the same retrieval wheel.
Would love to hear your thoughts and also if you spot any errors in my approach, especially around architecture and feasibility within the current plugin system.