Design Discussion: Shared Embedding & Retrieval Infrastructure for Joplin AI Features

bishwathakur · 25 March 2026 19:39

Hi everyone,

I've been following the recent discussions around AI features (search, chat, auto-tagging), and one pattern that stood out is that many of these ideas rely on the same underlying components — embedding generation, indexing, and retrieval.

While exploring ideas 1, 3, and 4 in detail, it became clear to me that they are not really independent features, but different consumers of the same retrieval foundation.

Core Idea

My current approach is to focus on building a shared embedding and retrieval infrastructure, instead of each plugin maintaining its own pipeline.

A single embedding pipeline incorporating chunking, change detection via hashing, and incremental updates
A shared vector index, likely SQLite-based (after reading through the existing discussions on this), for cross-platform compatibility
A hybrid retrieval system combining BM25 with vector similarity
Reranking integrated as a core component of the pipeline, not a stretch goal — especially important for smaller on-device models

The goal is to build the retrieval layer once and let multiple features consume it, rather than producing duplicate, fragmented plugins.

Design Considerations

A few aspects I'm actively thinking through:

Chunking strategy: Given the range of Joplin note types — from brief entries to long structured documents — I'm leaning toward an adaptive approach that starts with structure-aware splitting (headings, paragraphs) but allows semantic merging for longer sections, rather than a fixed-size sliding window.
Model choice: Lightweight, locally runnable embedding models (e.g., Xenova/bge-small-en-v1.5 or similar) as the default, with the provider abstraction keeping the system open to alternatives.

Demo Layer

To validate the infrastructure end-to-end, I plan to build a few minimal consumer features on top:

Hybrid search with a tunable keyword ↔ semantic balance
A basic single-turn RAG chat interface (top-k retrieval, streamed response, source citations)
Auto-tagging suggestions

These act as reference integrations to demonstrate the shared infrastructure working in practice, not as fully standalone products.

Doubts

I'd really appreciate feedback on a few architectural points:

1. Shared storage vs plugin model
Since plugins are sandboxed to their own dataDir(), my current assumption is that the right approach is a single "provider plugin" that owns the vector index and exposes a clean internal API (search, insert, update) for other features to consume — rather than true shared storage across plugins. Is that the correct mental model, or is there a lighter architecture you'd suggest?

2. Reusing Joplin's existing FTS index
I can consume Joplin's existing FTS results via the search API for the lexical signal in hybrid retrieval. But is there any way to access raw BM25 scores from the underlying SQLite FTS index, or should I treat the API output as a black-box ranked list and apply RRF on rank positions only? The latter avoids core changes but I want to know if score-level access is possible before committing.

3. Mobile constraints
On mobile (React Native), is WASM execution supported in the plugin sandbox at all, or is the environment too restricted for any local inference? This directly affects whether the embedding pipeline can run on-device for mobile users or whether mobile would always fall back to an external provider. Understanding this early shapes how I design the provider abstraction.

4. Scope validation
My plan treats the shared infrastructure as the primary deliverable, with the three consumer features as demonstrators. Given the ~350-hour scope, does that balance feel right — or would you recommend dropping one consumer feature to give the retrieval pipeline and reranking layer more depth?

5. Agent-based search as an extension
You mentioned agent-based search as a direction nobody has proposed yet — an LLM reasoning over Joplin's existing structured tools (keyword search, tag filters, date ranges) rather than relying on a vector index. Would it make sense to include a basic intent-routing layer within this project, routing structured queries to tool-use search and vague semantic queries through the embedding pipeline? Or is that better left as a separate proposal?

Goal

The intention is to build something that doesn't just power one feature, but becomes a reusable foundation for future AI capabilities in Joplin — so that chat, search, auto-tagging, and whatever comes next don't each reinvent the same retrieval wheel.

Would love to hear your thoughts and also if you spot any errors in my approach, especially around architecture and feasibility within the current plugin system.

shikuz · 26 March 2026 03:43

Hey @bishwathakur, the retrieval layer (hybrid scoring, reranking, incremental indexing) is already a full project at 350 hours. With three consumer features on top, what's the cut line if the infrastructure takes longer than planned? Which feature gets dropped?

On BM25: Joplin's search API returns a ranked list, not raw scores - worth knowing before committing to a hybrid scoring design.

On mobile: the AI summarisation plugin is a good reference for what runs locally on desktop. Mobile is a different environment - what does your provider abstraction look like when local inference isn't available?

Topic		Replies	Views
GSoC 2026: Opportunities for the AI projects GSoC	40	1294	19 June 2026
AI project Discussion ( Project 1 : AI-supported search for notes) Development	4	136	31 March 2026
GSoC Idea Discussion: Chat with your note collection using AI – architecture and LLM approach Development	5	161	13 March 2026
GSOC Idea # 1 GSoC	7	158	31 March 2026
About semantic search implementation in joplin GSoC	5	181	31 March 2026