Shared Embedding & Retrieval Infrastructure for Joplin AI Features ( Questions)

Shared embedding & retrieval infrastructure — combines AI-supported search, AI-based categorisation, and Chat with notes under one unified foundation.

PR Description Status
#14391 Repeating to-do notifications — DB migration, AlarmService, desktop + mobile UI, 26 tests Ready for PR
#14582 Plugin custom icons — SVG/PNG toolbar registration with FA fallback In review

What I'm building

Exactly what @shikuz described:

"Infrastructure project: Chunking, embedding, storage, incremental indexing. Retrieval improvements: query decomposition / reranking / hybrid keyword + vector scoring / relevant segment extraction. Demo consumer: search / related notes."

A single provider plugin that indexes all notes, provides hybrid retrieval, and exposes

put(note) / query(text) — the interface @shikuz suggested so other projects can independently build their own pipeline now and migrate later.

Key decisions:

  • BGE-small-en-v1.5 (512 tokens, ndcg@10: 59.55) over MiniLM (256 tokens, 49.54). @justin212407 empirically validated nomic's similarity range issues.
  • sql.js for vector storage — @AmirthaYazhini confirmed native modules (onnxruntime-node, hnswlib-node) break in Joplin's webpack sandbox. sql.js is pure WASM, zero native deps.
  • Hybrid search via Reciprocal Rank Fusion (Cormack et al., SIGIR 2009) — merges vector + Joplin's existing FTS4 by rank, not score. No normalization needed.
  • All AI features optional with on/off toggles, as @guy-rouillier, @laurent, and @justin212407 agreed.
  • Vector DB = cache (per @adamoutler). Model/config change → blow away → rebuild. User warned about cost.

Full proposal with implementation code is ready. Happy to share.


Two architecture questions

1.

put(note) / query(text) — is this interface sufficient?

@shikuz proposed this as the basic interface for compatibility. I've expanded it to:

getNoteEmbedding() serves categorization (k-means over note vectors per @Harsh16gupta's chunk-vs-note distinction). findSimilarNotes() serves note graphs. Is there anything else that downstream features would need?

2. Inter-plugin data access

Joplin plugins have no direct IPC. My approach: the provider writes the sql.js index to dataDir, consumers load it read-only at the well-known path. API types published as npm package for type safety.

Is there a better mechanism? @shikuz mentioned experimenting with userData for cross-plugin sharing — but that introduces sync overhead (~5.5KB per note at 384-dim). For a local-only cache, a shared file seems simpler.


Happy to hear thoughts from @shikuz, @personalizedrefrigerator, @adamoutler, or anyone building on top of this

2 Likes

I think Laurent must make the final call here but IMO the vector DB should be a core portion of all Joplin app platforms.

Requiring one plugin to depend on another would lead to maintenance, security, and reliability issues. If one plugin updates API and others don't, the others all become outdated. Theoretically a single plugin change could cause issues for the entire ecosystem. Additionally, if a plugin has a built-in security problem then any plugin depending upon it also inherits that issue.

Requiring each plugin to maintain its own vector DB would lead to issues with complexity, storage space, and processing/cost. Each plugin needs its own settings for models, vectorization, along with the actual plugin settings. The database would essentially be duplicated leading to size bloat. Each plugin needs a model which costs time/money to operate.

With a centralized vector DB, each plugin may need additional APIs but it would be done per-feature and the APIs would be reusable for future plugins. This would reduce the complexity as the vectorization would be centralized.

I’d suggest a centralized vector DB in Joplin Libs with a built-in vectorization model like nomic-embed-text:137m that will run on most modern embedded computing hardware and exposing/manipulating the database via plugin. It's a basic embeddings fallback model which a desktop, phone, or refrigerator should be able to handle. Since the actual vector DB would not be shared, it would be part of Joplin App and exposed via API, the individual plugins could provide embedding models along with methods of display/access to data.

The one tricky thing here is if a plugin is removed and it is providing an embedding model then the vector DB must be removed and rebuilt which could be very inconvenient, and the default fallback model would need to be used. However, this does provide extensibility and allow paid users of various platforms to bring their own model and use their data how they see fit.

Let's break this idea down a bit-

  • Search- a simple query of the vector DB. Should the user want a better search, they could install a plugin with a new model and rebuild the vector DB. This is the core functionality.
  • Chat- uses a similar Search function on each user turn and provides the model with the top results of context. Chat additionally needs a reranker for small model context management.
  • Mindmap of notes - uses a similar functionality to search but in a different way to find distances between concepts. As an example - generate a word cloud of commonly used words, group notes by those words, and find embedding distances between notes to arrange on a 2d plane around those words. There are many ways to do this.
  • Image search - for each image, run the image though a vision model (not available on many embedded systems because it currently requires a minimum 2b model like ibm-granite/granite-vision-3.3-2b), then add that text to the image caption text in the note. This is risky because it involves directly manipulating the actual note data, however it's a one-time procedure and using some data marker it could be reversible/repeatable. Additionally, a 2b model would be insufficient for many cases as even 7 or 14b models suffer from accuracy issues. Image search would need cloud models to do proper captions. After the initial caption, the vector DB is all that is required.

So what's required of the vector DB? Each note needs to be run though an embedding model and results stored. Each update to a note needs the same procedure done as an update. The vector DB needs to have a model selector from plugins. The Vector DB needs some access methods for search and some internal APIs for sync. In the end, the vector DB is just a differently processed version of your notes that records semantic relationships determined by your model of choice.

On your questions: For inter-plugin data access - have you tried using Joplin's command system? Commands can return structured data (objects, arrays) across plugins, which would avoid the shared-file approach. The gap is load ordering - what happens if the infra plugin hasn't loaded when a consumer calls?

On the API - put(note) / query(text) / getNoteEmbedding() / findSimilarNotes() covers the basics. Where do retrieval features like reranking and query decomposition live - in the shared query() or in individual consumers? That also defines the project scope: what's the infra layer, and what's the demo consumer that proves it works?

As @adamoutler noted, there's a question about whether this lives as a plugin or in core. @laurent thoughts?

If you're planning to submit this as a GSoC proposal, the submission template has the required structure.

Yes that should indeed be part of the core app, then other plugins can use it. I guess as part of such proposal a sample plugin should also be included to demonstrate the API. Indeed having a plugin depend on another plugin is not something that we want or even currently support

2 Likes