Shared embedding & retrieval infrastructure — combines AI-supported search, AI-based categorisation, and Chat with notes under one unified foundation.
| PR | Description | Status |
|---|---|---|
| #14391 | Repeating to-do notifications — DB migration, AlarmService, desktop + mobile UI, 26 tests | Ready for PR |
| #14582 | Plugin custom icons — SVG/PNG toolbar registration with FA fallback | In review |
What I'm building
Exactly what @shikuz described:
"Infrastructure project: Chunking, embedding, storage, incremental indexing. Retrieval improvements: query decomposition / reranking / hybrid keyword + vector scoring / relevant segment extraction. Demo consumer: search / related notes."
A single provider plugin that indexes all notes, provides hybrid retrieval, and exposes
put(note) / query(text) — the interface @shikuz suggested so other projects can independently build their own pipeline now and migrate later.
Key decisions:
- BGE-small-en-v1.5 (512 tokens, ndcg@10: 59.55) over MiniLM (256 tokens, 49.54). @justin212407 empirically validated nomic's similarity range issues.
- sql.js for vector storage — @AmirthaYazhini confirmed native modules (onnxruntime-node, hnswlib-node) break in Joplin's webpack sandbox. sql.js is pure WASM, zero native deps.
- Hybrid search via Reciprocal Rank Fusion (Cormack et al., SIGIR 2009) — merges vector + Joplin's existing FTS4 by rank, not score. No normalization needed.
- All AI features optional with on/off toggles, as @guy-rouillier, @laurent, and @justin212407 agreed.
- Vector DB = cache (per @adamoutler). Model/config change → blow away → rebuild. User warned about cost.
Full proposal with implementation code is ready. Happy to share.
Two architecture questions
1.
put(note) / query(text) — is this interface sufficient?
@shikuz proposed this as the basic interface for compatibility. I've expanded it to:
getNoteEmbedding() serves categorization (k-means over note vectors per @Harsh16gupta's chunk-vs-note distinction). findSimilarNotes() serves note graphs. Is there anything else that downstream features would need?
2. Inter-plugin data access
Joplin plugins have no direct IPC. My approach: the provider writes the sql.js index to dataDir, consumers load it read-only at the well-known path. API types published as npm package for type safety.
Is there a better mechanism? @shikuz mentioned experimenting with userData for cross-plugin sharing — but that introduces sync overhead (~5.5KB per note at 384-dim). For a local-only cache, a shared file seems simpler.
Happy to hear thoughts from @shikuz, @personalizedrefrigerator, @adamoutler, or anyone building on top of this
