GSoC 2026 Proposal Draft – Idea 4: Chat with your note collection using AI–Divya A

Hi everyone! I’m Divya, currently working on a GSoC 2026 proposal for Idea 4: “Chat with your note collection using AI.”

I’ve been refining the design based on earlier discussions and feedback from @shikuz, and I’d really appreciate any input before I finalize the proposal.

Key Design Decisions

1. Embedding Model
I initially considered all-MiniLM-L6-v2 for efficiency, but I’m now evaluating Xenova/bge-small-en-v1.5 (as suggested) since it offers better retrieval performance at a comparable size.
I’m also reviewing the AI summarisation plugin to better understand how Transformers.js is currently integrated within Joplin.

2. Chunking Strategy
Instead of fixed-size chunking, I’m proposing a semantic approach:

  • Start with sentence-level segmentation

  • Merge adjacent sentences based on semantic similarity

  • Dynamically tune a similarity threshold

The goal is to better handle Joplin’s diverse note types (short notes vs. long structured documents). I’m planning to evaluate how different thresholds impact retrieval quality across these variations.

Open Questions

  • What similarity threshold range would be reasonable to start experimenting with?

  • Are there any constraints in Joplin’s plugin architecture that could affect embedding or chunking strategies?

  • Is there a preferred trade-off direction between local model efficiency and retrieval accuracy?

  • I’ve drafted my proposal around this design and would appreciate any overall feedback on whether this direction and scope makes sense before I finalize it.You can find my full draft here: link

Thanks in advance!

Hey @Divya-A10, chromadb's JS client connects to a running ChromaDB server - it doesn't embed into the plugin process. How does a user who installs this plugin start that server? What's the experience for someone who doesn't have Python?

hi @shikuz, thanks for pointing this out it's a very important factor.

You're right depending on the ChromaDB JS client would require users to run an external server, which is not ideal for a Joplin plugin and adds friction (Python dependency, manual setup, background process).I'm now thinking of moving toward an embedded or fully local approach to better align with a plug-and-play experience. I'm looking into a few options:

  • using an in-process vector store (e.g., a lightweight JS-based solution or SQLite-backed storage)
  • storing embeddings directly in Joplin’s local database or plugin storage
  • using Transformers.js to handle retrieval entirely within the plugin without the need for external services

This would guarantee:

  • no Python dependency
  • no separate server setup
  • a smooth installation experience for non-technical users

I’ll update the proposal to clearly prioritise a self-contained architecture and reflect in this direction.Would you recommend any preferred way or current Joplin plugin patterns for managing local indexing and search?
Thanks again for catching that!