GSoC 2026 Proposal Draft – Idea 4: Chat with your note collection using AI–Divya A

Divya-A10 · 25 March 2026 15:37

Hi everyone! I’m Divya, currently working on a GSoC 2026 proposal for Idea 4: “Chat with your note collection using AI.”

I’ve been refining the design based on earlier discussions and feedback from @shikuz, and I’d really appreciate any input before I finalize the proposal.

Key Design Decisions

1. Embedding Model
I initially considered all-MiniLM-L6-v2 for efficiency, but I’m now evaluating Xenova/bge-small-en-v1.5 (as suggested) since it offers better retrieval performance at a comparable size.
I’m also reviewing the AI summarisation plugin to better understand how Transformers.js is currently integrated within Joplin.

2. Chunking Strategy
Instead of fixed-size chunking, I’m proposing a semantic approach:

Start with sentence-level segmentation
Merge adjacent sentences based on semantic similarity
Dynamically tune a similarity threshold

The goal is to better handle Joplin’s diverse note types (short notes vs. long structured documents). I’m planning to evaluate how different thresholds impact retrieval quality across these variations.

Open Questions

What similarity threshold range would be reasonable to start experimenting with?
Are there any constraints in Joplin’s plugin architecture that could affect embedding or chunking strategies?
Is there a preferred trade-off direction between local model efficiency and retrieval accuracy?
I’ve drafted my proposal around this design and would appreciate any overall feedback on whether this direction and scope makes sense before I finalize it.You can find my full draft here: link

Thanks in advance!

shikuz · 26 March 2026 01:21

Hey @Divya-A10, chromadb's JS client connects to a running ChromaDB server - it doesn't embed into the plugin process. How does a user who installs this plugin start that server? What's the experience for someone who doesn't have Python?

Divya-A10 · 26 March 2026 02:00

hi @shikuz, thanks for pointing this out it's a very important factor.

You're right depending on the ChromaDB JS client would require users to run an external server, which is not ideal for a Joplin plugin and adds friction (Python dependency, manual setup, background process).I'm now thinking of moving toward an embedded or fully local approach to better align with a plug-and-play experience. I'm looking into a few options:

using an in-process vector store (e.g., a lightweight JS-based solution or SQLite-backed storage)
storing embeddings directly in Joplin’s local database or plugin storage
using Transformers.js to handle retrieval entirely within the plugin without the need for external services

This would guarantee:

no Python dependency
no separate server setup
a smooth installation experience for non-technical users

I’ll update the proposal to clearly prioritise a self-contained architecture and reflect in this direction.Would you recommend any preferred way or current Joplin plugin patterns for managing local indexing and search?
Thanks again for catching that!

Topic		Replies	Views
Question regarding GSoC 2026 GSoC	7	129	25 March 2026
GSoC Idea Discussion: Chat with your note collection using AI – architecture and LLM approach Development	5	146	13 March 2026
Design Discussion: Shared Embedding & Retrieval Infrastructure for Joplin AI Features GSoC	1	64	26 March 2026
GSoC 2026: Opportunities for the AI projects GSoC	32	698	13 April 2026
GSoC 2026 Proposal Draft – Idea 4: Chat with your note collection using AI GSoC	0	19	31 March 2026

GSoC 2026 Proposal Draft – Idea 4: Chat with your note collection using AI–Divya A

Key Design Decisions

Related topics