Hi everyone! I’m Divya, currently working on a GSoC 2026 proposal for Idea 4: “Chat with your note collection using AI.”
I’ve been refining the design based on earlier discussions and feedback from @shikuz, and I’d really appreciate any input before I finalize the proposal.
Key Design Decisions
1. Embedding Model
I initially considered all-MiniLM-L6-v2 for efficiency, but I’m now evaluating Xenova/bge-small-en-v1.5 (as suggested) since it offers better retrieval performance at a comparable size.
I’m also reviewing the AI summarisation plugin to better understand how Transformers.js is currently integrated within Joplin.
2. Chunking Strategy
Instead of fixed-size chunking, I’m proposing a semantic approach:
-
Start with sentence-level segmentation
-
Merge adjacent sentences based on semantic similarity
-
Dynamically tune a similarity threshold
The goal is to better handle Joplin’s diverse note types (short notes vs. long structured documents). I’m planning to evaluate how different thresholds impact retrieval quality across these variations.
Open Questions
-
What similarity threshold range would be reasonable to start experimenting with?
-
Are there any constraints in Joplin’s plugin architecture that could affect embedding or chunking strategies?
-
Is there a preferred trade-off direction between local model efficiency and retrieval accuracy?
-
I’ve drafted my proposal around this design and would appreciate any overall feedback on whether this direction and scope makes sense before I finalize it.You can find my full draft here: link
Thanks in advance!