I've been thinking about this year's AI-related GSoC ideas, specifically search, chat, and categorisation, and wanted to open a discussion about where the interesting opportunities might be. These are just my thoughts as one of the mentors. I'd really like to hear what other mentors, students, and community members think, especially if you see it differently.
Looking at the proposal drafts, the search, chat, and categorisation proposals all build on the same core idea (unless I missed something): split notes into chunks, generate embeddings, store the vectors, and retrieve by similarity. Whether the goal is search, chat, or auto-tagging, the retrieval layer underneath looks very similar. Existing plugins already do this to some extent too.
Shared infrastructure as an opportunity
Since search, chat, and categorisation may make use of the same embedding index, there's a nice opportunity to build it once and build it well.
I'm not sure what the right architecture for this would be (I experimented with storing embeddings in userData, which can potentially be shared across plugins in the future). But the idea is for one process to build the index (using the latest methods), one set of vectors (perhaps even synced across devices). Other features like chat, search, categorisation would just be consumers.
This is also relevant for local model loading. Several proposals want to run embedding models in-process (via Transformers.js or WASM). That works well for a single plugin, but if multiple plugins each load their own model, the memory footprint inside the app adds up quickly. A shared index means one model, one memory budget.
Building a shared embedding index could be a GSoC-sized project on its own, especially combined with the pipeline improvements I describe below. Something like "build a shared, high-quality embedding index for Joplin" could create the foundation that makes all the downstream features possible. With a shared index in place, features like chat, auto-tagging, and related notes become much easier to build on top, which frees up time for the harder problems in each project.
Where I see the interesting opportunities
Two directions that I think would take Joplin's AI capabilities further than what we have today:
Retrieval pipeline improvements
The basic embed-and-retrieve approach works, but there are well-known techniques that can meaningfully improve quality. I wrote about it recently here. Some ideas:
-
Reranking: using a second, more precise model to re-score the top results before passing them to the LLM.
-
Hybrid scoring: combining keyword-based scores (like BM25) with vector similarity automatically, so exact term matches get a natural boost.
-
Query decomposition: breaking complex questions into sub-queries and retrieving for each one separately.
-
Relevant segment extraction (RSE): dynamically combining adjacent relevant chunks into longer segments instead of returning fixed-size blocks.
The nice thing about these is that if we have a shared infrastructure, they benefit every downstream feature at once: chat, search, related notes, and auto-tags all get better together.
Agent-based search
Current proposals use embedding-based retrieval: embed the query, rank by similarity. But search can also work differently and become distinct from the pipeline above. An LLM agent could use Joplin's existing search tools (keyword search, tag filters, date ranges, notebook scoping) and reason about how to combine them. This is similar to what Joplin MCP servers do, but built into the app rather than through an external client.
This would be a genuinely different approach from embedding-based retrieval. It doesn't need a vector index at all, just tool-use capabilities. It could work as a plugin or potentially in core. I haven't seen anyone propose this yet, and I think it's worth exploring.
Sharing this to start a conversation. Happy to discuss any of this with students, mentors or users.