GSoC 2026: Opportunities for the AI projects

That’s a fair point, especially with loosely titled notes.

In my approach, I’m not relying only on headings. While building the tree, each node also stores a short LLM-generated summary of its content. So even if a note is titled something like “Monday meeting”, the node would still capture what the meeting was actually about.

This makes the structure more semantic-aware rather than purely based on titles. The idea is to combine structure with lightweight semantic understanding, instead of depending entirely on embeddings.

That said, I agree it may not fully replace semantic search in all cases, but it could still work well as a complementary or lightweight alternative, especially for structured markdown notes.

If any GSoC contributor is reading this, as noted above that can be a good opportunity for a project, so don't hesitate creating a proposal if you have some ideas.

@adamoutler thanks for the detailed breakdown, lots of useful thinking in there. A few things I want to pick up on:

Your reranking observation is interesting: that it matters more for smaller on-device models than for larger cloud models. Good to keep in mind for the infrastructure project, since we'd want to support both.

The hybrid search idea (keyword ↔ vector slider) is a nice concrete way to think about the UI for that.

The "negative friction" idea for MCP design makes a lot of sense. Keeping round trips and context usage low matters. There are already a few Joplin MCP servers out there (including one I maintain) that work with the desktop client's Data API. Yours works with Joplin Server, which is cool, I don't think anyone else has explored that space. For the GSoC projects though, which run as plugins / core inside the app, I think we can keep things simpler by describing tools directly in LLM calls rather than going through MCP (as I described in my reply above). Whether Joplin should bundle an MCP server inside the app for external consumers is a different question.

On vector DB choices: the discussion here is mostly about plugin-level projects that need to work inside the app on desktop and mobile. That narrows things down quite a bit. Your overview is very helpful, and I agree that sqlite-vec looks like a natural starting point given its cross-platform support. Making sure whatever we pick actually works on mobile (where FS access is limited) should be part of any infrastructure project. Perhaps we may need to include PRs to the mobile app.

@Krishh interesting idea with the hierarchical summary tree. LLM-generated summaries at each node are more robust than headings alone, and a nice complement to embedding-based retrieval. As @adamoutler noted, pure structural approaches can miss semantic relationships, but combining a summary tree with Joplin's search tools (as I described above) could give you the best of both. Might be worth exploring as part of the search project.

This discussion aligns very closely with what I’ve been thinking while working on my proposal.

  • For the past 4 days I've been planning a proposal combining the ideas 1,3 and 4 because after thoroughly researching about these ideas, it came to my attention that they fall under the same single umbrella that is they share the same retrieval foundation.

  • I kept coming back to the fact that all 3 required building the embedding pipeline first then letting the three features consume it.

  • This approach aligns with the removal of the duplication issue i.e., multiple plugin usage in which each plugin builds its own pipeline and we keep repeating the same work multiple times which could have been minimized into a standalone pipeline used by all 3 features or more in future decreasing money wastage in terms of memory and architecture.

  • I’m currently thinking of framing the project around this shared infrastructure as the core deliverable, with a few minimal consumer features (search, chat, auto-tagging) implemented mainly to validate and demonstrate the system end-to-end rather than as fully independent products.

  • Would love to hear if this direction aligns with how you’d expect these ideas to be approached, or if there are constraints in Joplin’s current architecture that would push this in a different direction.

  • I’ll be creating a separate discussion post to explore this approach in more detail and clarify a few open questions, would really appreciate any guidance or feedback there.

My Discussion Post: Design Discussion: Shared Embedding & Retrieval Infrastructure for Joplin AI Features

As someone working on the categorisation proposal, I've already empirically validated the compressed similarity range issue with nomic-embed-text during POC development, notes with short bodies required thresholds below 0.60 regardless of topic distance. I'm planning to implement the put(note)/query(text) interface with a swappable backend so it can migrate to shared infrastructure later.

Quick question for @shikuz - for the categorisation use case specifically, would you recommend sqlite-vec embedded directly in the plugin, or designing around an external service interface from the start?"

I’m a paid user of Joplin for several years; I compared many note-taking apps and Joplin was clearly the optimal choice for me. I’m no technophobe; I made my living in software development, now retired. However, I’m a confirmed skeptic when it comes to AI; I’ve turned it off in all the desktop and Android apps where possible. So, I would implore anyone working on AI features to please make them optional, via an easily selectable on/off switch. Thanks.

Yeah we should definitely have a flag for each AI feature, since i have observed a desire among users to have an AI free experience.

Any such feature would definitely be optional. We have no intention to push these as some companies do - we hope however that whatever will be developed will be useful, and if it is then users can enable them themselves

When done correctly, AI is a completely transparent net-positive. Semantic search can provide enhanced contextual understanding using on-device, local-only models.

eg. Someone searches for “kitty” but they wrote the word cat. The semantic understanding provided by the vector model handles that automatically while the traditional search continues to provide the most recent exact match.

eg2. they search for “kitty” again, and find that note where they talked about “tom” and a picture of a cat.

As far as the user knows, it’s not AI. It’s just a really good search function. This sort of AI SHOULD be pushed. Joe Shmoe doesn’t know how it works, or why it works. They just know it is better… as long as it’s done right.

That’s a great insight. Initially I was thinking of adding a semantic search layer, but your approach feels simpler and more practical compared to introducing extra complexity.

I think we can start with Joplin search + structured tree, and optionally use semantic search as a fallback when keyword-based retrieval doesn’t work well. That way it stays simple while still handling harder cases.

I’ll explore this direction further.

From what I’ve explored while working on semantic search, a hybrid approach (lexical + semantic with fallback) seems to strike a good balance—lexical for precision and filters, and semantic primarily for recall when queries are more descriptive or ambiguous.

For the storage side, starting with something like sqlite-vec locally makes a lot of sense for simplicity and privacy, especially for a plugin-first approach. Designing the interface to be swappable (as you mentioned) feels important though, so it can evolve later without tight coupling to a single backend.

Curious to hear thoughts from @shikuz and others:
For Joplin specifically, would you lean toward keeping the first iteration strictly local (embedded vector store), or designing early for a pluggable backend even if it adds a bit more complexity upfront?

“Correctly” is carrying a lot of weight here.

I’m with guy-rouillier: I do not want generative AI anywhere near my Joplin experience or data.

Please note: I didn't say generative AI. But I'm certain you'd appreciate some level of semantic search where eg. The words code, programming, and similar are searched when you type the name of a programming language. The semantic search function, which all the main AI projects listed above depend upon, requires an AI Model of about the same level as the next word generation on your keyboard.

The generative AI gets things wrong and many times takes more action than people want. However keeping it to some level of hybrid, context aware, search is pretty much universally good.