GSoC 2026: Opportunities for the AI projects

I've been thinking about this year's AI-related GSoC ideas, specifically search, chat, and categorisation, and wanted to open a discussion about where the interesting opportunities might be. These are just my thoughts as one of the mentors. I'd really like to hear what other mentors, students, and community members think, especially if you see it differently.

Looking at the proposal drafts, the search, chat, and categorisation proposals all build on the same core idea (unless I missed something): split notes into chunks, generate embeddings, store the vectors, and retrieve by similarity. Whether the goal is search, chat, or auto-tagging, the retrieval layer underneath looks very similar. Existing plugins already do this to some extent too.

Shared infrastructure as an opportunity

Since search, chat, and categorisation may make use of the same embedding index, there's a nice opportunity to build it once and build it well.

I'm not sure what the right architecture for this would be (I experimented with storing embeddings in userData, which can potentially be shared across plugins in the future). But the idea is for one process to build the index (using the latest methods), one set of vectors (perhaps even synced across devices). Other features like chat, search, categorisation would just be consumers.

This is also relevant for local model loading. Several proposals want to run embedding models in-process (via Transformers.js or WASM). That works well for a single plugin, but if multiple plugins each load their own model, the memory footprint inside the app adds up quickly. A shared index means one model, one memory budget.

Building a shared embedding index could be a GSoC-sized project on its own, especially combined with the pipeline improvements I describe below. Something like "build a shared, high-quality embedding index for Joplin" could create the foundation that makes all the downstream features possible. With a shared index in place, features like chat, auto-tagging, and related notes become much easier to build on top, which frees up time for the harder problems in each project.

Where I see the interesting opportunities

Two directions that I think would take Joplin's AI capabilities further than what we have today:

Retrieval pipeline improvements

The basic embed-and-retrieve approach works, but there are well-known techniques that can meaningfully improve quality. I wrote about it recently here. Some ideas:

  • Reranking: using a second, more precise model to re-score the top results before passing them to the LLM.

  • Hybrid scoring: combining keyword-based scores (like BM25) with vector similarity automatically, so exact term matches get a natural boost.

  • Query decomposition: breaking complex questions into sub-queries and retrieving for each one separately.

  • Relevant segment extraction (RSE): dynamically combining adjacent relevant chunks into longer segments instead of returning fixed-size blocks.

The nice thing about these is that if we have a shared infrastructure, they benefit every downstream feature at once: chat, search, related notes, and auto-tags all get better together.

Agent-based search

Current proposals use embedding-based retrieval: embed the query, rank by similarity. But search can also work differently and become distinct from the pipeline above. An LLM agent could use Joplin's existing search tools (keyword search, tag filters, date ranges, notebook scoping) and reason about how to combine them. This is similar to what Joplin MCP servers do, but built into the app rather than through an external client.

This would be a genuinely different approach from embedding-based retrieval. It doesn't need a vector index at all, just tool-use capabilities. It could work as a plugin or potentially in core. I haven't seen anyone propose this yet, and I think it's worth exploring.

Sharing this to start a conversation. Happy to discuss any of this with students, mentors or users.

10 Likes

This is a very interesting perspective on the AI-related ideas for this year.

While working on my proposal for Idea 5 (Automatically label images using AI), I also noticed that several AI features could end up loading their own models or building their own pipelines independently.

A shared infrastructure for things like model loading or indexing could definitely reduce duplication and memory usage, especially if multiple plugins run local models (for example via Transformers.js or similar approaches).

The idea of agent-based search is also very interesting. Since Joplin already has strong structured search features (tags, notebooks, date filters), an agent that can reason about how to combine these tools could provide a different approach compared to embedding-based retrieval.

Looking forward to seeing how these ideas evolve.

It would like to suggest my opinion on three of your prospects. I can suggest one by one by area:

  • Shared Embedding Index via userData

Rather than building a private vector store that only the AI search plugin can access, I will architect the indexing pipeline to store all embeddings and metadata in Joplin's userData directory. This means the index is built once, maintained by a single background process, and shared across any Joplin plugin that needs it.

A plugin for chat, auto-tagging, or related notes would simply read from the same sqlite-vec database file in userData rather than running its own indexing pipeline and loading its own embedding model. This directly addresses the memory footprint problem. Instead of three plugins each loading a 500MB embedding model into memory simultaneously, the shared infrastructure loads one model once and all downstream features consume the same vectors. The indexing pipeline I have already designed (paginated note fetching) SHA-256 change detection, overlapping chunks, and F32BLOB vector storage remains exactly the same.

but its output is now a shared resource rather than a private one. This makes my plugin the foundation layer that makes every other AI feature in Joplin cheaper and easier to build.

  • Retrieval Pipeline Improvements

The basic embed and retrieve approach I described in section 3.5 of my proposal works but I will extend it with two concrete improvements that meaningfully increase result quality.

The first is hybrid scoring, which combines BM25 keyword-based scores with cosine vector similarity so that exact term matches receive a natural boost alongside semantically relevant results. This removes the need for the user to think about whether their query is keyword style or natural language. The ranking handles both automatically.

The second improvement is reranking, where after the initial top-K results are retrieved by vector similarity, a second lighter model re-scores those results with greater precision before they are displayed in the sidebar. Reranking is computationally cheap because it only processes the small top-K set rather than the entire index, and it consistently improves precision in production RAG systems.

If time permits within the coding period I will also explore query decomposition, which breaks a complex question into sub-queries and retrieves for each separately before merging the ranked results, making the system more capable on multi-part questions.

  • Agent-Based Search as a Future Extension

The agent-based search direction the mentor describes is architecturally distinct from everything I have proposed and I think it is worth acknowledging explicitly.

My suggestion for the agent based search should be a future extension to the AI search plugin

recommended approach for GSoC 2026

Since the GSoC coding period is only 13 weeks, contributors building chat, auto-tagging, and related notes plugins cannot wait for a shared index before starting. They will each build their own private pipeline just to meet their deadlines and complete their plugin projects, which leads to exactly the duplication problem the mentor described.

The cleaner approach is to treat the shared embedding index as a dedicated standalone GSoC project first. Once it is stable and its API is published, every other AI plugin can migrate to it in a follow-up update.

I would suggest to the mentors that the shared embedding index be listed as its own GSoC 2026 idea, with all other AI feature ideas explicitly noting they should consume it rather than rebuild it.

Thanks for sharing this information @shikuz!

From the look of it, it seems like it would make sense to have both a shared structure for embeddings, and also an MCP server? The MCP server would possibly make any interaction with the data more standard? Like if the MCP server could expose both the regular search engine and the embedding, then the LLM decide which one to use based on the query?

Would that help, or it's not really necessary?

1 Like

And I agree we should probably list this as a project idea. We need to think whether it's going to be manageable to have one project a dependency of several others - a reasonable approach I guess would be to agree early on an API, and the dependent projects could build around that, even if the API is not completed yet

+1 on this,
I have been working with Sugar Labs as a member where we were building various AI-focused activities, and because every activity needed an API key we developed a common place from where all the AI activities can take the API Meaning the user can submit their API key at one place and then use that API key to access all the activities.

I was literally thinking the 1st and 4th projects sounded similar or related, and the same goes for the 2nd and 3rd ones.

And then having a shared server for this users submit their API key once and all have common vector database and can access all the features, rather than enabling everything individually.

1 Like

This is a really valuable breakdown, thank you for sharing your thinking!

While working on my proposal for Idea 4, "Chat with your note collection", I noticed some directions that align closely with what you have described here.

The shared embedding index idea makes a lot of architectural sense to me. I had initially planned a self-contained pipeline for the chat feature, but building the index as a shared layer that chat, search, and categorisation all consume is clearly the better approach. One process, one memory budget, and any retrieval improvements benefit every downstream feature at once. What I find particularly exciting about this direction is what comes after: once the shared index is stable and its API is published, every other AI plugin could migrate to it in a follow-up update rather than maintaining its own separate index. That feels like a meaningful step toward a coherent AI layer for Joplin rather than a collection of isolated features.

For the retrieval pipeline, hybrid BM25 and vector scoring with cross-encoder reranking were already in my plan. Your post pushed me to also look into Relevant Segment Extraction and query decomposition, both of which feel particularly well suited to the chat use case where questions tend to be more conversational and multi-part than a typical search query.

One thing I have been exploring independently is semantic chunking as an alternative to fixed-size splitting. Rather than cutting notes every N characters, the idea is to split at natural sentence and paragraph boundaries using embedding similarity between adjacent sentences, so each chunk stays semantically coherent. Given that Joplin notes vary widely in length and structure, from short to-do lists to long clipped articles, I think the chunking strategy matters more here than in a typical RAG setup and is worth getting right as part of the foundation.

The agent-based search direction is the one I find most interesting. Using an LLM to reason over Joplin's existing tools rather than relying purely on a vector index feels genuinely complementary rather than a replacement, and I would love to explore it as a stretch goal within this project.

According to the GSoC guidelines, we should avoid project dependencies:

Don’t select multiple people for the same project idea: If two GSoC contributors are working on the exact same project then they are competing with each other. Likewise, don’t make one person’s project dependent on another person’s project, that essentially makes it a team project which is not allowed or in the best interest of the GSoC contributors.

— Selecting a GSoC contributor | Google Summer of Code Guides (Emphasis added)

Given this, perhaps it could make sense to:

  • Have a "shared infrastructure" project.
  • Design other GSoC projects so that they:
    • Initially don't use the "shared infrastructure" project.
    • Can migrate some functionality to APIs provided by the "shared infrastructure" project in the future (if/when it's completed).
3 Likes

The idea of having a shared embedding index makes sense, but the way each idea uses embeddings is a bit different. the ideas can be grouped into two types:

  1. Chunk based projects
    Idea 1 (AI Search) and Idea 4 (Chat with notes) need chunk level embeddings, since they retrieve specific parts of notes.
  2. Note based projects
    Idea 3 (Categorisation) and Idea 2 (Note graphs) mostly compare whole notes, so a single vector per note is enough. This can be created by averaging the chunk embeddings.

At the same time, the retrieval logic is not same for every idea. For example, search and chat would use similarity based retrieval (possibly with reranking and RAG), while categorisation would use clustering.

3 Likes

That makes sense. We can ask students to keep this in mind when designing their project, make the "glue" code between their implementation and the AI swappable.

Edit: In which case I guess we can make it a project anyway, but with no requirement for other projects to consider the API under developement

4 Likes

I second this!

1 Like

Hi everyone, thanks to @laurent for pointing me to this thread!

I completely agree with @shikuz that building a shared embedding index is the smartest way to prevent memory bloat and keep Joplin lightweight. Having a single local database and a unified API key manager for all the AI plugins makes perfect sense.

In my current draft for the AI Chat plugin, I designed a local-first pipeline using sqlite-vec and a background Web Worker to handle embeddings without freezing the app's UI. I would be more than happy to help adapt this into a shared backend for all the plugins to use. Alternatively, if we build the shared index as a separate core project, I am fully prepared to collaborate early on the API boundaries so my chat UI can seamlessly consume it.

I am very flexible and excited to see how we can all collaborate to build a unified AI ecosystem for Joplin this summer!

Thanks for pointing this out! It's a crucial constraint (and makes sense).

I agree. If we can build modular projects, compatible with some predefined basic interface (such as put(note) and query(text)), then post-GSoC we can pick the best retrieval implementation and share it. Projects can build their own simple pipelines to remain independent and unblocked, but everyone targets the same interface.

Maybe the following would make for a natural split between projects.

Infrastructure project (the retrieval layer):

  • Chunking, embedding, storage, incremental indexing

  • Retrieval improvements: query decomposition / reranking / hybrid keyword + vector scoring / relevant segment extraction / other

  • Demo consumer: search / related notes (to prove the index works)

Chat project (the conversational layer):

  • Replaceable: Chunking, embedding, storage, incremental indexing

  • Context assembly from retrieved results

  • Token budget management

  • Conversation state and multi-turn coherence

  • Streaming chat UI

  • Prompt engineering and grounding

The chat project builds its own simple pipeline initially (chunk, embed, retrieve by similarity), targeting the same interface. If the infrastructure project delivers something better, migration is straightforward. This means the chat project is about making the conversation actually good: how do you assemble context, manage a multi-turn dialogue, and present results in a way that's useful.

I think MCP makes sense for external clients (like Claude Desktop or other apps talking to Joplin from outside). But for search running inside Joplin (if this is what you referred to), it's simpler than that. You describe Joplin's search capabilities as tools that the LLM can call directly. Something like:

{
"name": "search_notes",
"description": "Search notes using Joplin's query syntax. Supports keywords, tag filters (#tag), notebook scoping (notebook:name), date ranges (created:day-1), and type filters (type:todo).",
"parameters": {
"query": { "type": "string", "description": "Joplin search query" },
"limit": { "type": "number", "description": "Max results to return" }
}
}

The LLM gets this tool, figures out which queries to run (maybe combining keyword search with tag filters), and calls them via the Data API. No embedding index needed, no MCP, just the LLM defining the query. Theoretically the entire Data API can be described this way.

I like the idea of letting the LLM choose between keyword search and embedding-based retrieval depending on the query. You just don't need MCP to do it internally. You can give the model both a search_notes tool and a query_embeddings tool and let it decide.

Worth noting that this isn't limited to search. Chat with your notes could theoretically also benefit from tool-based retrieval alongside embeddings. The two approaches complement each other well.

This is just one way to think about it, but the nice thing is that agent-based search can be independent from the embedding-based projects.

Good distinction. I agree that chunk --> note aggregation is possible, and that perhaps multiple retrieval options are worth considering.

1 Like

Great conversation @shikuz! Good thinking @developerzohaib786, but maybe cache instead of user data. I’ll get into that in a minute.

I’ve been working on a Joplin Server MCP sidecar container I call joplin-server-vector-memory. I didn’t know about GSOC when I started. However I’ve made a lot of great progress. I’ve implemented joplin sync, local and ollama embeddings, search, and quite a bit more. My personal goal with my own project is to prevent vendor lockin, and to use an MCP server with my AIs instead of the vendor’s own platform and allow me to freely change between services (Gemini, OpenAI, Claude) while retaining my notes.

Laurent recommended this thread after I posted this on Discord earler.

Far be it from me to try to jump in and take opportunities from students. However, I'd like to point out that each of the items:

  1. AI-supported search for notes
  2. AI-Generated note graphs
  3. AI-based categorisation
  4. Chat with your note collection using AI
  5. Automatically label images using AI
    will each require a centralized vector database with sync capabilities or the users will be throwing a lot of money at the problem from their end.

Examples:

  • Search = vector db results
  • Note graphs = analysis of vector distances
  • Categorization = vector search analysis of topics
  • chat = attach an AI to your vector database
  • label images = analyze images with a vision model and attach to metadata before vector database

My project is exclusively Joplin Server related, and not for all clients eg. mobile/desktop, it would absolutely require a Joplin Server instance. For this reason I don't think it's the best option for Joplin Notes. But as a user of Joplin, I am very invested and interested to see these AI projects succeed.

In order to prevent duplication, size explosion, and high cost, it might be a good idea to lay out the vector DB plan or guidelines ahead of time. If these are to be standardized features, they will all need access to all the data on all platforms and should not be worked on independently from the ground up but rather on a tandem, unified platform.

I made some pretty extensible choices in joplin-server-vector-memory, such as sqlite-vec, but it won’t go directly to all platforms. My solution is intended to be an MCP server extension, and requires Joplin Server, or Joplin Cloud in order to function properly, and it currently uses a user-marriage pattern where 1-container=1-user. But lets talk about patterns that work.

I figure I’ll drop in what I know and you guys can build upon it or ask questions

  1. Components
    1. The Joplin Data
      1. This is required to do vectorization
      2. synced from server or in-app
    2. Vector Database
      1. a set of numbers representing concepts, linked to data
      2. Represents concepts and links data together in various ways across multi-dimensional arays
    3. Embedding/reranking/chat/vision model - one size doesn’t fit all
      1. Don’t lock the user into a single choice it may not be available due to
        1. Corporate policies
        2. Geographical region
        3. Political reasons
        4. Export controls
        5. Import bans
        6. marketplace restrictions
        7. ….
      2. Do allow:
        1. Local - usually slow but free
        2. Ollama - faster but requires hardware
        3. APIs - OpenAI, Gemini, Claude - Usually fastest but cost money
    4. APIs
      1. While this may be an obvious thing, it’s underrated.
      2. A well designed API will be extensible to allow any access required in the future
      3. An MCP is NOT an API, and don’t treat it as such - MCP is the UX for the AI.
  2. Vector Databases
    1. Native mobile
      1. sqlite-vec (and its predecessor sqlite-vss) - All platforms (Embedded, Mobile, Desktop, Server)
        1. My choice because it's probably the most common database for other purposes on mobile. More common = better support in general.
      2. ObjectBox - All platforms (Native Mobile Android/iOS, Edge, IoT, Server)
      3. Couchbase Lite - All platforms (Native Mobile, Edge, Desktop)
      4. Faiss (Meta) - Server, Desktop, Mobile (C++ library, requires custom compilation/NDK for Android and iOS)
    2. Not native mobile
      1. ChromaDB - Cloud or Server - Not native mobile
      2. Milvus - Cloud or Server - Not native mobile (Highly scalable distributed architecture)
      3. Qdrant - Cloud, Server, or Edge - Not native mobile (Rust-based, runs on lightweight edge servers but not embedded on devices)
      4. Weaviate - Cloud or Server - Not native mobile
      5. Pgvector (PostgreSQL extension) - Cloud or Server - Not native mobile
      6. Pinecone - Cloud only (SaaS)
  3. Database treatment
    1. The Vector Database itself should be treated as generated content or cache
      1. Any vector db errors → blow it away and get a fresh one
        1. It’s important the handing is properly unit tested for this reason.
      2. Changes to embeddings model → blow it away and get a fresh one
      3. Updates to embeddings model → blow it away and get a fresh one
      4. Changes to chunk size, or overlap → blow it away and get a fresh one
      5. Before making changes, the user should be aware that the change will require data and time/cost.
    2. Vector DB Generation
      1. Big providers cost money
        1. users should know how many estimated tokens/MB before starting the process
      2. On-mobile costs time (quite a bit)
        1. New devices (Pixel/Galaxy) come with embedding capabilities. They’re basic, but work.
  4. Usage
    1. For AI chat (RAG)
      1. Weigh the context load based on
        1. chunk size and what’s presented to the AI model
        2. number of results
      2. Find a balance of conversational size, expected context length, and number of results.
    2. Search
      1. Methods
        1. Traditional Search - Find the term in the database
        2. Full vector - Query the vector db, and get back results and snippets
        3. Hybrid search - It may be useful to have a hybrid search balance slider where
          1. 0=Keyword only
          2. 1=Vector only
      2. top-k - When used in combination with the chunk size, this can be used to regulate context size or context
      3. Reranking - search and assign relevance within the vector results
        1. On smaller models (eg. 2-4b on-device models) reranking is important to ensure the model context is managed. Smaller models have no real understanding of what’s important.
        2. On larger models (gemini/chat-gpt), reranking just takes extra time. They will know how to use the results.

I hope this helps

6 Likes

To elaborate on the difference between MCP and APIs a bit; there is a difference between API and MCP calls. The MCP is the primary interface for the AI.

APIs:

We all know what APIs are. They’re programmatic access to an application. You make one method and reuse it so it’s standardized.

  • APIs are designed to be fast extensible and universal.
  • The user will never notice the difference between 1 or 2 API calls.

MCP:

An MCP is the AI’s interface to your app. It should be optimized for the “AI Journey” and friction should only be applied where necessary.

  • MCP calls costs tokens which equates to time or time+money.
  • Input Tokens are cheaper than Output Tokens. The MCP should be weighted toward this, but balanced so it doesn’t kill the context window.
  • Multiple MCP calls should be avoided unless you’re attempting to introduce friction.
  • MCP should be tuned for optimal AI Experience and saving context
    • Don’t list all your APIs and then some - this eats context
    • Make the method name and variables required as descriptive as possible. You get the method signature and a description to convey it. eg
      • search( query, type=minimal) 
        Use this method to search; Use any string for the query, values for type are “full”, “minimal”, or “keys-only”. 
        

Reason

When the MCP server is initialized, it appears as an instruction manual for the AI. Large models can handle large MCPs well. Small models not so much. An MCP should be basic, forgiving, and require very short description.
You don’t want

  • search notes
  • get note metadata
  • get note id
  • get note contents

You want what I call Negative Friction Meaning, the AI conducts a search and gets everything it needs. Maybe it has to do a “get note” followup.

You have to tune the MCP for the expected actions and journey you want an AI to take.

4 Likes

@shikuz this is a really useful framing. The note graph project would benefit directly from a shared embedding index since the graph construction step (cosine similarity, Louvain, centrality) sits on top of the same embedding layer. If the index exists, the plugin can skip generating embeddings entirely and just consume the vectors, which cuts first-run cost and memory footprint significantly.

1 Like

I've been thinking along exactly the same lines.

I've been experimenting with a vectorless approach inspired by PageIndex. Instead of embedding chunks, it builds a hierarchical tree index from note headings and uses an LLM to reason over that tree — no vector DB, no chunking.

Since Joplin is markdown-first, the structure is already there in the headings. I rebuilt it in TypeScript and early experiments show promising retrieval accuracy on markdown documents.

This feels close to your "agent-based search" idea — reasoning over structure rather than similarity scores. Do you think this direction is worth a GSoC proposal, or would you recommend focusing on the vector pipeline improvements instead?

I suppose that might work if you intentionally write your notes a certain way. Personally I have a bunch of “Monday meeting” and “xxx project meeting” notes. The idea of semantic search is you can have your AI talk about taking your pet to the vet and it will be able to use the vector database’s semantic understanding to find similar concepts. Eg

Pet=cat,dog,animal..

Vet=doctor, veterinarian, office, appointment…

Titles don't capture the semantic nuance. But that does seem like a good fallback thing for non-AI mobile devices when offline.