GSoC 2026: Opportunities for the AI projects

Thanks for pointing this out! It's a crucial constraint (and makes sense).

I agree. If we can build modular projects, compatible with some predefined basic interface (such as put(note) and query(text)), then post-GSoC we can pick the best retrieval implementation and share it. Projects can build their own simple pipelines to remain independent and unblocked, but everyone targets the same interface.

Maybe the following would make for a natural split between projects.

Infrastructure project (the retrieval layer):

  • Chunking, embedding, storage, incremental indexing

  • Retrieval improvements: query decomposition / reranking / hybrid keyword + vector scoring / relevant segment extraction / other

  • Demo consumer: search / related notes (to prove the index works)

Chat project (the conversational layer):

  • Replaceable: Chunking, embedding, storage, incremental indexing

  • Context assembly from retrieved results

  • Token budget management

  • Conversation state and multi-turn coherence

  • Streaming chat UI

  • Prompt engineering and grounding

The chat project builds its own simple pipeline initially (chunk, embed, retrieve by similarity), targeting the same interface. If the infrastructure project delivers something better, migration is straightforward. This means the chat project is about making the conversation actually good: how do you assemble context, manage a multi-turn dialogue, and present results in a way that's useful.

I think MCP makes sense for external clients (like Claude Desktop or other apps talking to Joplin from outside). But for search running inside Joplin (if this is what you referred to), it's simpler than that. You describe Joplin's search capabilities as tools that the LLM can call directly. Something like:

{
"name": "search_notes",
"description": "Search notes using Joplin's query syntax. Supports keywords, tag filters (#tag), notebook scoping (notebook:name), date ranges (created:day-1), and type filters (type:todo).",
"parameters": {
"query": { "type": "string", "description": "Joplin search query" },
"limit": { "type": "number", "description": "Max results to return" }
}
}

The LLM gets this tool, figures out which queries to run (maybe combining keyword search with tag filters), and calls them via the Data API. No embedding index needed, no MCP, just the LLM defining the query. Theoretically the entire Data API can be described this way.

I like the idea of letting the LLM choose between keyword search and embedding-based retrieval depending on the query. You just don't need MCP to do it internally. You can give the model both a search_notes tool and a query_embeddings tool and let it decide.

Worth noting that this isn't limited to search. Chat with your notes could theoretically also benefit from tool-based retrieval alongside embeddings. The two approaches complement each other well.

This is just one way to think about it, but the nice thing is that agent-based search can be independent from the embedding-based projects.

Good distinction. I agree that chunk --> note aggregation is possible, and that perhaps multiple retrieval options are worth considering.