GSoC 2026: Opportunities for the AI projects

shikuz · 20 March 2026 13:24

Thanks for pointing this out! It's a crucial constraint (and makes sense).

I agree. If we can build modular projects, compatible with some predefined basic interface (such as put(note) and query(text)), then post-GSoC we can pick the best retrieval implementation and share it. Projects can build their own simple pipelines to remain independent and unblocked, but everyone targets the same interface.

Maybe the following would make for a natural split between projects.

Infrastructure project (the retrieval layer):

Chunking, embedding, storage, incremental indexing
Retrieval improvements: query decomposition / reranking / hybrid keyword + vector scoring / relevant segment extraction / other
Demo consumer: search / related notes (to prove the index works)

Chat project (the conversational layer):

Replaceable: Chunking, embedding, storage, incremental indexing
Context assembly from retrieved results
Token budget management
Conversation state and multi-turn coherence
Streaming chat UI
Prompt engineering and grounding

The chat project builds its own simple pipeline initially (chunk, embed, retrieve by similarity), targeting the same interface. If the infrastructure project delivers something better, migration is straightforward. This means the chat project is about making the conversation actually good: how do you assemble context, manage a multi-turn dialogue, and present results in a way that's useful.

I think MCP makes sense for external clients (like Claude Desktop or other apps talking to Joplin from outside). But for search running inside Joplin (if this is what you referred to), it's simpler than that. You describe Joplin's search capabilities as tools that the LLM can call directly. Something like:

{
"name": "search_notes",
"description": "Search notes using Joplin's query syntax. Supports keywords, tag filters (#tag), notebook scoping (notebook:name), date ranges (created:day-1), and type filters (type:todo).",
"parameters": {
"query": { "type": "string", "description": "Joplin search query" },
"limit": { "type": "number", "description": "Max results to return" }
}
}

The LLM gets this tool, figures out which queries to run (maybe combining keyword search with tag filters), and calls them via the Data API. No embedding index needed, no MCP, just the LLM defining the query. Theoretically the entire Data API can be described this way.

I like the idea of letting the LLM choose between keyword search and embedding-based retrieval depending on the query. You just don't need MCP to do it internally. You can give the model both a search_notes tool and a query_embeddings tool and let it decide.

Worth noting that this isn't limited to search. Chat with your notes could theoretically also benefit from tool-based retrieval alongside embeddings. The two approaches complement each other well.

This is just one way to think about it, but the nice thing is that agent-based search can be independent from the embedding-based projects.

Good distinction. I agree that chunk --> note aggregation is possible, and that perhaps multiple retrieval options are worth considering.

Topic		Replies	Views
Design Discussion: Shared Embedding & Retrieval Infrastructure for Joplin AI Features GSoC	1	62	26 March 2026
Welcome to GSoC 2026 with Joplin! GSoC	155	1878	1 April 2026
AI project Discussion ( Project 1 : AI-supported search for notes) Development	4	120	31 March 2026
Proposal: A local-based semantic search engine Integration Features	1	230	6 March 2025
AI agents and Joplin Apps	9	2031	13 March 2026

GSoC 2026: Opportunities for the AI projects

Related topics