GSoC 2026 Proposal – Idea 1: AI-Supported Search for Notes in Joplin
Bysani Vedavyas
Links:
- GitHub: 18ToNyStArK18 (Vedavyas) · GitHub
- Pull Requests: fix: show correct notebook context menu options when server is offline by 18ToNyStArK18 · Pull Request #14937 · laurent22/joplin · GitHub
- Project Idea: gsoc/ideas.md at master · joplin/gsoc · GitHub
- Forum Introduction: Welcome to GSoC 2026 with Joplin! - #136 by 18ToNyStArK18
1. Introduction
My name is Bysani Vedavyas and I am a computer science and engineering student at International Institute of Information Technology Hyderabad (IIITH). I would like to specifically work on this project since I have a keen interest in Information Retrieval. I previously worked on projects that use RAG (Retrieval-Augmented Generation) so I understand its advantages, disadvantages, implementation and testing procedure well.
My previous experience in these areas includes the following.
AI-Powered Adaptive Tutor
I built a platform using RAG where I handled document chunking, embedding generation, and vector similarity search to provide context-aware answers.
Open Source Contributor
I am comfortable with large codebases, having worked on the xv6 kernel which helps me navigate the "core application" integration mentioned in the requirements.
2. Project Summary
What problem it solves
The goal of this project is to build an AI-powered search for a note-taking application. Instead of the existing full-text search or keyword-based search, the app should have a search engine that understands the intent behind a user's query. For example, searching for notes from a meeting with a German company in 2019 should be as easy as typing "meeting with a German company in 2019" without the need to remember a word of the note itself.
Why it matters to users
Sometimes users don't remember the exact keywords / titles of the notes but remember what the note has, so the user should be able to find that note by explaining what that note has. Keyword-based search will not be helpful in this situation so we need separate search techniques in these cases.
What will be implemented
The implementation will be adding one more AI based search technique on top of the existing keyword search. This is achieved by PageIndex.
PageIndex is a vectorless, reasoning-based RAG system that generates a "Table-of-Contents" tree structure index of documents and performs reasoning-based retrieval through tree search. This approach does not involve using a vector database and is thus cheaper and more portable. It does not rely on similarity measures, instead, it uses the meaning and intent within the data to evaluate relevance. It also provides an easier way to get the exact relevant portion of a document easily, which is important in a search engine.
How PageIndex works:
PageIndex works in two phases — indexing and retrieval.
During indexing, each note is broken into logical chunks (paragraphs media). An LLM generates a one-line summary for each chunk.
These summaries are stored in a tree structure called the Table of Contents
(TOC), where the root has notes as children, and notes have their chunks as children.
During retrieval, when the user types a natural language query, the query
and the TOC are passed to an LLM. The LLM walks the tree top-down — it first
decides which notes are relevant,then which chunks inside those notes. It returns only the IDs of the relevant
chunks rather than reading every note in full. This makes it fast even for
large collections.
This approach is vectorless — it does not use embeddings or a vector database.
Instead it uses the LLM's reasoning ability to judge relevance from the
summaries alone, which makes it cheaper, more portable, and easier to run
fully on-device with a local model like Ollama.
Adapting PageIndex for Joplin
PageIndex is designed for single long documents, not a collection of notes.
The key adaptation is mapping Joplin's hierarchy — notebooks, sub-notebooks,
and notes — to the TOC tree structure. The root node is a dummy node whose
children are the user's notebooks. Each notebook's children are its notes
and sub-notebooks. Each note's children are its content chunks (paragraphs,
list items, media). This means the navigator LLM can prune entire notebooks
early in the tree search without having to evaluate every note, which keeps
retrieval fast even for large collections.
Expected outcome
By the end of GSoC, Joplin will have a working AI-powered context based search engine integrated into the main application. Users should be able to search their notes by explaining the note in English.
Out of Scope
- Voice based search
- Filters + context based search
3. Technical Approach
Architecture & Components
The implementation adds a fourth search mode (SEARCH_TYPE_CONTEXT) alongside the three existing modes in SearchEngine.ts. It introduces two new classes and one new SQLite table, touching the existing codebase minimally.
New components:
AiIndexer.ts— splits notes into chunks and generates summaries via LLMAiSearcher.ts— loads the TOC from the database and runs reasoning-based retrievalnotes_ai_indextable — stores the multi-node TOC tree in SQLite
Modified files:
SearchEngine.ts— adds the new search type, wiresAiIndexerintosyncTables_(), and adds the new branch insearch()JoplinDatabase.ts— adds the migration to createnotes_ai_index
Changes to the Joplin Codebase
Add a helper function isNaturalLanguageQuery_() which is used in determineSearchType_() to determine if the query is a Natural Language Query:
private isNaturalLanguageQuery_(query: string): boolean {
const parsed = filterParser(query);
const hasExplicitFilters = parsed.some(t => t.name !== 'text');
if (hasExplicitFilters) return false; // if the user has filters its not a context based search
const words = query.trim().split(/\s+/);
const contextWords = new Set(['with','from','about','during','at','in','on','a','the','my','our']);
const Count = words.filter(w => contextWords.has(w.toLowerCase())).length;
return words.length >= 4 && Count >= 2; // if the count of the context words is >= 2 then its a NaturalLanguageQuery
}
We create a new table to store the TOC Tree:
CREATE TABLE IF NOT EXISTS notes_ai_index (
id TEXT PRIMARY KEY, -- random UUID, not the note ID
note_id TEXT NOT NULL, -- foreign key back to notes
node_type TEXT, -- 'note' | 'paragraph' | 'media'
parent_node TEXT, -- id of parent node
notebook_path TEXT,
summary TEXT,
updated_time INTEGER
);
We add the following sync logic inside syncTables_() in packages/lib/services/search/SearchEngine.ts:
for (const change of changes) {
if (change.type === ItemChange.TYPE_DELETE) {
queries.push({
sql: 'DELETE FROM notes_ai_index WHERE note_id = ?',
params: [change.item_id],
});
} else {
const note = this.noteById_(notes, change.item_id);
if (note) {
const nodes = await this.aiIndexer_.indexNote(note);
const notebookPath = await this.getNotebookPath_(note.parent_id);
for (const node of nodes) {
queries.push({
sql: `INSERT INTO notes_ai_index
(id, note_id, node_type, parent_node, notebook_path, summary, updated_time)
VALUES
(?, ?, ?, ?, ?, ?, ?)`,
params: [
uuid(), note.id, node.node_type, node.parent_node,
notebookPath, node.summary, Date.now(),
],
});
}
}
}
}
Libraries & Technologies
- Ollama — for local on-device LLM support (privacy-preserving, no API key needed)
- Cloud LLM API — configurable via settings, supporting OpenAI-compatible endpoints
Potential Challenges
- Indexing latency — summarising every note chunk via LLM is slow. Mitigated by running indexing in the background through the existing
scheduleSyncTables()debounce mechanism, and by batching LLM calls. - TOC size — for users with thousands of notes the TOC passed to the navigator LLM may exceed context window limits.
- Accuracy of heuristic —
isNaturalLanguageQuery_()may misclassify edge cases.
Testing Strategy
- Unit tests for
isNaturalLanguageQuery_()covering keyword queries, natural language queries, and queries with explicit filters — added to the existingSearchEngine.test.ts - Unit tests for
AiIndexer.indexNote()with mock notes covering paragraphs and media nodes - Unit tests for the
notes_ai_indexsync loop — verifying inserts on create/update and deletes on note deletion, following the pattern in existing search engine tests - Integration test for the end-to-end search path with a mocked LLM client returning a fixed list of node IDs
Documentation Plan
- User documentation — a short guide explaining when AI search activates, how to configure the LLM backend (cloud vs Ollama), and what natural language queries look like vs keyword queries
- Developer documentation — the JSON schema for
notes_ai_index, the prompt engineering decisions for the indexer and navigator, and instructions for extending the system to support new note content types
4. Implementation Plan
| Weeks | Work | Output |
|---|---|---|
| Week 1–2 (May 1 – May 14) | Architecture & "Architect" Prompt. Set up the dev environment and study the core app's note storage. Design the "Architect" prompt to generate structured TOC JSON. | Architecture finalized, prompt designed. |
| Week 3–4 (May 15 – May 31) | Design and create the notes_ai_index SQLite table. Write getNotebookPath_() helper. Hook sync loop into syncTables_(). |
notes_ai_index stays in sync with note create/update/delete events automatically. |
| Week 5–6 (June 1 – June 14) | Build the TOC tree for the note. Markdown parsing, chunk splitting (paragraphs, media), per-chunk LLM summarisation. | Every note in the database gets a populated subtree of nodes in notes_ai_index. |
| Week 7–8 (June 15 – June 30) | Build the reasoning-based retrieval step. Load TOC from notes_ai_index, send query + TOC to LLM, parse returned node list. Implement note ranking and wire into search(). |
End-to-end natural language search working. |
| Week 9–10 (July 1 – July 14) | UI Integration & Transparency. Build the search results UI with explainability (show the path taken). Integrate the search bar toggle (Hybrid Search: Lexical + PageIndex). | Users can toggle AI search from the UI. |
| Week 11–12 (July 15 – July 31) | Optimization & Final Polish. Benchmarking, latency profiling (target < 3s end-to-end). Local model mode via Ollama. Developer documentation. | Polished, documented, and production-ready. |
5. Deliverables
- A new AI-powered search mode integrated into the existing
SearchEngine.tsalongside the three current modes isNaturalLanguageQuery_()— intent detection that automatically routes natural language queries to the AI path without breaking existing keyword searchAiIndexer— a background indexer that splits notes into logical chunks (paragraphs, media, list items) and generates LLM summaries, hooked into the existingsyncTables_()pipelineAiSearcher— a reasoning-based retrieval layer that passes the TOC and user query to an LLM and returns ranked resultsnotes_ai_indexSQLite table with full incremental sync support (create, update, delete)- Support for both cloud LLM and local Ollama as configurable backends
- Silent fallback to FTS when the AI path returns no results
- UI toggle to enable/disable AI search mode
6. Availability
Weekly availability — 40 hours per week. I have no internships, courses, or other commitments during the GSoC period and will treat this as a full-time project.
Time zone — IST (UTC+5:30)
Other commitments — My semester ends before the GSoC coding period begins, so there are no academic commitments during the programme. I am available for calls and reviews throughout the week including weekends.
AI Disclosure
AI tools (Claude by Anthropic) were used to assist with grammar, wording, and formatting of this proposal. All technical ideas, implementation decisions, and code are my own.