GSoC 2026 Proposal Draft – Idea 4: Chat with your note collection using AI – Krish Jain
Links
-
Project Idea: https://joplinapp.org/gsoc2026/ideas/#4-chat-with-your-note-collection-using-ai
-
GitHub: krisshJain (Krish Jain) · GitHub
-
Forum Introduction: Welcome to GSoC 2026 with Joplin! - #138 by Krishh
-
Pull Requests: No PRs submitted yet. Actively engaged in forum discussions with mentors on the approach.
1. Introduction
I'm Krish Jain, a third-year B.Tech Information Technology student from Mumbai. I've been building with AI for about a year not just experimenting, but shipping things people actually use. One of those is Desk AI, an AI assistant with a RAG-based company policy chatbot currently used by 5+ companies. That project taught me a lot about what works in production RAG systems and, more importantly, what doesn't.
I've been using Joplin for my own notes for a while. The idea of being able to ask questions across a large personal knowledge base and get answers grounded in your actual notes is something I find genuinely useful. That's what drew me to this project.
2. Project Summary
The Problem
Joplin users who maintain large note collections currently have no good way to query that knowledge. Keyword search breaks down quickly. If you want to ask "what were the key decisions from my last project planning session?" there's no way to do that today.
What I Want to Build
A chat interface over your Joplin note collection. You ask a question, the system finds the relevant notes and sections, and the LLM generates an answer grounded in your actual content. You can follow up and refine similar to ChatGPT, except the answers come from your own knowledge base.
Expected Outcome
A working Joplin plugin with a chat UI that enables users to query their note collection conversationally. The retrieval system will be benchmarked against a keyword baseline so there's a measurable quality signal.
Out of Scope
Real-time note syncing during chat, voice input, and multi-user knowledge bases are out of scope. The focus is on getting core retrieval and chat working well.
3. Technical Approach
Why Not Standard Vector RAG?
The obvious approach is chunk → embed → vector search. I've built this before and it works, but it has real issues for a note-taking context:
-
Chunking breaks note structure. A chunk has no idea it belongs under "# Project Alpha > ## Action Items"
-
Similarity search finds related content, not necessarily the right content
-
Requires an embedding model in-process — if multiple plugins each load their own model, memory footprint adds up fast
-
No clear citation trail back to the specific note and section
The Approach: Structured Retrieval with Semantic Enrichment
I want to build a tree index from note structure and use BM25 over LLM-generated summaries for retrieval, inspired by the PageIndex framework ( GitHub - VectifyAI/PageIndex: 📑 PageIndex: Document Index for Vectorless, Reasoning-based RAG · GitHub ), adapted for Joplin's needs.
Concrete example: user has 2000 notes, asks "What was my action plan for the Joplin proposal?"
Phase 1 — Index Building (background, once per note)
-
Each note is parsed into a tree based on heading structure
-
For flat notes without headings, content is split into paragraph-sized blocks, each becoming a node
-
Weak or vague headings (e.g. "Monday", "Meeting") are strengthened — the LLM derives a more descriptive title from the node's content, so retrieval doesn't rely on poor heading quality
-
Each node gets an LLM-generated summary (1-2 sentences) — runs once, re-runs only when updated_time changes
-
Tree + summaries stored in Joplin userData, persisted across sessions
Phase 2 — Query Time (3 LLM calls)
-
BM25 over note summaries narrows 2000 notes to ~5 candidates. Because summaries are natural language, "Joplin proposal" matches a note even if the title just says "GSoC 2026"
-
LLM call 1: model sees 5 note summaries → picks the right note
-
LLM call 2: model sees ~8 section summaries → picks "Action Plan"
-
LLM call 3: full section text + question → answer with citation
Optimisation: Contextual Neighbourhood Node Caching
After answering about the Joplin proposal, the system caches not just the retrieved node but its parent, siblings, and children. If the user follows up with a related question, the search pipeline is skipped entirely down to 1 LLM call. Conversations naturally stay in one topic area, so cache hit rate is high in practice.
Handling Flat Notes
For notes that do not contain headings, the content is automatically divided into paragraph-level blocks. Each block is treated as an individual node and enhanced with an LLM-generated summary.
This approach effectively creates meaningful structure from otherwise unorganized content, making notes easier to navigate, search, and understand even when they were originally written in a flat format.
LLM Provider Strategy
-
Local: Ollama (no API key, fully private) — Joplin's offline-first approach
-
Cloud: OpenAI-compatible API endpoint — user provides their own key
Potential Challenges
-
Very large collections: lazy indexing (index notes as opened, not all at once) keeps initial setup manageable
-
Flat/low-quality notes: paragraph fallback handles structure, but summary quality depends on note quality
-
Latency: 3 LLM calls at query time; contextual cache reduces follow-ups to 1 call
Testing Strategy
-
Unit tests for tree builder and heading parser
-
Integration tests with fixed note collection and known Q&A pairs
-
Benchmark: retrieval accuracy vs BM25-only baseline
-
Manual testing with real Joplin note collections of varying sizes
4. Implementation Plan
Community Bonding (before Week 1)
-
Deep dive into Joplin plugin API and data API
-
Set up local Joplin dev environment
-
Finalise storage approach with mentors
| Period | Tasks |
|---|---|
| Week 1-2 | Tree builder: markdown heading parser, paragraph fallback for flat notes. Unit tests. Works standalone. |
| Week 3-4 | Summarisation pipeline: per-node summaries, batched processing, incremental updates via updated_time. Ollama + OpenAI API support. |
| Week 5-6 | BM25 pre-filter over summaries, note-level + section-level LLM selection. End-to-end retrieval working in isolation. |
| Week 7-8 | Joplin plugin scaffolding, connect to data API, store index in userData, background indexing. |
| Week 9-10 | React chat UI in plugin panel, conversation history, cited answers (note + section), follow-up questions. |
| Week 11 | Contextual neighbourhood cache, performance testing, benchmark vs BM25 baseline. |
| Week 12 | Bug fixes, edge cases, polish. |
| Week 13 | Documentation (user + developer), final testing, submission. |
5. Deliverables
Required:
-
Working Joplin plugin with chat interface over note collection
-
Tree-based index builder with semantic summarisation
-
BM25 + LLM retrieval pipeline with contextual caching
-
Support for local (Ollama) and cloud LLM providers
-
Retrieval benchmark vs keyword baseline
-
Unit and integration test suite
-
User and developer documentation
Nice to have:
-
Auto-tagging notes using the same summary index
-
Related notes suggestions
-
Streaming responses in chat UI
6. Availability
-
Weekly availability: 40-50 hours per week
-
Time zone: IST (UTC+5:30)
-
No major conflicts during GSoC period. Minor academic commitments will be communicated to mentors in advance.
I've already started a TypeScript proof of concept at GitHub - KrrishJain/pageindex-js: Vectorless, reasoning-based RAG for markdown and PDF documents. TypeScript port of PageIndex — no vector DB, no chunking, just LLM tree reasoning. · GitHub
demonstrating the tree builder, summarisation pipeline, and retrieval working end-to-end on markdown files and PDFs. This gives me confidence the approach is implementable within the GSoC timeline.