GSoC 2026 Proposal Draft - Idea 4: Chat with your note collection using AI - Aarushi Tandon
Links
Project idea: gsoc/ideas.md at master · joplin/gsoc · GitHub
GitHub profile: aarushitandon0 (Aarushi Tandon) · GitHub
Forum introduction post: Welcome to GSoC 2026 with Joplin! - #95 by aarushitandon0
Pull requests submitted to Joplin:
- Keyboard accessibility fix: Shift+Tab to New Notebook button fails when sidebar is scrolled
Other relevant development experience:
- EarningsIQ: Advanced multi-stage RAG system over NASDAQ earnings call transcripts
1. Introduction
I am Aarushi Tandon, a second-year Computer Engineering student from Pune. I have a strong background in Python, ML, NLP, LLMs, and RAG systems, and I am comfortable with JavaScript, TypeScript, React, and Node.js.
I have practical experience building end-to-end retrieval pipelines and am familiar with advanced techniques including semantic chunking, hybrid BM25 and dense retrieval, cross-encoder reranking, query decomposition, corrective RAG (CRAG), HyDE, and RAGAS evaluation. Before applying, I built EarningsIQ, an advanced multi-stage RAG system over NASDAQ earnings call transcripts, which gave me hands-on experience with the exact retrieval and reranking patterns I plan to apply here.
I explored the Joplin codebase in depth before writing this proposal. Key findings:
joplin.plugins.dataDir()confirmed as persistent plugin storage path (JoplinPlugins.ts)- Pagination uses
has_more: boolean+ integerpage(confirmed from Api.test.ts) onNoteChangefires only for the currently selected note (confirmed from JoplinWorkspace.ts source -- it filters byselectedNoteIds)onSyncCompletefires with{ withErrors: boolean }only (confirmed from JoplinWorkspace.ts)markup_languagefield: 1 = Markdown, 2 = HTML/web-clipped (confirmed from routes/notes.ts)userDataSetkey max 255 chars, timestamp-based sync across devices (confirmed from userData.ts)- 13 search filter keywords confirmed from filterParser.ts lines 74-75
- Zero embedding or vector code exists anywhere in Joplin today
2. Project Summary
What problem it solves
Many Joplin users build something much more valuable than a note collection over time. They carefully clip research papers, save book highlights, write reflections, collect references, building a personal knowledge base that represents years of curated thinking.
The problem is that once this knowledge base grows large enough, it becomes practically inaccessible. You cannot have a conversation with it. You cannot ask "what do I know about sleep and cognition?" or "how does my note on stoicism connect to what I wrote about decision-making?" or "give me everything I have collected about machine learning, synthesised." The knowledge is there, carefully selected, curated, personal, but there is no way to interrogate it.
What will be implemented
This project builds two things in sequence.
Layer 1 is a shared embedding infrastructure: a background indexing service that reads all notes via the Joplin plugin data API, cleans content, applies semantic chunking, generates local embeddings, and stores them in ChromaDB inside joplin.plugins.dataDir(). Once its API is published, any future Joplin AI plugin can consume this index. One process, one model in memory, one set of vectors, not duplicated per plugin.
Layer 2 is a conversational chat interface embedded inside Joplin as a plugin panel. The UI is exactly like ChatGPT except the knowledge comes entirely from the user's own note collection. The user asks a question, the AI answers based on their actual notes with citations, and the user follows up to refine. The retrieval pipeline behind this implements semantic chunking, hybrid BM25 and dense retrieval with Reciprocal Rank Fusion, Relevant Segment Extraction, query decomposition, cross-encoder reranking, and HyDE. All improvements are measured using RAGAS throughout development.
Expected outcome
Submitted as a PR to GitHub - joplin/plugins: Joplin official plugin repository · GitHub , available in Joplin's built-in plugin manager, one-click install, no separate download. Local embeddings (all-MiniLM-L6-v2) are the default so no API key is needed for indexing and no note content leaves the device unless the user explicitly opts in.
3. Technical Approach
How it plugs into Joplin
This is a standard Joplin plugin scaffolded with yo joplin and submitted to GitHub - joplin/plugins: Joplin official plugin repository · GitHub . It installs from Joplin's built-in plugin manager. No core Joplin files are modified.
On startup the plugin does six things in order:
- Calls
joplin.plugins.dataDir()to get a persistent, plugin-scoped storage path. This is where ChromaDB and the BM25 index live. - Spawns a Python FastAPI subprocess (the AI backend) from the plugin directory.
- Polls
GET /healthevery 500ms until the backend responds. Only then does it show the panel. - Creates a sidebar panel via
joplin.views.panels.create('ai-chat-panel'). - Loads the compiled React bundle into the panel via
panels.addScript()andpanels.setHtml(). - Registers event listeners for note changes and sync completion.
The React UI talks to the Python backend directly over HTTP and SSE at localhost:8000. The TypeScript plugin bridge only handles things that require the Joplin API: reading notes, listening for changes, and opening notes from citation clicks.
Citation click flow: React UI calls webviewApi.postMessage({ type: 'openNote', noteId }), plugin receives it in panels.onMessage(), calls joplin.commands.execute('openNote', noteId), source note opens in Joplin.
Re-indexing flow: onNoteChange (selected note only, confirmed from JoplinWorkspace.ts source) triggers a fast-path re-embed for the note currently being edited. onSyncComplete({ withErrors: boolean }) (confirmed from source) triggers a full diff. The Python backend compares all note hashes and re-embeds only what changed.
Note ingestion and cleaning
Notes are read via joplin.data.get() with fields id, title, body, markup_language, parent_id, updated_time. No SQLite access needed. Pagination uses has_more + page confirmed from Api.test.ts.
Before anything is embedded, every note goes through a cleaning pipeline. Without it, resource IDs like :/4a3f8b2c... and raw HTML from web-clipped notes pollute the vector space and degrade retrieval quality.
if markup_language == 2 (web-clipped HTML, confirmed from routes/notes.ts)
convert body to Markdown via markdownify
strip :/[a-f0-9]{32} Joplin resource IDs
strip <[^>]+> residual HTML tags
strip ^---...--- YAML frontmatter
normalise whitespace
prepend # {title} gives every chunk the note title as context
Skip rules: encryption_applied == 1 (body is ciphertext), is_conflict == 1 (same filter FTS5 uses at JoplinDatabase.ts), body under 50 chars after cleaning.
Handling large collections and large notes
For large collections: notes are fetched ordered by updated_time DESC. Recently modified notes index first. The chat panel becomes usable after 100 notes are indexed so users do not wait an hour before asking their first question. Indexing continues in the background.
For large individual notes: a 50,000-word clipped article becomes roughly 200 chunks. The LLM never sees the full note. It only receives the top-5 most relevant chunks regardless of note size or collection size. Context window limits are addressed at the retrieval stage, not the LLM stage.
Embedding speed (all-MiniLM-L6-v2 on CPU): roughly 2,000 to 5,000 chunks per minute. 10,000 notes at 3 chunks each takes about 6 to 15 minutes on first launch. This estimate is shown to the user before indexing begins.
Content hashes are stored per note via joplin.data.userDataSet() (confirmed from userData.ts: 255 char key limit, timestamp-based cross-device sync). Subsequent launches only re-index notes whose body hash has changed.
Semantic chunking
Fixed-size splitting breaks sentences mid-thought and produces chunks that embed poorly because neither half is a complete idea. For Joplin specifically, where notes range from 3 lines to 10,000 words, a fixed chunk size either fragments long notes or makes short notes useless.
The approach: tokenise into sentences with NLTK, then measure cosine similarity between adjacent sentence embeddings. When similarity drops below 0.45 or the token budget hits 512, start a new chunk and carry the last two sentences as overlap.
for each adjacent sentence pair (i, i+1):
sim = cosine_similarity(embed(sent_i), embed(sent_i+1))
if sim < 0.45 or current_chunk_tokens + next_sent_tokens > 512:
save current chunk
start new chunk with last 2 sentences as overlap
else:
add sentence to current chunk
All sentences are batch-encoded in one pass per note, efficient even for very large notes.
Retrieval pipeline
Why each step exists:
Hybrid + RRF: Dense search captures meaning when exact words differ. BM25 captures exact matches that dense search misses such as proper names and technical terms. RRF combines both without requiring weight tuning.
RSE: Without this, the LLM receives disconnected sentence fragments. RSE merges adjacent relevant chunks from the same note into complete passages so the LLM reasons from whole thoughts.
Cross-encoder reranking: The bi-encoder scores query and chunk independently in vector space, fast but imprecise. The cross-encoder reads them together as a pair and understands their relationship directly. Used for the final top-5 where precision matters most.
HyDE: A vague query like "that thing about sleep" is far from any note chunk in embedding space. A generated hypothetical passage is semantically much closer to real note content. Opt-in, disabled by default since it adds one LLM call per vague query.
Grounding and hallucination
The system cannot eliminate hallucination. The LLM may misread context or fill gaps from prior training. The design makes uncertainty explicit rather than hiding it. The system prompt instructs the LLM to cite the source note title for every claim and to say "I could not find this in your notes" when retrieved chunks are not relevant rather than speculate. RAGAS faithfulness (target above 0.85) measures what fraction of claims are grounded in retrieved context throughout development.
Conversation management
6-turn sliding window. When history exceeds 12 turns, older turns are compressed into a short summary via a cheap LLM call so context limits are never breached. Topic-shift detection using cosine similarity between consecutive query embeddings below 0.40 triggers fresh retrieval rather than reusing stale chunks.
LLM providers and cost
Cost is a real concern. Local embeddings (all-MiniLM-L6-v2) are always the default. Zero cost, zero API calls, zero data leaving the device during indexing. No API key needed to start.
Cloud embedding APIs are opt-in only with a cost estimate shown before the user enables them (around $0.002 per 1,000 notes with OpenAI). LLM calls happen only when the user asks a question. There is no bulk LLM processing at first launch. The settings panel shows a per-question cost estimate so there are no surprises.
| Provider | Models | Access | Default? |
|---|---|---|---|
| OpenAI (cloud default) | GPT-4o, GPT-4o-mini | openai SDK -- user API key | Yes |
| Google (cloud alt) | Gemini 1.5 Flash, Pro | google-generativeai SDK -- user API key | Alternative |
| Ollama (local option) | LLaMA 3.1, Mistral 7B, Gemma 2 | ollama SDK -- no key, runs locally | Optional |
Chat UI
The UI is a self-contained React application compiled to a webpack bundle and loaded via panels.addScript(). No CDN resources. Joplin's webview enforces CSP and external URLs are blocked at runtime, so the entire app including React must be bundled locally.
App
IndexingProgressScreen first launch: progress bar, ETA, unlock at 100 notes
ChatPanel
MessageThread
AssistantMessage
StreamingText tokens appended live via EventSource (SSE)
CitationCards note title + snippet, click opens source note
InputBar
ConversationSidebar past conversations stored locally
SettingsPanel API key, model, cost estimates per provider
Streaming works via the Fetch Streams API reading SSE from /chat. Citation cards appear below each answer. Clicking a card calls webviewApi.postMessage({ type: 'openNote', noteId }) which the plugin routes to joplin.commands.execute('openNote', noteId).
Confirmed plugin APIs used
| API | Source file | Role |
|---|---|---|
joplin.plugins.dataDir() |
JoplinPlugins.ts line 70 | ChromaDB + BM25 storage |
panels.create / addScript / setHtml / show |
JoplinViewsPanels.ts | Register and display React panel |
panels.onMessage() |
JoplinViewsPanels.ts | Receive citation clicks from React UI |
onNoteChange({ id, event: 1/2/3 }) |
JoplinWorkspace.ts | Fast re-index of selected note (confirmed: filters by selectedNoteIds) |
onSyncComplete({ withErrors: boolean }) |
JoplinWorkspace.ts | Full diff re-index after sync (confirmed) |
commands.execute('openNote', id) |
JoplinCommands.ts | Open source note on citation click |
data.userDataSet(ModelType.Note, ...) |
JoplinData.ts | Store content hashes, syncs across devices |
Changes to the Joplin codebase
None. This is a plugin submitted to GitHub - joplin/plugins: Joplin official plugin repository · GitHub . No core Joplin files are touched.
Libraries and technologies
| Library | Purpose |
|---|---|
| sentence-transformers | all-MiniLM-L6-v2 embeddings, ms-marco cross-encoder reranking |
| chromadb | Vector store in dataDir() / chroma/ |
| rank_bm25 | BM25Okapi sparse retrieval |
| nltk | sent_tokenize for chunking boundaries |
| fastapi + uvicorn | Async HTTP server with SSE streaming |
| markdownify | Convert markup_language=2 HTML notes to Markdown |
| openai / google-generativeai / ollama | LLM provider SDKs |
| ragas | Faithfulness, relevancy, context recall evaluation |
| React + TypeScript + webpack | CSP-compliant panel UI |
| langraph | Stretch goal: agent-based search |
Potential challenges
| Challenge | Why it matters | Mitigation |
|---|---|---|
onNoteChange fires only for selected note |
Confirmed from source, easy to miss from docs alone | onSyncComplete diff covers all notes after every sync |
| CSP blocks CDN in webview | React and all dependencies must be bundled locally | Full webpack bundle loaded via panels.addScript() |
| Python binary name varies by platform | python vs python3 vs full path on Windows |
Try python3 first, then python, show install instructions if neither found |
| First-launch indexing on large collections | 100,000 notes can take over an hour on slow hardware | Lazy indexing (recent first), ETA shown, chat unlocks at 100 notes |
| LLM context overflow on long conversations | 5 chunks x 512 tokens + history + prompt fills fast | 6-turn sliding window + compression of older turns |
| Port 8000 conflict | Another process may already be using it | Configurable port, auto-fallback to 8001/8002 |
Stretch goal
If ahead of schedule in Week 10, a LangGraph agent will use Joplin's existing keyword search filters as LLM tool calls alongside the vector retrieval pipeline. All 13 filter keywords confirmed from filterParser.ts lines 74-75: tag:, notebook:, created:, updated:, type:, iscompleted:, due:, latitude:, longitude:, altitude:, resource:, sourceurl:, id:, any:.
4. Implementation Plan
Week 1 (Community bonding)
- Confirm dataDir sharing strategy with mentors
- Study TOC and post_messages example plugins in full
- Run yo joplin scaffold and understand the generated structure
Week 2
- Note ingestion with has_more + page pagination
- Full cleaning pipeline for all edge cases: encrypted, conflict, HTML, empty
- NLTK sentence tokeniser setup
Week 3
- Semantic chunking with embedding-similarity boundaries
- Content hashes stored via joplin.data.userDataSet
- ChromaDB integration inside dataDir()
- Batch embedding with lazy indexing (recent notes first) and progress tracking
- Milestone 1: notes indexed, semantic search returning relevant results
Week 4
- BM25Okapi index over all chunks
- Hybrid retrieval with Reciprocal Rank Fusion
- Unit tests across exact keyword, semantic, and vague query types
- Baseline RAGAS scores recorded
Week 5
- Relevant Segment Extraction
- Query decomposition via LLM
- Cross-encoder reranking with ms-marco-MiniLM-L-6-v2
- HyDE for vague query detection
- RAGAS comparison vs baseline, target faithfulness above 0.80
Week 6
- LLM adapters for OpenAI, Gemini, and Ollama
- Citation-enforcing system prompt and "not found" fallback response
- Conversation history with sliding window and topic-shift detection
- FastAPI /chat SSE endpoint
- Milestone 2: full pipeline working in terminal with RAGAS scores
Week 7
- All remaining FastAPI endpoints: /ingest, /health, /progress, /settings
- Cost estimation per provider and model
- API key validation and error handling
- Port conflict detection and fallback
Week 8
- React UI: streaming MessageThread, CitationCards, ConversationSidebar, IndexingProgressScreen, SettingsPanel
- Webpack bundle configured for CSP compliance
Week 9
- TypeScript plugin bridge: subprocess spawn with cross-platform Python detection, health poll, panel setup, onNoteChange handler, onSyncComplete diff, citation routing via panels.onMessage, cleanup on close
- Milestone 3: complete system running inside Joplin, citations open source notes
Week 10
- Stretch goal: LangGraph agent with all 13 confirmed search filters as LLM tool calls
- If behind: extended scale testing at 50,000+ notes
Week 11
- Cross-platform testing on Windows, macOS, and Linux
- Performance benchmarks at 1k, 10k, and 50k note scales
- Final RAGAS evaluation run
- CSP compliance verification for the React bundle
Week 12
- User README and installation guide
- Developer architecture document describing the shared index API for future plugin authors
- Demo video recording
- Final PR submitted to GitHub - joplin/plugins: Joplin official plugin repository · GitHub
- GSoC final report with quantitative RAGAS results and scale benchmarks
- Milestone 4: submitted
5. Deliverables
Implemented features:
- Plugin submitted as a PR to
github.com/joplin/pluginsso it appears in Joplin's built-in plugin manager - Shared ChromaDB index in dataDir() designed from the start for consumption by future AI plugins
- Note preprocessing pipeline: resource ID stripping, HTML conversion for markup_language=2 notes, frontmatter removal, code block separation
- Semantic chunker using NLTK and embedding-similarity boundaries
- Hybrid BM25 and dense retrieval with Reciprocal Rank Fusion, Relevant Segment Extraction, query decomposition, cross-encoder reranking, and HyDE
- Streaming React chat UI with citation cards that open source notes, conversation history sidebar, indexing progress screen, and settings panel with cost estimates
- OpenAI GPT-4o and Gemini Flash (cloud, user API key) plus Ollama LLaMA 3.1 and Mistral (local, no key required)
- Citation-enforcing system prompt with explicit "not found" fallback
- Incremental indexing via joplin.data.userDataSet content hashes with onNoteChange fast-path and onSyncComplete full diff
- Full FastAPI backend with six endpoints and complete error handling
- TypeScript plugin bridge with cross-platform Python detection
Tests:
- RAGAS evaluation script with 50 question-answer pairs across four query types (exact keyword, semantic, vague, multi-turn), with before and after scores for each retrieval technique added
- Unit tests for the cleaning pipeline, chunker, BM25 index, hybrid retrieval, and reranker
- Scale benchmarks at 1,000, 10,000, and 50,000 notes measuring indexing time and query latency
Documentation:
- User README with installation, API key setup, and usage guide
- Developer architecture document describing the shared index API surface for future plugin authors
- GSoC final report with quantitative RAGAS results and performance benchmarks
Stretch goal:
- LangGraph agent using all 13 confirmed Joplin search filters as LLM tool calls
6. Availability
Weekly availability: 35 hours per week during GSoC.
Time zone: IST (UTC+5:30)
Happy to share the complete detailed proposal with mentors on the GSoC portal. Looking forward for a feedback:)
