GSoC 2026 Proposal Draft - Idea 4: Chat with your note collection using AI - Aarushi Tandon

aarushitandon0 · 20 March 2026 10:14

Links

Project idea: gsoc/ideas.md at master · joplin/gsoc · GitHub

GitHub profile: aarushitandon0 (Aarushi Tandon) · GitHub

Forum introduction post: Welcome to GSoC 2026 with Joplin! - #95 by aarushitandon0

Pull requests submitted to Joplin:

Keyboard accessibility fix: Shift+Tab to New Notebook button fails when sidebar is scrolled

Other relevant development experience:

EarningsIQ: Advanced multi-stage RAG system over NASDAQ earnings call transcripts

1. Introduction

I am Aarushi Tandon, a second-year Computer Engineering student from Pune. I have a strong background in Python, ML, NLP, LLMs, and RAG systems, and I am comfortable with JavaScript, TypeScript, React, and Node.js.

I have practical experience building end-to-end retrieval pipelines and am familiar with advanced techniques including semantic chunking, hybrid BM25 and dense retrieval, cross-encoder reranking, query decomposition, corrective RAG (CRAG), HyDE, and RAGAS evaluation. Before applying, I built EarningsIQ, an advanced multi-stage RAG system over NASDAQ earnings call transcripts, which gave me hands-on experience with the exact retrieval and reranking patterns I plan to apply here.

I explored the Joplin codebase in depth before writing this proposal. Key findings:

joplin.plugins.dataDir() confirmed as persistent plugin storage path (JoplinPlugins.ts)
Pagination uses has_more: boolean + integer page (confirmed from Api.test.ts)
onNoteChange fires only for the currently selected note (confirmed from JoplinWorkspace.ts source -- it filters by selectedNoteIds)
onSyncComplete fires with { withErrors: boolean } only (confirmed from JoplinWorkspace.ts)
markup_language field: 1 = Markdown, 2 = HTML/web-clipped (confirmed from routes/notes.ts)
userDataSet key max 255 chars, timestamp-based sync across devices (confirmed from userData.ts)
13 search filter keywords confirmed from filterParser.ts lines 74-75
Zero embedding or vector code exists anywhere in Joplin today

2. Project Summary

What problem it solves

Many Joplin users build something much more valuable than a note collection over time. They carefully clip research papers, save book highlights, write reflections, collect references, building a personal knowledge base that represents years of curated thinking.

The problem is that once this knowledge base grows large enough, it becomes practically inaccessible. You cannot have a conversation with it. You cannot ask "what do I know about sleep and cognition?" or "how does my note on stoicism connect to what I wrote about decision-making?" or "give me everything I have collected about machine learning, synthesised." The knowledge is there, carefully selected, curated, personal, but there is no way to interrogate it.

What will be implemented

This project builds two things in sequence.

Layer 1 is a shared embedding infrastructure: a background indexing service that reads all notes via the Joplin plugin data API, cleans content, applies semantic chunking, generates local embeddings, and stores them in ChromaDB inside joplin.plugins.dataDir(). Once its API is published, any future Joplin AI plugin can consume this index. One process, one model in memory, one set of vectors, not duplicated per plugin.

Layer 2 is a conversational chat interface embedded inside Joplin as a plugin panel. The UI is exactly like ChatGPT except the knowledge comes entirely from the user's own note collection. The user asks a question, the AI answers based on their actual notes with citations, and the user follows up to refine. The retrieval pipeline behind this implements semantic chunking, hybrid BM25 and dense retrieval with Reciprocal Rank Fusion, Relevant Segment Extraction, query decomposition, cross-encoder reranking, and HyDE. All improvements are measured using RAGAS throughout development.

Expected outcome

Submitted as a PR to GitHub - joplin/plugins: Joplin official plugin repository · GitHub , available in Joplin's built-in plugin manager, one-click install, no separate download. Local embeddings (all-MiniLM-L6-v2) are the default so no API key is needed for indexing and no note content leaves the device unless the user explicitly opts in.

3. Technical Approach

How it plugs into Joplin

This is a standard Joplin plugin scaffolded with yo joplin and submitted to GitHub - joplin/plugins: Joplin official plugin repository · GitHub . It installs from Joplin's built-in plugin manager. No core Joplin files are modified.

On startup the plugin does six things in order:

Calls joplin.plugins.dataDir() to get a persistent, plugin-scoped storage path. This is where ChromaDB and the BM25 index live.
Spawns a Python FastAPI subprocess (the AI backend) from the plugin directory.
Polls GET /health every 500ms until the backend responds. Only then does it show the panel.
Creates a sidebar panel via joplin.views.panels.create('ai-chat-panel').
Loads the compiled React bundle into the panel via panels.addScript() and panels.setHtml().
Registers event listeners for note changes and sync completion.

The React UI talks to the Python backend directly over HTTP and SSE at localhost:8000. The TypeScript plugin bridge only handles things that require the Joplin API: reading notes, listening for changes, and opening notes from citation clicks.

Citation click flow: React UI calls webviewApi.postMessage({ type: 'openNote', noteId }), plugin receives it in panels.onMessage(), calls joplin.commands.execute('openNote', noteId), source note opens in Joplin.

Re-indexing flow: onNoteChange (selected note only, confirmed from JoplinWorkspace.ts source) triggers a fast-path re-embed for the note currently being edited. onSyncComplete({ withErrors: boolean }) (confirmed from source) triggers a full diff. The Python backend compares all note hashes and re-embeds only what changed.

Note ingestion and cleaning

Notes are read via joplin.data.get() with fields id, title, body, markup_language, parent_id, updated_time. No SQLite access needed. Pagination uses has_more + page confirmed from Api.test.ts.

Before anything is embedded, every note goes through a cleaning pipeline. Without it, resource IDs like :/4a3f8b2c... and raw HTML from web-clipped notes pollute the vector space and degrade retrieval quality.

if markup_language == 2  (web-clipped HTML, confirmed from routes/notes.ts)
    convert body to Markdown via markdownify

strip  :/[a-f0-9]{32}    Joplin resource IDs
strip  <[^>]+>            residual HTML tags
strip  ^---...---         YAML frontmatter
normalise whitespace
prepend  # {title}        gives every chunk the note title as context

Skip rules: encryption_applied == 1 (body is ciphertext), is_conflict == 1 (same filter FTS5 uses at JoplinDatabase.ts), body under 50 chars after cleaning.

Handling large collections and large notes

For large collections: notes are fetched ordered by updated_time DESC. Recently modified notes index first. The chat panel becomes usable after 100 notes are indexed so users do not wait an hour before asking their first question. Indexing continues in the background.

For large individual notes: a 50,000-word clipped article becomes roughly 200 chunks. The LLM never sees the full note. It only receives the top-5 most relevant chunks regardless of note size or collection size. Context window limits are addressed at the retrieval stage, not the LLM stage.

Embedding speed (all-MiniLM-L6-v2 on CPU): roughly 2,000 to 5,000 chunks per minute. 10,000 notes at 3 chunks each takes about 6 to 15 minutes on first launch. This estimate is shown to the user before indexing begins.

Content hashes are stored per note via joplin.data.userDataSet() (confirmed from userData.ts: 255 char key limit, timestamp-based cross-device sync). Subsequent launches only re-index notes whose body hash has changed.

Semantic chunking

Fixed-size splitting breaks sentences mid-thought and produces chunks that embed poorly because neither half is a complete idea. For Joplin specifically, where notes range from 3 lines to 10,000 words, a fixed chunk size either fragments long notes or makes short notes useless.

The approach: tokenise into sentences with NLTK, then measure cosine similarity between adjacent sentence embeddings. When similarity drops below 0.45 or the token budget hits 512, start a new chunk and carry the last two sentences as overlap.

for each adjacent sentence pair (i, i+1):
    sim = cosine_similarity(embed(sent_i), embed(sent_i+1))
    if sim < 0.45 or current_chunk_tokens + next_sent_tokens > 512:
        save current chunk
        start new chunk with last 2 sentences as overlap
    else:
        add sentence to current chunk

All sentences are batch-encoded in one pass per note, efficient even for very large notes.

Retrieval pipeline

Why each step exists:

Hybrid + RRF: Dense search captures meaning when exact words differ. BM25 captures exact matches that dense search misses such as proper names and technical terms. RRF combines both without requiring weight tuning.

RSE: Without this, the LLM receives disconnected sentence fragments. RSE merges adjacent relevant chunks from the same note into complete passages so the LLM reasons from whole thoughts.

Cross-encoder reranking: The bi-encoder scores query and chunk independently in vector space, fast but imprecise. The cross-encoder reads them together as a pair and understands their relationship directly. Used for the final top-5 where precision matters most.

HyDE: A vague query like "that thing about sleep" is far from any note chunk in embedding space. A generated hypothetical passage is semantically much closer to real note content. Opt-in, disabled by default since it adds one LLM call per vague query.

Grounding and hallucination

The system cannot eliminate hallucination. The LLM may misread context or fill gaps from prior training. The design makes uncertainty explicit rather than hiding it. The system prompt instructs the LLM to cite the source note title for every claim and to say "I could not find this in your notes" when retrieved chunks are not relevant rather than speculate. RAGAS faithfulness (target above 0.85) measures what fraction of claims are grounded in retrieved context throughout development.

Conversation management

6-turn sliding window. When history exceeds 12 turns, older turns are compressed into a short summary via a cheap LLM call so context limits are never breached. Topic-shift detection using cosine similarity between consecutive query embeddings below 0.40 triggers fresh retrieval rather than reusing stale chunks.

LLM providers and cost

Cost is a real concern. Local embeddings (all-MiniLM-L6-v2) are always the default. Zero cost, zero API calls, zero data leaving the device during indexing. No API key needed to start.

Cloud embedding APIs are opt-in only with a cost estimate shown before the user enables them (around $0.002 per 1,000 notes with OpenAI). LLM calls happen only when the user asks a question. There is no bulk LLM processing at first launch. The settings panel shows a per-question cost estimate so there are no surprises.

Provider	Models	Access	Default?
OpenAI (cloud default)	GPT-4o, GPT-4o-mini	openai SDK -- user API key	Yes
Google (cloud alt)	Gemini 1.5 Flash, Pro	google-generativeai SDK -- user API key	Alternative
Ollama (local option)	LLaMA 3.1, Mistral 7B, Gemma 2	ollama SDK -- no key, runs locally	Optional

Chat UI

The UI is a self-contained React application compiled to a webpack bundle and loaded via panels.addScript(). No CDN resources. Joplin's webview enforces CSP and external URLs are blocked at runtime, so the entire app including React must be bundled locally.

App
  IndexingProgressScreen  first launch: progress bar, ETA, unlock at 100 notes
  ChatPanel
    MessageThread
      AssistantMessage
        StreamingText     tokens appended live via EventSource (SSE)
        CitationCards     note title + snippet, click opens source note
    InputBar
  ConversationSidebar     past conversations stored locally
  SettingsPanel           API key, model, cost estimates per provider

Streaming works via the Fetch Streams API reading SSE from /chat. Citation cards appear below each answer. Clicking a card calls webviewApi.postMessage({ type: 'openNote', noteId }) which the plugin routes to joplin.commands.execute('openNote', noteId).

Confirmed plugin APIs used

API	Source file	Role
`joplin.plugins.dataDir()`	JoplinPlugins.ts line 70	ChromaDB + BM25 storage
`panels.create / addScript / setHtml / show`	JoplinViewsPanels.ts	Register and display React panel
`panels.onMessage()`	JoplinViewsPanels.ts	Receive citation clicks from React UI
`onNoteChange({ id, event: 1/2/3 })`	JoplinWorkspace.ts	Fast re-index of selected note (confirmed: filters by selectedNoteIds)
`onSyncComplete({ withErrors: boolean })`	JoplinWorkspace.ts	Full diff re-index after sync (confirmed)
`commands.execute('openNote', id)`	JoplinCommands.ts	Open source note on citation click
`data.userDataSet(ModelType.Note, ...)`	JoplinData.ts	Store content hashes, syncs across devices

Changes to the Joplin codebase

None. This is a plugin submitted to GitHub - joplin/plugins: Joplin official plugin repository · GitHub . No core Joplin files are touched.

Libraries and technologies

Library	Purpose
sentence-transformers	all-MiniLM-L6-v2 embeddings, ms-marco cross-encoder reranking
chromadb	Vector store in dataDir() / chroma/
rank_bm25	BM25Okapi sparse retrieval
nltk	sent_tokenize for chunking boundaries
fastapi + uvicorn	Async HTTP server with SSE streaming
markdownify	Convert markup_language=2 HTML notes to Markdown
openai / google-generativeai / ollama	LLM provider SDKs
ragas	Faithfulness, relevancy, context recall evaluation
React + TypeScript + webpack	CSP-compliant panel UI
langraph	Stretch goal: agent-based search

Potential challenges

Challenge	Why it matters	Mitigation
`onNoteChange` fires only for selected note	Confirmed from source, easy to miss from docs alone	`onSyncComplete` diff covers all notes after every sync
CSP blocks CDN in webview	React and all dependencies must be bundled locally	Full webpack bundle loaded via panels.addScript()
Python binary name varies by platform	`python` vs `python3` vs full path on Windows	Try `python3` first, then `python`, show install instructions if neither found
First-launch indexing on large collections	100,000 notes can take over an hour on slow hardware	Lazy indexing (recent first), ETA shown, chat unlocks at 100 notes
LLM context overflow on long conversations	5 chunks x 512 tokens + history + prompt fills fast	6-turn sliding window + compression of older turns
Port 8000 conflict	Another process may already be using it	Configurable port, auto-fallback to 8001/8002

Stretch goal

If ahead of schedule in Week 10, a LangGraph agent will use Joplin's existing keyword search filters as LLM tool calls alongside the vector retrieval pipeline. All 13 filter keywords confirmed from filterParser.ts lines 74-75: tag:, notebook:, created:, updated:, type:, iscompleted:, due:, latitude:, longitude:, altitude:, resource:, sourceurl:, id:, any:.

4. Implementation Plan

Week 1 (Community bonding)

Confirm dataDir sharing strategy with mentors
Study TOC and post_messages example plugins in full
Run yo joplin scaffold and understand the generated structure

Week 2

Note ingestion with has_more + page pagination
Full cleaning pipeline for all edge cases: encrypted, conflict, HTML, empty
NLTK sentence tokeniser setup

Week 3

Semantic chunking with embedding-similarity boundaries
Content hashes stored via joplin.data.userDataSet
ChromaDB integration inside dataDir()
Batch embedding with lazy indexing (recent notes first) and progress tracking
Milestone 1: notes indexed, semantic search returning relevant results

Week 4

BM25Okapi index over all chunks
Hybrid retrieval with Reciprocal Rank Fusion
Unit tests across exact keyword, semantic, and vague query types
Baseline RAGAS scores recorded

Week 5

Relevant Segment Extraction
Query decomposition via LLM
Cross-encoder reranking with ms-marco-MiniLM-L-6-v2
HyDE for vague query detection
RAGAS comparison vs baseline, target faithfulness above 0.80

Week 6

LLM adapters for OpenAI, Gemini, and Ollama
Citation-enforcing system prompt and "not found" fallback response
Conversation history with sliding window and topic-shift detection
FastAPI /chat SSE endpoint
Milestone 2: full pipeline working in terminal with RAGAS scores

Week 7

All remaining FastAPI endpoints: /ingest, /health, /progress, /settings
Cost estimation per provider and model
API key validation and error handling
Port conflict detection and fallback

Week 8

React UI: streaming MessageThread, CitationCards, ConversationSidebar, IndexingProgressScreen, SettingsPanel
Webpack bundle configured for CSP compliance

Week 9

TypeScript plugin bridge: subprocess spawn with cross-platform Python detection, health poll, panel setup, onNoteChange handler, onSyncComplete diff, citation routing via panels.onMessage, cleanup on close
Milestone 3: complete system running inside Joplin, citations open source notes

Week 10

Stretch goal: LangGraph agent with all 13 confirmed search filters as LLM tool calls
If behind: extended scale testing at 50,000+ notes

Week 11

Cross-platform testing on Windows, macOS, and Linux
Performance benchmarks at 1k, 10k, and 50k note scales
Final RAGAS evaluation run
CSP compliance verification for the React bundle

Week 12

User README and installation guide
Developer architecture document describing the shared index API for future plugin authors
Demo video recording
Final PR submitted to GitHub - joplin/plugins: Joplin official plugin repository · GitHub
GSoC final report with quantitative RAGAS results and scale benchmarks
Milestone 4: submitted

5. Deliverables

Implemented features:

Plugin submitted as a PR to github.com/joplin/plugins so it appears in Joplin's built-in plugin manager
Shared ChromaDB index in dataDir() designed from the start for consumption by future AI plugins
Note preprocessing pipeline: resource ID stripping, HTML conversion for markup_language=2 notes, frontmatter removal, code block separation
Semantic chunker using NLTK and embedding-similarity boundaries
Hybrid BM25 and dense retrieval with Reciprocal Rank Fusion, Relevant Segment Extraction, query decomposition, cross-encoder reranking, and HyDE
Streaming React chat UI with citation cards that open source notes, conversation history sidebar, indexing progress screen, and settings panel with cost estimates
OpenAI GPT-4o and Gemini Flash (cloud, user API key) plus Ollama LLaMA 3.1 and Mistral (local, no key required)
Citation-enforcing system prompt with explicit "not found" fallback
Incremental indexing via joplin.data.userDataSet content hashes with onNoteChange fast-path and onSyncComplete full diff
Full FastAPI backend with six endpoints and complete error handling
TypeScript plugin bridge with cross-platform Python detection

Tests:

RAGAS evaluation script with 50 question-answer pairs across four query types (exact keyword, semantic, vague, multi-turn), with before and after scores for each retrieval technique added
Unit tests for the cleaning pipeline, chunker, BM25 index, hybrid retrieval, and reranker
Scale benchmarks at 1,000, 10,000, and 50,000 notes measuring indexing time and query latency

Documentation:

User README with installation, API key setup, and usage guide
Developer architecture document describing the shared index API surface for future plugin authors
GSoC final report with quantitative RAGAS results and performance benchmarks

Stretch goal:

LangGraph agent using all 13 confirmed Joplin search filters as LLM tool calls

6. Availability

Weekly availability: 35 hours per week during GSoC.

Time zone: IST (UTC+5:30)

Happy to share the complete detailed proposal with mentors on the GSoC portal. Looking forward for a feedback:)

shikuz · 25 March 2026 08:07

Hey @aarushitandon0, thanks for the detailed proposal.

The FastAPI backend requires Python and its dependencies (ChromaDB, NLTK, etc.) on the user's machine. How would installation and setup work in practice? For reference, the AI summarisation plugin (by @HahaBill) runs ML models inside a Joplin plugin without an external process.

Week 5 has RSE, query decomposition, reranking, and HyDE all in one week. If you had to prioritise, which would you keep and which would you defer?

Also, check your proposal against the submission template - the links section should list all your Joplin PRs explicitly, and the testing strategy could be a dedicated section under Technical Approach rather than just in the timeline.

Topic		Replies	Views
GSoC 2026: Opportunities for the AI projects GSoC	38	1100	20 May 2026
GSoC Idea Discussion: Chat with your note collection using AI – architecture and LLM approach Development	5	151	13 March 2026
GSoC 2026 Proposal Draft – Idea 4: Chat with your note collection using AI GSoC	0	26	31 March 2026
LangChain / LlamaIndex Joplin integrations for developing AI apps Apps	3	1410	7 May 2025
Welcome to GSoC 2026 with Joplin! GSoC	155	2244	1 April 2026

GSoC 2026 Proposal Draft - Idea 4: Chat with your note collection using AI - Aarushi Tandon

Related topics