GSoC 2026 Proposal Draft - AI-Supported Note Search for Joplin - Dipanshu Rawat

dipanshurdev · 28 March 2026 19:02

GSoC 2026 Proposal Draft – Idea 1: AI-Supported Natural Language Search for Joplin

Links

Project idea: https://joplinapp.org/gsoc2026/ideas/
GitHub: dipanshurdev (Dipanshu Rawat) · GitHub
Forum introduction: Welcome to GSoC 2026 with Joplin! - #58 by dipanshurdev
Pull requests to Joplin:
Other relevant experience:
- StudyOS AI – AI study platform I built using HuggingFace for automated note analysis and MCQ generation
- Portfolio

Introduction

I'm Dipanshu Rawat, a software developer from India with two years of professional experience building AI-powered products. At Vojic LLC I built VMeet, a real-time meeting assistant handling transcription and summarisation in production. Before that, at Claritel.ai, I integrated ML pipelines into a FastAPI/TypeScript backend, working daily with HuggingFace models and embedding-based retrieval systems.

I have been contributing to Joplin core since early 2026, working across UI state management, TypeScript utility optimisations, and localisation. Those five merged PRs gave me a working understanding of Joplin's plugin API, React layer, and internal data services, the exact areas this project touches.

For this proposal specifically, I read through filterParser.ts, SearchEngine.ts, and queryBuilder.ts before writing a single line. Every architectural decision below comes directly from that codebase reading, not from generic AI search literature.

Other Relevant Experience:

StudyOS AI — AI study platform built using HuggingFace for automated note analysis and MCQ generation
VMeet (Vojic LLC) — LLM-powered meeting assistant with real-time structured data extraction from free-form speech; the same extraction pattern used in this proposal's query translator

Project Summary

The Problem

Joplin's existing SQLite FTS5 engine is highly effective for exact keyword matches but cannot process conceptual queries or implicit metadata. A user who has spent years building a knowledge base has no way to ask it natural questions.

What the user remembers	Keyword result	Why it fails
"Meeting with a German company in 2019 or 2020"	Zero results	Note says "Berlin client" not "German" or "meeting"
"Poetry lines about the moon"	Zero results	Note contains a haiku, the word "moon" never appears
"Tasks I wrote for the website redesign"	50 results	Finds every note with "website", can't rank by intent
"That article about sleep and productivity"	Partial	Matches "productivity" but not the conceptual pairing
"My notes from the React conference last autumn"	Zero results	Note is titled "JSConf", date metadata isn't searched

The Solution

This project builds a Hybrid Self-Querying Search Engine natively integrated into Joplin core. A single LLM call decomposes any natural language query into two parallel execution paths:

Structured metadata filters (dates, tags, notebooks, types), compiled into Term[] objects and passed into queryBuilder.ts using Joplin's actual filter syntax
Semantic query string, embedded locally and searched against a vector store

Results from both paths are merged using Reciprocal Rank Fusion (RRF) and returned in the existing note list with an explainability badge showing how each note was found.

Input:  "meeting with a German company in 2019 or maybe 2020."
 
Single LLM call → internal TranslatedQuery:
{
  "filters":        { "date_from": "20190101", "date_to": "20201231" },
  "semantic_query": "meeting company client"
}
 
Query compiler step:
  date_from/date_to  → created:20190101 (Joplin YYYYMMDD format)
  semantic_query     → keyword slice passed to FTS + embedded for vector
 
→ FTS5 path:    filterParser terms → queryBuilder SQL  → filtered candidates
→ Vector path:  embed("meeting company client")        → semantic matches
→ RRF merge:    notes scoring in both paths rank highest
→ Result badge: "date: 2019–2020 · semantic: meeting company"

What this is not: a replacement for Joplin's existing search. Every query that works today continues to work identically. The hybrid path is activated only when AI search is enabled and the query is clearly natural language.

Out of scope for v1.0: fine-tuning LLMs, indexing image or PDF attachments, mobile/web support.

Technical Approach

Codebase Grounding

Before designing anything, I traced the full search path in the codebase:

SearchEngine.search(query)
  → filterParser.parseQuery(query)       // tokenises into Term[]
  → queryBuilder.buildQuery(terms)       // emits SQL for notes_fts / notes_normalized
  → SQLite FTS5 query execution
  → SearchEngineUtils ranking / scoring

The relevant files:

packages/lib/services/search/
├── SearchEngine.ts       ← orchestrator; calls parser, executes SQL, ranks results
├── filterParser.ts       ← tokenises query string into Term[] objects
├── queryBuilder.ts       ← converts Term[] into SQLite FTS5 SQL against notes_fts
└── SearchEngineUtils.ts  ← frequency scoring, field weighting helpers

filterParser.ts accepts date values in Joplin's own formats: YYYYMMDD, YYYYMM, YYYY, or relative forms like day-1. It does not accept ISO strings like 2019-01-01. The getUnixMs helper in dateFilter handles this parsing. A date range in 2019–2020 compiles to:

created:20190101 created:20201231
// filterParser interprets:
//   created:20190101   → notes created on or after 2019-01-01 (>=)
//   -created:20201231  → negated upper bound gives < semantics
// or equivalent range approach confirmed with mentor during Community Bonding

This is the exact format the query compiler in this project will emit, not the raw JSON from the LLM.

Architecture Overview

]

Why This Architecture Over Alternatives

Approach	Problem
Pure agent loop (5 LLM tool calls)	Up to 5–15 seconds latency. Unacceptable for interactive search.
Pure embedding search	Blind to dates, tags, notebooks. "2019 or 2020" is invisible to vectors.
Pure LLM → FTS syntax	Misses conceptual queries where exact keywords are absent from the note.
Hybrid self-querying (this proposal)	One LLM call. Parallel execution. Metadata precision + semantic recall.

Query Routing

The AI search path is only entered when:

The user has AI search enabled in settings (opt-in), and
The query does not contain explicit Joplin filter syntax

function shouldUseHybridSearch(query: string, aiEnabled: boolean): boolean {
  if (!aiEnabled) return false;
 
  // Pass through to existing engine if query uses Joplin filter syntax
  const structuredPatterns = [
    /^(any|title|body|tag|notebook|created|updated|type|iscompleted|due):/,
    /^-?(any|title|body|tag|notebook):/,
    /\bAND\b|\bOR\b|\bNOT\b/,
    /^".*"$/,
  ];
  return !structuredPatterns.some(p => p.test(query.trim()));
}

This replaces the "≤2 words = FTS" heuristic. The user can always type body:moon to force classic search regardless. Short conceptual queries like "moon haiku" or "Berlin meeting" correctly enter the hybrid path when AI is enabled.

Integration Point in SearchEngine.ts

Adding the hybrid path is a branch inside SearchEngine.search(), with the existing code path completely unchanged for all other cases:

// In SearchEngine.ts, new branch, old path untouched
async search(query: string, options: SearchOptions): Promise<SearchResult[]> {
  if (shouldUseHybridSearch(query, this.aiSearchEnabled_)) {
    return this.hybridSearch_(query, options);
  }
  // existing path, zero changes below this line
  return this.searchFts_(query, options);
}

The existing searchFts_ method is not modified. No existing behaviour changes.

Component A, LLM Query Translator

A single LLM call converts natural language into an internal TranslatedQuery object. The LLM outputs abstract intent; a query compiler step converts this to Joplin's actual filter syntax before touching queryBuilder.ts.

// Step 1: LLM outputs abstract intent
const TRANSLATOR_PROMPT = `
You are a search query parser for a note-taking app.
Extract metadata filters and semantic intent from the user's query.
Return ONLY valid JSON. No explanation. No markdown.
 
Valid filter fields: date_from, date_to (YYYYMMDD integers), tag, notebook,
type ("note" or "todo"), keyword (optional residual keywords)
 
Example:
Input:  "meeting with German company in 2019 or 2020"
Output: {"filters":{"date_from":20190101,"date_to":20201231},
         "semantic_query":"meeting company client"}
`;
 
interface TranslatedQuery {
  filters: {
    date_from?: number;   // YYYYMMDD integer, Joplin's actual format
    date_to?:   number;   // YYYYMMDD integer
    tag?:       string;   // value for tag: filter
    notebook?:  string;   // value for notebook: filter
    type?:      'note' | 'todo';  // value for type: filter
  };
  semantic_query: string;
}
 
// Step 2: Query compiler, converts TranslatedQuery → Term[] Joplin understands
function compileToJoplinTerms(tq: TranslatedQuery): string {
  const parts: string[] = [];
 
  if (tq.filters.date_from) parts.push(`created:${tq.filters.date_from}`);
  // Upper bound uses negation for < semantics as queryBuilder supports
  if (tq.filters.date_to)   parts.push(`-created:${tq.filters.date_to}`);
  if (tq.filters.tag)       parts.push(`tag:${tq.filters.tag}`);
  if (tq.filters.notebook)  parts.push(`notebook:${tq.filters.notebook}`);
  if (tq.filters.type)      parts.push(`type:${tq.filters.type}`);
 
  // Residual keyword slice for FTS text matching
  if (tq.semantic_query) parts.push(tq.semantic_query);
 
  return parts.join(' ');
  // e.g. → "created:20190101 -created:20201231 meeting company client"
  // This string is valid input to filterParser.parseQuery()
}

LLM Backends, unified LLMClient interface, user-configurable:

Backend	What leaves the device	Speed	Status
Ollama (local)	Nothing — fully local	1–3s	✓ Default
OpenAI API (gpt-4o-mini)	Query text only, over HTTPS	< 1s	Optional
Anthropic API (claude-haiku)	Query text only, over HTTPS	< 1s	Optional
None configured	—	—	✓ Graceful fallback to FTS5

Privacy note: In local mode (Ollama + local embeddings) no note content ever leaves the device. In cloud translation mode, only the user's query text is sent, not note content. Embeddings always run locally regardless of translation backend.

If the LLM returns malformed JSON, the system silently falls back to FTS5-only search, no crash, no broken UI.

Component B, Embedding & Vector Layer

All embedding runs locally regardless of which LLM backend is chosen for translation.

]

Embedding model: @xenova/transformers (Transformers.js) with BGE-small-en-v1.5 (24 MB, ONNX/WASM). This is a plausible path in Joplin's Electron environment, @xenova/transformers appears in Joplin's existing test/worker infrastructure. Final validation will happen in week 1 of Community Bonding with a minimal PoC; if WASM loading proves problematic in the production build, an alternative will be agreed with the mentor.

Vector store: vectra, pure TypeScript, zero native dependencies, file-backed JSON index:

Option	Decision	Reason
vectra	✓ Primary	Pure TypeScript. No native bindings. Cross-platform. Built-in BM25. Final choice subject to mentor review and bundle size assessment.
sqlite-vec	✗ Not for v1	C extension requiring native compilation. Incompatible with cross-platform builds. Could be revisited post-GSoC if maintainers accept native deps.
LanceDB / Chroma	✗ Rejected	Require native binaries or a running server process.

Incremental indexing, SHA-256 hashing skips unchanged notes; background indexing runs off the UI thread:

async function indexNote(note: JoplinNote): Promise<void> {
  const hash = sha256(note.body).slice(0, 32);
  const existing = await indexStateDb.get(note.id);
  if (existing?.hash === hash) return; // unchanged, skip entirely
 
  await vectorStore.deleteWhere({ noteId: note.id });
  const chunks     = chunkNote(note);   // heading-aware, 400-token sliding window
  const embeddings = await embedBatch(chunks.map(c => c.text));
  await vectorStore.insertBatch(chunks, embeddings);
  await indexStateDb.set(note.id, { hash, indexedAt: Date.now() });
}

Change detection: Joplin exposes note change events through its internal ItemChange tracking and SearchEngine.syncTables(). The indexer will hook into the same note-save paths that the existing search index uses, confirmed with mentor during Community Bonding. Periodic polling of updated_time covers bulk imports and sync arrivals.

Component C, Reciprocal Rank Fusion

Results from both paths are merged mathematically. A note ranking high in both paths scores higher than one dominating only one:

// Standard RRF: score = 1 / (rank + k), where k=60 is the accepted constant
function reciprocalRankFusion(
  ftsResults:    ScoredNote[],
  vectorResults: ScoredNote[],
  k = 60
): ScoredNote[] {
  const scores = new Map<string, number>();
 
  ftsResults.forEach((note, rank) => {
    scores.set(note.id, (scores.get(note.id) ?? 0) + 1 / (rank + k));
  });
  vectorResults.forEach((note, rank) => {
    scores.set(note.id, (scores.get(note.id) ?? 0) + 1 / (rank + k));
  });
 
  return [...scores.entries()]
    .sort(([, a], [, b]) => b - a)
    .map(([id]) => findNote(id, ftsResults, vectorResults));
}

Handling filters-only queries: When the LLM extracts only metadata filters with no semantic content (e.g. "notes from 2019"), semantic_query will be an empty string. In this case the vector leg is skipped and only the FTS5 path runs, avoiding a meaningless vector search against an empty embedding. The reverse (no filters, semantic only) runs vector search only, with FTS5 receiving a broad keyword slice of semantic_query as a secondary pass.

Component D, Explainability Badge

Each result in the note list shows how it was found. Built deterministically from the compiled query, no extra LLM call. The badge renders alongside existing search result rows in packages/app-desktop/gui/NoteList/:

function buildExplainBadge(tq: TranslatedQuery, path: 'fts' | 'vector' | 'both'): string {
  const parts: string[] = [];
  if (tq.filters.date_from) {
    const from = String(tq.filters.date_from).slice(0, 4);
    const to   = String(tq.filters.date_to ?? '').slice(0, 4);
    parts.push(`date: ${from}–${to}`);
  }
  if (tq.filters.tag)      parts.push(`tag: ${tq.filters.tag}`);
  if (tq.filters.type)     parts.push(`type: ${tq.filters.type}`);
  if (path !== 'fts')      parts.push(`semantic: "${tq.semantic_query}"`);
  return parts.join(' · ');
  // → "date: 2019–2020 · semantic: meeting company"
}

Worked Examples

"Meeting with a company from Germany in 2020 or maybe 2019"

LLM output:  { filters: { date_from: 20190101, date_to: 20201231 },
               semantic_query: "meeting company client" }
 
Compiler:    "created:20190101 -created:20201231 meeting company client"
             → valid filterParser.parseQuery() input
 
FTS5 path:   date-filtered SQL → candidates in 2019–2020 range
Vector path: embed("meeting company client") → semantic matches
RRF merge:   "Berlin client sync call – Jan 2020" ranks 1 (in both paths)
Badge:       "date: 2019–2020 · semantic: meeting company"

"List of tasks I wrote for the website redesign"

LLM output:  { filters: { type: "todo" },
               semantic_query: "website redesign tasks checklist" }
 
Compiler:    "type:todo website redesign tasks checklist"
 
FTS5 path:   type:todo filter → to-do notes only
Vector path: embed("website redesign tasks") → semantic matches
Badge:       "type: todo · semantic: website redesign"

"Poetry lines about the moon"

LLM output:  { filters: {}, semantic_query: "poetry moon verse night" }
 
Compiler:    no filter terms → semantic_query used as keyword slice for FTS
FTS5 path:   keyword search → 0 results (word "moon" absent from note text)
Vector path: embed("poetry moon verse night") → haiku surfaces via semantic proximity
Badge:       "semantic: poetry moon verse"

Threat Model & Operational Risks

Risk	Detail	Mitigation
Encrypted notebooks	vectra index stored in `joplin.plugins.dataDir()` (or equivalent core path), separate from sync	Index never synced; re-built locally on each device
First-time index cost	10,000 notes × embedding time could be minutes	Background worker with progress bar; search still works via FTS5 during initial build
Background CPU on battery	Continuous embedding worker drains battery	Idle-only scheduling; pause on battery (configurable)
Large vault memory	vectra loads full index (~75 MB for 10k notes)	Configurable chunk limit; mentor review on final threshold
LLM unavailable	Ollama not running; API key missing	Silent fallback to FTS5; one-time settings notice

UI Integration

Query type	Indicator	Engine
AI Search disabled (default off until configured)	None — identical to today	Existing FTS5
AI Search enabled + structured syntax (`tag:work`)	None — passes through	Existing FTS5
AI Search enabled + natural language	Animated `✦ AI` badge while translating	Hybrid search
LLM unavailable	One-time settings notice	FTS5 fallback

New settings under Tools → Options → Search:

Setting	Default	Purpose
AI Search: enabled	false	Opt-in toggle, off until user configures an LLM
AI Search: LLM backend	Ollama	Local or cloud LLM selection
AI Search: endpoint / API key	http://localhost:11434	Ollama URL or cloud API key
AI Search: show explainability badge	true	Show/hide result annotations

Potential Challenges & Mitigations

Challenge	Mitigation
Transformers.js WASM in Electron/Webpack	PoC in Community Bonding week 1. `@xenova/transformers` appears in Joplin's existing worker infrastructure, validate production build path before committing.
Date filter upper bound semantics	`queryBuilder.ts` uses negated `created:` for `<` semantics. Exact range pattern confirmed with mentor / codebase reading before Phase 2.
vectra memory on large vaults	Configurable chunk limit. Idle-only background loading. Final bundle size reviewed with mentor.
LLM translation latency	Loading indicator shown immediately. FTS5 path returns near-instantly in parallel.
Malformed JSON from LLM	Strict schema validation with `zod` or equivalent. On failure: silent FTS5 fallback.
Empty semantic query / filters-only	Skip vector leg; run FTS5 with compiled filter terms only.
Core vs plugin integration	Architecture is identical either way. Confirm preferred integration point with @malekhavasi in Community Bonding week 1.

Implementation Plan

Total: 350 hours across 14 weeks + Community Bonding

Community Bonding (May 4 – May 26)

Confirm integration point (core vs plugin) with @malekhavasi
Read queryBuilder.ts genericFilter + dateFilter end-to-end; confirm exact date range syntax
Validate @xenova/transformers WASM loading inside Joplin's Electron/Webpack build with a minimal PoC
Confirm note change event hook approach with mentor (ItemChange / syncTables path)
Prototype the LLM translator with a 10-query smoke test against real Joplin notes
Build small labelled query set (aim for ~30–50 queries) as initial evaluation baseline

Phase 1 — Embedding & Vector Indexing (May 27 – Jun 24) · ~100 hrs

Week	Tasks
Week 1	Heading-aware chunking, 400-token sliding window fallback, merge tiny chunks; unit tests
Week 2	`@xenova/transformers` BGE-small-en-v1.5 in background worker; WASM loading in production build
Week 3	vectra integration; metadata filters (notebook, tag, type); CRUD pipeline; index stored in core data dir
Week 4	SHA-256 change detection; note-save hook via Joplin's `ItemChange` / `syncTables`; periodic polling for bulk imports

Phase 2 — LLM Query Translator & Compiler (Jun 25 – Jul 15) · ~75 hrs

Week	Tasks
Week 5	`LLMClient` interface; Ollama backend; system prompt engineering; `zod` JSON schema validation + fallback
Week 6	OpenAI and Anthropic API backends; unified swappable interface; settings API key integration
Week 7	`compileToJoplinTerms()` query compiler; `shouldUseHybridSearch()` router; full integration with `SearchEngine.search()`

Phase 3 — Hybrid Execution & RRF (Jul 16 – Aug 5) · ~75 hrs

Week	Tasks
Week 8	Compiler output → `filterParser.parseQuery()` → `queryBuilder.ts` FTS5 path; date range validation
Week 9	Parallel execution of both paths; RRF merger; filters-only / semantic-only edge case handling
Week 10	Explainability badge in note list UI (`packages/app-desktop/gui/NoteList/`); progress indicator in search bar

Phase 4 — Settings, Testing & Docs (Aug 6 – Sep 1) · ~100 hrs

Week	Tasks
Week 11	Settings panel (enable toggle, backend selector, API key/endpoint, badge toggle); opt-in default
Week 12	Evaluation against labelled query set, Hit@3 and MRR vs FTS5 baseline
Week 13	Large vault testing (10,000+ notes); latency profiling; first-time index UX; regression pass
Week 14	User documentation; developer guide for adding LLM backends; PR cleanup; merge

Deliverables

#	Deliverable	Description
01	Query Router	`shouldUseHybridSearch()`, opt-in gate with structured syntax passthrough.
02	LLM Query Translator	Single-call JSON extractor. Ollama, OpenAI, Anthropic backends behind a unified `LLMClient` interface.
03	Query Compiler	`compileToJoplinTerms()`, converts `TranslatedQuery` to valid `filterParser` input using Joplin's actual date/filter syntax (YYYYMMDD).
04	Embedding Indexer	BGE-small-en-v1.5 via `@xenova/transformers` WASM. Heading-aware chunking. SHA-256 incremental change detection. Background worker.
05	vectra Vector Store	Pure-TypeScript local index. No native dependencies. Metadata filtering. Built-in BM25.
06	Hybrid Executor	Parallel FTS5 + vector execution. Handles filters-only and semantic-only edge cases. Result deduplication.
07	RRF Merger	Mathematically merges ranked lists using standard k=60 constant.
08	Explainability Badge	Deterministic result annotation in note list UI. No extra LLM call.
09	Settings UI	Opt-in enable toggle, backend selector, API key/endpoint config, badge toggle.
10	Documentation	User guide; developer guide for adding LLM backends; inline code docs throughout.

Availability


Hours per week	40–45
Timezone	IST (UTC+5:30)
Competing commitments	None, no conflicting internships or jobs during GSoC
College coursework	Will not affect committed GSoC hours
Communication	Weekly progress updates on the Joplin forum; available on Discord throughout

AI Assistance Disclosure

I used Claude (Anthropic) as an AI tool during the preparation of this proposal.

What AI was used for:

Drafting and structuring proposal sections based on my technical research
Improving clarity and wording of explanations I had already reasoned through
Formatting tables, code block layout, and section organisation

What AI was NOT used for:

The codebase research. I read filterParser.ts, SearchEngine.ts, and
queryBuilder.ts directly and verified all technical details (YYYYMMDD date
format, ItemChange/syncTables hooks, vectra vs sqlite-vec constraint)
myself before any drafting began
The architectural decisions. The choice of hybrid self-querying over agent loops,
the shouldUseHybridSearch() routing logic, and the compileToJoplinTerms()
compiler step came from my own analysis of the codebase and the forum discussion

Confirmation:

I have reviewed the entire proposal carefully
I understand every piece of code and every architectural decision in it
I can explain and justify any section in a live discussion

Topic		Replies	Views
GSoC 2026: Opportunities for the AI projects GSoC	32	700	13 April 2026
LangChain / LlamaIndex Joplin integrations for developing AI apps Apps	3	1401	7 May 2025
Welcome to GSoC 2026 with Joplin! GSoC	155	2001	1 April 2026
Proposal: A local-based semantic search engine Integration Features	1	231	6 March 2025
AI powered searching of notes? Features	15	2164	19 November 2024