Links
-
Project idea: https://github.com/joplin/gsoc/blob/main/ideas/2026.md#1-ai-supported-search-for-notes
-
GitHub profile: https://github.com/Amirtha-yazhini
-
Forum introduction post: Welcome to GSoC 2026 with Joplin! - #120 by AmirthaYazhini
-
Pull requests submitted to Joplin:
- https://github.com/laurent22/joplin/pull/14865 — Mobile: Fixes #14835: Upgrade react-native-popup-menu to remove deprecated SafeAreaView warning (open, reviewed by @personalizedrefrigerator)
-
Other relevant development experience:
-
Proof-of-concept AI Search plugin (built as part of this proposal): https://github.com/Amirtha-yazhini/joplin-plugin-ai-search — functional plugin with neural embeddings, incremental indexing, cosine similarity search, and React search panel running in Joplin desktop
-
Custom search engine (personal project): web crawling + TF-IDF + PageRank implemented from scratch in Python
-
1. Introduction
I am Amirtha Yazhini M, a Computer Science and Software Engineering student based in Coimbatore, India. My core skills are TypeScript, React, and applied machine learning, and I have hands-on experience building information retrieval systems — most notably a custom search engine combining web crawling, TF-IDF ranking, and PageRank, which is directly applicable to this project.
Since discovering Joplin's GSoC programme I have made concrete contributions to the codebase. I submitted PR #14865 (Mobile: Fixes #14835 — upgrading react-native-popup-menu from 0.17.0 to 0.19.0 to fix a deprecated SafeAreaView CI warning, migrating the existing accessibility patch, and adding a Jest test suite). The PR was reviewed by @personalizedrefrigerator and I addressed all feedback the same day.
I have also built a working proof-of-concept plugin for this exact GSoC idea — available at https://github.com/Amirtha-yazhini/joplin-plugin-ai-search — which demonstrates a functional search panel, settings registration, incremental note indexing, and semantic search using real neural embeddings. The PoC was built specifically to validate the key technical risks before writing this proposal, not after.
My open-source journey began with this GSoC application and I am committed to growing as a contributor beyond the programme. I am a regular Joplin user and have a personal stake in making note retrieval smarter.
2. Project Summary
Problem: Joplin's existing search engine is keyword-based — it works when the user remembers the exact words in a note, but fails when memory is vague or the query is expressed in natural language. A user might remember "the note about a meeting with a German company around 2019 or 2020" without recalling any searchable keyword. The current engine cannot handle this.
Why it matters: As a note collection grows into the hundreds or thousands, discoverability becomes the primary pain point. A natural-language search engine transforms Joplin from a static archive into an active knowledge retrieval system.
Alignment with mentor discussion: I have read the GSoC 2026 mentor forum thread (Opportunities for the AI projects). @shikuz and @laurent proposed that all AI projects should target the same embedding interface — "build it once and build it well." This proposal is designed from the ground up to be that foundation layer: one embedding index, one memory budget, one incremental update pipeline that other plugins (chat, categorisation, note graphs) can consume via a well-defined put(note)/query(text) interface. The glue code between my implementation and the AI backend will be explicitly swappable as Laurent requested.
What will be implemented:
-
Semantic embedding pipeline using all-MiniLM-L6-v2 via Transformers.js running in the panel webview — fully local, no native dependencies, bundled model weights (~22MB), works offline
-
Persistent vector store with incremental updates (pure JavaScript cosine similarity, no native dependencies)
-
Structure-aware chunking: notes split on Markdown headings (H1/H2/H3) with note title + heading path prepended to each chunk; sections exceeding 512 tokens split further on paragraph boundaries with 64-token overlap; chunks under 20 tokens skipped
-
Hybrid ranking using Reciprocal Rank Fusion (RRF) combining semantic similarity with Joplin's existing FTS4 keyword results via the Data API
-
Three-source incremental sync: onNoteChange() for the current note, Events API cursor for all note changes, 5-minute polling fallback
-
Smart query classification routing queries to the correct engine based on query type
-
React search panel with natural-language input, result cards with relevance scores, heading breadcrumbs, and match-signal badges (Semantic / Keyword / Hybrid)
-
Settings panel for model configuration, hybrid mode toggle, and index management
-
Swappable embedding interface (put/query) designed for future shared infrastructure compatibility
-
Stretch goal: MCP tool schema definitions exposing search_notes and query_embeddings as described by Laurent in the forum
Expected outcome: An installable .jpl plugin distributed through the Joplin plugin repository. The user opens a sidebar, types a natural-language query, and receives ranked result cards with relevance scores. Clicking a card navigates to that note at the matched heading. All embedding inference runs locally — no API key required.
Out of scope:
-
Cloud-based or third-party embedding APIs (all inference is local)
-
Replacing the existing keyword search engine (the AI search supplements it)
-
Mobile platform optimisation (desktop is the primary target)
-
Requiring other GSoC projects to depend on this plugin's completion
3. Technical Approach
Architecture — four loosely coupled layers:
-
EmbeddingService — loads all-MiniLM-L6-v2 via Transformers.js in the panel webview from bundled local model files. Swappable backend: the interface is
embed(text): Promise<number[]>so a future shared infrastructure project can replace it transparently. -
VectorStore — persistent cosine similarity index stored as a JSON file in joplin.plugins.dataDir(). Pure JavaScript, no native dependencies. Exposes
upsert(noteId, vector, metadata)andsearch(queryVector, topK), matching the shared interface proposed by @shikuz. Hash-based change detection ensures only modified chunks are re-embedded. -
SearchCoordinator — lightweight query classifier (Joplin syntax tokens → keyword only; 1–2 words → hybrid; longer natural language → semantic). Hybrid results merged via Reciprocal Rank Fusion (RRF, k=60).
-
SemanticSearchPanel — React component with debounced search input, result cards with relevance scores, heading breadcrumbs, match-signal badges (Semantic / Keyword / Hybrid), progress indicator, and "Index All Notes" button.
Changes to the Joplin codebase: None. The plugin uses only the official Joplin plugin API (joplin.views.panels, joplin.data, joplin.settings, joplin.commands, joplin.workspace). Joplin core is not modified.
Embedding approach (consistent across all sections):
-
Model: all-MiniLM-L6-v2 (384 dimensions, ~22MB, 256-token context window)
-
Runtime: Transformers.js loaded in the panel's Electron webview from bundled local model weights
-
Chunking: notes split on Markdown headings (H1/H2/H3); note title + full heading path prepended to each chunk (e.g. "Linux Server Setup > PostgreSQL: [chunk text]"); sections exceeding 512 tokens split further on paragraph boundaries with 64-token overlap; chunks under 20 tokens skipped
-
Storage: pure JavaScript cosine similarity over Float32 vectors persisted to plugin dataDir
Libraries and technologies:
-
Transformers.js — local inference in panel webview, no native dependencies
-
Pure JavaScript cosine similarity — avoids plugin sandbox bundling constraints
-
Joplin Data API — FTS4 keyword search for hybrid ranking
-
Joplin Events API — cursor-based incremental sync across all notes
Potential challenges and mitigations (informed by PoC development):
| Challenge | Mitigation |
|---|---|
| Native modules cannot load in plugin webpack sandbox | Confirmed during PoC: solution is Transformers.js in panel webview with bundled model weights — fully working in PoC |
| onNoteChange() only fires for currently selected note | Use Events API cursor (/events endpoint) for all-note incremental sync, supplemented by onSyncComplete() and 5-minute polling |
| Initial indexing latency for large collections | Background indexing with batch-and-yield pattern (10 notes then yield to event loop); progress indicator; partial progress persists to disk on restart |
| Model size (~22MB bundled) | One-time install cost; works fully offline after installation |
| Note update events during bulk import | Hash-based change detection; only re-embed changed chunks |
Testing strategy:
-
Unit tests (Jest): EmbeddingService chunking edge cases (nested headings, long sections, code blocks, empty notes), VectorStore CRUD, query classifier heuristics, RRF fusion
-
Integration tests: end-to-end query against a 50-note synthetic corpus; verify top-1 result matches expected note
-
Regression snapshots: top-5 results for a fixed query set across versions to catch silent degradation
-
Evaluation report: Recall@5, MRR, Precision@5, latency P50/P95, index build time on a 1,000-note collection
Documentation plan:
-
User guide: search panel walkthrough, query examples, privacy guarantee (local-only inference), settings reference
-
Developer documentation: architecture diagram, module API reference, guide for swapping the embedding backend
-
Inline JSDoc for all exported functions and plugin API call sites
4. Implementation Plan
Community Bonding (May)
-
Deep-dive into Joplin plugin API, Events API, and existing SearchEngine internals
-
Finalise embedding backend strategy and chunking parameters based on mentor guidance
-
Align on shared interface API (put/query) and MCP stretch goal scope
-
Iterate on PoC plugin based on mentor feedback
Weeks 1–2
-
Production EmbeddingService: Transformers.js in panel webview, bundled model weights
-
Structure-aware Markdown heading chunker with heading-path prepending
-
Unit tests for chunking logic, edge cases (nested headings, code blocks, empty notes)
-
Benchmark CPU performance on 500-note test collection
Weeks 3–4
-
VectorStore: pure JS cosine similarity, upsert, delete, disk persistence, hash-based change detection
-
Three-source incremental sync: onNoteChange() fast path + Events API cursor + polling fallback
-
Unit and integration tests for VectorStore CRUD and sync logic
Weeks 5–6
-
SearchCoordinator: query classifier, Joplin FTS4 dispatch via Data API, RRF fusion (k=60)
-
Unit tests for classifier heuristics and RRF edge cases
-
Manual testing across 20+ diverse query types
Weeks 7–8 (Midterm)
-
SemanticSearchPanel React component: result cards with snippets, heading breadcrumbs, match-signal badges, dark/light theme support
-
Midterm deliverable: end-to-end demo — natural-language query returns semantically relevant notes with heading navigation
Weeks 9–10
-
Background indexing with batch-and-yield, progress indicator, cancel button, persistent partial progress
-
Performance benchmarking on 1,000-note collection; optimise if needed
-
Edge cases: image-only notes, non-English content, very large collections
Weeks 11–12
-
Settings panel: provider dropdown, hybrid mode toggle, privacy disclosure
-
Regression test suite with fixed query snapshots
-
Evaluation report: Recall@5, MRR, latency benchmarks
-
Begin MCP stretch goal: tool schema definitions for search_notes and query_embeddings
Weeks 13–14
-
User documentation and in-app help text
-
Developer documentation: architecture, module API, model swap guide, inline JSDoc
-
Code review cycles with mentors; address all feedback
Week 15 (Final)
-
Final polish, bug fixes, and code cleanup
-
Submit final work product report
-
Record demo screencast and announce plugin on Joplin forum Plugins category
5. Deliverables
-
EmbeddingService with Transformers.js in panel webview, bundled model weights, swappable backend interface
-
Structure-aware chunker with Markdown heading splitting, heading-path prepending, hash-based change detection
-
VectorStore with persistent pure-JS cosine similarity index and incremental update support
-
SearchCoordinator with query classification, Joplin FTS4 integration, and RRF hybrid ranking
-
SemanticSearchPanel React component with result cards, heading breadcrumbs, match-signal badges, and theme support
-
Three-source incremental sync (onNoteChange + Events API cursor + polling)
-
Plugin settings panel with model, hybrid mode, and privacy controls
-
Full test suite — unit, integration, regression, and evaluation report (>80% coverage on new modules)
-
User and developer documentation including inline JSDoc and architecture diagram
-
Working PoC plugin (already complete): https://github.com/Amirtha-yazhini/joplin-plugin-ai-search
-
Stretch goal: MCP tool schema definitions for search_notes and query_embeddings
-
Demo screencast and community forum post
6. Availability
-
Weekly availability: ~40 hours/week (full-time commitment)
-
Working hours: 5 PM – 12 AM IST on weekdays (after university classes); flexible all-day on weekends
-
Time zone: Indian Standard Time (IST, UTC+5:30)
-
Mentor availability: Available for daily async updates and weekly video calls at any time convenient for mentors
-
Other commitments: None — no internship, part-time work, or travel planned during the GSoC period
-
Classes: University classes run 9 AM – 5 PM IST on weekdays; this time is not available for GSoC work
-
Contact: amirthayazhini.m@gmail.com
AI Assistance Disclosure
In accordance with Joplin's GSoC AI policy, I disclose the following use of AI assistance:
-
I used Claude (Anthropic) to help structure and draft sections of this proposal
-
I used Claude to help understand the Yarn patch migration workflow during PR #14865
-
I used Claude to help debug webpack bundling issues during PoC plugin development
All code in the PoC plugin and the PR was written, reviewed, understood, and manually verified by me. The key technical insights — native module sandbox constraints, Events API incremental sync, structure-aware chunking — were discovered through my own hands-on PoC development, not AI assistance. Claude was not used to generate any code submitted to the Joplin repository.