GSoC 2026 Proposal Draft – Idea 1: AI-Supported Search for Notes – Amirtha Yazhini M

AmirthaYazhini · 25 March 2026 07:49

Links

Project idea: https://github.com/joplin/gsoc/blob/main/ideas/2026.md#1-ai-supported-search-for-notes
GitHub profile: https://github.com/Amirtha-yazhini
Forum introduction post: Welcome to GSoC 2026 with Joplin! - #120 by AmirthaYazhini
Pull requests submitted to Joplin:
- https://github.com/laurent22/joplin/pull/14865 — Mobile: Fixes #14835: Upgrade react-native-popup-menu to remove deprecated SafeAreaView warning (open, reviewed by @personalizedrefrigerator)
Other relevant development experience:
- Proof-of-concept AI Search plugin (built as part of this proposal): https://github.com/Amirtha-yazhini/joplin-plugin-ai-search — functional plugin with neural embeddings, incremental indexing, cosine similarity search, and React search panel running in Joplin desktop
- Custom search engine (personal project): web crawling + TF-IDF + PageRank implemented from scratch in Python

1. Introduction

I am Amirtha Yazhini M, a Computer Science and Software Engineering student based in Coimbatore, India. My core skills are TypeScript, React, and applied machine learning, and I have hands-on experience building information retrieval systems — most notably a custom search engine combining web crawling, TF-IDF ranking, and PageRank, which is directly applicable to this project.

Since discovering Joplin's GSoC programme I have made concrete contributions to the codebase. I submitted PR #14865 (Mobile: Fixes #14835 — upgrading react-native-popup-menu from 0.17.0 to 0.19.0 to fix a deprecated SafeAreaView CI warning, migrating the existing accessibility patch, and adding a Jest test suite). The PR was reviewed by @personalizedrefrigerator and I addressed all feedback the same day.

I have also built a working proof-of-concept plugin for this exact GSoC idea — available at https://github.com/Amirtha-yazhini/joplin-plugin-ai-search — which demonstrates a functional search panel, settings registration, incremental note indexing, and semantic search using real neural embeddings. The PoC was built specifically to validate the key technical risks before writing this proposal, not after.

My open-source journey began with this GSoC application and I am committed to growing as a contributor beyond the programme. I am a regular Joplin user and have a personal stake in making note retrieval smarter.

2. Project Summary

Problem: Joplin's existing search engine is keyword-based — it works when the user remembers the exact words in a note, but fails when memory is vague or the query is expressed in natural language. A user might remember "the note about a meeting with a German company around 2019 or 2020" without recalling any searchable keyword. The current engine cannot handle this.

Why it matters: As a note collection grows into the hundreds or thousands, discoverability becomes the primary pain point. A natural-language search engine transforms Joplin from a static archive into an active knowledge retrieval system.

Alignment with mentor discussion: I have read the GSoC 2026 mentor forum thread (Opportunities for the AI projects). @shikuz and @laurent proposed that all AI projects should target the same embedding interface — "build it once and build it well." This proposal is designed from the ground up to be that foundation layer: one embedding index, one memory budget, one incremental update pipeline that other plugins (chat, categorisation, note graphs) can consume via a well-defined put(note)/query(text) interface. The glue code between my implementation and the AI backend will be explicitly swappable as Laurent requested.

What will be implemented:

Semantic embedding pipeline using all-MiniLM-L6-v2 via Transformers.js running in the panel webview — fully local, no native dependencies, bundled model weights (~22MB), works offline
Persistent vector store with incremental updates (pure JavaScript cosine similarity, no native dependencies)
Structure-aware chunking: notes split on Markdown headings (H1/H2/H3) with note title + heading path prepended to each chunk; sections exceeding 512 tokens split further on paragraph boundaries with 64-token overlap; chunks under 20 tokens skipped
Hybrid ranking using Reciprocal Rank Fusion (RRF) combining semantic similarity with Joplin's existing FTS4 keyword results via the Data API
Three-source incremental sync: onNoteChange() for the current note, Events API cursor for all note changes, 5-minute polling fallback
Smart query classification routing queries to the correct engine based on query type
React search panel with natural-language input, result cards with relevance scores, heading breadcrumbs, and match-signal badges (Semantic / Keyword / Hybrid)
Settings panel for model configuration, hybrid mode toggle, and index management
Swappable embedding interface (put/query) designed for future shared infrastructure compatibility
Stretch goal: MCP tool schema definitions exposing search_notes and query_embeddings as described by Laurent in the forum

Expected outcome: An installable .jpl plugin distributed through the Joplin plugin repository. The user opens a sidebar, types a natural-language query, and receives ranked result cards with relevance scores. Clicking a card navigates to that note at the matched heading. All embedding inference runs locally — no API key required.

Out of scope:

Cloud-based or third-party embedding APIs (all inference is local)
Replacing the existing keyword search engine (the AI search supplements it)
Mobile platform optimisation (desktop is the primary target)
Requiring other GSoC projects to depend on this plugin's completion

3. Technical Approach

Architecture — four loosely coupled layers:

EmbeddingService — loads all-MiniLM-L6-v2 via Transformers.js in the panel webview from bundled local model files. Swappable backend: the interface is embed(text): Promise<number[]> so a future shared infrastructure project can replace it transparently.
VectorStore — persistent cosine similarity index stored as a JSON file in joplin.plugins.dataDir(). Pure JavaScript, no native dependencies. Exposes upsert(noteId, vector, metadata) and search(queryVector, topK), matching the shared interface proposed by @shikuz. Hash-based change detection ensures only modified chunks are re-embedded.
SearchCoordinator — lightweight query classifier (Joplin syntax tokens → keyword only; 1–2 words → hybrid; longer natural language → semantic). Hybrid results merged via Reciprocal Rank Fusion (RRF, k=60).
SemanticSearchPanel — React component with debounced search input, result cards with relevance scores, heading breadcrumbs, match-signal badges (Semantic / Keyword / Hybrid), progress indicator, and "Index All Notes" button.

Changes to the Joplin codebase: None. The plugin uses only the official Joplin plugin API (joplin.views.panels, joplin.data, joplin.settings, joplin.commands, joplin.workspace). Joplin core is not modified.

Embedding approach (consistent across all sections):

Model: all-MiniLM-L6-v2 (384 dimensions, ~22MB, 256-token context window)
Runtime: Transformers.js loaded in the panel's Electron webview from bundled local model weights
Chunking: notes split on Markdown headings (H1/H2/H3); note title + full heading path prepended to each chunk (e.g. "Linux Server Setup > PostgreSQL: [chunk text]"); sections exceeding 512 tokens split further on paragraph boundaries with 64-token overlap; chunks under 20 tokens skipped
Storage: pure JavaScript cosine similarity over Float32 vectors persisted to plugin dataDir

Libraries and technologies:

Transformers.js — local inference in panel webview, no native dependencies
Pure JavaScript cosine similarity — avoids plugin sandbox bundling constraints
Joplin Data API — FTS4 keyword search for hybrid ranking
Joplin Events API — cursor-based incremental sync across all notes

Potential challenges and mitigations (informed by PoC development):

Challenge	Mitigation
Native modules cannot load in plugin webpack sandbox	Confirmed during PoC: solution is Transformers.js in panel webview with bundled model weights — fully working in PoC
onNoteChange() only fires for currently selected note	Use Events API cursor (/events endpoint) for all-note incremental sync, supplemented by onSyncComplete() and 5-minute polling
Initial indexing latency for large collections	Background indexing with batch-and-yield pattern (10 notes then yield to event loop); progress indicator; partial progress persists to disk on restart
Model size (~22MB bundled)	One-time install cost; works fully offline after installation
Note update events during bulk import	Hash-based change detection; only re-embed changed chunks

Testing strategy:

Unit tests (Jest): EmbeddingService chunking edge cases (nested headings, long sections, code blocks, empty notes), VectorStore CRUD, query classifier heuristics, RRF fusion
Integration tests: end-to-end query against a 50-note synthetic corpus; verify top-1 result matches expected note
Regression snapshots: top-5 results for a fixed query set across versions to catch silent degradation
Evaluation report: Recall@5, MRR, Precision@5, latency P50/P95, index build time on a 1,000-note collection

Documentation plan:

User guide: search panel walkthrough, query examples, privacy guarantee (local-only inference), settings reference
Developer documentation: architecture diagram, module API reference, guide for swapping the embedding backend
Inline JSDoc for all exported functions and plugin API call sites

4. Implementation Plan

Community Bonding (May)

Deep-dive into Joplin plugin API, Events API, and existing SearchEngine internals
Finalise embedding backend strategy and chunking parameters based on mentor guidance
Align on shared interface API (put/query) and MCP stretch goal scope
Iterate on PoC plugin based on mentor feedback

Weeks 1–2

Production EmbeddingService: Transformers.js in panel webview, bundled model weights
Structure-aware Markdown heading chunker with heading-path prepending
Unit tests for chunking logic, edge cases (nested headings, code blocks, empty notes)
Benchmark CPU performance on 500-note test collection

Weeks 3–4

VectorStore: pure JS cosine similarity, upsert, delete, disk persistence, hash-based change detection
Three-source incremental sync: onNoteChange() fast path + Events API cursor + polling fallback
Unit and integration tests for VectorStore CRUD and sync logic

Weeks 5–6

SearchCoordinator: query classifier, Joplin FTS4 dispatch via Data API, RRF fusion (k=60)
Unit tests for classifier heuristics and RRF edge cases
Manual testing across 20+ diverse query types

Weeks 7–8 (Midterm)

SemanticSearchPanel React component: result cards with snippets, heading breadcrumbs, match-signal badges, dark/light theme support
Midterm deliverable: end-to-end demo — natural-language query returns semantically relevant notes with heading navigation

Weeks 9–10

Background indexing with batch-and-yield, progress indicator, cancel button, persistent partial progress
Performance benchmarking on 1,000-note collection; optimise if needed
Edge cases: image-only notes, non-English content, very large collections

Weeks 11–12

Settings panel: provider dropdown, hybrid mode toggle, privacy disclosure
Regression test suite with fixed query snapshots
Evaluation report: Recall@5, MRR, latency benchmarks
Begin MCP stretch goal: tool schema definitions for search_notes and query_embeddings

Weeks 13–14

User documentation and in-app help text
Developer documentation: architecture, module API, model swap guide, inline JSDoc
Code review cycles with mentors; address all feedback

Week 15 (Final)

Final polish, bug fixes, and code cleanup
Submit final work product report
Record demo screencast and announce plugin on Joplin forum Plugins category

5. Deliverables

EmbeddingService with Transformers.js in panel webview, bundled model weights, swappable backend interface
Structure-aware chunker with Markdown heading splitting, heading-path prepending, hash-based change detection
VectorStore with persistent pure-JS cosine similarity index and incremental update support
SearchCoordinator with query classification, Joplin FTS4 integration, and RRF hybrid ranking
SemanticSearchPanel React component with result cards, heading breadcrumbs, match-signal badges, and theme support
Three-source incremental sync (onNoteChange + Events API cursor + polling)
Plugin settings panel with model, hybrid mode, and privacy controls
Full test suite — unit, integration, regression, and evaluation report (>80% coverage on new modules)
User and developer documentation including inline JSDoc and architecture diagram
Working PoC plugin (already complete): https://github.com/Amirtha-yazhini/joplin-plugin-ai-search
Stretch goal: MCP tool schema definitions for search_notes and query_embeddings
Demo screencast and community forum post

6. Availability

Weekly availability: ~40 hours/week (full-time commitment)
Working hours: 5 PM – 12 AM IST on weekdays (after university classes); flexible all-day on weekends
Time zone: Indian Standard Time (IST, UTC+5:30)
Mentor availability: Available for daily async updates and weekly video calls at any time convenient for mentors
Other commitments: None — no internship, part-time work, or travel planned during the GSoC period
Classes: University classes run 9 AM – 5 PM IST on weekdays; this time is not available for GSoC work
Contact: amirthayazhini.m@gmail.com

AI Assistance Disclosure

In accordance with Joplin's GSoC AI policy, I disclose the following use of AI assistance:

I used Claude (Anthropic) to help structure and draft sections of this proposal
I used Claude to help understand the Yarn patch migration workflow during PR #14865
I used Claude to help debug webpack bundling issues during PoC plugin development

All code in the PoC plugin and the PR was written, reviewed, understood, and manually verified by me. The key technical insights — native module sandbox constraints, Events API incremental sync, structure-aware chunking — were discovered through my own hands-on PoC development, not AI assistance. Claude was not used to generate any code submitted to the Joplin repository.

malekhavasi · 30 March 2026 10:00

Thanks for the proposal, nice work. Just a quick feedback: please fix the missing sections, make sure to follow the proposal template here: How to Submit Your Proposal Draft and AI Disclosure (Joplin's GSoC AI policy).

On the technical side, your embedding approach and chunking strategy are inconsistent across sections (e.g., different token/overlap settings), so please pick one clear approach and provide a bit more detail on how you plan to implement it. Also, please remove the comparison to another applicant’s proposal, proposals should be evaluated on their own merit, and referencing others doesn’t add value here

AmirthaYazhini · 31 March 2026 02:22

Thank you for the feedback @malekhavasi . I have updated the proposal — fixed the embedding/chunking inconsistency (now consistently Transformers.js in panel webview, 512-token sections with 64-token overlap, heading-aware splitting), added the AI Disclosure, added the Expected Outcome and Changes to Joplin codebase fields, and removed the comparison to another applicant. Please let me know if anything else needs to be addressed.

Topic		Replies	Views
GSoC 2026: Opportunities for the AI projects GSoC	40	1296	19 June 2026
Welcome to GSoC 2026 with Joplin! GSoC	154	2697	1 April 2026
GSoC 2026 Proposal Draft – Idea 4: Chat with your note collection using AI GSoC	0	29	31 March 2026
Question regarding GSoC 2026 GSoC	7	137	25 March 2026
GSOC Idea # 1 GSoC	7	158	31 March 2026