GSoC 2026 Proposal Draft – Idea 1: AI-Supported Search for Notes – Amirtha Yazhini M

Links


1. Introduction

I am Amirtha Yazhini M, a Computer Science and Software Engineering student based in Coimbatore, India. My core skills are TypeScript, React, and applied machine learning, and I have hands-on experience building information retrieval systems — most notably a custom search engine combining web crawling, TF-IDF ranking, and PageRank, which is directly applicable to this project.

Since discovering Joplin's GSoC programme I have made concrete contributions to the codebase. I submitted PR #14865 (Mobile: Fixes #14835 — upgrading react-native-popup-menu from 0.17.0 to 0.19.0 to fix a deprecated SafeAreaView CI warning, migrating the existing accessibility patch, and adding a Jest test suite). The PR was reviewed by @personalizedrefrigerator and I addressed all feedback the same day.

I have also built a working proof-of-concept plugin for this exact GSoC idea — available at https://github.com/Amirtha-yazhini/joplin-plugin-ai-search — which demonstrates a functional search panel, settings registration, incremental note indexing, and semantic search using real neural embeddings. The PoC was built specifically to validate the key technical risks before writing this proposal, not after.

My open-source journey began with this GSoC application and I am committed to growing as a contributor beyond the programme. I am a regular Joplin user and have a personal stake in making note retrieval smarter.


2. Project Summary

Problem: Joplin's existing search engine is keyword-based — it works when the user remembers the exact words in a note, but fails when memory is vague or the query is expressed in natural language. A user might remember "the note about a meeting with a German company around 2019 or 2020" without recalling any searchable keyword. The current engine cannot handle this.

Why it matters: As a note collection grows into the hundreds or thousands, discoverability becomes the primary pain point. A natural-language search engine transforms Joplin from a static archive into an active knowledge retrieval system.

Alignment with mentor discussion: I have read the GSoC 2026 mentor forum thread (Opportunities for the AI projects). @shikuz and @laurent proposed that all AI projects should target the same embedding interface — "build it once and build it well." This proposal is designed from the ground up to be that foundation layer: one embedding index, one memory budget, one incremental update pipeline that other plugins (chat, categorisation, note graphs) can consume via a well-defined put(note)/query(text) interface. The glue code between my implementation and the AI backend will be explicitly swappable as Laurent requested.

What will be implemented:

  • Semantic embedding pipeline using all-MiniLM-L6-v2 via Transformers.js running in the panel webview — fully local, no native dependencies, bundled model weights (~22MB), works offline

  • Persistent vector store with incremental updates (pure JavaScript cosine similarity, no native dependencies)

  • Structure-aware chunking: notes split on Markdown headings (H1/H2/H3) with note title + heading path prepended to each chunk; sections exceeding 512 tokens split further on paragraph boundaries with 64-token overlap; chunks under 20 tokens skipped

  • Hybrid ranking using Reciprocal Rank Fusion (RRF) combining semantic similarity with Joplin's existing FTS4 keyword results via the Data API

  • Three-source incremental sync: onNoteChange() for the current note, Events API cursor for all note changes, 5-minute polling fallback

  • Smart query classification routing queries to the correct engine based on query type

  • React search panel with natural-language input, result cards with relevance scores, heading breadcrumbs, and match-signal badges (Semantic / Keyword / Hybrid)

  • Settings panel for model configuration, hybrid mode toggle, and index management

  • Swappable embedding interface (put/query) designed for future shared infrastructure compatibility

  • Stretch goal: MCP tool schema definitions exposing search_notes and query_embeddings as described by Laurent in the forum

Expected outcome: An installable .jpl plugin distributed through the Joplin plugin repository. The user opens a sidebar, types a natural-language query, and receives ranked result cards with relevance scores. Clicking a card navigates to that note at the matched heading. All embedding inference runs locally — no API key required.

Out of scope:

  • Cloud-based or third-party embedding APIs (all inference is local)

  • Replacing the existing keyword search engine (the AI search supplements it)

  • Mobile platform optimisation (desktop is the primary target)

  • Requiring other GSoC projects to depend on this plugin's completion


3. Technical Approach

Architecture — four loosely coupled layers:

  • EmbeddingService — loads all-MiniLM-L6-v2 via Transformers.js in the panel webview from bundled local model files. Swappable backend: the interface is embed(text): Promise<number[]> so a future shared infrastructure project can replace it transparently.

  • VectorStore — persistent cosine similarity index stored as a JSON file in joplin.plugins.dataDir(). Pure JavaScript, no native dependencies. Exposes upsert(noteId, vector, metadata) and search(queryVector, topK), matching the shared interface proposed by @shikuz. Hash-based change detection ensures only modified chunks are re-embedded.

  • SearchCoordinator — lightweight query classifier (Joplin syntax tokens → keyword only; 1–2 words → hybrid; longer natural language → semantic). Hybrid results merged via Reciprocal Rank Fusion (RRF, k=60).

  • SemanticSearchPanel — React component with debounced search input, result cards with relevance scores, heading breadcrumbs, match-signal badges (Semantic / Keyword / Hybrid), progress indicator, and "Index All Notes" button.

Changes to the Joplin codebase: None. The plugin uses only the official Joplin plugin API (joplin.views.panels, joplin.data, joplin.settings, joplin.commands, joplin.workspace). Joplin core is not modified.

Embedding approach (consistent across all sections):

  • Model: all-MiniLM-L6-v2 (384 dimensions, ~22MB, 256-token context window)

  • Runtime: Transformers.js loaded in the panel's Electron webview from bundled local model weights

  • Chunking: notes split on Markdown headings (H1/H2/H3); note title + full heading path prepended to each chunk (e.g. "Linux Server Setup > PostgreSQL: [chunk text]"); sections exceeding 512 tokens split further on paragraph boundaries with 64-token overlap; chunks under 20 tokens skipped

  • Storage: pure JavaScript cosine similarity over Float32 vectors persisted to plugin dataDir

Libraries and technologies:

  • Transformers.js — local inference in panel webview, no native dependencies

  • Pure JavaScript cosine similarity — avoids plugin sandbox bundling constraints

  • Joplin Data API — FTS4 keyword search for hybrid ranking

  • Joplin Events API — cursor-based incremental sync across all notes

Potential challenges and mitigations (informed by PoC development):

Challenge Mitigation
Native modules cannot load in plugin webpack sandbox Confirmed during PoC: solution is Transformers.js in panel webview with bundled model weights — fully working in PoC
onNoteChange() only fires for currently selected note Use Events API cursor (/events endpoint) for all-note incremental sync, supplemented by onSyncComplete() and 5-minute polling
Initial indexing latency for large collections Background indexing with batch-and-yield pattern (10 notes then yield to event loop); progress indicator; partial progress persists to disk on restart
Model size (~22MB bundled) One-time install cost; works fully offline after installation
Note update events during bulk import Hash-based change detection; only re-embed changed chunks

Testing strategy:

  • Unit tests (Jest): EmbeddingService chunking edge cases (nested headings, long sections, code blocks, empty notes), VectorStore CRUD, query classifier heuristics, RRF fusion

  • Integration tests: end-to-end query against a 50-note synthetic corpus; verify top-1 result matches expected note

  • Regression snapshots: top-5 results for a fixed query set across versions to catch silent degradation

  • Evaluation report: Recall@5, MRR, Precision@5, latency P50/P95, index build time on a 1,000-note collection

Documentation plan:

  • User guide: search panel walkthrough, query examples, privacy guarantee (local-only inference), settings reference

  • Developer documentation: architecture diagram, module API reference, guide for swapping the embedding backend

  • Inline JSDoc for all exported functions and plugin API call sites


4. Implementation Plan

Community Bonding (May)

  • Deep-dive into Joplin plugin API, Events API, and existing SearchEngine internals

  • Finalise embedding backend strategy and chunking parameters based on mentor guidance

  • Align on shared interface API (put/query) and MCP stretch goal scope

  • Iterate on PoC plugin based on mentor feedback

Weeks 1–2

  • Production EmbeddingService: Transformers.js in panel webview, bundled model weights

  • Structure-aware Markdown heading chunker with heading-path prepending

  • Unit tests for chunking logic, edge cases (nested headings, code blocks, empty notes)

  • Benchmark CPU performance on 500-note test collection

Weeks 3–4

  • VectorStore: pure JS cosine similarity, upsert, delete, disk persistence, hash-based change detection

  • Three-source incremental sync: onNoteChange() fast path + Events API cursor + polling fallback

  • Unit and integration tests for VectorStore CRUD and sync logic

Weeks 5–6

  • SearchCoordinator: query classifier, Joplin FTS4 dispatch via Data API, RRF fusion (k=60)

  • Unit tests for classifier heuristics and RRF edge cases

  • Manual testing across 20+ diverse query types

Weeks 7–8 (Midterm)

  • SemanticSearchPanel React component: result cards with snippets, heading breadcrumbs, match-signal badges, dark/light theme support

  • Midterm deliverable: end-to-end demo — natural-language query returns semantically relevant notes with heading navigation

Weeks 9–10

  • Background indexing with batch-and-yield, progress indicator, cancel button, persistent partial progress

  • Performance benchmarking on 1,000-note collection; optimise if needed

  • Edge cases: image-only notes, non-English content, very large collections

Weeks 11–12

  • Settings panel: provider dropdown, hybrid mode toggle, privacy disclosure

  • Regression test suite with fixed query snapshots

  • Evaluation report: Recall@5, MRR, latency benchmarks

  • Begin MCP stretch goal: tool schema definitions for search_notes and query_embeddings

Weeks 13–14

  • User documentation and in-app help text

  • Developer documentation: architecture, module API, model swap guide, inline JSDoc

  • Code review cycles with mentors; address all feedback

Week 15 (Final)

  • Final polish, bug fixes, and code cleanup

  • Submit final work product report

  • Record demo screencast and announce plugin on Joplin forum Plugins category


5. Deliverables

  • EmbeddingService with Transformers.js in panel webview, bundled model weights, swappable backend interface

  • Structure-aware chunker with Markdown heading splitting, heading-path prepending, hash-based change detection

  • VectorStore with persistent pure-JS cosine similarity index and incremental update support

  • SearchCoordinator with query classification, Joplin FTS4 integration, and RRF hybrid ranking

  • SemanticSearchPanel React component with result cards, heading breadcrumbs, match-signal badges, and theme support

  • Three-source incremental sync (onNoteChange + Events API cursor + polling)

  • Plugin settings panel with model, hybrid mode, and privacy controls

  • Full test suite — unit, integration, regression, and evaluation report (>80% coverage on new modules)

  • User and developer documentation including inline JSDoc and architecture diagram

  • Working PoC plugin (already complete): https://github.com/Amirtha-yazhini/joplin-plugin-ai-search

  • Stretch goal: MCP tool schema definitions for search_notes and query_embeddings

  • Demo screencast and community forum post


6. Availability

  • Weekly availability: ~40 hours/week (full-time commitment)

  • Working hours: 5 PM – 12 AM IST on weekdays (after university classes); flexible all-day on weekends

  • Time zone: Indian Standard Time (IST, UTC+5:30)

  • Mentor availability: Available for daily async updates and weekly video calls at any time convenient for mentors

  • Other commitments: None — no internship, part-time work, or travel planned during the GSoC period

  • Classes: University classes run 9 AM – 5 PM IST on weekdays; this time is not available for GSoC work

  • Contact: amirthayazhini.m@gmail.com


AI Assistance Disclosure

In accordance with Joplin's GSoC AI policy, I disclose the following use of AI assistance:

  • I used Claude (Anthropic) to help structure and draft sections of this proposal

  • I used Claude to help understand the Yarn patch migration workflow during PR #14865

  • I used Claude to help debug webpack bundling issues during PoC plugin development

All code in the PoC plugin and the PR was written, reviewed, understood, and manually verified by me. The key technical insights — native module sandbox constraints, Events API incremental sync, structure-aware chunking — were discovered through my own hands-on PoC development, not AI assistance. Claude was not used to generate any code submitted to the Joplin repository.


Thanks for the proposal, nice work. Just a quick feedback: please fix the missing sections, make sure to follow the proposal template here: How to Submit Your Proposal Draft and AI Disclosure (Joplin's GSoC AI policy).

On the technical side, your embedding approach and chunking strategy are inconsistent across sections (e.g., different token/overlap settings), so please pick one clear approach and provide a bit more detail on how you plan to implement it. Also, please remove the comparison to another applicant’s proposal, proposals should be evaluated on their own merit, and referencing others doesn’t add value here

Thank you for the feedback @malekhavasi . I have updated the proposal — fixed the embedding/chunking inconsistency (now consistently Transformers.js in panel webview, 512-token sections with 64-token overlap, heading-aware splitting), added the AI Disclosure, added the Expected Outcome and Changes to Joplin codebase fields, and removed the comparison to another applicant. Please let me know if anything else needs to be addressed.