GSoC 2026 Proposal Draft - Idea 1: AI-supported search for notes - Alan Biju

trueharmonyalan · 28 March 2026 09:47

Links

Project idea: gsoc/ideas.md at master · joplin/gsoc · GitHub
Github profile: trueharmonyalan (Alan Biju) · GitHub
Introduction post: Introducing trueharmonyalan
Pull request submitted to Joplin: Desktop: Fixes #12877: Viewer word count includes CSS-hidden text causing editor/viewer count mismatch by trueharmonyalan · Pull Request #14893 · laurent22/joplin · GitHub
No other open-source PRs at this time; personal projects available on GitHub profile.

1. Introduction

My name is Alan Biju, and I am a recent Computer Science and Engineering graduate with hands-on experience developing full-stack applications using technologies directly relevant to Joplin, including React, JavaScript, and TypeScript. As a passionate user of open-source software on my primary Debian system, I have chosen Joplin as the project for my first deep open-source contribution.

To prepare for this, I have been proactively studying the Joplin codebase, which led to the detailed architectural analysis that forms the basis of this proposal. I have also submitted a pull request (#14893) to familiarize myself with the contribution workflow. I am eager to apply my skills to a project I admire and to deliver a high-quality semantic search feature to the Joplin community.

2. Project Summary

What problem it solves

This project is a solution for retrieving notes efficiently and contextually, If a user who use Joplin for a long time, in that case many notes accumulate across all their notebooks, and retrieving using existing search technique is not optimal because the existing search matches words, not meaning. If the user cannot remember the exact word they wrote, the search fails them and it brings notes that share the same keyword but are not actually what the user is looking for. This leads to a waste of time and user need to put additional effort to pick the right note they've written.

Why it matters to users

The importance of this solution is to simply make Joplin more intuitive and more convenient for users who use it as their second brain. For example, a student who created quick notes about a concept will have different notes based on the complexity of that concept. Over time those notes become difficult to locate individually. In that moment, the student may remember writing about a concept but cannot recall whether they titled it "light reactions" or "energy conversion" or "photosynthesis notes" so the keyword search returns nothing useful. Having a search technique that understands meaning rather than exact words to find the right note is helpful and a true time saver.

What will be implemented

The implementation for this solution is to apply an AI-powered search capability to Joplin's existing search system, supplementing the existing keyword search. When a user stops typing, Joplin saves the note and logs the change to the item_changes table. A background service that this project implements watches this table and converts the updated note's text into a vector a list of numbers representing its meaning using a pre-trained embedding model. These vectors are stored in a new dedicated note_embeddings table, added to the database via a standard migration script in packages/lib/database-migrations/. No existing tables are modified, ensuring full backwards compatibility.

Expected outcome

By the end of the GSoC period, Joplin will have a working AI-powered semantic search engine integrated into the core application. Users will be able to type a natural language query such as "my notes about the meeting with the German client last year" into the existing search bar and receive results ranked by meaning rather than keyword match. The system will run entirely on-device with no cloud dependency, keeping user notes private.

The implementation will include: a background embedding service that continuously indexes notes into a new note_embeddings database table, an in-memory vector cache for fast similarity search at query time, and a routing layer that decides between semantic and existing FTS search depending on the query. All logic lives in packages/lib/, so Desktop and CLI apps inherit it automatically through Joplin's shared architecture.

Out of Scope

Mobile support (Android/iOS) the semantic engine is gated on mobile due to memory constraints; mobile falls back to existing FTS silently
Replacing or modifying the existing keyword FTS pipeline in any way
Cloud-based embedding APIs all processing is local by default
Native SQLite vector extensions (no sqlite-vec or C++ dependencies)
A chat or Q&A interface this project is search only, not Idea 4

3. Technical Approach

Architecture Overview

The implementation introduces a new SemanticSearchEngine class that runs as a parallel search lane alongside the existing FTS pipeline. The existing SearchEngine, queryBuilder, and notes_fts infrastructure are completely untouched. Both engines share the same entry point SearchEngineUtils.notesForQuery() and produce results in the same format, so the UI requires no changes.

All AI logic lives exclusively in packages/lib/services/search/SemanticSearchEngine.ts. Because the Desktop and CLI apps both boot through BaseApplication.ts and route searches through the shared packages/lib/ layer, implementing it once covers both apps simultaneously.

Components and Codebase Changes

New file: `packages/lib/services/search/SemanticSearchEngine.ts`

This is the core class. It owns four responsibilities:

Background embedding sync via syncEmbeddings()
An in-memory vector cache(Map<noteId, Float32Array>)
Query embedding via a dedicated Worker thread
Dot product similarity search against the cache

Modified: `packages/lib/services/search/SearchEngineUtils.ts`

The existing determineSearchType_() function is extended to add a Semantic type. Routing logic follows this priority order:

Query contains filter syntax (tag:, notebook:, -, "exact phrase") → FTS always
User has the AI toggle active → Semantic always
Query contains natural language markers (articles, prepositions) with ≥5 words and no filter syntax → Semantic
Everything else → FTS (safe default)

The AI toggle is a boolean stored in settings (ai.searchEnabled), surfaced as a small button in the existing search bar. No new UI components are required beyond this single control.

New migration:`packages/lib/database-migrations/`

A standard migration script adds one new table:

CREATE TABLE note_embeddings (
    note_id       TEXT    PRIMARY KEY,
    embedding     BLOB    NOT NULL,
    model_version TEXT    NOT NULL,
    updated_time  INTEGER NOT NULL
);

No existing tables (notes, notes_normalized, notes_fts) are modified. The model_version column protects against dimension mismatches if the user switches models via ai.activeModel in settings, the table is cleared and re-indexed automatically.

Modified: `packages/lib/models/Setting.ts`

Two new settings are added:

ai.searchEnabled boolean toggle for semantic search
ai.activeModel model name string, defaults to all-MiniLM-L6-v2

How Indexing Works

syncEmbeddings() runs on a 10-second interval, directly mirroring the existing SearchEngine.syncTables_(). It reads from the item_changes table the same queue that already drives FTS indexing processes notes in batches of 50, and saves vectors to note_embeddings. The last processed item_change_id is saved to settings after each batch so progress survives app restarts.

The embedding input for each note is constructed deliberately: the note title is included twice followed by the plain text body, with all markdown syntax stripped, code blocks replaced with the literal text "code block", and all URLs and image tags removed. Title repetition gives it implicit higher semantic weight without requiring complex weighted pooling.

All inference runs inside a dedicated Worker thread a Web Worker on Electron, and worker_threadson CLI/Node so the main thread never blocks during either background indexing or live search queries.

On first enable, the all-MiniLM-L6-v2 model (~80MB) is downloaded on-demand from Hugging Face and stored at Setting.value('profileDir') + /ai-models/${Setting.value('ai.activeModel')}/. It is never bundled with the installer. The model loads during the first syncEmbeddings() run after enabling, so it is already in memory before the user types their first semantic query.

For a user enabling AI search with 5,000 existing notes, initial indexing takes approximately 5–10 minutes of silent background processing at ~40ms per note (5,000 × 40ms ≈ 200 seconds), plus some additional overhead for I/O and batch coordination. The user continues using Joplin normally throughout. A progress indicator in settings shows the current status.

How Search Works

When a semantic query is routed to SemanticSearchEngine, it is first passed through cleanQuery(), which removes punctuation and common English stopwords — articles, pronouns, prepositions, and filler verbs such as 'find', 'show', and 'get'. Cleaning is intentionally lightweight, since the model itself handles semantic de-weighting of weak terms during inference. The cleaned query is then passed to the Worker thread which returns a 384-dimension vector. Similarity is then computed against the in-memory vector cache.

To ensure both performance and flexibility, the system is designed to handle different types of embedding models. The default model, all-MiniLM-L6-v2, produces L2-normalized vectors. This enables a significant performance optimization: the standard cosine similarity calculation (A·B) / (||A||*||B||) simplifies to a much faster dot product (A·B), as the vector magnitudes ||A|| and ||B|| are both 1.

To support future user-selected models that may not be normalized, the engine will check model-specific metadata. If the active model's vectors are normalized, it will use the highly optimized dot product. If not, it will fall back to computing the full, mathematically correct cosine similarity. This ensures both speed by default and accuracy for all supported models. This entire computation, even for 100,000 notes, completes in well under 50ms with no native extensions required.

The in-memory cache (Map<noteId, Float32Array>) is loaded once from note_embeddings on the first AI search of a session and stays warm for the entire session. Every time syncEmbeddings() writes a new vector to the database, it also updates the cache in the same operation, keeping them always in sync.

The resulting ranked list of note IDs is passed directly back into SearchEngineUtils.notesForQuery(), which calls Note.previews() to load full note data, applies user preferences (such as hiding completed todos), and dispatches NOTE_UPDATE_ALL to Redux. The note list renders identically to an FTS result.

Libraries and Technologies

Library	Purpose	Why
`@huggingface/transformers` (Transformer.js)	Run embedding model locally in JS/Node	No cloud dependency, works in Electron and Node.js natively
`all-MiniLM-L6-v2`	Embedding model	~80MB, 384 dimensions, strong English semantic quality, L2-normalized output
SQLite BLOB	Vector storage	Already used by Joplin, no new dependencies introduced
Web Worker / `worker_threads`	Inference off main thread	Built-in browser and Node APIs no external library needed

No native C++ extensions. No platform-specific binaries. No cloud APIs by default.

Potential Challenges

Initial indexing time on large collections
A user with 10,000+ notes will wait approximately 10–20 minutes for full initial indexing (at ~40ms per note, plus I/O and batch coordination overhead).
Mitigation: indexing is silent and non-blocking, the existing FTS search continues working throughout, and a progress indicator keeps the user informed.

Worker thread data transfer overhead
Passing large vector arrays between threads involves serialization cost.
Mitigation: use Transferable objects to transfer buffer ownership without copying, reducing overhead to near zero.

Model download failure or interruption
If the ~80MB download is interrupted, the model file will be incomplete.
Mitigation: download to a temporary path, verify the file size on completion, then move it to the final location atomically. Retry automatically on next startup.

Heuristic routing and model quality for non-English users

The natural language detector used in determineSearchType_() is optimized for English, so non-English natural language queries may not be auto-detected and will fall through to FTS. Additionally, the default model all-MiniLM-L6-v2 is English-trained, meaning embedding quality for non-English note content will be lower.

Mitigation: The explicit AI toggle in the search bar gives all users a reliable override for the detection issue. For embedding quality, the architecture already supports model substitution via ai.activeModel. Non-English users can switch to a multilingual model such as
paraphrase-multilingual-MiniLM-L12-v2 (note: ~470MB download vs the default ~80MB). This requires making the model download path dynamic based on the active model setting — a minor change already planned as part of the ai.activeModel implementation. Both the size difference and the model swap option will be noted clearly in the user documentation.

Memory usage on lower-end machines
The in-memory vector cache for 10,000 notes uses approximately 15MB of RAM. For 100,000 notes this reaches ~150MB, which may be significant on older hardware. This is a known trade-off for the query-time performance gain and will be documented clearly.

Testing Strategy

Unit tests (packages/lib/services/search/SemanticSearchEngine.test.ts):

syncEmbeddings() correctly reads item_changes and writes to note_embeddings
Dot product produces correct scores for known vector pairs
Model version change triggers table clear and re-index
Mobile gating returns an empty array, not an error
Graceful FTS fallback triggers correctly when the model fails to load

Integration tests:

End-to-end: note created → item_changes → syncEmbeddings() → note_embeddings populated
Semantic query routed correctly vs a keyword query with filter syntax
Results from SemanticSearchEngine.search() flow through notesForQuery() and render correctly in the note list

Joplin's existing test infrastructure in packages/lib/ will be used throughout no new test frameworks are required.

Documentation Plan

Developer documentation: A SEMANTIC_SEARCH.md file in packages/lib/services/search/ explaining the architecture, the embedText() interface for future backend additions, the migration pattern, and how to substitute a different embedding model.

User documentation: An addition to the Joplin help site explaining what semantic search is, how to enable it, what to expect during initial indexing, and how to use the AI toggle. Written in plain language with no technical terms.

4. Implementation Plan

This project is structured into two-week milestones to ensure steady progress and allow for adjustments. The 12-week timeline is designed to cover foundation, implementation, testing, and documentation.

Weeks 1–2: Foundation and Indexing Backend

Goal: Establish the core database structure and background service.
Tasks:
- Set up the complete development environment for the Joplin desktop application.
- Create and apply the new database migration script for the note_embeddings table.
- Implement the initial SemanticSearchEngine.ts class structure.
- Implement the syncEmbeddings() background process. This service will read from the item_changes table (similar to the existing FTS syncTables_() process) and, for now, save placeholder data to the new note_embeddings table.

Weeks 3–4: Model and Worker Integration

Goal: Integrate the machine learning model and ensure it runs without blocking the UI.
Tasks:
- Integrate the @huggingface/transformers library into the project.
- Implement the on-demand model download logic, storing the model in the user's profile directory.
- Set up the dedicated Worker thread (worker_threads for Node/CLI, Web Worker for Electron) for running model inference.
- Connect the syncEmbeddings() process to the worker, allowing it to pass note text and receive a real vector embedding in return.

Weeks 5–6: Search Logic and Routing

Goal: Implement the core search functionality and routing logic.
Tasks:
- Implement the in-memory vector cache (Map<noteId, Float32Array>) that loads from the note_embeddings table.
- Implement the similarity search function, including the logic to switch between fast dot-product and full cosine similarity based on model properties.
- Modify the determineSearchType_() function in SearchEngineUtils.ts to add the new routing rules for semantic queries.
- Connect the search function so that a semantic query successfully returns a ranked list of note IDs.

Weeks 7–8: End-to-End Integration and UI

Goal: Connect the backend to the UI and complete the user-facing features.
Tasks:
- Add the new settings (ai.searchEnabled, ai.activeModel) to Setting.ts.
- Add the AI toggle button to the search bar UI and connect it to the ai.searchEnabled setting.
- Ensure the full search pipeline works end-to-end: a user types a query, it is routed correctly, results are returned, and the note list updates.
- Implement the progress indicator for the initial indexing process.

Weeks 9–10: Testing and Robustness

Goal: Ensure the implementation is reliable and bug-free.
Tasks:
- Write comprehensive unit tests for SemanticSearchEngine.ts, covering indexing, similarity calculations, and model versioning logic.
- Write integration tests to verify the end-to-end flow from note creation to search result rendering.
- Address the "Potential Challenges" identified, such as implementing Transferable objects for worker efficiency and ensuring the model download is robust.
- Conduct thorough manual testing for performance and edge cases.

Weeks 11–12: Documentation and Final Polish

Goal: Finalize all project deliverables for submission.
Tasks:
- Write the developer documentation (SEMANTIC_SEARCH.md) explaining the architecture and extension points.
- Write the user-facing documentation for the Joplin help site.
- Perform a final code review, add comments, and ensure the codebase is clean and maintainable.
- Prepare and submit the final pull request.

5. Deliverables

By the end of the GSoC program, the following will be delivered:

A fully functional semantic search engine integrated into the Joplin Desktop and CLI applications, capable of understanding natural language queries.
A background indexing service that automatically and continuously converts note content into vector embeddings.
A new database table (note_embeddings) managed by a standard migration script to store the vector data persistently.
An intelligent query router that automatically decides between the existing keyword search and the new semantic search based on the query's structure.
A complete set of unit and integration tests to ensure the new functionality is robust and reliable.
Comprehensive documentation, including a technical guide for future developers (SEMANTIC_SEARCH.md) and a user-friendly guide for the Joplin help site.
A final, polished Pull Request submitted to the main Joplin repository containing all the above features and assets.

6. Availability

Weekly Availability: I am available to work full-time, approximately 40 hours per week, for the entire duration of the GSoC program.
Time Zone: IST (Indian Standard Time, UTC+5:30).
Other Commitments: I have no other academic or professional commitments during the GSoC 2026 period and can dedicate my full attention to this project.

trueharmonyalan · 28 March 2026 14:48

@malekhavasi @personalizedrefriger Would love to hear what you think about the approach I used and expecting some advice on this proposal.

Topic		Replies	Views
GSoC 2026: Opportunities for the AI projects GSoC	32	698	13 April 2026
Welcome to GSoC 2026 with Joplin! GSoC	155	1959	1 April 2026
Proposal: A local-based semantic search engine Integration Features	1	231	6 March 2025
GSoC Idea - Search Features gsoc-2020	15	2505	11 August 2020
GSoC 2026 Proposal Draft – Idea 4: Chat with your note collection using AI GSoC	0	19	31 March 2026