Links
- Project idea: gsoc/ideas.md at master · joplin/gsoc · GitHub
- Github profile: trueharmonyalan (Alan Biju) · GitHub
- Introduction post: Introducing trueharmonyalan
- Pull request submitted to Joplin: Desktop: Fixes #12877: Viewer word count includes CSS-hidden text causing editor/viewer count mismatch by trueharmonyalan · Pull Request #14893 · laurent22/joplin · GitHub
- No other open-source PRs at this time; personal projects available on GitHub profile.
1. Introduction
My name is Alan Biju, and I am a recent Computer Science and Engineering graduate with hands-on experience developing full-stack applications using technologies directly relevant to Joplin, including React, JavaScript, and TypeScript. As a passionate user of open-source software on my primary Debian system, I have chosen Joplin as the project for my first deep open-source contribution.
To prepare for this, I have been proactively studying the Joplin codebase, which led to the detailed architectural analysis that forms the basis of this proposal. I have also submitted a pull request (#14893) to familiarize myself with the contribution workflow. I am eager to apply my skills to a project I admire and to deliver a high-quality semantic search feature to the Joplin community.
2. Project Summary
What problem it solves
This project is a solution for retrieving notes efficiently and contextually, If a user who use Joplin for a long time, in that case many notes accumulate across all their notebooks, and retrieving using existing search technique is not optimal because the existing search matches words, not meaning. If the user cannot remember the exact word they wrote, the search fails them and it brings notes that share the same keyword but are not actually what the user is looking for. This leads to a waste of time and user need to put additional effort to pick the right note they've written.
Why it matters to users
The importance of this solution is to simply make Joplin more intuitive and more convenient for users who use it as their second brain. For example, a student who created quick notes about a concept will have different notes based on the complexity of that concept. Over time those notes become difficult to locate individually. In that moment, the student may remember writing about a concept but cannot recall whether they titled it "light reactions" or "energy conversion" or "photosynthesis notes" so the keyword search returns nothing useful. Having a search technique that understands meaning rather than exact words to find the right note is helpful and a true time saver.
What will be implemented
The implementation for this solution is to apply an AI-powered search capability to Joplin's existing search system, supplementing the existing keyword search. When a user stops typing, Joplin saves the note and logs the change to the item_changes table. A background service that this project implements watches this table and converts the updated note's text into a vector a list of numbers representing its meaning using a pre-trained embedding model. These vectors are stored in a new dedicated note_embeddings table, added to the database via a standard migration script in packages/lib/database-migrations/. No existing tables are modified, ensuring full backwards compatibility.
Expected outcome
By the end of the GSoC period, Joplin will have a working AI-powered semantic search engine integrated into the core application. Users will be able to type a natural language query such as "my notes about the meeting with the German client last year" into the existing search bar and receive results ranked by meaning rather than keyword match. The system will run entirely on-device with no cloud dependency, keeping user notes private.
The implementation will include: a background embedding service that continuously indexes notes into a new note_embeddings database table, an in-memory vector cache for fast similarity search at query time, and a routing layer that decides between semantic and existing FTS search depending on the query. All logic lives in packages/lib/, so Desktop and CLI apps inherit it automatically through Joplin's shared architecture.
Out of Scope
- Mobile support (Android/iOS) the semantic engine is gated on mobile due to memory constraints; mobile falls back to existing FTS silently
- Replacing or modifying the existing keyword FTS pipeline in any way
- Cloud-based embedding APIs all processing is local by default
- Native SQLite vector extensions (no sqlite-vec or C++ dependencies)
- A chat or Q&A interface this project is search only, not Idea 4
3. Technical Approach
Architecture Overview
The implementation introduces a new SemanticSearchEngine class that runs as a parallel search lane alongside the existing FTS pipeline. The existing SearchEngine, queryBuilder, and notes_fts infrastructure are completely untouched. Both engines share the same entry point SearchEngineUtils.notesForQuery() and produce results in the same format, so the UI requires no changes.
All AI logic lives exclusively in packages/lib/services/search/SemanticSearchEngine.ts. Because the Desktop and CLI apps both boot through BaseApplication.ts and route searches through the shared packages/lib/ layer, implementing it once covers both apps simultaneously.
Components and Codebase Changes
New file: packages/lib/services/search/SemanticSearchEngine.ts
This is the core class. It owns four responsibilities:
- Background embedding sync via
syncEmbeddings() - An in-memory vector cache
(Map<noteId, Float32Array>) - Query embedding via a dedicated Worker thread
- Dot product similarity search against the cache
Modified: packages/lib/services/search/SearchEngineUtils.ts
The existing determineSearchType_() function is extended to add a Semantic type. Routing logic follows this priority order:
- Query contains filter syntax (tag:, notebook:, -, "exact phrase") → FTS always
- User has the AI toggle active → Semantic always
- Query contains natural language markers (articles, prepositions) with ≥5 words and no filter syntax → Semantic
- Everything else → FTS (safe default)
The AI toggle is a boolean stored in settings (ai.searchEnabled), surfaced as a small button in the existing search bar. No new UI components are required beyond this single control.
New migration:packages/lib/database-migrations/
A standard migration script adds one new table:
CREATE TABLE note_embeddings (
note_id TEXT PRIMARY KEY,
embedding BLOB NOT NULL,
model_version TEXT NOT NULL,
updated_time INTEGER NOT NULL
);
No existing tables (notes, notes_normalized, notes_fts) are modified. The model_version column protects against dimension mismatches if the user switches models via ai.activeModel in settings, the table is cleared and re-indexed automatically.
Modified: packages/lib/models/Setting.ts
Two new settings are added:
ai.searchEnabledboolean toggle for semantic searchai.activeModelmodel name string, defaults toall-MiniLM-L6-v2
How Indexing Works
syncEmbeddings() runs on a 10-second interval, directly mirroring the existing SearchEngine.syncTables_(). It reads from the item_changes table the same queue that already drives FTS indexing processes notes in batches of 50, and saves vectors to note_embeddings. The last processed item_change_id is saved to settings after each batch so progress survives app restarts.
The embedding input for each note is constructed deliberately: the note title is included twice followed by the plain text body, with all markdown syntax stripped, code blocks replaced with the literal text "code block", and all URLs and image tags removed. Title repetition gives it implicit higher semantic weight without requiring complex weighted pooling.
All inference runs inside a dedicated Worker thread a Web Worker on Electron, and worker_threadson CLI/Node so the main thread never blocks during either background indexing or live search queries.
On first enable, the all-MiniLM-L6-v2 model (~80MB) is downloaded on-demand from Hugging Face and stored at Setting.value('profileDir') + /ai-models/${Setting.value('ai.activeModel')}/. It is never bundled with the installer. The model loads during the first syncEmbeddings() run after enabling, so it is already in memory before the user types their first semantic query.
For a user enabling AI search with 5,000 existing notes, initial indexing takes approximately 5–10 minutes of silent background processing at ~40ms per note (5,000 × 40ms ≈ 200 seconds), plus some additional overhead for I/O and batch coordination. The user continues using Joplin normally throughout. A progress indicator in settings shows the current status.
How Search Works
When a semantic query is routed to SemanticSearchEngine, it is first passed through cleanQuery(), which removes punctuation and common English stopwords — articles, pronouns, prepositions, and filler verbs such as 'find', 'show', and 'get'. Cleaning is intentionally lightweight, since the model itself handles semantic de-weighting of weak terms during inference. The cleaned query is then passed to the Worker thread which returns a 384-dimension vector. Similarity is then computed against the in-memory vector cache.
To ensure both performance and flexibility, the system is designed to handle different types of embedding models. The default model, all-MiniLM-L6-v2, produces L2-normalized vectors. This enables a significant performance optimization: the standard cosine similarity calculation (A·B) / (||A||*||B||) simplifies to a much faster dot product (A·B), as the vector magnitudes ||A|| and ||B|| are both 1.
To support future user-selected models that may not be normalized, the engine will check model-specific metadata. If the active model's vectors are normalized, it will use the highly optimized dot product. If not, it will fall back to computing the full, mathematically correct cosine similarity. This ensures both speed by default and accuracy for all supported models. This entire computation, even for 100,000 notes, completes in well under 50ms with no native extensions required.
The in-memory cache (Map<noteId, Float32Array>) is loaded once from note_embeddings on the first AI search of a session and stays warm for the entire session. Every time syncEmbeddings() writes a new vector to the database, it also updates the cache in the same operation, keeping them always in sync.
The resulting ranked list of note IDs is passed directly back into SearchEngineUtils.notesForQuery(), which calls Note.previews() to load full note data, applies user preferences (such as hiding completed todos), and dispatches NOTE_UPDATE_ALL to Redux. The note list renders identically to an FTS result.
Libraries and Technologies
| Library | Purpose | Why |
|---|---|---|
@huggingface/transformers (Transformer.js) |
Run embedding model locally in JS/Node | No cloud dependency, works in Electron and Node.js natively |
all-MiniLM-L6-v2 |
Embedding model | ~80MB, 384 dimensions, strong English semantic quality, L2-normalized output |
| SQLite BLOB | Vector storage | Already used by Joplin, no new dependencies introduced |
Web Worker / worker_threads |
Inference off main thread | Built-in browser and Node APIs no external library needed |
No native C++ extensions. No platform-specific binaries. No cloud APIs by default.
Potential Challenges
Initial indexing time on large collections
A user with 10,000+ notes will wait approximately 10–20 minutes for full initial indexing (at ~40ms per note, plus I/O and batch coordination overhead).
Mitigation: indexing is silent and non-blocking, the existing FTS search continues working throughout, and a progress indicator keeps the user informed.
Worker thread data transfer overhead
Passing large vector arrays between threads involves serialization cost.
Mitigation: use Transferable objects to transfer buffer ownership without copying, reducing overhead to near zero.
Model download failure or interruption
If the ~80MB download is interrupted, the model file will be incomplete.
Mitigation: download to a temporary path, verify the file size on completion, then move it to the final location atomically. Retry automatically on next startup.
Heuristic routing and model quality for non-English users
The natural language detector used in determineSearchType_() is optimized for English, so non-English natural language queries may not be auto-detected and will fall through to FTS. Additionally, the default model all-MiniLM-L6-v2 is English-trained, meaning embedding quality for non-English note content will be lower.
Mitigation: The explicit AI toggle in the search bar gives all users a reliable override for the detection issue. For embedding quality, the architecture already supports model substitution via ai.activeModel. Non-English users can switch to a multilingual model such as
paraphrase-multilingual-MiniLM-L12-v2 (note: ~470MB download vs the default ~80MB). This requires making the model download path dynamic based on the active model setting — a minor change already planned as part of the ai.activeModel implementation. Both the size difference and the model swap option will be noted clearly in the user documentation.
Memory usage on lower-end machines
The in-memory vector cache for 10,000 notes uses approximately 15MB of RAM. For 100,000 notes this reaches ~150MB, which may be significant on older hardware. This is a known trade-off for the query-time performance gain and will be documented clearly.
Testing Strategy
Unit tests (packages/lib/services/search/SemanticSearchEngine.test.ts):
syncEmbeddings()correctly readsitem_changesand writes tonote_embeddings- Dot product produces correct scores for known vector pairs
- Model version change triggers table clear and re-index
- Mobile gating returns an empty array, not an error
- Graceful FTS fallback triggers correctly when the model fails to load
Integration tests:
- End-to-end: note created →
item_changes→syncEmbeddings()→note_embeddingspopulated - Semantic query routed correctly vs a keyword query with filter syntax
- Results from
SemanticSearchEngine.search()flow throughnotesForQuery()and render correctly in the note list
Joplin's existing test infrastructure in packages/lib/ will be used throughout no new test frameworks are required.
Documentation Plan
Developer documentation: A SEMANTIC_SEARCH.md file in packages/lib/services/search/ explaining the architecture, the embedText() interface for future backend additions, the migration pattern, and how to substitute a different embedding model.
User documentation: An addition to the Joplin help site explaining what semantic search is, how to enable it, what to expect during initial indexing, and how to use the AI toggle. Written in plain language with no technical terms.
4. Implementation Plan
This project is structured into two-week milestones to ensure steady progress and allow for adjustments. The 12-week timeline is designed to cover foundation, implementation, testing, and documentation.
Weeks 1–2: Foundation and Indexing Backend
- Goal: Establish the core database structure and background service.
- Tasks:
- Set up the complete development environment for the Joplin desktop application.
- Create and apply the new database migration script for the
note_embeddingstable. - Implement the initial
SemanticSearchEngine.tsclass structure. - Implement the
syncEmbeddings()background process. This service will read from theitem_changestable (similar to the existing FTSsyncTables_()process) and, for now, save placeholder data to the newnote_embeddingstable.
Weeks 3–4: Model and Worker Integration
- Goal: Integrate the machine learning model and ensure it runs without blocking the UI.
- Tasks:
- Integrate the
@huggingface/transformerslibrary into the project. - Implement the on-demand model download logic, storing the model in the user's profile directory.
- Set up the dedicated Worker thread (
worker_threadsfor Node/CLI, Web Worker for Electron) for running model inference. - Connect the
syncEmbeddings()process to the worker, allowing it to pass note text and receive a real vector embedding in return.
- Integrate the
Weeks 5–6: Search Logic and Routing
- Goal: Implement the core search functionality and routing logic.
- Tasks:
- Implement the in-memory vector cache (
Map<noteId, Float32Array>) that loads from thenote_embeddingstable. - Implement the similarity search function, including the logic to switch between fast dot-product and full cosine similarity based on model properties.
- Modify the
determineSearchType_()function inSearchEngineUtils.tsto add the new routing rules for semantic queries. - Connect the search function so that a semantic query successfully returns a ranked list of note IDs.
- Implement the in-memory vector cache (
Weeks 7–8: End-to-End Integration and UI
- Goal: Connect the backend to the UI and complete the user-facing features.
- Tasks:
- Add the new settings (
ai.searchEnabled,ai.activeModel) toSetting.ts. - Add the AI toggle button to the search bar UI and connect it to the
ai.searchEnabledsetting. - Ensure the full search pipeline works end-to-end: a user types a query, it is routed correctly, results are returned, and the note list updates.
- Implement the progress indicator for the initial indexing process.
- Add the new settings (
Weeks 9–10: Testing and Robustness
- Goal: Ensure the implementation is reliable and bug-free.
- Tasks:
- Write comprehensive unit tests for
SemanticSearchEngine.ts, covering indexing, similarity calculations, and model versioning logic. - Write integration tests to verify the end-to-end flow from note creation to search result rendering.
- Address the "Potential Challenges" identified, such as implementing
Transferableobjects for worker efficiency and ensuring the model download is robust. - Conduct thorough manual testing for performance and edge cases.
- Write comprehensive unit tests for
Weeks 11–12: Documentation and Final Polish
- Goal: Finalize all project deliverables for submission.
- Tasks:
- Write the developer documentation (
SEMANTIC_SEARCH.md) explaining the architecture and extension points. - Write the user-facing documentation for the Joplin help site.
- Perform a final code review, add comments, and ensure the codebase is clean and maintainable.
- Prepare and submit the final pull request.
- Write the developer documentation (
5. Deliverables
By the end of the GSoC program, the following will be delivered:
- A fully functional semantic search engine integrated into the Joplin Desktop and CLI applications, capable of understanding natural language queries.
- A background indexing service that automatically and continuously converts note content into vector embeddings.
- A new database table (
note_embeddings) managed by a standard migration script to store the vector data persistently. - An intelligent query router that automatically decides between the existing keyword search and the new semantic search based on the query's structure.
- A complete set of unit and integration tests to ensure the new functionality is robust and reliable.
- Comprehensive documentation, including a technical guide for future developers (
SEMANTIC_SEARCH.md) and a user-friendly guide for the Joplin help site. - A final, polished Pull Request submitted to the main Joplin repository containing all the above features and assets.
6. Availability
- Weekly Availability: I am available to work full-time, approximately 40 hours per week, for the entire duration of the GSoC program.
- Time Zone: IST (Indian Standard Time, UTC+5:30).
- Other Commitments: I have no other academic or professional commitments during the GSoC 2026 period and can dedicate my full attention to this project.