Links
-
GitHub profile - Fardin96
-
Introduction post: intro post
-
PRs
1. Introduction
I am Farabi Fardin Khan, a software engineer from Dhaka, Bangladesh, with a B.Sc. in Computer Science from Brac University.
My technical background is mainly in full-stack applications primarily using TypeScript and JavaScript — React-Native, React on the frontend, Node.js on the backend — which maps well to Joplin’s plugin architecture. I have developed, maintained and shipped applications at DEVxHUB and Swop Technologies, with experience in the entire development lifecycle. I have also recently shipped personal projects such as Travel Cover (OCR based travel insurance) and an automated blog-publishing pipeline triggered by webhooks.
On the Joplin side, I have already had two PRs(#14042, #14944). The solution I propose for this specific case seeks to minimize the compounding mental overhead of organizing dozens of notes over the years by categorizing all notes using local-first AI which suits Joplin’s philosophy.
2. Project Summary
2.1 The Problem
Note-taking apps often become disorganized over time: unsorted notes, redundant tags, and outdated notebooks. Existing Joplin plugins(Jarvis) offer limited help—they work per-note, require cloud APIs, and only suggest tags without addressing structure or patterns across notes.
2.2 The Solution
I propose a plugin that analyses all notes to build a semantic understanding, then suggests tags, notebooks, clusters, and archive candidates—all without sending data to the cloud by default. Users review and approve every action, ensuring full control.
2.3 Key Features
- Smart Tagging: Identifies notes with similar content (via embedding similarity) and propagates tags from well-tagged neighbors to under-tagged ones.
- Notebook Auto-Filing: Compares a note’s embedding against the centroid of each notebook and suggests a better one if the match is strong enough.
- Semantic Clustering: Groups the entire collection by topic, helping users discover themes they never explicitly named — useful for creating new notebooks or bulk tags.
- Archive Discovery: Flags notes that have not been accessed or edited for a configurable period (default six months) as candidates for archiving.
2.4 Expected Outcome
By the end of GSoC the plugin (“Alfred”) will be available in the marketplace with a local embedding engine, a review panel, full offline support, optional LLM enhancement, comprehensive tests, and clear documentation.
3. Technical Approach
3.1 Overall Architecture
The plugin consists of four core components: an Embedding Service (model loading and inference), a Categorisation Engine (tagging, notebooks, clustering, archiving), a Review UI Panel, and Data API that applies selected tags to notes. An Index Service bridges the Embedding Service and Joplin’s events, incrementally updating embeddings only for changed notes—avoiding full re-indexing.
3.2 Embedding Engine
I plan to use transformers.js (Xenova port) to run a small sentence-transformer model (e.g., all-MiniLM-L6-v2 or bge-small-en) in WebAssembly. Models are 20–30 MB, produce 384‑dim vectors, and run entirely on‑ devices—no API key or cloud data. Storage will be a custom binary store (Float32Array).
3.3 Categorisation Logic
- Smart Tagging: K‑nearest neighbours (k=5–10) on cosine similarity; tags from neighbours are weighted by similarity distance and suggested above a confidence threshold.
- Notebook Auto‑Filing: Centroid vectors per notebook; a note is suggested for move if its similarity to the nearest centroid exceeds a threshold (≈0.8).
Open question: handling small notebooks with statistically unreliable centroids. - Semantic Clustering: K‑means over all embeddings, with auto‑selection of k via silhouette score. Clusters are labelled with TF‑IDF keywords (optional LLM refinement). Triggered manually (“Analyse All”) due to computational cost.
- Archive Discovery: Heuristic based on user_updated_time older than a configurable threshold (default 180 days).
3.4 Optional LLM Layer
If a user provides a key in settings, the plugin can call an LLM for two specific tasks: labelling semantic clusters (converting TF-IDF keyword sets into human-readable topic names) and validating low-confidence categorisation suggestions. I plan to use function calling / structured JSON output so the LLM’s response maps directly to Joplin API actions (add_tag, move_note, create_notebook, etc.) without any free-text parsing.
Privacy is the important caveat here: the LLM path is strictly opt-in, and the settings panel will clearly explain what is sent. In local (Ollama) mode, data stays on the machine regardless.
3.5 Review UI Panel
The sidebar panel is a rendered via joplin.views.panels It lists pending suggestions grouped by type (tags, moves, archives, clusters), each with an Accept / Reject toggle. A single “Apply Approved” button commits all accepted actions through the Joplin data API, wrapped with undo support. I want to invest real time in this panel — it is the part the user interacts with every day, and a clunky UI will make even excellent suggestions feel annoying.
3.6 Known Challenges & Questions
I want to be transparent about the parts of this design that are not yet fully resolved, because I think these are worth discussing with mentors early:
- Plugin sandbox constraints: Joplin plugins can only use whitelisted native modules. I am confident transformers.js/WASM works, but I have not yet verified the exact memory ceiling under the plugin runtime. I plan to prototype this in week 1 and raise any issues immediately.
- KNN performance at scale: Brute-force KNN is O(N) per note and should be fine up to a few thousand notes. For very large collections (10,000+), we might need a more efficient approximate index. I would like to discuss whether this is a realistic concern for the target user base, or whether we should plan for it from the start.
- Embedding model size vs. cold-start time: A 30 MB WASM bundle is non-trivial. I plan to lazy-load it (only when the user opens the panel), but I want mentor feedback on whether there is a preferred model size/accuracy trade-off for Joplin plugins.
- Centroid stability for small notebooks: Notebooks with fewer than ~5 notes will have unreliable centroids. I am thinking of a minimum-size guard, but I am not sure what the right threshold is.
3.7 Testing
- Unit tests (Jest): Deterministic functions (cosine similarity, centroid calculation, KNN, TF‑IDF) tested against synthetic data with known outputs.
- Integration tests: Full pipeline tested via Joplin’s HTTP API in a headless environment, verifying suggestions against expected results.
- Performance benchmarks: Indexing time, KNN search, and memory measured for 50, 500, and 5,000 notes; any unacceptable degradation addressed early.
- User/mentor testing: After week 6, mentors and community members test with real collections to uncover edge cases.
- LLM mock tests: Simulated function‑calling responses to validate JSON schema handling and the undo mechanism.
3.8 Documentation Plan
User Documentation
- Overview – Purpose of the plugin, how it differs from existing tools.
- Installation – Steps to install from Joplin’s plugin repository or manually.
- Getting Started – First‑run setup, model download, and a quick walkthrough of the review panel.
- Feature Guides – Separate sections for Smart Tagging, Notebook Auto‑Filing, Semantic Clustering, and Archive Discovery, each explaining what it does, how it works (conceptually), and how to use it.
- Settings Reference – Explanation of configurable options (thresholds, archive age, clustering trigger, etc.).
- Review Panel – How to review, approve, reject, and apply suggestions.
- Privacy & Security – Clear statement that all processing is local by default; optional cloud features (if any) are explicit.
- Troubleshooting – Common issues (e.g., model fails to load, no suggestions appear) and solutions.
Developer Documentation
-
Architecture Overview – High‑level diagram and description of the main modules (Embedding Service, Index Service, Categorisation Engine, Review UI).
-
Setup for Development – Cloning the repo, installing dependencies, running tests, and loading the plugin in a development Joplin instance.
-
Core Component Details – How the embedding model is loaded and run (transformers.js, WebAssembly), incremental indexing, and the custom binary store.
-
Categorisation Algorithms – Detailed explanation of each algorithm (KNN tagging, centroid filing, K‑means clustering, heuristic archiving), including parameters and how they are tuned.
-
Testing – How to run unit, integration, performance, and mock tests; adding new test cases.
-
API & Events – Integration points with Joplin (note change events, plugin data dir, etc.).
-
Extending – Guidelines for contributors: how to add a new suggestion type, modify thresholds, or replace the embedding model.
-
Future Updates – Notes on planned improvements (e.g., moving to Joplin core embedding infrastructure if available).
4. Implementation Plan
| Period | Goals & Milestones |
|---|---|
| Weeks 1–2 | Foundation: Plugin scaffolding, dev environment, transformers.js integration proof-of-concept. |
| Weeks 3–4 | Indexing & Core Logic: Embedding Service + incremental indexer using Joplin Events API. Binary vector store. Initial KNN tagging and centroid notebook classifier (no UI yet). Benchmark on 1,000+ notes. |
| Weeks 5–6 | Review UI & Actions: React sidebar panel with Accept/Reject controls. Hook up tag and move actions via joplin.data API. Progress indicators. Begin testing with real note collections. |
| Weeks 7–8 | Archive & Clustering: Stale-note detection. K-Means clustering with auto-k (silhouette heuristic). TF-IDF cluster labelling. Cluster suggestions in UI. Optimise for larger corpora. |
| Weeks 9–10 | LLM Integration: Function-calling schema (add_tag, move_note, etc.). OpenAI + Ollama providers. JSON schema validation layer. Undo/rollback for all actions. |
| Week 11 | End-to-End Testing: 50 / 500 / 5,000-note test corpora. Performance profiling. Cross-platform checks (Windows, macOS, Linux). Fix any memory or stability issues. |
| Week 12 | Polish & Documentation: User guide, developer architecture doc, privacy notice, demo video/GIF. Finalise plugin manifest for marketplace submission. |
5. Deliverables
- Joplin plugin: As
.jplon the marketplace, with complete features. - Embedding engine: Transformers.js / WASM that runs entirely on-device.
- LLM integration layer: OpenAI/Ollama for cluster labelling and low-confidence refinements — strictly opt-in.
- Full test suite: unit tests for vector maths, KNN, and centroid logic; integration tests using the Joplin data API; performance benchmarks on large corpora.
- Complete documentation: user guide with screenshots, developer architecture doc, privacy statement, and a short demo video.
6. Availability
- Weekly availability: ~40 hours per week during GSoC
- Time zone: GMT+6