AI Based Categorization
Unsupervised Topic Modeling & Vector-Relational Taxonomy for Joplin
─
Muhammad Zohaib Irshad
Mid Senior Full Stack TypeScript Developer
[GitHub Account] [LinkedIn Account] [GSOC Idea#3]
1. Pull Requests & Relevant Work
Contributions made in Joplin
My Own Plugins that I build
| Contribution | Status | Links |
|---|---|---|
| Joplin Word Count, Spell Check & Reading Metrics | Completed | [Github Repository] [Npm Publish] [Vedio Demonstration] |
| AI Note Assistant: Chat On Your Notes | Completed | [Github Repository] [Npm Publish] [Vedio Demonstration] |
Joplin Main Repository
| Contribution | Status | Links |
|---|---|---|
| Pull Request | Merged | [Pull Request] |
| Pull Request | Opened | [Pull Request] |
| Pull Request | Closed | [Pull Request] |
| Pull Request | Closed | [Pull Request] |
AI Notes Assistant Plugin
| Contribution | Status | Links |
|---|---|---|
| Pull Request | Merged | [Pull Request] |
| Pull Request | Merged | [Pull Request] |
Jarvis Plugin
| Contribution | Status | Links |
|---|---|---|
| Pull Request | Merged | [Pull Request] |
| Pull Request | Closed | [Pull Request] |
| Pull Request | Merged | [Pull Request] |
Contributions in Apache and FOSSASIA Organisations
| Contribution | Status | Links |
|---|---|---|
| Pull Request (Complex Backend Token Storage Issue Solved) | Merged | [Pull Request] |
| Pull Request | Merged | [Pull Request] |
| Pull Request | Opened | [Pull Request] |
| Pull Request | Opened | [Pull Request] |
| Pull Request | Opened | [Pull Request] |
Personal Projects Related to this Plugin
| Project | GitHub Links | Tech Stack |
|---|---|---|
| RAG Pipeline | [Frontend Repo] [Backend Repo]Live Link] | TypeScript, Node.js, Next.js, Clerk, LangchainJS, Qdrant Vector DB, Cohere |
| AI + Reddit Analysis Based Ecommerce Store | [Github Repo] [Live Link] | TypeScript, Node.js, Next.js, Reddit API, Gemini API, MongoDB |
| LeaderBoard Sphere | [Frontend Repo] [Backend Repo] | Next.js, Node.js, Redis, Kafka, Prisma, SocketIO |
2. Introduction
I am Muhammad Zohaib Irshad, a mid senior full stack software engineer based in Islamabad, Pakistan. I am currently completing my Bachelor's degree in Software Engineering at Air University Islamabad.
Contact Information
| Field | Details |
|---|---|
| Name | Muhammad Zohaib Irshad |
| zohaibirshad678@gmail.com | |
| GitHub | developerzohaib786 (Muhammad Zohaib Irshad) · GitHub |
| https://linkedin.com/in/developerzohaib | |
| Address | Islamabad, Pakistan |
| University | Air University Islamabad |
| Degree | Bachelors in Software Engineering |
Programming Experience
| Area | Technologies |
|---|---|
| Frontend | JavaScript, TypeScript, ReactJS, Next.js, HTML5, CSS3, Tailwindcss, Shadcn/UI |
| Backend | Node.js, ExpressJS, NestJS, Prisma, JWT, Socket.io |
| Generative AI | RAG System, LangchainJS, Qdrant Vector DB, Cohere API, Gemini API, Reddit API |
| Databases | Vector Database (Qdrant), MongoDB, PostgreSQL, NeonDB, MySQL |
| System Design | Redis, Kafka, BullMQ, Rate Limiting, Cache, Server Clustering |
Tech Industry Experience
| Company | Role | Timeline |
|---|---|---|
| SyncaAI | Full Stack TypeScript Intern | Jul 2025 – Sept 2025 |
| Softechnova Enterprises | MERN Stack Intern | Jun 2025 – Jul 2025 |
| SARTE Digital Marketing | SEO Expert and WordPress Content Writer | Oct 2023 – Sept 2024 |
| Clients from Facebook, LinkedIn & WhatsApp | MERN Stack & Next.js Developer | 2023 – Present |
Open Source Experience
I actively contribute to open source projects including Apache Polaris-Tools, Apache Doris, FOSSASIA, and Links-Hub. I understand how to navigate large established codebases, communicate through PRs, and follow project contribution standards. I have mentioned my open source work in the Pull Requests and Relevant Work section.
3. Project Summary
Problem Space:
As a Joplin user's note collection grows it becomes increasingly disorganised. Notes accumulate without consistent tags, notebooks fill up with unrelated content, and rarely accessed notes get buried alongside frequently used ones. Manually reviewing and reorganising hundreds or thousands of notes is a task most users never complete. It is simply too time consuming. The result is a knowledge base that reflects when notes were created rather than what they are actually about, making the overall collection harder to navigate over time.
Implementation Strategy:
The plugin embeds each note using BGE-small-en-v1.5, chosen for its 512-token context window and strong MTEB clustering benchmark scores. Notes are split into overlapping chunks, embedded, and averaged into a single note level vector. Meaningful titles are blended into the note vector using cosine similarity weighting generic titles like "Untitled" are filtered out before any weighting is applied.
Rather than clustering on raw 384-dimensional vectors, the plugin first applies UMAP via DruidJS to reduce vectors to 5 dimensions, separating topic clusters in a low-dimensional space where K-Means performs significantly better. The optimal K is selected using silhouette scoring across K values from 2 to √N more reliable than the elbow method which cannot be detected programmatically.
Tag names are generated without sending note text to an LLM. TF-IDF first identifies cluster-specific terms, which are re-ranked by cosine similarity to the cluster centroid. Only the top five keywords per cluster are sent to the LLM, keeping the process privacy-preserving even when a cloud provider is used. All vectors are stored in a local SQLite database via joplin.require('sqlite3').
Archive candidates are scored across five signals, last edited date, edit count, content length, backlinks from other notes, and silhouette fit. It will make detection more accurate than checking a single timestamp. All suggestions are presented in a review panel before any change is applied. The plugin writes the pre change state of every affected note to a categorisation_log table before applying, enabling one click undo. Incremental sync uses the Joplin Events API cursor to catch changes from all devices.
Expected Outcome
A Joplin plugin that analyses the user's note collection using UMAP-enhanced clustering, discovers natural semantic categories, and presents tag and notebook suggestions in a review panel. If the user approves, the plugin applies those changes automatically creating new tags, creating new notebooks, and moving notes. While never modifying any note without explicit confirmation. A one click undo restores the full previous state from the categorisation log. The plugin supplements rather than replaces the user's existing organisational structure and works entirely offline by default, with cloud providers available as opt in only.
4. Technical Approach
4.1 Architectural Justification: Decoupled Plugin Runtime vs. Core Integration
A plugin keeps Joplin's core lightweight, ships independently of the main release cycle, and touches zero core source code. It can be installed or removed without affecting the main application.
4.2 Comparative Analysis: Evolution Beyond Current LLM Baselines (Jarvis Case Study)
I have worked directly inside the Jarvis codebase by submitting PR #66 (Azure OpenAI support) and resolving Issue #18 (dedicated chatbox) by PR#69 which showed me clearly where Jarvis's boundaries are. In my own joplin-plugin-ai-chat-on-notes I built a multi-provider abstraction layer both patterns I am applying here.
Jarvis operates on the currently open note with no batch embedding, no persistent vector index, and no clustering. This proposal builds that missing layer: embed every note, reduce dimensions with UMAP, discover semantic groupings through clustering, and surface them as actionable tag and notebook suggestions.
4.3 Technology Stack & Dependency Graph
| Component | Technology | Details |
|---|---|---|
| Language | TypeScript | Consistent with Joplin's entire plugin ecosystem |
| Database | sqlite3 via joplin.require('sqlite3') | Officially supported. Zero setup, single file, all platforms |
| Vector storage | BLOB column (Float32Array) | Raw binary, no native extension required |
| Dimensionality reduction | UMAP via DruidJS | Pure JavaScript, IEEE-published, actively maintained |
| Clustering | Pure JavaScript K-Means | Runs entirely in-process, no native modules needed |
| Default embedding | ONNX local (all-MiniLM-L6-v2) | ~23 MB, no API key required, fully offline |
| Category naming LLM | OpenAI / Ollama | Generates tag and notebook names from cluster summaries |
4.4 Validation-First Design: Week 1–2 Technical Spike
The two highest technical risks are treated as hypotheses before the architecture is locked:
Risk 1: Local embedding inference: Can ONNX load and run cross platform inside the Joplin plugin sandbox?
- Preferred: ONNX Runtime Node.js
- Fallback A: ONNX WASM
- Fallback B: HTTP to Ollama or OpenAI
Risk 2: WASM memory degradation: Transformers.js WebAssembly memory grows during batch embedding and never releases, dropping throughput from ~47 notes/sec to ~2 notes/sec after 100 notes. Mitigation: recycle the worker process every 80–100 notes and embeddings already written to sqlite3 are never lost on recycling.
4.5 Embedding Model Selection & Benchmarking (MTEB Analysis)
BGE-small-en-v1.5 is the default. Model selection was based on MTEB clustering task scores specifically not the overall leaderboard which averages across 8 task types and rewards retrieval quality irrelevant to this project. all-MiniLM-L6-v2 was considered but rejected because its 256-token limit silently truncates longer notes with no error or warning.
| Model | Dimensions | Size | Context | Notes |
|---|---|---|---|---|
| BGE-small-en-v1.5 (ONNX) | 384 | ~23 MB | 256 tokens | Default. Highest reliable clustering score on MTEB |
| all-MiniLM-L6-v2 (ONNX) | 384 | ~33 MB | 512 tokens | Rejected as default, silently truncates at 256 tokens |
| nomic-embed-text (Ollama) | 768 | ~274 MB | 8192 tokens | For the Ollama HTTP path |
| text-embedding-3-small (OpenAI) | 1536 | Cloud | 8191 tokens | $0.02 per million tokens |
4.6 System Workflow & Pipeline Stages
The plugin operates in four sequential phases: embedding, clustering, suggestion generation, and review and apply.
4.6.1 Phase I: Vector Ingestion & Multi-Stage Embedding
Notes are fetched via the Joplin Data API in paginated batches of 100. Each note is split into overlapping chunks (400 words, 50-word overlap) rather than headings. this ensures the model never silently truncates content at a heading boundary. Each chunk is embedded using BGE-small-en-v1.5 and stored as a BLOB in sqlite3 alongside note ID, title, SHA-256 hash, and user_updated_time.
Meaningful titles are embedded separately and blended into the final note vector using cosine similarity weighting. Generic titles ("Untitled", "New Note", dates) are filtered before any weighting is applied. SHA-256 hashing ensures only modified notes are re-embedded on subsequent runs.
4.6.2 Phase II: Dimensionality Reduction & Unsupervised Clustering
Once all notes are embedded, chunk vectors are averaged into a single note-level vector. Rather than clustering directly on raw 384-dimensional vectors where the curse of dimensionality makes everything look equally distant. the plugin first applies UMAP via DruidJS to reduce each note vector to 5 dimensions. UMAP parameters follow BERTopic's recommended defaults for topic clustering:
| Parameter | Value | Reason |
|---|---|---|
| n_neighbors | 15 | Balances local and global structure for 100–2000 note collections |
| n_components | 5 | Sweet spot: 2–3 loses too much, higher hurts K-Means |
| min_dist | 0.0 | Packs similar notes tightly for clean cluster boundaries |
| metric | cosine | Text embeddings should be compared by angle, not magnitude |
| random_state | 42 | Fixed seed: ensures consistent output across runs |
K-Means then runs on the UMAP-reduced vectors. The optimal K is selected using silhouette scoring across K values from 2 to √N. the K with the highest average silhouette score is chosen. The elbow method was considered and rejected because detecting the bend automatically in code without a human looking at the plot is unreliable.
Tag names are generated through a two-step pipeline. First TF-IDF identifies terms that appear frequently in a cluster but not in others. Those terms are re-ranked by cosine similarity to the cluster centroid. Only the top five keywords per cluster are sent to the LLM never actual note text. It will keep the process privacy preserving even when a cloud provider is used.
4.6.3 Phase III: Multi-Signal Heuristics for Archive Detection
Archive candidates are scored using a five-signal staleness score rather than checking a single timestamp field, which is insufficient, a note untouched for two years but referenced by ten other notes is not a candidate for archiving.
| Signal | Weight | Calculation |
|---|---|---|
| Last edited | 0.30 | days_since_edit / 365, capped at 1.0 |
| Edit count | 0.15 | 1 - min(edit_count, 10) / 10 |
| Content Length | 0.10 | 1.0 if under 100 characters and not a to-do, else 0.0 |
| Backlinks | 0.15 | 1.0 if no other note links to this one, else 0.0 |
| Silhouette fit | 0.30 | 1-max(individual_silhouette, 0): poor cluster fit scores high |
Notes scoring above 0.6 appear in the archive suggestions section. The threshold is configurable in settings.
onNoteChange() only fires for the currently selected note, it does not catch changes from other devices after a Joplin sync. The plugin hooks into three event sources:
- onNoteChange(): immediate re-embedding of the currently edited note
- onSyncComplete(): runs syncIndex() via the Events API cursor to catch changes from other devices
- Periodic polling every 5 minutes fallback for anything that slipped through
UMAP and clustering only re-run when the user clicks Re analyse or when more than 5% of the collection has changed below that threshold the existing clustering is still accurate enough.
4.6.4 Phase IV: Suggestion Review Phase
No changes are ever applied without explicit user confirmation. The plugin presents all suggestions in a structured review panel with three sections:
- Tag suggestions: each proposed new tag with the list of notes that would receive it
- Notebook suggestions: each proposed new notebook with the notes that would be moved into it
- Archive suggestions: notes flagged as rarely-accessed with a proposed move to an archive notebook
The user can accept all, reject all, or handle each suggestion individually. Before applying any accepted suggestion the plugin writes the original state of every affected note to the categorisation_log table. A one-click 'Undo last categorisation' button restores all affected notes to their previous state from the log.
4.7 Constraint Management: Handling Data API Rate Limits & Throttling
The Joplin Data API is a local REST service accessed through the joplin.data module. The maximum number of items returned per request is 100, controlled by the limit parameter. The plugin fetches notes in controlled batches of 100 per page. Critically, Joplin's plugin sandbox has no Web Worker API. To prevent UI freezing during embedding a batch-and-yield pattern is used: process 10 notes, then yield control back to the event loop. This keeps Joplin responsive throughout initial indexing.
| Constraint | Value | Source |
|---|---|---|
| Max items per request | 100 | Official Joplin Data API docs |
| Pagination field | has_more (boolean) | Official Joplin Data API docs |
| Page parameter | page (starts at 1) | Official Joplin Data API docs |
| Web Worker API | Not available in plugin sandbox | Joplin plugin architecture |
| Mitigation | Batch-and-yield event loop pattern |
4.8 Data Serialization: BLOB Binary Efficiency vs. JSON Overhead
The vector is stored as binary rather than a JSON string because binary can be deserialised directly back into a Float32Array in a single operation. During clustering every note vector is compared against every centroid on every iteration so the deserialisation cost multiplies significantly across large collections.
| BLOB (Float32Array binary) | JSON array | |
|---|---|---|
| Speed | Fast, one Buffer read, directly usable | Slow, full JSON parse before every comparison |
| Human readable | No | Yes |
| Storage size | ~1.5 KB per vector (384-dim local model) | ~2.5 KB per vector |
| Used for | Storing vectors in sqlite3 | Not suitable for vector operations |
4.9 API Cost Estimate
The local ONNX path has zero cost. Cloud providers are opt-in only.
| Collection size | Avg chunks | Local ONNX time | OpenAI cost | Storage (384-dim) |
|---|---|---|---|---|
| 100 notes | ~300 | ~15 seconds | ~$0.001 | ~0.5 MB |
| 1,000 notes | ~3,000 | ~2–3 minutes | ~$0.01 | ~5 MB |
| 5,000 notes | ~15,000 | ~12–15 minutes | ~$0.05 | ~25 MB |
| 10,000 notes | ~30,000 | ~25–30 minutes | ~$0.10 | ~50 MB |
4.10 IPC Bridge Constraints: Secure Sandbox Data Access
Plugins cannot access the Joplin database directly. The Joplin database is an SQLite file managed exclusively by the Joplin core application. All data access goes through the joplin.data module via an IPC bridge between the plugin sandbox and the Joplin main process. This matters significantly for the apply phase of this plugin, which requires multiple sequential write operations creating tags, creating notebooks, assigning tags to notes, and moving notes, each of which is a separate API call through the IPC bridge. The plugin batches these write operations and applies them sequentially with a small delay between calls. A progress indicator shows the user how many changes have been applied out of the total.
| Access method | Available to plugins | Speed | Notes |
|---|---|---|---|
| Direct SQLite file access | No | Very fast | Reserved for Joplin core only |
| joplin.data REST API via IPC | Yes | Moderate | Only supported method |
| Max items per request | — | — | 100 items |
| Fields selection | Yes | Faster | Use fields param to fetch only what is needed |
4.11 Core Design Principles for End-to-End AI Systems
- No Changes Without User Confirmation: Every tag assignment, notebook creation, and note move is shown in the review panel before anything is written. The plugin never modifies the user's note collection silently.
- Reversibility: Before applying any accepted suggestion the plugin logs the original state of every affected note to the categorisation_log table. A one-click undo button restores all affected notes.
- Incremental Processing via Change Detection: SHA-256 hashing ensures only modified notes are re-embedded. The Events API cursor tracks all changes including those from other devices after a Joplin sync.
- Local-First Privacy: By default nothing leaves the machine. All embedding inference runs locally via ONNX. When a user switches to a remote provider the settings page shows a persistent warning.
- Chunking with Overlap for Context Preservation: Notes are split into structure-aware overlapping chunks. The heading path is prepended to each chunk so that the embedding captures both the local content and the broader document context.
- Provider Abstraction via Common Interface: The LLM provider sits behind a shared EmbeddingProvider interface. Switching between ONNX, Ollama, and OpenAI requires only a settings change.
- Background-Safe Processing: Embedding and clustering use the batch-and-yield event loop pattern since the Joplin plugin sandbox has no Web Worker API. Joplin remains fully usable throughout.
- Graceful Degradation: If the API is unreachable the plugin shows a clear error and waits for the user to fix configuration. If the ONNX model fails to load it falls back to the Ollama or OpenAI path.
- Encrypted API Key Storage: API keys are stored using Joplin's settings API with secure: true, which uses the OS keychain where available. Keys are never stored in plaintext config files.
4.12 Error Handling and Edge Cases
- API key invalid or expired: Before the first full embedding run the plugin sends a test embedding request. If it fails the user sees an error in settings immediately.
- ONNX model fails to load: The plugin disables local inference, falls back to the Ollama or OpenAI path, and shows a clear message in settings suggesting the user switch providers.
- Ollama not running: The plugin pings http://localhost:11434/api/tags on startup. If unreachable the settings panel shows a warning with a link to Ollama's installation instructions.
Very large note collections (5,000+ notes):
- Live progress indicator: 'Embedding note 342 of 5,127...'
- Batch-and-yield pattern: 10 notes then event loop yield
- Cancel button: already-embedded chunks survive in sqlite3
- Persistent partial progress: next startup picks up where it left off via source_hash comparison
- Notes that fit no cluster cleanly: Notes with similarity to their assigned centroid below a configurable threshold are flagged as 'uncategorised' and excluded from suggestions.
- Empty or very short notes: Notes producing fewer than 20 tokens are excluded from clustering and shown as 'too short to categorise' in a separate section of the review panel.
- User rejects all suggestions: The plugin does not re-run automatically. The user must manually trigger a new analysis run, preventing the plugin from repeatedly suggesting changes already rejected.
- sqlite3 database corruption: The plugin runs PRAGMA integrity_check on startup. If corruption is detected it offers a one-click 'Rebuild Analysis' button.
- Non-English notes: The default ONNX model (all-MiniLM-L6-v2) is trained on English text. Users with multilingual collections can switch to paraphrase-multilingual-MiniLM-L12-v2 via the model selector.
- Accessibility: The review panel includes role='region', aria-label, and aria-live='polite'. Full keyboard navigation: Tab through suggestions, Enter to accept, Delete to reject, Escape to cancel.
4.13 First-Run Behaviour
| Step | What the user sees | What the plugin does |
|---|---|---|
| Plugin installed, Joplin restarted | Sidebar panel appears collapsed | Waits — no indexing starts automatically |
| User opens sidebar | Not Indexed state with Build Index button and estimated time | Ready to start |
| User clicks Build Index | Progress bar: 'Embedding note 342 of 1,247...' + Cancel button | Batch-and-yield embedding. Joplin remains fully usable |
| Embedding complete | 'Analysing your notes...' message | K-Means clustering and LLM category naming runs |
| Analysis complete | Review panel opens with tag, notebook, and archive suggestions | All suggestions visible, nothing applied yet |
| User reviews and confirms | Apply progress: 'Applied 12 of 47 changes' | Creates tags, notebooks, moves notes via Data API |
| Apply complete | 'Done. Undo last categorisation' button visible | categorisation_log written for rollback |
| Subsequent launches | Silent background sync message | Events API cursor check — re-embeds only changed notes |
4.14 Plugin Settings
| Setting | Type | Default | Description |
|---|---|---|---|
| Embedding Provider | Dropdown | local | local / ollama / openai |
| API Endpoint | String | "" | URL for Ollama or OpenAI. Hidden when local selected |
| API Key | Secure String | "" | Stored via secure: true in OS keychain |
| Embedding Model | String | all-MiniLM-L6-v2 | Model identifier. Changes based on provider |
| Cluster Count | String | auto | auto (elbow method) or manual integer 3–30 |
| Archive Threshold (months) | Integer | 12 | Notes not edited in this many months are flagged |
| Privacy Disclosure | Label | — | Read-only warning shown when a remote provider is selected |
4.15 UX Plan: Sidebar Panel States
| State | Display |
|---|---|
| Not Indexed | 'Click Build Index to enable AI categorisation' with estimated time |
| Embedding | 'Embedding note 342 of 1,200...' with progress bar and Cancel button |
| Clustering | 'Analysing your notes, discovering categories...' |
| Review | Three-section panel: tag suggestions, notebook suggestions, archive suggestions |
| Applying | 'Applied 12 of 47 changes...' with progress indicator |
| Done | Summary of applied changes + Undo button |
5. Implementation Plan
350 hours · May 26 – August 23 · Mentors: HahaBill, shikuz,
Week 1–2 · Validation Spike (~40 hrs)
- Validate BGE-small-en-v1.5 loads and runs cross-platform via ONNX inside the plugin sandbox
- Confirm WASM memory degradation behaviour and validate worker recycling every 80–100 notes as the fix
- Validate sqlite3 BLOB storage and Float32Array round-trip
- Build minimal PoC: embed a string → store → retrieve → runs on macOS, Windows, Linux
- Share spike report with mentors before locking architecture
Week 3–4 · Note Ingestion & Embedding (~60 hrs)
- Paginated note fetcher via Joplin Data API (100 notes/request, has_more loop)
- Chunk notes (~400 words, 50-word overlap), embed with BGE-small-en-v1.5
- Title vector blending with cosine similarity weighting; filter generic titles
- SHA-256 change detection + user_updated_time stored per note
- Batch-and-yield event loop pattern (10 notes + setTimeout(0))
- Unit tests for chunker edge cases
Week 5–6 · UMAP, Clustering & Tag Generation (~60 hrs)
- Average chunk vectors into note level vectors
- UMAP via DruidJS (n_neighbors=15, n_components=5, min_dist=0, cosine metric, random_state=42)
- Silhouette scoring across K=2 to √N to select optimal K
- K-Means on UMAP-reduced vectors
- TF-IDF term extraction → re-rank by centroid cosine similarity → send top 5 keywords to LLM
- Five-signal staleness score for archive detection
- Integration tests on a sample note collection
Midterm (July 14–18) · Checkpoint
- Working embedding + UMAP + clustering pipeline producing named tag suggestions in a basic panel
Week 7–8 · Suggestion Review UI (~60 hrs)
- React sidebar panel: tag suggestions, notebook suggestions, archive suggestions sections
- Per-suggestion accept / reject + accept all / reject all controls
- Events API cursor sync (onNoteChange, onSyncComplete, 5-minute poll fallback)
- Ollama and OpenAI provider adapters
Week 9–10 · Apply Logic & Rollback (~50 hrs)
- Apply pipeline: create tags → create notebooks → assign tags → move notes via joplin.data
- Write categorisation_log before every apply; one-click undo from log
- Settings UI: provider dropdown, secure API key, privacy disclosure
- Cluster centroid stored per notebook for new-note placement suggestions
Week 11–12 · Testing & Polish (~40 hrs)
- Benchmark on 10,000+ note collections; confirm WASM recycling holds under load
- Edge cases: empty notes, very short notes, multilingual notes, notes that fit no cluster
- ARIA attributes and full keyboard navigation in review panel
- End-to-end integration test on a real Joplin database
Final Phase (Aug 23 – Sep 1) · Documentation & Submission (~40 hrs)
- README: installation, configuration, privacy model, architecture overview
- Architecture documentation for future contributors
- Demo screencast: full suggest → review → apply → undo flow
- Final code review with mentors; submit to Joplin plugin marketplace
Stretch Buffer (~20 hrs)
- Cross-encoder reranking for cluster quality improvement
- Additional LLM provider adapters
6. Deliverables
At the end of the GSoC period the following will exist as working, tested, and documented outputs. Required items represent the minimum successful outcome. Optional items will be completed if time permits.
Core Plugin
| Deliverable | Description | Type |
|---|---|---|
| Joplin plugin package | Installable .jpl plugin published to the Joplin plugin marketplace | Required |
| Plugin settings panel | Provider dropdown, secure API key, model selector, cluster count, archive threshold, privacy disclosure | Required |
| Categorisation sidebar panel | React-based panel covering all five UX states from Not Indexed through Done | Required |
Embedding Pipeline
| Deliverable | Description | Type |
|---|---|---|
| Validation spike report | Cross-platform test of ONNX runtime and sqlite3 BLOB round-trip shared with mentors | Required |
| Paginated note fetcher | Fetches all notes via Joplin Data API with full pagination and Events API cursor sync | Required |
| Structure-aware chunker | Splits on headings, prepends heading path, 64-token overlap | Required |
| ONNX local embedding adapter | all-MiniLM-L6-v2 bundled with plugin — no API key required | Required |
| Ollama provider adapter | Local HTTP — no data sent to cloud | Required |
| OpenAI provider adapter | Cloud — API key stored via secure: true | Required |
| sqlite3 vector store | BLOB schema with source_hash, user_updated_time, and categorisation_log tables | Required |
| Batch-and-yield indexing | Event loop yield pattern keeping Joplin responsive during embedding | Required |
Clustering & Suggestion Engine
| Deliverable | Description | Type |
|---|---|---|
| Pure JavaScript K-Means | Clusters note vectors entirely in-process | Required |
| Automatic K selection | Elbow method determines optimal cluster count | Required |
| LLM category naming | Sends cluster summaries to LLM to generate tag and notebook names | Required |
| Archive detection | Identifies rarely-accessed notes using user_updated_time | Required |
| Hierarchical agglomerative clustering | Alternative algorithm for better quality on small collections | Optional |
Review & Apply
| Deliverable | Description | Type |
|---|---|---|
| Suggestion review panel | Three sections: tag suggestions, notebook suggestions, archive suggestions | Required |
| Per-suggestion controls | Accept and reject each suggestion individually with keyboard support | Required |
| Apply pipeline | Creates tags, creates notebooks, assigns tags, moves notes via Joplin Data API | Required |
| categorisation_log table | Stores pre-change state of every affected note before applying | Required |
| One-click undo | Restores all affected notes to their state before the last apply | Required |
Quality & Documentation
| Deliverable | Description | Type |
|---|---|---|
| Unit test suite | Tests for chunker, K-Means, elbow method, archive detection | Required |
| Integration tests | End-to-end tests on a real Joplin database with a sample note collection | Required |
| User documentation | Setup guide, privacy FAQ, configuration reference | Required |
| Architecture documentation | Technical documentation for future contributors | Required |
| Demo screencast | Recording showing full suggest-review-apply-undo flow | Required |
| npm package | Core clustering and embedding logic as a standalone npm package | Optional |
7. Availability
I am fully available for the entire GSoC 2026 coding period with no competing employment, internship, or academic commitments. I treat GSoC as a full-time engagement. If I encounter a blocker I will raise it on the forum or Discord the same day rather than waiting. I will maintain a public weekly progress post so mentors and the community can track progress and give feedback at every stage of the project.
| Item | Details |
|---|---|
| Weekly availability | 40–45 hours per week during the coding period |
| Time zone | PKT — UTC+5 (Islamabad, Pakistan) |
| Mentor overlap | Morning PKT overlaps with European business hours, allowing daily async communication with mentors HahaBill and shikuz |
| Communication style | Weekly async progress report posted to the Joplin forum every Monday. Weekly 30-minute video sync with mentor. Daily availability for async communication with same-day responses. All code submitted as early draft PRs for incremental review. |
| Other commitments | No other employment, internship, or GSoC applications. University summer schedule is free of coursework obligations |
| Known absences | None currently planned. Any unavoidable absence communicated to mentors at least one week in advance |
| Blockers | Surfaced within 24 hours. If stuck, mentors will know the same day |









