GSoC 2026 Proposal Draft – Idea 3: AI-based-Note-categorization

AI Based Categorization

Unsupervised Topic Modeling & Vector-Relational Taxonomy for Joplin

Muhammad Zohaib Irshad

Mid Senior Full Stack TypeScript Developer

[GitHub Account] [LinkedIn Account] [GSOC Idea#3]

1. Pull Requests & Relevant Work

Contributions made in Joplin

My Own Plugins that I build

Contribution Status Links
Joplin Word Count, Spell Check & Reading Metrics Completed [Github Repository] [Npm Publish] [Vedio Demonstration]
AI Note Assistant: Chat On Your Notes Completed [Github Repository] [Npm Publish] [Vedio Demonstration]

Joplin Main Repository

Contribution Status Links
Pull Request Merged [Pull Request]
Pull Request Opened [Pull Request]
Pull Request Closed [Pull Request]
Pull Request Closed [Pull Request]

AI Notes Assistant Plugin

Contribution Status Links
Pull Request Merged [Pull Request]
Pull Request Merged [Pull Request]

Jarvis Plugin

Contribution Status Links
Pull Request Merged [Pull Request]
Pull Request Closed [Pull Request]
Pull Request Merged [Pull Request]

Contributions in Apache and FOSSASIA Organisations

Contribution Status Links
Pull Request (Complex Backend Token Storage Issue Solved) Merged [Pull Request]
Pull Request Merged [Pull Request]
Pull Request Opened [Pull Request]
Pull Request Opened [Pull Request]
Pull Request Opened [Pull Request]

Personal Projects Related to this Plugin

Project GitHub Links Tech Stack
RAG Pipeline [Frontend Repo] [Backend Repo]Live Link] TypeScript, Node.js, Next.js, Clerk, LangchainJS, Qdrant Vector DB, Cohere
AI + Reddit Analysis Based Ecommerce Store [Github Repo] [Live Link] TypeScript, Node.js, Next.js, Reddit API, Gemini API, MongoDB
LeaderBoard Sphere [Frontend Repo] [Backend Repo] Next.js, Node.js, Redis, Kafka, Prisma, SocketIO

2. Introduction

I am Muhammad Zohaib Irshad, a mid senior full stack software engineer based in Islamabad, Pakistan. I am currently completing my Bachelor's degree in Software Engineering at Air University Islamabad.

Contact Information

Field Details
Name Muhammad Zohaib Irshad
Email zohaibirshad678@gmail.com
GitHub developerzohaib786 (Muhammad Zohaib Irshad) · GitHub
LinkedIn https://linkedin.com/in/developerzohaib
Address Islamabad, Pakistan
University Air University Islamabad
Degree Bachelors in Software Engineering

Programming Experience

Area Technologies
Frontend JavaScript, TypeScript, ReactJS, Next.js, HTML5, CSS3, Tailwindcss, Shadcn/UI
Backend Node.js, ExpressJS, NestJS, Prisma, JWT, Socket.io
Generative AI RAG System, LangchainJS, Qdrant Vector DB, Cohere API, Gemini API, Reddit API
Databases Vector Database (Qdrant), MongoDB, PostgreSQL, NeonDB, MySQL
System Design Redis, Kafka, BullMQ, Rate Limiting, Cache, Server Clustering

Tech Industry Experience

Company Role Timeline
SyncaAI Full Stack TypeScript Intern Jul 2025 – Sept 2025
Softechnova Enterprises MERN Stack Intern Jun 2025 – Jul 2025
SARTE Digital Marketing SEO Expert and WordPress Content Writer Oct 2023 – Sept 2024
Clients from Facebook, LinkedIn & WhatsApp MERN Stack & Next.js Developer 2023 – Present

Open Source Experience

I actively contribute to open source projects including Apache Polaris-Tools, Apache Doris, FOSSASIA, and Links-Hub. I understand how to navigate large established codebases, communicate through PRs, and follow project contribution standards. I have mentioned my open source work in the Pull Requests and Relevant Work section.

3. Project Summary

Problem Space:

As a Joplin user's note collection grows it becomes increasingly disorganised. Notes accumulate without consistent tags, notebooks fill up with unrelated content, and rarely accessed notes get buried alongside frequently used ones. Manually reviewing and reorganising hundreds or thousands of notes is a task most users never complete. It is simply too time consuming. The result is a knowledge base that reflects when notes were created rather than what they are actually about, making the overall collection harder to navigate over time.

Implementation Strategy:

The plugin embeds each note using BGE-small-en-v1.5, chosen for its 512-token context window and strong MTEB clustering benchmark scores. Notes are split into overlapping chunks, embedded, and averaged into a single note level vector. Meaningful titles are blended into the note vector using cosine similarity weighting generic titles like "Untitled" are filtered out before any weighting is applied.

Rather than clustering on raw 384-dimensional vectors, the plugin first applies UMAP via DruidJS to reduce vectors to 5 dimensions, separating topic clusters in a low-dimensional space where K-Means performs significantly better. The optimal K is selected using silhouette scoring across K values from 2 to √N more reliable than the elbow method which cannot be detected programmatically.

Tag names are generated without sending note text to an LLM. TF-IDF first identifies cluster-specific terms, which are re-ranked by cosine similarity to the cluster centroid. Only the top five keywords per cluster are sent to the LLM, keeping the process privacy-preserving even when a cloud provider is used. All vectors are stored in a local SQLite database via joplin.require('sqlite3').

Archive candidates are scored across five signals, last edited date, edit count, content length, backlinks from other notes, and silhouette fit. It will make detection more accurate than checking a single timestamp. All suggestions are presented in a review panel before any change is applied. The plugin writes the pre change state of every affected note to a categorisation_log table before applying, enabling one click undo. Incremental sync uses the Joplin Events API cursor to catch changes from all devices.

Expected Outcome

A Joplin plugin that analyses the user's note collection using UMAP-enhanced clustering, discovers natural semantic categories, and presents tag and notebook suggestions in a review panel. If the user approves, the plugin applies those changes automatically creating new tags, creating new notebooks, and moving notes. While never modifying any note without explicit confirmation. A one click undo restores the full previous state from the categorisation log. The plugin supplements rather than replaces the user's existing organisational structure and works entirely offline by default, with cloud providers available as opt in only.

4. Technical Approach

4.1 Architectural Justification: Decoupled Plugin Runtime vs. Core Integration

A plugin keeps Joplin's core lightweight, ships independently of the main release cycle, and touches zero core source code. It can be installed or removed without affecting the main application.

4.2 Comparative Analysis: Evolution Beyond Current LLM Baselines (Jarvis Case Study)

I have worked directly inside the Jarvis codebase by submitting PR #66 (Azure OpenAI support) and resolving Issue #18 (dedicated chatbox) by PR#69 which showed me clearly where Jarvis's boundaries are. In my own joplin-plugin-ai-chat-on-notes I built a multi-provider abstraction layer both patterns I am applying here.

Jarvis operates on the currently open note with no batch embedding, no persistent vector index, and no clustering. This proposal builds that missing layer: embed every note, reduce dimensions with UMAP, discover semantic groupings through clustering, and surface them as actionable tag and notebook suggestions.

4.3 Technology Stack & Dependency Graph

Component Technology Details
Language TypeScript Consistent with Joplin's entire plugin ecosystem
Database sqlite3 via joplin.require('sqlite3') Officially supported. Zero setup, single file, all platforms
Vector storage BLOB column (Float32Array) Raw binary, no native extension required
Dimensionality reduction UMAP via DruidJS Pure JavaScript, IEEE-published, actively maintained
Clustering Pure JavaScript K-Means Runs entirely in-process, no native modules needed
Default embedding ONNX local (all-MiniLM-L6-v2) ~23 MB, no API key required, fully offline
Category naming LLM OpenAI / Ollama Generates tag and notebook names from cluster summaries

4.4 Validation-First Design: Week 1–2 Technical Spike

The two highest technical risks are treated as hypotheses before the architecture is locked:

Risk 1: Local embedding inference: Can ONNX load and run cross platform inside the Joplin plugin sandbox?

  • Preferred: ONNX Runtime Node.js
  • Fallback A: ONNX WASM
  • Fallback B: HTTP to Ollama or OpenAI

Risk 2: WASM memory degradation: Transformers.js WebAssembly memory grows during batch embedding and never releases, dropping throughput from ~47 notes/sec to ~2 notes/sec after 100 notes. Mitigation: recycle the worker process every 80–100 notes and embeddings already written to sqlite3 are never lost on recycling.

4.5 Embedding Model Selection & Benchmarking (MTEB Analysis)

BGE-small-en-v1.5 is the default. Model selection was based on MTEB clustering task scores specifically not the overall leaderboard which averages across 8 task types and rewards retrieval quality irrelevant to this project. all-MiniLM-L6-v2 was considered but rejected because its 256-token limit silently truncates longer notes with no error or warning.

Model Dimensions Size Context Notes
BGE-small-en-v1.5 (ONNX) 384 ~23 MB 256 tokens Default. Highest reliable clustering score on MTEB
all-MiniLM-L6-v2 (ONNX) 384 ~33 MB 512 tokens Rejected as default, silently truncates at 256 tokens
nomic-embed-text (Ollama) 768 ~274 MB 8192 tokens For the Ollama HTTP path
text-embedding-3-small (OpenAI) 1536 Cloud 8191 tokens $0.02 per million tokens

4.6 System Workflow & Pipeline Stages

The plugin operates in four sequential phases: embedding, clustering, suggestion generation, and review and apply.

4.6.1 Phase I: Vector Ingestion & Multi-Stage Embedding

Notes are fetched via the Joplin Data API in paginated batches of 100. Each note is split into overlapping chunks (400 words, 50-word overlap) rather than headings. this ensures the model never silently truncates content at a heading boundary. Each chunk is embedded using BGE-small-en-v1.5 and stored as a BLOB in sqlite3 alongside note ID, title, SHA-256 hash, and user_updated_time.

Meaningful titles are embedded separately and blended into the final note vector using cosine similarity weighting. Generic titles ("Untitled", "New Note", dates) are filtered before any weighting is applied. SHA-256 hashing ensures only modified notes are re-embedded on subsequent runs.

4.6.2 Phase II: Dimensionality Reduction & Unsupervised Clustering

Once all notes are embedded, chunk vectors are averaged into a single note-level vector. Rather than clustering directly on raw 384-dimensional vectors where the curse of dimensionality makes everything look equally distant. the plugin first applies UMAP via DruidJS to reduce each note vector to 5 dimensions. UMAP parameters follow BERTopic's recommended defaults for topic clustering:

Parameter Value Reason
n_neighbors 15 Balances local and global structure for 100–2000 note collections
n_components 5 Sweet spot: 2–3 loses too much, higher hurts K-Means
min_dist 0.0 Packs similar notes tightly for clean cluster boundaries
metric cosine Text embeddings should be compared by angle, not magnitude
random_state 42 Fixed seed: ensures consistent output across runs

K-Means then runs on the UMAP-reduced vectors. The optimal K is selected using silhouette scoring across K values from 2 to √N. the K with the highest average silhouette score is chosen. The elbow method was considered and rejected because detecting the bend automatically in code without a human looking at the plot is unreliable.

Tag names are generated through a two-step pipeline. First TF-IDF identifies terms that appear frequently in a cluster but not in others. Those terms are re-ranked by cosine similarity to the cluster centroid. Only the top five keywords per cluster are sent to the LLM never actual note text. It will keep the process privacy preserving even when a cloud provider is used.

4.6.3 Phase III: Multi-Signal Heuristics for Archive Detection

Archive candidates are scored using a five-signal staleness score rather than checking a single timestamp field, which is insufficient, a note untouched for two years but referenced by ten other notes is not a candidate for archiving.

Signal Weight Calculation
Last edited 0.30 days_since_edit / 365, capped at 1.0
Edit count 0.15 1 - min(edit_count, 10) / 10
Content Length 0.10 1.0 if under 100 characters and not a to-do, else 0.0
Backlinks 0.15 1.0 if no other note links to this one, else 0.0
Silhouette fit 0.30 1-max(individual_silhouette, 0): poor cluster fit scores high

Notes scoring above 0.6 appear in the archive suggestions section. The threshold is configurable in settings.

  1. Incremental Synchronization & Events API Hooking

onNoteChange() only fires for the currently selected note, it does not catch changes from other devices after a Joplin sync. The plugin hooks into three event sources:

  • onNoteChange(): immediate re-embedding of the currently edited note
  • onSyncComplete(): runs syncIndex() via the Events API cursor to catch changes from other devices
  • Periodic polling every 5 minutes fallback for anything that slipped through

UMAP and clustering only re-run when the user clicks Re analyse or when more than 5% of the collection has changed below that threshold the existing clustering is still accurate enough.

4.6.4 Phase IV: Suggestion Review Phase

No changes are ever applied without explicit user confirmation. The plugin presents all suggestions in a structured review panel with three sections:

  • Tag suggestions: each proposed new tag with the list of notes that would receive it
  • Notebook suggestions: each proposed new notebook with the notes that would be moved into it
  • Archive suggestions: notes flagged as rarely-accessed with a proposed move to an archive notebook

The user can accept all, reject all, or handle each suggestion individually. Before applying any accepted suggestion the plugin writes the original state of every affected note to the categorisation_log table. A one-click 'Undo last categorisation' button restores all affected notes to their previous state from the log.

4.7 Constraint Management: Handling Data API Rate Limits & Throttling

The Joplin Data API is a local REST service accessed through the joplin.data module. The maximum number of items returned per request is 100, controlled by the limit parameter. The plugin fetches notes in controlled batches of 100 per page. Critically, Joplin's plugin sandbox has no Web Worker API. To prevent UI freezing during embedding a batch-and-yield pattern is used: process 10 notes, then yield control back to the event loop. This keeps Joplin responsive throughout initial indexing.

Constraint Value Source
Max items per request 100 Official Joplin Data API docs
Pagination field has_more (boolean) Official Joplin Data API docs
Page parameter page (starts at 1) Official Joplin Data API docs
Web Worker API Not available in plugin sandbox Joplin plugin architecture
Mitigation Batch-and-yield event loop pattern

4.8 Data Serialization: BLOB Binary Efficiency vs. JSON Overhead

The vector is stored as binary rather than a JSON string because binary can be deserialised directly back into a Float32Array in a single operation. During clustering every note vector is compared against every centroid on every iteration so the deserialisation cost multiplies significantly across large collections.

BLOB (Float32Array binary) JSON array
Speed Fast, one Buffer read, directly usable Slow, full JSON parse before every comparison
Human readable No Yes
Storage size ~1.5 KB per vector (384-dim local model) ~2.5 KB per vector
Used for Storing vectors in sqlite3 Not suitable for vector operations

4.9 API Cost Estimate

The local ONNX path has zero cost. Cloud providers are opt-in only.

Collection size Avg chunks Local ONNX time OpenAI cost Storage (384-dim)
100 notes ~300 ~15 seconds ~$0.001 ~0.5 MB
1,000 notes ~3,000 ~2–3 minutes ~$0.01 ~5 MB
5,000 notes ~15,000 ~12–15 minutes ~$0.05 ~25 MB
10,000 notes ~30,000 ~25–30 minutes ~$0.10 ~50 MB

4.10 IPC Bridge Constraints: Secure Sandbox Data Access

Plugins cannot access the Joplin database directly. The Joplin database is an SQLite file managed exclusively by the Joplin core application. All data access goes through the joplin.data module via an IPC bridge between the plugin sandbox and the Joplin main process. This matters significantly for the apply phase of this plugin, which requires multiple sequential write operations creating tags, creating notebooks, assigning tags to notes, and moving notes, each of which is a separate API call through the IPC bridge. The plugin batches these write operations and applies them sequentially with a small delay between calls. A progress indicator shows the user how many changes have been applied out of the total.

Access method Available to plugins Speed Notes
Direct SQLite file access No Very fast Reserved for Joplin core only
joplin.data REST API via IPC Yes Moderate Only supported method
Max items per request 100 items
Fields selection Yes Faster Use fields param to fetch only what is needed

4.11 Core Design Principles for End-to-End AI Systems

  • No Changes Without User Confirmation: Every tag assignment, notebook creation, and note move is shown in the review panel before anything is written. The plugin never modifies the user's note collection silently.
  • Reversibility: Before applying any accepted suggestion the plugin logs the original state of every affected note to the categorisation_log table. A one-click undo button restores all affected notes.
  • Incremental Processing via Change Detection: SHA-256 hashing ensures only modified notes are re-embedded. The Events API cursor tracks all changes including those from other devices after a Joplin sync.
  • Local-First Privacy: By default nothing leaves the machine. All embedding inference runs locally via ONNX. When a user switches to a remote provider the settings page shows a persistent warning.
  • Chunking with Overlap for Context Preservation: Notes are split into structure-aware overlapping chunks. The heading path is prepended to each chunk so that the embedding captures both the local content and the broader document context.
  • Provider Abstraction via Common Interface: The LLM provider sits behind a shared EmbeddingProvider interface. Switching between ONNX, Ollama, and OpenAI requires only a settings change.
  • Background-Safe Processing: Embedding and clustering use the batch-and-yield event loop pattern since the Joplin plugin sandbox has no Web Worker API. Joplin remains fully usable throughout.
  • Graceful Degradation: If the API is unreachable the plugin shows a clear error and waits for the user to fix configuration. If the ONNX model fails to load it falls back to the Ollama or OpenAI path.
  • Encrypted API Key Storage: API keys are stored using Joplin's settings API with secure: true, which uses the OS keychain where available. Keys are never stored in plaintext config files.

4.12 Error Handling and Edge Cases

  • API key invalid or expired: Before the first full embedding run the plugin sends a test embedding request. If it fails the user sees an error in settings immediately.
  • ONNX model fails to load: The plugin disables local inference, falls back to the Ollama or OpenAI path, and shows a clear message in settings suggesting the user switch providers.
  • Ollama not running: The plugin pings http://localhost:11434/api/tags on startup. If unreachable the settings panel shows a warning with a link to Ollama's installation instructions.

Very large note collections (5,000+ notes):

  • Live progress indicator: 'Embedding note 342 of 5,127...'
  • Batch-and-yield pattern: 10 notes then event loop yield
  • Cancel button: already-embedded chunks survive in sqlite3
  • Persistent partial progress: next startup picks up where it left off via source_hash comparison
  • Notes that fit no cluster cleanly: Notes with similarity to their assigned centroid below a configurable threshold are flagged as 'uncategorised' and excluded from suggestions.
  • Empty or very short notes: Notes producing fewer than 20 tokens are excluded from clustering and shown as 'too short to categorise' in a separate section of the review panel.
  • User rejects all suggestions: The plugin does not re-run automatically. The user must manually trigger a new analysis run, preventing the plugin from repeatedly suggesting changes already rejected.
  • sqlite3 database corruption: The plugin runs PRAGMA integrity_check on startup. If corruption is detected it offers a one-click 'Rebuild Analysis' button.
  • Non-English notes: The default ONNX model (all-MiniLM-L6-v2) is trained on English text. Users with multilingual collections can switch to paraphrase-multilingual-MiniLM-L12-v2 via the model selector.
  • Accessibility: The review panel includes role='region', aria-label, and aria-live='polite'. Full keyboard navigation: Tab through suggestions, Enter to accept, Delete to reject, Escape to cancel.

4.13 First-Run Behaviour

Step What the user sees What the plugin does
Plugin installed, Joplin restarted Sidebar panel appears collapsed Waits — no indexing starts automatically
User opens sidebar Not Indexed state with Build Index button and estimated time Ready to start
User clicks Build Index Progress bar: 'Embedding note 342 of 1,247...' + Cancel button Batch-and-yield embedding. Joplin remains fully usable
Embedding complete 'Analysing your notes...' message K-Means clustering and LLM category naming runs
Analysis complete Review panel opens with tag, notebook, and archive suggestions All suggestions visible, nothing applied yet
User reviews and confirms Apply progress: 'Applied 12 of 47 changes' Creates tags, notebooks, moves notes via Data API
Apply complete 'Done. Undo last categorisation' button visible categorisation_log written for rollback
Subsequent launches Silent background sync message Events API cursor check — re-embeds only changed notes

4.14 Plugin Settings

Setting Type Default Description
Embedding Provider Dropdown local local / ollama / openai
API Endpoint String "" URL for Ollama or OpenAI. Hidden when local selected
API Key Secure String "" Stored via secure: true in OS keychain
Embedding Model String all-MiniLM-L6-v2 Model identifier. Changes based on provider
Cluster Count String auto auto (elbow method) or manual integer 3–30
Archive Threshold (months) Integer 12 Notes not edited in this many months are flagged
Privacy Disclosure Label Read-only warning shown when a remote provider is selected

4.15 UX Plan: Sidebar Panel States

State Display
Not Indexed 'Click Build Index to enable AI categorisation' with estimated time
Embedding 'Embedding note 342 of 1,200...' with progress bar and Cancel button
Clustering 'Analysing your notes, discovering categories...'
Review Three-section panel: tag suggestions, notebook suggestions, archive suggestions
Applying 'Applied 12 of 47 changes...' with progress indicator
Done Summary of applied changes + Undo button

5. Implementation Plan

350 hours · May 26 – August 23 · Mentors: HahaBill, shikuz,

Week 1–2 · Validation Spike (~40 hrs)

  • Validate BGE-small-en-v1.5 loads and runs cross-platform via ONNX inside the plugin sandbox
  • Confirm WASM memory degradation behaviour and validate worker recycling every 80–100 notes as the fix
  • Validate sqlite3 BLOB storage and Float32Array round-trip
  • Build minimal PoC: embed a string → store → retrieve → runs on macOS, Windows, Linux
  • Share spike report with mentors before locking architecture

Week 3–4 · Note Ingestion & Embedding (~60 hrs)

  • Paginated note fetcher via Joplin Data API (100 notes/request, has_more loop)
  • Chunk notes (~400 words, 50-word overlap), embed with BGE-small-en-v1.5
  • Title vector blending with cosine similarity weighting; filter generic titles
  • SHA-256 change detection + user_updated_time stored per note
  • Batch-and-yield event loop pattern (10 notes + setTimeout(0))
  • Unit tests for chunker edge cases

Week 5–6 · UMAP, Clustering & Tag Generation (~60 hrs)

  • Average chunk vectors into note level vectors
  • UMAP via DruidJS (n_neighbors=15, n_components=5, min_dist=0, cosine metric, random_state=42)
  • Silhouette scoring across K=2 to √N to select optimal K
  • K-Means on UMAP-reduced vectors
  • TF-IDF term extraction → re-rank by centroid cosine similarity → send top 5 keywords to LLM
  • Five-signal staleness score for archive detection
  • Integration tests on a sample note collection

Midterm (July 14–18) · Checkpoint

  • Working embedding + UMAP + clustering pipeline producing named tag suggestions in a basic panel

Week 7–8 · Suggestion Review UI (~60 hrs)

  • React sidebar panel: tag suggestions, notebook suggestions, archive suggestions sections
  • Per-suggestion accept / reject + accept all / reject all controls
  • Events API cursor sync (onNoteChange, onSyncComplete, 5-minute poll fallback)
  • Ollama and OpenAI provider adapters

Week 9–10 · Apply Logic & Rollback (~50 hrs)

  • Apply pipeline: create tags → create notebooks → assign tags → move notes via joplin.data
  • Write categorisation_log before every apply; one-click undo from log
  • Settings UI: provider dropdown, secure API key, privacy disclosure
  • Cluster centroid stored per notebook for new-note placement suggestions

Week 11–12 · Testing & Polish (~40 hrs)

  • Benchmark on 10,000+ note collections; confirm WASM recycling holds under load
  • Edge cases: empty notes, very short notes, multilingual notes, notes that fit no cluster
  • ARIA attributes and full keyboard navigation in review panel
  • End-to-end integration test on a real Joplin database

Final Phase (Aug 23 – Sep 1) · Documentation & Submission (~40 hrs)

  • README: installation, configuration, privacy model, architecture overview
  • Architecture documentation for future contributors
  • Demo screencast: full suggest → review → apply → undo flow
  • Final code review with mentors; submit to Joplin plugin marketplace

Stretch Buffer (~20 hrs)

  • Cross-encoder reranking for cluster quality improvement
  • Additional LLM provider adapters

6. Deliverables

At the end of the GSoC period the following will exist as working, tested, and documented outputs. Required items represent the minimum successful outcome. Optional items will be completed if time permits.

Core Plugin

Deliverable Description Type
Joplin plugin package Installable .jpl plugin published to the Joplin plugin marketplace Required
Plugin settings panel Provider dropdown, secure API key, model selector, cluster count, archive threshold, privacy disclosure Required
Categorisation sidebar panel React-based panel covering all five UX states from Not Indexed through Done Required

Embedding Pipeline

Deliverable Description Type
Validation spike report Cross-platform test of ONNX runtime and sqlite3 BLOB round-trip shared with mentors Required
Paginated note fetcher Fetches all notes via Joplin Data API with full pagination and Events API cursor sync Required
Structure-aware chunker Splits on headings, prepends heading path, 64-token overlap Required
ONNX local embedding adapter all-MiniLM-L6-v2 bundled with plugin — no API key required Required
Ollama provider adapter Local HTTP — no data sent to cloud Required
OpenAI provider adapter Cloud — API key stored via secure: true Required
sqlite3 vector store BLOB schema with source_hash, user_updated_time, and categorisation_log tables Required
Batch-and-yield indexing Event loop yield pattern keeping Joplin responsive during embedding Required

Clustering & Suggestion Engine

Deliverable Description Type
Pure JavaScript K-Means Clusters note vectors entirely in-process Required
Automatic K selection Elbow method determines optimal cluster count Required
LLM category naming Sends cluster summaries to LLM to generate tag and notebook names Required
Archive detection Identifies rarely-accessed notes using user_updated_time Required
Hierarchical agglomerative clustering Alternative algorithm for better quality on small collections Optional

Review & Apply

Deliverable Description Type
Suggestion review panel Three sections: tag suggestions, notebook suggestions, archive suggestions Required
Per-suggestion controls Accept and reject each suggestion individually with keyboard support Required
Apply pipeline Creates tags, creates notebooks, assigns tags, moves notes via Joplin Data API Required
categorisation_log table Stores pre-change state of every affected note before applying Required
One-click undo Restores all affected notes to their state before the last apply Required

Quality & Documentation

Deliverable Description Type
Unit test suite Tests for chunker, K-Means, elbow method, archive detection Required
Integration tests End-to-end tests on a real Joplin database with a sample note collection Required
User documentation Setup guide, privacy FAQ, configuration reference Required
Architecture documentation Technical documentation for future contributors Required
Demo screencast Recording showing full suggest-review-apply-undo flow Required
npm package Core clustering and embedding logic as a standalone npm package Optional

7. Availability

I am fully available for the entire GSoC 2026 coding period with no competing employment, internship, or academic commitments. I treat GSoC as a full-time engagement. If I encounter a blocker I will raise it on the forum or Discord the same day rather than waiting. I will maintain a public weekly progress post so mentors and the community can track progress and give feedback at every stage of the project.

Item Details
Weekly availability 40–45 hours per week during the coding period
Time zone PKT — UTC+5 (Islamabad, Pakistan)
Mentor overlap Morning PKT overlaps with European business hours, allowing daily async communication with mentors HahaBill and shikuz
Communication style Weekly async progress report posted to the Joplin forum every Monday. Weekly 30-minute video sync with mentor. Daily availability for async communication with same-day responses. All code submitted as early draft PRs for incremental review.
Other commitments No other employment, internship, or GSoC applications. University summer schedule is free of coursework obligations
Known absences None currently planned. Any unavoidable absence communicated to mentors at least one week in advance
Blockers Surfaced within 24 hours. If stuck, mentors will know the same day

@shikuz @HahaBill @malekhavasi i have added the pull requests section and now curiously waiting for reviews from possible mentors of this project on my draft proposal

Sorry, I accidentally posted this in the other proposal thread first.

The model comparison table has BGE-small-en-v1.5 at 256 tokens and all-MiniLM-L6-v2 at 512. I think those are swapped. Since the context window drives your chunking decisions, does the model choice change if the specs are reversed?

sqlite3 works on desktop but isn't available on Joplin mobile (at the moment). Is mobile out of scope, or have you thought about a storage path that works on both?

The proposal covers incremental re-embedding via the Events API, but what happens to the clusters when a user creates a new note? Does the whole UMAP + K-Means pipeline re-run, or is there a lighter path?

Have you tested the clustering pipeline on a real note collection? Curious what the clusters looked like.

Thanks for the review @shikuz!

1. Context window table: yes, those are swapped Correct values are: BGE-small-en-v1.5 → 512-token, all-MiniLM-L6-v2 → 256-token (silently truncates).

2. Mobile: explicitly out of scope Pipeline depends on Node.js native modules (sqlite3, ONNX Runtime) unavailable in mobile sandbox. The vector store sits behind an abstraction layer though, so a future contributor could swap sqlite3 for sql.js without touching embedding or clustering logic.

3. New note: no full re-run

  1. onNoteChange() fires → note embedded, vector saved to sqlite3 (< 1 second)
  2. New vector compared against stored centroids → tentative assignment, no re-clustering
  3. Full re-analysis only triggers on manual Re-analyse click, or when 5%+ of collection has changed

4. Real collection test: Yes. Built a working Joplin plugin prototype with embedded clustering pipeline. The implementation validates the core architecture before potential production scaling.

Demo

there is a limit of 10mb video so i have uploaded the last part please see the full demo video at

data.json (100 notes) 
  ↓
Embedding extraction (BGE-small-en-v1.5 via Transformers.js in Web Worker)
  ↓
Optional dimensionality reduction (UMAP: 384-dim → 5-dim for tighter separation)
  ↓
K-Means clustering (K=2 to adaptive max)
  ↓
Silhouette scoring (automatic K selection without manual inspection)
  ↓
Final clustering + Benchmark UI (sidebar visualization with metrics)

Repository of my clustering phase testing (with dummy data not real-time notes) :slight_smile:

Will push the corrected proposal with the table fix shortly.

Phase # 1 System Architecture

Phase # 2 System Architecture

also want for @HahaBill to see my work :blush:

Thank you for your proposal, I had a look :slight_smile: