GSoC 2026 Proposal Draft – Idea 3: AI-based-Note-categorization

developerzohaib786 · 24 March 2026 19:52

AI Based Categorization

Unsupervised Topic Modeling & Vector-Relational Taxonomy for Joplin

─

Muhammad Zohaib Irshad

Mid Senior Full Stack TypeScript Developer

[GitHub Account] [LinkedIn Account] [GSOC Idea#3]

1. Pull Requests & Relevant Work

Contributions made in Joplin

My Own Plugins that I build

Contribution	Status	Links
Joplin Word Count, Spell Check & Reading Metrics	Completed	[Github Repository] [Npm Publish] [Vedio Demonstration]
AI Note Assistant: Chat On Your Notes	Completed	[Github Repository] [Npm Publish] [Vedio Demonstration]

Joplin Main Repository

Contribution	Status	Links
Pull Request	Merged	[Pull Request]
Pull Request	Opened	[Pull Request]
Pull Request	Closed	[Pull Request]
Pull Request	Closed	[Pull Request]

AI Notes Assistant Plugin

Contribution	Status	Links
Pull Request	Merged	[Pull Request]
Pull Request	Merged	[Pull Request]

Jarvis Plugin

Contribution	Status	Links
Pull Request	Merged	[Pull Request]
Pull Request	Closed	[Pull Request]
Pull Request	Merged	[Pull Request]

Contributions in Apache and FOSSASIA Organisations

Contribution	Status	Links
Pull Request (Complex Backend Token Storage Issue Solved)	Merged	[Pull Request]
Pull Request	Merged	[Pull Request]
Pull Request	Opened	[Pull Request]
Pull Request	Opened	[Pull Request]
Pull Request	Opened	[Pull Request]

Personal Projects Related to this Plugin

Project	GitHub Links	Tech Stack
RAG Pipeline	[Frontend Repo] [Backend Repo]Live Link]	TypeScript, Node.js, Next.js, Clerk, LangchainJS, Qdrant Vector DB, Cohere
AI + Reddit Analysis Based Ecommerce Store	[Github Repo] [Live Link]	TypeScript, Node.js, Next.js, Reddit API, Gemini API, MongoDB
LeaderBoard Sphere	[Frontend Repo] [Backend Repo]	Next.js, Node.js, Redis, Kafka, Prisma, SocketIO

2. Introduction

I am Muhammad Zohaib Irshad, a mid senior full stack software engineer based in Islamabad, Pakistan. I am currently completing my Bachelor's degree in Software Engineering at Air University Islamabad.

Contact Information

Field	Details
Name	Muhammad Zohaib Irshad
Email	zohaibirshad678@gmail.com
GitHub	developerzohaib786 (Muhammad Zohaib Irshad) · GitHub
LinkedIn	https://linkedin.com/in/developerzohaib
Address	Islamabad, Pakistan
University	Air University Islamabad
Degree	Bachelors in Software Engineering

Programming Experience

Area	Technologies
Frontend	JavaScript, TypeScript, ReactJS, Next.js, HTML5, CSS3, Tailwindcss, Shadcn/UI
Backend	Node.js, ExpressJS, NestJS, Prisma, JWT, Socket.io
Generative AI	RAG System, LangchainJS, Qdrant Vector DB, Cohere API, Gemini API, Reddit API
Databases	Vector Database (Qdrant), MongoDB, PostgreSQL, NeonDB, MySQL
System Design	Redis, Kafka, BullMQ, Rate Limiting, Cache, Server Clustering

Tech Industry Experience

Company	Role	Timeline
SyncaAI	Full Stack TypeScript Intern	Jul 2025 – Sept 2025
Softechnova Enterprises	MERN Stack Intern	Jun 2025 – Jul 2025
SARTE Digital Marketing	SEO Expert and WordPress Content Writer	Oct 2023 – Sept 2024
Clients from Facebook, LinkedIn & WhatsApp	MERN Stack & Next.js Developer	2023 – Present

Open Source Experience

I actively contribute to open source projects including Apache Polaris-Tools, Apache Doris, FOSSASIA, and Links-Hub. I understand how to navigate large established codebases, communicate through PRs, and follow project contribution standards. I have mentioned my open source work in the Pull Requests and Relevant Work section.

3. Project Summary

Problem Space:

As a Joplin user's note collection grows it becomes increasingly disorganised. Notes accumulate without consistent tags, notebooks fill up with unrelated content, and rarely accessed notes get buried alongside frequently used ones. Manually reviewing and reorganising hundreds or thousands of notes is a task most users never complete. It is simply too time consuming. The result is a knowledge base that reflects when notes were created rather than what they are actually about, making the overall collection harder to navigate over time.

Implementation Strategy:

The plugin embeds each note using BGE-small-en-v1.5, chosen for its 512-token context window and strong MTEB clustering benchmark scores. Notes are split into overlapping chunks, embedded, and averaged into a single note level vector. Meaningful titles are blended into the note vector using cosine similarity weighting generic titles like "Untitled" are filtered out before any weighting is applied.

Rather than clustering on raw 384-dimensional vectors, the plugin first applies UMAP via DruidJS to reduce vectors to 5 dimensions, separating topic clusters in a low-dimensional space where K-Means performs significantly better. The optimal K is selected using silhouette scoring across K values from 2 to √N more reliable than the elbow method which cannot be detected programmatically.

Tag names are generated without sending note text to an LLM. TF-IDF first identifies cluster-specific terms, which are re-ranked by cosine similarity to the cluster centroid. Only the top five keywords per cluster are sent to the LLM, keeping the process privacy-preserving even when a cloud provider is used. All vectors are stored in a local SQLite database via joplin.require('sqlite3').

Archive candidates are scored across five signals, last edited date, edit count, content length, backlinks from other notes, and silhouette fit. It will make detection more accurate than checking a single timestamp. All suggestions are presented in a review panel before any change is applied. The plugin writes the pre change state of every affected note to a categorisation_log table before applying, enabling one click undo. Incremental sync uses the Joplin Events API cursor to catch changes from all devices.

Expected Outcome

A Joplin plugin that analyses the user's note collection using UMAP-enhanced clustering, discovers natural semantic categories, and presents tag and notebook suggestions in a review panel. If the user approves, the plugin applies those changes automatically creating new tags, creating new notebooks, and moving notes. While never modifying any note without explicit confirmation. A one click undo restores the full previous state from the categorisation log. The plugin supplements rather than replaces the user's existing organisational structure and works entirely offline by default, with cloud providers available as opt in only.

4. Technical Approach

4.1 Architectural Justification: Decoupled Plugin Runtime vs. Core Integration

A plugin keeps Joplin's core lightweight, ships independently of the main release cycle, and touches zero core source code. It can be installed or removed without affecting the main application.

4.2 Comparative Analysis: Evolution Beyond Current LLM Baselines (Jarvis Case Study)

I have worked directly inside the Jarvis codebase by submitting PR #66 (Azure OpenAI support) and resolving Issue #18 (dedicated chatbox) by PR#69 which showed me clearly where Jarvis's boundaries are. In my own joplin-plugin-ai-chat-on-notes I built a multi-provider abstraction layer both patterns I am applying here.

Jarvis operates on the currently open note with no batch embedding, no persistent vector index, and no clustering. This proposal builds that missing layer: embed every note, reduce dimensions with UMAP, discover semantic groupings through clustering, and surface them as actionable tag and notebook suggestions.

4.3 Technology Stack & Dependency Graph

Component	Technology	Details
Language	TypeScript	Consistent with Joplin's entire plugin ecosystem
Database	sqlite3 via joplin.require('sqlite3')	Officially supported. Zero setup, single file, all platforms
Vector storage	BLOB column (Float32Array)	Raw binary, no native extension required
Dimensionality reduction	UMAP via DruidJS	Pure JavaScript, IEEE-published, actively maintained
Clustering	Pure JavaScript K-Means	Runs entirely in-process, no native modules needed
Default embedding	ONNX local (all-MiniLM-L6-v2)	~23 MB, no API key required, fully offline
Category naming LLM	OpenAI / Ollama	Generates tag and notebook names from cluster summaries

4.4 Validation-First Design: Week 1–2 Technical Spike

The two highest technical risks are treated as hypotheses before the architecture is locked:

Risk 1: Local embedding inference: Can ONNX load and run cross platform inside the Joplin plugin sandbox?

Preferred: ONNX Runtime Node.js
Fallback A: ONNX WASM
Fallback B: HTTP to Ollama or OpenAI

Risk 2: WASM memory degradation: Transformers.js WebAssembly memory grows during batch embedding and never releases, dropping throughput from ~47 notes/sec to ~2 notes/sec after 100 notes. Mitigation: recycle the worker process every 80–100 notes and embeddings already written to sqlite3 are never lost on recycling.

4.5 Embedding Model Selection & Benchmarking (MTEB Analysis)

BGE-small-en-v1.5 is the default. Model selection was based on MTEB clustering task scores specifically not the overall leaderboard which averages across 8 task types and rewards retrieval quality irrelevant to this project. all-MiniLM-L6-v2 was considered but rejected because its 256-token limit silently truncates longer notes with no error or warning.

Model	Dimensions	Size	Context	Notes
BGE-small-en-v1.5 (ONNX)	384	~23 MB	256 tokens	Default. Highest reliable clustering score on MTEB
all-MiniLM-L6-v2 (ONNX)	384	~33 MB	512 tokens	Rejected as default, silently truncates at 256 tokens
nomic-embed-text (Ollama)	768	~274 MB	8192 tokens	For the Ollama HTTP path
text-embedding-3-small (OpenAI)	1536	Cloud	8191 tokens	$0.02 per million tokens

4.6 System Workflow & Pipeline Stages

The plugin operates in four sequential phases: embedding, clustering, suggestion generation, and review and apply.

4.6.1 Phase I: Vector Ingestion & Multi-Stage Embedding

Notes are fetched via the Joplin Data API in paginated batches of 100. Each note is split into overlapping chunks (400 words, 50-word overlap) rather than headings. this ensures the model never silently truncates content at a heading boundary. Each chunk is embedded using BGE-small-en-v1.5 and stored as a BLOB in sqlite3 alongside note ID, title, SHA-256 hash, and user_updated_time.

Meaningful titles are embedded separately and blended into the final note vector using cosine similarity weighting. Generic titles ("Untitled", "New Note", dates) are filtered before any weighting is applied. SHA-256 hashing ensures only modified notes are re-embedded on subsequent runs.

4.6.2 Phase II: Dimensionality Reduction & Unsupervised Clustering

Once all notes are embedded, chunk vectors are averaged into a single note-level vector. Rather than clustering directly on raw 384-dimensional vectors where the curse of dimensionality makes everything look equally distant. the plugin first applies UMAP via DruidJS to reduce each note vector to 5 dimensions. UMAP parameters follow BERTopic's recommended defaults for topic clustering:

Parameter	Value	Reason
n_neighbors	15	Balances local and global structure for 100–2000 note collections
n_components	5	Sweet spot: 2–3 loses too much, higher hurts K-Means
min_dist	0.0	Packs similar notes tightly for clean cluster boundaries
metric	cosine	Text embeddings should be compared by angle, not magnitude
random_state	42	Fixed seed: ensures consistent output across runs

K-Means then runs on the UMAP-reduced vectors. The optimal K is selected using silhouette scoring across K values from 2 to √N. the K with the highest average silhouette score is chosen. The elbow method was considered and rejected because detecting the bend automatically in code without a human looking at the plot is unreliable.

Tag names are generated through a two-step pipeline. First TF-IDF identifies terms that appear frequently in a cluster but not in others. Those terms are re-ranked by cosine similarity to the cluster centroid. Only the top five keywords per cluster are sent to the LLM never actual note text. It will keep the process privacy preserving even when a cloud provider is used.

4.6.3 Phase III: Multi-Signal Heuristics for Archive Detection

Archive candidates are scored using a five-signal staleness score rather than checking a single timestamp field, which is insufficient, a note untouched for two years but referenced by ten other notes is not a candidate for archiving.

Signal	Weight	Calculation
Last edited	0.30	days_since_edit / 365, capped at 1.0
Edit count	0.15	1 - min(edit_count, 10) / 10
Content Length	0.10	1.0 if under 100 characters and not a to-do, else 0.0
Backlinks	0.15	1.0 if no other note links to this one, else 0.0
Silhouette fit	0.30	1-max(individual_silhouette, 0): poor cluster fit scores high

Notes scoring above 0.6 appear in the archive suggestions section. The threshold is configurable in settings.

Incremental Synchronization & Events API Hooking

onNoteChange() only fires for the currently selected note, it does not catch changes from other devices after a Joplin sync. The plugin hooks into three event sources:

onNoteChange(): immediate re-embedding of the currently edited note
onSyncComplete(): runs syncIndex() via the Events API cursor to catch changes from other devices
Periodic polling every 5 minutes fallback for anything that slipped through

UMAP and clustering only re-run when the user clicks Re analyse or when more than 5% of the collection has changed below that threshold the existing clustering is still accurate enough.

4.6.4 Phase IV: Suggestion Review Phase

No changes are ever applied without explicit user confirmation. The plugin presents all suggestions in a structured review panel with three sections:

Tag suggestions: each proposed new tag with the list of notes that would receive it
Notebook suggestions: each proposed new notebook with the notes that would be moved into it
Archive suggestions: notes flagged as rarely-accessed with a proposed move to an archive notebook

The user can accept all, reject all, or handle each suggestion individually. Before applying any accepted suggestion the plugin writes the original state of every affected note to the categorisation_log table. A one-click 'Undo last categorisation' button restores all affected notes to their previous state from the log.

4.7 Constraint Management: Handling Data API Rate Limits & Throttling

The Joplin Data API is a local REST service accessed through the joplin.data module. The maximum number of items returned per request is 100, controlled by the limit parameter. The plugin fetches notes in controlled batches of 100 per page. Critically, Joplin's plugin sandbox has no Web Worker API. To prevent UI freezing during embedding a batch-and-yield pattern is used: process 10 notes, then yield control back to the event loop. This keeps Joplin responsive throughout initial indexing.

Constraint	Value	Source
Max items per request	100	Official Joplin Data API docs
Pagination field	has_more (boolean)	Official Joplin Data API docs
Page parameter	page (starts at 1)	Official Joplin Data API docs
Web Worker API	Not available in plugin sandbox	Joplin plugin architecture
Mitigation	Batch-and-yield event loop pattern

4.8 Data Serialization: BLOB Binary Efficiency vs. JSON Overhead

The vector is stored as binary rather than a JSON string because binary can be deserialised directly back into a Float32Array in a single operation. During clustering every note vector is compared against every centroid on every iteration so the deserialisation cost multiplies significantly across large collections.

	BLOB (Float32Array binary)	JSON array
Speed	Fast, one Buffer read, directly usable	Slow, full JSON parse before every comparison
Human readable	No	Yes
Storage size	~1.5 KB per vector (384-dim local model)	~2.5 KB per vector
Used for	Storing vectors in sqlite3	Not suitable for vector operations

4.9 API Cost Estimate

The local ONNX path has zero cost. Cloud providers are opt-in only.

Collection size	Avg chunks	Local ONNX time	OpenAI cost	Storage (384-dim)
100 notes	~300	~15 seconds	~$0.001	~0.5 MB
1,000 notes	~3,000	~2–3 minutes	~$0.01	~5 MB
5,000 notes	~15,000	~12–15 minutes	~$0.05	~25 MB
10,000 notes	~30,000	~25–30 minutes	~$0.10	~50 MB

4.10 IPC Bridge Constraints: Secure Sandbox Data Access

Plugins cannot access the Joplin database directly. The Joplin database is an SQLite file managed exclusively by the Joplin core application. All data access goes through the joplin.data module via an IPC bridge between the plugin sandbox and the Joplin main process. This matters significantly for the apply phase of this plugin, which requires multiple sequential write operations creating tags, creating notebooks, assigning tags to notes, and moving notes, each of which is a separate API call through the IPC bridge. The plugin batches these write operations and applies them sequentially with a small delay between calls. A progress indicator shows the user how many changes have been applied out of the total.

Access method	Available to plugins	Speed	Notes
Direct SQLite file access	No	Very fast	Reserved for Joplin core only
joplin.data REST API via IPC	Yes	Moderate	Only supported method
Max items per request	—	—	100 items
Fields selection	Yes	Faster	Use fields param to fetch only what is needed

4.11 Core Design Principles for End-to-End AI Systems

No Changes Without User Confirmation: Every tag assignment, notebook creation, and note move is shown in the review panel before anything is written. The plugin never modifies the user's note collection silently.
Reversibility: Before applying any accepted suggestion the plugin logs the original state of every affected note to the categorisation_log table. A one-click undo button restores all affected notes.
Incremental Processing via Change Detection: SHA-256 hashing ensures only modified notes are re-embedded. The Events API cursor tracks all changes including those from other devices after a Joplin sync.
Local-First Privacy: By default nothing leaves the machine. All embedding inference runs locally via ONNX. When a user switches to a remote provider the settings page shows a persistent warning.
Chunking with Overlap for Context Preservation: Notes are split into structure-aware overlapping chunks. The heading path is prepended to each chunk so that the embedding captures both the local content and the broader document context.
Provider Abstraction via Common Interface: The LLM provider sits behind a shared EmbeddingProvider interface. Switching between ONNX, Ollama, and OpenAI requires only a settings change.
Background-Safe Processing: Embedding and clustering use the batch-and-yield event loop pattern since the Joplin plugin sandbox has no Web Worker API. Joplin remains fully usable throughout.
Graceful Degradation: If the API is unreachable the plugin shows a clear error and waits for the user to fix configuration. If the ONNX model fails to load it falls back to the Ollama or OpenAI path.
Encrypted API Key Storage: API keys are stored using Joplin's settings API with secure: true, which uses the OS keychain where available. Keys are never stored in plaintext config files.

4.12 Error Handling and Edge Cases

API key invalid or expired: Before the first full embedding run the plugin sends a test embedding request. If it fails the user sees an error in settings immediately.
ONNX model fails to load: The plugin disables local inference, falls back to the Ollama or OpenAI path, and shows a clear message in settings suggesting the user switch providers.
Ollama not running: The plugin pings http://localhost:11434/api/tags on startup. If unreachable the settings panel shows a warning with a link to Ollama's installation instructions.

Very large note collections (5,000+ notes):

Live progress indicator: 'Embedding note 342 of 5,127...'
Batch-and-yield pattern: 10 notes then event loop yield
Cancel button: already-embedded chunks survive in sqlite3
Persistent partial progress: next startup picks up where it left off via source_hash comparison
Notes that fit no cluster cleanly: Notes with similarity to their assigned centroid below a configurable threshold are flagged as 'uncategorised' and excluded from suggestions.
Empty or very short notes: Notes producing fewer than 20 tokens are excluded from clustering and shown as 'too short to categorise' in a separate section of the review panel.
User rejects all suggestions: The plugin does not re-run automatically. The user must manually trigger a new analysis run, preventing the plugin from repeatedly suggesting changes already rejected.
sqlite3 database corruption: The plugin runs PRAGMA integrity_check on startup. If corruption is detected it offers a one-click 'Rebuild Analysis' button.
Non-English notes: The default ONNX model (all-MiniLM-L6-v2) is trained on English text. Users with multilingual collections can switch to paraphrase-multilingual-MiniLM-L12-v2 via the model selector.
Accessibility: The review panel includes role='region', aria-label, and aria-live='polite'. Full keyboard navigation: Tab through suggestions, Enter to accept, Delete to reject, Escape to cancel.

4.13 First-Run Behaviour

Step	What the user sees	What the plugin does
Plugin installed, Joplin restarted	Sidebar panel appears collapsed	Waits — no indexing starts automatically
User opens sidebar	Not Indexed state with Build Index button and estimated time	Ready to start
User clicks Build Index	Progress bar: 'Embedding note 342 of 1,247...' + Cancel button	Batch-and-yield embedding. Joplin remains fully usable
Embedding complete	'Analysing your notes...' message	K-Means clustering and LLM category naming runs
Analysis complete	Review panel opens with tag, notebook, and archive suggestions	All suggestions visible, nothing applied yet
User reviews and confirms	Apply progress: 'Applied 12 of 47 changes'	Creates tags, notebooks, moves notes via Data API
Apply complete	'Done. Undo last categorisation' button visible	categorisation_log written for rollback
Subsequent launches	Silent background sync message	Events API cursor check — re-embeds only changed notes

4.14 Plugin Settings

Setting	Type	Default	Description
Embedding Provider	Dropdown	local	local / ollama / openai
API Endpoint	String	""	URL for Ollama or OpenAI. Hidden when local selected
API Key	Secure String	""	Stored via secure: true in OS keychain
Embedding Model	String	all-MiniLM-L6-v2	Model identifier. Changes based on provider
Cluster Count	String	auto	auto (elbow method) or manual integer 3–30
Archive Threshold (months)	Integer	12	Notes not edited in this many months are flagged
Privacy Disclosure	Label	—	Read-only warning shown when a remote provider is selected

4.15 UX Plan: Sidebar Panel States

State	Display
Not Indexed	'Click Build Index to enable AI categorisation' with estimated time
Embedding	'Embedding note 342 of 1,200...' with progress bar and Cancel button
Clustering	'Analysing your notes, discovering categories...'
Review	Three-section panel: tag suggestions, notebook suggestions, archive suggestions
Applying	'Applied 12 of 47 changes...' with progress indicator
Done	Summary of applied changes + Undo button

5. Implementation Plan

350 hours · May 26 – August 23 · Mentors: HahaBill, shikuz,

Week 1–2 · Validation Spike (~40 hrs)

Validate BGE-small-en-v1.5 loads and runs cross-platform via ONNX inside the plugin sandbox
Confirm WASM memory degradation behaviour and validate worker recycling every 80–100 notes as the fix
Validate sqlite3 BLOB storage and Float32Array round-trip
Build minimal PoC: embed a string → store → retrieve → runs on macOS, Windows, Linux
Share spike report with mentors before locking architecture

Week 3–4 · Note Ingestion & Embedding (~60 hrs)

Paginated note fetcher via Joplin Data API (100 notes/request, has_more loop)
Chunk notes (~400 words, 50-word overlap), embed with BGE-small-en-v1.5
Title vector blending with cosine similarity weighting; filter generic titles
SHA-256 change detection + user_updated_time stored per note
Batch-and-yield event loop pattern (10 notes + setTimeout(0))
Unit tests for chunker edge cases

Week 5–6 · UMAP, Clustering & Tag Generation (~60 hrs)

Average chunk vectors into note level vectors
UMAP via DruidJS (n_neighbors=15, n_components=5, min_dist=0, cosine metric, random_state=42)
Silhouette scoring across K=2 to √N to select optimal K
K-Means on UMAP-reduced vectors
TF-IDF term extraction → re-rank by centroid cosine similarity → send top 5 keywords to LLM
Five-signal staleness score for archive detection
Integration tests on a sample note collection

Midterm (July 14–18) · Checkpoint

Working embedding + UMAP + clustering pipeline producing named tag suggestions in a basic panel

Week 7–8 · Suggestion Review UI (~60 hrs)

React sidebar panel: tag suggestions, notebook suggestions, archive suggestions sections
Per-suggestion accept / reject + accept all / reject all controls
Events API cursor sync (onNoteChange, onSyncComplete, 5-minute poll fallback)
Ollama and OpenAI provider adapters

Week 9–10 · Apply Logic & Rollback (~50 hrs)

Apply pipeline: create tags → create notebooks → assign tags → move notes via joplin.data
Write categorisation_log before every apply; one-click undo from log
Settings UI: provider dropdown, secure API key, privacy disclosure
Cluster centroid stored per notebook for new-note placement suggestions

Week 11–12 · Testing & Polish (~40 hrs)

Benchmark on 10,000+ note collections; confirm WASM recycling holds under load
Edge cases: empty notes, very short notes, multilingual notes, notes that fit no cluster
ARIA attributes and full keyboard navigation in review panel
End-to-end integration test on a real Joplin database

Final Phase (Aug 23 – Sep 1) · Documentation & Submission (~40 hrs)

README: installation, configuration, privacy model, architecture overview
Architecture documentation for future contributors
Demo screencast: full suggest → review → apply → undo flow
Final code review with mentors; submit to Joplin plugin marketplace

Stretch Buffer (~20 hrs)

Cross-encoder reranking for cluster quality improvement
Additional LLM provider adapters

6. Deliverables

At the end of the GSoC period the following will exist as working, tested, and documented outputs. Required items represent the minimum successful outcome. Optional items will be completed if time permits.

Core Plugin

Deliverable	Description	Type
Joplin plugin package	Installable .jpl plugin published to the Joplin plugin marketplace	Required
Plugin settings panel	Provider dropdown, secure API key, model selector, cluster count, archive threshold, privacy disclosure	Required
Categorisation sidebar panel	React-based panel covering all five UX states from Not Indexed through Done	Required

Embedding Pipeline

Deliverable	Description	Type
Validation spike report	Cross-platform test of ONNX runtime and sqlite3 BLOB round-trip shared with mentors	Required
Paginated note fetcher	Fetches all notes via Joplin Data API with full pagination and Events API cursor sync	Required
Structure-aware chunker	Splits on headings, prepends heading path, 64-token overlap	Required
ONNX local embedding adapter	all-MiniLM-L6-v2 bundled with plugin — no API key required	Required
Ollama provider adapter	Local HTTP — no data sent to cloud	Required
OpenAI provider adapter	Cloud — API key stored via secure: true	Required
sqlite3 vector store	BLOB schema with source_hash, user_updated_time, and categorisation_log tables	Required
Batch-and-yield indexing	Event loop yield pattern keeping Joplin responsive during embedding	Required

Clustering & Suggestion Engine

Deliverable	Description	Type
Pure JavaScript K-Means	Clusters note vectors entirely in-process	Required
Automatic K selection	Elbow method determines optimal cluster count	Required
LLM category naming	Sends cluster summaries to LLM to generate tag and notebook names	Required
Archive detection	Identifies rarely-accessed notes using user_updated_time	Required
Hierarchical agglomerative clustering	Alternative algorithm for better quality on small collections	Optional

Review & Apply

Deliverable	Description	Type
Suggestion review panel	Three sections: tag suggestions, notebook suggestions, archive suggestions	Required
Per-suggestion controls	Accept and reject each suggestion individually with keyboard support	Required
Apply pipeline	Creates tags, creates notebooks, assigns tags, moves notes via Joplin Data API	Required
categorisation_log table	Stores pre-change state of every affected note before applying	Required
One-click undo	Restores all affected notes to their state before the last apply	Required

Quality & Documentation

Deliverable	Description	Type
Unit test suite	Tests for chunker, K-Means, elbow method, archive detection	Required
Integration tests	End-to-end tests on a real Joplin database with a sample note collection	Required
User documentation	Setup guide, privacy FAQ, configuration reference	Required
Architecture documentation	Technical documentation for future contributors	Required
Demo screencast	Recording showing full suggest-review-apply-undo flow	Required
npm package	Core clustering and embedding logic as a standalone npm package	Optional

7. Availability

I am fully available for the entire GSoC 2026 coding period with no competing employment, internship, or academic commitments. I treat GSoC as a full-time engagement. If I encounter a blocker I will raise it on the forum or Discord the same day rather than waiting. I will maintain a public weekly progress post so mentors and the community can track progress and give feedback at every stage of the project.

Item	Details
Weekly availability	40–45 hours per week during the coding period
Time zone	PKT — UTC+5 (Islamabad, Pakistan)
Mentor overlap	Morning PKT overlaps with European business hours, allowing daily async communication with mentors HahaBill and shikuz
Communication style	Weekly async progress report posted to the Joplin forum every Monday. Weekly 30-minute video sync with mentor. Daily availability for async communication with same-day responses. All code submitted as early draft PRs for incremental review.
Other commitments	No other employment, internship, or GSoC applications. University summer schedule is free of coursework obligations
Known absences	None currently planned. Any unavoidable absence communicated to mentors at least one week in advance
Blockers	Surfaced within 24 hours. If stuck, mentors will know the same day

developerzohaib786 · 25 March 2026 19:08

@shikuz @HahaBill @malekhavasi i have added the pull requests section and now curiously waiting for reviews from possible mentors of this project on my draft proposal

shikuz · 29 March 2026 10:54

Sorry, I accidentally posted this in the other proposal thread first.

The model comparison table has BGE-small-en-v1.5 at 256 tokens and all-MiniLM-L6-v2 at 512. I think those are swapped. Since the context window drives your chunking decisions, does the model choice change if the specs are reversed?

sqlite3 works on desktop but isn't available on Joplin mobile (at the moment). Is mobile out of scope, or have you thought about a storage path that works on both?

The proposal covers incremental re-embedding via the Events API, but what happens to the clusters when a user creates a new note? Does the whole UMAP + K-Means pipeline re-run, or is there a lighter path?

Have you tested the clustering pipeline on a real note collection? Curious what the clusters looked like.

developerzohaib786 · 29 March 2026 19:04

Thanks for the review @shikuz!

1. Context window table: yes, those are swapped Correct values are: BGE-small-en-v1.5 → 512-token, all-MiniLM-L6-v2 → 256-token (silently truncates).

2. Mobile: explicitly out of scope Pipeline depends on Node.js native modules (sqlite3, ONNX Runtime) unavailable in mobile sandbox. The vector store sits behind an abstraction layer though, so a future contributor could swap sqlite3 for sql.js without touching embedding or clustering logic.

3. New note: no full re-run

onNoteChange() fires → note embedded, vector saved to sqlite3 (< 1 second)
New vector compared against stored centroids → tentative assignment, no re-clustering
Full re-analysis only triggers on manual Re-analyse click, or when 5%+ of collection has changed

4. Real collection test: Yes. Built a working Joplin plugin prototype with embedded clustering pipeline. The implementation validates the core architecture before potential production scaling.

Demo

there is a limit of 10mb video so i have uploaded the last part please see the full demo video at

data.json (100 notes) 
  ↓
Embedding extraction (BGE-small-en-v1.5 via Transformers.js in Web Worker)
  ↓
Optional dimensionality reduction (UMAP: 384-dim → 5-dim for tighter separation)
  ↓
K-Means clustering (K=2 to adaptive max)
  ↓
Silhouette scoring (automatic K selection without manual inspection)
  ↓
Final clustering + Benchmark UI (sidebar visualization with metrics)

Repository of my clustering phase testing (with dummy data not real-time notes)

Will push the corrected proposal with the table fix shortly.

Phase # 1 System Architecture

Phase # 2 System Architecture

also want for @HahaBill to see my work

HahaBill · 1 April 2026 12:00

Thank you for your proposal, I had a look

Topic		Replies	Views
GSoC 2026 Proposal Draft – Idea 4: Chat with your note collection using AI GSoC	0	19	31 March 2026
GSoC 2026: Opportunities for the AI projects GSoC	32	695	13 April 2026
AI Note Clustering BenchMark Tessting via Plugin GSoC	0	26	29 March 2026
Plugin: Semantically Similar Notes (beta) Plugins	30	2657	5 February 2024
Welcome to GSoC 2026 with Joplin! GSoC	155	1930	1 April 2026