GSoC 2026 Proposal Draft - Idea 3: AI-based categorisation - Justin Charles


Links:

GitHub profile: justin212407 (Justin Charles) · GitHub
Contributions to Joplin:

  • #14547 :- Fix ++insert++ syntax rendering fix in markdown (Merged)
  • #14563:- Fix Prevent unclosed frontmatter from breaking Markdown rendering (Merged)
  • #14626:- add plugin website link to /help (Merged)
  • #14634:- Application crashes when deleting a notebook (Closed but considered for GSoC)
  • #14674:- hide new note/todo buttons when no notebook exists (Merged)
  • #14692:- Always require password confirmation when changing master password after encryption (Merged)
  • #14824:- Add warning dialog before JEX import to prevent note duplication (Approved)

1. Introduction

I am Justin Charles, a B.Tech Information Science Engineering student with a strong interest in full-stack systems and distributed architectures. I am a year-long contributor at Sugar Labs and an C4GT DMP 2025 contributor, where I have worked on the Music Blocks v4 project. Through this, I have gained experience collaborating on large open source codebases and understanding project architecture,.


2. Project Summary

Joplin users who maintain large note collections over months and years face a compounding organizational problem: tags become inconsistent, notes land in the wrong notebooks, and hundreds of old notes silently accumulate without ever being revisited. No current Joplin feature addresses this systematically.

This project will build a privacy-first AI categorization plugin that analyses a user's entire note collection, discovers semantic groupings, and proposes concrete organizational actions creating tags, filing notes into notebooks, and surfacing archive candidates which the user reviews and approves before anything changes. The system is designed around a strict suggest -> confirm -> apply contract: the AI never modifies a note without explicit user permission.

The implementation follows a plugin-first architecture, leveraging joplin.data for all read/write operations against the Joplin internal REST layer (packages/lib/services/rest/routes/). The AI pipeline defaults to a fully local Ollama backend for privacy, with a transformers.js fallback requiring no external dependencies, and an optional opt-in path for users who wish to use frontier model APIs. A phased timeline ensures a working MVP by Week 6, with the remaining weeks dedicated to polish, edge cases, and documentation.


3. Technical Approach and Implementation

Deliverable 1 - Core Plugin Infrastructure & Incremental Indexing Pipeline

What it is

The foundational layer of the entire system. This deliverable establishes the plugin scaffold, the paginated note fetching mechanism, the hash-based incremental index, and the persistence layer that stores per-note embedding metadata between sessions. Nothing else in the project works without this.

Approach in depth

Plugin scaffold is generated using yo generator-joplin and structured with three top-level modules: indexer, analyser, and ui. The plugin registers a toolbar command and a panel on onStart.

Paginated note fetching uses joplin.data.get with limit: 100 and page increments until has_more is false. Fields fetched are id, title, body, parent_id, updated_time. This mirrors the internal REST route at packages/lib/services/rest/routes/notes.ts.

Hash-based incremental indexing avoids re-embedding notes that haven't changed. On each run, the plugin computes an MD5 hash of each note's body and compares it against a stored hash. Only changed or new notes are sent to the embedding backend. This keeps the plugin fast and non-intrusive on large vaults.

Persistence layer stores all index metadata in Joplin plugin settings using namespaced keys (index.hash.<noteId>, index.embedding.<noteId>, index.clusterId.<noteId>). This avoids touching note userData which would trigger unnecessary sync events.

Trigger mechanism hooks into joplin.workspace.onSyncComplete so the index refreshes automatically after every sync, and exposes a manual "Analyse Notes" command for on-demand runs.

Deliverable 2 - AI Embedding & Analysis Engine

What it is

The intelligence core of the plugin. This deliverable implements the three-tier embedding backend, the HDBSCAN clustering algorithm, LLM-based cluster label generation, and the centroid classifier for mapping notes to existing notebooks. This is where the raw note text becomes actionable organizational suggestions.

Approach in depth

Three-tier embedding backend is selected via a plugin settings dropdown. All three tiers produce the same output - a float32[] vector - so the rest of the pipeline is backend-agnostic.

Confidence thresholds control what gets surfaced:

  • similarity ≥ 0.85 -> high confidence, shown first in UI
  • 0.6 ≤ similarity < 0.85 -> medium confidence, shown with a warning badge
  • similarity < 0.6 -> noise point, moved to Uncategorized holding area

Deliverable 3 - Intelligent Categorisation System (Tagging + Notebook Filing)

What it is

The layer that translates AI analysis output into concrete, validated Joplin actions. This deliverable implements the Command Dispatcher, the full LLM-to-API mapping, the duplicate-prevention logic, and the rollback mechanism that makes every action reversible within a session.

Approach in depth

Command Dispatcher is the critical bridge between LLM output and Joplin's API. The LLM is prompted to return a strictly typed JSON array. The dispatcher validates each command against a TypeScript schema before execution and refuses to run any command that fails validation. Duplicate prevention checks existing tags and notebooks before creating new ones.
This prevents the plugin from creating machine-learning, Machine Learning, and machine learning as three separate tags.

Snapshot-based rollback captures the current state of every note that will be affected before dispatching any commands. If the user clicks "Undo" within the same session, the dispatcher replays the snapshot in reverse.

Agentic mode (stretch goal, opt-in) allows users with API keys to schedule automatic categorization runs. In agentic mode, high-confidence suggestions (≥ 0.92) are applied without UI review, and a summary notification is shown afterward. This is disabled by default and requires explicit user opt-in in settings.

Deliverable 4 - Review & Apply UI (React Panel)

What it is

The user-facing surface of the entire system. A React-based sidebar panel built with joplin.views.panels.create() that presents all AI suggestions grouped by type, lets the user approve or reject them individually or in bulk, and triggers the Command Dispatcher on confirmation.

Approach in depth

Panel architecture uses Joplin's panel API with joplin.views.panels.onMessage for two-way communication between the React webview and the plugin main process. The panel posts messages to the plugin, which executes joplin.data calls and responds with results.

Three-tab layout:

  • Tags tab — groups suggestions by proposed tag. Shows note count per tag, a confidence badge, and an expandable list of affected notes. Users can approve all notes for a tag, or deselect individual notes before applying.
  • Notebooks tab — shows proposed note moves with a before → after path display (Work > GeneralWork > Machine Learning). Users approve or reject per note.
  • Archive tab — shows cold notes sorted by last activity, with last_viewed and updated_time displayed. Users can select notes to archive in bulk.

State management uses React useReducer with an actions: Command[] array and a selected: Set<string> for tracking which suggestions are approved. The "Apply Selected" button is disabled until at least one suggestion is selected.

"Never suggest again" rule — users can right-click any suggestion and mark it as permanently ignored. The plugin stores ignored note-tag pairs in settings and filters them from all future runs.

Deliverable 5 - Archive Discovery, Settings, and Documentation

What it is

The final production-ready layer. This deliverable implements the cold note detection system, the complete settings UI, performance optimization for large vaults (2000+ notes), and full technical documentation.

Approach in depth

Cold note detection requires solving a gap in Joplin's data model: there is no native last_viewed_time field. The plugin implements a lightweight tracking mechanism using joplin.workspace.onNoteSelectionChange. This runs silently in the background from the moment the plugin is installed, building up a last_viewed record per note stored entirely in plugin settings — not in note userData, which avoids triggering spurious sync events.

Archive criteria — all three conditions must be true:

  • Date.now() - lastViewed[noteId] > userThreshold (default: 180 days)
  • note.updated_time < userThreshold (default: 90 days)
  • Note is not in a notebook the user has marked as "exclude from archive suggestions"

Privacy settings UI presents three clearly labelled backend options with a data-handling disclosure:

Setting Default Description
AI Backend Ollama (local) Where embeddings are computed
External API Key Empty / disabled Only visible if Tier 3 selected
Archive threshold 180 days How long before a note is considered cold
Exclude notebooks None Notebooks never suggested for archive
Agentic mode Off Auto-apply high-confidence suggestions

A permanent notice in the settings panel reads: "Your note content never leaves your device unless you choose an external AI provider." The external API option is greyed out until the user clicks an acknowledgement checkbox.

Performance optimization for large vaults: the indexer runs note fetching and embedding in batches of 20, with a 500ms yield between batches using setTimeout to avoid blocking the Joplin UI thread. On a 2000-note vault, initial indexing completes in approximately 8-12 minutes with Ollama. Subsequent runs (incremental only) complete in under 60 seconds for typical daily note volumes.



4. Timeline

6. Timeline

Phase Weeks Work
Community Bonding --- Study Joplin plugin APIs (joplin.data, panels), finalize architecture (indexer, analyser, UI), validate incremental indexing approach with mentors
Foundation 1 - 2 Plugin setup, paginated note fetching, hash-based incremental indexing, persistence in settings
Embeddings Pipeline 3 - 4 Integrate Ollama + transformers.js, generate/store embeddings, batching for performance
Clustering + MVP 5 - 6 HDBSCAN clustering, confidence thresholds, basic tag/notebook suggestions (MVP ready)
Intelligence Layer 7 - 8 LLM tag generation, confidence scoring, Command Dispatcher, duplicate prevention, rollback system
UI Development 9 - 10 React panel, Tags/Notebooks/Archive tabs, suggestion selection + apply workflow
Polish & Finalization 11 - 12 Cold note detection, settings UI, performance optimization, testing, documentation

5. Availability

Weekly availability during GSoC

I can dedicate approximately 40 hours per week on average to the project throughout the GSoC period.

Timezone

India Standard Time (IST) (GMT+5:30).

Other commitments during the programme

I do not have any exams, internships, or other major commitments during the GSoC period. I will prioritize this project fully and ensure consistent progress. I will maintain regular communication with mentors through the Joplin forum and Discord, and provide structured weekly updates on progress.


I was also working on implementing this to get a deeper idea on how deep of a threshold we would require and how many notes we shall be able to index and categorize. So i created this simple prototype to categorize a few notes in a few clusters. Here is the working demo of the same:

While working on this i faced a minor but important issue, while forming clusters ollama (i used ollama to form these clusters) was unable to categorize very short notes (<10-15 characters) into any of the clusters. I tried to keep the threshold >0.60 but just to experiment i tried to put it at 0.50 and it clustered them together in 1 cluster even though the content inside was entirely different.

@HahaBill @shikuz What would your take on tackling this edge case be?

I have worked on the following POC - GitHub - justin212407/joplin-plugin-ai-categorization · GitHub
on here i have run embeddings through Ollama, got 768-dimension vectors, clustered notes, and written output back to Joplin.
The similarity threshold of 0.65 is empirically calibrated for nomic-embed-text. We can definitely fine tune this by sweeping values across [0.50, 0.80] and measuring cluster quality using simplified silhouette score on a test vault.

It would be great if you guys could take a deeper look at the POC and provide any suggestions that you might have
.

Hi! Thank you for your proposal, we really appreciate that! I have few questions:

  • I like the idea that you’re thinking about re-embedding notes and having a mechanism to track the drift. Could you elaborate the re-embedding mechanism? Do you think storing embedding in Joplin plugin settings make sense?
  • How do you implement HDBSCAN? Are you using some pre-existing libraries?
  • What is the confidence threshold based on?
  1. The re-embedding mechanism works as follows: on each run, the plugin computes an MD5 hash of each note's title + body and compares it against the stored hash from the previous run. Only notes where the hash has changed are sent to the embedding backend unchanged notes reuse their cached vector. This keeps incremental runs fast even on large vaults.
    Regarding storage in plugin settings I considered this but have a concern I'd like your input on. Plugin settings in Joplin use a key-value store which works well for small metadata (hashes, cluster assignments, timestamps) but becomes a bottleneck for raw embedding vectors. A 768-dimension float32 vector is around 3KB per note. At 1000 notes that comes to around 3MB of JSON in settings, which would cause slow reads and writes on every sync cycle.
  • My current approach stores lightweight metadata (note ID, body hash, cluster ID, last embedded timestamp) in plugin settings, and writes the raw float32 vectors as a binary file in joplin.plugins.dataDir() using fs-extra. This keeps settings lean while giving fast vector access. I'm also aware of the forum discussion about userData as a potential shared storage mechanism - I've designed the storage layer behind an interface so it can migrate to whatever shared infrastructure approach we decide upon. Happy to rethink this if you see a better pattern.
  1. For the POC I implemented a simpler greedy cosine similarity clustering to validate the pipeline end-to-end — which is what the demo shows. For the full GSoC implementation I plan to use the hdbscan npm package (or port the core algorithm in TypeScript if the package has compatibility issues with the plugin sandbox).The reason I chose HDBSCAN over K-Means or centroid-only approaches is that it doesn't require specifying K upfront — critical for a cold-start vault where the number of topics is unknown. It also naturally produces noise points (genuinely unique notes that don't belong to any cluster), which I surface in an "Uncategorised" holding area rather than force-assigning them.The key parameters are min_cluster_size (default 3 - a cluster needs at least 3 notes to be suggested as a tag) and min_samples (controls how conservative the density estimation is). These will be empirically calibrated during development.
  • If the HDBSCAN library proves incompatible with the plugin sandbox, the fallback is a centroid-based classifier for existing notebooks (where centroids are stable) combined with a simpler density clustering for discovery of new categories. I would flag this risk early in the project and discuss with mentors.
  1. The confidence threshold is accounted for the specific embedding model in use. During POC development I discovered that nomic-embed-text (768 dimensions) compresses cosine similarities into a narrow range, even semantically unrelated notes produce similarity scores above 0.50. This matches the behavior documented for models trained with contrastive learning, where the effective decision space is compressed into roughly [0.55, 1.0] rather than the full [-1, 1] range. From my POC experiments:
  • Threshold 0.75 -> only 1 cluster (too tight, travel notes cluster but 2 categories like ML/Cooking don't separate)
  • Threshold 0.60 -> 1-2 clusters (still too tight for short bodies)
  • Threshold 0.65 -> 3 clean clusters on notes with rich bodies (empirically optimal for nomic-embed-text)

The threshold is therefore model-dependent. We can implement this in a way so that:

  1. Ship with model-specific defaults (0.65 for nomic-embed-text, calibrated separately for bge-small and MiniLM)
  2. Compute a simplified silhouette score across threshold values [0.50, 0.80] on the user's actual vault during first-run calibration and select the threshold maximising cluster separation
  3. Expose the threshold as an advanced setting for users who want manual control (will also give user more control over how they want to manipulate their note collections).

Hi @HahaBill I've made significant updates to the POC and wanted to share the progress.

Following your feedback and the forum discussion around shared infrastructure, I've extended the plugin with several additions:

What the POC now demonstrates end-to-end inside Joplin:

  • Paginated note fetching via joplin.data.get
  • Ollama embeddings (nomic-embed-text, 768d confirmed)
  • Greedy cosine similarity clustering with configurable threshold
  • Simplified silhouette score to measure cluster quality objectively this lets the system self-report whether its own clustering is STRONG, MODERATE, WEAK, or POOR, and will drive automatic threshold calibration in the final implementation
  • KNN search timing (k=5 over 8 vectors: 1ms)
  • LLM cluster labelling via llama3
  • Performance projections based on measured throughput (7.6 notes/sec on Ollama → ~2.2 min for 1000 notes)
  • All output written back to a Joplin note via joplin.data.put

Key empirical finding from testing: nomic-embed-text compresses cosine similarities into a narrow range — ML notes peaked at 0.666 similarity with each other, while cross-topic similarities dropped sharply to 0.45-0.49. This confirms the threshold must be model-specific and corpus-calibrated, not hardcoded. At threshold 0.55, the silhouette score reaches STRONG with clean 3-cluster separation (ML / cooking / travel).

GitHub: GitHub - justin212407/joplin-plugin-ai-categorization · GitHub

Happy to get any feedback on the approach, especially on whether silhouette-based automatic threshold calibration is the right direction, or whether you'd prefer a different cluster quality metric.