GSoC 2026 Proposal Draft – Idea 3: AI-Based Categorisation – Sasha

jellyfrostt · 23 March 2026 20:26

Relevant Links

GitHub profile: jellyfrostt · GitHub
Forum introduction post: Welcome to GSoC 2026 with Joplin! - #110 by jellyfrostt
Pull requests submitted to Joplin:
Chore: Mobile: Fixes #14834: Fix JSDOM scrollIntoView error in tests by jellyfrostt · Pull Request #14870 · laurent22/joplin · GitHub (Merged)
Chore: Mobile: Fixes #14835: Silence deprecated SafeAreaView warning in tests by jellyfrostt · Pull Request #14949 · laurent22/joplin · GitHub
Other relevant experience:
Built RAG pipelines with Milvus, pgvector, and Chroma at GDP Labs and 에이사허브
Researched regarding PII detection/classification system evaluating 6 detection approaches for a compliance platform
Production AI chatbot with inter-session semantic memory using embeddings

1. Introduction

Hello, I'm Sasha (nickname, as my legal name is a bit difficult to pronounce), a Computer Science student based in Southeast Asia and a 4th year student equipped with some professional full-stack developer experience. My production experience includes working with React, TypeScript, Django, and AI/LLM systems across multiple companies and am very eager to dive into the world of open source!

Programming experience:

에이사허브: Next.js, TypeScript, Django REST Framework, WebSocket streaming, OpenAI/Claude/Gemini integration, pgvector, RAG with embeddings, Celery, AWS ECS, Terraform
GDP Labs: LangChain.js, LangGraph, FastAPI, React, Elasticsearch, Milvus vector database, GraphRAG, ColBERT, BM25 reranking
Equinox Technology: Next.js, React, Shopify/Liquid, Core Web Vitals optimization
PT Winniecode: MERN stack, REST APIs, authentication
Freelance (Alpacca Studio): Next.js, Django, PostgreSQL, multi-tenant SaaS architecture, RBAC

Why this project?:

As note collections grow, the cognitive overhead of filing notes into the right notebook and tagging them consistently becomes tedious enough that most users just stop doing it — and once organisation falls behind, it compounds. This project automates that friction away using local embeddings to classify, tag, and surface structural patterns across the collection, without requiring an API key or sending data off-device. The core intelligence runs entirely offline, which fits naturally with Joplin's privacy-first philosophy.

2. Problem Statement

2.1 Notes end up in the wrong place. Users jot down a meeting note while browsing "Recipes" — and that's where it stays. On mobile especially, switching notebooks before writing feels like too much friction, so people just don't.

2.2 Tags become inconsistent. #project, #projects, #proj — same thing, three tags. Eventually most users stop tagging altogether, which defeats the purpose.

2.3 You can't see what you actually have. Notes about the same topic end up scattered across multiple notebooks with no shared tags. There's no way to discover the latent structure in your own collection hence causing massive user frustration to manually navigate through tons of entries to find their targeted notes.

2.4 Related work. The Jarvis plugin offers embedding-based tag suggestion. This project differs in three crucial ways: (a) it suggests notebooks, not just tags — no existing Joplin plugin does this; (b) it operates in batch mode across the entire collection ("Analyse All"), not just on the currently selected note; and (c) it provides a dedicated categorisation UI with accept/reject/undo, rather than embedding AI features into a general-purpose assistant. In the broader ecosystem, Obsidian's auto-tagging plugins (AI Tagger, Auto Classifier) all require an LLM API key — none offer offline, zero-config tagging via embeddings.

3. Building Plan

A Joplin desktop plugin that uses local embeddings to automatically suggest where notes belong and how they should be tagged — without needing an API key or sending any data off-device.

3.1 Core features (works without any LLM):

Notebook Classification — Centroid-based classifier that compares a note's embedding against the mean vector of each notebook. When users open a note, the plugin suggests which notebook it actually belongs in with a one-click move.
Auto-Tagging — KNN tag propagation from semantically similar notes, weighted by cosine similarity. If your 5 closest notes are all tagged #react and #frontend, it suggests those tags for the current note.
Stale Note Detection — Flags notes not edited for 6+ months using user_updated_time and suggests archival to a dedicated "Archive" notebook. No AI utilized just pure timestamp check, but integrated into the same suggestion panel.

3.2 Enhancement layers (optional, when LLM is configured):

Topic Discovery — K-Means clustering on note embeddings to reveal hidden themes. Users might discover 40 notes about "home renovation" scattered across 4 notebooks. The plugin finds these clusters, labels them via c-TF-IDF with optional LLM refinement, and suggests creating proper notebooks.
Agentic Organisation — LLM-powered batch action planning: "move these 12 notes to a new 'Machine Learning' notebook, tag these 8 with #research, archive these 15 stale notes." Every action is reviewed in a sidebar panel before execution. Nothing auto-executes.

3.3 Expected outcome. For daily use: open a note, see notebook + tag suggestions instantly. For periodic use: click "Analyse All Notes", review a list of concrete suggestions (move, tag, create notebook, archive), and apply approved actions with one click. Every action is undoable.

3.4 Out of scope. Full RAG / chat with notes (Idea 4), multi-modal analysis (text only), mobile-specific UI (desktop first via joplin.views.panels).

4. Technical Approach

4.1 Architecture Overview

The plugin separates into an embedding service (shared infrastructure), an intelligence engine (core classification + tagging + stale detection, and optional clustering), and a suggestion panel (UI). All run inside the Joplin plugin sandbox.

4.2 Embedding Service

4.2.1 Model selection. We opted for BAAI/bge-small-en-v1.5 via transformers.js (ONNX Runtime, WASM backend) over the commonly-cited all-MiniLM-L6-v2 because bge-small scores significantly higher on MTEB classification benchmarks specifically — 74.14 vs ~63 — and classification is exactly what this plugin does. It also supports longer token context (512 vs 256), meaning fewer notes get truncated. The size difference is negligible for desktop. Among all models under ~50MB quantised with ONNX availability, bge-small ranks #1 on MTEB Classification — the next best small model (gte-small) scores 72.31 and models that beat bge-small (gte-modernbert-base at 76.99, bge-base at 75.53) are all 100MB+, too large for a plugin. Notes exceeding 512 tokens are truncated — the model embeds the first ~350 words. For typical notes this captures the topic; for very long notes, the title and opening content still provide a usable signal. The service is model-agnostic — swapping requires only a config change and re-index.

4.2.2 Why WASM. Joplin's plugin sandbox only whitelists a few native packages (sqlite3, fs-extra). Native ONNX would need platform-specific binaries that can't be bundled. WASM runs everywhere. The build uses extraScripts in plugin.config.json to compile the Web Worker entry point separately, and a build-time copy script (following the pattern from Joplin's official worker example plugin at packages/app-cli/tests/support/plugins/worker/) to place ONNX WASM files into dist/. The plugin loads them from its installation directory at runtime.

4.2.3 Model delivery. The ONNX model weights (~34MB quantised) are auto-downloaded by transformers.js from HuggingFace on first use and cached locally in joplin.plugins.dataDir(). Subsequent loads read from the local cache — no internet required after the initial download. A progress indicator is shown during the one-time download. If the download fails, the plugin surfaces an error and retries on next trigger.

4.2.4 Lazy initialisation. The model loads on first use (when the user opens the plugin or triggers "Analyse All"), not on Joplin startup. Once loaded, the pipeline object stays in memory for the session.

4.3 Vector Storage

4.3.1 Approach. A custom binary Float32Array store: a contiguous buffer persisted as a binary file via joplin.require('fs-extra'), a Map<noteId, index> for constant-time lookup, and brute-force cosine similarity.

4.3.2 Why not the alternatives?

Vectra: JSON-based storage — significantly larger on disk, slower to load. Pre-1.0 library with a single maintainer. Risky dependency.
sqlite-vec: Good performance but requires platform-specific native extensions (.dll/.so/.dylib) that Joplin plugins cannot bundle — only sqlite3 and fs-extra are whitelisted via joplin.require().

The custom store is ~150 lines of TypeScript, zero dependencies, loads instantly, and handles KNN queries fast enough at personal-collection scale (5,000 notes ≈ 7.5MB, millisecond queries).

4.3.3 CRUD operations. For create, the new embedding is appended at the end of the buffer with its note ID and precomputed norm, then flushed to disk. Search is a brute-force cosine similarity scan using precomputed norms — <1ms for k=5 over 1,000 vectors, scaling linearly (10,000 vectors would still be <10ms). For update, the note's index is looked up from the ID map and the 384 floats at that offset are overwritten with the new embedding and recomputed norm. Delete uses a swap-with-last approach — the entry is swapped with the last element in the buffer and the count is shrunk by 1, giving O(1) deletion. Storage: 1,000 notes at 384 dimensions ≈ 1.5MB on disk; 10,000 notes ≈ 15MB. Disk writes are debounced during bulk indexing to avoid excessive I/O.

4.4 Incremental Indexing

4.4.1 The onNoteChange() limitation. My first instinct was to use onNoteChange() for all sync. Investigating JoplinWorkspace.ts (lines 115–128), I found it only fires for the currently selected note. That's fine for immediate feedback but misses notes synced from other devices.

4.4.2 Solution: Events API cursor polling. The reliable approach is GET /events with a persisted cursor, which returns up to 100 changes per call (as defined in packages/lib/services/rest/routes/events.ts, line 10) with has_more pagination. This catches every create/update/delete across the entire collection. Cursor bootstrap: on first run (no stored cursor), the plugin performs a full scan of all notes via paginated GET /notes, then calls GET /events without a cursor parameter — which returns an empty item list and the latest change ID as the starting cursor. From that point, incremental polling picks up only new changes.

4.4.3 Three sync triggers:

onNoteChange() — fast path for the currently edited note
onSyncComplete() — batch catch-up after Joplin sync
Periodic polling (every 5 minutes) — safety net for missed events

Each note gets a source_hash (MD5 of the note body). On re-index, only notes whose hash differs from the stored hash get re-embedded — unchanged notes are skipped entirely. Notes with encryption_applied = 1 (E2EE) are skipped and queued — their ciphertext body would produce meaningless embeddings. When decryption completes, the next poll cycle picks up the change and indexes the decrypted content. A running "Analyse All" job is guarded by a simple lock flag — if triggered again while running, the second invocation is queued rather than overlapping. If Joplin closes mid-indexing, partial progress is safe: the hash map and vector store are flushed to disk every 50 notes during bulk indexing, so the next run resumes from the last persisted point.

4.5 Centroid-Based Notebook Classification

4.5.1 How it works. For each notebook, compute its centroid — the L2-normalised mean of all its note embeddings. For a new or uncategorised note, compute cosine similarity against each centroid and suggest the highest-scoring notebook. This is O(k) per note — instant. Centroids are updated incrementally: when a single note is created, updated, or moved, only the affected notebook's centroid is recomputed (sum + divide, O(n) for that notebook). During "Analyse All", all centroids are fully recomputed from scratch. For nested notebooks (sub-notebooks via parent_id), centroids are computed per leaf notebook — a note in "Programming > Python" contributes only to the Python centroid, not the parent "Programming" centroid. Classification suggests the most specific matching notebook. If the user has only a single default notebook with all their notes, the plugin detects this (one notebook with >90% of notes) and prompts "Run 'Analyse All' to discover topic-based notebooks" rather than making meaningless single-notebook suggestions.

4.5.2 Threshold calibration for bge-small-en-v1.5. This model uses contrastive learning with temperature τ=0.01 (as documented in the official BAAI model card), which compresses its cosine similarity distribution into the interval [0.6, 1.0]. Even unrelated text pairs produce scores above 0.7. Thresholds must be calibrated for this compressed distribution:

Classification threshold: 0.78 — below this, suggest creating a new notebook. This sits just below BAAI's recommended filtering thresholds of 0.8+ — adjusted slightly lower because centroid classification (selecting the best match) requires a lower bar than similarity filtering (finding near-duplicates). This implements novelty/distance rejection (Dubuisson & Masson, 1993): rejecting inputs that are too far from all known classes.
Ambiguity margin: 0.03 — if the top-2 notebook scores are within this margin, present both options to the user. This implements ambiguity rejection (Chow, 1970; Hendrickx et al., 2021): abstaining when the input falls near a decision boundary. Within the compressed [0.6, 1.0] range, 0.03 represents 7.5% of the effective decision space. Margin-based confidence — using the gap between the top-1 and top-2 scores as an uncertainty signal — is a well-established heuristic in classification with reject option literature (Fumera & Roli, 2002), making this gap a reliable indicator of decision boundary proximity.
Minimum notes for stable centroid: 10 — below this, fall back to KNN classification among that notebook's individual notes.

These thresholds are starting defaults and will be initially calibrated during weeks 3–4 and refined during end-to-end testing in weeks 5–6, consistent with the literature consensus that absolute cosine similarity thresholds must be tuned per model and task (Reimers & Gurevych, EMNLP 2019). Calibration methodology: construct a labelled evaluation set from real user note collections (with mentor assistance) where each note has a known correct notebook. Sweep threshold values across [0.70, 0.90] in 0.01 increments, measure classification F1, and select the threshold maximising F1 on a held-out split. The ambiguity margin is tuned similarly by measuring the rate of correct vs incorrect suggestions in the ambiguous band.

4.5.3 When centroid vs full clustering applies:

Scenario	Method	Why
Single note arrives	Centroid comparison	O(k) — instant
User clicks "Analyse All"	Full K-Means clustering	Discovers new structure
After sync with many changes	Centroid + periodic re-cluster	Balance speed and quality
First-time setup (no notebooks)	Full clustering + labelling	Build structure from scratch

4.6 KNN Auto-Tagging

4.6.1 Algorithm. Find the k-nearest neighbours, collect their tags, weight by cosine similarity, and suggest tags that meet the threshold.

4.6.2 Parameters (calibrated for bge-small's compressed [0.6, 1.0] range per BAAI model card):

k = 5 (adaptive: 3–10 based on corpus size)
Weighting: cosine similarity directly — weighted voting typically produces ~1–5% better results than unweighted in classification tasks
Tag threshold: score ≥ 0.78 AND tag appears in ≥ 2 of k neighbours. The 0.78 threshold is reused from notebook classification because both tasks operate on the same embedding space with the same compressed similarity distribution — a neighbour scoring below 0.78 is too semantically distant to propagate tags from reliably. The dual condition (score AND frequency) provides additional filtering that classification does not need.
Maximum 5 suggestions per note — avoids choice overload

4.6.3 Cold-start handling. When fewer than 10 notes have tags, KNN lacks sufficient examples. The fallback chain:

If LLM is configured → ask the LLM to suggest tags from the note content + user's existing tag vocabulary
If no LLM → display "Tag 10+ notes to enable auto-suggestions." This is honest and avoids generating random tags from cluster keywords (users create organisational tags like #todo, not topical keywords like machine, learning, neural)

4.7 Stale Note Detection

During "Analyse All", query all notes and flag those where user_updated_time exceeds the configurable threshold (default: 6 months). Using user_updated_time rather than updated_time avoids false negatives from sync-triggered updates. Additionally, the plugin tracks last_viewed_time per note via onNoteSelectionChange — stored in dataDir(), not in note userData (which would trigger unnecessary sync events). A note that is frequently viewed but never edited (e.g. a reference document) is not stale. The effective staleness check uses max(user_updated_time, last_viewed_time). Stale notes surface as "Archive?" suggestions in the same panel, using move_note to relocate to a user-configured "Archive" notebook (auto-created if absent). No AI required.

4.8 Topic Discovery via K-Means Clustering

This is a periodic "Analyse All" feature — it answers "what hidden topics exist in my notes?"

4.8.1 Why K-Means over HDBSCAN.

Every note must be assigned. HDBSCAN marks outliers as "noise" — users expect every note in a notebook.
Notebook count is predictable (5–30). K-Means with auto-k maps naturally.
JavaScript ecosystem: ml-kmeans (v7.0.0, maintained TypeScript). No mature JS HDBSCAN exists.
On L2-normalised embeddings, minimising Euclidean distance is equivalent to maximising cosine similarity — ||u-v||² = 2(1 - cos(u,v)) — so K-Means directly optimises the right metric.

4.8.2 Automatic k selection. Centroid-based simplified silhouette replaces O(n²) full pairwise with O(n·k) centroid distances: a(i) = distance to own centroid, b(i) = distance to nearest other centroid, s(i) = (b-a)/max(a,b). Search range: k = 2 to min(√n, 30). Sampling for large collections.

4.8.3 Cluster labelling. c-TF-IDF treats each cluster as a single document and computes class-based term frequency weighted by inverse document frequency across clusters. Top terms become the cluster's keyword label. When an LLM is available, keywords + sample note titles get refined into human-readable labels.

4.8.4 Cluster-to-suggestion flow. After clustering, the plugin compares discovered clusters against existing notebooks. For each cluster that does not align with any existing notebook (low centroid overlap), it generates a suggestion: "Create notebook '[cluster label]' and move [N] notes into it." For clusters that partially overlap an existing notebook, it suggests moving the outlier notes. All suggestions appear in the same sidebar panel as notebook classification and tagging suggestions, with the same Accept/Reject/Undo workflow. The user can preview which notes belong to each cluster before accepting.

4.9 Agentic LLM Layer

When an LLM provider is configured, the plugin generates a batch action plan that the user reviews before execution.

4.9.1 LLM-to-API mapping. Four tools following the OpenAI function-calling JSON Schema format (which Ollama also supports via its compatible endpoint):

Tool	Description	Joplin API Call
`add_tag`	Add tag to note (creates if needed)	`POST /tags` → `POST /tags/:id/notes`
`remove_tag`	Remove tag from note	`DELETE /tags/:id/notes/:noteId`
`move_note`	Move note to notebook	`PUT /notes/:id` with `{ parent_id }`
`create_notebook`	Create new notebook	`POST /folders` with `{ title, parent_id }`

All four were manually verified to exist in the Joplin codebase:

Tag.addNote() via POST /tags/:id/notes in packages/lib/services/rest/routes/tags.ts
Tag.removeNote() via DELETE /tags/:id/notes/:noteId in the same file
Note.save({ parent_id }) via PUT /notes/:id in routes/notes.ts
Folder.save() via POST /folders via defaultAction in routes/folders.ts

4.9.2 Execution flow:

4.9.3 Safety guarantees.

Nothing auto-executes. The sidebar presents each suggestion with type (TAG/MOVE), confidence %, reason, and Accept/Reject buttons
Accept All / Reject All available at the top
"Never suggest again" — when rejecting a suggestion, the user can mark it as permanently ignored. The plugin stores an ignore map (noteId → Set<suggestionHash>) in dataDir() and filters these from all future runs
Every executed action records its inverse for the undo stack (session-scoped; cleared on Joplin restart)
All note/folder IDs validated via joplin.data.get() before execution
JSON Schema validation on every tool call; retry with error feedback (max 2)

4.9.4 LLM context design. The prompt sent to the LLM contains: (a) the note's title and first ~500 tokens of body text (not the full note, to limit token usage), (b) the list of existing notebook names and tag vocabulary, and (c) the top-3 candidate notebooks/tags from the embedding-based classifier as pre-computed hints. The LLM's role is to refine and plan batch actions across multiple notes — not to replace the embedding classifier. For remote LLM providers, note content is sent over HTTPS; this is explicitly disclosed in the settings panel (see 4.11).

4.9.5 Multi-provider support. Since Ollama exposes an OpenAI-compatible endpoint (/v1/chat/completions), a single client class handles both — only the base URL changes. No LLM is the default; all embedding-based features work without one.

4.10 Plugin Registration

The plugin registers via joplin.plugins.register() in onStart:

Settings section (joplin.settings.registerSection): "AI Categoriser" with LLM provider dropdown (none/Ollama/OpenAI), API key (SettingItem.secure — OS keychain), Ollama URL, and auto-tag-on-save toggle. Similarity thresholds are internally configurable and calibrated during development — not exposed as user-facing settings.
Sidebar panel (joplin.views.panels.create): suggestion review UI loaded via setHtml and addScript
Command (joplin.commands.register): "AI: Analyse All Notes" via Tools menu (joplin.views.menuItems.create)
Event hooks: onNoteChange for immediate suggestions, onSyncComplete for batch re-indexing, onNoteSelectionChange for cached suggestions on note switch

4.11 Privacy

Everything runs locally by default. No data leaves the machine. The embedding model runs via WASM, the vector store lives in joplin.plugins.dataDir(), and all core features (notebook classification, auto-tagging, stale detection) work fully offline. When a user opts into a remote LLM provider (OpenAI), note titles and truncated body text (~500 tokens per note) are sent over HTTPS for batch action planning — the settings page shows a persistent disclosure stating exactly this. Ollama keeps everything local since it runs on the user's machine. API keys are stored via secure: true (OS keychain).

4.12 Risks and Mitigations

Risk	Mitigation
WASM model won't load in sandbox	Build-time copy script places ONNX WASM files into `dist/`; follows the official Joplin worker example plugin pattern (`packages/app-cli/tests/support/plugins/worker/`)
Initial indexing freezes UI	Embedding runs in a dedicated Web Worker (`new Worker()`) so the main Joplin UI thread stays completely free. Within the worker, notes are processed in batches of 10 with event-loop yields between batches to keep `postMessage` IPC responsive (cancel signals, progress reporting). Progress bar + cancel in the UI. Partial progress persists to disk every 50 notes — resumes after restart
ONNX WASM memory degradation	The WASM runtime's linear memory grows but never shrinks during sustained embedding — bge-small degrades from ~47 to ~2 notes/sec after ~100 notes. Mitigated by recycling the Web Worker periodically (`worker.terminate()` + `new Worker()`). Model reload from local cache costs ~325ms per recycle. For 5,000 notes with recycling every ~100 notes: ~50 reloads × 325ms = ~16s overhead on top of ~2.4 min embedding time — acceptable. Verified in POC for both bge-small and all-MiniLM-L6-v2
Encrypted notes (E2EE)	Notes with `encryption_applied = 1` are skipped during indexing — ciphertext produces meaningless embeddings. They are queued and indexed after decryption completes on the next poll cycle
LLM tool calls return garbage	JSON Schema validation per call; retry with error context (max 2); fall back to embedding-only
Non-English notes	Default model is English-optimised. Multilingual model (`multilingual-e5-small`) is post-GSoC. Non-English still gets indexed, just lower quality
Cross-platform binary issues	Eliminated — custom Float32Array store is pure TypeScript, zero native deps

4.13 Testing Strategy

Unit tests: Cosine similarity, centroid computation, KNN voting, c-TF-IDF, silhouette score — pure functions, Jest
Integration tests: Create notes via Data API → verify indexed → modify → verify re-indexed → classify → verify suggestions match expected notebooks
Agentic tests: Mock LLM responses with known tool calls → verify validation → verify execution → verify undo reverses action
Edge cases: Empty notes (skip), <20 chars (skip), image-only notes (detected by stripping markdown image/link syntax ![...](:/..) and checking if remaining text is <20 chars — skip), encrypted notes with encryption_applied = 1 (skip, queue for post-decryption), 0 tags cold start, single-note notebooks, <10 note notebooks (KNN fallback), single default notebook (prompt Analyse All)

5. Proposed Timeline

Weeks	Phase	Deliverable
1–2	Foundation	Validate WASM model loading. Build `EmbeddingService` + `VectorStore` (Float32Array binary) + incremental indexer (Events API cursor). Unit tests for embedding + cosine similarity + store CRUD.
3–4	Core Classification	Centroid notebook classifier with threshold calibration. KNN auto-tagger with weighted voting. Stale note detector. Cold-start handling. Integration tests.
5–6	UI + End-to-End	Sidebar panel via `joplin.views.panels`. WebView ↔ Plugin messaging. Accept/reject with confidence indicators. Progress bar. Joplin theme styling.
7–8	Topic Discovery	K-Means with auto-k (simplified silhouette). c-TF-IDF labelling. "Analyse All" command. Target: <5s for 1,000 notes.
9–10	Agentic Layer	4 tool definitions + JSON Schema validation. `ActionExecutor` with Joplin API mappings. Ollama + OpenAI provider. Undo stack.
11–12	Polish	End-to-end testing (100 / 1K / 5K notes). Performance tuning. User + developer docs. Plugin marketplace packaging. Demo video.

Risk mitigation: Core features (embedding, classification, tagging, UI) are done by week 6. If WASM or infrastructure takes longer, enhancement layers (clustering, agentic) can be descoped without losing a working, shippable product.

6. Deliverables

Required:

Joplin plugin (.jpl) installable from the marketplace
Local embedding pipeline — bge-small-en-v1.5 via transformers.js with incremental indexing
Centroid-based notebook classifier
KNN auto-tagger with weighted voting
Stale note detector with configurable threshold
Sidebar panel UI with accept/reject/undo
Test suite — unit + integration
User guide + developer documentation

Optional (enhancement layers):

K-Means topic discovery with c-TF-IDF labelling and "Analyse All" command
Agentic LLM organiser with 4 tools, multi-provider support, and human-in-the-loop approval

7. Availability

Weekly availability: ~30–35 hours/week during GSoC (primary commitment)
Time zone: Asia/Jakarta (UTC+7)
Other commitments: University courses — exam periods and quizzes are to be communicated to mentors in advance.
Communication: Daily async on Joplin forum + GitHub. Weekly sync with mentor using their preferred platform for communication. Blockers surfaced will be communicated within 24 hours. All code submitted as early draft PRs for incremental review.

HahaBill · 24 March 2026 17:52

@jellyfrostt Thank you for the proposal and it looks great!! I have questions:

With all-MiniLM-L6-v2:
- Could you estimate how fast it would run for 1000 notes based on couple of examples?
- How long users have to wait to load all-MiniLM-L6-v2? I see that your plan is to lazy load it.
- Do you plan to do the inference async or not? And if so you should be careful about lags on user’s computer.
- What if some users use older computers? Would the wait time be long?
With binary Float32Array store, how does it handle CRUD? I see that in subsequent sections you are describing the syncs but there’s a missing explanation of the technical implementation.
I like the KNN → c.TF-IDF idea!! And I think providing titles as context to LLMs should be good enough for now
Are you sure about the cosine similarity threshold range for notebook classification?
- Where are your reasoning on the parameter (classification threshold, ambiguity margin) values come from?
- Will the threshold work on different types of notes? It’s something to think about
Overall, what LLMs are you using for both the agent and the LLM categorisation helper?

jellyfrostt · 25 March 2026 09:27

Hello! @HahaBill Thank you for your very insightful feedback, they were really helpful questions that took into account important scenarios and gave me ideas on how to improve the proposal further. To answer your questions:

With all-MiniLM-L6-v2:

= I need to clarify that we opted for bge-small-en-v1.5 over all-MiniLM-L6-v2 in the proposal but your feedback and questions inspired me to benchmark both models and to run a direct comparison of their performances, so I will answer related questions from that lense.

Could you estimate how fast it would run for 1000 notes based on couple of examples?

= I ended up making a small POC to benchmark this for the purpose of this question (as the proposal opted for bge-small-en-v1.5) utilizing two independent datasets - 1000 real Wikipedia article excerpts (avg 109 words) and 1000 synthetic note-like texts (avg 155 words) and the findings are as follows:

	all-MiniLM-L6-v2	bge-small-en-v1.5 (proposed)
Synthetic 1000 notes	13s (75.5 notes/sec)	~25s (35–43 notes/sec)
Wikipedia 1000 notes	10s (102.9 notes/sec)	19s (52.6 notes/sec)
5K projection	~49s - 1.1 min	~1.6 - 2.4 min
10K projection	~1.6 - 2.2 min	~3.2 - 4.8 min

Need to note that while all-MiniLM-L6-v2 is faster, the reason we opted for bge-small because this plugin is fundamentally a classifier and therefore we prioritize the needs to accurately determine which notebook a note belongs in and which tags to suggest. bge-small scores 74.14 on MTEB classification benchmarks vs ~63 for all-MiniLM-L6-v2. A wrong suggestion that the user has to reject is worse UX than waiting a few extra seconds during a one-time background indexing. bge-small also supports 512-token context vs 256, so longer notes are classified using more of their content rather than being truncated halfway. I think the speed difference is acceptable because indexing is a one-time background task with progress indication and cancel support. After that, only new/changed notes need re-embedding. That said, the architecture is model-agnostic so switching models requires only a config change and re-index if we want to swap models.

How long users have to wait to load all-MiniLM-L6-v2? I see that your plan is to lazy load it.

=

	all-MiniLM-L6-v2	bge-small-en-v1.5 (proposed)
Load time (cached)	~230-280ms	~325ms
Model size (q8)	~22MB	~34MB
Memory	+70MB RSS	+96MB RSS

Both are lazy-loaded during the benchmarking process and since the model only loads when the user first triggers semantic search or background indexing, not on plugin startup that means users who don't use the feature pay zero cost. Load time difference between the two is negligible (~50ms).

Do you plan to do the inference async or not? And if so you should be careful about lags on user's computer.

= Yes, fully async in a dedicated Web Worker (new Worker()) so that the main Joplin UI thread stays completely free. Per-note latency is 15–30ms depending on note length. During testing I also observed that the ONNX WASM runtime degrades after sustained use in a single process — bge-small drops from ~47 notes/sec to ~2 notes/sec after ~100 notes, and all-MiniLM-L6-v2 drops from ~100 to ~5 notes/sec after ~200 notes. I believe this is because WASM linear memory grows but never shrinks. Therefore, I propose recycling the worker process periodically to maintain consistent throughput — I verified this briefly in POC and found that it fully mitigates the issue for both models. So far no user-visible lag on my end (tested on i7-13620H, 16 threads, 24GB RAM).

Dataset	Notes/sec	1000 notes	5000 notes
Synthetic (avg 155 words)	35–43	~25s	~2 min
Wikipedia (avg 109 words)	52.6	19s	~1.5 min

What if some users use older computers? Would the wait time be long?

= The ONNX WASM runtime is CPU-only (no GPU required), so performance scales roughly linearly with CPU speed. On a machine 3–4x slower (older computers) than my test machine (i7-13620H, 16 threads, 24GB RAM), bge-small would still process ~10–15 notes/sec, meaning 1000 notes in ~1–1.5 minutes in the background. But that's only the initial bulk indexing therefore a one-time cost. After that, only new or modified notes are embedded incrementally (one at a time on save), which would be around <100ms even on slow hardware. The worker process recycling mitigation also works regardless of hardware since it's a WASM runtime issue, not a CPU issue.

With binary Float32Array store, how does it handle CRUD? I see that in subsequent sections you are describing the syncs but there's a missing explanation of the technical implementation.

= The vector store is a contiguous Float32Array buffer so around 384 floats per note for bge-small — with a parallel array of note IDs for index mapping, persisted as binary files via fs.writeFile/fs.readFile.

For create, the new embedding is appended at the end of the buffer with its note ID and precomputed norm, then flushed to disk. Search is a brute-force cosine similarity scan using the precomputed norms. In the POC it was measured around <1ms for k=5 over 1000 vectors, and it scales linearly so even 10,000 vectors would be <10ms.

For update, the note's index is looked up from the ID array and the 384 floats at that offset are overwritten with the new embedding and recomputed norm. Delete uses a swap-with-last approach — the entry is swapped with the last element in the buffer and the count is shrunk by 1, giving O(1) deletion.

Storage-wise, 1000 notes at 384 dimensions is about 1.5MB on disk, and 10,000 notes would be ~15MB. Disk writes can be debounced during bulk indexing to avoid excessive I/O.

I will append the technical implementation onto the draft, thanks for pointing this out!

Are you sure about the cosine similarity threshold range for notebook classification?

Where are your reasoning on the parameter (classification threshold, ambiguity margin) values come from?

= The classification threshold of 0.78 is derived from BAAI's documented recommendation of 0.8+ for similarity filtering, adjusted slightly lower for classification vs filtering — centroid classification (selecting the best match) requires a lower bar than similarity filtering (finding near-duplicates).

The ambiguity margin of 0.03 implements ambiguity rejection — based on research it is abstaining from classification when the top-2 scores are too close, indicating the input falls near a decision boundary. Within bge-small's compressed [0.6, 1.0] range, 0.03 represents 7.5% of the effective decision space. This is grounded in margin-based confidence — using the gap between the top-1 and top-2 scores is a well-established heuristic in classification with reject option literature, making this gap a reliable uncertainty signal.

Both values are starting defaults scheduled for empirical calibration during so that it remains consistent with the literature consensus that absolute cosine similarity thresholds must be tuned per model and task.

Will the threshold work on different types of notes? It's something to think about

= A single fixed threshold likely won't be equally optimal across all note types as technical notes tend to cluster tightly while personal or journal-style notes may be more diffuse. The proposal so far handles this through three mechanisms:

First, the dual-threshold design already provides adaptability where the 0.78 threshold handles novelty rejection (note too far from all notebooks) while the 0.03 ambiguity margin handles ambiguity rejection (note equally close to two notebooks). This follows the two canonical forms of rejection from research findings, covering both failure modes.

Second, the KNN fallback for small notebooks (<10 notes) means the system adapts its method rather than relying solely on centroids that may be unstable.

Third, the threshold values are internally configurable and will be calibrated through empirical testing during weeks 5–6. If a fixed threshold proves too brittle across diverse collections, I plan to explore relative thresholding where using the gap between the best match and the distribution mean rather than an absolute cutoff — which naturally adapts to different similarity distributions.

Admittedly, the threshold primarily comes from known researches on the topics and you are right to call out and question them. I will shortly append the feedbacks you gave in the draft proposal as well as relevant references accordingly. Much thanks!

Overall, what LLMs are you using for both the agent and the LLM categorisation helper?

= The embedding model (bge-small-en-v1.5) runs locally via transformers.js with ONNX WASM so no API key needed. This powers the core features: notebook classification and auto-tagging. Stale note detection is purely timestamp-based, so no AI involved.

For the LLM categorisation helper and agentic layer, the proposal supports Ollama (local) and OpenAI (remote) through a single client class since Ollama exposes an OpenAI-compatible endpoint (/v1/chat/completions), only the base URL changes. No LLM is the default — all embedding-based features work without one. The LLM is only used for optional enhancement layers: c-TF-IDF cluster label refinement, cold-start tag suggestion when fewer than 10 notes have tags, and agentic batch action planning.

For the agent specifically, it uses OpenAI function-calling JSON Schema format (which Ollama also supports) with 4 defined tools: add_tag, remove_tag, move_note, create_notebook. Every tool call is validated against the schema, with retry on failure (max 2), and nothing auto-executes so users can review all actions in the sidebar before applying.

Once again, thank you for taking the time to review my proposal!

jellyfrostt · 25 March 2026 11:44

Just a heads up, I've recently revised the proposal draft accordingly and have taken into account all of your feedbacks as crucial insights moving forward. Here is the list of changes added into the revised proposal:

Model delivery mechanism — Explains how the ONNX model reaches the user's machine and what happens on failure
Centroid recomputation timing — Explains when and how centroids are recalculated after note changes
Web Worker vs yield clarification — Clarifies how the Web Worker and event-loop yields coexist without contradiction
LLM prompt design — Explains what context is included in the prompt sent to the LLM
Privacy specifics — Explains exactly what data leaves the machine when a remote LLM is configured
Threshold calibration methodology — Explains how "empirical calibration" will actually be conducted
source_hash — Explains what source_hash is and where it's used
Tag threshold reuse — Explains why the 0.78 threshold is reused from notebook classification for tagging
Cluster-to-action UX — Explains how discovered clusters become actionable suggestions in the UI

Thank you for the feedback! If there are any additional concerns or changes for further improvements, am more than happy to iterate!

HahaBill · 26 March 2026 15:39

Hi @jellyfrostt, thank you for answering my questions!! I really appreciate the effort that you are putting into this!! Your answers are comprehensive and clear to understand and see!

I have few more questions and requests for me to assess your proposal further:

Could you share the small POC (video + logs) that you used in your benchmark?
Could you also share a small demo video where you run Transformers.js with embedding random notes in a Joplin plugin (very minimal)? It’d be great if you create logs that shows its load and inference time. A Github link to this minimal plugin would be appreciated!
What specific LLM you are thinking of running for your LLM categorisation and agent. I understood that you want to use OpenAI-compatible APIs which is great! But I am more interested in specific LLMs.
- Which ones for Ollama?
- Which ones for frontier models (OpenAI, Anthropic, Gemini, etc.)?
- Can you justify why?
- Can you show me your LLM categorisation and agent prompts?
  - What are the inputs? Explain why.
  - What are the outputs? Explain why.
  - For both of them, show me the schemas in JSON format.

jellyfrostt · 27 March 2026 14:43

Helloo!! Thank you for the inputs and for taking the time to be involved in my proposal thoroughly. I'm really enjoying working on this. Here are the answers to your follow-up questions and results from further iterations:

1. POC Benchmark (Video + Logs)

I built two artifacts to demonstrate feasibility:

Standalone benchmark (benchmark-bge.mjs) — runs in Node.js with onnxruntime-node (native C++ bindings) to measure embedding quality, latency scaling, throughput degradation, memory growth, and KNN search accuracy on 1,000 synthetic notes (generated with realistic word counts and topic distributions).
Joplin plugin (Section 2 below) — the real demo running onnxruntime-web (WASM) inside Joplin's Electron sandbox via a Web Worker.

The standalone benchmark validates algorithmic correctness and scaling behaviour. The Joplin plugin demo (Section 2) shows real WASM performance inside the actual plugin sandbox — WASM is typically ~3–10x slower than native depending on threading (as noted in the script header).

Standalone Benchmark (Node.js, native ONNX)

================================================================
  bge-small-en-v1.5 Benchmark — 1000 Notes [notes]
================================================================

System:
  CPU: 13th Gen Intel(R) Core(TM) i7-13620H
  Cores: 16 (WASM threads: 4)
  RAM: 24236.9 MB total, 2126.6 MB free
  Platform: win32 x64, Node v24.13.0

Notes:
  Total: 1000 unique
  Words: avg=155, median=141, min=22, max=404
  p10=62, p50=141, p90=290

--- 1. Model Load Time ---
  Load: 605.9ms
  Memory: +97.4 MB RSS

  Warmup: 135.5ms (3 representative notes)

--- 2. Latency by Note Length ---
  short (<80 words): avg=16.3ms min=7.2ms max=25.3ms (n=15)
  medium (80-150 words): avg=37.2ms min=21.2ms max=62.3ms (n=15)
  long (150-300 words): avg=43.2ms min=21.8ms max=86.1ms (n=15)
  very long (300-500 words): avg=50.3ms min=40.1ms max=62.7ms (n=15)
  extra long (500+ words): no notes

--- 3. Full 1000-Note Batch ---
  1000/1000 (23.0 notes/sec, elapsed 43.57s)
  Total: 43.57s
  Throughput: 22.9 notes/sec
  Per note: avg=43.6ms, p50=37.4ms, p90=79.3ms, p99=130.6ms
  Memory delta: +69.0 MB RSS

--- 3a. Throughput Degradation (per 100-note window) ---
  notes    1– 100: avg=  24.3ms,  41.1 notes/sec
  notes  101– 200: avg=  28.6ms,  35.0 notes/sec
  notes  201– 300: avg=  28.1ms,  35.6 notes/sec
  notes  301– 400: avg=  56.2ms,  17.8 notes/sec
  notes  401– 500: avg=  43.9ms,  22.8 notes/sec
  notes  501– 600: avg=  51.9ms,  19.3 notes/sec
  notes  601– 700: avg=  54.8ms,  18.3 notes/sec
  notes  701– 800: avg=  43.0ms,  23.2 notes/sec
  notes  801– 900: avg=  52.1ms,  19.2 notes/sec
  notes  901–1000: avg=  47.0ms,  21.3 notes/sec

  First window: 41.1 notes/sec
  Last window:  21.3 notes/sec
  Degradation:  48.3%

--- 3b. Memory Growth (sampled every 100 notes) ---
  note  100: RSS=  240.7 MB, Heap=   22.9 MB
  note  200: RSS=  250.5 MB, Heap=   26.7 MB
  note  300: RSS=  260.5 MB, Heap=   21.0 MB
  note  400: RSS=  267.3 MB, Heap=   25.4 MB
  note  500: RSS=  268.1 MB, Heap=   24.5 MB
  note  600: RSS=  268.3 MB, Heap=   21.2 MB
  note  700: RSS=  268.9 MB, Heap=   21.1 MB
  note  800: RSS=  269.8 MB, Heap=   25.1 MB
  note  900: RSS=  271.2 MB, Heap=   21.1 MB
  note 1000: RSS=  270.6 MB, Heap=   24.4 MB

  RSS growth during batch: +29.9 MB

--- 4. KNN Search ---
  k=5 over 1000 vectors: 1.3ms avg (200 runs)

  Query: "work, travel and more" (148 words)
  [0.789] "health, study and more" (104w)
  [0.784] "travel, study and more" (133w)
  [0.777] "study, personal and more" (314w)
  [0.768] "work, personal and more" (167w)
  [0.763] "study, finance and more" (104w)

--- 5. Memory ---
  RSS: 270.7 MB | Heap: 24.6 MB | External: 39.3 MB
  Vector store: 1.5 MB

--- 6. Projections ---
  Embedding (based on measured avg, WITHOUT Worker recycling):
      500 notes:   21.79s (0.4 min)
     1000 notes:   43.57s (0.7 min)
     2000 notes:   87.15s (1.5 min)
     5000 notes:  217.87s (3.6 min)
    10000 notes:  435.74s (7.3 min)

  KNN k=5 brute-force (linear scaling from measured):
     1000 vectors:    1.3ms
     5000 vectors:    6.6ms
    10000 vectors:   13.2ms
    50000 vectors:   65.9ms

================================================================
  Done.
================================================================

Note: The proposal's throughput figures (~47 to ~2 notes/sec) were from early POC testing under different conditions (shorter test corpus, different WASM memory pressure patterns). This larger 1,000-note benchmark provides more representative measurements — peak 41.1 notes/sec natively with 48% degradation, and ~1.1–6.0 notes/sec in the WASM plugin depending on note length. The degradation pattern is confirmed but less extreme than initially observed, which is good news for the proposed potential Worker recycling strategy.

Key findings (native backend, 1,000 notes):

Model load: 606ms + 136ms warmup — fast cold start
Throughput: 22.9 notes/sec average (peak 41.1, degraded to 21.3 — 48% degradation over the batch)
Latency scales with note length: short 16ms → long 43ms → very long 50ms — as expected from tokenizer + attention complexity
KNN search: 1.3ms for k=5 over 1,000 vectors (brute-force cosine similarity) — instant for the user
Memory: RSS grew by only 69MB during the batch (+30MB during steady state), heap stayed flat — native backend manages memory well
Projections: 5,000 notes in ~3.6 min, 10,000 in ~7.3 min — comfortably background-able

Throughput degradation (48%) and Worker recycling. Even with the native backend, throughput degrades from 41→21 notes/sec over the batch. In the Joplin plugin (WASM backend), this degradation is expected to be more severe because WASM linear memory grows monotonically and never returns pages (WebAssembly spec, wasm design#1300). This is exactly why our proposal includes Worker recycling: terminating the Worker (worker.terminate()) tears down the Worker's runtime resources and associated WASM memory. A memory leak on terminate was fixed in Electron 9+ (electron#24965), so modern Electron reliably reclaims memory. Model reload from local file cache (transformers.js caches to disk via env.cacheDir) is sub-second, so recycling overhead is negligible.

WASM vs native context: The Joplin plugin runs onnxruntime-web (WASM) in an Electron Web Worker, not native bindings. WASM is typically ~3–10x slower than native depending on threading. Multi-threaded WASM requires SharedArrayBuffer, which is not enabled by default in Electron (electron#35905) — the plugin defaults to single-threaded WASM. This means real plugin throughput will be lower than the numbers above. See Section 2 for actual Electron WASM measurements. All architectural decisions (Worker recycling, incremental indexing, background processing) are designed for the slower WASM case — the native benchmark validates the algorithmic correctness and scaling patterns.

Future optimisation paths: Transformers.js v4 (released February 2026) reports a 4x speedup for BERT-based embedding models via a new MultiHeadAttention operator — this could significantly close the WASM-to-native gap. We target v3 for stability but v4 is a natural upgrade path during GSoC. Additionally, IBM's granite-embedding-small-english-r2 (47M params, 384d, ModernBERT-based, 8192 token context) is a promising swap candidate — same dimensionality as bge-small but 16x longer context window. The embedding service is model-agnostic, so swapping requires only a config change and re-index.

2. Minimal Joplin Plugin Demo (Video + Logs + GitHub)

I built a minimal Joplin plugin that loads bge-small-en-v1.5 via @huggingface/transformers v3 in a Web Worker inside the Joplin plugin sandbox.

I ran the POC for 2 scenarios: One case being with 5 notes and the other case being 20 notes:

GitHub: jellyfrostt/joplin-embedding-poc

What the plugin does:

Load Model (robot icon) — Downloads/loads bge-small-en-v1.5 (q8) via @huggingface/transformers in a Web Worker. Logs model load time and warmup time.
Embed Notes (bolt icon) — Fetches the 20 most recently updated notes via joplin.data.get(), sends each to the Worker for embedding, logs per-note inference time and dimensions.
Summary — Reports batch throughput (notes/sec) in an alert dialog.

Architecture:

Plugin (index.ts)                    Worker (worker.ts)
  |                                    |
  |-- postMessage({ type: 'load' }) -->|
  |                                    |-- pipeline('feature-extraction', 'bge-small-en-v1.5')
  |<-- { loadTime, warmupTime } -------|
  |                                    |
  |-- postMessage({ type: 'embed',  -->|
  |      text, noteId })               |-- embedder(text, { pooling: 'mean', normalize: true })
  |<-- { inferenceTime, dimensions } --|

Key implementation details:

plugin.config.json: extraScripts: ["worker.ts"] + webpackOverrides: { target: "web" } — the target: "web" override is critical because it forces @huggingface/transformers to detect a browser environment and use onnxruntime-web (WASM) instead of onnxruntime-node (native bindings, which can't load in Joplin's Electron sandbox).
tools/copyAssets.js: Build-time script that copies ONNX WASM files from node_modules/onnxruntime-web into dist/onnx-dist/. This follows the same pattern as Joplin's official worker example plugin at packages/app-cli/tests/support/plugins/worker/.
Worker loads WASM from local path: env.backends.onnx.wasm.wasmPaths = './onnx-dist/' — ensures the WASM runtime loads from the plugin's installation directory, not from CDN.

A note on logging

Joplin plugins run in a sandboxed process separate from the main renderer. console.info() from within the plugin goes to the sandbox's own JS context — not the main DevTools console (Help > Toggle Development Tools), which only shows Joplin's internal logs. For production-installed plugins, there is no built-in way to open DevTools for the sandbox (dev-mode plugins loaded via plugins.devPluginPaths do get DevTools automatically). The plugin therefore uses alert() dialogs to surface timing data across the sandbox boundary. A production plugin would use Joplin's panel API (joplin.views.panels) instead.

Results (from alert dialogs):

Model load (from cache, second run — first run downloads ~34MB from HuggingFace):

[Embedding POC] Model loaded in 344ms | Warmup: 12ms

Embedding run 1 (5 default Joplin notes only — long documentation pages):

--- Embedding 5 notes ---
[1/5] 1. Welcome to Joplin! — 898.3ms (384d) | cumulative: 1.1 notes/sec
[2/5] 2. Importing and exporting notes — 1040.0ms (384d) | cumulative: 1.0 notes/sec
[3/5] 3. Synchronising your notes — 896.1ms (384d) | cumulative: 1.1 notes/sec
[4/5] 4. Tips — 879.2ms (384d) | cumulative: 1.1 notes/sec
[5/5] 5. Joplin Privacy Policy — 831.6ms (384d) | cumulative: 1.1 notes/sec

Batch: 5 notes in 4546ms (1.1 notes/sec)
Per note: avg=909.0ms, p50=896.1ms, p90=1040.0ms

Embedding run 2 (20 notes — mix of short user notes + long default notes):

--- Embedding 20 notes ---
[1/20]  reydfgbdfhdfjgrjgjgjty — 58.4ms (384d) | cumulative: 16.9 notes/sec
[2/20]  wrtreyrhyfhfhf — 44.6ms (384d) | cumulative: 19.2 notes/sec
[3/20]  vdshdherreyestuey — 34.3ms (384d) | cumulative: 21.6 notes/sec
[4/20]  API Design Principles — 100.3ms (384d) | cumulative: 16.7 notes/sec
[5/20]  Photography Tips — 88.7ms (384d) | cumulative: 15.2 notes/sec
[6/20]  Home Garden Log — 82.0ms (384d) | cumulative: 14.6 notes/sec
[7/20]  Japanese Study Notes — 115.2ms (384d) | cumulative: 13.3 notes/sec
[8/20]  Docker Basics — 100.1ms (384d) | cumulative: 12.8 notes/sec
[9/20]  Sourdough Starter Guide — 112.3ms (384d) | cumulative: 12.2 notes/sec
[10/20] Meeting Notes: Project Kickoff — 93.9ms (384d) | cumulative: 12.0 notes/sec
[11/20] Python Virtual Environments — 91.0ms (384d) | cumulative: 11.9 notes/sec
[12/20] Budget March 2026 — 97.3ms (384d) | cumulative: 11.8 notes/sec
[13/20] Book Notes: Atomic Habits — 77.2ms (384d) | cumulative: 11.8 notes/sec
[14/20] Git Commands Cheat Sheet — 93.3ms (384d) | cumulative: 11.8 notes/sec
[15/20] Daily Workout Routine — 115.5ms (384d) | cumulative: 11.5 notes/sec
[16/20] Travel Packing Checklist — 77.0ms (384d) | cumulative: 11.6 notes/sec
[17/20] Recipe: Pasta Carbonara — 101.0ms (384d) | cumulative: 11.4 notes/sec
[18/20] Machine learning — 77.4ms (384d) | cumulative: 11.5 notes/sec
[19/20] 1. Welcome to Joplin! — 948.9ms (384d) | cumulative: 7.6 notes/sec
[20/20] 2. Importing and exporting notes — 841.8ms (384d) | cumulative: 6.0 notes/sec

Batch: 20 notes in 3353ms (6.0 notes/sec)
Per note: avg=167.5ms, p50=93.9ms, p90=841.8ms

Key observations:

Model load from cache: 344ms — near-instant. First-run download (~34MB from HuggingFace) takes longer but is a one-time cost.
WASM latency scales with note length — short notes (1-3 sentences): 34–100ms, medium notes (paragraph-length): 77–115ms, long documentation pages (Joplin defaults): 832–1040ms. This matches the standalone benchmark's finding that latency scales with tokenizer + attention complexity.
20-note batch throughput: 6.0 notes/sec (avg 167.5ms/note). The p90 of 841.8ms is pulled up by the two long default notes — the median (p50) of 93.9ms better reflects typical user notes.
384-dimensional embeddings confirmed — matching bge-small-en-v1.5's output spec.
All notes processed successfully — the Worker + WASM pipeline works end-to-end inside Joplin's plugin sandbox.
Projection at WASM speed: For typical user notes (p50 ~94ms), throughput is ~10.6 notes/sec. At that rate, 5,000 notes ≈ 8 min — comfortably background-able. Even worst-case (all long notes at 1.1 notes/sec), 5,000 notes ≈ 76 min — still manageable with incremental indexing (only re-embed changed notes). Transformers.js v4's reported 4x BERT speedup would improve both cases significantly.

3. Specific LLM Choices

As stated previously, the plugin uses OpenAI-compatible endpoints (/v1/chat/completions with tools parameter) so that a single client class handles both local Ollama and cloud providers — only the base URL and API key change. Here are my specific model choices (my bad I misunderstood your question previously, cuz ESL stuff):

3.1 Ollama (Local)

	Model	VRAM	Context	Why
Primary	`qwen3:8b`	~5–7 GB (Q4_K_M)	32K tokens (extendable to 131K via YaRN)	Highest tool-calling F1 (0.933) among sub-10B models on Docker's practical evaluation (3,570 test cases across 21 models). Native thinking mode for complex batch planning. Ollama officially lists it in their Tools category.
Fallback	`llama3.1:8b`	~5–7 GB (Q4_K_M)	128K tokens	BFCL score ~76.1%. Longest track record for tool calling on Ollama — battle-tested since 2024. Good fallback if users experience Qwen3/Ollama version compatibility issues.
Lightweight	`phi4-mini` (3.8B)	~3–4 GB	128K tokens	For users on 8GB systems. Microsoft confirms function calling support (requires Ollama 0.5.13+). Listed in Ollama's Tools category, but has known reliability issues with parallel tool calling (ollama#9437) — the `phi4-mini:3.8b-fp16` variant is more reliable. Best for categorization helper only — not reliable enough for agentic batch planning with chained tool calls.

Why Qwen3:8b specifically? Docker's evaluation tested 21 models on 3,570 real-world function-calling scenarios. Qwen3-8B achieved F1 0.933, nearly matching GPT-4's 0.974. The next-best sub-10B model (Llama 3.1 8B) scored lower. Qwen3's 32K native context (extendable to 131K via YaRN) is more than sufficient for the batch planner which feeds in 10–20 notes at once. Its dual-mode reasoning (thinking + non-thinking) lets us use non-thinking mode for reliable tool emission (thinking mode has a known issue where ~60% of planned tool calls fail to emit). Note: Qwen3.5 (newer) had tool-calling issues in Ollama, now fixed in Ollama v0.17.6 — but Qwen3 remains the more battle-tested choice as of March 2026.

Future watch: Qwen3.5:4B. In JD Hodges' 2026 local LLM tool-calling evaluation (13 models), Qwen3.5:4B achieved a 97.5% pass rate at only 3.4GB — the best score among all models tested. Ollama's tool-calling bugs (#14493, #14745) were fixed in v0.17.6, making it a viable candidate to replace phi4-mini as the lightweight option and potentially challenge Qwen3:8B as primary.

3.2 Frontier (Cloud)

	Model	Cost (per 1M in/out)	Context	Why
Primary	`gpt-4o-mini`	$0.15 / $0.60	128K	Supports both `response_format: { type: "json_schema" }` (guaranteed schema compliance via constrained decoding) and `tools` with `strict: true`. Battle-tested function calling since July 2024, with the longest production track record among small frontier models. gpt-4.1-nano is a ready drop-in replacement if OpenAI sunsets the 4o family.
Budget	`gpt-4.1-nano`	$0.10 / $0.40	1M	Cheapest OpenAI model. Outperforms gpt-4o-mini on several benchmarks (MMLU 80.1%). Supports `strict: true` on tool definitions. Does NOT support `response_format: json_schema` in Chat Completions — but this should be fine since we use function calling for structured output anyway.

Cost for personal use: Categorizing 100 notes in a batch ≈ 50K input + 10K output tokens ≈ $0.014 (~1.4 cents) with gpt-4o-mini. Even running 10 batches/day for a month would cost under $5. Cost is effectively a non-factor.

Why OpenAI models as the default frontier provider? Three reasons:

Price — gpt-4.1-nano at $0.10/$0.40 is the cheapest viable option with strict tool support. For comparison: Gemini 2.5 Flash is $0.30/$2.50, Claude Haiku 4.5 is $1.00/$5.00, DeepSeek V3 is $0.27/$1.10.
Native API format — Ollama implements the OpenAI /v1/chat/completions format natively, so both local and cloud use a single client class. While Gemini now offers an OpenAI-compatible endpoint, Ollama's compatibility is tested against OpenAI's format specifically — fewer edge cases.
Ecosystem maturity — gpt-4o-mini has the longest production track record for function calling among small frontier models. More libraries and community testing coverage.

Note: Gemini now also offers guaranteed structured output via constrained decoding — this is no longer unique to OpenAI. However, the price and format advantages remain.

Why not Claude Haiku? Claude Haiku 4.5 at $1.00/$5.00 per 1M tokens is 10x more expensive on input than gpt-4.1-nano for comparable quality on classification tasks, and requires an OpenRouter adapter for OpenAI-compatible endpoints, adding latency and a dependency.

4. LLM Prompts, Inputs, and Outputs

The plugin uses two separate LLM interactions with different purposes:

4.1 Categorization Helper

Purpose: For individual notes during real-time suggestions. Takes a single note + context, outputs structured JSON with notebook/tag suggestions. This is used when the user opens a note and the embedding-based classifier's confidence is borderline.

Why two separate prompts instead of one? The proposal described a single LLM interaction for batch action planning (Section 4.9). While working through the prompt design for this response, I realised the single-note case (cold-start fallback, mentioned in Section 4.6.3) deserves its own dedicated prompt — a categorization helper for single-note, low-latency suggestions (called inline when a note is opened). The batch planner handles multi-note, batch operations (called during "Analyse All"). Different contexts, different output formats, different latency requirements.

System Prompt

You are a note categorization assistant for a personal knowledge management app. Your task is to suggest the best notebook and tags for a given note based on its content and the user's existing organizational structure.

Rules:
- You may ONLY suggest notebooks from the provided list, OR recommend creating a new one.
- You may ONLY suggest tags from the provided vocabulary, OR recommend creating new tags (max 2 new).
- If the note does not clearly belong in any existing notebook, set notebook_action to "none".
- If you cannot confidently suggest tags, return an empty tags array.
- Confidence must honestly reflect your certainty. Do not inflate scores.
- It is better to suggest no action than a wrong action.
- Output ONLY the JSON object matching the schema. No other text.

User Prompt

<note>
Title: {note_title}
Current notebook: {current_notebook}
Current tags: {current_tags}
Body:
{note_body_first_500_tokens}
</note>

<context>
Notebooks: {notebook_names_list}
Tag vocabulary: {tag_list}
</context>

<embedding_hints>
The following candidates were identified by vector similarity search.
They may or may not be appropriate — use your judgment based on the note's actual content.

1. Notebook "{hint_notebook_1}" (similarity: {score_1})
2. Notebook "{hint_notebook_2}" (similarity: {score_2})
3. Notebook "{hint_notebook_3}" (similarity: {score_3})

Tag candidates: {hint_tags}
</embedding_hints>

Why these inputs:

Note title + first ~500 tokens of body: Captures the topic without sending entire notes (limits token cost, respects privacy).
Current notebook + current tags: So the LLM knows what's already applied — avoids suggesting what's already there.
Notebook/tag lists: Grounds the LLM in the user's actual organizational vocabulary. Without this, it would hallucinate notebook names.
Embedding hints placed AFTER note content: LLMs exhibit position bias — primacy effects cause early information to disproportionately influence outputs (Guo & Vosoughi, "Serial Position Effects of LLMs", ACL 2025 Findings). By placing the note content first, the LLM forms its own impression before seeing the embedding suggestions, reducing over-reliance on the similarity scores. Anthropic's prompt engineering best practices recommend putting long content first with queries at the end, citing up to 30% quality improvement in tests with complex multi-document inputs.
XML delimiters (<note>, <context>, <embedding_hints>): Clearly separate different types of input. Anthropic's documentation explicitly recommends XML tags for structuring complex prompts, as they help Claude parse inputs unambiguously — reducing misinterpretation when mixing instructions, context, and variable inputs. Research on prompt formatting impact (He et al., arxiv 2411.10541) shows up to 40% performance variation across formats (tested plain text, Markdown, JSON, YAML — XML was not tested, but the finding underscores that deliberate structural formatting matters).

Output Schema (JSON)

{
  "type": "object",
  "properties": {
    "reasoning": {
      "type": "string",
      "description": "2-3 sentences explaining categorization logic"
    },
    "notebook": {
      "type": "object",
      "properties": {
        "action": {
          "type": "string",
          "enum": ["move", "create_and_move", "none"]
        },
        "target": {
          "anyOf": [{"type": "string"}, {"type": "null"}],
          "description": "Existing notebook name, new name if create_and_move, or null if none"
        },
        "confidence": {
          "type": "number",
          "description": "0.0 to 1.0"
        }
      },
      "required": ["action", "target", "confidence"],
      "additionalProperties": false
    },
    "tags": {
      "type": "object",
      "properties": {
        "add": {
          "type": "array",
          "items": { "type": "string" },
          "description": "Up to 5 tag suggestions"
        },
        "confidence": {
          "type": "number",
          "description": "0.0 to 1.0"
        }
      },
      "required": ["add", "confidence"],
      "additionalProperties": false
    }
  },
  "required": ["reasoning", "notebook", "tags"],
  "additionalProperties": false
}

Why this output structure:

reasoning field: Forces concise chain-of-thought. Chain-of-thought prompting (Wei et al., NeurIPS 2022) improves reasoning quality; for classification tasks with implicit signals specifically, the THOR framework (Fei et al., ACL 2023 Short Papers) demonstrates that structured multi-hop reasoning chains significantly improve implicit sentiment classification — analogous to note categorization where topical signals may be implicit in the text. Requiring explanation forces the model to justify its choice, catching its own errors on ambiguous cases.
Separate confidence for notebook vs tags: These are independent decisions with different uncertainty profiles. A note might clearly belong in "Recipes" (confidence 0.95) but tag selection might be ambiguous (confidence 0.6).
action: "none" option: Explicit abstention path. Without this, LLMs tend to force a suggestion even when uncertain — a known failure mode in classification tasks.
create_and_move: Handles the case where no existing notebook fits but the note clearly has a topic. The plugin would create the notebook, then move the note.
Up to 5 tags: Matches our proposal's "maximum 5 suggestions per note" to avoid choice overload. This is enforced via the system prompt rather than schema keywords, since OpenAI's strict mode does not enforce maxItems.

Example Input/Output

Input note: "Sourdough Starter Maintenance" — body about feeding schedules, temperature, signs of healthy starter.

Output:

{
  "reasoning": "This note covers sourdough starter maintenance, clearly a baking/cooking topic. The top embedding suggestion of 'Recipes' with 0.87 similarity aligns well. Both 'baking' and 'cooking' tags are appropriate.",
  "notebook": {
    "action": "move",
    "target": "Recipes",
    "confidence": 0.92
  },
  "tags": {
    "add": ["baking", "cooking"],
    "confidence": 0.88
  }
}

Abstention example — input note: "Random Thoughts 2024-03-15" (stream-of-consciousness, already in "Personal"):

{
  "reasoning": "Stream-of-consciousness journal entry covering unrelated topics. Already in 'Personal' which is the most appropriate notebook. No tags clearly apply — passing mentions of health (dentist) don't warrant a tag.",
  "notebook": {
    "action": "none",
    "target": null,
    "confidence": 0.85
  },
  "tags": {
    "add": [],
    "confidence": 0.80
  }
}

4.2 Agentic Batch Planner

Purpose: For the "Analyse All Notes" feature. Takes a batch of ~10-20 notes with their embedding-based suggestions, plans a coherent sequence of tool calls that the user reviews before execution.

System Prompt

You are a batch organization planner for a personal knowledge management app. You receive notes with embedding-based suggestions and must plan a coherent sequence of actions to organize them.

Planning rules:
- Analyze all notes before planning actions. Look for patterns and groupings.
- If multiple notes should go to the same NEW notebook, create it ONCE then move all relevant notes.
- Tool calls execute in the order you specify. Ensure dependencies are respected (create before move).
- You may skip notes where the suggestion is uncertain — use skip_note for these.
- Do not move notes already in an appropriate notebook.
- Do not add tags already present on a note.
- Prefer fewer, higher-confidence actions over many low-confidence ones.
- Maximum 50 tool calls per batch.

User Prompt

<notes>
{for each note in batch}
Note {i}:
  ID: {note_id}
  Title: {note_title}
  Current notebook: {current_notebook}
  Current tags: [{current_tags}]
  Body preview: {first_200_tokens}
  Embedding suggestions:
    Notebooks: {top_3_notebook_suggestions_with_scores}
    Tags: {top_5_tag_suggestions_with_scores}
    Confidence: {embedding_confidence}
{end for}
</notes>

<structure>
Notebooks: {notebook_names_with_ids}
Tags: {tag_names_with_ids}
</structure>

Plan your tool calls in dependency order.
For each note, decide: organize it or skip it.

Why these inputs:

Batch of notes (not one-by-one): The batch planner's value is seeing patterns across notes. If 5 notes are all about "Machine Learning", it can create one notebook and move all 5 — instead of 5 independent LLM calls each suggesting create_notebook.
Note IDs included: Required for tool call arguments. The LLM must reference exact IDs, not names.
Body preview limited to ~200 tokens per note: With 20 notes, that's ~4,000 tokens of body text. Combined with metadata and the system prompt, the total stays under ~6K tokens — well within Qwen3's 32K context and gpt-4o-mini's 128K.
Embedding suggestions with scores: Pre-computed hints from the embedding classifier give the LLM a strong starting point. The scores let it distinguish strong signals (0.9+) from weak ones (0.6).
Notebook/tag lists with IDs: Grounds tool call arguments in real Joplin entity IDs.

Tool Definitions (JSON Schema)

{
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "move_note",
        "description": "Move a note to a different notebook. The target notebook must already exist or be created by a prior create_notebook call.",
        "strict": true,
        "parameters": {
          "type": "object",
          "properties": {
            "note_id": {
              "type": "string",
              "description": "Joplin note ID"
            },
            "target_notebook_id": {
              "type": "string",
              "description": "Target notebook ID (existing or placeholder from create_notebook)"
            },
            "reasoning": {
              "type": "string",
              "description": "Why this note should be moved (1-2 sentences)"
            },
            "confidence": {
              "type": "number",
              "description": "0.0 to 1.0"
            }
          },
          "required": ["note_id", "target_notebook_id", "reasoning", "confidence"],
          "additionalProperties": false
        }
      }
    },
    {
      "type": "function",
      "function": {
        "name": "add_tag",
        "description": "Add a tag to a note. Creates the tag if it doesn't exist.",
        "strict": true,
        "parameters": {
          "type": "object",
          "properties": {
            "note_id": {
              "type": "string",
              "description": "Joplin note ID"
            },
            "tag_name": {
              "type": "string",
              "description": "Tag name to add"
            },
            "reasoning": {
              "type": "string",
              "description": "Why this tag is appropriate"
            },
            "confidence": {
              "type": "number",
              "description": "0.0 to 1.0"
            }
          },
          "required": ["note_id", "tag_name", "reasoning", "confidence"],
          "additionalProperties": false
        }
      }
    },
    {
      "type": "function",
      "function": {
        "name": "remove_tag",
        "description": "Remove a tag from a note. Only use when the tag is clearly incorrect or redundant.",
        "strict": true,
        "parameters": {
          "type": "object",
          "properties": {
            "note_id": {
              "type": "string",
              "description": "Joplin note ID"
            },
            "tag_name": {
              "type": "string",
              "description": "Tag name to remove (must currently exist on the note)"
            },
            "reasoning": {
              "type": "string",
              "description": "Why this tag should be removed"
            }
          },
          "required": ["note_id", "tag_name", "reasoning"],
          "additionalProperties": false
        }
      }
    },
    {
      "type": "function",
      "function": {
        "name": "create_notebook",
        "description": "Create a new notebook. Use only when no existing notebook fits a group of notes.",
        "strict": true,
        "parameters": {
          "type": "object",
          "properties": {
            "notebook_name": {
              "type": "string",
              "description": "Name for the new notebook"
            },
            "parent_notebook_id": {
              "anyOf": [{"type": "string"}, {"type": "null"}],
              "description": "Parent notebook ID for nesting, or null for top-level"
            },
            "reasoning": {
              "type": "string",
              "description": "Why a new notebook is needed"
            },
            "placeholder_id": {
              "type": "string",
              "description": "Temporary ID (e.g. 'new_nb_1') for subsequent move_note calls to reference"
            }
          },
          "required": ["notebook_name", "parent_notebook_id", "reasoning", "placeholder_id"],
          "additionalProperties": false
        }
      }
    },
    {
      "type": "function",
      "function": {
        "name": "skip_note",
        "description": "Explicitly skip a note — take no action. Use when current organization is already appropriate or confidence is too low.",
        "strict": true,
        "parameters": {
          "type": "object",
          "properties": {
            "note_id": {
              "type": "string",
              "description": "Joplin note ID"
            },
            "reason": {
              "type": "string",
              "description": "Why no action is being taken"
            }
          },
          "required": ["note_id", "reason"],
          "additionalProperties": false
        }
      }
    }
  ]
}

Why these tools: The proposal described 4 API-mapped tools. skip_note is a fifth, planning-only tool that does not correspond to any Joplin API call — it exists solely to give the LLM an explicit abstention option, reducing false-positive actions. Here's the mapping:

move_note: Maps to PUT /notes/:id with { parent_id } — verified in packages/lib/services/rest/routes/notes.ts.
add_tag: Maps to POST /tags (create if needed) → POST /tags/:id/notes — verified in routes/tags.ts lines 15-19.
remove_tag: Maps to DELETE /tags/:id/notes/:noteId — verified in routes/tags.ts lines 21-26.
create_notebook: Maps to POST /folders with { title, parent_id } — verified in routes/folders.ts.
skip_note: Not a Joplin API call — it's a planning-only tool that makes abstention a first-class action. The "Know Your Limits" survey (TACL 2025) shows that adding a "None of the Above" option is effective at reducing erroneous commitments in LLMs. The recent AbstentionBench benchmark (Facebook Research, 2025) further demonstrates that even reasoning-tuned LLMs degrade by ~24% on abstention tasks — making an explicit tool-based mechanism more reliable than hoping the model will self-abstain. Without skip_note, LLMs tend to force actions on every note.

Why reasoning on every mutating tool? Forces the LLM to justify each action, which reduces hallucinated/spurious calls. The reasoning is displayed to the user in the suggestion panel alongside the Accept/Reject buttons.

Why confidence on move and add_tag but not remove_tag? Moves and tag additions are speculative — the model might be wrong. The confidence score enables application-layer filtering (reject actions below 0.7). Removals should only happen when clearly wrong, so confidence is always high by definition.

Why placeholder_id on create_notebook? Solves the create-before-move dependency. The LLM can write create_notebook with placeholder_id: "new_nb_1", then subsequent move_note calls reference "new_nb_1" as the target. The plugin maps placeholders to real Joplin IDs after creation. This is an established pattern from the LLMCompiler framework (Kim et al., ICML 2024), which uses placeholder variables in DAG-based task planning to resolve dependencies between tool calls.

Example Batch Output

The example below shows the logical plan extracted from the LLM response. The actual wire format differs by provider — OpenAI returns message.tool_calls[].function.{name, arguments} with arguments as a JSON string, while Ollama returns arguments as a parsed object. The plugin normalises both formats into this common representation for validation and UI display.

Given 4 notes: two about programming (in Inbox), one about travel (in Inbox), one random (in Personal):

[
  {
    "tool": "move_note",
    "arguments": {
      "note_id": "n001",
      "target_notebook_id": "nb06",
      "reasoning": "React Hooks content belongs in Tech Notes alongside other programming notes.",
      "confidence": 0.93
    }
  },
  {
    "tool": "add_tag",
    "arguments": {
      "note_id": "n001",
      "tag_name": "javascript",
      "reasoning": "React is a JavaScript framework.",
      "confidence": 0.92
    }
  },
  {
    "tool": "move_note",
    "arguments": {
      "note_id": "n002",
      "target_notebook_id": "nb06",
      "reasoning": "TypeScript generics is a programming topic — same destination as the React note.",
      "confidence": 0.91
    }
  },
  {
    "tool": "add_tag",
    "arguments": {
      "note_id": "n002",
      "tag_name": "javascript",
      "reasoning": "TypeScript is a superset of JavaScript.",
      "confidence": 0.85
    }
  },
  {
    "tool": "move_note",
    "arguments": {
      "note_id": "n003",
      "target_notebook_id": "nb04",
      "reasoning": "Kyoto trip itinerary belongs in Travel notebook.",
      "confidence": 0.96
    }
  },
  {
    "tool": "add_tag",
    "arguments": {
      "note_id": "n003",
      "tag_name": "travel-plans",
      "reasoning": "Active trip itinerary.",
      "confidence": 0.94
    }
  },
  {
    "tool": "skip_note",
    "arguments": {
      "note_id": "n004",
      "reason": "Low embedding confidence (0.45). Note is already in Personal which seems appropriate. No confident action possible."
    }
  }
]

The user sees all 7 planned actions in the sidebar panel with confidence indicators (green >0.85, yellow 0.7-0.85) and Accept/Reject buttons. Nothing executes until approved.

API Parameters

Parameter	Categorization Helper	Batch Planner
`temperature`	0	0
`max_tokens`	300	4000
Output method	`response_format: json_schema` (gpt-4o-mini), or `tools` with `strict: true` wrapping the schema as a single tool (gpt-4.1-nano, which lacks `json_schema` response_format), or `format` with schema (Ollama)	`tools` with `strict: true` + `tool_choice: "required"`

Why temperature 0? Both tasks are deterministic — we want the same input to produce the same output. Temperature 0 maximises reproducibility and schema adherence. Several OpenAI cookbook examples (fine-tuning for function calling, function calling with OpenAPI spec) use temperature=0 for tool-calling tasks, and their prompt engineering guide recommends temperature 0 for "factual use cases such as data extraction." Note: strict: true is the primary quality lever for structured output — temperature 0 is a complementary best practice.

Application-Layer Safeguards (on top of LLM validation)

These run in the plugin code regardless of which LLM is used:

Confidence threshold: Reject any move_note or add_tag action with confidence < 0.7 (configurable). remove_tag has no confidence field — removals are only suggested when clearly incorrect, so no threshold filter is needed.
Dependency validation: Verify any move_note referencing a placeholder_id has a preceding create_notebook
Idempotency checks: Skip add_tag if tag already exists on note; skip move_note if note is already in target
ID validation: All note/folder IDs verified via joplin.data.get() before execution
JSON Schema validation: Every tool call validated against the schema; retry with error feedback (max 2); fall back to embedding-only on persistent failure
Rate limiting: Max 50 tool calls per batch — take only the highest-confidence actions if exceeded

Sorry it took me a while as I had to make sure everything is covered. Grateful for the feedback and will continue to search for more ways to improve the proposal! Happy to iterate further if needed. Will make sure to append your newest feedbacks onto the draft asap

HahaBill · 28 March 2026 00:34

Thank you so much for this detailed and comprehensive research and experimentation on running Transformers.js!! I really appreciate your effort doing this and dive very deep into the topic!! It’s clear to see that you’re having fun and curious about this.

I read until section 3. because it’s quite long, heavy in information and other contributors need feedback from me too. But I want to give you one right away since the deadline of submission is nearing. Your findings and explanations are well written and it’s great to see your minimal POC - I love this!! Here some of my thoughts and considerations:

Model load time seems to look quite fast so there’s less concern in there. But in case of processing large amount of long notes, consider:
- thinking how would you design this from UI/UX perspective
  - how/what/where to notify users when done with embedding
  - show somehow the progress tracking
  - be creative, fix UX issues you find annoying, etc.

I will continue read the rest of your answer tomorrow and give you another round of feedback regarding LLM and your AI agentic solution.

jellyfrostt · 28 March 2026 15:31

Thank you so much for the kind words and for taking the time to review — I really appreciate it!!! And yes, I'm genuinely having a blast diving into this

Regarding the draft: I hit the Discourse character limit so I wasn't able to append the latest revisions directly into the post. The revised proposal is saved locally and I'll make sure the final submission reflects all corrections in the GSoC website upon submission.

On your UI/UX questions — here's how I'm thinking about the embedding experience for large/long note collections:

1. Progress tracking during embedding:

Rather than a vague spinner, the sidebar panel would show a determinate progress bar with contextual info:


Indexing notes... 342 / 1,247 (27%)

├─ Current: "Meeting Notes: Q2 Planning"

├─ Speed: ~10.2 notes/sec

├─ Elapsed: 0:34

└─ [Cancel]

The progress bar updates via postMessage from the Worker after each note completes — this is the same Worker ↔ Plugin IPC pattern validated in the POC and used in Joplin's official worker example plugin (packages/app-cli/tests/support/plugins/worker/). I opted for elapsed time + percentage rather than an ETA countdown — I found out after research that remaining-time displays actually increase frustration compared to elapsed-time feedback, because users anchor on the countdown and get frustrated when it fluctuates. Elapsed time paired with a determinate percentage bar lets users mentally extrapolate without the system making a promise it might break. The speed indicator (notes/sec) still gives a rough sense of scale. Since notes vary wildly in length (34ms for a short note vs 1,000ms for a long doc page, as measured in the POC), the speed display uses an exponential moving average (alpha ~0.10) rather than a simple arithmetic mean — EMA biases toward recent throughput, which better reflects current processing speed when note lengths vary. This is consistent with benchmarking of 14 ETA algorithms showing EMA significantly outperforms naive averages for variable-speed workloads.

2. Notification when done:

If the user stays in Joplin: The sidebar panel transitions from the progress view to the suggestions view — "Done! 1,247 notes indexed. 23 suggestions ready for review." This is the natural flow since the panel is already open.
If the user switches away (or minimises Joplin): A toast notification via joplin.views.dialogs.showToast() — the plugin API exposes this with ToastType.Success to display a corner notification inside the app: "Indexing complete — 23 suggestions ready." When the user returns to Joplin, the toast is visible and the sidebar panel already shows results. For longer jobs, the panel itself retains a persistent "completed" banner so the user doesn't miss it even if the toast has already dismissed.
For incremental re-indexing (background, after sync): No notification unless new suggestions are generated. The sidebar panel content updates via setHtml() to show a subtle "3 new suggestions" indicator at the top of the suggestion list — the user sees it next time they glance at the panel. This avoids notification fatigue for routine background work.

3. Designing for the "large batch" UX:

The core UX problem with batch embedding is that it's a wait-then-act workflow — the user triggers "Analyse All", waits, then reviews. A few ideas to make this less painful:

Streaming suggestions: Don't wait until all 1,247 notes are embedded to show suggestions. As each note completes, immediately run centroid classification + KNN tagging against the already-indexed portion. Suggestions trickle into the panel in real-time — the user can start reviewing while embedding continues in the background. This turns a "wait 8 minutes then review" into "start reviewing after 10 seconds." Early suggestions might shift slightly as more notes get indexed (centroids update), but for a first pass this is much better than staring at a progress bar.
Smart ordering: Process notes most likely to need action first — notes in "Inbox" or the default notebook, recently created notes, untagged notes. This front-loads the useful suggestions so the user sees value immediately.
"Pause & Resume": If the user needs to close Joplin mid-indexing, the progress persists — the hash map and vector store are flushed to disk every 50 notes during bulk indexing (as described in section 4.4.3 of the proposal). On next launch, a banner in the panel: "Indexing paused at 342/1,247. [Resume] [Cancel]" — no lost work.
First-run experience: On first install with, say, 2,000 existing notes, a dedicated onboarding view in the sidebar panel: "Welcome! Let's index your notes. This takes about 3–4 minutes for your collection. You can keep using Joplin — we'll notify you when suggestions are ready." Sets expectations upfront rather than surprising them with a long-running background task.

4. One UX annoyance I'd love to fix:

The "Accept/Reject one-by-one" flow gets tedious when there are 50+ suggestions. Beyond just "Accept All / Reject All" (already in the proposal), I'm thinking about grouped actions — suggestions clustered by type: "Move 8 notes to 'Recipes'" as a single expandable card in the panel, rather than 8 individual move suggestions. The user can accept the whole group, expand to cherry-pick, or reject the batch. This mirrors how Apple's iOS 12 notification grouping and Gmail's batch actions — it respects the user's time while keeping them in control.

One important nuance here: automation bias research warns that blanket "Accept All" options risk complacency — users stop actually reviewing individual items. To mitigate this, the group card would show a preview of 2–3 representative items from the group before allowing group-level acceptance, so the user has to at least glance at what they're approving. For groups where individual items have mixed confidence scores (e.g., some green >0.85, some yellow 0.7–0.85), the card would flag this and encourage expanding before accepting. This way grouping reduces cognitive load without sacrificing review quality.

Looking forward to your feedback on the LLM/agentic sections soon!!

HahaBill · 29 March 2026 12:39

@jellyfrostt Amazing! Thank you for answer! I like the streaming suggestions idea where results come in as they're processed rather than waiting for the full batch and the progress tracking UI is also well thought out. Those are great UX!

The deadline is coming soon so make sure to have everything ready and best of luck with submitting your proposal on the GSoC platform!!

jellyfrostt · 29 March 2026 12:47

Got it! Thank you for all of your feedbacks thus far

Topic		Replies	Views
Plugin: Semantically Similar Notes (beta) Plugins	30	2662	5 February 2024
GSoC 2026: Opportunities for the AI projects GSoC	32	710	13 April 2026
GSoC 2026 Proposal Draft – Idea 4: Chat with your note collection using AI GSoC	0	21	31 March 2026
Plugin: AI Tag Suggester — AI-powered tag suggestions for your notes Plugins	0	113	23 August 2025
Plugin: Jarvis (AI assistant) - also on mobile [v0.13.0, 2026-05-04] Plugins	179	27092	14 May 2026

GSoC 2026 Proposal Draft – Idea 3: AI-Based Categorisation – Sasha

Relevant Links

1. Introduction

2. Problem Statement

3. Building Plan

4. Technical Approach

4.1 Architecture Overview

4.2 Embedding Service

4.3 Vector Storage

4.4 Incremental Indexing

4.5 Centroid-Based Notebook Classification

4.6 KNN Auto-Tagging

4.7 Stale Note Detection

4.8 Topic Discovery via K-Means Clustering

4.9 Agentic LLM Layer

4.10 Plugin Registration

4.11 Privacy

4.12 Risks and Mitigations

4.13 Testing Strategy

5. Proposed Timeline

6. Deliverables

7. Availability

1. POC Benchmark (Video + Logs)

Standalone Benchmark (Node.js, native ONNX)

2. Minimal Joplin Plugin Demo (Video + Logs + GitHub)

What the plugin does:

Architecture:

Key implementation details:

A note on logging

Results (from alert dialogs):

3. Specific LLM Choices

3.1 Ollama (Local)

3.2 Frontier (Cloud)

4. LLM Prompts, Inputs, and Outputs

4.1 Categorization Helper

System Prompt

User Prompt

Output Schema (JSON)

Example Input/Output

4.2 Agentic Batch Planner

System Prompt

User Prompt

Tool Definitions (JSON Schema)

Example Batch Output

API Parameters

Application-Layer Safeguards (on top of LLM validation)

Related topics