GSoC 2026 Proposal Draft – Idea 3: AI-based-Note-categorization

Thanks for the review @shikuz!

1. Context window table: yes, those are swapped Correct values are: BGE-small-en-v1.5 → 512-token, all-MiniLM-L6-v2 → 256-token (silently truncates).

2. Mobile: explicitly out of scope Pipeline depends on Node.js native modules (sqlite3, ONNX Runtime) unavailable in mobile sandbox. The vector store sits behind an abstraction layer though, so a future contributor could swap sqlite3 for sql.js without touching embedding or clustering logic.

3. New note: no full re-run

  1. onNoteChange() fires → note embedded, vector saved to sqlite3 (< 1 second)
  2. New vector compared against stored centroids → tentative assignment, no re-clustering
  3. Full re-analysis only triggers on manual Re-analyse click, or when 5%+ of collection has changed

4. Real collection test: Yes. Built a working Joplin plugin prototype with embedded clustering pipeline. The implementation validates the core architecture before potential production scaling.

Demo

there is a limit of 10mb video so i have uploaded the last part please see the full demo video at

data.json (100 notes) 
  ↓
Embedding extraction (BGE-small-en-v1.5 via Transformers.js in Web Worker)
  ↓
Optional dimensionality reduction (UMAP: 384-dim → 5-dim for tighter separation)
  ↓
K-Means clustering (K=2 to adaptive max)
  ↓
Silhouette scoring (automatic K selection without manual inspection)
  ↓
Final clustering + Benchmark UI (sidebar visualization with metrics)

Repository of my clustering phase testing (with dummy data not real-time notes) :slight_smile:

Will push the corrected proposal with the table fix shortly.

Phase # 1 System Architecture

Phase # 2 System Architecture

also want for @HahaBill to see my work :blush: