Hi everyone, here is my update for this week!
It’s been a really productive week. I focused on setting up the official repository infrastructure and running some hands-on benchmarks to see how we can make note embeddings run as fast as possible on-device.
Progress
-
Repository Setup & Initial PR: We set up the official plugin repository at joplin/plugin-note-categorization. I also opened my first PR (#1) to establish the core embedding infrastructure.What the PR does:
- Webpack Worker Compilation: Configures Webpack to support compilation of our background Web Worker (src/worker/embedWorker.ts) using the
webtarget. - Local ONNX WASM Bundling: Integrates a build script (tools/copyAssets.js) to copy
onnxruntime-webWASM binaries todist/onnx-dist/, allowing the Web Worker to load assets offline locally and bypass Electron’s CSP/CORS restrictions. - Joplin API Integration: Implements a paginated note reader (src/pipeline/noteReader.ts) that fetches note IDs, titles, and bodies in pages of 50.
- Test Utility Command: Registers an
AI Categorise: Test Embeddingcommand under Joplin’s Tools menu to verify the model loading, warm-up, and note embedding pipeline.
- Webpack Worker Compilation: Configures Webpack to support compilation of our background Web Worker (src/worker/embedWorker.ts) using the
-
Experimental Branch (WebGPU & Token-based Chunking): To address the speed constraints I noticed last week, Bill (Ton Hoang) suggested exploring WebGPU acceleration and testing the
fp16andq8quantized model precision.
Following his advice, I created a separate branch (feat/tiktoken-chunking) to test these optimizations alongside token-based note chunking usingjs-tiktoken(splitting note contents into safe segments of 250 tokens). -
Inference Benchmarks (Big thanks to Bill (Ton Hoang) for the suggestions!): I ran speed tests comparing different precision types (
fp32,fp16,q8) across WASM (CPU fallback) and WebGPU (GPU acceleration). I posted a very detailed breakdown of all the benchmark runs in this PR comment: Detailed Benchmark Comment.
Here are the high-level takeaways:- fp16 + WebGPU was the absolute fastest, taking only ~43 ms per note (nearly 25x faster than our initial
fp32baseline). - q8 + WASM (CPU) was the clear winner for CPU fallback, taking ~557 ms per note (almost twice as fast as
fp32on CPU). - In real-world testing with chunked notes, WebGPU finished the entire batch in just 16 seconds, compared to 8 minutes on WASM CPU (fp16).
- fp16 + WebGPU was the absolute fastest, taking only ~43 ms per note (nearly 25x faster than our initial
Plan for Next Week
I will be working on breaking down the core pipeline into logical pull requests:
- Finalize Chunking & Optimizations: Complete the token-based chunking logic (handling overlap and boundary details) and merge the q8 CPU fallback, WebGPU device selection, and tokenizer setup.
- Note Vector Aggregation & Title Weighting: Implement the averaging of chunk vectors to represent note bodies, filter out generic titles, and use cosine similarity to calculate descriptive title weights dynamically.
- Caching & Incremental Indexing: Set up the local vector database storage (via vectra) and add SHA-256 hashing to ensure we only run embeddings on new or modified notes.
Problem
- Linux WebGPU Support: WebGPU doesn't work out-of-the-box inside Electron on Linux without passing extra flags (I had to run Joplin with
--no-sandbox --enable-unsafe-webgpu --enable-features=Vulkanto get it working).
Fortunately, it works seamlessly on Windows without any manual flags. A big thank you to @akshajrawat for testing it on Windows to confirm this! Windows and macOS users should be able to get WebGPU hardware acceleration out of the box. For Linux users who run without flags, they will fall back silently to CPU WASM, where our defaultq8 + WASMfallback will still give them a 2x speed improvement over the initialfp32baseline.