Hello everyone! Hope you all had a great week. I wanted to share a quick update on where we're at with the note categorization plugin. I've got some really exciting progress to share!
Progress from last week:
As planned last week, I worked on 3 PRs:
- Chunking (#5): The token-based chunking and WebGPU/WASM fallback selection is done.
- Vector Aggregation (#6): Opened the PR for averaging chunk vectors, filtering out generic titles, and blending titles with body vectors.
- Caching: The local cache implementation using
vectraand SHA-256 hashing is complete. I'm just finishing up testing and will open a PR soon.
Plan for this week:
- Local Vector Caching & Incremental Indexing: Get the caching PR (currently in testing) opened, reviewed, and merged.
- UMAP Integration with DruidJS: Set up DruidJS to project our averaged note vectors down into UMAP coordinates, ensuring we use a fixed random seed so the output coordinates are stable and reproducible.
- UMAP Pipeline Verification: Run and test the full pipeline (embedding → note vector → UMAP) across different note collection sizes (small, medium, large) to check for stability and performance.
No major problem faced this week!