Weekly Update 1: Laying the Pipeline Foundation (Chunking & Vector Caching)

Hello everyone! Hope you all had a great week. I wanted to share a quick update on where we're at with the note categorization plugin. I've got some really exciting progress to share!

Progress from last week:

As planned last week, I worked on 3 PRs:

  • Chunking (#5): The token-based chunking and WebGPU/WASM fallback selection is done.
  • Vector Aggregation (#6): Opened the PR for averaging chunk vectors, filtering out generic titles, and blending titles with body vectors.
  • Caching: The local cache implementation using vectra and SHA-256 hashing is complete. I'm just finishing up testing and will open a PR soon.

Plan for this week:

  • Local Vector Caching & Incremental Indexing: Get the caching PR (currently in testing) opened, reviewed, and merged.
  • UMAP Integration with DruidJS: Set up DruidJS to project our averaged note vectors down into UMAP coordinates, ensuring we use a fixed random seed so the output coordinates are stable and reproducible.
  • UMAP Pipeline Verification: Run and test the full pipeline (embedding → note vector → UMAP) across different note collection sizes (small, medium, large) to check for stability and performance.

No major problem faced this week!

3 Likes