Weekly Update 5: Integrating Native AI & Cluster Tags

Hey everyone! Hope you all had a great week.

This week, I started by working on automatic tag generation for our clusters and opened PR #23. I wrote a simple, dependency-free TF-IDF extractor in TypeScript that cleans the note text, filters out common stop words, and picks the most unique keywords for each cluster. These are now displayed as tags directly on the cluster cards in our React UI.

After that, since Joplin introduced native on-device AI search, I shifted focus to integrate it (PR #24). Since the new API doesn't expose raw embedding vectors directly, it only lets us query for similar notes, so I built a distance matrix by querying similar notes for each note and converting their similarity scores into distances (1 - score). I updated UMAP to support a custom distance function to project this matrix into a 10D space that our clustering algos can use.
I also set it up as a hybrid pipeline, so it automatically uses Joplin's native search when it's ready, or falls back to our local ONNX web worker when it isn't.

One problem that I faced during testing was that our local ONNX fallback was sometimes returning NaN vectors on longer notes, which corrupted the cache and broke the clustering.

For next week, I plan to address this issue by adding a dynamic fallback, along with a self-healing cache validator to automatically find and re-embed corrupted records. I also want to work on auto-naming the formed clusters.

The plugin API could expose more information. It seems in particular that giving access to the raw vectors would help here?

it automatically uses Joplin's native search when it's ready

How do you currently check if it's ready or not? Because I recently added a few internal functions to check the readiness of the AI APIs - would it make sense to expose them?

Yes, exposing the raw vectors would help a lot. It would let us fetch all embeddings in a single batch call instead of running O(N) individual search queries. it also avoids the top-20 search results limit, giving umap 100% accurate global distance data for projection.

Currently, I check readiness using a query probe

await joplin.ai.search({
    query: { text: 'probe' },
    relevance: 'strict',
});

I do it inside a try-catch block. This verifies if the ai namespace exists, if AI is enabled in settings, and if the native sqlite-vec extension is loaded. The problem is that it doesn't tell us if the database is fully indexed.

I checked your recent PR (#15785) - the embeddingAvailability() helper checks the conditions we need. Exposing it to the plugin API (e.g. joplin.ai.embeddingAvailability()) would be very helpful.

Also, if possible, having it indicate whether background note indexing is complete (so the database isn't empty on a fresh install) would be a great addition!

Thanks for the feedback! It helps to know how the API can potentially be used. I've now created two pull requests for these use cases:

Would you mind checking them and letting me know if it covers what's needed for your project?

I tested both PRs locally and they work perfectly!

I was able to check the status, fetch all note embeddings paginated, group/average their chunks, and run clustering with no issues. The performance is super fast and the clustering quality is great (got a silhouette score of 0.93 on 50 notes).

These APIs cover everything we need for the hybrid pipeline, so they are good to go from my end. Thanks a lot!

Great, glad to hear it's working! I'm going to merge the PRs then and it will be part of the next release