Weekly Update 5: Integrating Native AI & Cluster Tags

Harsh16gupta · 28 June 2026 20:24

Hey everyone! Hope you all had a great week.

This week, I started by working on automatic tag generation for our clusters and opened PR #23. I wrote a simple, dependency-free TF-IDF extractor in TypeScript that cleans the note text, filters out common stop words, and picks the most unique keywords for each cluster. These are now displayed as tags directly on the cluster cards in our React UI.

After that, since Joplin introduced native on-device AI search, I shifted focus to integrate it (PR #24). Since the new API doesn't expose raw embedding vectors directly, it only lets us query for similar notes, so I built a distance matrix by querying similar notes for each note and converting their similarity scores into distances (1 - score). I updated UMAP to support a custom distance function to project this matrix into a 10D space that our clustering algos can use.
I also set it up as a hybrid pipeline, so it automatically uses Joplin's native search when it's ready, or falls back to our local ONNX web worker when it isn't.

One problem that I faced during testing was that our local ONNX fallback was sometimes returning NaN vectors on longer notes, which corrupted the cache and broke the clustering.

For next week, I plan to address this issue by adding a dynamic fallback, along with a self-healing cache validator to automatically find and re-embed corrupted records. I also want to work on auto-naming the formed clusters.

laurent · 28 June 2026 22:22

The plugin API could expose more information. It seems in particular that giving access to the raw vectors would help here?

it automatically uses Joplin's native search when it's ready

How do you currently check if it's ready or not? Because I recently added a few internal functions to check the readiness of the AI APIs - would it make sense to expose them?

Harsh16gupta · 29 June 2026 09:15

Yes, exposing the raw vectors would help a lot. It would let us fetch all embeddings in a single batch call instead of running O(N) individual search queries. it also avoids the top-20 search results limit, giving umap 100% accurate global distance data for projection.

Currently, I check readiness using a query probe

await joplin.ai.search({
    query: { text: 'probe' },
    relevance: 'strict',
});

I do it inside a try-catch block. This verifies if the ai namespace exists, if AI is enabled in settings, and if the native sqlite-vec extension is loaded. The problem is that it doesn't tell us if the database is fully indexed.

I checked your recent PR (#15785) - the embeddingAvailability() helper checks the conditions we need. Exposing it to the plugin API (e.g. joplin.ai.embeddingAvailability()) would be very helpful.

Also, if possible, having it indicate whether background note indexing is complete (so the database isn't empty on a fresh install) would be a great addition!

laurent · 29 June 2026 15:11

Thanks for the feedback! It helps to know how the API can potentially be used. I've now created two pull requests for these use cases:

github.com/laurent22/joplin

Desktop: Add joplin.ai.getIndexStatus() plugin API (#15800)

dev ← ai_index_status_plugin_api

opened 03:10PM - 29 Jun 26 UTC

laurent22

+143 -2

Exposes the on-device embedding indexer's state to plugins, so they can decide w…hether to use Joplin's native AI features or fall back to their own implementation. API: joplin.ai.getIndexStatus() → { ready, state, modelId, notesIndexed, totalNotes } - `ready` is `true` when the indexer is idle and at least one note is indexed — the common case for "should I use Joplin's AI or my fallback?" - `state` is a coarse 5-value enum ('unavailable' | 'disabled' | 'preparing' | 'indexing' | 'ready') translated from the richer internal state machine, so the public API isn't pinned to current internals - `modelId` pairs with vector-producing APIs so plugins can invalidate cached data on model swaps

github.com/laurent22/joplin

Desktop: Add joplin.ai.getEmbeddings() plugin API for raw vector access (#15799)

dev ← embedding_vector_search

opened 01:45PM - 29 Jun 26 UTC

laurent22

+275 -3

Exposes the raw embedding vectors stored in the on-device index to plugins, so t…hey can run their own clustering, dimensionality reduction, or distance computations without making O(N) similarity queries to reconstruct the data. API: joplin.ai.getEmbeddings({ noteIds?, cursor?, limit? }) → { modelId, dimension, chunks, nextCursor? } - Chunk-level granularity, matching how vectors are stored - Opaque cursor over monotonic rowids — chunks inserted behind the cursor are guaranteed to have been returned - modelId on every page so plugins can detect a mid-iteration model swap and discard partial results - Unindexed noteIds are silently skipped; throws when AI is disabled

Would you mind checking them and letting me know if it covers what's needed for your project?

Harsh16gupta · 29 June 2026 19:41

I tested both PRs locally and they work perfectly!

I was able to check the status, fetch all note embeddings paginated, group/average their chunks, and run clustering with no issues. The performance is super fast and the clustering quality is great (got a silhouette score of 0.93 on 50 notes).

These APIs cover everything we need for the hybrid pipeline, so they are good to go from my end. Thanks a lot!

laurent · 29 June 2026 23:14

Great, glad to hear it's working! I'm going to merge the PRs then and it will be part of the next release

Topic		Replies	Views
Design Discussion: Shared Embedding & Retrieval Infrastructure for Joplin AI Features GSoC	1	77	26 March 2026
GSoC 2026: Opportunities for the AI projects GSoC	40	1294	19 June 2026
AI Note Clustering BenchMark Tessting via Plugin GSoC	0	28	29 March 2026
AI project Discussion ( Project 1 : AI-supported search for notes) Development	4	136	31 March 2026
GSoC 2026 Proposal Draft – Idea 4: Chat with your note collection using AI GSoC	0	29	31 March 2026

Weekly Update 5: Integrating Native AI & Cluster Tags

Related topics