Hello everyone
I was working on idea 3 (The idea is automatically group notes into topic based clusters and suggest tags or notebooks). So to group notes by topic, each note is turned into embeddings. These embeddings are quite large (~384 values), which makes grouping difficult and time taking because everything starts to look equally distant.
So I was planning to use UMAP which reduces this to about 10 embeddings while keeping similar notes close to each other. Then HDBSCAN finds natural clusters from these. It also detects notes that don’t fit anywhere, which can help identify notes to archive. I was planning to use UMAP and HDBSCAN js libraries but found that:
- The
umap-js library hasn’t been updated for about 2 years, and the authors mention that a key part of the algorithm (spectral initialization) isn’t fully implemented.
hdbscan has no library for javascript.
Note: I am not using K-means because there we need to enter how many cluster we want to create.
What I’m proposing
The plugin will still be written in TypeScript and handle all UI and Joplin API interactions. But when the user clicks to organise notes, it would:
- Start a small Python subprocess
- Send note data via stdin
- Run clustering in Python
- Return results via stdout
- Exit completely after finishing
I also checked that starting a child process is possible in Joplin plugins: Is it possible to use child_process?
Doing so will allow access to umap-learn and hdbscan, which are actively maintained.
There’s also some research supporting this pipeline BERT embeddings + UMAP + HDBSCAN seems to work quite well for document clustering:
Questions
For mentors:
Is this approach correct?
For users:
If the plugin asked you to install Python once during setup in exchange for better clustering results, would you be okay with that?
Note: Based on the discussion I can plan to keep a basic JavaScript fallback as well, so the plugin would still work without Python just with lower quality results.
1 Like
Difference in execution time:
I realise that many AI related tools are more up to date in Python but we should really avoid going for this approach in my opinion. Maybe there are some other more recent JS libraries that could be used instead?
1 Like
Personally my python environment is already a hellscape of which the slightest touch stops me being able to build a couple of apps from source. Joking aside this would not go well for any professional users - corporate IT would be very unwilling to let you install anything like that.
2 Likes
I understand that why you are saying to avoid Python, I tried searching for JS alternatives
-
For UMAP:
there is umap-js whose readme file directly says this:
But I found an alternative druidJS, it is well maintained and was also published in the IEEE paper (I will compare umap-js and druidJs and will decide which one to choose)
-
For HBDSCAN (it is a clustering algorithm)
there is hbdscanjs but it is still not ready for use
so the only thing which we can do if we want to use JS libraries we will need to go to K-means (here we need to manually enter how many cluster we want to create) which will degraded the user experience.
how about having a fallback like this?
yep python can be bit messy sometimes, I understand that many people will not be ready to install it.
how about having a fallback like this?
just for more context I am explaining the difference in HDBSCAN (only possible in python) and K-means (can work in javascript):
HDBSCAN:
It finds groups on its own based on dense areas, and can leave out points that don’t belong anywhere.(it will create better cluster and can also find notes which does not related to any other notes)
K-means:
We decide number of groups first, then it puts every note into those groups, even if some don’t fit well.(it will only create group based on the value provided and can not find notes which does not related to any other notes and will keep it to any group)
During my GSoC, I tried to use Python libraries that runs scientific packages in JS/TS but it was quite difficult to do so in the desktop environment in Joplin. You can try it but I wouldn’t recommend it since I think it would be out of the scope.
I like the idea of using density-based clustering! Not only you don’t have to compute k but it deals with a noise too.
But if you cannot find any packages that properly implement those algorithms, then you can either:
- Use KMeans with calculating/finding the optimal k and think about the UI/UX around it in case your k is not the optimal one.
- Or prompt engineer your agent to run any density-based clustering using Claude code execution tool: Code execution tool - Claude API Docs
1 Like
Thanks for sharing your experience, that helps. I understand it might get difficult to manage this in the GSoC timeline. So I’ll avoid depending on Python for now and focus more on a pure JS approach.
Thanks! that’s why I was exploring HDBSCAN, since it can find natural clusters and also leave out unrelated notes.
I found that we can actually avoid asking the user to choose k manually. We can run K-means multiple times with different values of k (like 2, 3, 4.. up to some limit) and then pick the best one using silhouette score.
It checks how well the points fit inside their cluster compared to other clusters. So can we just pick the k where clustering looks close to natural.
It’s not perfect. K-means will still force every note into some cluster. So even if a note doesn’t really belong anywhere, it still gets grouped.
It is quite interesting, I was not aware of it. For categorising I was thinking that we can keep it fully local but if we decide to go through Python route then this can be useful.
1 Like
Yeah, that sounds like a great approach!
To tackle noise, you could then evaluate within each cluster whether each note are really semantically close using for example similarity metrics.
1 Like