GSoC 2026 Proposal Draft - Idea 3: AI-based categorisation - Harsh16gupta

HahaBill · 27 March 2026 00:03

@Harsh16gupta Hi! Thank you for your proposal and the proposal is solid. It’s very well written and carefully thought out I like how in every step you’re justifying your choices and solution by actual sources, comparison and explanation, that’s great!!

I read from the beginning to section 3.7 Tag Generation, rather than give you a full feedback on the proposal, I decided to do it partially because the deadline is approaching and it’s good to have some of the questions already asked.

It’s nice to see that you engaged with a community and based on that create this proposal!!
Great that you tried the case where the context could be over the embedding model’s window and thinking about chunking!!
Using UMAP to make KMeans Clustering more efficient is a smart approach.
For Tag Generation, it’s great to see that you’re doing reranking. However, I was wondering whether you have thought about this edge case:
- Let’s say that notes are about AI and machine learning, from reranking you get these words as a result: [artificial, intelligence, machine, learning, data, science, vector, space]
  - As you can see, these are all single words. But realistically, we want it to be: [artificial intelligence, machine learning, data science, vector space]
  - In the natural library that you’re planning to use, it has this feature: N-grams | Natural . You could use n-grams to extract multi-word phrases like 'machine learning' or 'data science' as single terms, which would give you much better tag names than individual words. Note that you need to deal with duplication from each n-grams.
Could you show me how fast is the embedding model (BGE-small-en-v1.5) using Transformers.js?
- Good to show the inference time and estimate how long it would take for 1000 notes
- Whether it works async and where is the limitation?
- If possible, could you create a short video of you running the embedding model with Transformers.js in a Joplin plugin?

Topic		Replies	Views
Plugin: Semantically Similar Notes (beta) Plugins	30	2652	5 February 2024
GSoC 2026: Opportunities for the AI projects GSoC	30	607	29 March 2026
AI Note Clustering BenchMark Tessting via Plugin GSoC	0	25	29 March 2026
GSoC 2026 Proposal Draft – Idea 4: Chat with your note collection using AI GSoC	0	18	31 March 2026
Weekly Update 4-5: KMeans Clustering, Evaluation Research / Survey and Express.js in a plugin Summarize with AI weekly , report , gsoc-2024	17	372	15 July 2024

GSoC 2026 Proposal Draft - Idea 3: AI-based categorisation - Harsh16gupta

Related topics