@Harsh16gupta Hi! Thank you for your proposal and the proposal is solid. It’s very well written and carefully thought out
I like how in every step you’re justifying your choices and solution by actual sources, comparison and explanation, that’s great!!
I read from the beginning to section 3.7 Tag Generation, rather than give you a full feedback on the proposal, I decided to do it partially because the deadline is approaching and it’s good to have some of the questions already asked.
- It’s nice to see that you engaged with a community and based on that create this proposal!!
- Great that you tried the case where the context could be over the embedding model’s window and thinking about chunking!!
- Using UMAP to make KMeans Clustering more efficient is a smart approach.
- For Tag Generation, it’s great to see that you’re doing reranking. However, I was wondering whether you have thought about this edge case:
- Let’s say that notes are about AI and machine learning, from reranking you get these words as a result:
[artificial, intelligence, machine, learning, data, science, vector, space]- As you can see, these are all single words. But realistically, we want it to be:
[artificial intelligence, machine learning, data science, vector space] - In the natural library that you’re planning to use, it has this feature: N-grams | Natural . You could use n-grams to extract multi-word phrases like 'machine learning' or 'data science' as single terms, which would give you much better tag names than individual words. Note that you need to deal with duplication from each n-grams.
- As you can see, these are all single words. But realistically, we want it to be:
- Let’s say that notes are about AI and machine learning, from reranking you get these words as a result:
- Could you show me how fast is the embedding model (BGE-small-en-v1.5) using
Transformers.js?- Good to show the inference time and estimate how long it would take for 1000 notes
- Whether it works async and where is the limitation?
- If possible, could you create a short video of you running the embedding model with
Transformers.jsin a Joplin plugin?