@yugalkaushik Thank you for your answer and providing some clarity!
and I really appreciate your effort of writing all of this down. I think it’s the best if I go through your answers and give my feedback progressively rather than go through everything at once because it’s quite a lot.
A) Data Collection to Pass A flow
With the way you are proposing to embed each note with {title} {title} {body} format, I do understand that this is the way to increase the term frequency score and thus increase the term importance like for example in tf-idf or search queries. But I am not sure whether that would be useful for embedding models, this is something to think about.
B) Regarding Pass A
-
Do you think clustering with the proposed approaches in those dimensions will be reliable and efficient?
-
How would you make sure that the k is the optimal choice? Same question for hierarchical clustering with the distance and linkage. You might find this thread useful: Idea 3 discussion - Using Python subprocess for UMAP and HDBSCAN instead of JavaScript - #7 by HahaBill - you do not have to follow this solution, take it as a source for your inspiration and research
-
This is an interesting idea of having different variables for computing the final score for the semantic edges!
- How is
tagOverlaporlinkBonuscalculated? I couldn’t find your approach on that. - You might want to think about normalizing the computed cosine similarity scores, because they may not be well distributed. What I mean is that the scores might be distributed from 0.56 to 0.83 rather than from 0 to 1. That means that for example the threshold of 0.5 would let almost everything through if we purely score on the cosine similarity.
- Furthemore, I would assume that the threshold is on your final score? Then if all notes do not have tags or links to each other, then that would mean all the final scores are 0.7 * cosine similarity but your threshold assumes from 0 to 1. Your default threshold (0.5) would be then "stricter". How do you solve that?
- How is