Weekly Update 4-5: KMeans Clustering, Evaluation Research / Survey and Express.js in a plugin

BioFacLay · 1 July 2024 19:33

Disclaimer: I've just stumbled upon this and have only read the OP, so feel free to ignore this post if my advice doesn't apply / makes no sense.

Clustering algorithms such as k-means often struggle with high dimensional data and word2vec gives high-dimensional embeddings. Dimensionality reduction helps in such cases. UMAP is a state-of-the-art dimensionality reduction algorithm that has been shown to significantly improve clustering performance of high-dimensional data. tSNE is an alternative, that doesn't perform quite as well but it's been around longer and is easier to find libraries for. Using dimensionality reduction as a preprocessing step could however slow down the clustering by quite a bit.

You might also want to look at density-based clustering algorithms. I recommend HDBSCAN. This way the clusters don't have to be hyper-elliptical in shape and you're able to detect noise. HDBSCAN's hyperparameters are very intuitive and you might be able to set them to a fixed value without the need for tuning them every time.

In any case, this looks like a really cool project and I wish you the best of luck with it.

Topic		Replies	Views
Weekly Update 8: Rich Text Editor in a Panel and Crafting Summaries Summarize with AI weekly , report , gsoc-2024	0	66	22 July 2024
Weekly Update 6: User-centric summarization feature -> Letting users control summaries Summarize with AI weekly , report , gsoc-2024	2	97	8 July 2024
Weekly Update 12: Running Language Models, Improving UI/UX and Released (v.0.1.1) 🚀 Summarize with AI weekly , report , gsoc-2024	0	379	20 August 2024
Weekly Update 9: Released AI Summarization Plugin and Learning about Tech Specs Summarize with AI weekly , report , gsoc-2024	0	48	29 July 2024
Weekly Update 7: Posted a Survey and Added plugin panel with displaying notebook tree Summarize with AI weekly , report , gsoc-2024	0	55	15 July 2024

Weekly Update 4-5: KMeans Clustering, Evaluation Research / Survey and Express.js in a plugin

Related topics