Weekly Update 3: LexRank, LSA and Pyodide with micropip

HahaBill · 19 June 2024 22:49

Progress

Understanding vectorization: in simple terms, it is a way to convert sentences into vector forms so that we can perform various algorithms. For example, in LSA, we create sentence vectors to form a matrix and then perform SVD to discover the most important dimensions. With those dimensions, we can determine which sentences are the most important.
- Vectorization methods:
  - Binary Matrix -> just converting sentences into binary vectors
  - TF-IDF -> convert sentences based on the frequency and importance of the words in the sentence
  - Word2Vec -> good for finding out semantic relationships between words
Implemented LexRank with TF- IDF and LSA with Binary Matrix
- The original authors of LexRank use TF - IDF as a vectorization method
- The paper for LSA, where they tested all the vectorization methods, found out that the binary format performed the best in terms of the quality of summaries. Therefore, I used the binary matrix for LSA
- LexRank creates more concise and clear summaries
- LSA creates more detailed summaries
- Both perform better than TextRank implemented by some library installed from npm => Better to implement TextRank with word2vec and co-reference resolution
- LSA performed the best in a super long text (3280 words) from my observations
Discovered Pyodide and micropip:
- Pyodide is a library based on WebAssembly. It allows us to run Python code/packages in web browsers with micropip. We can use, for example, scientific Python packages, including NumPy, pandas, SciPy, matplotlib, and sci-kit-learn, which is great!
- There is a Pyodide Webpack plugin: GitHub - pyodide/pyodide-webpack-plugin: A Webpack plugin for integrating pyodide into your project.
Chose GPLv3 license for the plugin-ai-summarization

Plan

Implementing KMeans Clustering with either TF - IDF or Word2Vec:
- [STEP 1] select random K (those will be centroids) -> [STEP 2] cluster sentence vectors -> [STEP 3] run until convergence -> [STEP 4] the most important sentences will be closest to centroids
Implement our own TextRank algorithm with word2vec (it seems to perform better based on one of the research papers I read, which I put in the last updates)
Make a first Evaluation Research for the Joplin community to find out the best algorithm

Extra:

Implemented LSA without diversity awareness (using Maximal Marginal Relevance) => try to understand the equation for diversity awareness and implement that
If time allows, try to find out whether we can use neuralcoref with spaCy in Joplin using Pyodide with micropip: neuralcoref · spaCy Universe => most likely will help us significantly with increasing the quality of summaries

Problem

Encountered a problem with natural not being able to find node:async_hooks (new NodeJS versions) -> solution: if node:async_hooks, do commonjs async_hooks instead

Topic		Replies	Views
Overview of Unsupervised Methods for Extractive Summarization Summarize with AI gsoc-2024	1	184	16 June 2024
Weekly Update 2: Summarizing notebooks and WebLLM Summarize with AI weekly , report , gsoc-2024	0	144	11 June 2024
Bonding Period Update - Week 3 Summarize with AI	0	131	27 May 2024
Weekly Update 1: Co-reference resolution and word2vec approach to improve extractive summarization Summarize with AI weekly , report , gsoc-2024	0	130	4 June 2024
Weekly Update 9: Released AI Summarization Plugin and Learning about Tech Specs Summarize with AI weekly , report , gsoc-2024	0	42	29 July 2024

Weekly Update 3: LexRank, LSA and Pyodide with micropip

Progress

Plan

Problem

Related topics