Progress
-
Understanding vectorization: in simple terms, it is a way to convert sentences into vector forms so that we can perform various algorithms. For example, in LSA, we create sentence vectors to form a matrix and then perform SVD to discover the most important dimensions. With those dimensions, we can determine which sentences are the most important.
- Vectorization methods:
- Binary Matrix -> just converting sentences into binary vectors
- TF-IDF -> convert sentences based on the frequency and importance of the words in the sentence
- Word2Vec -> good for finding out semantic relationships between words
- Vectorization methods:
-
Implemented LexRank with TF- IDF and LSA with Binary Matrix
-
The original authors of LexRank use TF - IDF as a vectorization method
-
The paper for LSA, where they tested all the vectorization methods, found out that the binary format performed the best in terms of the quality of summaries. Therefore, I used the binary matrix for LSA
-
LexRank creates more concise and clear summaries
-
LSA creates more detailed summaries
-
Both perform better than TextRank implemented by some library installed from npm => Better to implement TextRank with word2vec and co-reference resolution
-
LSA performed the best in a super long text (3280 words) from my observations
-
-
Discovered Pyodide and micropip:
- Pyodide is a library based on WebAssembly. It allows us to run Python code/packages in web browsers with micropip. We can use, for example, scientific Python packages, including NumPy, pandas, SciPy, matplotlib, and sci-kit-learn, which is great!
- There is a Pyodide Webpack plugin: GitHub - pyodide/pyodide-webpack-plugin: A Webpack plugin for integrating pyodide into your project.
-
Chose GPLv3 license for the plugin-ai-summarization
Plan
- Implementing KMeans Clustering with either TF - IDF or Word2Vec:
- [STEP 1] select random K (those will be centroids) -> [STEP 2] cluster sentence vectors -> [STEP 3] run until convergence -> [STEP 4] the most important sentences will be closest to centroids
- Implement our own TextRank algorithm with word2vec (it seems to perform better based on one of the research papers I read, which I put in the last updates)
- Make a first Evaluation Research for the Joplin community to find out the best algorithm
Extra:
- Implemented LSA without diversity awareness (using Maximal Marginal Relevance) => try to understand the equation for diversity awareness and implement that
- If time allows, try to find out whether we can use neuralcoref with spaCy in Joplin using Pyodide with micropip: neuralcoref · spaCy Universe => most likely will help us significantly with increasing the quality of summaries
Problem
- Encountered a problem with natural not being able to find node:async_hooks (new NodeJS versions) -> solution: if node:async_hooks, do commonjs async_hooks instead