Weekly Update 3: LexRank, LSA and Pyodide with micropip

Progress

  • Understanding vectorization: in simple terms, it is a way to convert sentences into vector forms so that we can perform various algorithms. For example, in LSA, we create sentence vectors to form a matrix and then perform SVD to discover the most important dimensions. With those dimensions, we can determine which sentences are the most important.

    • Vectorization methods:
      • Binary Matrix -> just converting sentences into binary vectors
      • TF-IDF -> convert sentences based on the frequency and importance of the words in the sentence
      • Word2Vec -> good for finding out semantic relationships between words
  • Implemented LexRank with TF- IDF and LSA with Binary Matrix

    • The original authors of LexRank use TF - IDF as a vectorization method

    • The paper for LSA, where they tested all the vectorization methods, found out that the binary format performed the best in terms of the quality of summaries. Therefore, I used the binary matrix for LSA

    • LexRank creates more concise and clear summaries

    • LSA creates more detailed summaries

    • Both perform better than TextRank implemented by some library installed from npm => Better to implement TextRank with word2vec and co-reference resolution

    • LSA performed the best in a super long text (3280 words) from my observations

  • Discovered Pyodide and micropip:

  • Chose GPLv3 license for the plugin-ai-summarization

Plan

  • Implementing KMeans Clustering with either TF - IDF or Word2Vec:
    • [STEP 1] select random K (those will be centroids) -> [STEP 2] cluster sentence vectors -> [STEP 3] run until convergence -> [STEP 4] the most important sentences will be closest to centroids
  • Implement our own TextRank algorithm with word2vec (it seems to perform better based on one of the research papers I read, which I put in the last updates)
  • Make a first Evaluation Research for the Joplin community to find out the best algorithm

Extra:

  • Implemented LSA without diversity awareness (using Maximal Marginal Relevance) => try to understand the equation for diversity awareness and implement that
  • If time allows, try to find out whether we can use neuralcoref with spaCy in Joplin using Pyodide with micropip: neuralcoref · spaCy Universe => most likely will help us significantly with increasing the quality of summaries

Problem

  • Encountered a problem with natural not being able to find node:async_hooks (new NodeJS versions) -> solution: if node:async_hooks, do commonjs async_hooks instead
5 Likes