Weekly Update 4-5: KMeans Clustering, Evaluation Research / Survey and Express.js in a plugin

1. Progress


  • Implemented KMeans Clustering:
    • sometimes the algorithm performs very well and sometimes not -> how well it converges to a threshold value (currently 0.0001 with 1000 iterations)
    • we need to do hyperparameter tuning for a value k value (number of clusters).
    • try word2vec -> seems to be a better choice for KMeans Clustering
  • Understanding more in-depth how Joplin and Joplin Plugin API work:
    • Plugins are stored as .jpl. They are then loaded in PluginService.ts
    • Each profile has its own data. We can use dataDir to store the AI model's configurations, weights, etc.
  • Created an Evaluation Research / Survey to find out the best algorithm and get the user's perspective on the quality and length of summaries -> Currently under review by @Daeraxa
  • Tried once again to run Transformers.js in a plugin -> [FAILED]
  • Integrating word2vec in a plugin -> encountered some problems [RESOLVED]

2. Plans


2.1 Retrospective

I wanted to create this section since next week marks Week 6 of the GSoC program, meaning we are already at the halfway point. I believe it is important to assess our current progress and compare our goals from the proposal with the plan for the next six weeks.

When I compare my progress to the plan in the proposal, I see that I am on a good track. However, I had a setback in Week 5 because I was not feeling very well, and there are still pending issues with using supervised methods (transformer models, LLMs, etc.)

Week 4
  • From the proposal, the goal for Week 4 is to implement text pre-processing
    • I have already done:
      • Removing stop words (a set of commonly used words in any language). For example: and, or, of, at, etc. -> creates noisy data for TF-IDF vectorization method
      • Removing unnecessary symbols
    • Add : removing more unnecessary data: links, images, etc. -> Added to Week 6
Week 5
  • Create unit tests with Jest [PENDING] -> Moving to Week 6
  • Create logs (electron-log) for better system monitoring -> Added to Week 6
Week 6

[DONE]

Week 7

[DONE (except for Midterm Evaluation)]

Week 8
  • Implement a functionality that displays UI pop-up windows to notify users about the
    summarization processes -> Moving to Week 6
    • Notifying users when the note is being summarized
    • Notifying users when the note is successfully summarized
    • Using React or joplin.views.dialog
Week 9

[DONE in Week 6]

Week 10

[DONE]

Week 11
  • Writing documentation, guide on how to add more AI features, and report -> Moving to Week 12
  • Set up GitHub Action for Joplin AI plugin - I will discuss with mentors whether it is
    needed -> Moving to Week 12
Week 12
  • Testing the feature and making sure everything is working
  • Re-reading and making final changes to documentation and report
Extra Goals

[DONE]

2.2 New plan and goals

Week 6
  • Posting the survey in the Joplin Forum and analyze the results
  • Updating "Overview of Extractive Summarization Techniques"]
  • Discuss @laurent about the problem encountered with running some ML packages in a plugin, wasm and the option of creating a new Plugin API
  • @Daeraxa found an edge case with summarizing notebooks
    • summarize the only parent's immediate children and not the whole subtree: .
      Assume that (1) is a parent (notebook) and (2) and (4) are his direct children (notes). When (1) is clicked, then only summarize (2) and (4), excluding (5, 6, 7, 8).
  • Remove unnecessary data, such as links, images, etc. for text (notes) pre-processing
  • Create unit tests with Jest
  • Create logs (electron-log) for better system monitoring
  • Implement a functionality that displays UI pop-up windows to notify users about the
    summarization processes
    • Notifying users when the note is being summarized
    • Notifying users when the note is successfully summarized
    • Using React or joplin.views.dialog
  • Update README.md in plugin-ai-summarization repository on Github
  • (implementing new Joplin Plugin API?)
Week 7
  • Midterm Evaluation
  • Add an option to summarize per section (#, ##, etc.)
  • Add settings for users to choose summarization algorithms (default: best-unsupervised algorithm)
  • Add info to let users know how to use the plugin
  • (implementing new Joplin Plugin API?)
Week 8
  • More UI/UX
Week 9
  • Improving unsupervised methods for extractive summarization with:
    • word2vec
    • co-reference resolution
  • Creating another evaluation research/survey
Week 10

[HOLIDAY]

  • Analyze results from the survey
Week 11

[BUFFER PERIOD]

Week 12
  • Writing documentation, guide on how to add more AI features, and report
  • Testing the feature and making sure everything is working
  • Re-reading and making final changes to documentation and report
  • (Set up GitHub Action for Joplin AI plugin - I will discuss with mentors whether it is
    needed)

3. Problems


1 Like

Thanks for the update. Have you been able to release a first version of the plugin yet? The earlier, the better, since then you can share it here on the forum and potentially get feedback from users.

I mentioned this because I was wondering if that would help your use case. Since running WASM from a plugin seems tricky, maybe we could create an API where you'd pass a WASM file and the app will load it for you. For example we couldn't get Tesseract to work from a plugin, but it works fine from the app (it's currently integrated to it).

Do you think that would be useful? Or did you manage to get a different approach working? (I see you've tried many different approaches so I'm not sure where you are at this point)

1 Like

From our meeting yesterday it seems that we haven't yet managed to get any of these bundled into the plugin just yet, issues originally with native modules and now it seems with WASM (which was to get around the native module issues in the first place).

I don't believe a .jpl has been published yet but the repo is being pushed to fairly regularly.

As for feedback we are looking for some community engagement (but not dependent on it) in the form of a poll in regards to the summaries, something I'm just looking over before it is posted.

1 Like

Disclaimer: I've just stumbled upon this and have only read the OP, so feel free to ignore this post if my advice doesn't apply / makes no sense.

Clustering algorithms such as k-means often struggle with high dimensional data and word2vec gives high-dimensional embeddings. Dimensionality reduction helps in such cases. UMAP is a state-of-the-art dimensionality reduction algorithm that has been shown to significantly improve clustering performance of high-dimensional data. tSNE is an alternative, that doesn't perform quite as well but it's been around longer and is easier to find libraries for. Using dimensionality reduction as a preprocessing step could however slow down the clustering by quite a bit.

You might also want to look at density-based clustering algorithms. I recommend HDBSCAN. This way the clusters don't have to be hyper-elliptical in shape and you're able to detect noise. HDBSCAN's hyperparameters are very intuitive and you might be able to set them to a fixed value without the need for tuning them every time.

In any case, this looks like a really cool project and I wish you the best of luck with it.

4 Likes

Seconding the recommendation for dimensionality reduction, while it is an extra step it's well worth it to get the data into a lower dimensional version. In fact this is such a common issue it's got a name, the "Curse of Dimensionality". It's the bane of data scientists' like myself's life lol. (I'm being a little dramatic, but it is something you need to be aware of).

For anyone unfamiliar, the reason it happens is that distance metrics scale with the number of dimensions they're used in. So if we're on a 1D line and you're 1 unit away, your distance from me is of course 1 unit. But if it's a 2D plane and you're 1 unit away in the x and y axis, you're distance from me is (1+1)^(1/2) = 1.41 units. If we're in 3D space and you're 1 unit away in x, y, and z, you're now (1+1+1)^(1/3) = 1.44 units away. So even closely related things start getting further away the more dimensions they exist in. Since word2vec uses between dozens and often hundreds of dimensions depending on the version you can see how this becomes an issue. (While you could use a lower dimension version to slightly reduce the issue, you'd also be degrading the quality of the output in the process, since each dimension can be thought of as a way in which you measure a term, the less dimensions you use, the less "expressive" the encoding becomes)

I'd also agree moving away from k-means and using something like HDBSCAN, or honestly any other hierarchical clustering algorithm. One of the big limitations of k-means imo is that you need to specify the number of clusters you want, and while this can be useful in some scenarios, if you're trying to do something like text summarisation it's good to have some level of flexibility since some text might be very broad in scope and cover a lot of topics while some might be very focused, having an algorithm that can cope with those different scenarios is much more preferable than running k-means a bunch of times imo.

Otherwise looks like a really interesting project! Can't wait to see how it turns out!

2 Likes

Hi Laurent, we haven't released the first version yet, but I agree that it's a good idea to do so! I will discuss this with Daeraxa, and then we will decide when to do it. I think next week would work!

1 Like

Alright, I got it! I glanced at Tesseract.js and OcrService.ts, and it seems that this approach could help get Transformers.js up and running in Joplin.

Do I need to create a web worker to communicate with the plugin via Joplin's Plugin API? Is that the idea? I haven't looked into it deeply yet.

Thank you for your input—it is really helpful! I will definitely look into HDBSCAN and other clustering algorithms.

I still have to look more into dimensionality reduction, but what do you think about using PCA? Furthermore, how much should we reduce the dimensionality? Or does the UMAP already optimally do that while preserving the quality of the data?

Right, got it! I was having a problem finding the right k. As you said, the text might be very broad in scope, so even if you find a good k for certain types of text, it won't work in others. It's a good idea to explore other clustering algorithms. Thank you both of you for making me realize that.

Anyway, I like your explanation of the curse of dimensionality! You covered that well.

1 Like

For anyone interested, I am keeping track of all unsupervised algorithms for extractive summarization I used so far in this topic: Overview of Unsupervised Methods for Extractive Summarization

1 Like

Yes something like this I think, the app would load the WASM file using a web worker.

But looking at the Tesseract loading code I'm wondering if it's possible to generalize this. @HahaBill, would you mind creating a minimal loading code for the lib you want to use? Then we can see what API would be needed exactly

1 Like

Alright, I will try to implement that! I will start once we release the first version of the plugin and after the midterm evaluation.

Sorry for the late reply, I'm really busy at the moment.

I don't think PCA will be very helpful in this scenario, since it's linear, whereas UMAP and t-sne are nonlinear. The graph-layout technique used in UMAP to construct the low-dimensional representation usually creates clumps of datapoints that lend themselves to clustering. There's a few caveats here and UMAPs focus is primarily concerned with visualization, not clustering but it's been shown to work really well as a preprocessing step for clustering high-dimensional data.

As for the dimensionality: You might be able to find benchmark studies that looked at clustering performance using different dimension values for the embeddings but this is probably highly data dependent. I would keep the target dimension low (2 or 3) as you can also plot the results in this case. From my experience (which is very limited), clustering performance isn't all that much affected by the choice of target dimension, but YMMV and it might be a good idea to tinker around a bit.

1 Like

I think it's really important to stress this point.

From my time studying ML/data science/etc in academia you can't understate how much just fiddling about and trying stuff is an important step. My PhD supervisor once phrased it as "machine learning is closer to alchemy than chemistry". There's as much an art as there is a science to doing this stuff well.

So, if you're trying to evaluate whether method A or method B is best, try both!

1 Like

@BioFacLay @Imperial_Squid Thank you so much again for your inputs. I greatly appreciate those!! I am currently focusing on the UI/UX part of the plugin. I will look back into AI in few weeks and let you know my findings and results.

In a meantime, I created a survey to help me to create better summaries for Joplin's community: Help Shape our AI Summaries: Participate in Survey #1

Anyone else reading this, if you find some little time, it would be super nice if you can look into the survey and answer those questions. It would help me a lot!!

If users have survey feedback, how should they send it? I would reply to the thread, but I don't want to influence others' votes with the survey still running.

1 Like

Hi, feel free to provide a survey feedback by replying to the thread!!