Weekly Update 4-5: KMeans Clustering, Evaluation Research / Survey and Express.js in a plugin

HahaBill · 30 June 2024 23:05

1. Progress

Implemented KMeans Clustering:
- sometimes the algorithm performs very well and sometimes not -> how well it converges to a threshold value (currently 0.0001 with 1000 iterations)
- we need to do hyperparameter tuning for a value k value (number of clusters).
- try word2vec -> seems to be a better choice for KMeans Clustering
Understanding more in-depth how Joplin and Joplin Plugin API work:
- Plugins are stored as .jpl. They are then loaded in PluginService.ts
- Each profile has its own data. We can use dataDir to store the AI model's configurations, weights, etc.
Created an Evaluation Research / Survey to find out the best algorithm and get the user's perspective on the quality and length of summaries -> Currently under review by @Daeraxa
Tried once again to run Transformers.js in a plugin -> [FAILED]
Integrating word2vec in a plugin -> encountered some problems [RESOLVED]

2. Plans

2.1 Retrospective

I wanted to create this section since next week marks Week 6 of the GSoC program, meaning we are already at the halfway point. I believe it is important to assess our current progress and compare our goals from the proposal with the plan for the next six weeks.

When I compare my progress to the plan in the proposal, I see that I am on a good track. However, I had a setback in Week 5 because I was not feeling very well, and there are still pending issues with using supervised methods (transformer models, LLMs, etc.)

Week 4

From the proposal, the goal for Week 4 is to implement text pre-processing
- I have already done:
  - Removing stop words (a set of commonly used words in any language). For example: and, or, of, at, etc. -> creates noisy data for TF-IDF vectorization method
  - Removing unnecessary symbols
- Add : removing more unnecessary data: links, images, etc. -> Added to Week 6

Week 5

Create unit tests with Jest [PENDING] -> Moving to Week 6
Create logs (electron-log) for better system monitoring -> Added to Week 6

Week 6

[DONE]

Week 7

[DONE (except for Midterm Evaluation)]

Week 8

Implement a functionality that displays UI pop-up windows to notify users about the
summarization processes -> Moving to Week 6
- Notifying users when the note is being summarized
- Notifying users when the note is successfully summarized
- Using React or joplin.views.dialog

Week 9

[DONE in Week 6]

Week 10

[DONE]

Week 11

Writing documentation, guide on how to add more AI features, and report -> Moving to Week 12
Set up GitHub Action for Joplin AI plugin - I will discuss with mentors whether it is
needed -> Moving to Week 12

Week 12

Testing the feature and making sure everything is working
Re-reading and making final changes to documentation and report

Extra Goals

[DONE]

2.2 New plan and goals

Week 6

Posting the survey in the Joplin Forum and analyze the results
Updating "Overview of Extractive Summarization Techniques"]
Discuss @laurent about the problem encountered with running some ML packages in a plugin, wasm and the option of creating a new Plugin API
@Daeraxa found an edge case with summarizing notebooks
- summarize the only parent's immediate children and not the whole subtree:
  Screenshot 2024-06-30 at 22.56.571942×990 51 KB
  .
  Assume that (1) is a parent (notebook) and (2) and (4) are his direct children (notes). When (1) is clicked, then only summarize (2) and (4), excluding (5, 6, 7, 8).
Remove unnecessary data, such as links, images, etc. for text (notes) pre-processing
Create unit tests with Jest
Create logs (electron-log) for better system monitoring
Implement a functionality that displays UI pop-up windows to notify users about the
summarization processes
- Notifying users when the note is being summarized
- Notifying users when the note is successfully summarized
- Using React or joplin.views.dialog
Update README.md in plugin-ai-summarization repository on Github
(implementing new Joplin Plugin API?)

Week 7

Midterm Evaluation
Add an option to summarize per section (#, ##, etc.)
Add settings for users to choose summarization algorithms (default: best-unsupervised algorithm)
Add info to let users know how to use the plugin
(implementing new Joplin Plugin API?)

Week 8

More UI/UX

Week 9

Improving unsupervised methods for extractive summarization with:
- word2vec
- co-reference resolution
Creating another evaluation research/survey

Week 10

[HOLIDAY]

Analyze results from the survey

Week 11

[BUFFER PERIOD]

Week 12

Writing documentation, guide on how to add more AI features, and report
Testing the feature and making sure everything is working
Re-reading and making final changes to documentation and report
(Set up GitHub Action for Joplin AI plugin - I will discuss with mentors whether it is
needed)

3. Problems

word2vec uses child_process to execute ./sh scripts to train the model (producing word vectors). The vector representations are then saved in a **.txt ** file. After that, we need to load the .txt file, which then allows us to do more processing.
- Solution: Question regarding `joplin.plugins.installationDir` path in Joplin's production environment
- I am using Webpach's CopyPlugin to copy the word2vec in dist and access with the
The solution for the word2vec sparked the idea of creating an Express.js server and storing it in the dist folder
- Solution: Express.js server in a Plugin - #2 by laurent
- The proposed solutions seem not to be ideal -> step back and explain the core problems with using Transformers.js, WebLLM, etc., in a plugin
Running Pyodide in a plugin:
- Used: GitHub - pyodide/pyodide-webpack-plugin: A Webpack plugin for integrating pyodide into your project.
- There seems to be a problem with loading Pyodide in a plugin:
```
let pyodide = await loadPyodide({ indexURL: `${window.location.origin}/pyodide` });
```
- Tried replace the indexURL to point to Pyodide folder in dist created by the Pyodide's Webpack plugin [FAILED]:

laurent · 1 July 2024 18:13

Thanks for the update. Have you been able to release a first version of the plugin yet? The earlier, the better, since then you can share it here on the forum and potentially get feedback from users.

laurent · 1 July 2024 18:17

I mentioned this because I was wondering if that would help your use case. Since running WASM from a plugin seems tricky, maybe we could create an API where you'd pass a WASM file and the app will load it for you. For example we couldn't get Tesseract to work from a plugin, but it works fine from the app (it's currently integrated to it).

Do you think that would be useful? Or did you manage to get a different approach working? (I see you've tried many different approaches so I'm not sure where you are at this point)

Daeraxa · 1 July 2024 18:21

From our meeting yesterday it seems that we haven't yet managed to get any of these bundled into the plugin just yet, issues originally with native modules and now it seems with WASM (which was to get around the native module issues in the first place).

I don't believe a .jpl has been published yet but the repo is being pushed to fairly regularly.

As for feedback we are looking for some community engagement (but not dependent on it) in the form of a poll in regards to the summaries, something I'm just looking over before it is posted.

BioFacLay · 1 July 2024 19:33

Disclaimer: I've just stumbled upon this and have only read the OP, so feel free to ignore this post if my advice doesn't apply / makes no sense.

Clustering algorithms such as k-means often struggle with high dimensional data and word2vec gives high-dimensional embeddings. Dimensionality reduction helps in such cases. UMAP is a state-of-the-art dimensionality reduction algorithm that has been shown to significantly improve clustering performance of high-dimensional data. tSNE is an alternative, that doesn't perform quite as well but it's been around longer and is easier to find libraries for. Using dimensionality reduction as a preprocessing step could however slow down the clustering by quite a bit.

You might also want to look at density-based clustering algorithms. I recommend HDBSCAN. This way the clusters don't have to be hyper-elliptical in shape and you're able to detect noise. HDBSCAN's hyperparameters are very intuitive and you might be able to set them to a fixed value without the need for tuning them every time.

In any case, this looks like a really cool project and I wish you the best of luck with it.

Imperial_Squid · 1 July 2024 20:20

Seconding the recommendation for dimensionality reduction, while it is an extra step it's well worth it to get the data into a lower dimensional version. In fact this is such a common issue it's got a name, the "Curse of Dimensionality". It's the bane of data scientists' like myself's life lol. (I'm being a little dramatic, but it is something you need to be aware of).

For anyone unfamiliar, the reason it happens is that distance metrics scale with the number of dimensions they're used in. So if we're on a 1D line and you're 1 unit away, your distance from me is of course 1 unit. But if it's a 2D plane and you're 1 unit away in the x and y axis, you're distance from me is (1+1)^(1/2) = 1.41 units. If we're in 3D space and you're 1 unit away in x, y, and z, you're now (1+1+1)^(1/3) = 1.44 units away. So even closely related things start getting further away the more dimensions they exist in. Since word2vec uses between dozens and often hundreds of dimensions depending on the version you can see how this becomes an issue. (While you could use a lower dimension version to slightly reduce the issue, you'd also be degrading the quality of the output in the process, since each dimension can be thought of as a way in which you measure a term, the less dimensions you use, the less "expressive" the encoding becomes)

I'd also agree moving away from k-means and using something like HDBSCAN, or honestly any other hierarchical clustering algorithm. One of the big limitations of k-means imo is that you need to specify the number of clusters you want, and while this can be useful in some scenarios, if you're trying to do something like text summarisation it's good to have some level of flexibility since some text might be very broad in scope and cover a lot of topics while some might be very focused, having an algorithm that can cope with those different scenarios is much more preferable than running k-means a bunch of times imo.

Otherwise looks like a really interesting project! Can't wait to see how it turns out!

HahaBill · 2 July 2024 23:06

Hi Laurent, we haven't released the first version yet, but I agree that it's a good idea to do so! I will discuss this with Daeraxa, and then we will decide when to do it. I think next week would work!

HahaBill · 2 July 2024 23:11

Alright, I got it! I glanced at Tesseract.js and OcrService.ts, and it seems that this approach could help get Transformers.js up and running in Joplin.

Do I need to create a web worker to communicate with the plugin via Joplin's Plugin API? Is that the idea? I haven't looked into it deeply yet.

HahaBill · 2 July 2024 23:29

Thank you for your input—it is really helpful! I will definitely look into HDBSCAN and other clustering algorithms.

I still have to look more into dimensionality reduction, but what do you think about using PCA? Furthermore, how much should we reduce the dimensionality? Or does the UMAP already optimally do that while preserving the quality of the data?

HahaBill · 2 July 2024 23:37

Right, got it! I was having a problem finding the right k. As you said, the text might be very broad in scope, so even if you find a good k for certain types of text, it won't work in others. It's a good idea to explore other clustering algorithms. Thank you both of you for making me realize that.

Anyway, I like your explanation of the curse of dimensionality! You covered that well.

HahaBill · 2 July 2024 23:39

For anyone interested, I am keeping track of all unsupervised algorithms for extractive summarization I used so far in this topic: Overview of Unsupervised Methods for Extractive Summarization

laurent · 3 July 2024 14:52

Yes something like this I think, the app would load the WASM file using a web worker.

But looking at the Tesseract loading code I'm wondering if it's possible to generalize this. @HahaBill, would you mind creating a minimal loading code for the lib you want to use? Then we can see what API would be needed exactly

HahaBill · 3 July 2024 20:27

Alright, I will try to implement that! I will start once we release the first version of the plugin and after the midterm evaluation.

BioFacLay · 9 July 2024 17:01

Sorry for the late reply, I'm really busy at the moment.

I don't think PCA will be very helpful in this scenario, since it's linear, whereas UMAP and t-sne are nonlinear. The graph-layout technique used in UMAP to construct the low-dimensional representation usually creates clumps of datapoints that lend themselves to clustering. There's a few caveats here and UMAPs focus is primarily concerned with visualization, not clustering but it's been shown to work really well as a preprocessing step for clustering high-dimensional data.

As for the dimensionality: You might be able to find benchmark studies that looked at clustering performance using different dimension values for the embeddings but this is probably highly data dependent. I would keep the target dimension low (2 or 3) as you can also plot the results in this case. From my experience (which is very limited), clustering performance isn't all that much affected by the choice of target dimension, but YMMV and it might be a good idea to tinker around a bit.

Imperial_Squid · 10 July 2024 07:53

I think it's really important to stress this point.

From my time studying ML/data science/etc in academia you can't understate how much just fiddling about and trying stuff is an important step. My PhD supervisor once phrased it as "machine learning is closer to alchemy than chemistry". There's as much an art as there is a science to doing this stuff well.

So, if you're trying to evaluate whether method A or method B is best, try both!

HahaBill · 15 July 2024 10:25

@BioFacLay @Imperial_Squid Thank you so much again for your inputs. I greatly appreciate those!! I am currently focusing on the UI/UX part of the plugin. I will look back into AI in few weeks and let you know my findings and results.

In a meantime, I created a survey to help me to create better summaries for Joplin's community: Help Shape our AI Summaries: Participate in Survey #1

Anyone else reading this, if you find some little time, it would be super nice if you can look into the survey and answer those questions. It would help me a lot!!

muzak · 15 July 2024 13:57

If users have survey feedback, how should they send it? I would reply to the thread, but I don't want to influence others' votes with the survey still running.

HahaBill · 15 July 2024 14:03

Hi, feel free to provide a survey feedback by replying to the thread!!

Topic	Replies	Views
Weekly Update 12: Running Language Models, Improving UI/UX and Released (v.0.1.1) 🚀 Summarize with AI weekly , report , gsoc-2024	340	20 August 2024
Weekly Update 9: Released AI Summarization Plugin and Learning about Tech Specs Summarize with AI weekly , report , gsoc-2024	42	29 July 2024
Weekly Update 7: Posted a Survey and Added plugin panel with displaying notebook tree Summarize with AI weekly , report , gsoc-2024	50	15 July 2024
Weekly Update 10-11: Generic Web Worker, Transformers.js again and UI/UX improvements Summarize with AI weekly , report , gsoc-2024	56	13 August 2024
Weekly Update 8: Rich Text Editor in a Panel and Crafting Summaries Summarize with AI weekly , report , gsoc-2024	56	22 July 2024

Weekly Update 4-5: KMeans Clustering, Evaluation Research / Survey and Express.js in a plugin

1. Progress

2. Plans

2.1 Retrospective

Week 4

Week 5

Week 6

Week 7

Week 8

Week 9

Week 10

Week 11

Week 12

Extra Goals

2.2 New plan and goals

Week 6

Week 7

Week 8

Week 9

Week 10

Week 11

Week 12

3. Problems

Related topics