Plugin: Semantically Similar Notes (beta)

tl;dr: this plugin displays notes that are semantically similar to the one you are currently viewing, with similarity being determined by a neural network pre-trained on English words.

(please note that upon first install, it will immediately start encoding all of your notes into embeddings with the full processing power of your computer, which will probably make it unresponsive during this time. if you have a lot of notes, this can take a few minutes. if you have large notes (500+ kb of text), it might take a really long time.)


I find it immensely productive to link notes together as part of my writing/thinking process, but sometimes I am not able to remember all of the notes I have (I recently surpassed 1000 notes in my corpus), and it sometimes requires a flow-breaking hunt for specific notes when I do (eg if I know it exists but can't remember the name).

To help mitigate this, I've written a plugin that searches through my notes for other notes that are semantically similar. It ranks them according to how similar they are to the currently selected one, and links the most similar ones in a UI panel. I'd eventually really love the ability to have joplin automatically suggest notes that could be linked together. I consider this plugin a step towards that! Check out some screenshots:

Screenshot 1:


In this screenshot, I consider the notes within the green outline incredibly relevant to the note I'm working on. The other suggestions are also quite relevant, but less directly helpful.

Screenshot 2:


There isn't much to go off of in this note, but the model still finds relevant notes in my corpus via the title and url in the body. The top suggestion might be the only directly relevant one, but the others are still nice to be reminded of.

I've found this plugin increases my ability to iterate on and refine my "second brain" as a whole; the suggested ones are not so mentally distant (as determined artificially by a neural network), and therefore I don't need to context switch as much to consume and produce thoughts. It can actually feel a bit addicting at times to click through my lists of similar notes, add a blurb capturing how one /actually/ relates to another (or create a new note, linking both), and repeat!

As a bonus feature, this plugin can be used as a very crude semantic search (with really bad UI): create a note with your search query, select a different note and then select your newly created note -> see the most similar notes to your query. I don't really use it this way, and don't vouch for it's performance. I think better semantic search algorithms generally take into account the fact that the query size is very small compared to the size of each returned result.

The technology powering the semanticness is the Universal Sentence Encoder Lite model via tensorflow.js. I'd eventually like to experiment with other models, but this model is certainly sufficient for my current needs. From the USE lite tfjs github page: "The Universal Sentence Encoder (Cer et al., 2018) (USE) is a model that encodes text into 512-dimensional embeddings ... This [TensorFlow.js] module is a GraphModel converted from the USE lite module on TFHub, a lightweight version of the original. The lite model is based on the Transformer (Vaswani et al, 2017) architecture, and uses an 8k [English] word piece vocabulary."

It's available in the Joplin plugin repo (search in your joplin app for the name). And here's a link to the plugin source code: semantic-joplin/similar-notes at master · marcgreen/semantic-joplin · GitHub

At the time of posting this, I consider the plugin to be something like a beta version, as although there are a few obvious outstanding bugs/issues, there is enough functionality for it be useful in my own daily life. I'm tracking the full list of known bugs and potential features in the README (please read it before installing, especially if you have large notes), and I'd love to hear your feature requests, bug reports, and thoughts in general!

15 Likes

Woah. That's really neat. Thanks for making and sharing!

Would you consider offering a built version? Lots of folks have no idea how to build one (myself included)

1 Like

It's available in Joplin plugin repo, if that's what you mean. I just updated the main post to make this clear :+1:

plugin in repo

2 Likes

Ah, I assumed it wasn't in the repo. Sorry about that.

So I installed and I think it's doing the calculations.

How do we toggle the similarity panel? Is there an icon, shortcut key or command in the command palette? It would be great to have all three since different people use different methods. Here is my toolbar and I don't see an icon that toggles it:

I looked at the Font Awesome site and I couldn't find one that I thought would work. I hoped they would have ≈ for "approximately equal", but the closest thing they had was bacon. :laughing:

Also, it's been about 10 minutes and I'm not seeing anything other than:
image

I've restarted several times to see if that was needed. None of my notes are very long. I think my backup folder is less than 6 gb including images.

1 Like

Very nice!

Might be worth introducing some throttling because when I first ran Joplin with the plugin it made my (resonably powerful) desktop unusable for ~5 minutes.
And then this happened:

Logger.ts:217 12:07:07: joplin.plugins: Uncaught exception in plugin "com.github.marcgreen.joplin-plugin-semantically-similar-notes": Error: Failed to compile fragment shader.
    at bv (/home/roman/.config/joplin-desktop/tmp/plugin_com.github.marcgreen.joplin-plugin-semantically-similar-notes.js:12024)
    at Nw.createProgram (/home/roman/.config/joplin-desktop/tmp/plugin_com.github.marcgreen.joplin-plugin-semantically-similar-notes.js:12234)
    at /home/roman/.config/joplin-desktop/tmp/plugin_com.github.marcgreen.joplin-plugin-semantically-similar-notes.js:12379
    at /home/roman/.config/joplin-desktop/tmp/plugin_com.github.marcgreen.joplin-plugin-semantically-similar-notes.js:12379
    at jx.getAndSaveBinary (/home/roman/.config/joplin-desktop/tmp/plugin_com.github.marcgreen.joplin-plugin-semantically-similar-notes.js:12379)
    at jx.runWebGLProgram (/home/roman/.config/joplin-desktop/tmp/plugin_com.github.marcgreen.joplin-plugin-semantically-similar-notes.js:12379)
    at Object.kernelFunc (/home/roman/.config/joplin-desktop/tmp/plugin_com.github.marcgreen.joplin-plugin-semantically-similar-notes.js:12493)
    at i (/home/roman/.config/joplin-desktop/tmp/plugin_com.github.marcgreen.joplin-plugin-semantically-similar-notes.js:726)
    at /home/roman/.config/joplin-desktop/tmp/plugin_com.github.marcgreen.joplin-plugin-semantically-similar-notes.js:726
    at y.scopedRun (/home/roman/.config/joplin-desktop/tmp/plugin_com.github.marcgreen.joplin-plugin-semantically-similar-notes.js:726)

I don't even have that many notes (1056) but a good part of them aren't in English, maybe that was a factor too.

2 Likes

Super interesting ! Right away I think it would be great if a link was made to this new plugin :thinking: (@ylc395)

2 Likes

@whitewall, there is currently no way to toggle the panel. I didn't have time to add that in the 2 weeks I was given to work this. but I can add an item to the feature list that it is desired :+1:. i'm not sure why it looks like it's not even starting to encode your notes into embeddings from your screenshot...I'll try to think about it more later (when I'm not at work).

@roman_r_m heh, yeah, throttling would be a good feature to have. I'll add that to the feature list in the README. for now, I can make it clear in the 'first use' warning that this might happen. thanks for the traceback about the uncaught exception - looks like it might be related to the webGL backend of tensorflow. to confirm, the plugin isn't working for you because of this, right? even if you try restarting?

1 Like

Seems so, yes. The CPU fan no longer spins when I start joplin and the numbers in the side panels stay the same.

1 Like

It's been several hours and it still looks like it did in the screenshot above.

Please let me know if you need specific debugging information from me. I'm happy to help. Of course I realize this is a side project for you. But I'm happy to test things if you need it.

1 Like

@whitewall Thanks for your understanding, and thanks for offering your help! I think knowing the make&model of your operating system, cpu, and gpu could be a good start. I assume the problems are coming from tensorflow and its backend which is why I am asking for this information.

@roman_r_m gotcha. does it not progress at all, like in whitewall's case? ie, stuck at 0?

From the code it looks like you load all the notes in memory and then process them, is that right? In that case you might want instead to process them by batch - the API returns 100 notes by default, so you process these, then fetch the next 100 notes, etc.

Some people have many thousands of notes so downloading them all in memory wouldn't work.

1 Like

Not at 0, it processed a few dozens of notes before crashing on the 1st start. Since then it stays the same.

1 Like

Joplin 2.4.9 (prod, win32)

Client ID: 55ce0b2fbc8e4c0c8c573a6923b41b59
Sync Version: 3
Profile Version: 39
Keychain Supported: Yes
Revision: bb44c4e

System Details
OS Name Microsoft Windows 10 Home
Version 10.0.19042 Build 19042
Other OS Description Not Available
OS Manufacturer Microsoft Corporation
System Name xxx
System Manufacturer ASUSTeK COMPUTER INC.
System Model TP410UAR
System Type x64-based PC
System SKU
Processor Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz, 1800 Mhz, 4 Core(s), 8 Logical Processor(s)
BIOS Version/Date American Megatrends Inc. TP410UAR.309, 4/19/2019
SMBIOS Version 3.0
Embedded Controller Version 255.255
BIOS Mode UEFI
BaseBoard Manufacturer ASUSTeK COMPUTER INC.
BaseBoard Product TP410UAR
BaseBoard Version 1.0
Platform Role Mobile
Secure Boot State On
PCR7 Configuration Elevation Required to View
Windows Directory C:\WINDOWS
System Directory C:\WINDOWS\system32
Boot Device \Device\HarddiskVolume1
Locale United States
Hardware Abstraction Layer Version = 10.0.19041.1151
User Name DESKTOP-8CJ84UQ\learningtoletgo
Time Zone Sri Lanka Standard Time
Installed Physical Memory (RAM) 8.00 GB
Total Physical Memory 7.89 GB
Available Physical Memory 1.15 GB
Total Virtual Memory 16.6 GB
Available Virtual Memory 5.83 GB
Page File Space 8.70 GB
Page File C:\pagefile.sys
Kernel DMA Protection Off
Virtualization-based security Not enabled
Device Encryption Support Elevation Required to View
Hyper-V - VM Monitor Mode Extensions Yes
Hyper-V - Second Level Address Translation Extensions Yes
Hyper-V - Virtualization Enabled in Firmware Yes
Hyper-V - Data Execution Protection Yes

GPU is Intel UHD Graphics 620

The latest .jpx backup is 7 gb.
The latest .jpx backup is 7 MB!

1 Like

@laurent you're correct, and I agree with you. I knew I eventually wanted to support out-of-core processing, but had put it out of scope for the first iteration. Since you brought it up, and since it might be related to the issues whitehall and roman report, I spent some time tonight thinking through the logical modifications I would need to make to support this, and now it's just a matter of finding the time to implement the redesign.

By the way, is there an onShutdown() event? I didn't see one in the api docs, but if there was one, that is where I would want to call tensorflow's .dispose() method for the model I load and the tensors I create. Using the WebGL backend for tensorflow.js comes with the responsibility of managing our own tensor memory (ie, calling .dispose() when we're done with them). If I could dispose of the tensors onShutdown, then I think I could avoid a couple transfers of data between gpu and cpu, which I hear is a good thing to do. I don't actually know the extent this would improve performance, but was just a thought.

@roman_r_m I searched that error and one result indeed suggested it implies too much memory is in use. By the time the plugin starts batching the creation of note embeddings, all notes are already loaded in memory, so the fact that you see embeddings start to be created indicates your entire corpus can fit in your system memory. If I had to guess, it might be that you have a particularly large note or set of notes in a certain batch, and maybe the model needs more memory than is available to encode them. It is very possible that implementing laurent's idea would fix this, since it could free up enough system memory that would then be available for the model.

However, it could also be the case that there is an upper limit to the note size the model can handle. If this is the case, we might be able to divide large notes into chunks, encode them separately, and average their embedding values. But I don't yet know enough about the theory to know if this would actually work.

Would you be willing to share some statistics about your note sizes? Specifically, the size in KBs or MBs of your largest notes, not including attachments (so, text only). One way to determine this info is (maybe you know a more convenient way?):

  1. export each notebook RAW
  2. go into each exported folder, sort by filesize, and record the filesizes of the largest markdown files

For me, my largest notes are 300 KB, and I only have a handful of them. The distribution quickly dips to 100 KB and then 30KB and below. I used to have a single note that was larger than 1 MB, and it seemed to cause the model to hang. I didn't end up waiting to see if computation would ever finish, because the note itself was an old html web clipping that I actually didn't care about, so I just deleted the note. However, I did not see the plugin crashing like you are, so maybe something different is happening here.

@whitewall, thanks for the system info. Would you mind also following the two steps I lay out above? It could be the case you have a large note in your very first batch that is causing the model to hang. I'm actually not sure what else it would be at the moment.

Hi Marc
My largest note is ~63k, the 2nd largest is 14k. I'll delete the largest one and try again later today.

To get these numbers I just opened my database with DB Browser for SQLite and ran

select length(body),* from notes order by length(body) DESC
1 Like

OK, I messed up. My latest backup is 7 meg, not 7 gig. Very sorry.

And of that, only 1.5 mb is notes. The rest is resources.

The largest singe note is 60kb. That's an instance of the note overview plugin page. It lists the titles of all of my notes, so I deleted it to try again with your plugin. However after 30 min, I'm still showing this:
image

Of my other notes, all are less than 30kb. Most are much less than that.

1 Like

No, and I'm not sure I'd want to add one. I think it's better if plugins are designed with the assumption that the app can close at any time, including when crashing. I don't know the details of your plugin but if you have a "dispose()" handler to call, why not do it at the end of processing? I guess that processing doesn't need to be running all the time?

2 Likes

@laurent good call, I agree that plugins should be designed with that assumption. I do call dispose() at the end of processing, I just had a (premature) optimization in mind I could do if I didn't need to do that. thanks for your help!

@whitewall and @roman_r_m , I just released an update that will either resolve your issues or help us learn a little more about them:

  1. you can change the tensorflow backend to use your CPU instead of WebGL in the settings menu, which I think might solve whitewall's problem.
  2. you can also change how many notes are fed to the model at one time. I recommend setting this to 1 for debugging issues. restart the app after both of these changes.
  3. I added some error handling around the creation of embeddings and log which notes cause exceptions to be thrown to this file (on Windows): %APPDATA%\Roaming\@joplin\app-desktop\logs\renderer.log. Not sure the path on other OSes but can find that out if needed (whatever is default for electron-log).
  4. I also stopped loading all note bodies at once per laurent's suggestion, and now only load 100 at a time

@roman_r_m , Between items 2 and 3 above, we should be able to pinpoint if a specific note is causing you trouble. it should just skip troublesome notes and continue encoding as many ones as it can. and the troublesome ones are logged to the file I mention above. I'm not sure if item 4 will be the 'throttle' you wanted, so please let me know if your computer is still as unresponsive as it was the first time you tried this.

@whitewall , could you please set the batch size to 1, and choose the 'cpu' backend, and restart the app, and give it a minute or two to see if it is able to create the embeddings for your notes? I was able to replicate your issue on my windows 10 tablet, and switching to cpu from webgl fixed it for me.

That did it! Great work.

Do you plan on limiting the number of notes displayed? Like only those with 90%+ similarity? Doesn't seem much point in listing all notes.

Thanks for your work.

1 Like

In some way this change made things even worse. Previously when I opened Joplin my system froze but then the plugin would crash and things would go back to normal.
Now first time I opened Joplin after updating the plugin, the system froze again and stayed this way for 5 or 10 minutes. I had to do a hard reset as I could not wait any longer.

The 2nd time I tried I managed to open Joplin and quickly create a new note and it seems the plugin has crashed trying to process this empty note.