Joplin Vacuum - A Python script to remove orphaned resources

Hi there!

I enjoyed using Joplin, but currently if a note is deleted, the resources used in the note is not automatically deleted, resulting in orphaned resources. I tried to search around and found an issue on Github: Potential orphaned resources left on sync target on resource conflict · Issue #5223 · laurent22/joplin (github.com). One solution is first export all notes to JEX file then reimport.

Based on this method, I wrote a small Python script that takes in a JEX file and automatically removes all orphaned resources via Joplin Web API. From my own experience it shrinks my note library size by about 20%.

You can find the script in the link down below. Only Python 3.6+ is needed. No third party dependencies required.

Hope this helps!


After the discussion with Joplin team (see the following replies), I was notified that if the sync is incomplete (e.g. network failed during syncing), resources not synced to local will be marked as orphaned, even if they are referred in some notes. Also, even if resources are referred in note histories, as long as they are not referred in the current version, they are marked as orphaned.

Before using the script, it would be better to perform a full sync on all your devices, and try to reach the same state across all devices. On the device you planned to run the script on, make sure the sync is completed before running the script, and always keep up-to-date backups just in case the worst happened. Use at your own risk and be responsible of your own data.


Also, check out the amazing job of rxliuli's joplin-batch-web at #9, which provides a Web interface that you can manually inspect the unused resources before removing them.

8 Likes

Awesome, thanks! A bit scary to run, but let's hope for the best :slight_smile:

If you follow the steps it should be fine. :grinning:
Also, as a JEX export file is needed as input, in the worst case there will always be a backup that you can fallback to.

Remember that there's always a risk to lose resources when doing this. For example in the following case:

  1. You sync you data and the app downloads several resources
  2. Your network connection goes off
  3. Some of the notes associated with the resources didn't get downloaded

Now the resources in step 1 are "orphans" even though they have notes associated with them (and these notes will be downloaded on next sync). If you run any clean up script at this point, you will accidentally delete all these resources.

3 Likes

Thanks for mentioning that case. Just wondering, if I make sure the sync is completed (click the sync button, and wait until it finishes) before running the script, will the resources be orphaned? I am using the built in JEX export function, so I am really not sure what's going inside Joplin. I would advice anyone using the script to backup all the data (preferably in all devices) before actually starting the cleaning process.

if I make sure the sync is completed (click the sync button, and wait until it finishes) before running the script, will the resources be orphaned?

You can't know without looking at the sync state of all devices and server. The mostly decentralised nature of Joplin means you can never know if a client or the server has all the data. For example the sync operation of client A got interrupted so it didn't upload all its data. Now you do a full sync on client B and might think you've downloaded everything, but of course you didn't get the data from client A.

Dealing with orphaned resources is a complicated problem because resources and notes in Joplin are independent. I don't really have any solution to this at this point, and every time I look into it I get stuck eventually by some subtle issue.

3 Likes

Thanks for your explanation of this complicated issue. Now I understand more why there isn't an official vacuum function provided.

As for myself I would advise anyone using the script to perform a full sync on all devices and try to reach a stable & same state on all devices before running it, and as always backup their data. I will update the repo to reflect the insight you provided.

Thank you and your team for the amazing job of creating Joplin!

2 Likes

Are you sure that your script takes the note history into account?

e.g. if you remove the resource that is orphaned in the current note, but when you go back and show the same note a few days prior when the attachment was still linked, is the attachment gone after running your script?

I once wrote a similar tool website, you can preview them before deleting, so they are unlikely to be deleted by mistake

https://joplin-utils.rxliuli.com/joplin-batch-web/#/en-US/unusedResource

6 Likes

This is an awesome tool! From the source code, it seems that "unused resources" is determined by whether the resource id can be found or not. I think that's a better approach as it eliminates the need to manually export.

Sadly no. From my own test, it seems that the JEX export gives a snapshot of current Joplin notes, ignoring note histories. rxliuli's joplin-batch-web project searches resource id to determine whether a resource is not used, which also ignores note histories.

I just checked Joplin Data API | Joplin (joplinapp.org), and note history is missing from that page, so it seems that they are not exposed to REST API that my script relies on. Unless there is anyway to programmatically access note histories via API, there isn't much I can do. Still, I will update the repo and my post to reflect the issue you shed light on.

2 Likes

I plan to solve the problem of joplin history record with another tool. For reference: https://github.com/rxliuli/joplin-utils/blob/master/apps/joplin-plugin-backup-prettier/README.ZH_CN.md, at present, there is only a crude idea and basic test code, but I am really a bit busy lately. If you want, you can implement it.

If we can recover any note at any point in time, then nothing will be lost


Maybe you can refer to: https://www.presslabs.com/docs/code/gitfs/

I checked all the related Joplin API, but note history is not accessible via Joplin Data API or Joplin Plugin API. According to this thread, the only way is to directly backup the Joplin sync directory. From my perspective, one issue of using Git or other revision control system, is that a large amount of info is stored in an internal SQLite database, which is not very Git-friendly.

You may find some info about how the revision service works here: joplin/RevisionService.ts at 8a2ca0535d4dc9ac835262c4613225c40218ce8f · laurent22/joplin (github.com)

It seems that note history info is stored in the revision table in database.sqlite, which can be found under your Joplin config directory.

So, I didn't expect to access joplin's built-in note history through api, but to implement an external history system by continuously exporting all notes and resources to git. The problem with existing backup plugins is that incremental backups are not possible, so it is impossible.


Existing backup plugin

In any case, even if you can query the note history directly from sqlite, how do you think that no one wants to recover from a more distant history?

This is the reason why I access the db directly to determine what to delete. The delete operation however is done via the API.

I wrote a bash script a while back....

3 Likes

Sorry for my misunderstanding. Now I see what you means by referring to GitFS. It would definitely help to have a better backup mechanism, but I am still not sure how to sync the Joplin status with external Git status (how to decide when to "commit" so that no git revision represents an intermediate state).

Thanks for sharing the script. Sadly I am a Windows user, but the SQL you provides is surely handy. I think I will try to adapt them into my script (definitely with credits to you) for better cross-platform compatibility. Just wondering, how can I know whether a future update will update the internal DB structure and renders the script unusable?

Unfortunately we don't. If it breaks, it breaks and we'll have to figure out why.

I usually keep track of db changes by reading commits.

1 Like

You can refer to the existing backup plugins above. It is a regular backup, but if you use git, you can do this more frequently (because incremental backups are possible), and also about when to back up. The simple solution is to follow the time interval (for example, 30s, it seems that there will be performance problems for people with a lot of notes, but as long as the query joplin api layer is also cached or the plugin hooks api is used, the backup can be as little as possible), complex The solution is batch processing, to achieve a function similar to the IDE's local history, and then submit to git in batches at the right time


There will be unlimited increase of git repositories, and N historical records before submission can be deleted regularly


The problem I expect to solve is already mentioned in the README

  1. Real full backup can be used to migrate or reset
  2. Can incremental backup to avoid losing notes
  3. Human readable
  4. Recover any notes (or notebooks) at any point in time - this is the existing ability of git and should be integrated

I think the only thing you can do here is to check the schema version and report an error if it does not match the expected value.
The version is in the version table.