Based on this method, I wrote a small Python script that takes in a JEX file and automatically removes all orphaned resources via Joplin Web API. From my own experience it shrinks my note library size by about 20%.
You can find the script in the link down below. Only Python 3.6+ is needed. No third party dependencies required.
Hope this helps!
After the discussion with Joplin team (see the following replies), I was notified that if the sync is incomplete (e.g. network failed during syncing), resources not synced to local will be marked as orphaned, even if they are referred in some notes. Also, even if resources are referred in note histories, as long as they are not referred in the current version, they are marked as orphaned.
Before using the script, it would be better to perform a full sync on all your devices, and try to reach the same state across all devices. On the device you planned to run the script on, make sure the sync is completed before running the script, and always keep up-to-date backups just in case the worst happened. Use at your own risk and be responsible of your own data.
Also, check out the amazing job of rxliuli's joplin-batch-web at #9, which provides a Web interface that you can manually inspect the unused resources before removing them.
Remember that there's always a risk to lose resources when doing this. For example in the following case:
You sync you data and the app downloads several resources
Your network connection goes off
Some of the notes associated with the resources didn't get downloaded
Now the resources in step 1 are "orphans" even though they have notes associated with them (and these notes will be downloaded on next sync). If you run any clean up script at this point, you will accidentally delete all these resources.
Thanks for mentioning that case. Just wondering, if I make sure the sync is completed (click the sync button, and wait until it finishes) before running the script, will the resources be orphaned? I am using the built in JEX export function, so I am really not sure what's going inside Joplin. I would advice anyone using the script to backup all the data (preferably in all devices) before actually starting the cleaning process.
if I make sure the sync is completed (click the sync button, and wait until it finishes) before running the script, will the resources be orphaned?
You can't know without looking at the sync state of all devices and server. The mostly decentralised nature of Joplin means you can never know if a client or the server has all the data. For example the sync operation of client A got interrupted so it didn't upload all its data. Now you do a full sync on client B and might think you've downloaded everything, but of course you didn't get the data from client A.
Dealing with orphaned resources is a complicated problem because resources and notes in Joplin are independent. I don't really have any solution to this at this point, and every time I look into it I get stuck eventually by some subtle issue.
Thanks for your explanation of this complicated issue. Now I understand more why there isn't an official vacuum function provided.
As for myself I would advise anyone using the script to perform a full sync on all devices and try to reach a stable & same state on all devices before running it, and as always backup their data. I will update the repo to reflect the insight you provided.
Thank you and your team for the amazing job of creating Joplin!
Are you sure that your script takes the note history into account?
e.g. if you remove the resource that is orphaned in the current note, but when you go back and show the same note a few days prior when the attachment was still linked, is the attachment gone after running your script?
This is an awesome tool! From the source code, it seems that "unused resources" is determined by whether the resource id can be found or not. I think that's a better approach as it eliminates the need to manually export.
Sadly no. From my own test, it seems that the JEX export gives a snapshot of current Joplin notes, ignoring note histories. rxliuli's joplin-batch-web project searches resource id to determine whether a resource is not used, which also ignores note histories.
I just checked Joplin Data API | Joplin (joplinapp.org), and note history is missing from that page, so it seems that they are not exposed to REST API that my script relies on. Unless there is anyway to programmatically access note histories via API, there isn't much I can do. Still, I will update the repo and my post to reflect the issue you shed light on.
I checked all the related Joplin API, but note history is not accessible via Joplin Data API or Joplin Plugin API. According to this thread, the only way is to directly backup the Joplin sync directory. From my perspective, one issue of using Git or other revision control system, is that a large amount of info is stored in an internal SQLite database, which is not very Git-friendly.
So, I didn't expect to access joplin's built-in note history through api, but to implement an external history system by continuously exporting all notes and resources to git. The problem with existing backup plugins is that incremental backups are not possible, so it is impossible.
Existing backup plugin
In any case, even if you can query the note history directly from sqlite, how do you think that no one wants to recover from a more distant history?
Sorry for my misunderstanding. Now I see what you means by referring to GitFS. It would definitely help to have a better backup mechanism, but I am still not sure how to sync the Joplin status with external Git status (how to decide when to "commit" so that no git revision represents an intermediate state).
Thanks for sharing the script. Sadly I am a Windows user, but the SQL you provides is surely handy. I think I will try to adapt them into my script (definitely with credits to you) for better cross-platform compatibility. Just wondering, how can I know whether a future update will update the internal DB structure and renders the script unusable?
You can refer to the existing backup plugins above. It is a regular backup, but if you use git, you can do this more frequently (because incremental backups are possible), and also about when to back up. The simple solution is to follow the time interval (for example, 30s, it seems that there will be performance problems for people with a lot of notes, but as long as the query joplin api layer is also cached or the plugin hooks api is used, the backup can be as little as possible), complex The solution is batch processing, to achieve a function similar to the IDE's local history, and then submit to git in batches at the right time
There will be unlimited increase of git repositories, and N historical records before submission can be deleted regularly
The problem I expect to solve is already mentioned in the README
Real full backup can be used to migrate or reset
Can incremental backup to avoid losing notes
Recover any notes (or notebooks) at any point in time - this is the existing ability of git and should be integrated