Resource Deduplication

It appears that resources in Joplin are given names based on UUIDs.

When we attach the same file in multiple we end up creating duplicate resources. We can presently work around this by only attaching the resource in one file and copying the markdown link into other files. However, I feel we can deduplicate (possibly only unencrypted) resources as follows

  1. Switch to using checksum (like sha256) as a resource ID - this is backwards compatible since existing links would still work as they are just paths. However, there would still be duplication of new resources with ones created prior to this change - Also, there may be collisions between checksums and UUIDs.
  2. Add a checksum column to resources in database.sqlite and simply deduplicate unencrypted resources based the checksum.

For instance, here are two entries in my database.sqlite which are exact duplicates

d594e0e41b28405aa1ecfd7ff50f7ce8|MICRO44_Andre_Seznec.pdf|application/pdf||1582537597630|1582537597630|1582537597630|1582537597630|pdf||0|0|832663|0
03ab241fadf644109c1d6a931d98666a|MICRO44_Andre_Seznec.pdf|application/pdf||1582537643024|1582537643024|1582537643024|1582537643024|pdf||0|0|832663|0
  • I can work on this if there isn’t something obvious that I am missing because of which deduplication would be considered a bad thing to do.
8 Likes

good feature

This is something that I would definitely be interested in.

I use the web clipper fairly extensively, and logos and other regular images are routinely duplicated across multiple pages. If instead of a UUID, Joplin were to use some form of file hash (SHA2/SHA3) as the identifier, that would definitely be ideal. Plus you have the advantage that deduplicated notebooks could potentially be significantly smaller depending on the amount of duplicated attachments.

Thinking further about it, this would also easily allow you to track the number of references from pages to the files (add a reference_count column for example), which should fairly significantly speed up cleaning up dereferenced attachments (remove when reference_count is at 0, or run a cleanup routine to search through all pages to check for existence of references to files with a count of 0 prior to removal).

I opened an issue requesting this:

3 Likes