Switching to Paperless NGX

... for some content :slight_smile:
No I am not dropping Joplin, on the contrary !

I used Evernote for about 10 years before it becames useless and discovered Joplin.
I switched quite quickly, amazed by all Joplin offers.... if I'd known, I would have switched earlier.

I have been probably abusing Joplin as DMS (Document Management System) since day 1 and I must say it works pretty well at it although it is not really what Joplin is made for.

I am now well over 10K notes and I keep importing PDF. Although Joplin did not show its limits yet, I decided to handle my PDF with a solution that is probably more dedicated to it: Paperless NGX.

Why the switch ?
First of all, I want to keep Joplin "light" and keep doing daily backups.
My Daily backup is now well over 10GB. Joplin does still behave great and I love the options to search notes, link them, and add comments and labels. I have tried DMS in the past and they were so bad that using Evernote/Joplin remained a MUCH better experience despite the manual work of tagging and filling up metadata (such as dates, amounts, ...).

Paperless NGX however, scores a few extra points for document management aka. PDF...
First it will import, just like you can do with Joplin, listening to the content of a folder.
I like that it also organizes the PDF as a folder tree based on pre-defined rules. For instance, it will take care of putting all your 2025 invoices in to a folder called 2025 (if you wish to do so). Which brings me to the next point: the main pain I had using Joplin as DMS was extracting OCR Data.
Paperless NGX will find the one or few dates of your document, and it is automatic.

I would love to see some integration between those 2 as I still think Joplin remains better to "comment" on documents. For that, I would need to link documents stored in Paperless NGX into Joplin.

Linking one document is possible but painful and relies on some incremental ID. The @@... links in Joplin are much more convenient. It is possible to link to a Paperless search since it is a simple URL.

Ideal would be a plugin, using the Paperless NGX calling its API to fetch one doc, one search, one label, etc... based on a shortcut like @@.... Call it >>... or whatever.

I hope this way to bloat my Joplin Datanase a little less, while keeping my PDF as clear documents in the backend (ie file system) and not as BASE64 in the DB.

Paperless NGX simply offers more as DMS than Joplin... and that's fair as it is a DMS...
Since Paperless NGX allows custom fields and URL as type, it is trivial to link back to a Joplin note that can be a note with an embedded search, etc...

With this, I hope to keep Joplin smaller and faster, to keep doing what it does best.

How do YOU do it ?

1 Like

I just had a friend recommend Paperless NGX and will be installing it on my home server today. I have been using Joplin for around 2 years (self hosted on the same server) and have approximately 3,500 scanned PDFs (moved over from Evernote) in it. My scanner does OCR automatically, so all those PDFs have embedded text.

I use the Athena plugin to add the embedded text to Joplin for easy searchability, and will admit that it definitely doesn't provide the best looking notes. On the other hand the native searchability is huge, and I typically don't need to view the PDFs as text.

As I learn it I will report how I'm able to use it, and would also love to see how other Joplin users use Pareless NGX along with Joplin or mpossibly instead of it.

Honnestly, when I saw Paperless NGX first time I thought yeah... cool but I am not sure it brings me much. I will test but I was not convinced.

My scanner also does OCR so I can search all in Joplin.

I had my ah ah moment when I discovered the following in Paperless NGX that Joplin does not do (b/c it is not really its job...):

  • extract data such as dates... VERY handy
  • organise all your docs on disk with a naming scheme that you can update anytime. So you can have all your invoices for instance under invoices/yyyy/mm/yyyy_mm_<correspondant>_your_title.pdf. That's VERY convenient.

That allows me syncing way less data into Joplin but I still link to my docs or searches from Joplin with things like: "You can find the invoices for 2023 here ".

Cherry on top, Paperless NGX supports custom fields. So you can have an "amount" field for your invoices for instance. Yet another thing not super convenient in Joplin (I used to stuff all of that in the titles...).

So IMO, both those software complements each other pretty well.

2 Likes

With a few days of experience, I drew a few conclusions:

  • Joplin is not a DMS but handles docs actually darn good ! So in short, for someone having no DMS at all, Joplin is already a far better solution than even some "specialized" DMS I already tried out. The main limitation is about metadata: Joplin offers mainly only title + tags.
  • Paperless NGX really helps attaching metadata to your docs
  • Paperless NGX dedupe showed me I had a lot of duplicates. This is actually NOT really an issue in Joplin as Joplin will store the file under its HASH. So it will not use much extra disk if 10 notes points to the same file. Yet, you will have 10 notes. In Paperless, you will get one single doc entry
  • The creation date recognition works much better than I thought, this is a real time saver
  • The Paperless workflows are really useful so you can automatically change metadata based on a trigger

If you have many notes with only a PDF, a title and some tags (ie no body other than the file), the best strategy I found is:

  • you are willing to lose your title
  • you use tags and want to keep them

In that case, the best is to leverage the ability from Paperless (maybe the Joplin plugin should allow that as well) to auto tag docs based on the folders, allow you keeping your tags and more.

For instance, if you have notes tagged foo + bar and are about to import 100 files, you may:

  • use a raw export
  • ignore the md files
  • rename the resources folder into subfolders foo / bar / batchxyz
  • You docs will be imported with the tags foo, bar and batchxyz. After that, you may batch select those docs in Paperless and apply/change more properties.

All of that to say without high jacking this forrum for another piece of software, that Joplin still does a VERY good job at managing docs although it is not what it has been designed for.
Paperless NGX is a nice option that is much more than what it initially looks like.

There is an open question and if that ends up possible, it would be rather simple to import a doc into Paperless and convert a joplin doc link to the new url.

Still in my journey to help Joplin lose weight.
I thought I would share a small piece of code that makes my life much easier.

I am (searching and...) running into Joplin notes with a PDF linked:

[20210525.pdf](:/1bfb3540a19f48c3ebf35dfd3c84333e)

I save the doc to the paperless consume folder.
I usually then keep going with a few other docs as the consumption may take a while when OCR is required.

I then revisit the notes while having the snippet below running and copy:

[20210525.pdf](:/1bfb3540a19f48c3ebf35dfd3c84333e)

to the clipboard.

The piece of code queries paperless for the file and returns the information. Since we search via exact hash, there should be ONE match at most.

python joplin-paperless.py
📋 Hash match: 1bfb3540a19f48c3ebf35dfd3c84333e
🔍 Document info: {'url': 'https://<redacted>/api/documents/<some-id>/', 'filename': '1bfb3540a19f48c3ebf35dfd3c84333e'}
✅ Clipboard updated: [20210525.pdf](https://<redacted>/api/documents/<some-id>/)

The script then replaces the clipboard content with the new link:

[20210525.pdf](https://<redacted>/api/documents/<some-id>/)

So I only need to spot the attachments, copy, wait a second, paste.

The script needs 2 ENV:

import os
import re
import time
import pyperclip
import requests
import sys

# Load from environment
PAPERLESS_API_URL = os.getenv("PAPERLESS_API_URL")
PAPERLESS_TOKEN = os.getenv("PAPERLESS_TOKEN")

# Validate environment variables
if not PAPERLESS_API_URL or not PAPERLESS_TOKEN:
    sys.exit("❌ ERROR: PAPERLESS_API_URL and PAPERLESS_TOKEN environment variables must be set.")

# Normalize URL
if not PAPERLESS_API_URL.endswith("/"):
    PAPERLESS_API_URL += "/"

DOCUMENTS_ENDPOINT = PAPERLESS_API_URL + "api/documents/"
SLEEP_TIME = 1  # Seconds between clipboard checks

# Pattern: [filename.pdf](:/hash)
pattern = re.compile(r'\[([^\]]+)\]\(:/([a-fA-F0-9]{32})\)')

last_clipboard = ""

def get_document_url_by_hash(doc_hash):
    headers = {
        "Authorization": f"Token {PAPERLESS_TOKEN}",
        "Accept": "application/json"
    }

    query_url = f"{PAPERLESS_API_URL.rstrip('/')}/documents/?checksum__iexact={doc_hash}"

    try:
        response = requests.get(query_url, headers=headers)
        response.raise_for_status()

        if 'application/json' not in response.headers.get('Content-Type', ''):
            print(f"⚠️ Unexpected content type: {response.headers.get('Content-Type')}")
            return None

        data = response.json()
        results = data.get("results", [])
        if results:
            doc = results[0]
            return {
                "url": f"{PAPERLESS_API_URL.rstrip('/')}/documents/{doc['id']}/",
                "filename": doc.get("title") or "document"
            }
        else:
            print("⚠️ Document not found for given hash.")
    except requests.RequestException as e:
        print(f"⚠️ Failed to fetch document info: {e}")
    except ValueError as e:
        print(f"⚠️ Failed to parse JSON: {e}")
        print(f"Response was: {response.text[:200]}")
    return None


while True:
    try:
        clipboard = pyperclip.paste()
        if clipboard != last_clipboard:
            match = pattern.search(clipboard)
            if match:
                _, doc_hash = match.groups()
                print(f"📋 Hash match: {doc_hash}")

                # Default fallback replacement
                replacement = f"paperless {doc_hash}"

                # Try to get doc info from Paperless NGX
                doc_info = get_document_url_by_hash(doc_hash)
                print(f"🔍 Document info: {doc_info}")
                if doc_info:
                    filename, _ = match.groups()  # 👈 use original filename from clipboard
                    url = doc_info["url"]
                    replacement = f"[{filename}]({url})"

                pyperclip.copy(replacement)
                last_clipboard = replacement
                print(f"✅ Clipboard updated: {replacement}")
            else:
                last_clipboard = clipboard
        time.sleep(SLEEP_TIME)
    except KeyboardInterrupt:
        print("👋 Exiting...")
        break
    except Exception as e:
        print(f"❗ Unexpected error: {e}")
        time.sleep(SLEEP_TIME)

The script keeps running until you kill it.

That sure could be automated further but this is a one time run for me so this helper is actually way enough.

The Cherry on Top now would be if Joplin could show a PDF/image coming from an external URL...