Fix Evernote export ENEX files

Here's a python program, derived from one found elsewhere in the forums, that will:

  1. Grab the first non-empty line of text of an Untitled Note in the ENEX file and use that as a title for the note. First line might be found before or in the first div tag.

  2. Not choke on escaped HTML text in note-attributes, and not choke on escapes in unchanged titles (both of which happened with the original script).

As a new user, I didn't realize I could only reply 3 times on the other forum entry, which is titled "How to Auto Title notes that were imported?"

    # max chars in title
    titleLength = 100

    import re
    import html
    import os
    import xml.etree.ElementTree as ET

    # define a function that takes a string as an argument and strips all html fields
    def strip_html_fields(new_title):
        # use re.sub to replace any html tags with an empty string
        # the pattern is < followed by any characters until >, with the flags re.IGNORECASE and re.DOTALL
        # the replacement is an empty string
        # the string is new_title
        return re.sub("<.*?>", "", new_title, flags=re.IGNORECASE | re.DOTALL)

    # Define a function that takes a file name as an argument to process that file
    def process_file(file_name):
        print("##################")
        print(file_name)
        # parse input.enex file
        tree = ET.parse(file_name)
        root = tree.getroot()

        # Loop through all the notes in the tree
        for note in root.findall("note"):

            # Get the title and the content of the note
            title = note.find("title").text
            content = note.find("content").text

            # Check if the title is "Untitled Note"
            if title == "Untitled Note":

                # split text on <div>, and on new lines, and look for a non-empty
                txts = content.split("<div")
                prefix = ''
                for txt in txts:
                    work = prefix + txt
                    work = strip_html_fields( work )
                    work = work.split('\n')
                    for tmp in work:
                        tmp = tmp.strip()[ :titleLength ].strip()
                        if len( tmp ):
                            break
                    new_title = tmp
                    if len( new_title ):
                        break
                    prefix = '<div'
            else:
                new_title = html.escape( title )

            if new_title:
                # Replace the title with the new title
                note.find("title").text = new_title
                print(new_title)
            attrs = note.find("note-attributes")
            for attr in attrs:
                attr.text = html.escape( attr.text )
            note.find("content").text = '<![CDATA[' + content + ']]>'

        # write the modified tree to output.enex file
        tree.write(file_name)

        # open the .enex file in binary mode to convert HTML character codes into text equivalents
        with open(file_name, "rb") as f:
            # read the file content as bytes
            data = f.read()
            # decode the bytes using utf-8 encoding
            text = data.decode("utf-8")
            # unescape the HTML character codes using the html.unescape function
            text = html.unescape(text)

        # open the .enex file in binary mode to write the output
        with open(file_name, "wb") as f:
            # encode the text using utf-8 encoding
            data = text.encode("utf-8")
            # write the data to the file using the file object's write method
            f.write(data)

    # Get the current directory
    current_dir = os.getcwd()

    # Loop through all the files in the current directory
    for file in os.listdir(current_dir):
        # Check if the file has the .enex extension
        if file.endswith(".enex"):
            # Apply the function to the file name
            process_file(file)

I had a similar issue to you using the python you used as your source (I think? import crashed because of an @ somewhere).

But when I copy paste your code it inserts indents that weren't in the other code (basically the whole thing, except the first line, is indented), and python refuses to run it.

I'm not a coder so I don't know what's going on, but I tried to manually remove all of the indents (4 spaces at the start of each non-empty line), and... it didn't work.

The terminal spat this out.

Enote - 1 Jan 2024 - RETITLED 1.enex
Traceback (most recent call last):
File "C:\Users\Guest1\Dropbox\Admin\Archive (Admin)\Evernote and Joplin\PreTitle2.txt", line 89, in
process_file(file)
File "C:\Users\Guest1\Dropbox\Admin\Archive (Admin)\Evernote and Joplin\PreTitle2.txt", line 22, in process_file
tree = ET.parse(file_name)
^^^^^^^^^^^^^^^^^^^
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.12_3.12.496.0_x64__qbz5n2kfra8p0\Lib\xml\etree\ElementTree.py", line 1203, in parse
tree.parse(source, parser)
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.12_3.12.496.0_x64__qbz5n2kfra8p0\Lib\xml\etree\ElementTree.py", line 568, in parse
self._root = parser._parse_whole(source)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 20, column 44

But... line 20 in Notepad is only 21 columns long so... I'm guessing it's talking about something else.

This is my text file (sorry if I've posted this wrong)

# max chars in title
titleLength = 100

import re
import html
import os
import xml.etree.ElementTree as ET

# define a function that takes a string as an argument and strips all html fields
def strip_html_fields(new_title):
    # use re.sub to replace any html tags with an empty string
    # the pattern is < followed by any characters until >, with the flags re.IGNORECASE and re.DOTALL
    # the replacement is an empty string
    # the string is new_title
    return re.sub("<.*?>", "", new_title, flags=re.IGNORECASE | re.DOTALL)

# Define a function that takes a file name as an argument to process that file
def process_file(file_name):
    print("##################")
    print(file_name)
    # parse input.enex file
    tree = ET.parse(file_name)
    root = tree.getroot()

    # Loop through all the notes in the tree
    for note in root.findall("note"):

        # Get the title and the content of the note
        title = note.find("title").text
        content = note.find("content").text

        # Check if the title is "Untitled Note"
        if title == "Untitled Note":

            # split text on <div>, and on new lines, and look for a non-empty
            txts = content.split("<div")
            prefix = ''
            for txt in txts:
                work = prefix + txt
                work = strip_html_fields( work )
                work = work.split('\n')
                for tmp in work:
                    tmp = tmp.strip()[ :titleLength ].strip()
                    if len( tmp ):
                        break
                new_title = tmp
                if len( new_title ):
                    break
                prefix = '<div'
        else:
            new_title = html.escape( title )

        if new_title:
            # Replace the title with the new title
            note.find("title").text = new_title
            print(new_title)
        attrs = note.find("note-attributes")
        for attr in attrs:
            attr.text = html.escape( attr.text )
        note.find("content").text = '<![CDATA[' + content + ']]>'

    # write the modified tree to output.enex file
    tree.write(file_name)

    # open the .enex file in binary mode to convert HTML character codes into text equivalents
    with open(file_name, "rb") as f:
        # read the file content as bytes
        data = f.read()
        # decode the bytes using utf-8 encoding
        text = data.decode("utf-8")
        # unescape the HTML character codes using the html.unescape function
        text = html.unescape(text)

    # open the .enex file in binary mode to write the output
    with open(file_name, "wb") as f:
        # encode the text using utf-8 encoding
        data = text.encode("utf-8")
        # write the data to the file using the file object's write method
        f.write(data)

# Get the current directory
current_dir = os.getcwd()

# Loop through all the files in the current directory
for file in os.listdir(current_dir):
    # Check if the file has the .enex extension
    if file.endswith(".enex"):
        # Apply the function to the file name
        process_file(file)

So, I'm not an expert on "markdown" used in this forum, but it seems that indenting 4 spaces makes it do quotations without interpreting code as markdown syntax. So removing the 4 spaces was the right thing to do. Seems you figured out a better way to get a quoted input file... share to educate me?

Regarding the error, that is one of the module complaining about the input file it is reading, so the line and column are referring to that input file. Sadly, the error doesn't give the name of the input file on the bottom line, but maybe the top line you pasted shows that file name. Also the files should be straight from the ENEX export, not processed before this script is run. But it might be enlightening to know what is at and near that character position in your input file.

Oh, okay sweet, yes I had just run your script on the same file(s) after running the original one that didn't work. I hadn't thought about that being an issue. Thanks!

I ran it on a copy of my backup, and (judging by the terminal window) it has done an awesome job (aside from a small handful of " ' " characters it has translated as some &string, which is pretty inconsequential). I'm importing it to Joplin now...

To quote the code, I copy pasted, selected the whole thing, and 'preformatted the text' (ctrl + e)

Okay, as far as I can tell it's fixed everything perfectly!

The " ' " symbol must just display that way in the cmd terminal because it looks flawless in Joplin everywhere I can see it.

Thank you!

Thanks for the feedback. Yes, the "character escaping" required by the XML format is the source of the &string; stuff and while it is visible in the .enex files, it should not be visible in the Joplin user interface. Which, of course, you figured out after doing the import.

Hopefully your reported issues with using the script will help someone else avoid the same issue.

Thanks for the hint about using preformatted text.