How to Auto Title notes that were imported?

Genn · 21 December 2023 08:30

So I fixed the program for title issues, and I could share it, but it seems that if a note-attributes tag exists with content, that each attribute in the content should be procesed with html.escape so that the later unescape doesn't destroy it.

I'm not familiar with ElementTree, so I'm not sure how to write a loop over the note-attributes content.

Another issue is that sometimes there is text before the first div so shouldn't that text be used as the title? If nothing there, then also sometimes the first div is empty... Perhaps it would be nice to check for text before the first div, and then iterate over the divs until finding something non-blank?

# max chars in title
titleLength = 40

import re
import html
import os
import xml.etree.ElementTree as ET

# define a function that takes a string as an argument and strips all html fields
def strip_html_fields(new_title):
    # use re.sub to replace any html tags with an empty string
    # the pattern is < followed by any characters until >, with the flags re.IGNORECASE and re.DOTALL
    # the replacement is an empty string
    # the string is new_title
    return re.sub("<.*?>", "", new_title, flags=re.IGNORECASE | re.DOTALL)

# Define a function that takes a file name as an argument to process that file
def process_file(file_name):
    print("##################")
    print(file_name)
    # parse input.enex file
    tree = ET.parse(file_name)
    root = tree.getroot()

    # Loop through all the notes in the tree
    for note in root.findall("note"):

        # Get the title and the content of the note
        title = note.find("title").text
        content = note.find("content").text

        # Check if the title is "Untitled Note"
        if title == "Untitled Note":

            # Find the first div in the content
            start = content.find("<div>")
            end = content.find("</div>")

            # Extract the text between the div tags
            new_title = content[start + 5 : end]
            new_title = strip_html_fields(new_title)
            new_title = new_title[:titleLength]
            new_title = new_title.strip()
        else:
            new_title = html.escape( title )

        if new_title:
            # Replace the title with the new title
            note.find("title").text = new_title
            print(new_title)
        note.find("content").text = '<![CDATA[' + content + ']]>'

    # write the modified tree to output.enex file
    tree.write(file_name)

    # open the .enex file in binary mode to convert HTML character codes into text equivalents
    with open(file_name, "rb") as f:
        # read the file content as bytes
        data = f.read()
        # decode the bytes using utf-8 encoding
        text = data.decode("utf-8")
        # unescape the HTML character codes using the html.unescape function
        text = html.unescape(text)

    # open the .enex file in binary mode to write the output
    with open(file_name, "wb") as f:
        # encode the text using utf-8 encoding
        data = text.encode("utf-8")
        # write the data to the file using the file object's write method
        f.write(data)

# Get the current directory
current_dir = os.getcwd()

# Loop through all the files in the current directory
for file in os.listdir(current_dir):
    # Check if the file has the .enex extension
    if file.endswith(".enex"):
        # Apply the function to the file name
        process_file(file)

jopjop · 26 December 2023 14:12

Hello, can I ask what the implication is of this 'note-attribute' problem? What actually goes wrong?

Genn · 3 January 2024 02:21

The way the enex file is processed, is somewhat strange, but it sort of works, mostly. The enex file is a flavor of xml or html, so certain characters (particularly less-than, greater-than, ampersand) must be escaped if used in the text. The library used to process the text unescapes the text, and so then when text is edited and rewritten, it must be re-escaped. So the problem arose from the un- and re- escaping, and how many times, and which pieces were being escaped and which were not, but it only mattered if those certain characters were in the data, and if there were any data in some of the values. Anand likely did what was necessary for his enex files, and shared the program, but it didn't work for all of mine, so I tweaked it until it did. But then, as a new user, I couldn't post the new version here, but I did post it in a separate post. Not sure how to link to that one, but it is the only new post by me. Not sure this reply will work, as it is my 4th, but maybe the limit of 3 is for sequential ones? About to find out!

P.S. Title of my post is "Fix Evernote export ENEX files" Not sure why I can't include a link to it, but links seem to be forbidden here.

Topic		Replies	Views
Import from Evernote Hanging Up - What now? Support	24	1757	24 January 2021
Evernote import - titels lost Support	4	470	30 November 2023
Importing evernote .enex Support	22	3758	24 January 2021
Subject field missing in Approx 10% of notes imported from Evernote ENEX markup file Support	6	515	18 August 2022
Many "Untitled Note" after import from Evernote Support	5	972	2 March 2021

How to Auto Title notes that were imported?

Related topics