Here's a python program, derived from one found elsewhere in the forums, that will:
Grab the first non-empty line of text of an Untitled Note in the ENEX file and use that as a title for the note. First line might be found before or in the first div tag.
Not choke on escaped HTML text in note-attributes, and not choke on escapes in unchanged titles (both of which happened with the original script).
As a new user, I didn't realize I could only reply 3 times on the other forum entry, which is titled "How to Auto Title notes that were imported?"
# max chars in title
titleLength = 100
import re
import html
import os
import xml.etree.ElementTree as ET
# define a function that takes a string as an argument and strips all html fields
def strip_html_fields(new_title):
# use re.sub to replace any html tags with an empty string
# the pattern is < followed by any characters until >, with the flags re.IGNORECASE and re.DOTALL
# the replacement is an empty string
# the string is new_title
return re.sub("<.*?>", "", new_title, flags=re.IGNORECASE | re.DOTALL)
# Define a function that takes a file name as an argument to process that file
def process_file(file_name):
print("##################")
print(file_name)
# parse input.enex file
tree = ET.parse(file_name)
root = tree.getroot()
# Loop through all the notes in the tree
for note in root.findall("note"):
# Get the title and the content of the note
title = note.find("title").text
content = note.find("content").text
# Check if the title is "Untitled Note"
if title == "Untitled Note":
# split text on <div>, and on new lines, and look for a non-empty
txts = content.split("<div")
prefix = ''
for txt in txts:
work = prefix + txt
work = strip_html_fields( work )
work = work.split('\n')
for tmp in work:
tmp = tmp.strip()[ :titleLength ].strip()
if len( tmp ):
break
new_title = tmp
if len( new_title ):
break
prefix = '<div'
else:
new_title = html.escape( title )
if new_title:
# Replace the title with the new title
note.find("title").text = new_title
print(new_title)
attrs = note.find("note-attributes")
for attr in attrs:
attr.text = html.escape( attr.text )
note.find("content").text = '<![CDATA[' + content + ']]>'
# write the modified tree to output.enex file
tree.write(file_name)
# open the .enex file in binary mode to convert HTML character codes into text equivalents
with open(file_name, "rb") as f:
# read the file content as bytes
data = f.read()
# decode the bytes using utf-8 encoding
text = data.decode("utf-8")
# unescape the HTML character codes using the html.unescape function
text = html.unescape(text)
# open the .enex file in binary mode to write the output
with open(file_name, "wb") as f:
# encode the text using utf-8 encoding
data = text.encode("utf-8")
# write the data to the file using the file object's write method
f.write(data)
# Get the current directory
current_dir = os.getcwd()
# Loop through all the files in the current directory
for file in os.listdir(current_dir):
# Check if the file has the .enex extension
if file.endswith(".enex"):
# Apply the function to the file name
process_file(file)
I had a similar issue to you using the python you used as your source (I think? import crashed because of an @ somewhere).
But when I copy paste your code it inserts indents that weren't in the other code (basically the whole thing, except the first line, is indented), and python refuses to run it.
I'm not a coder so I don't know what's going on, but I tried to manually remove all of the indents (4 spaces at the start of each non-empty line), and... it didn't work.
The terminal spat this out.
Enote - 1 Jan 2024 - RETITLED 1.enex
Traceback (most recent call last):
File "C:\Users\Guest1\Dropbox\Admin\Archive (Admin)\Evernote and Joplin\PreTitle2.txt", line 89, in
process_file(file)
File "C:\Users\Guest1\Dropbox\Admin\Archive (Admin)\Evernote and Joplin\PreTitle2.txt", line 22, in process_file
tree = ET.parse(file_name)
^^^^^^^^^^^^^^^^^^^
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.12_3.12.496.0_x64__qbz5n2kfra8p0\Lib\xml\etree\ElementTree.py", line 1203, in parse
tree.parse(source, parser)
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.12_3.12.496.0_x64__qbz5n2kfra8p0\Lib\xml\etree\ElementTree.py", line 568, in parse
self._root = parser._parse_whole(source)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 20, column 44
But... line 20 in Notepad is only 21 columns long so... I'm guessing it's talking about something else.
This is my text file (sorry if I've posted this wrong)
# max chars in title
titleLength = 100
import re
import html
import os
import xml.etree.ElementTree as ET
# define a function that takes a string as an argument and strips all html fields
def strip_html_fields(new_title):
# use re.sub to replace any html tags with an empty string
# the pattern is < followed by any characters until >, with the flags re.IGNORECASE and re.DOTALL
# the replacement is an empty string
# the string is new_title
return re.sub("<.*?>", "", new_title, flags=re.IGNORECASE | re.DOTALL)
# Define a function that takes a file name as an argument to process that file
def process_file(file_name):
print("##################")
print(file_name)
# parse input.enex file
tree = ET.parse(file_name)
root = tree.getroot()
# Loop through all the notes in the tree
for note in root.findall("note"):
# Get the title and the content of the note
title = note.find("title").text
content = note.find("content").text
# Check if the title is "Untitled Note"
if title == "Untitled Note":
# split text on <div>, and on new lines, and look for a non-empty
txts = content.split("<div")
prefix = ''
for txt in txts:
work = prefix + txt
work = strip_html_fields( work )
work = work.split('\n')
for tmp in work:
tmp = tmp.strip()[ :titleLength ].strip()
if len( tmp ):
break
new_title = tmp
if len( new_title ):
break
prefix = '<div'
else:
new_title = html.escape( title )
if new_title:
# Replace the title with the new title
note.find("title").text = new_title
print(new_title)
attrs = note.find("note-attributes")
for attr in attrs:
attr.text = html.escape( attr.text )
note.find("content").text = '<![CDATA[' + content + ']]>'
# write the modified tree to output.enex file
tree.write(file_name)
# open the .enex file in binary mode to convert HTML character codes into text equivalents
with open(file_name, "rb") as f:
# read the file content as bytes
data = f.read()
# decode the bytes using utf-8 encoding
text = data.decode("utf-8")
# unescape the HTML character codes using the html.unescape function
text = html.unescape(text)
# open the .enex file in binary mode to write the output
with open(file_name, "wb") as f:
# encode the text using utf-8 encoding
data = text.encode("utf-8")
# write the data to the file using the file object's write method
f.write(data)
# Get the current directory
current_dir = os.getcwd()
# Loop through all the files in the current directory
for file in os.listdir(current_dir):
# Check if the file has the .enex extension
if file.endswith(".enex"):
# Apply the function to the file name
process_file(file)
So, I'm not an expert on "markdown" used in this forum, but it seems that indenting 4 spaces makes it do quotations without interpreting code as markdown syntax. So removing the 4 spaces was the right thing to do. Seems you figured out a better way to get a quoted input file... share to educate me?
Regarding the error, that is one of the module complaining about the input file it is reading, so the line and column are referring to that input file. Sadly, the error doesn't give the name of the input file on the bottom line, but maybe the top line you pasted shows that file name. Also the files should be straight from the ENEX export, not processed before this script is run. But it might be enlightening to know what is at and near that character position in your input file.
Oh, okay sweet, yes I had just run your script on the same file(s) after running the original one that didn't work. I hadn't thought about that being an issue. Thanks!
I ran it on a copy of my backup, and (judging by the terminal window) it has done an awesome job (aside from a small handful of " ' " characters it has translated as some &string, which is pretty inconsequential). I'm importing it to Joplin now...
To quote the code, I copy pasted, selected the whole thing, and 'preformatted the text' (ctrl + e)
Thanks for the feedback. Yes, the "character escaping" required by the XML format is the source of the &string; stuff and while it is visible in the .enex files, it should not be visible in the Joplin user interface. Which, of course, you figured out after doing the import.
Hopefully your reported issues with using the script will help someone else avoid the same issue.
Thanks for the hint about using preformatted text.