[SOLVED] Tips for removing safely duplicated notes from two very similar notebooks

manouchk · 11 October 2021 18:49

Hi,

I seek the efficient way to merge two very similar instances of same notebook that were identical till I had some synchronization problem (an old problem) and some divergence occurred between those instances. For some reason, one instance included the title in the note so that notes that are only identical in term of meaning content but they are not exactly identical in content. I was looking for a smart plugin or approach that could merge those notes automatically, but I didn't found it and was not able to program it yet. If anyone have an idea. Please share it to me. I have about 5000 notes to be merged.
The last strategy I though of is to export both instances of the notebook as markdown, compare MD directory with meld. It allows to compare notes quite fast (compared to what I could do directly in joplin). I could see if two notes with the same name are identical in meaning and choose which one I should keep. For merging or deleting duplicated notes in joplin, I first search a phrase of the note in joplin in order to identify notes. This usually gives me two notes of each instance of the notebook and I can do the correct choice manually.
I thought, I initially could manage 5000 notes in about 2-3 weeks, but it seems that I would need maybe 5 month and now I looking for a better strategy!!!
Maybe I could do some step faster by using CLI or this kind of thing that I never used. Another possibility is that, I could live with those doublons without too much disturbance.

I'm open to any suggestions that may help me solve this problems.

I use joplin-desktop-2.3.5 on archlinux.

laurent · 11 October 2021 18:57

I don't know if that would be any easier but you could try that:

First create a Git repository (with git init)
Export all your notes to it in Markdown format
Commit the changes (git add -A && git commit -m 'init'
Then export the second set of notes to it

Then using something like Sublime Merge you would directly see everything that has changed, and you can choose to stage the changes you want to keep or discard the ones you don't.

Otherwise, couldn't you just keep the most recent version of each note? Isn't it the best version in most cases? Then next to it you can maybe keep a backup of both sets so if one day you realise that something's missing you have it there.

manouchk · 11 October 2021 19:12

I will try to implement the git merging approach (I've to learn some git stuffs first).

For the other suggestion based on modification time, I made unfortunately some unfortunate manipulations like exporting as markdown and importing again which make me feel that it is a bit a risky approach. I think modification date is not anymore related to editing of the notes in those notebooks.

manouchk · 12 October 2021 01:11

In fact many notes have the same exact content, except the two first lines. The first line includes a date with a fixed format. The second line is empty. Maybe I could identify note/file with this characteristic, remove those lines and find identical notes after this modification, merge them. The problem is that I will lose attachements of notes that have attachements. Right? When exporting as markdown, we lose attachements?
In order to analyze the other notes I'm thinking of using python because I have some habilities with it. I think I am too weak in git, javascript and typescript. I would need quite a lot of time too learn enough in order to solve this problem.

Am I the only one having this kind of problems?

The only languages that I could use for "scripting" joplin is javascript or typescript, right?

manouchk · 12 October 2021 02:17

I found that I could strip out notes with attachment by searching something like that notebook:Google_Keep_desktop resource:* and treat tese notes manually in order not to lose the attachement. I found 339 notes with attachments in one notebook and 340 in the other one. It's getting "fun"...

manouchk · 17 October 2021 03:42

I finally opted for python as I know it a (little) bit more than any other options.

In order to visualize differences I use :
d = difflib.Differ()
and
diff = d.compare(text1_lines, text2_lines)

With compare it looks like that:

----------------- new comparison of possibly identical notes---------------

+ Le langage du cerveau Lionel Naccache Tête au carré
+ 
  Le système nerveux digestif est important dans la régulation des émotions et dans l'idée de soi.
  
  Nerf pneumo gastrique
  
  Il n'y a pas une région de la conscience. On n'est pas dans la phrenologie.

---------------- end of comparison of possibly identical notes---------------

The codes that I use with many limitations/bugs looks is like that (it though allows to choose which note to keep in the case of very similar notes with the same meaning content quite fast):

import difflib
from joppy.api import Api
api = Api(token='xxxxxxxx')

import getkey

# initializing things
#notebook1 = None
#notebook2 = None
#notebook_of_analizednotes_name = None


#define manually the name of notebooks of interest for merging (name is case sensitive)
#first two notebook where are to be found duplicated notes
notebook1name='Google_keep_desktop'
notebook2name='Google_keep_laptop'
#notebook were will be put the notes that will be kept
notebook_of_analizednotes_name='Google_keep_analized_via_script'
#notebook were will be put notes that in principle should and could be deleted
notebook_of_tobedeletednotes_name='Google_keep_to_be_deleted'

#get the list of id and title of all notebooks
notebooklist=api.get_notebooks(fields='id,title')['items']

#get the four dictionaries of the notebooks with the selected names
#notebookx['id'] will give the id of notebook1
#notebookx['title'] will give the title of notebook1
for notebook in notebooklist:
    if notebook['title']==notebook1name:
        notebook1=notebook
    if notebook['title']==notebook2name:
        notebook2=notebook
    if notebook['title']==notebook_of_analizednotes_name:
        notebook_of_analizednotes=notebook
    if notebook['title']==notebook_of_tobedeletednotes_name:
        notebook_of_tobedeletednotes=notebook
        
print("notebook1 is", str(notebook1),"\n")
print("notebook2 is", str(notebook2),"\n")
print("notebook_of_analizednotes is", str(notebook_of_analizednotes),"\n")
print("notebook_of_tobedeletednotes is", str(notebook_of_tobedeletednotes),"\n")




#for loop on the 100 notes of one notebook that contain duplicated notes
# make a query from the longuest line of the note
for note in api.get_notes(notebook_id=notebook2['id'],fields='id,title,body')['items']:
    # create a list containig all line of the note
    notelines=note['body'].split('\n')
    print("title:",note['title'],"\n","biggest ligne:",max(notelines),"\n")
    #make a search with the longest line of the note 
    q='"'+max(notelines,key=len)+'"'
    print("searching for:",q)
    identicalnotes=api.search(query=q)
    #if query returns 2 or more notes then...
    if len(identicalnotes['items'])==2:
        print("\n\n\n\n\n\n----------------- new comparison of possibly identical notes---------------\n")
        d = difflib.Differ()
        note1=api.get_note(id_=identicalnotes['items'][0]['id'],fields='id,title,body')
        #list of lines of note 1
        text1_lines=note1['body'].splitlines()
        note2=api.get_note(id_=identicalnotes['items'][1]['id'],fields='id,title,body')
        #list of lines of note 2
        text2_lines=note2['body'].splitlines()
        #compare notes and give a representation of the difference beetween notes
        diff = d.compare(text1_lines, text2_lines)

        print('\n'.join(diff))
        print("\n---------------- end of comprison of possibly identical notes---------------\n\n")

        while 1:
            print("""
            choose your option:
            1: note 1 -> analized notebook
               note 2 -> tobedeleted notebook
               analize next note
            2: note 2 -> analized notebook
               note 1 -> tobedeleted notebook
               analize next note
            3: do not modify note and analize next note

               analize next note
            """)
            key = getkey.getkey()
            if key == '1':
                print('note 1 -> analized notebook')
                #move note1 to analized notebook
                api.modify_note(id_=note1['id'], parent_id=notebook_of_analizednotes['id'])
                print('note 2 -> tobedeleted notebook')
                #move note2 to tobedeleted notebook
                api.modify_note(id_=note2['id'], parent_id=notebook_of_tobedeletednotes['id'])
                print('analizing next note')
                break
            elif key == '2':
                print('note 1 -> tobedeleted notebook')
                #move note1 to tobedeleted notebook
                api.modify_note(id_=note1['id'], parent_id=notebook_of_tobedeletednotes['id'])
                print('note 2 -> analized notebook')
                #move note1 to analized notebook
                api.modify_note(id_=note2['id'], parent_id=notebook_of_analizednotes['id'])
                print('analizing next note')
                break
            elif key == '3':
                break

manouchk · 1 November 2021 21:45

With this script with small modifications, I was able to strip down from about 5000 notes to about 170 notes in few days. The remaining possibly replicated notes are notes that have attachment and need a little bit more code. I've to do some research on how to compare attachments in order to check if notes are identical including the attachment.

manouchk · 11 November 2021 12:43

Here is the last script I used. Someone may be interested in adapting it for its own reality. It helped me a lot accelerating the cleaning of doublons. It shows a diff file that allows to check if two similar files are semantically identical. In case of necessity of merging, I used meld e did manually merging in joplin. I could clean about 5000 doublons in maybe about 10-15 hours.

import difflib
from joppy.api import Api
api = Api(token='b654d8fb77bc7a32ea15033339ae8f94192434076ac0e59f1c4839e6c131b26dd95275e655a557cc8217c561647968835e28a2a4bf9ea2cbcc6ceb9cd02e0516')

import getkey
import re

#define manually the name of notebooks of interest for merging (name is case sensitive)
#first two notebook where are to be found duplicated notes
notebook1name='Google_keep_desktop'
notebook2name='Google_keep_laptop'
#notebook were will be put the notes that will be kept
notebook_of_analizednotes_name='Google_keep_analized_notes'
#notebook were will be put notes that in principle should and could be deleted
notebook_of_tobedeletednotes_name='Google_keep_to_be_deleted'

#get the list of id and title of all notebooks
notebooklist=api.get_notebooks(fields='id,title')['items']
#notebooklist is such that:
#notebooklist[0]['id']    give the id    of the first notebook
#notebooklist[0]['title'] give the title of the first notebook

#select the notebook for which you want to see if it contains notes that are duplicated
#and select the notebook in which you want to put the note after managing it (notebook_of_analizednotes
# logic: if note have no doublon but

print("list of notebooks are the following:\n")
for i,notebookdic in enumerate(notebooklist):
    print(i,' ',notebookdic['title'])


#ask name of notebooks
#notebook1name=input('give the name of the notebook for which in you want to check if notes inside it have doublons:\n')
#notebook_of_analizednotes_name=input('give the name of the notebook in which you will put anotation that will be kept:\n')
#notebook_of_tobedeletednotes_name=input('give the name of the notebook in which you will put anotations that will not be kept:\n')
notebook1name='Google_Keep_desktop_attachement'
notebook_of_analizednotes_name='Google_Keep_laptop_analized_attachement'
notebook_of_tobedeletednotes_name='Google_keep_to_be_deleted'
########## parei aqui no 08/11/2021


for notebookdic in enumerate(notebooklist):
    if notebookdic[1]['title']==notebook1name:
        notebook1=notebookdic[1]
    if notebookdic[1]['title']==notebook_of_analizednotes_name:
        notebook_of_analizednotes=notebookdic[1]
    if notebookdic[1]['title']==notebook_of_tobedeletednotes_name:
        notebook_of_tobedeletednotes=notebookdic[1]

print("notebook1 is", str(notebook1),"\n")
print("notebook_of_analizednotes is", str(notebook_of_analizednotes),"\n")
print("notebook_of_tobedeletednotes is", str(notebook_of_tobedeletednotes),"\n")

#loop on all notes of notebook2
for note in api.get_all_notes(notebook_id=notebook1['id'],fields='id,title,body'):
    # create a list containig all line of the note removing some special characteres
    notelines=note['body'].replace('(',' ').replace(')',' ').replace('[',' ').replace(']',' ').replace('"',' ').replace('&',' ').replace('#',' ').replace('%',' ').replace('.',' ').split('\n')
    # abandonned approach:
    #notelines=re.split(r'[`\-=~!@#$%^&*()_+\[\]{};\'\\:"|<,./<>?]', note['body'])
    #make a search with the longest line of the note
    q=max(notelines,key=len)
    #adding " in order to get a correct search query
    q='"'+max(notelines,key=len)+'"'
    print("searching for:",q)
    identicalnotes=api.search(query=q)
    print("number of notes:",len(identicalnotes['items']))
    #printing all notes in order to see note titles
    print("identical notes:")
    for identicalnote in identicalnotes['items']:
        print(identicalnote)
    #if a single note is found
    if len(identicalnotes['items'])==1:
        #move not duplicated notes to the notebook of analyzed notes
        api.modify_note(id_=identicalnotes['items'][0]['id'], parent_id=notebook_of_analizednotes['id'])
    if len(identicalnotes['items'])>2:
        print("they are more than 2 notes at this point")
        # reduce note list to note that are in the two notebooks were notes have to be analized:
        identicalnotes2=[i for i in identicalnotes['items'] if (i['parent_id']=='dfa6f637732f4d318b4fcf21b05aa6af' or i['parent_id']=='8b31a7b0486a453fb425516432a96225' )]
        print("identical notes that have not yet been analized:")
        for i,identicalnote in enumerate(identicalnotes2):
            print(i, identicalnote)
        while 1:
            print("""
            choose your option :
            Press Enter twice -> do nothing
            Press numbers of two notes -> compare these notes
            """)
            key1 = getkey.getkey()
            key2 = getkey.getkey()
            if key1 == '\n':
                break
            elif key1 != key2:
                #a preencher
                d = difflib.Differ()
                #substitute identicalnotes['items'][0]['id'] by identicalnotes['items']
                note1=api.get_note(id_=identicalnotes2[int(key1)]['id'],fields='id,title,body')
                #list of lines of note 1
                text1_lines=note1['body'].splitlines()
                note2=api.get_note(id_=identicalnotes2[int(key2)]['id'],fields='id,title,body')
                #list of lines of note 2
                text2_lines=note2['body'].splitlines()
                #compare notes and give a representation of the difference beetween notes
                diff = d.compare(text1_lines, text2_lines)
                bodiescompare='\n'.join(diff)
                # choose only "+ notes" or "- notes"
                #if bodiescompare.splitlines()[0][0]=='-' and bodiescompare.splitlines()[0][0]=='-':
                #if bodiescompare.splitlines()[0][0]=='+' and bodiescompare.splitlines()[0][0]=='+':
                print("\n\n\n\n\n\n----------------- new comparison of possibly identical notes---------------\n")
                print(bodiescompare)
                print("\n---------------- end of comprison of possibly identical notes---------------\n\n")
                while 1:
                    print("""
                    choose your option:
                    1: note 1 -> analized notebook
                    note 2 -> tobedeleted notebook
                    analize next note
                    2: note 2 -> analized notebook
                    note 1 -> tobedeleted notebook
                    analize next note
                    3: do not modify note and analize next note

                    analize next note
                    """)
                    key3 = getkey.getkey()
                    if key3 == '1':
                        print('note 1 -> analized notebook')
                        #move note1 to analized notebook
                        api.modify_note(id_=note1['id'], parent_id=notebook_of_analizednotes['id'])
                        print('note 2 -> tobedeleted notebook')
                        #move note2 to tobedeleted notebook
                        api.modify_note(id_=note2['id'], parent_id=notebook_of_tobedeletednotes['id'])
                        print('analizing next note')
                        break
                    elif key3 == '2':
                        print('note 1 -> tobedeleted notebook')
                        #move note1 to tobedeleted notebook
                        api.modify_note(id_=note1['id'], parent_id=notebook_of_tobedeletednotes['id'])
                        print('note 2 -> analized notebook')
                        #move note1 to analized notebook
                        api.modify_note(id_=note2['id'], parent_id=notebook_of_analizednotes['id'])
                        print('analizing next note')
                        break
                    elif key3 == '3':
                        break
                #print('note 1 -> analized notebook')
                #move note1 to analized notebook
                #api.modify_note(id_=note1['id'], parent_id=notebook_of_analizednotes['id'])
                #print('note 2 -> tobedeleted notebook')
                #move note2 to tobedeleted notebook
                #api.modify_note(id_=note2['id'], parent_id=notebook_of_tobedeletednotes['id'])
                #print('analizing next note')
                break
            elif key1 == key2:
                identicalnotes['items'](key1)
                identicalnotes['items'].pop(key1)
                print("list of identical notes are now:")
                for identicalnote in identicalnotes['items']:
                    print(identicalnote)
                #print('note 1 -> tobedeleted notebook')
                #move note1 to tobedeleted notebook
                #api.modify_note(id_=note1['id'], parent_id=notebook_of_tobedeletednotes['id'])
                #print('note 2 -> analized notebook')
                #move note1 to analized notebook
                #api.modify_note(id_=note2['id'], parent_id=notebook_of_analizednotes['id'])
                #print('analizing next note')
                break
            elif key == '3':
                identicalnotes['items'].pop()
                break
            elif key == '4':
                break

        #q=input()
        #identicalnotes=api.search(query=q)
        # new ideas to develop:
        #print("they are more than 2 notes. Cycling through notes in order to find possibly some search query that can be addressed fast---")
        #for note2 in identicalnotes['items']:
        #    notelines=re.split(r'[`\-=~!@#$%^&*()_+\[\]{};\'\\:"|<,./<>?]', note2['body'])
        #    q=max(notelines,key=len)
        #    identicalnotes=api.search(query=q)
        #else:
        #    print("could strip down to 2 notes")
        #q não é vazia pode tentar a comparação de anotações
        #if len(q)>10:
        #    ok=1
        #    q='"'+q+'"'
        #    identicalnotes=api.search(query=q)
    if len(identicalnotes['items'])==2:
        #starting comparison of 2 notes and printing
        d = difflib.Differ()
        note1=api.get_note(id_=identicalnotes['items'][0]['id'],fields='id,title,body')
        #list of lines of note 1
        text1_lines=note1['body'].splitlines()
        note2=api.get_note(id_=identicalnotes['items'][1]['id'],fields='id,title,body')
        #list of lines of note 2
        text2_lines=note2['body'].splitlines()
        #compare notes and give a representation of the difference beetween notes
        diff = d.compare(text1_lines, text2_lines)
        bodiescompare='\n'.join(diff)
        # choose only "+ notes" or "- notes"
        #if bodiescompare.splitlines()[0][0]=='-' and bodiescompare.splitlines()[0][0]=='-':
        #if bodiescompare.splitlines()[0][0]=='+' and bodiescompare.splitlines()[0][0]=='+':
        print("\n\n\n\n\n\n----------------- new comparison of possibly identical notes---------------\n")
        print(bodiescompare)
        print("\n---------------- end of comprison of possibly identical notes---------------\n\n")
        while 1:
            print("""
            choose your option:
            1: note 1 -> analized notebook
            note 2 -> tobedeleted notebook
            analize next note
            2: note 2 -> analized notebook
            note 1 -> tobedeleted notebook
            analize next note
            3: do not modify note and analize next note

            analize next note
            """)
            key = getkey.getkey()
            if key == '1':
                print('note 1 -> analized notebook')
                #move note1 to analized notebook
                api.modify_note(id_=note1['id'], parent_id=notebook_of_analizednotes['id'])
                print('note 2 -> tobedeleted notebook')
                #move note2 to tobedeleted notebook
                api.modify_note(id_=note2['id'], parent_id=notebook_of_tobedeletednotes['id'])
                print('analizing next note')
                break
            elif key == '2':
                print('note 1 -> tobedeleted notebook')
                #move note1 to tobedeleted notebook
                api.modify_note(id_=note1['id'], parent_id=notebook_of_tobedeletednotes['id'])
                print('note 2 -> analized notebook')
                #move note1 to analized notebook
                api.modify_note(id_=note2['id'], parent_id=notebook_of_analizednotes['id'])
                print('analizing next note')
                break
            elif key == '3':
                break

laurent · 11 November 2021 13:04

Nice that you got there eventually, it looks like was quite some work!

manouchk · 11 November 2021 13:19

Yes, It was for me quite a bit of work but it was finally easier than what I thought initially. The python difflib did helped me. It made it possible to see immediatly that two notes are semantically identical in my cases. For bigger note, I just had to scroll bit so that 2 or 3 second were sufficient. With the getkey llib that gets automatically keyboard input, I could make an almost immediate decision. I may say, it was almost fun. I have to keep programming every 2 or 3 years in order not to forgot everything!

system · 11 December 2021 13:20

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Detect and delete duplicates of notes even between notebooks Features	1	645	5 March 2021
Diff and merge for supposedly identical or very similar notes Features	2	1099	10 August 2021
Conflict Resolution Plugin Conflict Resolution	26	3375	4 June 2023
Web clipper: confirm if note exists with the same name in same notebook Features	0	318	2 August 2022
Plugin: Conflict Resolution Plugins	30	16667	20 March 2025

[SOLVED] Tips for removing safely duplicated notes from two very similar notebooks

Related topics