Hey everyone,
I'm working on Idea 3 for GSoC 2026 auto-categorization plugin.
I have a few quick questions that will directly affect some decisions in my proposal and the effectiveness of the plugin.
Context (why I'm asking about titles)
One part of my proposal involves using note titles to improve how well notes get grouped. The idea is that a good title like "Backpacking trip to Japan 2024" or "Notes on gradient descent" is a really dense summary of what the note is about. So I was thinking of giving the title a bit more weight when calculating what a note means, something like:
final_vector = (note_body × 0.7) + (title × 0.3)
But this only works if titles are actually descriptive. If your title is "Untitled" or just today's date, giving it 30% weight would actually impact the clustering and it would pull the note's meaning in the wrong direction. So I need to know
How do you usually write your note titles?
- Clear and descriptive the title tells you what the note is about
- Vague or short (just a quick label to remember it later)
- Dates, "Untitled", or left blank
- Mixed (some notes have good titles, others don't)
Context (why I'm asking about collection size)
So for clustering the notes I am using K-means (it is a clustering algorithm), it has one disadvantage it need a value to create that much cluster. So to solve this I am using a Silhouette Score metric (it tries several different values of k (number of groups) and picks the best one automatically).
The more notes you have, the more options it needs to test. For 100 notes it tests around 8 options, for 2000 notes around 44 options. Knowing your collection size will helps me figure out how long this will take for most people.
How many notes do you have (roughly)?
- Fewer than 100
- 100 to 500
- 500 to 2000
- More than 2000
Even if you skipped the context, a quick vote will still help.
I'm not sure you'll get useful data with a poll, although it's good that you consider the fact that note title, or any field can be unreliable. Maybe don't make assumption about the quality of the data as that could vary massively from one user to the next. One could import thousands of notes without any proper title, and another could do the same but all with good titles.
So I think you should write your proposal with the assumption that it can go either way.
1 Like
Thankyou for you input, I will consider both the cases and change how I calculate the final_vector:
If the notes title is clear and descriptive then I will give title some weight(based on how descriptive it is) while calculating the final_vector.
and if the notes title is generic then will only consider the note_body.
Perhaps some pre-processing could help too? For example exclude all notes titled "Untitled" (since it's common for Evernote notes), or those with empty titles
Yes, a pre-processing step will help a lot to clean the useless titles. I will remove anything that is clearly generic "Untitled", empty titles, "Note 1" type patterns before applying any weighting.
After filtering, I thought how to decide how much weight to give to the titles left (it can be done in 2 ways):
-
Word count: give more weight to longer titles and vice versa. A 6 word title gets 0.3, and it goes on decreasing. this is simple and fast but this has an issue what if the title is long and wrong(unrelated to the note body) then it will affect the cluster formed.
-
Cosine similarity: In this I will embed the title separately and compare it to the body_avg_vector. If they're talking about the same thing, similarity will be high and I give the title more weight. If they're mismatched, similarity drops and the weight reduces automatically. This is a better approach, but it adds one extra embedding call per note(will add 60 more seconds if there are 2000 notes only once at the start).
How this will help:
For short notes the body average is a great matric even if I don't add the title it will give good result but for longer notes where the body covers multiple topics and the average vector gets blurry(less useful as the note has covered different topics) a good title brings the final vector toward the actual main topic (following the cosine similarity).
Analysis:
On a 2000 note collection, probably 300–400 of those longer notes would benefit noticeably. The extra time is around 60 seconds on the first run, and after that everything is cached.
How should I proceed?