Yes, a pre-processing step will help a lot to clean the useless titles. I will remove anything that is clearly generic "Untitled", empty titles, "Note 1" type patterns before applying any weighting.
After filtering, I thought how to decide how much weight to give to the titles left (it can be done in 2 ways):
-
Word count: give more weight to longer titles and vice versa. A 6 word title gets 0.3, and it goes on decreasing. this is simple and fast but this has an issue what if the title is long and wrong(unrelated to the note body) then it will affect the cluster formed.
-
Cosine similarity: In this I will embed the title separately and compare it to the
body_avg_vector. If they're talking about the same thing, similarity will be high and I give the title more weight. If they're mismatched, similarity drops and the weight reduces automatically. This is a better approach, but it adds one extra embedding call per note(will add 60 more seconds if there are 2000 notes only once at the start).
How this will help:
For short notes the body average is a great matric even if I don't add the title it will give good result but for longer notes where the body covers multiple topics and the average vector gets blurry(less useful as the note has covered different topics) a good title brings the final vector toward the actual main topic (following the cosine similarity).
Analysis:
On a 2000 note collection, probably 300–400 of those longer notes would benefit noticeably. The extra time is around 60 seconds on the first run, and after that everything is cached.
How should I proceed?