Including title, full heading path is definitely a good improvement.
Talking about sparse+dense search - having optionally sparse+dense+re-ranker will definitely help.
As I already stated:
Each chunk should additionally have contextual information about how this chunk contributes to the upper hierarchy in scope of note, how it contributes to summary of whole note and also should contain short summary of chunks/links/images from same/other notes it references. Then these “rich” pieces of text go into Vector DB.
Given that you want to make AI optional, what’s stated above should also be optional because it’s going to require LLM.
Hyperlinks page content summaries injected right into chunk would help a lot if the summary itself explains how this link contributes to this specific chunk.
Utilizing visual capable LLM for image description with short summary of how image contributes to chunk would be even more amazing.
Here’s a very basic idea on what improvement it all brings:
Now the Anthropic’s suggestion is not enough by any means per my understanding, it should be definitely combined with semantic sectioning, retrieval-stage neighboaring chunk merging based on original order, etc.
Regarding semantic sectioning - MD-heading based sectioning is good, but letting LLM dissect sections, even groupping multiple sections that articulate same or adjacent ideas is more superior to just using MD heading or delimiters, especially if document/note is not structured - as are a lot of the documents in the early stages of development.
Here’s another good article from dsRag that explains additional techniques: article.
Now I know “Jarvis“ plugin has already some standard RAG implemented and working in Joplin - you should definitely check it out - it’s got to have some Joplin-specific or otherwise interesting techniques.
Also subscribe to your “competitor“ thread here.