Hey, I'm Yahya — yahya94812 (Yahya) · GitHub
I built semantic search for Joplin: GitHub - yahya94812/Semantic-Search: This project is for searching documents , notes semantically (not traditional text matching) · GitHub
It lets you find relevant notes by meaning rather than exact keywords. Here's how it works:
- Generates embeddings for all notes using
all-MiniLM-L6-v2
- Stores them in a vector database
- At search time, embeds the query and retrieves the most similar notes via vector similarity
Would love to hear if there are better approaches worth exploring!
2 Likes
This by itself would great improvement when implemented in Joplin internal search.
For even more amazing results we need to utilize MD structure for text chunking. Each chunk should additionally have contextual information about how this chunk contributes to the upper hierarchy in scope of note, how it contributes to summary of whole note and also should contain short summary of chunks/links/images from same/other notes it references. Then these “rich” pieces of text go into Vector DB.
During the search we apply sparse+dense and also a re-ranker LLM to sort/filter based on relevance and finally regular LLM to decide the results subset that should be shown to the user.
This approach is not so demanding as existing GraphRAG or others architectures, but proved to be amazingly working for my local RAG setup.
1 Like
I think that for implementing an LLM-based reranker, we would need to use third-party LLMs through API keys. This can certainly be implemented, but I think it would be useful to keep it as an additional or optional functionality.
Embedding-based search is great for running local semantic search, especially for classic and basic users who may not want to rely on external APIs.
Your idea of implementing Markdown-conscious chunks is great, as it helps maintain the semantic meaning and metadata of blocks.
So I’m looking forward to implementing the experimental Markdown-conscious chunking feature.
1 Like
Hey @executed — really appreciated your feedback! I've gone ahead and implemented most of what you suggested. Here's what's changed:
Markdown-aware chunking
Instead of blindly splitting by token count, the indexer now parses the heading hierarchy first (H1 → H2 → H3). Each section becomes its own chunk, and only falls back to overlapping token windows if a section is too long. This preserves the semantic meaning of each block.
Rich context metadata per chunk
Every chunk now carries:
note_title — the note it belongs to
breadcrumb — full heading path (e.g. Project X > Backend > Database)
notebook — the parent folder
This means search results now tell you exactly where inside a note the match was found, not just which file.
SQLite instead of pickle
Switched the storage backend from pickle to SQLite — safer, inspectable, and supports incremental re-indexing (unchanged files are skipped on re-runs).
Better ranking
max_similarity is now the primary ranking key instead of avg_similarity. This prevents long, broadly-relevant notes from outranking short, highly-specific ones with a single perfect chunk.
On your point about sparse+dense retrieval and LLM reranking — I fully agree those would push quality even further. My plan is to keep the embedding-only pipeline as the local default (no external dependencies), and add BM25 hybrid search + optional LLM reranking as an opt-in for users who have API access.
Would love to hear your thoughts on the chunking approach, especially the breadcrumb strategy for deeply nested notes!
source: GitHub - yahya94812/Semantic-Search: This project is for searching documents , notes semantically (not traditional text matching) · GitHub
1 Like
Including title, full heading path is definitely a good improvement.
Talking about sparse+dense search - having optionally sparse+dense+re-ranker will definitely help.
As I already stated:
Each chunk should additionally have contextual information about how this chunk contributes to the upper hierarchy in scope of note, how it contributes to summary of whole note and also should contain short summary of chunks/links/images from same/other notes it references. Then these “rich” pieces of text go into Vector DB.
Given that you want to make AI optional, what’s stated above should also be optional because it’s going to require LLM.
Hyperlinks page content summaries injected right into chunk would help a lot if the summary itself explains how this link contributes to this specific chunk.
Utilizing visual capable LLM for image description with short summary of how image contributes to chunk would be even more amazing.
Here’s a very basic idea on what improvement it all brings:
Now the Anthropic’s suggestion is not enough by any means per my understanding, it should be definitely combined with semantic sectioning, retrieval-stage neighboaring chunk merging based on original order, etc.
Regarding semantic sectioning - MD-heading based sectioning is good, but letting LLM dissect sections, even groupping multiple sections that articulate same or adjacent ideas is more superior to just using MD heading or delimiters, especially if document/note is not structured - as are a lot of the documents in the early stages of development.
Here’s another good article from dsRag that explains additional techniques: article.
Now I know “Jarvis“ plugin has already some standard RAG implemented and working in Joplin - you should definitely check it out - it’s got to have some Joplin-specific or otherwise interesting techniques.
Also subscribe to your “competitor“ thread here.
This is a solid baseline, especially for local-first semantic search.
One thing I’ve been focusing on is how to balance retrieval quality with Joplin’s constraints (local-first, no mandatory external APIs).
Instead of relying on LLM-heavy pipelines, I think a strong middle-ground is:
-
structure-aware chunking (aligned with Markdown sections rather than fixed tokens),
-
lightweight contextual metadata (note title, notebook, section path),
-
and hybrid retrieval (lexical + semantic) with a simple re-ranking layer.
This already improves recall significantly while keeping the system efficient and fully local by default.
More advanced approaches like LLM-based reranking or contextual summaries seem useful, but I’d keep them optional layers rather than part of the core search pipeline.