[Proposal Discussion] AI Chat with Joplin Notes using PageIndex (Alternative to Vector RAG)

GSoC 2026 Proposal Draft – Idea 4: Chat with your note collection using AI – Krish Jain


Links


1. Introduction

I'm Krish Jain, a third-year B.Tech Information Technology student from Mumbai. I've been building with AI for about a year not just experimenting, but shipping things people actually use. One of those is Desk AI, an AI assistant with a RAG-based company policy chatbot currently used by 5+ companies. That project taught me a lot about what works in production RAG systems and, more importantly, what doesn't.

I've been using Joplin for my own notes for a while. The idea of being able to ask questions across a large personal knowledge base and get answers grounded in your actual notes is something I find genuinely useful. That's what drew me to this project.


2. Project Summary

The Problem

Joplin users who maintain large note collections currently have no good way to query that knowledge. Keyword search breaks down quickly. If you want to ask "what were the key decisions from my last project planning session?" there's no way to do that today.

What I Want to Build

A chat interface over your Joplin note collection. You ask a question, the system finds the relevant notes and sections, and the LLM generates an answer grounded in your actual content. You can follow up and refine similar to ChatGPT, except the answers come from your own knowledge base.

Expected Outcome

A working Joplin plugin with a chat UI that enables users to query their note collection conversationally. The retrieval system will be benchmarked against a keyword baseline so there's a measurable quality signal.

Out of Scope

Real-time note syncing during chat, voice input, and multi-user knowledge bases are out of scope. The focus is on getting core retrieval and chat working well.


3. Technical Approach

Why Not Standard Vector RAG?

The obvious approach is chunk → embed → vector search. I've built this before and it works, but it has real issues for a note-taking context:

  • Chunking breaks note structure. A chunk has no idea it belongs under "# Project Alpha > ## Action Items"

  • Similarity search finds related content, not necessarily the right content

  • Requires an embedding model in-process — if multiple plugins each load their own model, memory footprint adds up fast

  • No clear citation trail back to the specific note and section

The Approach: Structured Retrieval with Semantic Enrichment

I want to build a tree index from note structure and use BM25 over LLM-generated summaries for retrieval, inspired by the PageIndex framework ( GitHub - VectifyAI/PageIndex: 📑 PageIndex: Document Index for Vectorless, Reasoning-based RAG · GitHub ), adapted for Joplin's needs.

Concrete example: user has 2000 notes, asks "What was my action plan for the Joplin proposal?"

Phase 1 — Index Building (background, once per note)

  • Each note is parsed into a tree based on heading structure

  • For flat notes without headings, content is split into paragraph-sized blocks, each becoming a node

  • Weak or vague headings (e.g. "Monday", "Meeting") are strengthened — the LLM derives a more descriptive title from the node's content, so retrieval doesn't rely on poor heading quality

  • Each node gets an LLM-generated summary (1-2 sentences) — runs once, re-runs only when updated_time changes

  • Tree + summaries stored in Joplin userData, persisted across sessions

Phase 2 — Query Time (3 LLM calls)

  • BM25 over note summaries narrows 2000 notes to ~5 candidates. Because summaries are natural language, "Joplin proposal" matches a note even if the title just says "GSoC 2026"

  • LLM call 1: model sees 5 note summaries → picks the right note

  • LLM call 2: model sees ~8 section summaries → picks "Action Plan"

  • LLM call 3: full section text + question → answer with citation

Optimisation: Contextual Neighbourhood Node Caching

After answering about the Joplin proposal, the system caches not just the retrieved node but its parent, siblings, and children. If the user follows up with a related question, the search pipeline is skipped entirely down to 1 LLM call. Conversations naturally stay in one topic area, so cache hit rate is high in practice.

Handling Flat Notes

For notes that do not contain headings, the content is automatically divided into paragraph-level blocks. Each block is treated as an individual node and enhanced with an LLM-generated summary.

This approach effectively creates meaningful structure from otherwise unorganized content, making notes easier to navigate, search, and understand even when they were originally written in a flat format.

LLM Provider Strategy

  • Local: Ollama (no API key, fully private) — Joplin's offline-first approach

  • Cloud: OpenAI-compatible API endpoint — user provides their own key

Potential Challenges

  • Very large collections: lazy indexing (index notes as opened, not all at once) keeps initial setup manageable

  • Flat/low-quality notes: paragraph fallback handles structure, but summary quality depends on note quality

  • Latency: 3 LLM calls at query time; contextual cache reduces follow-ups to 1 call

Testing Strategy

  • Unit tests for tree builder and heading parser

  • Integration tests with fixed note collection and known Q&A pairs

  • Benchmark: retrieval accuracy vs BM25-only baseline

  • Manual testing with real Joplin note collections of varying sizes


4. Implementation Plan

Community Bonding (before Week 1)

  • Deep dive into Joplin plugin API and data API

  • Set up local Joplin dev environment

  • Finalise storage approach with mentors

Period Tasks
Week 1-2 Tree builder: markdown heading parser, paragraph fallback for flat notes. Unit tests. Works standalone.
Week 3-4 Summarisation pipeline: per-node summaries, batched processing, incremental updates via updated_time. Ollama + OpenAI API support.
Week 5-6 BM25 pre-filter over summaries, note-level + section-level LLM selection. End-to-end retrieval working in isolation.
Week 7-8 Joplin plugin scaffolding, connect to data API, store index in userData, background indexing.
Week 9-10 React chat UI in plugin panel, conversation history, cited answers (note + section), follow-up questions.
Week 11 Contextual neighbourhood cache, performance testing, benchmark vs BM25 baseline.
Week 12 Bug fixes, edge cases, polish.
Week 13 Documentation (user + developer), final testing, submission.

5. Deliverables

Required:

  • Working Joplin plugin with chat interface over note collection

  • Tree-based index builder with semantic summarisation

  • BM25 + LLM retrieval pipeline with contextual caching

  • Support for local (Ollama) and cloud LLM providers

  • Retrieval benchmark vs keyword baseline

  • Unit and integration test suite

  • User and developer documentation

Nice to have:

  • Auto-tagging notes using the same summary index

  • Related notes suggestions

  • Streaming responses in chat UI


6. Availability

  • Weekly availability: 40-50 hours per week

  • Time zone: IST (UTC+5:30)

  • No major conflicts during GSoC period. Minor academic commitments will be communicated to mentors in advance.

I've already started a TypeScript proof of concept at GitHub - KrrishJain/pageindex-js: Vectorless, reasoning-based RAG for markdown and PDF documents. TypeScript port of PageIndex — no vector DB, no chunking, just LLM tree reasoning. · GitHub
demonstrating the tree builder, summarisation pipeline, and retrieval working end-to-end on markdown files and PDFs. This gives me confidence the approach is implementable within the GSoC timeline.

Hey @Krishh, on the PageIndex approach — how does the system know which note or section to navigate to for a given query? What maps the user's question to a node in the tree?

How does it handle notes without headings, where the structure is flat?

The proposal is missing several required sections - weekly milestones, testing strategy, availability, and the GSoC idea link. Check the submission template before finalising.

Hi,

Thanks for the detailed feedback.
Regarding how the system maps a query to a node:
Sorry, I didn’t explain this clearly earlier. I’ve updated the previous message with more details on the approach — it would be helpful if you could take another look. In short, the idea is to use node-level summaries and let the model reason over them to select the most relevant section, rather than relying on similarity scores.

For notes without headings (flat structure):
In such cases, the system falls back to splitting the content into smaller segments (like paragraphs) and treating them as nodes. During the summarisation phase, each of these nodes gets an LLM-generated summary, and we can also derive stronger semantic titles from the content itself. This effectively creates a structured, semantic tree even when the original note has no headings.

About the missing proposal sections:
This isn’t a final proposal yet — I was mainly trying to discuss and validate the approach. I believe the category was updated to “GSoC Proposal Draft” by @lauren, which is why it may therefore it is showing in proposal draft. I’ll make sure to add the required sections once I move towards a proper draft.

I missed the edge case where there are no headings due to my lack of experience, but after you pointed it out, I was able to figure out a way to handle it. If you notice any other edge cases or improvements, please do let me know. I’m really looking forward to working with you.

Thanks again for the guidance.

When a user asks a question, how many LLM calls does PageIndex need to navigate the tree and produce an answer? Is this approach based on a published method, or your own design?

Hi,

Thanks for the question.

On the number of LLM calls:
In the current implementation, it typically involves two main calls:

  1. One for selecting the most relevant node (reasoning over summaries)

  2. One for generating the final answer from the selected content

If we move towards a guided traversal approach, this can increase slightly (one call per level), but it can be controlled by limiting depth or using top-k selection.


On the approach:

The idea is inspired by PageIndex (https://github.com/VectifyAI/PageIndex), a vectorless, reasoning-based RAG framework introduced in September 2025 by Mingtian Zhang, Yu Tang, and the PageIndex team. It has gained significant attention in the tech community for proposing an alternative to similarity-based retrieval.

Some key advantages of this direction:

  • Preserves document structure instead of flattening into chunks

  • More explainable (clear path of which section was selected)

  • Closer to how users naturally navigate notes

At the same time, there are some limitations:

  • Higher cost due to multiple LLM calls during traversal

  • Can be slower compared to vector-based retrieval

However, since Joplin follows an offline-first and local-data approach, this trade-off may be acceptable in this context.

In my case, I’m not using PageIndex as-is. I’m adapting the idea to better fit Joplin’s use case by combining structure with semantic summaries and exploring more controlled traversal strategies.

I’d like your opinion on this — do you think it makes sense to continue with this direction in the proposal, or would you recommend focusing more on improving a vector-based pipeline?

Thanks.

The PageIndex reference is interesting. The trade-off between preserving note structure and the extra latency per query is a real one. That's your call, both directions are viable. What matters is showing that retrieval quality is measurably better than a basic similarity search, whichever approach you take.

On scale: your post describes Notebook → Note → Sections → Paragraphs. For a user with a few thousand notes, what does the model see at each level? How many summaries fit in one selection call?

Hi,

Thanks for the feedback — that helped me think more clearly about both scale and latency.

On scale:
The system doesn’t pass all notes to the LLM. The hierarchy itself limits what the model sees at each step:

  • Level 1: Notebook selection (~10–20 items)

  • Level 2: Notes within that notebook (~50–200)

  • Level 3: Sections within a note (~5–20)

  • Final step: Only the selected section’s content

So even with thousands of notes, each LLM call operates on a small, focused subset.


On latency:
There are typically 2–3 LLM calls per query (note selection → section selection → answer). To reduce repeated cost, I’m using a contextual neighbourhood cache.

Example:
If a user asks: “What is my Jenkins proposal about?”
→ the system retrieves the “Jenkins Proposal” node and caches:

  • its parent (“GSoC Proposals”)

  • its siblings (other proposals)

  • its children (timeline, deliverables)

If the next query is: “What is my Joplin proposal timeline?”
→ the system can resolve it directly from the cached neighbourhood without re-running the full retrieval pipeline.


On pre-filtering:
I agree that raw keyword search alone isn’t sufficient. For example:
“CI/CD tool proposal” may not match “Jenkins Proposal”.

To handle this, I’m exploring:

  • searching over LLM-generated summaries instead of raw text

  • or a lightweight semantic layer (embeddings only on summaries)

This keeps the system efficient while handling semantic gaps.


Overall, the idea is to keep the hierarchy doing most of the filtering work, minimise LLM calls per query, and use caching + better pre-filtering to keep latency practical.

Hi @shikuz,

I’ve drafted my final proposal using the template provided by @laurent. I also have a few additional ideas related to the project that I haven’t included in the proposal yet would it be okay to discuss them with you?