[GSoC 2026 Draft] Chat with your note collection using AI (Local-First RAG)

Hi everyone,

I recently introduced myself in the main welcome thread, and I'm excited to say my first accessibility PR (#14617) was officially merged today!

Now, I’ve been diving deep into the ideas.md document and decided to put together a full technical architecture draft for Project 4: Chat with your note collection using AI.

Because privacy and offline capability are at the core of Joplin's DNA, I wanted to avoid the standard "just send all the notes to a cloud API" approach. Instead, I've drafted a Local-First, Privacy-Preserving RAG Architecture.

The design focuses on running document chunking, vector embedding (via LanceDB), and inference (via Transformers.js/llama.cpp) entirely locally on the user's machine by default. It also includes an opt-in BYOK (Bring Your Own Key) cloud fallback strictly for older hardware that cannot sustain local inference.

Before I formalize the final application, I would love to get some early eyes on this from the mentors or community to see if this architectural direction aligns with the core team's vision.

You can review my technical draft here: Joplin_GSoC_Proposal.md · GitHub

Any critical feedback, especially regarding the IPC constraints or the choice of LanceDB over sqlite-vss, would be incredibly appreciated!

Best,
Chimuanya

Hello, we've recently updated the template for GSoC draft proposals. Please update your post as described here:

1 Like

Hi @laurent,

Thanks for the heads up! It looks like the system won't let me edit my original post at the top of the thread anymore. I've also completely revamped the architecture to a 100 percent offline model to eliminate the need for any cloud API keys and ensure strict privacy.

LINKS

1. Introduction

  • Background / studies: Computer Science student at Benson Idahosa University (Nigeria) with a strong focus on software engineering, modern web architectures, and the React ecosystem.

  • Programming experience: Solid experience with JavaScript, Node.js, React Native, React.js, CSS, and API integrations.

  • Experience with open source: Active contributor to the Joplin desktop application. I have successfully navigated the frontend stack to merge UI fixes and am currently engaged in addressing state management bugs within the joplin.views.panels API, giving me direct experience with the infrastructure required for this project.

2. Project Summary

  • What problem it solves: Currently, users cannot query their notes using natural language without sending personal data to a third-party cloud API (like OpenAI), which inherently violates Joplin's privacy-first, offline-capable philosophy.

  • What will be implemented: A 100% offline, Local-First Retrieval-Augmented Generation (RAG) architecture. It uses local embedding models, a local vector database, and a hardware-aware "Silent Switch" inference engine to generate answers entirely on-device.

  • Expected outcome: A seamless, highly responsive chat panel where users can ask questions and receive synthesized answers backed by citations to their local notes, with zero configuration and zero data leaving their machine.

3. Technical Approach

  • Architecture or components involved: The system relies on a Background Document Processor for chunking Markdown, a Local Vector Store for embeddings, a Hardware-Aware Inference Engine for generation, and a React UI Chat Panel.

  • Changes to the Joplin codebase: Implementation will be structured as a Joplin Plugin utilizing the joplin.views.panels API, with heavy asynchronous IPC (Inter-Process Communication) to ensure LLM inference never blocks the main Electron UI thread.

  • Libraries or technologies you plan to use: LanceDB (or SQLite VSS) for local serverless vector storage, and Transformers.js (or WASM-ported llama.cpp) for on-device embedding and LLM inference entirely within the Node.js environment.

  • Potential challenges: Running LLMs locally often isolates users on older hardware. Instead of relying on external Cloud API keys (BYOK) which introduce UX friction and privacy risks, I will implement a Two-Tier Silent Switch. Upon initialization, the system will profile the hardware (RAM/GPU). High-end devices will silently load a highly capable quantized model (e.g., Llama-3-8B), while older devices will gracefully degrade to a highly optimized, sub-2-Billion parameter Micro-Model (e.g., Qwen-1.5B/TinyLlama) ensuring 100% offline functionality for all users.

4. Implementation Plan

  • Weeks 1–2: Analyze Markdown structures to build intelligent document processing/chunking logic and prototype the local embedding generation.

  • Weeks 3–5: Integrate the local Vector Store (LanceDB) and implement the background indexing workflow to populate embeddings seamlessly.

  • Weeks 6–7: Integrate the Local Inference Engine and build the hardware profiling logic to smoothly switch between Tier 1 (GPU) and Tier 2 (CPU/Micro) models without user intervention.

  • Weeks 8–9: Design and implement the React UI chat panel using joplin.views.panels, establishing asynchronous IPC messaging between the UI and the generation engine.

  • Weeks 10–12: Optimize memory footprint for low-end devices, conduct extensive integration testing on massive note collections, and finalize user/developer documentation.

5. Deliverables

  • Implemented features: A background note indexer and a 100% offline, hardware-aware AI chat panel integrated into the desktop application.

  • Tests: A robust suite of unit and integration tests covering Markdown chunking, retrieval accuracy, and graceful model degradation on restricted hardware.

  • Documentation: Clear developer guides on the IPC/RAG architecture and user-facing documentation highlighting the zero-config privacy features.

6. Availability

  • Weekly availability during GSoC: 40 hours per week.

  • Time zone: West Africa Time (WAT) / UTC+1 (Lagos, Nigeria).

  • Any other commitments during the programme: None. I have no summer classes or other employment; GSoC with Joplin will be my sole full-time focus.

Hey @chimzyfire-ship-it, on the LanceDB question you raised - worth checking whether it has a WASM build that works in an Electron plugin. The AI summarisation plugin by @HahaBill is a good reference for how WASM loading was handled there.

What embedding model are you planning to use, and how would it be downloaded and stored?

For the 8B model tier - roughly how large is the quantised model file, and how would a user install it?

Also, the Technical Approach section is missing a testing strategy - how would you validate that retrieval is actually working well?

Hi @shikuz,

Thank you so much for the detailed review! These are excellent questions, especially regarding the Electron sandbox constraints. Here is a breakdown of how I plan to handle each of those areas:

1. LanceDB, WASM, and the Electron Plugin Environment

You are completely right to call this out. Relying on native Node/Rust binaries inside Joplin's sandboxed plugin architecture is a recipe for cross-platform packaging issues. I just reviewed @HahaBill's summarisation plugin and saw how they ingeniously used Webpack's CopyPlugin to bundle the WebWorker/WASM files directly into the plugin's installationDir/dataDir to bypass Electron's strict sandbox restrictions.

I plan to replicate this exact Webpack bundling strategy for LanceDB's WASM bindings.

• Fallback Plan: If LanceDB's WASM build still proves brittle even with the Webpack workaround, my architectural fallback is to pivot the vector store to Orama (which is pure TS/WASM and optimized for the edge) or a WASM-compiled SQLite with vector extensions. This guarantees zero native-compilation headaches for end users.

2. Embedding Model & Storage

• The Model: I plan to use Xenova/all-MiniLM-L6-v2 via Transformers.js. It is lightweight (under 100MB), extremely fast for on-device WASM execution, and highly effective for document-level semantic search.

• Storage & Download: Transformers.js handles fetching from the Hugging Face Hub. I will override its default cache directory to point directly to the plugin's isolated storage using await joplin.plugins.dataDir(). This ensures the model is safely stored within Joplin's standard data structure and persists across app restarts.

3. The 8B Model Tier (Size & Installation UX)

• Size: A 4-bit quantized 8B model (e.g., Meta-Llama-3-8B-Instruct.Q4_K_M.gguf) is roughly 4.5GB to 4.8GB.

• Installation: Because silently downloading a 5GB file in the background is poor UX and could eat user bandwidth, the "Silent Switch" hardware check will only silently determine capability. The actual installation will require a one-time user confirmation. The React UI panel will display: "Your hardware supports High-Performance AI. Click here to download the local model (4.8GB)." The download will stream directly into the plugin's dataDir with a visible progress bar.

4. Validating Retrieval (Testing Strategy)

This is a great catch. Standard unit tests aren't enough for RAG. To validate that the retrieval is actually returning the right context, I will build an automated evaluation pipeline:

• The Golden Dataset: I will create a dummy collection of ~50 diverse Joplin notes (testing edge cases like long code blocks, nested lists, and Markdown tables) and map a set of predefined test questions to the exact note chunks that contain the answers.

• The Metrics: The test script will index the Golden Dataset, run the queries, and validate the results using Hit Rate (is the expected chunk in the top k=3 results?) and Mean Reciprocal Rank (MRR). If a tweak to the Markdown chunking algorithm causes the MRR to drop, the test suite will fail.

I will update the main proposal draft above to explicitly include this retrieval testing strategy and the WASM/UX clarifications. Thanks again for pointing me toward the summarization plugin repository!

Thank you @chimzyfire. How does the 8B model run inference inside the plugin?

Hi @shikuz, Running an 8B parameter model purely in Node.js/WASM inside Electron would quickly hit V8 memory limits and freeze the app.

To solve this, the Tier 1 (8B) inference will rely heavily on WebGPU.

By utilizing Transformers.js v3 (which introduced

WebGPU support via ONNX Runtime Web) or a WebGPU-native library like WeLL, we can offload the inference computation directly to the user's local GPU VRAM.

Architecturally: Instead of running the heavy inference in the main plugin Node process, the plugin will spawn an isolated WebWorker (or a hidden offscreen WebView) with WebGPU enabled.

This completely sandboxes the heavy matrix multiplication away from Joplin's main Electron UI thread, ensuring the app remains buttery smooth while the 8B model streams its response back to the chat panel via IPC messages.