GSoC 2026 Proposal Draft – Idea 4: Multimodal AI Chat for Note Collections – S D Keerthiga Devi

(Update - March 28: Refined the Vector Store implementation to use sandbox-safe WASM/JS libraries instead of native Node.js modules, and clarified the local vs. API approaches for Vision Ingestion based on community feedback.)
Links:

  • Link to the project idea: Link to Idea 4 from Ideas.md
  • GitHub profile: GitHub Link
  • Forum introduction post: Link
  • Pull requests you have submitted to Joplin: Currently setting up the local environment and investigating open frontend issues to submit my first PR during the review period.
  • Other relevant development experience: Published research: "Multimodal RAG-enhanced AI tutoring system" (IRJET, Dec 2025). Link
    • Built an AI-Powered Blog Generator using the MERN stack and Google Gemini API.
    • Architected complex React/TypeScript collaborative platforms ("Code Mentor" and "Clipsify").

1. Introduction

Hi, I am S D Keerthiga Devi, a final-year B.Tech student in Computer Science and Engineering specializing in AI/ML. My core expertise lies at the intersection of modern frontend architecture (React/TypeScript) and applied Generative AI. I have extensive experience building scalable web applications and have published research specifically on Multimodal RAG (Retrieval-Augmented Generation) systems. I am passionate about open-source and want to bring a highly responsive, visually-aware AI assistant to the Joplin ecosystem.

2. Project Summary

  • What problem it solves: Standard AI note assistants only read text. However, users often clip web pages with crucial diagrams, save photos of whiteboards, or store scanned receipts. Searching and chatting with a note collection that ignores visual data leaves a massive knowledge gap.
  • Why it matters to users: By introducing a "Vision-Aware" Chat UI, users can ask questions about everything in their notebooks, including the content of their images, making the AI truly comprehensive.
  • What will be implemented: A React-based Joplin plugin featuring a conversational interface. It will utilize a Multimodal RAG pipeline to ingest both Markdown text and image attachments, store them in a local vector database, and use an LLM to provide context-aware answers with exact note citations.
  • Expected outcome: A non-blocking chat panel where users can interrogate their entire knowledge base (text and visuals) while keeping data strictly local (via tools like Ollama) or routing through cloud APIs for speed.

  • What is explicitly out of scope: Training a foundational LLM or Vision model from scratch.

3. Technical Approach

Architecture & Components:
I will build this as a plugin using a two-tier Multimodal RAG architecture:

  1. Text & Vision Ingestion: Fetch notes via Joplin's Data API. For Markdown text, apply semantic chunking. For image attachments, pass them through a vision-to-text pipeline to extract semantic descriptions.
  2. Unified Embedding: Embed both the text chunks and the image descriptions into a sandbox-safe vector store. To comply with Electron's restrictions on native Node.js modules, I will utilize a pure JavaScript vector engine (like Orama) or a WASM-compiled store (e.g., Voy), modeling the WASM loading architecture after the existing plugin-ai-summarisation plugin.
  3. Retrieval Engine: On a user query, perform a cosine similarity search to fetch the top-K relevant chunks (whether they originated from text or an image).
  4. Generative UI: Pass the context to the LLM and stream the response back to a custom React UI, appending :/ internal Joplin links as clickable citations.

Libraries or technologies:

  • TypeScript & React: For the plugin infrastructure and a seamless, native-feeling UI panel.
  • LangChain.js: To orchestrate the document loaders, chunking, and retrieval chains.
  • Vision Ingestion Models: Image description generation will support a privacy-first local path (via Ollama + llava) and a low-latency cloud fallback utilizing the Google Gemini API.
  • Text LLM Options: Support for Local LLMs (via Llama.cpp/Ollama) for privacy, and cloud APIs (OpenAI/Gemini) for users with lower-end hardware.

Potential challenges:

  • UI Thread Blocking: Running heavy embeddings locally could freeze Joplin.
    • Solution: Offload the ingestion and embedding generation to Web Workers or background processes, running only when the app is idle.
  • Context Window Limits: * Solution: Implement strict Maximum Marginal Relevance (MMR) in the retrieval step to ensure diverse context without overflowing the token limit.

4. Implementation Plan

Week 1–2: Plugin Scaffolding & Data Pipeline

  • Initialize plugin architecture and React webview.
  • Implement Joplin Data API listeners to fetch and sync notes and resource attachments incrementally.

Week 3–4: The Multimodal Ingestion Engine

  • Implement Markdown parsing and text splitting using LangChain.js.
  • Build the image-processing bridge to generate descriptions for visual attachments.

Week 5–6: Vector Storage & Retrieval

  • Integrate local embedding models (e.g., Transformers.js or Ollama embeddings).
  • Build the semantic search function using the WASM/JS-safe vector store to accurately retrieve top-K context chunks.

Week 7–8: Conversational UI & Streaming

  • Develop the React chat interface with message history state management.
  • Implement server-sent events/streaming so the AI's response types out dynamically without UI lag.

Week 9–10: Citations & Prompt Engineering

  • Refine the system prompts to ensure the AI strictly answers from the provided context (preventing hallucinations).
  • Map retrieved vector chunks back to their parent Joplin Note IDs to render clickable citation badges in the chat.

Week 11–12: Polish, Testing & Documentation

  • Finalize UI styling to respect Joplin's light/dark themes.
  • Write unit tests for the chunking and retrieval logic.
  • Publish user documentation outlining how to connect local vs. cloud AI models.

5. Deliverables

  • A published Joplin Plugin with a React-based Chat UI.
  • A complete Multimodal RAG ingestion and retrieval pipeline.
  • Support for both local inference and external APIs.
  • Comprehensive user and developer documentation.

6. Availability

  • Weekly availability: 30–35 hours per week.
  • Time zone: IST (UTC +5:30).
  • Other commitments: Flexible schedule with dedicated daily blocks secured specifically for GSoC development to easily meet the 350-hour requirement.

Hey @S-D-Keerthiga-Devi,

Which vector store are you planning to use? Joplin plugins run in an Electron sandbox that can't load native Node.js modules - worth looking at the AI summarisation plugin by @HahaBill for how WASM loading was handled there.

How are image descriptions generated - local vision model, API call, or something else?

Hi @shikuz,

Thank you for the excellent catch regarding the Electron sandbox limitations! I really appreciate you pointing me toward @HahaBill 's plugin-ai-summarisation repository—I am reviewing their WASM loading implementation now.

To address your questions:

1. Vector Store Implementation: Because native Node.js modules are out of the question, I am pivoting away from standard HNSWLib. Instead, I plan to use a pure JavaScript/TypeScript vector search engine that runs safely in the sandbox, such as Orama or LangChain.js's MemoryVectorStore, and persist the serialized index to Joplin's data directory or IndexedDB. Alternatively, if performance demands it, I will implement a WASM-based vector store (like Voy) mirroring the exact WASM loading pattern established in the summarisation plugin.

2. Generating Image Descriptions: Since Joplin users have varying hardware capabilities and privacy preferences, I plan to offer a hybrid approach for the vision pipeline:

  • Privacy-First (Local): For users with capable hardware, I will route the images through a local instance of Ollamarunning a lightweight multimodal model like llava.

  • Performance-First (API): For users on lower-end machines, I will provide a configuration field to input an API key to use a cloud vision model. I plan to default to the Google Gemini API (Gemini 1.5 Flash/Pro), as I have prior hands-on experience integrating it into MERN-stack applications for AI generation tasks.

I will update my main proposal draft above to reflect these architectural refinements immediately. Thanks again for the guidance!