(Update - March 28: Refined the Vector Store implementation to use sandbox-safe WASM/JS libraries instead of native Node.js modules, and clarified the local vs. API approaches for Vision Ingestion based on community feedback.)
Links:
- Link to the project idea: Link to Idea 4 from Ideas.md
- GitHub profile: GitHub Link
- Forum introduction post: Link
- Pull requests you have submitted to Joplin: Currently setting up the local environment and investigating open frontend issues to submit my first PR during the review period.
- Other relevant development experience: Published research: "Multimodal RAG-enhanced AI tutoring system" (IRJET, Dec 2025). Link
- Built an AI-Powered Blog Generator using the MERN stack and Google Gemini API.
- Architected complex React/TypeScript collaborative platforms ("Code Mentor" and "Clipsify").
1. Introduction
Hi, I am S D Keerthiga Devi, a final-year B.Tech student in Computer Science and Engineering specializing in AI/ML. My core expertise lies at the intersection of modern frontend architecture (React/TypeScript) and applied Generative AI. I have extensive experience building scalable web applications and have published research specifically on Multimodal RAG (Retrieval-Augmented Generation) systems. I am passionate about open-source and want to bring a highly responsive, visually-aware AI assistant to the Joplin ecosystem.
2. Project Summary
- What problem it solves: Standard AI note assistants only read text. However, users often clip web pages with crucial diagrams, save photos of whiteboards, or store scanned receipts. Searching and chatting with a note collection that ignores visual data leaves a massive knowledge gap.
- Why it matters to users: By introducing a "Vision-Aware" Chat UI, users can ask questions about everything in their notebooks, including the content of their images, making the AI truly comprehensive.
- What will be implemented: A React-based Joplin plugin featuring a conversational interface. It will utilize a Multimodal RAG pipeline to ingest both Markdown text and image attachments, store them in a local vector database, and use an LLM to provide context-aware answers with exact note citations.
- Expected outcome: A non-blocking chat panel where users can interrogate their entire knowledge base (text and visuals) while keeping data strictly local (via tools like Ollama) or routing through cloud APIs for speed.
- What is explicitly out of scope: Training a foundational LLM or Vision model from scratch.
3. Technical Approach
Architecture & Components:
I will build this as a plugin using a two-tier Multimodal RAG architecture:
- Text & Vision Ingestion: Fetch notes via Joplin's Data API. For Markdown text, apply semantic chunking. For image attachments, pass them through a vision-to-text pipeline to extract semantic descriptions.
- Unified Embedding: Embed both the text chunks and the image descriptions into a sandbox-safe vector store. To comply with Electron's restrictions on native Node.js modules, I will utilize a pure JavaScript vector engine (like Orama) or a WASM-compiled store (e.g., Voy), modeling the WASM loading architecture after the existing
plugin-ai-summarisationplugin. - Retrieval Engine: On a user query, perform a cosine similarity search to fetch the top-K relevant chunks (whether they originated from text or an image).
- Generative UI: Pass the context to the LLM and stream the response back to a custom React UI, appending
:/internal Joplin links as clickable citations.
Libraries or technologies:
- TypeScript & React: For the plugin infrastructure and a seamless, native-feeling UI panel.
- LangChain.js: To orchestrate the document loaders, chunking, and retrieval chains.
- Vision Ingestion Models: Image description generation will support a privacy-first local path (via Ollama +
llava) and a low-latency cloud fallback utilizing the Google Gemini API. - Text LLM Options: Support for Local LLMs (via Llama.cpp/Ollama) for privacy, and cloud APIs (OpenAI/Gemini) for users with lower-end hardware.
Potential challenges:
- UI Thread Blocking: Running heavy embeddings locally could freeze Joplin.
- Solution: Offload the ingestion and embedding generation to Web Workers or background processes, running only when the app is idle.
- Context Window Limits: * Solution: Implement strict Maximum Marginal Relevance (MMR) in the retrieval step to ensure diverse context without overflowing the token limit.
4. Implementation Plan
Week 1–2: Plugin Scaffolding & Data Pipeline
- Initialize plugin architecture and React webview.
- Implement Joplin Data API listeners to fetch and sync notes and resource attachments incrementally.
Week 3–4: The Multimodal Ingestion Engine
- Implement Markdown parsing and text splitting using LangChain.js.
- Build the image-processing bridge to generate descriptions for visual attachments.
Week 5–6: Vector Storage & Retrieval
- Integrate local embedding models (e.g., Transformers.js or Ollama embeddings).
- Build the semantic search function using the WASM/JS-safe vector store to accurately retrieve top-K context chunks.
Week 7–8: Conversational UI & Streaming
- Develop the React chat interface with message history state management.
- Implement server-sent events/streaming so the AI's response types out dynamically without UI lag.
Week 9–10: Citations & Prompt Engineering
- Refine the system prompts to ensure the AI strictly answers from the provided context (preventing hallucinations).
- Map retrieved vector chunks back to their parent Joplin Note IDs to render clickable citation badges in the chat.
Week 11–12: Polish, Testing & Documentation
- Finalize UI styling to respect Joplin's light/dark themes.
- Write unit tests for the chunking and retrieval logic.
- Publish user documentation outlining how to connect local vs. cloud AI models.
5. Deliverables
- A published Joplin Plugin with a React-based Chat UI.
- A complete Multimodal RAG ingestion and retrieval pipeline.
- Support for both local inference and external APIs.
- Comprehensive user and developer documentation.
6. Availability
- Weekly availability: 30–35 hours per week.
- Time zone: IST (UTC +5:30).
- Other commitments: Flexible schedule with dedicated daily blocks secured specifically for GSoC development to easily meet the 350-hour requirement.
