| Organization | Joplin |
| Mentors | Henry Heino, Caleb John, Shikuz |
| Difficulty | Medium |
| Expected size of project | 175 hours |
TITLE
| GSoC 2026 Proposal Draft Idea-5: ‘‘Automatically label images using AI’’- MAHADEV KUMAR |
|---|
Personal Information
Student Details
- Name: Mahadev Kumar
- GitHub: rajmahadev422
- Email: Mahadev Raj , 24je0035
- LinkedIn: Mahadev Kumar
- Portfolio: Mahadev
- Resume: Link
- TimeZone: New Delhi (+5:30GMT)
- Address: Sitamarhi, Bihar, India, Pin-843323
- Introduction post: link
University Details
- University: Indian Institute of Technology (ISM) Dhanbad
- Degree: Bachelor of Technology
- Branch: Civil Engineering
- Current Year: 2nd
- Expected Graduation: 2028
Background
- I have a strong foundation in full-stack web development, primarily working with the MERN stack to build scalable and efficient applications. I have extensive experience with React.js, where I focus on building reusable component architectures, managing state using Redux Toolkit and Context API, and optimizing performance for better user experience.
- I am highly proficient in JavaScript and TypeScript, with a clear understanding of core concepts such as asynchronous programming, closures, and the event loop.
- I also have the knowledge of CPP and Python.
- On the backend, I work with Node.js and Express.js to design and develop RESTful APIs, implement authentication systems using JWT, and handle secure data flow between client and server.
- I have worked with MongoDB for database design, schema structuring, and query optimization. Through my projects, including e-commerce and Campus management system websites.
- I have implemented features like authentication, protected routes, and dynamic data rendering.
- In addition to web development, I am actively exploring the integration of AI/ML into applications.
- I have experience working with APIs like Gemini to build intelligent systems such as automated issue and pull request labeling.
- I am also familiar with modern development workflows, including Git-based version control and contributing to open-source projects.
- I have made an open-source github project that gives a lot of knowledge about how to contribute to open-source.
- I have also contributed to OpenCV. [PR-Link]
Summary
This project introduces an AI-powered system for automatically generating descriptive labels for images in Joplin, significantly improving accessibility for visually impaired users. A working prototype has already been implemented using Google Gemini API, demonstrating real-time caption generation
Problem Statement
Joplin has a strong focus on accessibility. To enhance accessibility, we aim to use AI to automatically scan all images found within the notes and assign a descriptive label to each one. For instance, an image of the Mona Lisa could be labelled as "A portrait of a woman with an enigmatic smile, featuring a soft landscape background and masterful use of sfumato shading".
Proposed Solution
Architecture Overview
Joplin Plugin (React + TypeScript) → Image Extraction layes or Converting into Base64 → AI Captioning Service (Local or API) → Description Storage (Metadata / Note DB) → UI for Displaying and Editing
Core Features
1. Automatic Image Detection
- Scan notes for images
- Detect new or modified images
2. AI-Based Caption Generation
- Generate detailed descriptions
- Support:
- Local models (privacy-first)
- API-based models (performance)
3. Metadata Storage
- Store captions as:
- Alt text
- Searchable metadata
4. User Interface
- Edit captions manually
- Re-generate descriptions
- Enable/disable automation
5. Batch Processing
- Scan entire notebook collections
- Background processing support
Technical Approach
Frontend (Plugin)
- TypeScript-based Joplin plugin
- React UI for:
- Image preview panel
- Caption editor
- Settings dashboard
AI Integration
Option A: Local (Privacy Mode)
-
FastAPI server (Python)
-
Models:
- BLIP / Vision Transformers
Option B: API-Based ( Recommended for this Project )
-
Google Gemini (leveraging prior experience)
-
Faster and easier to scale
-
No need of backend
Communication
- Plugin ↔ AI service via HTTP API
Implementation Timeline
Community Bonding Period (Before Coding Starts)
- Set up development environment for Joplin plugin
- Study Joplin plugin API and architecture
- Discuss scope, milestones, and expectations with mentors
- Finalize system design and AI approach
Week 1: Research & System Design
- Analyze Joplin note and resource structure
- Design complete architecture and data flow
- Evaluate AI models (Gemini vs local models)
- Finalize prompt strategy for caption generation
Week 2: Plugin Setup & Image Extraction
- Initialize Joplin plugin using TypeScript
- Implement command system and menu integration
- Extract images from notes using resource IDs
- Convert images to Base64 format
Week 3: AI Integration (Core)
- Integrate Google Gemini API
- Implement image-to-caption pipeline
- Parse and validate AI responses
- Add error handling for failed requests
Week 4: Storage & Metadata Handling
- Store captions in note metadata
- Ensure persistence across sessions
- Handle multiple images per note
- Begin basic search integration
Week 5: UI Development (Core Features)
- Build React-based UI panel inside Joplin
- Display images with generated captions
- Add manual editing and re-generation options
Week 6: Advanced Features
- Add settings panel (API key, automation toggle)
- Implement batch processing for multiple notes
- Improve prompt structure for better output quality
Week 7: Optimization & Testing
- Optimize performance for large datasets
- Implement background processing
- Handle edge cases and improve reliability
- Fix bugs and refine user experience
Week 8: Finalization & Documentation
- Complete documentation (user + developer)
- Prepare demo (GIF/video/screenshots)
- Code cleanup and final testing
- Submit final project
The timeline is structured to deliver a functional MVP early (by Week 4), followed by iterative improvements, ensuring steady progress and continuous mentor feedback integration.
Prototype / Prior Implementation
I have already implemented a working prototype of this system:
- Built an AI Image Describer web app using React + TypeScript
- Integrated Google Gemini API for real-time caption generation
- Implemented:
- Drag-and-drop image upload
- Base64 image processing
- AI-generated descriptive outputs
- Live Demo: link
- Github link: link

Expected Outcomes
- Fully functional Joplin plugin
- Automated image caption generation
- Improved accessibility and usability
- Searchable image descriptions
Risk Analysis & Mitigation
- API Failure / Rate Limits → Implement retry + fallback mechanism
- Slow Processing → Batch processing and background execution
- Low Caption Quality → Prompt tuning + manual editing support
- Privacy Concerns → Optional local model support
Why me ?
-
Built a working AI image captioning system using Gemini API
-
Experience integrating LLMs into real-world applications
-
Strong full-stack skills (React, TypeScript, JavaScript, Node.js)
-
Experience with automation systems (Learn-to-PR)
-
Familiar with open-source workflows and Git-based collaboration
Future Enhancements
- Multi-language captions
- Context-aware descriptions using note text
- OCR integration
- Voice output for accessibility
Previous Experience
1. Vehicle Prediction System [github-link]
- Built a machine learning model to predict vehicle-related outcomes using historical data
- Implemented data preprocessing, feature engineering, and model evaluation
- Focused on accuracy optimization and real-world applicability
- Make a frontend using HTML, CSS and JavaScript and backend using Python and PyTorch.
2. Learn-to-PR (AI-powered GitHub Assistant) [github-link]
- Developed a system that integrates Google Gemini API
- Automatically labels GitHub issues and pull requests
- Also give a welcome message to the user that adds any issue or pull request.
- Demonstrates real-world LLM(google-genai) integration and automation
3. Portfolio [github-link] [live-link]
- Technologies that I used for making this:
- Framework: Nextjs
- Language: TypeScript
- Library: Tailwindcss
- Support both light and dark mode
- Fully responsive
Full-Stack & AI Skills
-
Languages: TypeScript, JavaScript, Python, CPP
-
Frontend: React.js, Tailwind CSS
-
Backend: Node.js, FastAPI
-
Database: MongoDB, MySQL
-
AI/ML: Model integration, API-based LLM usage, data pipelines, OpenCV
-
Other: Plugin development, REST APIs, system design
Motivation
- Accessibility is a critical but often underdeveloped feature in note-taking tools. Users relying on screen readers face limitations when images lack descriptions.
- This project aligns with my goal of integrating AI into real-world applications and improving user experience at scale.
- I have hands-on experience building full-stack and AI-integrated systems, making me well-equipped to deliver this solution efficiently.
Conclusion
This project delivers a practical, scalable solution to improve accessibility in Joplin using AI. By combining modern captioning models with a well-designed plugin system, it ensures both usability and impact.
My prior experience with AI-powered automation and full-stack systems positions me to successfully deliver this project within the given timeline.
