GSoC 2026 Proposal Draft - Idea 5:Automatically label images using AI’

rajmahadev422 · 24 March 2026 08:09


Organization	Joplin
Mentors	Henry Heino, Caleb John, Shikuz
Difficulty	Medium
Expected size of project	175 hours

TITLE

GSoC 2026 Proposal Draft Idea-5: ‘‘Automatically label images using AI’’- MAHADEV KUMAR

Personal Information

Student Details

Name: Mahadev Kumar
GitHub: rajmahadev422
Email: Mahadev Raj , 24je0035
LinkedIn: Mahadev Kumar
Portfolio: Mahadev
Resume: Link
TimeZone: New Delhi (+5:30GMT)
Address: Sitamarhi, Bihar, India, Pin-843323
Introduction post: link

University Details

University: Indian Institute of Technology (ISM) Dhanbad
Degree: Bachelor of Technology
Branch: Civil Engineering
Current Year: 2nd
Expected Graduation: 2028

Background

I have a strong foundation in full-stack web development, primarily working with the MERN stack to build scalable and efficient applications. I have extensive experience with React.js, where I focus on building reusable component architectures, managing state using Redux Toolkit and Context API, and optimizing performance for better user experience.
I am highly proficient in JavaScript and TypeScript, with a clear understanding of core concepts such as asynchronous programming, closures, and the event loop.
I also have the knowledge of CPP and Python.
On the backend, I work with Node.js and Express.js to design and develop RESTful APIs, implement authentication systems using JWT, and handle secure data flow between client and server.
I have worked with MongoDB for database design, schema structuring, and query optimization. Through my projects, including e-commerce and Campus management system websites.
I have implemented features like authentication, protected routes, and dynamic data rendering.
In addition to web development, I am actively exploring the integration of AI/ML into applications.
I have experience working with APIs like Gemini to build intelligent systems such as automated issue and pull request labeling.
I am also familiar with modern development workflows, including Git-based version control and contributing to open-source projects.
I have made an open-source github project that gives a lot of knowledge about how to contribute to open-source.
I have also contributed to OpenCV. [PR-Link]

Summary

This project introduces an AI-powered system for automatically generating descriptive labels for images in Joplin, significantly improving accessibility for visually impaired users. A working prototype has already been implemented using Google Gemini API, demonstrating real-time caption generation

Problem Statement

Joplin has a strong focus on accessibility. To enhance accessibility, we aim to use AI to automatically scan all images found within the notes and assign a descriptive label to each one. For instance, an image of the Mona Lisa could be labelled as "A portrait of a woman with an enigmatic smile, featuring a soft landscape background and masterful use of sfumato shading".

Proposed Solution

Architecture Overview

Joplin Plugin (React + TypeScript) → Image Extraction layes or Converting into Base64 → AI Captioning Service (Local or API) → Description Storage (Metadata / Note DB) → UI for Displaying and Editing

Core Features

1. Automatic Image Detection

Scan notes for images
Detect new or modified images

2. AI-Based Caption Generation

Generate detailed descriptions
Support:
- Local models (privacy-first)
- API-based models (performance)

3. Metadata Storage

Store captions as:
- Alt text
- Searchable metadata

4. User Interface

Edit captions manually
Re-generate descriptions
Enable/disable automation

5. Batch Processing

Scan entire notebook collections
Background processing support

Technical Approach

Frontend (Plugin)

TypeScript-based Joplin plugin
React UI for:
- Image preview panel
- Caption editor
- Settings dashboard

AI Integration

Option A: Local (Privacy Mode)

FastAPI server (Python)
Models:
- BLIP / Vision Transformers
Option B: API-Based ( Recommended for this Project )
Google Gemini (leveraging prior experience)
Faster and easier to scale
No need of backend

Communication

Plugin ↔ AI service via HTTP API

Implementation Timeline

Community Bonding Period (Before Coding Starts)

Set up development environment for Joplin plugin
Study Joplin plugin API and architecture
Discuss scope, milestones, and expectations with mentors
Finalize system design and AI approach

Week 1: Research & System Design

Analyze Joplin note and resource structure
Design complete architecture and data flow
Evaluate AI models (Gemini vs local models)
Finalize prompt strategy for caption generation

Week 2: Plugin Setup & Image Extraction

Initialize Joplin plugin using TypeScript
Implement command system and menu integration
Extract images from notes using resource IDs
Convert images to Base64 format

Week 3: AI Integration (Core)

Integrate Google Gemini API
Implement image-to-caption pipeline
Parse and validate AI responses
Add error handling for failed requests

Week 4: Storage & Metadata Handling

Store captions in note metadata
Ensure persistence across sessions
Handle multiple images per note
Begin basic search integration

Week 5: UI Development (Core Features)

Build React-based UI panel inside Joplin
Display images with generated captions
Add manual editing and re-generation options

Week 6: Advanced Features

Add settings panel (API key, automation toggle)
Implement batch processing for multiple notes
Improve prompt structure for better output quality

Week 7: Optimization & Testing

Optimize performance for large datasets
Implement background processing
Handle edge cases and improve reliability
Fix bugs and refine user experience

Week 8: Finalization & Documentation

Complete documentation (user + developer)
Prepare demo (GIF/video/screenshots)
Code cleanup and final testing
Submit final project

The timeline is structured to deliver a functional MVP early (by Week 4), followed by iterative improvements, ensuring steady progress and continuous mentor feedback integration.

Prototype / Prior Implementation

I have already implemented a working prototype of this system:

Built an AI Image Describer web app using React + TypeScript
Integrated Google Gemini API for real-time caption generation
Implemented:
- Drag-and-drop image upload
- Base64 image processing
- AI-generated descriptive outputs
Live Demo: link
Github link: link

Expected Outcomes

Fully functional Joplin plugin
Automated image caption generation
Improved accessibility and usability
Searchable image descriptions

Risk Analysis & Mitigation

API Failure / Rate Limits → Implement retry + fallback mechanism
Slow Processing → Batch processing and background execution
Low Caption Quality → Prompt tuning + manual editing support
Privacy Concerns → Optional local model support

Why me ?

Built a working AI image captioning system using Gemini API
Experience integrating LLMs into real-world applications
Strong full-stack skills (React, TypeScript, JavaScript, Node.js)
Experience with automation systems (Learn-to-PR)
Familiar with open-source workflows and Git-based collaboration

Future Enhancements

Multi-language captions
Context-aware descriptions using note text
OCR integration
Voice output for accessibility

Previous Experience

1. Vehicle Prediction System [github-link]

Built a machine learning model to predict vehicle-related outcomes using historical data
Implemented data preprocessing, feature engineering, and model evaluation
Focused on accuracy optimization and real-world applicability
Make a frontend using HTML, CSS and JavaScript and backend using Python and PyTorch.

2. Learn-to-PR (AI-powered GitHub Assistant) [github-link]

Developed a system that integrates Google Gemini API
Automatically labels GitHub issues and pull requests
Also give a welcome message to the user that adds any issue or pull request.
Demonstrates real-world LLM(google-genai) integration and automation

3. Portfolio [github-link] [live-link]

Technologies that I used for making this:
- Framework: Nextjs
- Language: TypeScript
- Library: Tailwindcss
Support both light and dark mode
Fully responsive

Full-Stack & AI Skills

Languages: TypeScript, JavaScript, Python, CPP
Frontend: React.js, Tailwind CSS
Backend: Node.js, FastAPI
Database: MongoDB, MySQL
AI/ML: Model integration, API-based LLM usage, data pipelines, OpenCV
Other: Plugin development, REST APIs, system design

Motivation

Accessibility is a critical but often underdeveloped feature in note-taking tools. Users relying on screen readers face limitations when images lack descriptions.
This project aligns with my goal of integrating AI into real-world applications and improving user experience at scale.
I have hands-on experience building full-stack and AI-integrated systems, making me well-equipped to deliver this solution efficiently.

Conclusion

This project delivers a practical, scalable solution to improve accessibility in Joplin using AI. By combining modern captioning models with a well-designed plugin system, it ensures both usability and impact.

My prior experience with AI-powered automation and full-stack systems positions me to successfully deliver this project within the given timeline.

Topic		Replies	Views
Welcome to GSoC 2026 with Joplin! GSoC	155	1945	1 April 2026
GSoC 2026 Proposal Draft – Idea 4: Chat with your note collection using AI GSoC	0	19	31 March 2026
What AI feature (if any) could be useful as part of Joplin? Lounge	31	2095	7 August 2025
Introducing Nancy4Hany GSoC	3	286	21 March 2024
Introducing Yaseen - GSoC 2026 Applicant GSoC	0	60	19 February 2026