GSoC 2026 Proposal Draft - Idea 5:Automatically label images using AI’

gsoc-logo

Organization Joplin
Mentors Henry Heino, Caleb John, Shikuz
Difficulty Medium
Expected size of project 175 hours

TITLE

GSoC 2026 Proposal Draft Idea-5: ‘‘Automatically label images using AI’’- MAHADEV KUMAR

Personal Information

Student Details

University Details

  • University: Indian Institute of Technology (ISM) Dhanbad
  • Degree: Bachelor of Technology
  • Branch: Civil Engineering
  • Current Year: 2nd
  • Expected Graduation: 2028

Background

  • I have a strong foundation in full-stack web development, primarily working with the MERN stack to build scalable and efficient applications. I have extensive experience with React.js, where I focus on building reusable component architectures, managing state using Redux Toolkit and Context API, and optimizing performance for better user experience.
  • I am highly proficient in JavaScript and TypeScript, with a clear understanding of core concepts such as asynchronous programming, closures, and the event loop.
  • I also have the knowledge of CPP and Python.
  • On the backend, I work with Node.js and Express.js to design and develop RESTful APIs, implement authentication systems using JWT, and handle secure data flow between client and server.
  • I have worked with MongoDB for database design, schema structuring, and query optimization. Through my projects, including e-commerce and Campus management system websites.
  • I have implemented features like authentication, protected routes, and dynamic data rendering.
  • In addition to web development, I am actively exploring the integration of AI/ML into applications.
  • I have experience working with APIs like Gemini to build intelligent systems such as automated issue and pull request labeling.
  • I am also familiar with modern development workflows, including Git-based version control and contributing to open-source projects.
  • I have made an open-source github project that gives a lot of knowledge about how to contribute to open-source.
  • I have also contributed to OpenCV. [PR-Link]

Summary

This project introduces an AI-powered system for automatically generating descriptive labels for images in Joplin, significantly improving accessibility for visually impaired users. A working prototype has already been implemented using Google Gemini API, demonstrating real-time caption generation

Problem Statement

Joplin has a strong focus on accessibility. To enhance accessibility, we aim to use AI to automatically scan all images found within the notes and assign a descriptive label to each one. For instance, an image of the Mona Lisa could be labelled as "A portrait of a woman with an enigmatic smile, featuring a soft landscape background and masterful use of sfumato shading".

Proposed Solution

Architecture Overview

Joplin Plugin (React + TypeScript) → Image Extraction layes or Converting into Base64 → AI Captioning Service (Local or API) → Description Storage (Metadata / Note DB) → UI for Displaying and Editing

Core Features

1. Automatic Image Detection

  • Scan notes for images
  • Detect new or modified images

2. AI-Based Caption Generation

  • Generate detailed descriptions
  • Support:
    • Local models (privacy-first)
    • API-based models (performance)

3. Metadata Storage

  • Store captions as:
    • Alt text
    • Searchable metadata

4. User Interface

  • Edit captions manually
  • Re-generate descriptions
  • Enable/disable automation

5. Batch Processing

  • Scan entire notebook collections
  • Background processing support

Technical Approach

Frontend (Plugin)

  • TypeScript-based Joplin plugin
  • React UI for:
    • Image preview panel
    • Caption editor
    • Settings dashboard

AI Integration

Option A: Local (Privacy Mode)

  • FastAPI server (Python)

  • Models:

    • BLIP / Vision Transformers

    Option B: API-Based ( Recommended for this Project )

  • Google Gemini (leveraging prior experience)

  • Faster and easier to scale

  • No need of backend

Communication

  • Plugin ↔ AI service via HTTP API

Implementation Timeline

Community Bonding Period (Before Coding Starts)

  • Set up development environment for Joplin plugin
  • Study Joplin plugin API and architecture
  • Discuss scope, milestones, and expectations with mentors
  • Finalize system design and AI approach

Week 1: Research & System Design

  • Analyze Joplin note and resource structure
  • Design complete architecture and data flow
  • Evaluate AI models (Gemini vs local models)
  • Finalize prompt strategy for caption generation

Week 2: Plugin Setup & Image Extraction

  • Initialize Joplin plugin using TypeScript
  • Implement command system and menu integration
  • Extract images from notes using resource IDs
  • Convert images to Base64 format

Week 3: AI Integration (Core)

  • Integrate Google Gemini API
  • Implement image-to-caption pipeline
  • Parse and validate AI responses
  • Add error handling for failed requests

Week 4: Storage & Metadata Handling

  • Store captions in note metadata
  • Ensure persistence across sessions
  • Handle multiple images per note
  • Begin basic search integration

Week 5: UI Development (Core Features)

  • Build React-based UI panel inside Joplin
  • Display images with generated captions
  • Add manual editing and re-generation options

Week 6: Advanced Features

  • Add settings panel (API key, automation toggle)
  • Implement batch processing for multiple notes
  • Improve prompt structure for better output quality

Week 7: Optimization & Testing

  • Optimize performance for large datasets
  • Implement background processing
  • Handle edge cases and improve reliability
  • Fix bugs and refine user experience

Week 8: Finalization & Documentation

  • Complete documentation (user + developer)
  • Prepare demo (GIF/video/screenshots)
  • Code cleanup and final testing
  • Submit final project

The timeline is structured to deliver a functional MVP early (by Week 4), followed by iterative improvements, ensuring steady progress and continuous mentor feedback integration.

Prototype / Prior Implementation

I have already implemented a working prototype of this system:

  • Built an AI Image Describer web app using React + TypeScript
  • Integrated Google Gemini API for real-time caption generation
  • Implemented:
    • Drag-and-drop image upload
    • Base64 image processing
    • AI-generated descriptive outputs
  • Live Demo: link
  • Github link: link
  • web-image

Expected Outcomes

  • Fully functional Joplin plugin
  • Automated image caption generation
  • Improved accessibility and usability
  • Searchable image descriptions

Risk Analysis & Mitigation

  • API Failure / Rate Limits → Implement retry + fallback mechanism
  • Slow Processing → Batch processing and background execution
  • Low Caption Quality → Prompt tuning + manual editing support
  • Privacy Concerns → Optional local model support

Why me ?

  • Built a working AI image captioning system using Gemini API

  • Experience integrating LLMs into real-world applications

  • Strong full-stack skills (React, TypeScript, JavaScript, Node.js)

  • Experience with automation systems (Learn-to-PR)

  • Familiar with open-source workflows and Git-based collaboration

Future Enhancements

  • Multi-language captions
  • Context-aware descriptions using note text
  • OCR integration
  • Voice output for accessibility

Previous Experience

1. Vehicle Prediction System [github-link]

  • Built a machine learning model to predict vehicle-related outcomes using historical data
  • Implemented data preprocessing, feature engineering, and model evaluation
  • Focused on accuracy optimization and real-world applicability
  • Make a frontend using HTML, CSS and JavaScript and backend using Python and PyTorch.

2. Learn-to-PR (AI-powered GitHub Assistant) [github-link]

  • Developed a system that integrates Google Gemini API
  • Automatically labels GitHub issues and pull requests
  • Also give a welcome message to the user that adds any issue or pull request.
  • Demonstrates real-world LLM(google-genai) integration and automation

3. Portfolio [github-link] [live-link]

  • Technologies that I used for making this:
    • Framework: Nextjs
    • Language: TypeScript
    • Library: Tailwindcss
  • Support both light and dark mode
  • Fully responsive

Full-Stack & AI Skills

  • Languages: TypeScript, JavaScript, Python, CPP

  • Frontend: React.js, Tailwind CSS

  • Backend: Node.js, FastAPI

  • Database: MongoDB, MySQL

  • AI/ML: Model integration, API-based LLM usage, data pipelines, OpenCV

  • Other: Plugin development, REST APIs, system design

Motivation

  • Accessibility is a critical but often underdeveloped feature in note-taking tools. Users relying on screen readers face limitations when images lack descriptions.
  • This project aligns with my goal of integrating AI into real-world applications and improving user experience at scale.
  • I have hands-on experience building full-stack and AI-integrated systems, making me well-equipped to deliver this solution efficiently.

Conclusion

This project delivers a practical, scalable solution to improve accessibility in Joplin using AI. By combining modern captioning models with a well-designed plugin system, it ensures both usability and impact.

My prior experience with AI-powered automation and full-stack systems positions me to successfully deliver this project within the given timeline.