GSoC 2026 Proposal Draft – Idea 5: Automatically label images using AI – Vinayreddy765

Links


1. Introduction

My name is Vinay Reddy, and I am currently pursuing a Bachelor of Engineering in Information Science and Engineering at Global Academy of Technology, Bengaluru, India (expected graduation: 2027). I am based in IST (UTC+5:30).

I have been contributing to the Joplin codebase since early 2026, with three merged pull requests in the desktop application. Through these contributions I became familiar with Joplin's architecture — specifically the encryption settings UI, resource handling, and note model. For this proposal I studied Resource.ts, Note.ts, and ResourceService.ts directly to understand how images are stored and processed internally.

Joplin has a strong focus on accessibility. However, images embedded in notes currently carry no descriptive alt text — they appear in Markdown as:

markdown

![](:/resourceId)

This means screen readers encounter a completely blank description for every image, leaving visually impaired users with no information about the content. For all users, image content is also invisible to Joplin's search system.

AI vision models can now generate accurate, detailed image descriptions automatically. This project proposes to build a Joplin plugin that scans all images in a user's note collection, calls an AI vision model to generate a description for each one, and writes the description back as Markdown alt text:

markdown

![A portrait of a woman with an enigmatic smile, featuring a soft landscape background](:/resourceId)

This directly addresses the GSoC idea's expected outcome and makes Joplin's image content genuinely accessible for the first time.


2. Project Summary

Problem: Images in Joplin notes have empty alt text, making them inaccessible to screen readers and invisible to search.

What will be implemented: A Joplin plugin built with TypeScript using the official Plugin API that scans image resources, generates AI descriptions, and writes them back into note bodies as Markdown alt text. The plugin will support both a local AI provider (Ollama — free, private, no data leaves the device) and a cloud provider (OpenAI — opt-in, requires user to provide their own API key).

Expected outcome: Every image with missing or weak alt text in the user's note collection gets a meaningful accessibility description automatically— improving accessibility for visually impaired users and making image content searchable.


Technical Approach

3.1 How Joplin Stores Images — From Codebase Study

Images in Joplin are stored as resources and referenced in note bodies using resource IDs. They can appear in two formats — Markdown ![](:/resourceId) or HTML . Joplin already has an existing OCR pipeline that follows a similar scan-process-write-back pattern. This plugin follows the same logical approach at the plugin level, building on existing resource handling patterns found in Resource.ts and Note.ts.

3.2 Architecture

The plugin follows this pipeline:

Figure 1: All processing occurs locally via Ollama — no image data leaves the user's device.

The plugin reads image resources from the note via the Joplin Plugin API,prepares images for AI processing, and sends them to the configured AI provider. The provider returns a concise accessibility description which the plugin writes back into the note body as Markdown alt text. When using Ollama, this entire process happens locally on the user's device — no data is sent externally

3.3 Plugin File Structure

joplin-plugin-image-alt-text/

├── src/

│ ├── index.ts (plugin entry point, Tools menu)

│ ├── AltTextService.ts (core scanning and processing)

│ ├── ReviewDialog.ts (user review before applying changes)

│ ├── NoteUpdater.ts (writes alt text back to notes)

│ └── providers/

│ ├── AltTextProvider.ts (common interface)

│ ├── OllamaProvider.ts (local AI — free, private)

│ └── OpenAIProvider.ts (cloud AI — opt-in)

├── manifest.json

└── package.json


3.4 How the Plugin Works

The plugin reads image resources from the note via the Joplin Plugin API, converts them to Base64, and sends them to the configured AI provider. The provider returns a concise accessibility description which the plugin presents to the user for review before writing back into the note body as alt text. The provider layer is abstracted behind a common interface — both Ollama and OpenAI work identically from the user's perspective. Images are detected in both Markdown and HTML formats to ensure no images are missed.


3.5 Privacy Design

The plugin is privacy-first by default:

Ollama (default): Runs entirely locally. No image data ever leaves the user's device. No API key required. Free.

OpenAI (opt-in): Only enabled if the user explicitly selects it and provides their own API key. A clear warning is displayed explaining that images are sent to an external server.

This directly aligns with Joplin's privacy-first philosophy.


3.6 Why This Design Works

This design ensures:

  • Non-intrusive workflow — generation is always manually triggered by default. The plugin never modifies notes without the user's explicit action.
  • Full user control — nothing is written to a note without passing through the review dialog first. Users can accept, edit, or skip every description.
  • Privacy by default — Ollama processes everything locally on the user's device. No image data leaves the machine unless the user explicitly opts into OpenAI.
  • Safe for existing content — the plugin classifies alt text before processing. Meaningful descriptions written by the user are never overwritten.
  • Consistent experience — the plugin works across both Markdown and Rich Text editors, adapting its UI to match what the user sees in each editor.

3.7 Behaviour and Edge Cases

Handling Existing Alt Text

Images embedded in notes may already contain alt text. The plugin distinguishes between three cases:

Case 1 — Missing Alt Text

Example:

![](:/resourceId)

or

![image.png](:/resourceId)

In this case, the plugin automatically generates an accessibility description and replaces the alt text.

Example result:

![Screenshot of an e-commerce product page showing Nike football shoes with price and size options.](:/resourceId)


Case 2 — Meaningful Existing Alt Text

Example:

![Nike Phantom football shoes product page](:/resourceId)

The plugin will not overwrite this text by default.

Instead, the image is skipped and recorded as already having descriptive alt text.

This ensures that user-written descriptions are never replaced automatically.


Case 3 — Weak Alt Text

Example:

![IMG_2034.png](:/resourceId)

These cases usually originate from file names rather than real descriptions.

The plugin can optionally replace these descriptions if the user enables the setting:

Replace weak alt text

The plugin uses heuristics to identify weak alt text — such as filenames, very short strings, and common auto-generated patterns. Distinguishing weak from meaningful alt text is a nuanced problem that will be refined through testing and mentor feedback during implementation.


Regeneration Rules

Users may want to regenerate alt text when:

  • the AI description is inaccurate

  • the image content changes

  • a different AI provider or model is selected

The plugin supports regeneration through:

Tools → Regenerate Alt Text

Rules:

  • regeneration applies only to images selected by the user

  • existing alt text is replaced only when regeneration is explicitly triggered

  • a preview dialog is shown before applying changes

This prevents accidental overwriting.When the user triggers regeneration, the plugin identifies all images with existing alt text in the current note, presents them one by one in the review dialog, and only replaces descriptions the user explicitly accepts

Batch Processing

Notes may contain multiple images. The plugin supports batch processing with the following workflow:

  1. Scan the note for image resources

  2. Identify images requiring alt text generation

  3. Process images sequentially through the AI provider

  4. Update the note body after processing completes

Example:

Before:

![image1.png](:/id1)

![image2.png](:/id2)

After:

![Screenshot of a product page showing Nike football shoes.](:/id1)

![Digital certificate showing NPTEL course completion details.](:/id2)

Images that already contain meaningful alt text are skipped automatically.

Batch processing will also support:

  • progress reporting

  • safe cancellation

  • skipping previously processed images

Joplin's data API returns results in paginated batches. The plugin iterates through all pages using the has_more flag to ensure every note in the collection is scanned.


4.User Review and Control

Users maintain full control over changes made by the plugin.

Manual Mode (Default)

Users manually trigger generation:

Tools → Generate Alt Text

The plugin processes images only in the current note.


Preview Mode

Before applying changes, users can review generated descriptions:

Example:

AI Alt Text Suggestions — 4 images found

Image 1 of 4

Generated Description

Accept

Edit

Skip

Users can:

  • accept the suggestion

  • edit the description

  • skip the image

When the user selects Edit, an inline text field appears allowing them to correct the description before saving.


Undo Support

Before modifying any note, the plugin stores the original note body in memory. Users can revert the entire batch operation using Undo Last Batch, restoring all notes to their previous state.

Automatic Mode (Optional)

Users can enable automatic generation for newly inserted images.

Workflow:

Insert image

Plugin detects image

AI generates description

Alt text inserted automatically

This behaviour is configurable in plugin settings.


5.Error Handling

The plugin is designed to handle common edge cases gracefully, ensuring it never crashes or corrupts note content. The following cases have been identified and will be handled during implementation

Edge Case Handling
Unsupported file types Only PNG, JPG, JPEG, WEBP are processed; others silently skipped
Encrypted resources Skipped — encrypted blobs cannot be read; reported in summary
AI runtime unavailable Clear message: 'AI runtime not detected. Please start Ollama.'
Ollama model not installed Shows exact install command: ollama pull llava
Large images Resized before sending; original file on disk never modified
Network failure mid-batch Skip failed image, continue, offer Retry Failed at end
Same resource in multiple notes All notes updated, description generated once and cached
Note modified during batch Skip to avoid conflict, reported in summary
OpenAI rate limiting Wait 5 seconds and retry once; report if retry also fails

6.Potential Challenges and Solutions

AI accuracy and user trust Vision models can occasionally produce vague or inaccurate descriptions. The plugin addresses this through careful prompt engineering and by always showing descriptions to the user for review before writing anything to notes. Users can edit any description before accepting it.

Local AI availability The plugin requires Ollama to be running locally for the default provider. If Ollama is unavailable or the required model is not installed, the plugin will show a clear actionable message guiding the user to resolve the issue rather than failing silently.

Large images and performance High-resolution images may slow processing. Large images will be resized before sending to the AI provider without modifying the original file. Users can also configure lighter models for faster processing on lower-end hardware.

Encrypted resources Joplin supports end-to-end encryption. Encrypted resources cannot be read at the plugin level and will be skipped gracefully, with a note in the completion summary.

Cloud provider limitations When using OpenAI, rate limiting and API errors will be handled with appropriate retry logic and clear user messaging. Users are also warned before enabling the cloud provider that images will leave their device.

Cross-platform compatibility Primary development and testing will be on Windows, with Linux tested via virtual machine. macOS compatibility will be validated with community assistance during the testing phase.


AI Runtime Considerations

Joplin plugins run inside a sandboxed environment which restricts the use of native modules directly within the plugin process.

For local AI inference, the plugin will therefore rely on runtime environments that work within this sandbox. The default approach uses the Ollama HTTP interface — Ollama runs as a separate local process outside the sandbox, and the plugin communicates with it via a simple HTTP API. This avoids all sandbox restrictions while keeping processing fully local on the user's device.

Alternative approaches such as WASM-based inference runtimes may also be explored during implementation as they can run directly inside the plugin sandbox without requiring a separate installation. The performance tradeoffs between these approaches will be evaluated with mentor guidance during the project.

7.User Experience and User Flows

The plugin is designed to be non-intrusive and fully controllable. Users are never surprised by changes to their notes.


Flow 1 — Single Note Mode (Default)

User opens a note and clicks Tools → Generate Alt Text. The plugin scans the note and presents each image for review one at a time:

AI Alt Text Review — 4 images found

Image 1 of 4 [Image preview]

Proposed description:

"A red torii gate surrounded by autumn maple trees at dusk."

Actions: [Accept] [Edit] [Skip]

Batch actions: [Accept] [Cancel All]

Nothing is written to the note until the user clicks Accept or Accept All.


Flow 2 — Batch Mode (All Notes)

User clicks Tools → Generate Alt Text (All Notes). The plugin processes the entire note collection in the background:

Processing notes…

47 of 312 scanned — 12 images labeled

Cancel

On completion:

✓ Processing complete

47 images labeled across 23 notes

Actions:

View Changes Undo Close

Undo is available for 60 seconds. If a note was modified during processing, it is skipped and reported in the summary.


Flow 3 — Auto-label Mode (Optional)

User enables auto-labeling in plugin settings. When a new image is inserted:

Insert image

Plugin detects new image

AI generates description in background

Alt text inserted automatically

User can edit at any time

This mode is off by default — users must explicitly enable it.


Flow 4 — Markdown Editor vs Rich Text Editor

The plugin adapts its UI based on which editor the user is working in:

Markdown Editor Rich Text Editor
User sees Raw syntax Rendered image visually
How to trigger Tools → Generate Alt Text Right-click image → Generate Alt Text
Review dialog Shows Markdown with proposed text Shows rendered image with input field
Edit alt text Edit text directly Inline input field overlay on image

Markdown Editor

When the plugin detects an image without alt text:

![](:/resourceId)

Missing alt text

User options:

Generate with AI Write manually


Rich Text Editor

Because the Rich Text editor hides Markdown syntax, the plugin exposes alt-text actions through the UI.

Available actions:

  • Right-click image → Generate Alt Text

  • Toolbar button → Alt Text

The review dialog then displays the image preview alongside the generated description.

When the user accepts a description in the Rich Text editor, the plugin writes the alt text back to the underlying note body by updating the alt attribute of the corresponding img tag directly — for example changing img src=':/resourceId' alt='' to img src=':/resourceId' alt='Generated description'.

This ensures the alt text is correctly stored in the note regardless of which editor the user is working in, and is immediately visible if the user switches to the Markdown editor.

Images in notes can appear as either Markdown or HTML format. The plugin detects and handles both formats automatically — for Markdown images the alt text is written by replacing the text inside the square brackets, and for HTML images by updating the alt attribute directly. This ensures no images are missed and alt text is correctly stored regardless of how the image was inserted.

In all cases the user remains in control — no description is written to a note without explicit user action, except in optional auto-label mode which must be enabled in settings.

8. Evaluation and Performance Metrics

To ensure the effectiveness and reliability of the proposed plugin, the project will be evaluated using both performance metrics and quality evaluation metrics.

8.1 Performance Metrics

The performance of the plugin will be measured based on the following criteria:

Processing Time per Image

  • Measure the average time required to generate alt text for a single image.

  • Compare performance between local Ollama models and cloud-based models.

Batch Processing Performance

  • Evaluate the time required to process multiple images within a note or across the entire note collection.

  • Ensure the plugin processes images sequentially to maintain UI responsiveness.

Memory and Resource Usage

  • Monitor CPU and memory usage during image processing to ensure the plugin runs efficiently on typical user hardware.

Scalability

  • Test performance on notes containing a large number of images to ensure the plugin remains stable and responsive.

8.2 Evaluation Metrics

The quality and usefulness of generated alt text descriptions will be evaluated using the following criteria:

Accuracy of Image Descriptions

  • Verify that generated alt text correctly describes the primary subject and key elements of the image.

Accessibility Quality

  • Ensure descriptions follow accessibility guidelines by focusing on meaningful visual information useful for screen reader users.

Consistency of Output

  • Evaluate whether the generated descriptions maintain a consistent style and level of detail across different images.

Acceptance rate of generated descriptions (percentage accepted without editing during user testing).

Cross-Platform Reliability

  • Validate that the plugin functions correctly on Windows, macOS, and Linux environments.

8.3Testing Strategy

The plugin will be tested at two levels:

Manual Testing

  • Test with different image types: PNG, JPG, JPEG, WEBP
  • Test with notes containing no alt text, weak alt text, and meaningful existing alt text
  • Test single note mode and batch mode with a collection of notes
  • Test cancellation mid-batch and undo functionality
  • Test with Ollama provider on Windows and Linux
  • Test error scenarios: Ollama not running, invalid API key, encrypted resources
  • Test in both Markdown editor and Rich Text editor

Automated Testing

  • Run Joplin's existing test suite after each change using yarn test to ensure no regressions
  • Write basic unit tests for core logic such as alt text classification (missing, weak, meaningful) and resource ID extraction from note bodies
  • Write integration tests for the Ollama provider connection and note body update logic

9.Implementation Plan

Community Bonding Period (May 8 – May 25)

  • Study Joplin Plugin API documentation in depth
  • Explore existing OCR pipeline in Resource.ts to understand patterns to follow
  • Discuss provider strategy, privacy requirements, and plugin architecture with mentors
  • Finalize design decisions (model selection UI, settings structure, error handling approach)
  • Set up development environment and plugin scaffold

Week 1–2 (May 26 – June 7) Plugin Foundation and Resource Scanning

  • Implement plugin entry point (index.ts) with command registration
  • Implement resource detection logic for images referenced in notes.
  • Implement image loading and preparation for AI processing.
  • Implement core image scanning service for current note.
  • Write unit tests for image scanning and processing logic.

Week 3–4 (June 8 – June 21) Local AI Integration — Ollama Provider

  • Implement AltTextProvider.ts interface
  • Implement OllamaProvider.ts — connect to local Ollama runtime via HTTP API
  • Implement NoteUpdater.ts — safely replace Markdown alt text using regex
  • Write integration tests for Ollama provider and note body replacement
  • Test with multiple image types (PNG, JPEG, WebP)
  • Add: Implement ReviewDialog.ts — preview mode with Accept, Edit, Skip per image

Week 5–6 (June 22 – July 5) Cloud AI Integration and Edge Cases

  • Implement OpenAIProvider.ts — opt-in cloud provider with user API key
  • Handle all edge cases:
    • Skip images that already have meaningful alt text
    • Handle encrypted resources gracefully
    • Handle large images (resize before sending if needed)
    • Handle OpenAI rate limiting with retry logic
  • Write tests for edge case handling

Week 7–8 (July 6 – July 19) Settings UI and Tools Menu

  • Implement plugin settings using joplin.settings:
    • Provider selection (Ollama / OpenAI)
    • Ollama model name (configurable)
    • OpenAI API key (secure field)
    • Privacy warning for cloud provider
  • Add "Generate Alt Text (AI)" command to Tools menu
  • Add progress notifications during processing
  • Add clear error messages (Ollama not running, API key missing, etc.)

Week 9–10 (July 20 – August 2) Batch Processing and Auto-labeling

  • Extend from current-note mode to scan-all-notes mode
  • Implement auto-labeling on new image insertion using joplin.workspace events
  • Cross-platform QA on Windows and Linux via virtual machine; macOS with community assistance
  • Fix any UI or behavior inconsistencies found during testing

Week 11 (August 3 – August 9) Mentor Feedback and Optimization

  • Address all mentor code review feedback
  • Optimize performance for large note collections
  • Improve prompt engineering for more accurate descriptions
  • Conduct accessibility audit of the plugin UI itself

Week 12 (August 10 – August 18) Final Submission

  • Write complete user documentation and setup guide
  • Publish plugin to the official Joplin plugin repository
  • Submit final pull request
  • Write final GSoC progress blog post

10.Deliverables

At the end of this project the following will exist:

  • Complete Joplin plugin published to the official Joplin plugin repository
  • AltTextService.ts — scans image resources in current note and all notes
  • OllamaProvider.ts — local AI provider, fully private, no API key needed
  • OpenAIProvider.ts — cloud AI provider, opt-in, user provides own key
  • NoteUpdater.ts — safely writes AI descriptions as alt text into note bodies
  • ReviewDialog.ts — user review dialog with Accept, Edit, Skip per image
  • Plugin settings UI with provider selection, API key configuration, and privacy warning
  • "Generate Alt Text (AI)" command in Joplin's Tools menu
  • Auto-labeling of newly added images via workspace events
  • Manual test coverage across all supported image types and editor modes; unit tests for core classification and parsing logic
  • User documentation explaining setup for both Ollama and OpenAI
  • Verified functionality on Windows, macOS, and Linux

11.Availability

  • Weekly availability: 20–25 hours/week (175-hour medium project across 12 weeks)
  • Time zone: IST (UTC+5:30), Bengaluru, India
  • Other commitments: I have semester-end examinations in June which I am confident I can manage alongside GSoC commitments. No other planned vacations or internships during the coding period.
  • Communication plan: I will contact my mentor minimum 3 times per week, post weekly progress updates on the Joplin forum, and maintain an up-to-date progress blog. I understand that lack of communication results in failing the programme.
  • Remote work: I am comfortable working independently under a remote mentor across time zones. I have already done this through my prior Joplin contributions, communicating with maintainers asynchronously via GitHub and the Joplin forum.
  • Language: My native language is Telugu. I am fully comfortable working and communicating in English.

Benefits to Joplin Community

This plugin directly advances Joplin's stated accessibility goals. Users who rely on screen readers will for the first time receive meaningful descriptions of images in their notes. The dual-provider design (Ollama local + OpenAI opt-in) ensures every user can benefit regardless of technical setup or budget — with privacy preserved by default. The plugin will be published to the Joplin plugin repository making it available to all Joplin users immediately upon completion.

Gsoc proposal link:Automatically label images using AI

Thanks for your proposal! The main thing missing is more detail on behaviour and edge cases, especially around existing alt text, regeneration rules, batch processing, and how users review or control changes.

1 Like

Thanks for the helpful feedback earlier — I really appreciate the suggestions.

I’ve updated the proposal to address the points you mentioned and expanded the behaviour and edge-case sections. In particular, I added:

• A heuristic (regex example) showing how weak alt text such as file names is detected.
• A clearer preview dialog mockup showing the number of images being reviewed (e.g., “Image 1 of 4”).
• Pagination logic for batch scanning so collections larger than 100 notes are handled correctly.
• Undo support that snapshots original note bodies before modification so batch changes can be reverted.
• Additional edge-case handling such as notes modified during batch processing.

Please let me know if there are any other areas that could be improved further. Thanks again for taking the time to review the draft!

One design detail I’m still evaluating is the choice of the default local vision model.

In the proposal I currently assume LLaVA via Ollama, mainly because it integrates well with Ollama and produces reasonably good image descriptions. However, I’m aware that LLaVA can be relatively heavy for users on lower-end hardware.

Are there other open-source vision models that might be worth considering for this use case — particularly models that run efficiently while still producing useful accessibility-style descriptions?

The plugin architecture keeps the provider layer modular, so switching models or adding additional providers later should be straightforward. I’d appreciate any suggestions on models worth evaluating during development.

Thanks for the submission

Please review the proposal draft submission guidelines specifically step 4

Update the draft by editing your first post directly so the latest version is always visible there.

For the document submission itself I have a few commens.

  • Overall, I think the submission has too many low-level details and not enough high level description, we want to see that you have thought about edge cases, but you don't need to have solved them yet.
  • I like the idea of using heuristics to identify automatically filled descriptions, but I think as proposed is overly simplistic. That said, you can keep that section high level.
  • You need for detail about user experience and user flows, also some discussion on how this works with the Rich text editor should be added.
1 Like

Thank you for the detailed feedback. I appreciate the suggestions. I'll revise the proposal by reducing the low-level implementation details, expanding the sections on user experience and user flows, and adding a discussion on compatibility with the Rich Text editor. I'll also keep the heuristics section more high-level as suggested.

Before revising the draft, I’d like to clarify one point regarding Rich Text editor compatibility. Does the Rich Text editor store image alt text using the same underlying Markdown representation (![alt text](:/resource_id)), or is there a different internal representation I should account for? I'll update the first post directly with the revised version once I incorporate the feedback.

The app only has one representation of each note, the editor doesn’t matter. But markdown isn’t the only possibility. In fact, even in a markdown document many users will have an html image. I suggest you review the codebase, and maybe even the forum to understand this a bit better.

The reason I bring up the rich text editor is that it doesn’t show the source, and will thus need different considerations in terms of UI compared to the markdown editor.

Thank you for the clarification. I'll review the codebase for HTML image handling in notes and study how the Rich Text editor differs in terms of UI considerations. I'll update the proposal accordingly

Thank you for the feedback earlier. I've updated the proposal in the first post to address the points discussed. The revision focuses on improving the high-level description, expanding the user experience and user flow sections, and adding clearer considerations for the Rich Text editor.

I also simplified some of the lower-level implementation details.

I'd appreciate any additional feedback or suggestions.

@Ronaldo

Thanks for the updates, but please make them in accordance with the draft submission guidelines that I linked above.

1 Like

Thanks for the reminder. I’ve updated the first post directly in accordance with the draft submission guidelines so that the latest version of the proposal is visible there.