GSoC 2026: Proposal Draft - Idea 5:Automatically Label Images Using AI

GSoC 2026 Proposal Draft - Idea 5: Automatically Label Images Using AI - Kanishka

Links

  • Project Idea: Idea #5 - Automatically Label Images Using AI

  • GitHub Profile: kanishka0411

  • Pull Requests Submitted to Joplin:

    • #14593 - Add toggle button to hide/show sync panel (Merged)
    • #14519 - Add table editing commands for CodeMirror (Open)
    • #14467 - Fix sidebar scroll jump on desktop (Merged)
    • #14442 - Handle missing script assets in HTML export (Merged)
    • #14403 - Move editor settings to dedicated section (Merged)
  • Other Open-Source Experience:

    • RoboSats (P2P Bitcoin exchange): 4 merged PRs - custom HTTP webhook notification system, coordinator ratings refactoring, USDT swap crash fix, Tor browser rendering fix
    • BTC Map (merchant map): 3 merged PRs - verified date badges, merchant actions refactoring, note field cleanup

1. Introduction

I'm a CS student. I mainly work with TypeScript, React, and Python. Over the past months I got into open source pretty seriously, contributing across a few projects in the privacy, payments, and mapping space. That gave me a good feel for navigating unfamiliar codebases, working with maintainers, and getting things through review.

I started contributing to Joplin a few weeks ago. Got 4 merged PRs and 1 open so far, touching the desktop frontend (app-desktop), shared library (packages/lib), and CodeMirror editor. That gave me a decent understanding of how things are wired across packages: the resource model, plugin API, settings system, editor internals.


2. Project Summary

Problem

People attach a lot of images to their Joplin notes: screenshots, photos, diagrams, scanned docs. Right now those images are basically invisible to search. You can't type "dog" and find the photo of your dog, or search "whiteboard" and pull up that meeting snapshot. OCR helps with text in images, but it doesn't do anything for the actual visual content like photos, illustrations, or diagrams.

What Will Be Implemented

A Joplin plugin that automatically generates descriptive labels for image attachments using AI. It runs locally by default so nothing leaves your machine, but there's an optional cloud provider if you want better accuracy for unusual images. Here's what the plugin does:

  1. Detects when you attach or update an image in a note

  2. Runs inference to generate labels like "outdoor", "dog", "mountain", "diagram"

  3. Stores those labels as structured metadata on the resource using the plugin userData API

  4. Makes labels searchable through Joplin's search

  5. Shows labels in the UI through a sidebar panel

Expected Outcome

  • A published, installable Joplin plugin

  • Local-first labeling with zero cloud dependency by default

  • Search integration via indexed userData or a new plugin API, with note-tag fallback

  • Clean settings UI for picking providers, configuring models, and toggling things on/off

  • Tests and documentation

Why a Plugin?

A plugin-first approach makes the most sense here because a lot of users simply don't want AI features in their note-taking app, and that's a totally valid preference. Making it a plugin keeps it fully opt-in. It's also way easier to iterate quickly on a plugin during the GSoC timeline, and it gives users the choice to install it or not without anything controversial touching core.


3. Technical Approach

Architecture Overview

Component Breakdown

3.1 Resource Detection

There's an important nuance in Joplin's plugin API here: onResourceChange() only fires when an existing resource is modified, not when a new one is created. And onNoteChange() only fires for the currently selected note, not every note in the database. So we need a layered detection strategy:

For modified resources: Hook into joplin.workspace.onResourceChange() to catch updates to existing images (e.g., user replaces an attachment). When this fires, check if the blob actually changed via blob_updated_time and re-label if needed.

For new resources in the active note (latency optimization): Hook into joplin.workspace.onNoteChange() and diff the note body to detect newly referenced resource IDs. This only fires for the selected note, so it's a fast-path optimization for the common case of "user is editing a note and pastes an image." The diff parses all :/resourceId occurrences in the note body (not just ![alt](:/id) markdown syntax, since resources can also appear in raw HTML blocks), then filters by MIME type against supported image formats (image/jpeg, image/png, image/webp, image/bmp) and enqueues new ones.

Fallback scanner (primary correctness mechanism): A periodic scan (configurable interval, default every 5 minutes) is the main guarantee that every image gets labeled. It queries resources via joplin.data.get(['resources'], { fields: [...], page: N }) and paginates through all pages (looping until has_more is false), checking which image resources don't have labels yet. This catches everything the event listeners miss: images added via sync, the web clipper, mobile, or notes the user wasn't actively editing. There's also a manual "Scan all unlabeled images" command for one-off bulk runs.

3.2 Processing Queue

There's a queue manager sitting between the event listener and the actual inference. It handles:

  • Batching: Groups multiple resource events so we don't process the same thing twice

  • Deduplication: If the same resource fires multiple events in a short window, only one gets processed

  • Rate limiting: Caps concurrent inference jobs (1 for local, configurable for cloud)

  • Retry with backoff: For when things fail transiently (cloud rate limits, model loading hiccups)

  • Progress tracking: So the UI panel can show what's happening

3.3 Provider System

I'm building this with a common provider interface so it's easy to swap or add new backends:


interface LabelProvider {

name: string;

initialize(): Promise<void>;

generateLabels(imagePath: string): Promise<LabelResult>;

dispose(): Promise<void>;

}

interface LabelResult {

labels: Array<{ name: string; confidence: number }>;

model: string;

timestamp: number;

}

Local Provider (Default):

  • Joplin's plugin sandbox only allows a few whitelisted native modules (sqlite3, fs-extra, 7zip-bin), so onnxruntime-node won't work here. Instead, the local provider uses ONNX Runtime Web (WASM-based), which is pure JavaScript/WebAssembly and runs fine inside the plugin sandbox without any native dependencies.

  • Primary model: MobileCLIP (~20-50MB, optimized for edge devices). The model is not bundled with the plugin package to keep installs fast. Instead, it's downloaded on first use with checksum verification and cached locally. The download is resumable so a flaky connection won't corrupt anything. Only a tiny bootstrap loader ships with the plugin itself.

  • Fallback: CLIP ViT-B/32 for higher accuracy (~350MB, opt-in download, same first-run download mechanism)

  • Image preprocessing: resize to model input dimensions using joplin.imaging API (createFromResource -> resize -> toJpgFile). This API is desktop-only, so the plugin initially targets desktop but the architecture is designed with future mobile portability in mind. Every image handle is freed after processing to avoid memory leaks.

  • Zero-shot classification against a predefined label vocabulary (500+ common categories)

  • Everything runs on-device, no data leaves the machine

Cloud Provider (Optional):

  • Supports OpenAI Vision API and Claude Vision API

  • User provides their own API key via plugin settings (secure: true, stored in the system keychain when available)

  • Sends image with a structured prompt asking for labels in JSON format

  • Respects rate limits and has cost controls (max images/day setting)

3.4 Label Normalizer

Takes the raw model output and cleans it up:

  • Maps synonyms to canonical labels (e.g., "puppy" -> "dog", "automobile" -> "car")

  • Filters out low-confidence labels (configurable threshold, default 0.3 for local, 0.5 for cloud)

  • Deduplicates overlapping labels

  • Caps at top N labels per image (default: 10)

  • Keeps labels separate from OCR text, but both end up searchable

3.5 Storage Layer

Labels get stored using Joplin's plugin userData API on the Resource entity:


// Store labels on a resource

await joplin.data.userDataSet(

ModelType.Resource,

resourceId,

'labels', // key

{

labels: [

{ name: 'dog', confidence: 0.92 },

{ name: 'outdoor', confidence: 0.87 },

{ name: 'grass', confidence: 0.74 }

],

model: 'mobileclip-s1',

timestamp: 1710700000000,

version: 1

}

);

Why userData instead of new database columns:

  • No schema migrations needed, it just works as a plugin without touching core

  • Syncs across devices automatically through Joplin's existing sync

  • Namespaced per plugin, so no conflicts with other plugins or OCR data

  • This is the same pattern used by existing Joplin plugins (there's a user_data test plugin in the codebase)

3.6 Search Integration

Plugin userData syncs across devices but is not indexed by Joplin's SearchEngine (which only reads resource.title and resource.ocr_text). This is the main architectural challenge for this project.

Why not inject labels into note bodies? The obvious approach (appending hidden HTML comments like <!-- ai-labels: dog, outdoor -->) is fragile and creates real problems:

  • A single image resource can be referenced from multiple notes — which note gets the comment? All of them? That's duplication and drift.

  • The plugin would be silently modifying user notes, which breaks trust and triggers unnecessary syncs.

  • Rich Text mode can strip HTML comments, causing data loss.

  • If the plugin is disabled or uninstalled, orphaned comments stay in notes forever.

Proposed approach: extend the plugin API to make userData searchable. Both extending existing userData behavior and adding a completely new plugin API method are valid options here — the final choice comes down to whichever results in a cleaner design with minimal risk. That's how the Joplin plugin API generally evolves: new methods get added when they're needed by plugins. Candidate designs:

  1. Indexed userData keys — Add an indexed: true option to userDataSet() so plugins can flag specific keys for search indexing. The SearchEngine would then include those values in its FTS table alongside resource.title and ocr_text. This is the least invasive: small change to the userData API, small change to SearchEngine's sync logic.

  2. New dedicated plugin API — A completely new API method (e.g., joplin.data.setSearchableMetadata()) designed specifically for plugins that need to contribute searchable text to resources. Cleaner separation of concerns, doesn't overload the existing userData API with indexing behavior.

  3. Resource metadata field — Write labels to a dedicated resource.ai_labels column. Simple and fast to query, but requires a schema migration and is less generic than the other approaches.

I'll evaluate these during community bonding with mentor guidance and prototype the most promising one before coding starts.

Fallback if core changes are too risky: Note-level tags (ai:dog, ai:outdoor) attached to the parent note. This uses existing Joplin infrastructure, requires zero core changes, and gives filtering through the tag sidebar. The downside is losing per-image granularity (tags attach to notes, not resources). This ships as a configurable option regardless.

Consistency guarantees I'll address:

  • One resource in multiple notes: Labels live on the resource, not the note — no duplication

  • Sync conflicts: userData already has merge semantics, labels follow the same path

  • Plugin disable/uninstall: Labels persist harmlessly in userData, no orphaned data in notes

  • Re-index after model change: Version field in label data lets the plugin detect stale labels and re-process

3.7 UI Panel

A sidebar panel created via joplin.views.panels.create():

  • Shows labels for images in the currently selected note

  • Displays label badges with confidence indicators

  • Lets you manually edit labels (add/remove/rename)

  • Shows processing status (queued, processing, done, error)

  • Has a "Re-scan" button to regenerate labels for an image

  • Built with HTML/CSS injected via setHtml() and addScript()

Changes to the Joplin Codebase

This is primarily a plugin project. The plugin itself uses only existing public APIs:

API Usage
joplin.workspace.onResourceChange() Detect modified images
joplin.workspace.onNoteChange() Detect newly attached images via note body diff
joplin.data.get/put(['resources', id]) Read resource metadata
joplin.data.userDataSet/Get() Store/retrieve labels
joplin.imaging.createFromResource() Load image for preprocessing
joplin.imaging.resize() Resize for model input
joplin.views.panels.create() Sidebar UI
joplin.settings.registerSection() Settings UI
joplin.settings.registerSettings() Provider config, API keys

The joplin.imaging API is desktop-only, so the initial version targets Joplin Desktop. For search integration, a small focused core PR may be needed to make userData indexable by SearchEngine or to add a new plugin API method — this will be scoped during community bonding with mentor approval.

Libraries and Technologies

Library Purpose Size
onnxruntime-web Local model inference (WASM, no native deps) ~8MB
MobileCLIP (ONNX) Image classification model ~20-50MB

No heavy ML frameworks like PyTorch or TensorFlow. ONNX Runtime Web is pure WASM/JS and runs inside the plugin sandbox without needing native modules. Image preprocessing is handled by Joplin's built-in joplin.imaging API, so no sharp dependency needed.

Potential Challenges

Challenge Mitigation
Model size vs accuracy tradeoff Start with MobileCLIP (~25MB), offer larger CLIP as opt-in download. Benchmark both during Week 1-2
Performance on low-end hardware Run inference off the main plugin execution path (worker-based processing in the plugin renderer process) so the UI stays responsive. Process images one at a time with configurable concurrency
Label quality for niche content Let users customize the label vocabulary. Cloud providers handle unusual images better
Search integration Will investigate indexed userData keys or a new dedicated plugin API method during community bonding. Fallback: note-level tags. No note body mutation
Offline model distribution Models are not bundled with the plugin. Downloaded on first use with checksum verification, resumable, and cached locally. Only a tiny bootstrap ships with the package
Privacy concerns with cloud providers Local is the default, always. Cloud needs explicit opt-in plus an API key. There'll be a clear warning in settings about data leaving your device

4. Implementation Plan

Community Bonding

  • Dig into Joplin's plugin development workflow and testing setup

  • Validate ONNX Runtime Web (WASM) runs inside the plugin sandbox

  • Benchmark candidate models (MobileCLIP-S1, MobileCLIP-S2, CLIP ViT-B/32) on a representative image set

  • Pick the final model based on size/accuracy/speed tradeoffs

  • Set up dev environment with automated tests

  • Investigate search integration strategy with mentors: prototype userData indexing in SearchEngine, evaluate tradeoffs of candidate designs (indexed userData keys vs new dedicated plugin API vs resource field)

  • Decision checkpoint: finalize search integration approach with mentors before Week 3 implementation begins

Week 1-2: Core Plugin Skeleton + Local Provider

  • Set up plugin project structure (manifest, settings, entry point)

  • Implement resource detection (onNoteChange for new images, onResourceChange for updates, periodic fallback scanner)

  • Build the processing queue with deduplication and rate limiting

  • Integrate ONNX Runtime Web (WASM) with the chosen model

  • Build the image preprocessing pipeline using joplin.imaging API

  • Implement the LabelProvider interface and the local provider

  • Milestone: Plugin can detect new images and generate labels locally, visible in the console

Week 3-4: Storage + Search + Cloud Provider

  • Wire up userData-based label storage on resources

  • Build the label normalizer (synonym mapping, confidence filtering, deduplication)

  • Implement search integration using the approach finalized during community bonding, with tag-based fallback

  • Build the cloud provider (OpenAI Vision API) with secure API key storage

  • Add provider selection in settings

  • Milestone: Labels stored on resources, search path implemented and validated, cloud provider working

Week 5-6: UI Panel + Settings

  • Build the sidebar panel with label badges for the current note's images

  • Add manual label editing (add/remove labels)

  • Add processing status indicators (queued/processing/done/error)

  • Build settings UI: provider selection, model config, confidence threshold, max labels, cloud API key

  • Add "Re-scan" and "Scan all images in note" commands

  • Milestone: Full UI with settings, manual editing, and status tracking

Midterm Evaluation

  • Working plugin with local + cloud labeling, storage, search, and UI

  • Demo to mentors, get feedback

  • Write midterm progress report

Week 7-8: Polish + Edge Cases + Performance

  • Handle edge cases: encrypted resources, large images, unsupported formats, sync conflicts

  • Optimize performance: worker-based off-main-path inference, lazy model loading, image caching

  • Add bulk processing command ("Label all unlabeled images")

  • Add a progress bar for bulk operations

  • Handle label merging during sync (timestamp-based via userData merge)

  • Milestone: Solid plugin that covers the tricky cases

Week 9-10: Testing + Documentation

  • Write Jest tests:

  • Unit tests for label normalizer, queue, provider interface

  • Integration tests for storage and retrieval

  • Mock-based tests for cloud provider

  • Write user docs: installation, configuration, usage guide

  • Write developer docs: architecture, how to add new providers

  • Performance benchmarks: time per image, memory usage, model loading time

  • Milestone: Full test coverage, complete documentation

Week 11-12: Final Polish + Submission

  • Address mentor feedback from final review

  • Final bug fixes and cleanup

  • Prepare plugin for publishing to the Joplin plugin repository

  • Write final GSoC report

  • Submit final work product and final mentor evaluation

  • Milestone: Plugin published and ready for users


5. Deliverables

Implemented Features

  • Joplin plugin: joplin-plugin-ai-image-labels

  • Local AI inference using ONNX Runtime Web/WASM (MobileCLIP), no cloud dependency, no native modules

  • Optional cloud provider support (OpenAI Vision, Claude Vision)

  • Automatic labeling on image attachment

  • Label storage via userData API with sync support

  • Search integration via indexed userData or new plugin API, with note-tag fallback to ensure usable search within project timeline

  • Sidebar panel showing labels per image with confidence scores

  • Manual label editing (add/remove/rename)

  • Bulk labeling command for existing images

  • Settings UI for full configuration

Tests

  • Unit tests for all core modules (normalizer, queue, providers, storage)

  • Integration tests for end-to-end labeling pipeline

  • Performance benchmarks documented

Documentation

  • User guide: installation, setup, configuration

  • Developer guide: architecture, adding custom providers

  • README with screenshots and usage examples


6. Availability

  • Weekly availability: 30-35 hours per week dedicated to GSoC

  • Time zone: IST (UTC+5:30)

  • Other commitments: No conflicting internships or jobs. University coursework will not affect committed GSoC hours.

4 Likes

Just my opinion, but I think probably a plugin approach will probably be better in this case. The reason is that there is a large number of people who for various reason do not want to use AI in any form, and if you build it into the core application, there will likely be a significant backlash against it.

Alternatively, it could be built-in but disabled by default. Using any cloud API will still end up being controversial though.

4 Likes

Additionally, as a general rule if something can be made as a plugin that's probably what it should be. It's easier to get started and to iterate quickly, which is important during GSoC, and indeed it also gives the users more choice.

Will it make sense to append LLM description of the image to OCR text?
I’m thinking it might help a lot in “search“ as well as the existing/future RAG approaches.

Thanks for the feedback :slight_smile:

@tomasz86 @laurent That makes sense. I’ll scope this as a plugin-first, opt-in project rather than a core feature. That way it stays completely optional for users who don't want AI, while also making the feature easier to test and iterate on during GSoC.

@executed I think that’s a useful direction to explore. AI-generated image descriptions could definitely improve search, but I’d want to look carefully at how this should relate to OCR data. My current preference would be to keep OCR output and AI-generated descriptions distinct, while still making both searchable if that fits Joplin’s indexing model.

I’m going to dig further into how resource OCR/search data is currently stored and indexed, then reflect that in the PoC and proposal draft. I’ll share the formal draft once it’s in better shape.

1 Like

Hello, just a note that we have now provided a template for the GSoC proposal drafts:

So I would suggest to update your top post according to this. I have also moved your post to the GSoC Proposal Drafts catagory

1 Like

That should work now, please try again. If not, for now add a reply to this post, and we can move it to the top post

1 Like

hey, just wanted to check- joplin.imaging is desktop-only right now, so my proposal is scoped to desktop. should i plan for mobile too within gsoc or is desktop fine for now?

Yes for now please scope it to desktop. I'd say during the development keep mobile in mind anyway so that we could potentially add support for it one day, but it's not necessary as part of this project.

1 Like

The main concern here is the search integration strategy. Storing labels in resource userData makes sense, but duplicating them into note bodies to make them searchable seems fragile. It creates sync and consistency issues, especially when a resource is referenced from multiple notes, and it also means the plugin modifies notes, which users may not like. I think this part needs a cleaner design.

Maybe investigate a few options:

  • Perhaps a new plugin API would help here? User data is not indexed by the search engine, but perhaps it should be? Or maybe we could change the API so that certain keys are indexed?

  • Or maybe a plugin is not the right option and it should be integrated to core. But immediately that makes it more risky since it would touch to resources and sync, so it needs to be considered carefully.

1 Like

Thanks, that makes a lot of sense. The multi-note resource problem with html comments is a real issue. i'll rework that section around proper indexing instead. my current thinking is maybe extending userDataSet so plugins can flag certain keys as indexed, and then SearchEngine picks those up during its sync.

1 Like

Yes that could make sense. Or it could be ok too to add a completely new API if that helps. That's generally how we work with the plugin API - we add new methods when they become needed by various plugins

1 Like

thanks, that helps a lot. i'll look into both options and compare the trade-offs in my proposal.

hey, so i've been going through how SearchEngine indexes resources to understand the search integration part better. from what i can see, items_normalized (migration 45) feeds into items_fts, and right now it only picks up ocr_text and resource.title through allForNormalization().
one thing i noticed is there are six unused reserved columns in both tables. could one of these work for storing plugin-contributed searchable text like AI labels? that way we skip needing a new migration entirely. or are those columns set aside for something specific?

the other option i was thinking about is extending allForNormalization() to also pull from userData when a plugin has marked a key as indexed, but that feels like it mixes the plugin layer with core search more than it should. what do you think, is one of these on the right track or should i be looking at it differently?

If there is a performance benefit to using onnxruntime-node, then distributing it through the core should be explored (probably better to use the wasm, but good to understand the exact tradeoff). I would make note of the exploration in the proposal, it doesn’t need to be answered today.

yeah that's a fair point. i'll benchmark both during community bonding- same model, same images, compare inference times. if wasm is only marginally slower it probably doesn't matter much since labeling runs in the background anyway, but if it's something like 3-5x slower then exploring onnxruntime-node through core would make more sense. i'll add this as an early evaluation task in the proposal. thanks for flagging it

hey @personalizedrefriger ,couple of things i wanted to ask , for running onnx inference i need to make sure it doesn't freeze the ui ,would spawning a web worker from the plugin's webview work inside the sandbox, or would that get blocked? if workers aren't an option i was thinking of breaking inference into chunks with setTimeout.

also on the accessibility side, right now images just have empty alt text like ![](:/resourceId). once the plugin generates labels, the sidebar panel would show them but ideally they'd also show up as alt text on the image itself in the rendered note. is there a way for plugins to modify how resources are rendered, like injecting alt attributes into the viewer?

  • That will work on desktop (but would be blocked on mobile).
  • A web worker might not be necessary: On desktop, each plugin currently also runs within its own (hidden) Electron BrowserWindow (source), so long-running tasks should not block the main window's UI.

For Markdown notes, it should be possible to do this with the joplin.contentScripts.register API, either directly through the markdown-it plugin, or through a content script that runs after rendering has completed.

Alternatively, could it make sense to store the accessibility labels directly in the note body?

  • Possible benefits:
    • Accessibility: Including the label in the note body allows screen readers to describe the image, while editing in the Markdown editor.
    • Simplifies search support: The search engine can index the ALT text included directly in the note body.
    • Supports publishing/sync/export: The ALT text will be included in exported, published, and synced notes. For example, even if the plugin doesn't support mobile/CLI, the generated ALT text can be used on synced mobile and CLI apps.
  • Drawbacks:
    • Involves modifying the note body. This can lead to conflicts if the note is also modified by sync.
    • Read-only/deleted notes can't be modified.

How well does MobileCLIP work for non-English languages? (For reference, I think Firefox's on-device ALT text generation involves a post-processing step with an on-device translation model.)