GSoC 2026 Proposal Draft - Idea 5: Automatically Label Images Using AI - Kanishka
Links
-
Project Idea: Idea #5 - Automatically Label Images Using AI
-
GitHub Profile: kanishka0411
-
Pull Requests Submitted to Joplin:
-
Other Open-Source Experience:
- RoboSats (P2P Bitcoin exchange): 4 merged PRs - custom HTTP webhook notification system, coordinator ratings refactoring, USDT swap crash fix, Tor browser rendering fix
- BTC Map (merchant map): 3 merged PRs - verified date badges, merchant actions refactoring, note field cleanup
1. Introduction
I'm a CS student. I mainly work with TypeScript, React, and Python. Over the past months I got into open source pretty seriously, contributing across a few projects in the privacy, payments, and mapping space. That gave me a good feel for navigating unfamiliar codebases, working with maintainers, and getting things through review.
I started contributing to Joplin a few weeks ago. Got 4 merged PRs and 1 open so far, touching the desktop frontend (app-desktop), shared library (packages/lib), and CodeMirror editor. That gave me a decent understanding of how things are wired across packages: the resource model, plugin API, settings system, editor internals.
2. Project Summary
Problem
People attach a lot of images to their Joplin notes: screenshots, photos, diagrams, scanned docs. Right now those images are basically invisible to search. You can't type "dog" and find the photo of your dog, or search "whiteboard" and pull up that meeting snapshot. OCR helps with text in images, but it doesn't do anything for the actual visual content like photos, illustrations, or diagrams.
What Will Be Implemented
A Joplin plugin that automatically generates descriptive labels for image attachments using AI. It runs locally by default so nothing leaves your machine, but there's an optional cloud provider if you want better accuracy for unusual images. Here's what the plugin does:
-
Detects when you attach or update an image in a note
-
Runs inference to generate labels like "outdoor", "dog", "mountain", "diagram"
-
Stores those labels as structured metadata on the resource using the plugin
userDataAPI -
Makes labels searchable through Joplin's search
-
Shows labels in the UI through a sidebar panel
Expected Outcome
-
A published, installable Joplin plugin
-
Local-first labeling with zero cloud dependency by default
-
Search integration via indexed userData or a new plugin API, with note-tag fallback
-
Clean settings UI for picking providers, configuring models, and toggling things on/off
-
Tests and documentation
Why a Plugin?
A plugin-first approach makes the most sense here because a lot of users simply don't want AI features in their note-taking app, and that's a totally valid preference. Making it a plugin keeps it fully opt-in. It's also way easier to iterate quickly on a plugin during the GSoC timeline, and it gives users the choice to install it or not without anything controversial touching core.
3. Technical Approach
Architecture Overview
Component Breakdown
3.1 Resource Detection
There's an important nuance in Joplin's plugin API here: onResourceChange() only fires when an existing resource is modified, not when a new one is created. And onNoteChange() only fires for the currently selected note, not every note in the database. So we need a layered detection strategy:
For modified resources: Hook into joplin.workspace.onResourceChange() to catch updates to existing images (e.g., user replaces an attachment). When this fires, check if the blob actually changed via blob_updated_time and re-label if needed.
For new resources in the active note (latency optimization): Hook into joplin.workspace.onNoteChange() and diff the note body to detect newly referenced resource IDs. This only fires for the selected note, so it's a fast-path optimization for the common case of "user is editing a note and pastes an image." The diff parses all :/resourceId occurrences in the note body (not just  markdown syntax, since resources can also appear in raw HTML blocks), then filters by MIME type against supported image formats (image/jpeg, image/png, image/webp, image/bmp) and enqueues new ones.
Fallback scanner (primary correctness mechanism): A periodic scan (configurable interval, default every 5 minutes) is the main guarantee that every image gets labeled. It queries resources via joplin.data.get(['resources'], { fields: [...], page: N }) and paginates through all pages (looping until has_more is false), checking which image resources don't have labels yet. This catches everything the event listeners miss: images added via sync, the web clipper, mobile, or notes the user wasn't actively editing. There's also a manual "Scan all unlabeled images" command for one-off bulk runs.
3.2 Processing Queue
There's a queue manager sitting between the event listener and the actual inference. It handles:
-
Batching: Groups multiple resource events so we don't process the same thing twice
-
Deduplication: If the same resource fires multiple events in a short window, only one gets processed
-
Rate limiting: Caps concurrent inference jobs (1 for local, configurable for cloud)
-
Retry with backoff: For when things fail transiently (cloud rate limits, model loading hiccups)
-
Progress tracking: So the UI panel can show what's happening
3.3 Provider System
I'm building this with a common provider interface so it's easy to swap or add new backends:
interface LabelProvider {
name: string;
initialize(): Promise<void>;
generateLabels(imagePath: string): Promise<LabelResult>;
dispose(): Promise<void>;
}
interface LabelResult {
labels: Array<{ name: string; confidence: number }>;
model: string;
timestamp: number;
}
Local Provider (Default):
-
Joplin's plugin sandbox only allows a few whitelisted native modules (
sqlite3,fs-extra,7zip-bin), soonnxruntime-nodewon't work here. Instead, the local provider uses ONNX Runtime Web (WASM-based), which is pure JavaScript/WebAssembly and runs fine inside the plugin sandbox without any native dependencies. -
Primary model: MobileCLIP (~20-50MB, optimized for edge devices). The model is not bundled with the plugin package to keep installs fast. Instead, it's downloaded on first use with checksum verification and cached locally. The download is resumable so a flaky connection won't corrupt anything. Only a tiny bootstrap loader ships with the plugin itself.
-
Fallback: CLIP ViT-B/32 for higher accuracy (~350MB, opt-in download, same first-run download mechanism)
-
Image preprocessing: resize to model input dimensions using
joplin.imagingAPI (createFromResource->resize->toJpgFile). This API is desktop-only, so the plugin initially targets desktop but the architecture is designed with future mobile portability in mind. Every image handle is freed after processing to avoid memory leaks. -
Zero-shot classification against a predefined label vocabulary (500+ common categories)
-
Everything runs on-device, no data leaves the machine
Cloud Provider (Optional):
-
Supports OpenAI Vision API and Claude Vision API
-
User provides their own API key via plugin settings (
secure: true, stored in the system keychain when available) -
Sends image with a structured prompt asking for labels in JSON format
-
Respects rate limits and has cost controls (max images/day setting)
3.4 Label Normalizer
Takes the raw model output and cleans it up:
-
Maps synonyms to canonical labels (e.g., "puppy" -> "dog", "automobile" -> "car")
-
Filters out low-confidence labels (configurable threshold, default 0.3 for local, 0.5 for cloud)
-
Deduplicates overlapping labels
-
Caps at top N labels per image (default: 10)
-
Keeps labels separate from OCR text, but both end up searchable
3.5 Storage Layer
Labels get stored using Joplin's plugin userData API on the Resource entity:
// Store labels on a resource
await joplin.data.userDataSet(
ModelType.Resource,
resourceId,
'labels', // key
{
labels: [
{ name: 'dog', confidence: 0.92 },
{ name: 'outdoor', confidence: 0.87 },
{ name: 'grass', confidence: 0.74 }
],
model: 'mobileclip-s1',
timestamp: 1710700000000,
version: 1
}
);
Why userData instead of new database columns:
-
No schema migrations needed, it just works as a plugin without touching core
-
Syncs across devices automatically through Joplin's existing sync
-
Namespaced per plugin, so no conflicts with other plugins or OCR data
-
This is the same pattern used by existing Joplin plugins (there's a
user_datatest plugin in the codebase)
3.6 Search Integration
Plugin userData syncs across devices but is not indexed by Joplin's SearchEngine (which only reads resource.title and resource.ocr_text). This is the main architectural challenge for this project.
Why not inject labels into note bodies? The obvious approach (appending hidden HTML comments like <!-- ai-labels: dog, outdoor -->) is fragile and creates real problems:
-
A single image resource can be referenced from multiple notes — which note gets the comment? All of them? That's duplication and drift.
-
The plugin would be silently modifying user notes, which breaks trust and triggers unnecessary syncs.
-
Rich Text mode can strip HTML comments, causing data loss.
-
If the plugin is disabled or uninstalled, orphaned comments stay in notes forever.
Proposed approach: extend the plugin API to make userData searchable. Both extending existing userData behavior and adding a completely new plugin API method are valid options here — the final choice comes down to whichever results in a cleaner design with minimal risk. That's how the Joplin plugin API generally evolves: new methods get added when they're needed by plugins. Candidate designs:
-
Indexed userData keys — Add an
indexed: trueoption touserDataSet()so plugins can flag specific keys for search indexing. The SearchEngine would then include those values in its FTS table alongsideresource.titleandocr_text. This is the least invasive: small change to the userData API, small change to SearchEngine's sync logic. -
New dedicated plugin API — A completely new API method (e.g.,
joplin.data.setSearchableMetadata()) designed specifically for plugins that need to contribute searchable text to resources. Cleaner separation of concerns, doesn't overload the existing userData API with indexing behavior. -
Resource metadata field — Write labels to a dedicated
resource.ai_labelscolumn. Simple and fast to query, but requires a schema migration and is less generic than the other approaches.
I'll evaluate these during community bonding with mentor guidance and prototype the most promising one before coding starts.
Fallback if core changes are too risky: Note-level tags (ai:dog, ai:outdoor) attached to the parent note. This uses existing Joplin infrastructure, requires zero core changes, and gives filtering through the tag sidebar. The downside is losing per-image granularity (tags attach to notes, not resources). This ships as a configurable option regardless.
Consistency guarantees I'll address:
-
One resource in multiple notes: Labels live on the resource, not the note — no duplication
-
Sync conflicts:
userDataalready has merge semantics, labels follow the same path -
Plugin disable/uninstall: Labels persist harmlessly in
userData, no orphaned data in notes -
Re-index after model change: Version field in label data lets the plugin detect stale labels and re-process
3.7 UI Panel
A sidebar panel created via joplin.views.panels.create():
-
Shows labels for images in the currently selected note
-
Displays label badges with confidence indicators
-
Lets you manually edit labels (add/remove/rename)
-
Shows processing status (queued, processing, done, error)
-
Has a "Re-scan" button to regenerate labels for an image
-
Built with HTML/CSS injected via
setHtml()andaddScript()
Changes to the Joplin Codebase
This is primarily a plugin project. The plugin itself uses only existing public APIs:
| API | Usage |
|---|---|
joplin.workspace.onResourceChange() |
Detect modified images |
joplin.workspace.onNoteChange() |
Detect newly attached images via note body diff |
joplin.data.get/put(['resources', id]) |
Read resource metadata |
joplin.data.userDataSet/Get() |
Store/retrieve labels |
joplin.imaging.createFromResource() |
Load image for preprocessing |
joplin.imaging.resize() |
Resize for model input |
joplin.views.panels.create() |
Sidebar UI |
joplin.settings.registerSection() |
Settings UI |
joplin.settings.registerSettings() |
Provider config, API keys |
The joplin.imaging API is desktop-only, so the initial version targets Joplin Desktop. For search integration, a small focused core PR may be needed to make userData indexable by SearchEngine or to add a new plugin API method — this will be scoped during community bonding with mentor approval.
Libraries and Technologies
| Library | Purpose | Size |
|---|---|---|
onnxruntime-web |
Local model inference (WASM, no native deps) | ~8MB |
| MobileCLIP (ONNX) | Image classification model | ~20-50MB |
No heavy ML frameworks like PyTorch or TensorFlow. ONNX Runtime Web is pure WASM/JS and runs inside the plugin sandbox without needing native modules. Image preprocessing is handled by Joplin's built-in joplin.imaging API, so no sharp dependency needed.
Potential Challenges
| Challenge | Mitigation |
|---|---|
| Model size vs accuracy tradeoff | Start with MobileCLIP (~25MB), offer larger CLIP as opt-in download. Benchmark both during Week 1-2 |
| Performance on low-end hardware | Run inference off the main plugin execution path (worker-based processing in the plugin renderer process) so the UI stays responsive. Process images one at a time with configurable concurrency |
| Label quality for niche content | Let users customize the label vocabulary. Cloud providers handle unusual images better |
| Search integration | Will investigate indexed userData keys or a new dedicated plugin API method during community bonding. Fallback: note-level tags. No note body mutation |
| Offline model distribution | Models are not bundled with the plugin. Downloaded on first use with checksum verification, resumable, and cached locally. Only a tiny bootstrap ships with the package |
| Privacy concerns with cloud providers | Local is the default, always. Cloud needs explicit opt-in plus an API key. There'll be a clear warning in settings about data leaving your device |
4. Implementation Plan
Community Bonding
-
Dig into Joplin's plugin development workflow and testing setup
-
Validate ONNX Runtime Web (WASM) runs inside the plugin sandbox
-
Benchmark candidate models (MobileCLIP-S1, MobileCLIP-S2, CLIP ViT-B/32) on a representative image set
-
Pick the final model based on size/accuracy/speed tradeoffs
-
Set up dev environment with automated tests
-
Investigate search integration strategy with mentors: prototype userData indexing in SearchEngine, evaluate tradeoffs of candidate designs (indexed userData keys vs new dedicated plugin API vs resource field)
-
Decision checkpoint: finalize search integration approach with mentors before Week 3 implementation begins
Week 1-2: Core Plugin Skeleton + Local Provider
-
Set up plugin project structure (manifest, settings, entry point)
-
Implement resource detection (
onNoteChangefor new images,onResourceChangefor updates, periodic fallback scanner) -
Build the processing queue with deduplication and rate limiting
-
Integrate ONNX Runtime Web (WASM) with the chosen model
-
Build the image preprocessing pipeline using
joplin.imagingAPI -
Implement the
LabelProviderinterface and the local provider -
Milestone: Plugin can detect new images and generate labels locally, visible in the console
Week 3-4: Storage + Search + Cloud Provider
-
Wire up
userData-based label storage on resources -
Build the label normalizer (synonym mapping, confidence filtering, deduplication)
-
Implement search integration using the approach finalized during community bonding, with tag-based fallback
-
Build the cloud provider (OpenAI Vision API) with secure API key storage
-
Add provider selection in settings
-
Milestone: Labels stored on resources, search path implemented and validated, cloud provider working
Week 5-6: UI Panel + Settings
-
Build the sidebar panel with label badges for the current note's images
-
Add manual label editing (add/remove labels)
-
Add processing status indicators (queued/processing/done/error)
-
Build settings UI: provider selection, model config, confidence threshold, max labels, cloud API key
-
Add "Re-scan" and "Scan all images in note" commands
-
Milestone: Full UI with settings, manual editing, and status tracking
Midterm Evaluation
-
Working plugin with local + cloud labeling, storage, search, and UI
-
Demo to mentors, get feedback
-
Write midterm progress report
Week 7-8: Polish + Edge Cases + Performance
-
Handle edge cases: encrypted resources, large images, unsupported formats, sync conflicts
-
Optimize performance: worker-based off-main-path inference, lazy model loading, image caching
-
Add bulk processing command ("Label all unlabeled images")
-
Add a progress bar for bulk operations
-
Handle label merging during sync (timestamp-based via
userDatamerge) -
Milestone: Solid plugin that covers the tricky cases
Week 9-10: Testing + Documentation
-
Write Jest tests:
-
Unit tests for label normalizer, queue, provider interface
-
Integration tests for storage and retrieval
-
Mock-based tests for cloud provider
-
Write user docs: installation, configuration, usage guide
-
Write developer docs: architecture, how to add new providers
-
Performance benchmarks: time per image, memory usage, model loading time
-
Milestone: Full test coverage, complete documentation
Week 11-12: Final Polish + Submission
-
Address mentor feedback from final review
-
Final bug fixes and cleanup
-
Prepare plugin for publishing to the Joplin plugin repository
-
Write final GSoC report
-
Submit final work product and final mentor evaluation
-
Milestone: Plugin published and ready for users
5. Deliverables
Implemented Features
-
Joplin plugin:
joplin-plugin-ai-image-labels -
Local AI inference using ONNX Runtime Web/WASM (MobileCLIP), no cloud dependency, no native modules
-
Optional cloud provider support (OpenAI Vision, Claude Vision)
-
Automatic labeling on image attachment
-
Label storage via
userDataAPI with sync support -
Search integration via indexed userData or new plugin API, with note-tag fallback to ensure usable search within project timeline
-
Sidebar panel showing labels per image with confidence scores
-
Manual label editing (add/remove/rename)
-
Bulk labeling command for existing images
-
Settings UI for full configuration
Tests
-
Unit tests for all core modules (normalizer, queue, providers, storage)
-
Integration tests for end-to-end labeling pipeline
-
Performance benchmarks documented
Documentation
-
User guide: installation, setup, configuration
-
Developer guide: architecture, adding custom providers
-
README with screenshots and usage examples
6. Availability
-
Weekly availability: 30-35 hours per week dedicated to GSoC
-
Time zone: IST (UTC+5:30)
-
Other commitments: No conflicting internships or jobs. University coursework will not affect committed GSoC hours.
