Author : ASHUTOSH
Google docs link - Google Summer of Code Draft - Google Docs
GitHub profile: https://github.com/Ashutoshx7π
Prototype repo: joplin-shared-embedding-infrastructure![]()
AI Disclosure & How AI Was Used
This project involved a collaborative process between my own thinking and AI assistance.
I began by writing down initial points and ideas based on my own research and planning. From there, I used AI to help build a prototype using prompts to progressively reach each milestone. As Jonas (creator of Debian Pure Blends) adviced me : no matter how much research or planning you do, there are always things you won't anticipate until you're actually in the middle of building. That's not a failure, that's the nature of the process.
As I built the prototype, new problems and solutions surfaced that I hadn't originally considered. I documented these as I went, and by the end of the prototyping phase, I had a much richer and more grounded set of points than I started with, informed by real experience rather than just upfront planning.
AI was then used to help me articulate and write up these learnings clearly. The ideas, discoveries, and direction remained my own; AI served as a tool to help express them effectively.
Just as a car doesn't choose the destination the driver does AI here was simply the vehicle that helped me get there faster. It accelerated my workflow, but the thinking, the decisions, and the vision behind this project were entirely mine.
Introduction
My name is Ashutosh Singh and I am currently pursuing a Bachelor's degree in Computer Science at the Indian Institute of Information Technology (IIIT) Lucknow. I secured admission through the Joint Entrance Examination (JEE Main), one of the most competitive engineering entrance examinations in India, achieving an All India Rank of approximately 8,500 among more than 800,000 candidates.
more about me you can find me at end
Why Joplin
Every tool I use daily from my browser (Zen) to my IDE (Zed) is open source. That's not a coincidence; it's a deliberate choice. Better note-taking and scheduling has always been a priority for me, and I've tried plenty of options: Notion, Sunsama, Todoist. They all worked, but none of them were open source, and none of them ever will be.
When I came across Joplin in September 2025, I downloaded the mobile app immediately. What struck me first was something simple: it runs everywhere. But the more I used it, the more I appreciated what it actually stood for your notes, your machine, your data.
Open Source & Development Experience
Extralit β Open Source Contribution (v0.4.0 Release)
- Contributed to the official v0.4.0 release of Extralit β credited as a key contributor alongside the project maintainer.
- Co-authored PR #57 β a comprehensive overhaul of the Extralit CLI, migrating the entire command structure from Argilla V1 to V2 using Typer, with modular command modules for datasets, users, workspaces, schemas, files, and documents.
- Implemented full CRUD support for workspace schemas including Pandera-based serialization, versioning, and dataset sharing via CLI and Python API.
Tech Stack: Python Typer Argilla V2 Pandera
v0.4.0 Release PR #57
Vercel Open Source Program: Vengeance UI
Vercel Open Source Program: Winter 2026 cohort
- Vercel Open Source Program: Winter 2026 cohort
- Engineered reusable React + TypeScript components and an MDX-based documentation platform with interactive previews.
- Scaled to 15,000+ monthly users and grew the project to 600+ GitHub stars, with external community contributions (37 Forks).
- Backed by Vercel's Open Source Program, recognizing the project's impact and community adoption.
Tech Stack: TypeScript Next.js Tailwind CSS Framer Motion Model Context Protocol
KDE
- Built QML/JavaScript-based dataset editors for multiple GCompris activities, enabling creation and validation of fixed and randomized datasets.
- Refactored legacy dataset formats into a unified, extensible schema, reducing parsing complexity and long-term maintenance overhead.
- Designed reusable QML UI components and implemented editor-level validation.
Tech Stack: QML JavaScript
Industry & Research Experience
C4GT/DMP 2025 β Beckn (May 2025 β Aug 2025)
- Unified vector databases for 100K+ embeddings, enabling sub-150ms semantic search latency.
- Improved query accuracy by 70% through ranking optimization and intent recognition.
- Built an ETL pipeline processing 10K+ records/hour with 85% noise reduction.
- Developed an AI platform to track 100+ Indian Constitution amendments using NLP-based summarization.
Tech Stack: Python NLP Vector Databases ETL
SuperKalam (YC 23) (September 2025 β December 2025)
- Improved retrieval and semantic search quality by refining ElasticSearch indexing, embedding generation, and Qdrant schemas, resulting in a 20% lift in search relevance metrics.
- Reduced inference costs by 35% through systematic model migration from OpenAI to Vertex AI Gemini.
- Built and maintained LLM evaluation systems to benchmark response quality, grounding accuracy, latency, and cost tradeoffs.
Tech Stack: Python TypeScript ElasticSearch Qdrant Vertex AI OpenAI
Project Summary
The Problem
Joplin has 5 AI projects this year: Semantic Search, Chat With Notes, Auto-Categorization, Note Graphs, and Image Labeling. Every single one needs the same pipeline:
Split notes into chunks β Generate embeddings β Store vectors β Retrieve by similarity
Without shared infrastructure, each project rebuilds this independently:
- 5Γ memory: Five model copies (5 Γ 127 MB = 635 MB)
- 5Γ compute: 2000 notes Γ 5 pipelines = 10,000 embedding calls
- 5Γ bugs: Five different chunking implementations
- 5Γ maintenance: Model upgrades, migrations, WASM compat all duplicated
This project is the unified backbone. One core service inside Joplin that all AI plugins consume. Built once, built well.
Architecture
The infrastructure lives in packages/lib/services/embedding/ as a core Joplin service not a plugin. Consumer plugins call EmbeddingService.instance() with zero cross-plugin dependencies.
Why Core, Not Plugin
Plugins cannot depend on other plugins in Joplin's architecture. Making this a core service means every plugin gets access automatically, no install ordering issues, no missing dependency errors, no version conflicts.
Working Prototype
This is not just a plan:the core service is coded and compiled with zero TypeScript errors.
Ai was used to build it fast
| Component | Location | Lines | Purpose |
|---|---|---|---|
| EmbeddingService.ts | packages/lib/services/embedding/ | 265 | Singleton β full public API |
| VectorStore.ts | packages/lib/services/embedding/ | 275 | sql.js storage, cosine_sim, BLOBs |
| RetrievalEngine.ts | packages/lib/services/embedding/ | 190 | RRF + RSE + decomposition + reranking |
| ChunkingEngine.ts | packages/lib/services/embedding/ | 150 | Markdown-aware heading-based chunker |
| EmbeddingProvider.ts | packages/lib/services/embedding/ | 134 | Ollama / OpenAI / Local providers |
| Total core | 1,014 | ||
| Sample consumer plugin | sample-consumer/ | 158 | Related Notes sidebar β proves the API |
| Standalone prototype | joplin-ai-search/ | 1,700+ | Full prototype with 30 passing tests |
Component Deep-Dive
ChunkingEngine: Markdown-Aware Splitting
Raw text splitting breaks headings, code blocks, and topic boundaries. The chunker splits at markdown heading boundaries and preserves context with breadcrumb prefixes:
| Decision | Why |
|---|---|
| 350-token max | BGE-small max is 512. Reserve ~30% for title/heading prefix |
| 50-word overlap | Concepts at chunk boundaries captured in both chunks |
| Code block preservation | splitByHeadings tracks ``` fences β never splits mid-code |
| SHA-256 content hash | O(1) change detection β skip re-embedding unchanged chunks |
| Heading breadcrumb prefix | Chunk text includes "Note Title > Heading" so the embedding model knows context |
| Long-line truncation | Cap lines at 500 chars to prevent bloated chunks |
EmbeddingProvider: Provider Abstraction
Three providers, hot-swappable via settings:
| Provider | How | Use case |
|---|---|---|
| Ollama | HTTP localhost:11434 | Local server, any model (nomic-embed-text, BGE, etc.) |
| OpenAI | HTTPS API | Cloud, best quality |
| Local (GSoC deliverable) | Transformers.js WASM | Offline, zero cost, no server needed |
All providers L2-normalize embeddings so dot product == cosine similarity:
Users are never locked into a single choice because the provider abstraction supports any ONNX-compatible model.
VectorStore(sql.js with Custom cosine_sim())
Sql.js compiles SQLite to pure WebAssembly with zero native dependencies, runs in Electron, mobile WebView, and any browser. No sandbox restrictions.
Schema
Custom cosine_sim SQL function:
Usage: SELECT *, cosine_sim(embedding, ?) AS score FROM chunk_embeddings ORDER BY score DESC LIMIT 20
BLOB vs JSON: 384-dim vector: JSON = 3.8 KB, BLOB = 1.5 KB. For 5000 chunks: 2.5Γ smaller.
Notebook filtering:
Cache-first philosophy: The vector database is treated as regenerable cache. Any of these triggers cause a full rebuild:
- Embedding model change
- Vector dimension change
- Chunk size or overlap change
- Database corruption
- User-initiated rebuild
All detected automatically via the index_meta table. Before rebuilding, the user sees estimated tokens, time, and cost.
Persistence: sql.js is in-memory and flush to disk every 30 seconds + on shutdown.
RetrievalEngine β Hybrid Search Pipeline
Four retrieval improvements, all implemented:
- Hybrid Scoring (RRF)
Hybrid balance slider:
hybridBalance = 0 β keyword only (traditional search), 1.0 β vector only (pure semantic). Users tune this based on their workflow. Semantic search means "kitty" finds notes about "cat" β it just works, transparently.
2.Reranking
Cross-encoder reranking via Ollama/OpenAI. Togglable essential for smaller on-device models (2-4B) where context management matters, but optional for larger cloud models (GPT-4, Gemini) that already understand relevance
- Query Decomposition
"Find notes about API design with security considerations" β ["API design", "security considerations"] β each runs independently β merged via RRF.
- Relevant Segment Extraction (RSE)
Adjacent chunks (gap β€ 2) merged into coherent passages. Instead of returning fixed-size blocks, RSE dynamically combines related chunks into longer, more useful segments.
Two Embedding Levels: Chunk-Based and Note-Based
Different AI projects need different granularities:
| Project Type | Granularity | What They Need |
|---|---|---|
| Search + Chat | Chunk-level | Specific passages matching a query |
| Categorization + Graphs | Note-level | Whole-note vectors for clustering/similarity |
The infrastructure serves both. Note-level embeddings are the normalized mean of chunk embeddings:
No consumer needs to understand the other level. Search uses chunk_embeddings. Categorization uses note_embeddings. The infrastructure manages both.
EmbeddingService:The Public API
Singleton service consumed by all plugins. Every method returns JSON-serializable data:
How Each Downstream Project Uses This
| GSoC Project | Methods They Call | What They Skip Building |
|---|---|---|
| AI Search | search() | Chunking, embedding, storage, retrieval |
| Chat With Notes | search() + embed() | Chunking, embedding, storage, retrieval |
| Auto-Categorization | getNoteEmbedding() + getAllNoteEmbeddings() | Chunking, embedding, storage |
| Note Graphs | findSimilarNotes() + getAllNoteEmbeddings() | Chunking, embedding, storage, similarity |
Context budget for RAG: Chunk size (350 tokens) Γ limit parameter = predictable token budget. The Chat plugin controls the total context window; our API returns scored chunks that can be truncated to fit.
Internally maps to: EmbeddingService.instance().search(query, { hybrid: true })
The API follows a "negative friction" design search() returns noteId, title, snippet, score, and heading path in ONE call. No follow-up calls needed. Only 3 tools total, keeping LLM context usage minimal.
For external clients (Claude Desktop, etc.), these same methods can be wrapped in an MCP server (~50 lines). But inside Joplin, direct API calls are simpler and faster.
LLM Tool Compatibility
The API is designed as LLM-callable tools. An AI agent inside Joplin gets both search_notes (keyword + vector hybrid) and query_embeddings (pure semantic), and the LLM decides which to call based on the query:
Internally maps to: EmbeddingService.instance().search(query, { hybrid: true })
The API follows a "negative friction" design search() returns noteId, title, snippet, score, and heading path in ONE call. No follow-up calls needed. Only 3 tools total, keeping LLM context usage minimal.
For external clients (Claude Desktop, etc.), these same methods can be wrapped in an MCP server. But inside Joplin, direct API calls are simpler and faster.
Sample Consumer: Related Notes Sidebar
A standalone plugin that proves the API:
The consumer has zero embedding code. If the infrastructure isn't available, it falls back gracefully.
Inter-Project Migration
Other GSoC projects build their own simple pipeline initially, targeting the same interface: put(note) and query(text). When the shared infrastructure is ready, they swap the backend (one line change):
`No project depends on this one. No project is blocked by this one. The API contract is the bridge.
Prior Art: Jarvis Plugin
The Jarvis plugin is the most mature embedding implementation in the Joplin ecosystem. Studying its v0.12.0 release informed several design decisions:
Jarvis Feature Integration
| Jarvis Feature | How We Apply It |
|---|---|
| Database based on note properties (syncs between devices) | Evaluate during GSoC: note-property storage vs. separate SQLite. Trade-off: sync bandwidth vs. query speed |
| Q8 quantization (4Γ smaller) | Add quantization option: 384-dim Γ 4 bytes = 1.5 KB β Q8 = 384 bytes/chunk |
| Mobile support (all code migrated) | Validates our sql.js WASM approach β runs in mobile WebView |
| Device profile with platform-aware tuning | Desktop = full model, mobile = quantized/smaller. Auto-detect platform |
| Excluded notes/folders | excludedNotebooks setting, skip indexing for user-excluded content |
| Progress display with stage messages | buildIndex(onProgress) callback reports (current, total, stage) |
| Strip AI-generated blocks from context | Skip marked sections when chunking β prevent self-referencing |
Why sql.js ( Vector DB Comparison )
The community discussed several vector DB options. Here's why sql.js is the right choice for Joplin:
| Database | Type | Native Deps | Mobile | Joplin Sandbox | Verdict |
|---|---|---|---|---|---|
| sql.js | WASM SQLite | β None | β WebView | β Works | β Our choice |
| ChromaDB | Cloud/Server | β Python | β | β | Requires external server |
| Milvus | Cloud/Server | β gRPC | β | β | Distributed β overkill |
| Qdrant | Rust server/edge | β Rust binary | β | β | Edge mode still needs binary |
| Weaviate | Cloud/Server | β Go | β | β | Requires external server |
| Pgvector | PostgreSQL ext | β libpq | β | β | Requires PostgreSQL |
| Pinecone | Cloud SaaS | β | β API | β | Cloud-only, costs money |
| sqlite-vec | SQLite ext | β C | ? | β | Native module β breaks webpack sandbox |
Why sql.js wins: Pure WASM, zero native deps, works in Electron + mobile WebView + any browser. Custom cosine_sim() SQL function gives us vector search without additional dependencies. The trade-off is linear O(n) scan instead of HNSW indexing acceptable for <10K notes (scan takes <50ms at 5000 chunks).
Memory & Storage Budget
| Scale | Notes | Chunks (~4 per note) | BLOB Storage | Note Embeddings | Total DB | RAM (sql.js) |
|---|---|---|---|---|---|---|
| Small | 500 | 2,000 | 3.0 MB | 0.3 MB | ~4 MB | ~8 MB |
| Medium | 2,000 | 8,000 | 12.0 MB | 1.2 MB | ~15 MB | ~25 MB |
| Large | 5,000 | 20,000 | 30.0 MB | 3.0 MB | ~38 MB | ~55 MB |
Calculation: 384-dim Γ 4 bytes = 1,536 bytes per vector. With Q8 quantization (stretch goal): 384 bytes per vector β 4Γ reduction.
Model memory: BGE-small-en-v1.5 ONNX = ~127 MB loaded once. Shared across all consumers β this is the key saving vs. 5 separate copies (635 MB).
Embedding speed estimates:
- Local WASM: ~50ms per chunk β 2,000 chunks in ~100 seconds
- Ollama (nomic-embed-text): ~15ms per chunk β 2,000 chunks in ~30 seconds
- OpenAI (ada-002): ~5ms per chunk β 2,000 chunks in ~10 seconds (+ API cost ~$0.02)
Community Feedback: Addressed
This design was shaped by extensive community discussion across multiple forum threads. Every concern raised has been addressed in the architecture above:
Core architecture: The mentors confirmed this should be part of the core app, not a standalone plugin and since plugins depending on other plugins is not supported. Done: the service lives in packages/lib/services/embedding/. A sample consumer plugin (Related Notes sidebar, 13 KB) demonstrates the API works without any cross-plugin dependency.
API design: The suggested interface put(note), query(text), getNoteEmbedding(), findSimilarNotes() has been implemented and expanded to 12 methods covering every downstream use case. Retrieval features (reranking, decomposition, RSE) live in the shared search() so consumers get the full pipeline automatically.
Inter-plugin data access: Instead of shared files, we use Joplin's command system (joplin.commands.execute()). Commands return structured JSON across plugins. If the infrastructure hasn't loaded when a consumer calls, try/catch handles it gracefully.
Centralized vector DB: One shared database prevents duplication, storage bloat, and processing cost. The vector DB is treated as a regenerable cache model change, config change, or corruption triggers automatic rebuild. Users see estimated tokens/time/cost before any indexing starts.
Provider flexibility: Users choose Local (WASM), Ollama, or OpenAI. Any ONNX model works via the Ollama provider, including nomic-embed-text:137m as a lightweight fallback for embedded hardware.
Optionality: Every AI feature is disabled by default with individual on/off toggles. Reranking is togglable essential for smaller on-device models (2-4B), optional for larger cloud models. The hybrid balance slider lets users tune keyword vs. semantic search.
Chunk vs. note embeddings: Search and Chat need chunk-level vectors; Categorization and Graphs need note-level vectors (mean of chunks). Both levels are stored and exposed through separate API methods.
LLM tool compatibility: The API is designed as search_notes + query_embeddings tools with "negative friction" ; one call returns everything the LLM needs. MCP wrapping is ~50 lines for external clients, but unnecessary inside Joplin.
Migration path: Other GSoC projects code to put(note)/query(text) from day one with their own simple pipeline. When shared infrastructure is ready, they swap the backend
Optionality & User Control
All AI features are strictly optional. The service does nothing until explicitly enabled.
| Control | Behavior |
|---|---|
| Enable AI Index toggle | Service disabled by default. No background processing until enabled |
| Provider dropdown | Local / Ollama / OpenAI β user chooses |
| Reranking toggle | Off by default β enable for small on-device models |
| Decomposition toggle | Off by default β optional for complex queries |
| Hybrid balance slider | 0%=keyword only, 100%=vector only |
| Cost estimation | Shows tokens/time/cost BEFORE indexing starts |
| Excluded notebooks | Skip indexing for specific folders |
| Disable | Index kept on disk, no processing. Re-enable = instant |
| Uninstall | Delete embedding-index.sqlite (~15β50 MB) |
Timeline
Community Bonding (May 8 β June 1)
- Validate Transformers.js WASM in Joplin's Electron environment
- Discuss service integration points with mentors
- Refine core service based on code review feedback
Week 1: Core Service Wiring
- Integrate EmbeddingService singleton into BaseApplication startup sequence
- Register Joplin settings: provider selection, model name, enable/disable toggle
- Set up settings UI in Joplin's preferences panel
- Deliverable: Service initializes on app start, settings visible in UI
Week 2: Local WASM Provider
- Bundle Transformers.js with BGE-small-en-v1.5 ONNX model
- Implement LocalEmbeddingProvider with WASM inference
- Pipeline recycling (reinitialize every 80 calls to prevent memory fragmentation)
- Deliverable: embed("hello world") returns 384-dim vector via WASM
Week 3: End-to-End put()
- Connect ChunkingEngine β EmbeddingProvider β VectorStore pipeline
- put(noteId) fetches note from Joplin DB, chunks, embeds, stores
- Implement clearIndex() and isReady()
- Milestone: EmbeddingService.instance().put(noteId) works end-to-end
Week 4: Incremental Indexing
- onNoteChange listener: auto-index on save
- SHA-256 content hash: skip re-embedding unchanged notes
- buildIndex(onProgress) with progress callback for full re-index
- Deliverable: Notes auto-indexed on save, progress displayed
Week 5: Hybrid Retrieval
- Wire RetrievalEngine into search() API
- Hybrid search: combine vector cosine_sim with Joplin's FTS4 keyword search
- RRF fusion (k=60) with configurable hybridBalance slider
- Deliverable: search(query) returns ranked results from both engines
Week 6: Advanced Retrieval
- Query decomposition: split complex queries into sub-queries
- RSE: merge adjacent relevant chunks into coherent passages
- Notebook-scoped filtering via notebookId parameter
- Milestone: Full retrieval pipeline operational
Midterm Evaluation (July 7)
Deliverable: Core service with incremental indexing, hybrid retrieval, and complete API
Week 7: Plugin API + Commands
- Register all aiSearch.* commands for plugin access
- Commands: put, search, embed, findSimilar, getNoteEmbedding, getAllNoteEmbeddings, getStats, isReady
- Type-safe JSON responses for all commands
- Deliverable: Other plugins can call the infrastructure via commands
Week 8: Search Panel UI
- Search panel with query input, result cards, similarity scores
- Snippet extraction with heading breadcrumbs
- Source labels: "keyword", "vector", "hybrid"
- Hybrid balance slider in the UI
- Deliverable: Working search panel in Joplin's sidebar
Week 9: Reranking + Settings Panel
- Cross-encoder reranking via Ollama generate or OpenAI chat
- Settings panel: provider config, reranking toggle, cost estimation
- Cost estimation: show token count, estimated time, and $ before indexing
- Milestone: Full-featured search with optional reranking
Week 10: Sample Consumer Plugin
- Related Notes sidebar β upgrade from prototype
- Demonstrates: findSimilarNotes, getNoteEmbedding, graceful fallback
- Package as standalone .jpl (target: <20 KB, zero deps)
- Deliverable: Installable consumer plugin that proves the API
Week 11: Edge Cases + Platform Testing
- Handle: encrypted notes, image-only, empty, very long (>50K words), trash, excluded folders
- Strip AI-generated blocks from chunking input
- Cross-platform testing: Windows, macOS, Linux
- Performance profiling: 50, 500, 2000, 5000 notes
- Deliverable: Hardened service with cross-platform validation
Week 12: Documentation + Polish
- API reference documentation (all 12 methods)
- "How to build a consumer plugin" tutorial with code examples
- Device profile for platform-aware tuning (desktop vs. mobile)
- Final performance optimization pass
- Milestone: Production-ready with full documentation
Final Evaluation (August 25)
More about me
My motivation for this project comes from my personal journey and the experiences that shaped how I think about technology, learning, and design.
It started with a broken computer my school was discarding. I repaired it just to play games, but in doing so I unknowingly got my first real lesson in how hardware and systems work. That curiosity never left me. Growing up with an artist mother added a different dimension altogether. Being around her work shaped my instinct for creativity, visual design, and user experience in ways I did not fully realise until I started building software.
During school, a genuine interest in biology particularly in understanding the human brain eventually led me to artificial intelligence and machine learning. Learning that neural networks are inspired by how the brain processes information felt like two of my biggest interests finally making sense together.
When I began building and contributing to software, all of these interests started working together naturally. My systems knowledge helped me think about architecture and constraints, while my design instincts kept me focused on usability and the learner's experience. Over time, I developed a real appreciation for the constructionist philosophy: the idea that people learn best by making things.
This project sits right at the intersection of AI, software systems, and UI/UX
exactly the space I have been growing into. It is a direct expression of that philosophy, and feels like a natural continuation of the path I have been on.
















