Shared Embedding & Retrieval Infrastructure ( Proposal )

Ashutoshx7 · 30 March 2026 23:29

Author : ASHUTOSH

Google docs link - Google Summer of Code Draft - Google Docs

GitHub profile: https://github.com/Ashutoshx7🔗

Prototype repo: joplin-shared-embedding-infrastructure

AI Disclosure & How AI Was Used

This project involved a collaborative process between my own thinking and AI assistance.

I began by writing down initial points and ideas based on my own research and planning. From there, I used AI to help build a prototype using prompts to progressively reach each milestone. As Jonas (creator of Debian Pure Blends) adviced me : no matter how much research or planning you do, there are always things you won't anticipate until you're actually in the middle of building. That's not a failure, that's the nature of the process.

As I built the prototype, new problems and solutions surfaced that I hadn't originally considered. I documented these as I went, and by the end of the prototyping phase, I had a much richer and more grounded set of points than I started with, informed by real experience rather than just upfront planning.

AI was then used to help me articulate and write up these learnings clearly. The ideas, discoveries, and direction remained my own; AI served as a tool to help express them effectively.

Just as a car doesn't choose the destination the driver does AI here was simply the vehicle that helped me get there faster. It accelerated my workflow, but the thinking, the decisions, and the vision behind this project were entirely mine.

Introduction

My name is Ashutosh Singh and I am currently pursuing a Bachelor's degree in Computer Science at the Indian Institute of Information Technology (IIIT) Lucknow. I secured admission through the Joint Entrance Examination (JEE Main), one of the most competitive engineering entrance examinations in India, achieving an All India Rank of approximately 8,500 among more than 800,000 candidates.
more about me you can find me at end

Why Joplin

Every tool I use daily from my browser (Zen) to my IDE (Zed) is open source. That's not a coincidence; it's a deliberate choice. Better note-taking and scheduling has always been a priority for me, and I've tried plenty of options: Notion, Sunsama, Todoist. They all worked, but none of them were open source, and none of them ever will be.

When I came across Joplin in September 2025, I downloaded the mobile app immediately. What struck me first was something simple: it runs everywhere. But the more I used it, the more I appreciated what it actually stood for your notes, your machine, your data.

Open Source & Development Experience

Extralit — Open Source Contribution (v0.4.0 Release)

Contributed to the official v0.4.0 release of Extralit — credited as a key contributor alongside the project maintainer.
Co-authored PR #57 — a comprehensive overhaul of the Extralit CLI, migrating the entire command structure from Argilla V1 to V2 using Typer, with modular command modules for datasets, users, workspaces, schemas, files, and documents.
Implemented full CRUD support for workspace schemas including Pandera-based serialization, versioning, and dataset sharing via CLI and Python API.

Tech Stack: Python Typer Argilla V2 Pandera
v0.4.0 Release PR #57

Vercel Open Source Program: Vengeance UI

Vercel Open Source Program: Winter 2026 cohort

Vercel Open Source Program: Winter 2026 cohort
Engineered reusable React + TypeScript components and an MDX-based documentation platform with interactive previews.
Scaled to 15,000+ monthly users and grew the project to 600+ GitHub stars, with external community contributions (37 Forks).
Backed by Vercel's Open Source Program, recognizing the project's impact and community adoption.

Tech Stack: TypeScript Next.js Tailwind CSS Framer Motion Model Context Protocol

GitHub | Live

KDE

Built QML/JavaScript-based dataset editors for multiple GCompris activities, enabling creation and validation of fixed and randomized datasets.
Refactored legacy dataset formats into a unified, extensible schema, reducing parsing complexity and long-term maintenance overhead.
Designed reusable QML UI components and implemented editor-level validation.

Tech Stack: QML JavaScript

Industry & Research Experience

C4GT/DMP 2025 — Beckn (May 2025 – Aug 2025)

Unified vector databases for 100K+ embeddings, enabling sub-150ms semantic search latency.
Improved query accuracy by 70% through ranking optimization and intent recognition.
Built an ETL pipeline processing 10K+ records/hour with 85% noise reduction.
Developed an AI platform to track 100+ Indian Constitution amendments using NLP-based summarization.

Tech Stack: Python NLP Vector Databases ETL

SuperKalam (YC 23) (September 2025 – December 2025)

Improved retrieval and semantic search quality by refining ElasticSearch indexing, embedding generation, and Qdrant schemas, resulting in a 20% lift in search relevance metrics.
Reduced inference costs by 35% through systematic model migration from OpenAI to Vertex AI Gemini.
Built and maintained LLM evaluation systems to benchmark response quality, grounding accuracy, latency, and cost tradeoffs.

Tech Stack: Python TypeScript ElasticSearch Qdrant Vertex AI OpenAI

Project Summary

The Problem

Joplin has 5 AI projects this year: Semantic Search, Chat With Notes, Auto-Categorization, Note Graphs, and Image Labeling. Every single one needs the same pipeline:

Split notes into chunks → Generate embeddings → Store vectors → Retrieve by similarity

Without shared infrastructure, each project rebuilds this independently:

5× memory: Five model copies (5 × 127 MB = 635 MB)
5× compute: 2000 notes × 5 pipelines = 10,000 embedding calls
5× bugs: Five different chunking implementations
5× maintenance: Model upgrades, migrations, WASM compat all duplicated

This project is the unified backbone. One core service inside Joplin that all AI plugins consume. Built once, built well.

Architecture

The infrastructure lives in packages/lib/services/embedding/ as a core Joplin service not a plugin. Consumer plugins call EmbeddingService.instance() with zero cross-plugin dependencies.

Why Core, Not Plugin

Plugins cannot depend on other plugins in Joplin's architecture. Making this a core service means every plugin gets access automatically, no install ordering issues, no missing dependency errors, no version conflicts.

Working Prototype

This is not just a plan:the core service is coded and compiled with zero TypeScript errors.

Ai was used to build it fast

Component	Location	Lines	Purpose
EmbeddingService.ts	packages/lib/services/embedding/	265	Singleton — full public API
VectorStore.ts	packages/lib/services/embedding/	275	sql.js storage, cosine_sim, BLOBs
RetrievalEngine.ts	packages/lib/services/embedding/	190	RRF + RSE + decomposition + reranking
ChunkingEngine.ts	packages/lib/services/embedding/	150	Markdown-aware heading-based chunker
EmbeddingProvider.ts	packages/lib/services/embedding/	134	Ollama / OpenAI / Local providers
Total core		1,014
Sample consumer plugin	sample-consumer/	158	Related Notes sidebar — proves the API
Standalone prototype	joplin-ai-search/	1,700+	Full prototype with 30 passing tests

Component Deep-Dive

ChunkingEngine: Markdown-Aware Splitting

Raw text splitting breaks headings, code blocks, and topic boundaries. The chunker splits at markdown heading boundaries and preserves context with breadcrumb prefixes:

Decision	Why
350-token max	BGE-small max is 512. Reserve ~30% for title/heading prefix
50-word overlap	Concepts at chunk boundaries captured in both chunks
Code block preservation	splitByHeadings tracks ``` fences — never splits mid-code
SHA-256 content hash	O(1) change detection — skip re-embedding unchanged chunks
Heading breadcrumb prefix	Chunk text includes "Note Title > Heading" so the embedding model knows context
Long-line truncation	Cap lines at 500 chars to prevent bloated chunks

EmbeddingProvider: Provider Abstraction

Three providers, hot-swappable via settings:

Provider	How	Use case
Ollama	HTTP localhost:11434	Local server, any model (nomic-embed-text, BGE, etc.)
OpenAI	HTTPS API	Cloud, best quality
Local (GSoC deliverable)	Transformers.js WASM	Offline, zero cost, no server needed

All providers L2-normalize embeddings so dot product == cosine similarity:

Users are never locked into a single choice because the provider abstraction supports any ONNX-compatible model.

VectorStore(sql.js with Custom cosine_sim())

Sql.js compiles SQLite to pure WebAssembly with zero native dependencies, runs in Electron, mobile WebView, and any browser. No sandbox restrictions.

Schema

Custom cosine_sim SQL function:

Usage: SELECT *, cosine_sim(embedding, ?) AS score FROM chunk_embeddings ORDER BY score DESC LIMIT 20

BLOB vs JSON: 384-dim vector: JSON = 3.8 KB, BLOB = 1.5 KB. For 5000 chunks: 2.5× smaller.

Notebook filtering:

Cache-first philosophy: The vector database is treated as regenerable cache. Any of these triggers cause a full rebuild:

Embedding model change
Vector dimension change
Chunk size or overlap change
Database corruption
User-initiated rebuild

All detected automatically via the index_meta table. Before rebuilding, the user sees estimated tokens, time, and cost.

Persistence: sql.js is in-memory and flush to disk every 30 seconds + on shutdown.

RetrievalEngine — Hybrid Search Pipeline

Four retrieval improvements, all implemented:

Hybrid Scoring (RRF)

Hybrid balance slider:hybridBalance = 0 → keyword only (traditional search), 1.0 → vector only (pure semantic). Users tune this based on their workflow. Semantic search means "kitty" finds notes about "cat" — it just works, transparently.

2.Reranking

Cross-encoder reranking via Ollama/OpenAI. Togglable essential for smaller on-device models (2-4B) where context management matters, but optional for larger cloud models (GPT-4, Gemini) that already understand relevance

Query Decomposition

image667×132 46.7 KB

"Find notes about API design with security considerations" → ["API design", "security considerations"] → each runs independently → merged via RRF.

Relevant Segment Extraction (RSE)

Adjacent chunks (gap ≤ 2) merged into coherent passages. Instead of returning fixed-size blocks, RSE dynamically combines related chunks into longer, more useful segments.

Two Embedding Levels: Chunk-Based and Note-Based

Different AI projects need different granularities:

Project Type	Granularity	What They Need
Search + Chat	Chunk-level	Specific passages matching a query
Categorization + Graphs	Note-level	Whole-note vectors for clustering/similarity

The infrastructure serves both. Note-level embeddings are the normalized mean of chunk embeddings:

No consumer needs to understand the other level. Search uses chunk_embeddings. Categorization uses note_embeddings. The infrastructure manages both.

EmbeddingService:The Public API

Singleton service consumed by all plugins. Every method returns JSON-serializable data:

How Each Downstream Project Uses This

GSoC Project	Methods They Call	What They Skip Building
AI Search	search()	Chunking, embedding, storage, retrieval
Chat With Notes	search() + embed()	Chunking, embedding, storage, retrieval
Auto-Categorization	getNoteEmbedding() + getAllNoteEmbeddings()	Chunking, embedding, storage
Note Graphs	findSimilarNotes() + getAllNoteEmbeddings()	Chunking, embedding, storage, similarity

Context budget for RAG: Chunk size (350 tokens) × limit parameter = predictable token budget. The Chat plugin controls the total context window; our API returns scored chunks that can be truncated to fit.

Internally maps to: EmbeddingService.instance().search(query, { hybrid: true })

The API follows a "negative friction" design search() returns noteId, title, snippet, score, and heading path in ONE call. No follow-up calls needed. Only 3 tools total, keeping LLM context usage minimal.

For external clients (Claude Desktop, etc.), these same methods can be wrapped in an MCP server (~50 lines). But inside Joplin, direct API calls are simpler and faster.

LLM Tool Compatibility

The API is designed as LLM-callable tools. An AI agent inside Joplin gets both search_notes (keyword + vector hybrid) and query_embeddings (pure semantic), and the LLM decides which to call based on the query:

Internally maps to: EmbeddingService.instance().search(query, { hybrid: true })
The API follows a "negative friction" design search() returns noteId, title, snippet, score, and heading path in ONE call. No follow-up calls needed. Only 3 tools total, keeping LLM context usage minimal.
For external clients (Claude Desktop, etc.), these same methods can be wrapped in an MCP server. But inside Joplin, direct API calls are simpler and faster.

Sample Consumer: Related Notes Sidebar

A standalone plugin that proves the API:

The consumer has zero embedding code. If the infrastructure isn't available, it falls back gracefully.

Inter-Project Migration

Other GSoC projects build their own simple pipeline initially, targeting the same interface: put(note) and query(text). When the shared infrastructure is ready, they swap the backend (one line change):

`No project depends on this one. No project is blocked by this one. The API contract is the bridge.

Prior Art: Jarvis Plugin

The Jarvis plugin is the most mature embedding implementation in the Joplin ecosystem. Studying its v0.12.0 release informed several design decisions:

Jarvis Feature Integration

Jarvis Feature	How We Apply It
Database based on note properties (syncs between devices)	Evaluate during GSoC: note-property storage vs. separate SQLite. Trade-off: sync bandwidth vs. query speed
Q8 quantization (4× smaller)	Add quantization option: 384-dim × 4 bytes = 1.5 KB → Q8 = 384 bytes/chunk
Mobile support (all code migrated)	Validates our sql.js WASM approach — runs in mobile WebView
Device profile with platform-aware tuning	Desktop = full model, mobile = quantized/smaller. Auto-detect platform
Excluded notes/folders	excludedNotebooks setting, skip indexing for user-excluded content
Progress display with stage messages	buildIndex(onProgress) callback reports (current, total, stage)
Strip AI-generated blocks from context	Skip marked sections when chunking — prevent self-referencing

Why sql.js ( Vector DB Comparison )

The community discussed several vector DB options. Here's why sql.js is the right choice for Joplin:

Database	Type	Native Deps	Mobile	Joplin Sandbox	Verdict
sql.js	WASM SQLite	✗ None	✓ WebView	✓ Works	← Our choice
ChromaDB	Cloud/Server	✓ Python	✗	✗	Requires external server
Milvus	Cloud/Server	✓ gRPC	✗	✗	Distributed — overkill
Qdrant	Rust server/edge	✓ Rust binary	✗	✗	Edge mode still needs binary
Weaviate	Cloud/Server	✓ Go	✗	✗	Requires external server
Pgvector	PostgreSQL ext	✓ libpq	✗	✗	Requires PostgreSQL
Pinecone	Cloud SaaS	✗	✓ API	✓	Cloud-only, costs money
sqlite-vec	SQLite ext	✓ C	?	✗	Native module — breaks webpack sandbox

Why sql.js wins: Pure WASM, zero native deps, works in Electron + mobile WebView + any browser. Custom cosine_sim() SQL function gives us vector search without additional dependencies. The trade-off is linear O(n) scan instead of HNSW indexing acceptable for <10K notes (scan takes <50ms at 5000 chunks).

Memory & Storage Budget

Scale	Notes	Chunks (~4 per note)	BLOB Storage	Note Embeddings	Total DB	RAM (sql.js)
Small	500	2,000	3.0 MB	0.3 MB	~4 MB	~8 MB
Medium	2,000	8,000	12.0 MB	1.2 MB	~15 MB	~25 MB
Large	5,000	20,000	30.0 MB	3.0 MB	~38 MB	~55 MB

Calculation: 384-dim × 4 bytes = 1,536 bytes per vector. With Q8 quantization (stretch goal): 384 bytes per vector → 4× reduction.

Model memory: BGE-small-en-v1.5 ONNX = ~127 MB loaded once. Shared across all consumers — this is the key saving vs. 5 separate copies (635 MB).

Embedding speed estimates:

Local WASM: ~50ms per chunk → 2,000 chunks in ~100 seconds
Ollama (nomic-embed-text): ~15ms per chunk → 2,000 chunks in ~30 seconds
OpenAI (ada-002): ~5ms per chunk → 2,000 chunks in ~10 seconds (+ API cost ~$0.02)

Community Feedback: Addressed

This design was shaped by extensive community discussion across multiple forum threads. Every concern raised has been addressed in the architecture above:

Core architecture: The mentors confirmed this should be part of the core app, not a standalone plugin and since plugins depending on other plugins is not supported. Done: the service lives in packages/lib/services/embedding/. A sample consumer plugin (Related Notes sidebar, 13 KB) demonstrates the API works without any cross-plugin dependency.

API design: The suggested interface put(note), query(text), getNoteEmbedding(), findSimilarNotes() has been implemented and expanded to 12 methods covering every downstream use case. Retrieval features (reranking, decomposition, RSE) live in the shared search() so consumers get the full pipeline automatically.

Inter-plugin data access: Instead of shared files, we use Joplin's command system (joplin.commands.execute()). Commands return structured JSON across plugins. If the infrastructure hasn't loaded when a consumer calls, try/catch handles it gracefully.

Centralized vector DB: One shared database prevents duplication, storage bloat, and processing cost. The vector DB is treated as a regenerable cache model change, config change, or corruption triggers automatic rebuild. Users see estimated tokens/time/cost before any indexing starts.

Provider flexibility: Users choose Local (WASM), Ollama, or OpenAI. Any ONNX model works via the Ollama provider, including nomic-embed-text:137m as a lightweight fallback for embedded hardware.

Optionality: Every AI feature is disabled by default with individual on/off toggles. Reranking is togglable essential for smaller on-device models (2-4B), optional for larger cloud models. The hybrid balance slider lets users tune keyword vs. semantic search.

Chunk vs. note embeddings: Search and Chat need chunk-level vectors; Categorization and Graphs need note-level vectors (mean of chunks). Both levels are stored and exposed through separate API methods.

LLM tool compatibility: The API is designed as search_notes + query_embeddings tools with "negative friction" ; one call returns everything the LLM needs. MCP wrapping is ~50 lines for external clients, but unnecessary inside Joplin.

Migration path: Other GSoC projects code to put(note)/query(text) from day one with their own simple pipeline. When shared infrastructure is ready, they swap the backend

Optionality & User Control

All AI features are strictly optional. The service does nothing until explicitly enabled.

Control	Behavior
Enable AI Index toggle	Service disabled by default. No background processing until enabled
Provider dropdown	Local / Ollama / OpenAI — user chooses
Reranking toggle	Off by default — enable for small on-device models
Decomposition toggle	Off by default — optional for complex queries
Hybrid balance slider	0%=keyword only, 100%=vector only
Cost estimation	Shows tokens/time/cost BEFORE indexing starts
Excluded notebooks	Skip indexing for specific folders
Disable	Index kept on disk, no processing. Re-enable = instant
Uninstall	Delete embedding-index.sqlite (~15–50 MB)

Timeline

Community Bonding (May 8 – June 1)

Validate Transformers.js WASM in Joplin's Electron environment
Discuss service integration points with mentors
Refine core service based on code review feedback

Week 1: Core Service Wiring

Integrate EmbeddingService singleton into BaseApplication startup sequence
Register Joplin settings: provider selection, model name, enable/disable toggle
Set up settings UI in Joplin's preferences panel
Deliverable: Service initializes on app start, settings visible in UI

Week 2: Local WASM Provider

Bundle Transformers.js with BGE-small-en-v1.5 ONNX model
Implement LocalEmbeddingProvider with WASM inference
Pipeline recycling (reinitialize every 80 calls to prevent memory fragmentation)
Deliverable: embed("hello world") returns 384-dim vector via WASM

Week 3: End-to-End put()

Connect ChunkingEngine → EmbeddingProvider → VectorStore pipeline
put(noteId) fetches note from Joplin DB, chunks, embeds, stores
Implement clearIndex() and isReady()
Milestone: EmbeddingService.instance().put(noteId) works end-to-end

Week 4: Incremental Indexing

onNoteChange listener: auto-index on save
SHA-256 content hash: skip re-embedding unchanged notes
buildIndex(onProgress) with progress callback for full re-index
Deliverable: Notes auto-indexed on save, progress displayed

Week 5: Hybrid Retrieval

Wire RetrievalEngine into search() API
Hybrid search: combine vector cosine_sim with Joplin's FTS4 keyword search
RRF fusion (k=60) with configurable hybridBalance slider
Deliverable: search(query) returns ranked results from both engines

Week 6: Advanced Retrieval

Query decomposition: split complex queries into sub-queries
RSE: merge adjacent relevant chunks into coherent passages
Notebook-scoped filtering via notebookId parameter
Milestone: Full retrieval pipeline operational

Midterm Evaluation (July 7)

Deliverable: Core service with incremental indexing, hybrid retrieval, and complete API

Week 7: Plugin API + Commands

Register all aiSearch.* commands for plugin access
Commands: put, search, embed, findSimilar, getNoteEmbedding, getAllNoteEmbeddings, getStats, isReady
Type-safe JSON responses for all commands
Deliverable: Other plugins can call the infrastructure via commands

Week 8: Search Panel UI

Search panel with query input, result cards, similarity scores
Snippet extraction with heading breadcrumbs
Source labels: "keyword", "vector", "hybrid"
Hybrid balance slider in the UI
Deliverable: Working search panel in Joplin's sidebar

Week 9: Reranking + Settings Panel

Cross-encoder reranking via Ollama generate or OpenAI chat
Settings panel: provider config, reranking toggle, cost estimation
Cost estimation: show token count, estimated time, and $ before indexing
Milestone: Full-featured search with optional reranking

Week 10: Sample Consumer Plugin

Related Notes sidebar — upgrade from prototype
Demonstrates: findSimilarNotes, getNoteEmbedding, graceful fallback
Package as standalone .jpl (target: <20 KB, zero deps)
Deliverable: Installable consumer plugin that proves the API

Week 11: Edge Cases + Platform Testing

Handle: encrypted notes, image-only, empty, very long (>50K words), trash, excluded folders
Strip AI-generated blocks from chunking input
Cross-platform testing: Windows, macOS, Linux
Performance profiling: 50, 500, 2000, 5000 notes
Deliverable: Hardened service with cross-platform validation

Week 12: Documentation + Polish

API reference documentation (all 12 methods)
"How to build a consumer plugin" tutorial with code examples
Device profile for platform-aware tuning (desktop vs. mobile)
Final performance optimization pass
Milestone: Production-ready with full documentation

Final Evaluation (August 25)

More about me
My motivation for this project comes from my personal journey and the experiences that shaped how I think about technology, learning, and design.

It started with a broken computer my school was discarding. I repaired it just to play games, but in doing so I unknowingly got my first real lesson in how hardware and systems work. That curiosity never left me. Growing up with an artist mother added a different dimension altogether. Being around her work shaped my instinct for creativity, visual design, and user experience in ways I did not fully realise until I started building software.

During school, a genuine interest in biology particularly in understanding the human brain eventually led me to artificial intelligence and machine learning. Learning that neural networks are inspired by how the brain processes information felt like two of my biggest interests finally making sense together.

When I began building and contributing to software, all of these interests started working together naturally. My systems knowledge helped me think about architecture and constraints, while my design instincts kept me focused on usability and the learner's experience. Over time, I developed a real appreciation for the constructionist philosophy: the idea that people learn best by making things.

This project sits right at the intersection of AI, software systems, and UI/UX
exactly the space I have been growing into. It is a direct expression of that philosophy, and feels like a natural continuation of the path I have been on.

Topic		Replies	Views
GSoC 2026: Opportunities for the AI projects GSoC	32	698	13 April 2026
Design Discussion: Shared Embedding & Retrieval Infrastructure for Joplin AI Features GSoC	1	66	26 March 2026
Plugin: Semantically Similar Notes (beta) Plugins	30	2657	5 February 2024
AI project Discussion ( Project 1 : AI-supported search for notes) Development	4	121	31 March 2026
Welcome to GSoC 2026 with Joplin! GSoC	155	1969	1 April 2026