GSoC 2026 Proposal: AI-Generated Note Graphs
Link to project Idea: gsoc/ideas.md at master · joplin/gsoc · GitHub
GitHub: yugalkaushik (Yugal Kaushik) · GitHub
Pull Requests:
- #14401: Error logged on first startup
- #14411: Remove empty hidden divs from ENEX imports
- #14449: Store note history settings in sync info
- #14535: Call unmount in Note.test.tsx tests to suppress act warnings
- #14557: Invisible cursor in legacy editor when using dark theme in separate window
- #14642: ENEX import no longer breaks bullet items with a line break into separate paragraphs
- #14703: Validate password on re-enable encryption after master password cleared
- #14526: Fix search highlights breaking mermaid diagram rendering
1. Introduction
Hi, I'm Yugal Kaushik. I'm a final year B.Tech student in Computer Science with a specialization in AI/ML at Shree Guru Gobind Singh Tricentenary University, graduating in 2026. My AI/ML experience isn't just on paper. During my internship at Prodinit Software Solutions (July to Oct 2025), I built a testing framework for ElevenLabs conversational agents where I integrated custom LLM personas to evaluate agent performance in chat-based workflows. I also worked on a Voice AI Evaluation project using Pipecat for voice agent orchestration and Langfuse for structured evaluation of LLM-driven conversational systems. So I have hands-on experience with LLM APIs, embeddings, prompt design and evaluating AI outputs - which is basically what this project is about.
I've made multiple pull requests across projects like Processing Foundation, Joplin and Meshery. For Joplin specifically, I have 8 merged PRs that touch different parts of the codebase. I'm already familiar with the monorepo structure, the review process and how the plugin system connects to everything else. Outside of open source, I've built a few full-stack projects. WhisperSpace is a privacy-first real-time messaging platform with React, Socket.io and OAuth. I also made a URL shortener with Next.js and PostgreSQL that has QR generation and analytics. These gave me solid experience with React, Node.js, TypeScript and databases - all directly relevant to this plugin. I've also worked with RAG models, chatbots and other projects which gave me experience working with LLMs.
I work primarily with TypeScript, JavaScript, React, React Native, Node.js and Python, with additional experience in Vue.js, Next.js and TailwindCSS. For AI/ML specifically I have worked with LLM APIs like OpenAI and Ollama, text embeddings, prompt engineering, cosine similarity, TF-IDF and tools like Pipecat and Langfuse for voice AI evaluation. I'm comfortable with Git, webpack, SQLite, Jest, Electron and npm plugin distribution. For databases I have worked with MongoDB, PostgreSQL, MySQL and SQLite across different projects.
Why this project specifically: I personally use various note-taking applications like Joplin and Obsidian, and the graph view is one of my favourite features. But Obsidian's graph only shows wikilink connections - just like existing Joplin graph plugins show :/noteId links. It never surfaces connections between notes that cover the same topic but were never explicitly linked. That's the gap this project fills. Making a note graph that actually understands your content using AI genuinely excites me.
2. Project Summary
The problem: After a while, Joplin users end up with a lot of notes scattered across notebooks. Understanding how they relate to each other, which ideas are central and how topics connect becomes pretty hard. Two graph plugins already exist for Joplin (Knowledge Graph by agerardin, Link Graph UI by treymo) but they only show manually created links and shared tags. Notes that talk about the same topic but were never explicitly linked just float as isolated nodes.
What I'll build: A Joplin plugin that uses AI (text embeddings + optional LLM refinement) to discover semantic relationships between notes, categorize them, score their importance and render everything as an interactive graph. It combines three types of connections: explicit inter-note links, shared tags and AI-detected semantic similarity. Users bring their own API key for the AI provider of their choice (OpenAI-compatible API, Google Gemini or local Ollama).
Expected outcome: An installable .jpl plugin distributed through the Joplin plugin repo. The user selects a scope (current notebook, selected notebooks or all notebooks), the plugin analyzes notes and shows a graph where node size reflects importance, node color reflects category, solid edges are explicit links and dashed edges are AI-discovered relationships. Click any node and it opens that note. Focus mode shows only that note's neighborhood. The graph exports as PNG, SVG or JSON.
3. Technical Approach
3.1 Why a Plugin and Not a Core Feature
I spent time reading through the Joplin codebase to understand whether this belongs in core or as a plugin. Plugin is the clear answer. Here's the reasoning with concrete evidence:
The panel API is built for exactly this use case. JoplinViewsPanels provides create(), setHtml(), addScript() for loading JS/CSS into webview panels, plus two-way messaging via onMessage() and postMessage(). The codebase even mentions graph rendering as a plugin use case explicitly.
Data API needs no auth token from plugins. JoplinData exposes get, post, put and delete methods that map to the REST API. Since plugins run inside the app, no authorization token is required.
Workspace events enable incremental updates. JoplinWorkspace provides onNoteChange(), onNoteSelectionChange() and onSyncComplete() which let the plugin react to note edits and refresh only the affected parts of the graph instead of recomputing everything.
Secure API key storage is built in. Plugin settings support a secure: true flag which stores values in the OS keychain rather than plaintext config files.
SQLite access for embedding cache. joplin.require('sqlite3') gives access to a native SQLite instance and joplin.plugins.dataDir() provides a persistent directory for the plugin's data. Together these let the plugin cache embedding vectors locally so unchanged notes are never re-analyzed.
Keeps AI dependencies out of core. All ML libraries and API clients stay in the plugin. Users who don't want AI features are completely unaffected. The plugin distributes as a standard .jpl archive through the Joplin plugin repository.
3.2 How This Differs from Existing Plugins
Two graph plugins exist for Joplin: Knowledge Graph and Link Graph UI. Both only extract structural connections. If two notes discuss the same topic but were never manually linked, they show nothing.
I also use Joplin and Obsidian personally and its graph view is great, but it only visualizes wikilink connections between notes. Structurally that's the same as what these existing Joplin plugins do. This project goes further by discovering semantic relationships that no explicit link represents.
3.3 Architecture
The plugin has three layers: Data Collection, AI Analysis and Visualization.
Data Collection
The plugin fetches all notes in scope using the Data API with pagination. The core call:
joplin.data.get(['folders', folderId, 'notes'], {
fields: ['id', 'title', 'body', 'parent_id', 'created_time', 'updated_time'],
limit: 100,
page
})
The output is an array of note objects containing id (32-char hex), title, body (full markdown), parent_id and timestamps for temporal analysis.
Joplin Search API optimization: For Pass B candidate selection, the plugin uses Joplin's full-text search endpoint (joplin.data.get(['search'], { query: '...', type: 'note' })) to quickly identify notes containing specific keywords before sending them for expensive LLM processing. This significantly reduces Pass B token usage.
Explicit links are extracted using the :/noteId regex pattern. Tags are fetched via joplin.data.get(['notes', noteId, 'tags']).
AI Analysis: Cost-Optimized Two-Pass Strategy
This directly addresses the cost concern that came up in the community. A user pointed out that LLM-driven graph analysis can be very expensive in terms of tokens. Fair concern. So the design uses embeddings as the primary tool which is cheap and fast and LLM as an optional refinement which is expensive but rich.
Pass A (embeddings, always runs, cheap)
For each note, the title and body are concatenated into a single text string for embedding. The format is:
Title: {title}
{body}
Putting the title first with a clear label gives it prominence since transformer-based embedding models give more attention to tokens at the beginning of the input. This is cleaner than repeating the title multiple times, which works for TF-IDF but doesn't translate the same way to embedding models that use attention mechanisms.
Before embedding, basic preprocessing happens: strip markdown syntax (links, images, code blocks, headers), remove Joplin resource references (:/noteId patterns), and normalize whitespace. If the text exceeds the model's context window (for example 8191 tokens for OpenAI), it gets truncated.
The output from Pass A includes embedding vectors for each note and semantic edges between notes above the similarity threshold.
Embedding Models
For API usage, the default is text-embedding-3-small from OpenAI:
- 1536 dimensions, 8191 token context window
- $0.02 per 1M tokens - very cheap
- Most reliable, well-documented API
Google Gemini (Free Tier): Google AI Studio provides completely free access to high-quality embeddings with generous usage limits. This is the recommended option for cost-sensitive users who want cloud quality without any API cost. Supported as a first-class provider.
For local inference with Ollama, the default is nomic-embed-text-v1.5:
- 768 dimensions, 8192 context, Apache 2.0 license
- Specifically trained for document retrieval and semantic similarity
- Important: Nomic models are instruction-aware and require task prefixes for optimal accuracy. The plugin automatically prepends
search_document:when embedding notes andsearch_query:when querying for similar notes. Skipping these prefixes measurably degrades retrieval quality.
For multilingual notebooks, nomic-embed-text-v2 is recommended. It uses a Mixture-of-Experts architecture supporting 100+ languages with strong semantic performance.
Other alternatives: mxbai-embed-large (1024 dims, higher quality) and all-minilm (384 dims, faster but lower quality).
Local Inference Options
Ollama offers an easy setup, works across platforms, provides a good selection of models, and exposes a REST API, but it requires a separate installation and runs as a background daemon. transformers.js can run directly in Node.js without external dependencies, though it comes with a larger bundle size, slower CPU performance, and a more limited set of models. llama.cpp via Node bindings is highly efficient and fast especially with GGUF models, but its setup can be complex and may require manual compilation on some platforms. LM Studio provides a user-friendly GUI and simplifies model management, however it is a desktop application and not suitable for headless environments.
My recommendation is to default to Ollama because it has the best developer experience (one command to install, one command to pull models), REST API makes integration simple, and it's cross-platform (Windows, Mac, Linux). I'll also support a "custom endpoint" option in settings so users can point to any OpenAI-compatible embedding endpoint.
Batched Embedding Requests
OpenAI's embedding API accepts up to 2048 inputs per request. Sending one note at a time would be roughly 100× slower than batched requests. The plugin batches notes in groups of up to 100 (balancing memory usage and efficiency):
async function batchEmbed(notes: NoteData[], batchSize = 100): Promise<Map<string, number[]>> {
const results = new Map<string, number[]>();
const batches = chunk(notes, batchSize);
for (const batch of batches) {
const texts = batch.map(n => preprocessForEmbedding(n));
const embeddings = await embeddingProvider.embedBatch(texts);
batch.forEach((note, i) => results.set(note.id, embeddings[i]));
}
return results;
}
Similarity Computation
Cosine similarity is used as the primary metric for comparing embedding vectors. It's the standard in NLP and what embedding models are optimized for. It's scale invariant meaning it only measures the angle between vectors not magnitude. It's fast with a simple dot product after normalization at O(n) per comparison. And it's interpretable since the 0-1 range maps nicely to confidence scores.
Alternative approaches were considered:
- Euclidean distance: Useful when magnitude matters but is sensitive to vector norms and typically requires normalization
- Dot product: Often used for raw similarity without normalization but is not bounded making it harder to define consistent thresholds
- Jaccard similarity: Works well for set-based comparisons like shared tags but can't be applied to continuous embeddings
- BM25 or TF-IDF: Suitable for keyword-based matching but only capture lexical similarity and fail to understand semantic relationships
My approach uses cosine similarity as primary but also incorporates tag overlap where notes sharing tags get a bonus and explicit links which are always shown regardless of similarity score.
Tag Overlap Calculation (Jaccard similarity):
function computeTagOverlap(noteA: NoteData, noteB: NoteData): number {
const tagsA = new Set(noteA.tags);
const tagsB = new Set(noteB.tags);
if (tagsA.size === 0 && tagsB.size === 0) return 0;
const intersection = new Set([...tagsA].filter(t => tagsB.has(t)));
const union = new Set([...tagsA, ...tagsB]);
return intersection.size / union.size; // 0 to 1
}
Link Bonus Calculation:
function computeLinkBonus(noteA: NoteData, noteB: NoteData): number {
const aLinksToB = noteA.explicitLinks.includes(noteB.id);
const bLinksToA = noteB.explicitLinks.includes(noteA.id);
if (aLinksToB && bLinksToA) return 1.0; // bidirectional
if (aLinksToB || bLinksToA) return 0.5; // one-way
return 0;
}
Temporal Proximity Bonus:
Notes created close in time often relate to the same project or research session. This is an implicit relationship signal most note graph tools ignore. The plugin computes a temporal bonus based on creation date proximity:
function computeTemporalBonus(noteA: NoteData, noteB: NoteData): number {
const daysDiff = Math.abs(noteA.created_time - noteB.created_time) / (1000 * 60 * 60 * 24);
if (daysDiff <= 1) return 0.1; // same day or adjacent
if (daysDiff <= 7) return 0.05; // same week
return 0;
}
This subtle bonus can surface connections between notes from the same research session or project sprint.
Score Normalization:
Raw cosine similarity scores often cluster in a narrow range (e.g., 0.56 to 0.83) rather than spanning the full 0-1 range. To make the threshold meaningful, the scores are min-max normalized across the dataset so the lowest similarity becomes 0 and the highest becomes 1.
Small notebook guard: If the similarity spread is very narrow (max - min < 0.1), normalization is skipped and the threshold applies to raw cosine scores directly. This prevents meaningless normalized thresholds when there are only 2-3 notes or when all notes are extremely similar.
Edge Creation Logic:
The threshold applies to the normalized cosine similarity directly. Tag and link bonuses are additive but only apply when there's already some semantic relationship. This prevents two unrelated notes with shared tags (e.g., both tagged "todo") from forming a spurious semantic edge:
function shouldCreateEdge(
cosineSim: number,
tagOverlap: number,
linkBonus: number,
temporalBonus: number,
threshold: number
): boolean {
const normalizedCosine = normalize(cosineSim);
// Require minimum semantic similarity (0.3) before bonuses apply
if (normalizedCosine < 0.3 && linkBonus === 0) return false;
const boostedScore = normalizedCosine
+ (0.1 * tagOverlap)
+ (0.05 * linkBonus)
+ temporalBonus;
return boostedScore >= threshold;
}
This ensures notes without tags or links compete fairly since the base threshold applies to cosine similarity alone, while preventing spurious edges from tag-only matches.
Similarity Threshold
Low confidence edges are discarded since showing everything would create a cluttered unusable graph. The default threshold is 0.5 on a 0-1 scale.
Justification:
- Cosine similarity of 1.0 means identical vectors
- Cosine similarity of 0.0 means completely unrelated
- In practice with modern embedding models:
- 0.8+ means very similar, almost the same topic
- 0.6-0.8 means related topics worth connecting
- 0.5-0.6 means somewhat related, marginal
- Below 0.5 is mostly noise
I chose 0.5 as the default because research literature such as work on semantic similarity with BERT shows that 0.5 is a common cutoff for identifying content that is meaningfully related. It provides a reasonable balance by capturing relevant connections without being too loose. It is also better to under-connect initially than to overwhelm the system with noisy or weak associations.
The threshold setting in the UI will have three presets:
- Strict (0.7): Only strong connections
- Balanced (0.5): Default
- Loose (0.3): More connections with more noise
Centrality Computation in Pass A
Once the similarity matrix (all pairwise cosine similarities) is computed, centrality can be calculated without needing the LLM. The plugin uses degree centrality which counts how many connections each node has above the threshold and normalizes to a 1-10 scale. For more sophisticated centrality, eigenvector centrality weights connections by how central their neighbors are and PageRank uses a similar concept good for directed graphs. Degree centrality is used by default since it's intuitive and fast.
function computeCentrality(notes: NoteData[], similarityMatrix: number[][]): Map<string, number> {
const centrality = new Map<string, number>();
const threshold = 0.5;
for (let i = 0; i < notes.length; i++) {
let connections = 0;
for (let j = 0; j < notes.length; j++) {
if (i !== j && similarityMatrix[i][j] >= threshold) {
connections++;
}
}
centrality.set(notes[i].id, Math.max(1, Math.round((connections / notes.length) * 10)));
}
return centrality;
}
Community Detection - Louvain on Similarity Graph
Since the plugin already constructs a weighted similarity graph, Louvain community detection is the natural fit. It operates directly on graph structure and avoids the curse of dimensionality that affects K-means on raw high-dimensional embeddings.
The implementation uses graphology-communities-louvain - the actively maintained, benchmark-proven choice. It is approximately 45× faster than the older jlouvain package (52ms vs 2368ms on a 1000-node graph) and is actively maintained with latest version 2.0.2. It pairs naturally with graphology, which is used as the graph data structure throughout the plugin.
import Graph from 'graphology';
import louvain from 'graphology-communities-louvain';
const graph = new Graph();
// Add nodes
notes.forEach(note => graph.addNode(note.id, { title: note.title }));
// Add weighted edges above threshold
edges.forEach(edge => {
if (!graph.hasEdge(edge.source, edge.target)) {
graph.addEdge(edge.source, edge.target, { weight: edge.confidence });
}
});
// Assign community to each node
louvain.assign(graph);
// graph.getNodeAttribute(nodeId, 'community') now returns community ID
Known limitation: Louvain can occasionally produce poorly-connected communities. The Leiden algorithm addresses this and is documented as a post-GSoC enhancement.
Fallback: When the graph is too sparse for Louvain (strict threshold, few notes), the plugin falls back to keyword-based categorization - scanning note content for common terms and grouping notes that share keywords. Category names are derived from the most frequent meaningful terms in each group.
Cost Summary for Pass A
For 200 notes:
- Embeddings only via local Ollama: Free
- Embeddings only via OpenAI API: ~$0.01
Embeddings get cached in SQLite so they persist across restarts and unchanged notes are never re-analyzed.
Scalability for Large Notebooks
For notebooks under 300 notes, pairwise cosine similarity (O(n²)) is fast - 200 notes means ~20,000 comparisons, completing in milliseconds. For 500+ notes, the plugin processes in batches with a progress indicator and cancel button.
On native ANN libraries: Joplin's plugin API explicitly states that native packages cannot be bundled with plugins because they need to work cross-platform. This rules out hnswlib-node and similar native bindings. Pure-JS HNSW implementations exist but are sparsely maintained. For GSoC scope, O(n²) with early termination and batching handles typical notebook sizes well. ANN is documented as a stretch goal for post-GSoC, where Joplin could potentially expose a native ANN API.
Graph Filtering
Even with a similarity threshold, large notebooks can produce cluttered graphs where important connections get lost in noise. Two filtering strategies prevent this:
-
Threshold filtering: Only edges above the similarity threshold are shown (default 0.5).
-
Top-K per node: Each note keeps only its K strongest connections (default K=5). This prevents any single note from having dozens of weak edges cluttering the graph. The UI has a density slider that adjusts this value.
Together these ensure the graph remains readable regardless of notebook size.
Pass B (LLM, optional, user enables it)
This pass only runs when the user explicitly enables it in settings and provides an API key. It sends batches of notes that Pass A already identified as related. The LLM returns category labels, centrality scores and relationship descriptions. By pre-filtering with embeddings first, token usage drops significantly.
LLM Prompt
Related notes are batched together (up to 15 per request to stay within context limits) and sent as a single prompt:
You are analyzing notes for a knowledge graph. Given these related notes, identify:
1. A category label for each note (1-2 words, e.g., "web development", "project planning")
2. A centrality score (1-10) based on how central/important each note seems
3. For each pair of notes, a brief relationship label (2-4 words, e.g., "expands on", "contradicts", "provides example of")
Notes to analyze:
Note 1 [ID: abc123]:
Title: React Component Architecture
Body: We'll use functional components with hooks. The main App component will manage global state via Context API...
---
Note 2 [ID: def456]:
Title: Performance Optimization
Body: React.memo for expensive renders. useMemo and useCallback to prevent unnecessary re-renders...
---
Respond in JSON format:
{
"notes": [
{"id": "abc123", "category": "...", "centrality": 7},
{"id": "def456", "category": "...", "centrality": 5}
],
"relationships": [
{"from": "abc123", "to": "def456", "label": "optimizes"}
]
}
LLM Output Schema
interface PassBResponse {
notes: {
id: string;
category: string;
centrality: number; // 1-10
summary?: string;
}[];
relationships: {
from: string;
to: string;
label: string;
}[];
}
The plugin parses the JSON response, updates each node with the LLM-assigned category and centrality, and for each relationship updates the corresponding semantic edge's label. If the LLM returns relationships not in the similarity set, they get added since the LLM might catch connections embeddings missed.
Error handling validates JSON structure before using it. If parsing fails, it falls back to Pass A results. If individual fields are invalid, Pass A defaults are used for those fields.
LLM Model Recommendations
Default: gpt-4o-mini - $0.15/1M input and $0.60/1M output tokens, excellent at structured tasks, 128k context window, native JSON mode.
Alternatives: gpt-4o (higher quality, $2.50/$10.00 per 1M), claude-3-haiku ($0.25/$1.25 per 1M), or local Ollama with llama3.2:3b for speed or mistral:7b for better quality.
Guaranteed structured output: Instead of relying on prompt instructions alone, the plugin uses each provider's native structured output mechanism:
// OpenAI
const response = await openai.chat.completions.create({
model: 'gpt-4o-mini',
messages: [{ role: 'user', content: prompt }],
response_format: { type: 'json_object' }
});
// Ollama
const response = await fetch('http://localhost:11434/api/generate', {
body: JSON.stringify({ model, prompt, format: 'json' })
});
// Gemini
const response = await model.generateContent({
contents: [{ role: 'user', parts: [{ text: prompt }] }],
generationConfig: { responseMimeType: 'application/json' }
});
Cost Summary
For 200 notes:
- Local Ollama embeddings only: Free
- OpenAI embeddings only: ~$0.01
- OpenAI embeddings + gpt-4o-mini LLM refinement: ~$0.15–0.30
- Naive full-notebook LLM without embeddings pre-filter: $2–5+
Note Summarization Integration
The plugin can integrate with the existing "Summarise your notes and notebooks" plugin which uses extractive summarization algorithms (LexRank, LSA and KMeans clustering) running locally without API calls. If installed, the AI Note Graph plugin reads these summaries for node tooltips, providing quick context without any additional API usage.
Visualization - Sigma.js + Graphology
The graph renders in a webview panel using sigma.js paired with graphology as the backing data structure. Since graphology is already used for Louvain community detection, using it as the unified graph data structure throughout the plugin eliminates redundancy and keeps the architecture clean. Sigma.js renders thousands of nodes efficiently using WebGL, is actively maintained, and handles the large-notebook case cleanly without a separate upgrade path.
Visual encoding:
- Node size = centrality (larger means more connections)
- Node color = community/category
- Solid edges = explicit links
- Dashed edges = AI semantic relationships
- Dotted edges = shared tags
- Edge thickness = confidence score
Interactions:
- Click a node to open the note via
joplin.commands.execute('openNote', noteId) - Hover for tooltip with category, centrality score and AI-generated summary
- Filter panel: by category, confidence threshold or edge type
- Toggle between force-directed and hierarchical layouts
- Search and highlight by note title
Focus Node Mode: Click any node to enter focus mode, showing only that note and its 1–2 hop neighbors. This is essential for large notebooks where the full graph becomes visually overwhelming. Users familiar with Obsidian's graph or the existing Joplin graph plugins expect this behavior.
Export Functionality: Users can export the graph as PNG, SVG or JSON. The JSON export preserves all node metadata and edge weights for external analysis. A "copy to clipboard as image" button enables quick sharing.
3.4 Graph Data Model
The in-memory graph structure passed between the plugin and webview:
interface GraphData {
nodes: {
noteId: string;
title: string;
summary: string; // from LLM (Pass B) or empty
category: string;
centrality: number; // 1-10
folderId: string;
community: number; // Louvain community ID
}[];
edges: {
source: string;
target: string;
type: 'explicit_link' | 'semantic' | 'tag_shared';
label: string;
confidence: number; // 0-1 (1.0 for explicit links)
}[];
metadata: {
analyzedAt: number;
scope: string[];
noteCount: number;
cursor: string;
};
}
The plugin persists analysis results in a local SQLite database via joplin.require('sqlite3') stored in joplin.plugins.dataDir() so unchanged notes are never re-analyzed.
3.5 Plugin Settings
| Setting | Type | Description |
|---|---|---|
| Analysis Scope | Dropdown | "Current notebook" / "All notebooks" / "Selected notebooks" |
| AI Provider | Dropdown | "openai" / "gemini" / "ollama" / "custom" |
| API Endpoint | String | URL, auto-populated based on selected provider |
| API Key | Secure String | Stored in OS keychain via secure: true |
| Embedding Model | String | e.g. "text-embedding-3-small" or "nomic-embed-text-v1.5" for Ollama |
| Enable LLM Analysis | Boolean | Toggle Pass B on/off |
| Chat Model | String | "gpt-4o-mini" (only used if Pass B is on) |
| Similarity Threshold | Slider (1-100) | Lower = more edges, higher = fewer but stronger connections |
| Max Edges Per Node | Slider (1-20) | Top-K filtering to prevent cluttered graphs |
Cross-Notebook Analysis: Many Joplin users organize with multiple notebooks and rely on tags to span topics. The three scope options are:
- Current notebook (default): Analyzes only the selected notebook
- All notebooks: Analyzes every note in Joplin, useful for finding connections across topic boundaries
- Selected notebooks: User picks specific notebooks to include, good for project-based analysis
Tags that span notebooks naturally create cross-notebook edges when using broader scopes.
3.6 Graph Mockup
What it looks like with a notebook containing 9 notes:
3.7 Error Handling and Edge Cases
API failures and rate limits: All API calls (embedding and LLM) are wrapped with retry logic using exponential backoff. If the provider is unreachable or returns a rate limit error, the plugin shows a clear message in the panel and falls back to displaying the graph with only explicit links and tags (no semantic edges). The graph is always usable even when the AI layer fails completely.
Ollama not running: When the user selects Ollama as their provider, the plugin pings the local endpoint on startup. If Ollama is not running, the settings panel shows a warning with setup instructions instead of silently failing during analysis.
Large notes exceeding context windows: Embedding models have token limits. Notes longer than the model's context window are truncated to the first N tokens before embedding. For LLM analysis in Pass B, long notes are summarized to their first 2000 words before being sent in the batch. This is documented in the settings tooltip so users understand the behavior.
Very large notebooks: For notebooks with 500+ notes, the pairwise similarity computation (O(n^2)) could become slow. The plugin mitigates this by processing notes in batches, showing a progress indicator in the panel and allowing the user to cancel mid-analysis. Embedding generation is the bottleneck, not similarity computation and the SQLite cache makes repeat analyses near-instant.
Empty or minimal notebooks: If a notebook has 0 or 1 notes, the plugin shows a helpful message instead of an empty graph. If all notes are very short (less than 10 words), similarity scores will be unreliable, so the plugin displays a notice that results may be limited.
Invalid or expired API keys: The plugin validates the API key with a lightweight test request (a single embedding of the word "test") when the user first saves their key in settings. Invalid keys are flagged immediately rather than failing silently during full analysis.
3.8 Incremental Updates
The plugin stays in sync with note changes without re-analyzing the entire notebook.
Adding New Notes
When joplin.workspace.onNoteChange(handler) fires for a new note, the plugin checks if the note is in the currently displayed folder. If yes, it triggers an incremental update: generate embedding for just that note, update the cache, recompute similarities only for this note vs all others, update semantic edges, recalculate centrality for affected nodes, and re-render the graph.
async function handleNoteChange(event: NoteChangeEvent) {
const note = event.note;
if (note.parent_id !== currentFolderId) return;
const embedding = await embedNote(note);
await cache.upsertEmbedding(note.id, embedding);
const similarities = await computeSimilaritiesForNote(note.id, allOtherEmbeddings);
await cache.updateEdgesForNote(note.id, similarities);
await refreshGraph();
}
Deleting Notes
The onNoteChange event also fires on deletion. The plugin removes the note from the graph immediately, removes its edges, cleans up the embedding and edges from the cache, recalculates centrality for affected nodes (those that were connected to the deleted note), and re-renders the graph.
async function handleNoteDeletion(noteId: string) {
graphData.nodes = graphData.nodes.filter(n => n.id !== noteId);
graphData.edges = graphData.edges.filter(e =>
e.source !== noteId && e.target !== noteId
);
await cache.deleteEmbedding(noteId);
await cache.deleteEdgesForNote(noteId);
await recalculateCentrality(affectedNodeIds);
await refreshGraph();
}
Sync from Other Devices
For notes synced from other devices, the plugin uses the Events API cursor combined with the onSyncComplete workspace event. After each full analysis, the cursor is stored. On sync complete, the plugin checks for changes since that cursor, processes only the changed notes, and updates the stored cursor.
joplin.workspace.onSyncComplete(async () => {
const changes = await joplin.data.get(['events'], { cursor: storedCursor });
for (const change of changes.items) {
if (change.type === 'note' && change.item_id in currentScope) {
await handleNoteChange(change);
}
}
storedCursor = changes.cursor;
});
3.9 Handling Images in Notes
Images are stored as Joplin "resources" and referenced in note bodies via :/resourceId syntax.
Using Joplin's Built-in OCR
Joplin already has OCR support using Tesseract.js. When OCR is enabled, image resources get an ocr_text field populated with extracted text. The plugin fetches this via the Data API and includes it in the embedding text. This way diagrams, screenshots with text and handwritten notes all get included in the embedding.
const resources = await joplin.data.get(['notes', noteId, 'resources'], {
fields: ['id', 'mime', 'ocr_text']
});
for (const resource of resources.items) {
if (resource.ocr_text) {
noteText += '\n' + resource.ocr_text;
}
}
Image Alt Text
Markdown images often have alt text in the format . The plugin extracts and includes this in the embedding text.
const altTextRegex = /!\[(.*?)\]\(:\/[a-f0-9]{32}\)/g;
let match;
while ((match = altTextRegex.exec(noteBody)) !== null) {
noteText += ' ' + match[1];
}
Images Without Text
For images without OCR text or alt text, the default behavior is to skip them. Most notes have text content anyway. Notes that are mostly images will have sparser embeddings, but this is correct behavior since a note that's mostly images with no text isn't semantically similar to text-heavy notes.
A future enhancement could use multimodal embeddings (CLIP-style models) to embed images directly into the same vector space as text, enabling visual similarity detection. This is documented as a potential post-GSoC enhancement.
4. Implementation Plan
350 hours across 12 weeks.
Weeks 1-2: Plugin scaffold and data layer
Set up the plugin project using the Joplin generator. Create the webview panel, register the toolbar button and toggle command. Implement note fetching with pagination via the Data API and explicit link extraction using the :/noteId regex pattern.
Outcome: A working plugin skeleton that fetches notes and extracts all explicit links from a selected notebook.
Weeks 3-4: Structural graph rendering
Integrate sigma.js and graphology into the webview panel via the webpack entry point. Render explicit links and shared tags as a force-directed graph. Implement click-to-navigate, the two-way webview message passing pattern, hover tooltips and a basic legend.
Outcome: A functional graph of explicit connections with navigation, tooltips and category legend.
Weeks 5-6: Embedding pipeline and semantic edges
Week 5: Build the embedding provider abstraction supporting OpenAI, Gemini and Ollama. Implement batched embedding requests (up to 100 notes per batch). Apply instruction prefixes for Nomic models. Compute cosine similarity and add semantic edges above the threshold. Set up the SQLite cache.
Week 6: Build the settings panel with secure: true API key storage, provider/model configuration and scope selection. Implement Louvain community detection using graphology-communities-louvain on the similarity graph. Add keyword-based fallback for sparse graphs.
Outcome: Unlinked but semantically related notes appear connected as dashed edges, colored by community.
Weeks 7-8: LLM analysis and categorization
Implement optional Pass B with structured JSON output using each provider's native JSON mode. Build batching logic grouping notes by embedding similarity to minimize token usage. Add cost estimation displayed before analysis runs. Implement all error handling: API failures, rate limits, timeouts, JSON parse failures and fallback to Pass A results.
Outcome: The graph has LLM-assigned category labels, centrality-sized nodes and labeled relationship edges.
Weeks 9-10: Graph UI polish, focus mode, export and incremental updates
Add category coloring with legend, density slider (Top-K per node), edge type filtering, threshold presets and layout toggle. Implement focus node mode (1–2 hop neighborhood view). Add export functionality (PNG, SVG, JSON and clipboard copy). Wire up Events API cursor tracking and workspace event listeners for incremental updates on note edits, deletions and sync. Implement dark and light theme support via Joplin CSS variables.
Outcome: Production-ready graph that stays in sync with note changes, with focus mode and export.
Weeks 11-12: Testing, documentation and release
Unit tests: Link extraction, similarity computation, score normalization, edge creation logic, centrality computation, community detection.
Integration tests: Mock Data API responses for full pipeline runs.
AI Output Testing Strategy:
- Golden-set tests: ~20 curated notes with known relationships. Verifies that closely related notes (e.g., "React Hooks Guide" and "useState Examples") score above threshold and unrelated notes (e.g., "Grocery List" and "React Hooks Guide") score below.
- Mock embedding responses: Deterministic embedding vectors for CI - tests similarity computation and graph construction without API costs.
- Threshold calibration tests: Verifies different threshold settings produce expected graph densities (strict = sparse, loose = dense).
- Regression tests: Captures baseline output for the golden set and flags unexpected changes when the algorithm is modified.
Edge case handling: empty notebooks, single note, notes with no body text, notes exceeding embedding context windows.
Documentation: Setup guides for OpenAI, Gemini, Ollama and custom endpoints. Performance benchmarking on small (50), medium (200) and large (500+) notebooks. Plugin repository submission and final polish.
Outcome: Published and installable from Joplin's plugin browser.
5. Deliverables
- Installable Joplin plugin published to the plugin repository
- Embedding-based semantic analysis discovering relationships between unlinked notes
- Support for OpenAI, Google Gemini and Ollama providers with batched requests
- Optional LLM enrichment (Pass B) for category labels, centrality scores and relationship labels
- Interactive sigma.js graph with click-to-navigate, focus mode, filtering and layout options
- Export functionality: PNG, SVG, JSON and clipboard copy
- SQLite caching with incremental updates via workspace events and Events API cursor
- Settings UI with secure API key storage, provider configuration and three-level scope selection
- Community detection via
graphology-communities-louvain(Louvain algorithm) - Test suite: link extraction, similarity computation, graph building and AI output validation with golden-set tests
- User documentation for setup with OpenAI, Gemini, Ollama and custom endpoints
6. Availability
Weekly availability during GSoC: I can dedicate 7 to 8 hours per day on weekdays and I am also available for meetings or check ins on weekends if needed. If the project demands extra effort at any point I am happy to put in more time on weekends to keep things on track.
Time zone: I am in IST (Indian Standard Time) and flexible with scheduling. I am open to calls or chats based on whatever works best for the mentors.
Any other commitments during the programme: As of now I don't have any other commitments during the GSoC period so I will be fully focused on the project.


