GSoC 2026 Proposal Draft – Idea 2: AI-Generated note graphs – yugalkaushik

GSoC 2026 Proposal: AI-Generated Note Graphs

Link to project Idea: gsoc/ideas.md at master · joplin/gsoc · GitHub

GitHub: yugalkaushik (Yugal Kaushik) · GitHub

Pull Requests:

  • #14401: Error logged on first startup
  • #14411: Remove empty hidden divs from ENEX imports
  • #14449: Store note history settings in sync info
  • #14535: Call unmount in Note.test.tsx tests to suppress act warnings
  • #14557: Invisible cursor in legacy editor when using dark theme in separate window
  • #14642: ENEX import no longer breaks bullet items with a line break into separate paragraphs
  • #14703: Validate password on re-enable encryption after master password cleared
  • #14526: Fix search highlights breaking mermaid diagram rendering

1. Introduction

Hi, I'm Yugal Kaushik. I'm a final year B.Tech student in Computer Science with a specialization in AI/ML at Shree Guru Gobind Singh Tricentenary University, graduating in 2026. My AI/ML experience isn't just on paper. During my internship at Prodinit Software Solutions (July to Oct 2025), I built a testing framework for ElevenLabs conversational agents where I integrated custom LLM personas to evaluate agent performance in chat-based workflows. I also worked on a Voice AI Evaluation project using Pipecat for voice agent orchestration and Langfuse for structured evaluation of LLM-driven conversational systems. So I have hands-on experience with LLM APIs, embeddings, prompt design and evaluating AI outputs - which is basically what this project is about.

I've made multiple pull requests across projects like Processing Foundation, Joplin and Meshery. For Joplin specifically, I have 8 merged PRs that touch different parts of the codebase. I'm already familiar with the monorepo structure, the review process and how the plugin system connects to everything else. Outside of open source, I've built a few full-stack projects. WhisperSpace is a privacy-first real-time messaging platform with React, Socket.io and OAuth. I also made a URL shortener with Next.js and PostgreSQL that has QR generation and analytics. These gave me solid experience with React, Node.js, TypeScript and databases - all directly relevant to this plugin. I've also worked with RAG models, chatbots and other projects which gave me experience working with LLMs.

I work primarily with TypeScript, JavaScript, React, React Native, Node.js and Python, with additional experience in Vue.js, Next.js and TailwindCSS. For AI/ML specifically I have worked with LLM APIs like OpenAI and Ollama, text embeddings, prompt engineering, cosine similarity, TF-IDF and tools like Pipecat and Langfuse for voice AI evaluation. I'm comfortable with Git, webpack, SQLite, Jest, Electron and npm plugin distribution. For databases I have worked with MongoDB, PostgreSQL, MySQL and SQLite across different projects.

Why this project specifically: I personally use various note-taking applications like Joplin and Obsidian, and the graph view is one of my favourite features. But Obsidian's graph only shows wikilink connections - just like existing Joplin graph plugins show :/noteId links. It never surfaces connections between notes that cover the same topic but were never explicitly linked. That's the gap this project fills. Making a note graph that actually understands your content using AI genuinely excites me.

2. Project Summary

The problem: After a while, Joplin users end up with a lot of notes scattered across notebooks. Understanding how they relate to each other, which ideas are central and how topics connect becomes pretty hard. Two graph plugins already exist for Joplin (Knowledge Graph by agerardin, Link Graph UI by treymo) but they only show manually created links and shared tags. Notes that talk about the same topic but were never explicitly linked just float as isolated nodes.

What I'll build: A Joplin plugin that uses AI (text embeddings + optional LLM refinement) to discover semantic relationships between notes, categorize them, score their importance and render everything as an interactive graph. It combines three types of connections: explicit inter-note links, shared tags and AI-detected semantic similarity. Users bring their own API key for the AI provider of their choice (OpenAI-compatible API, Google Gemini or local Ollama).

Expected outcome: An installable .jpl plugin distributed through the Joplin plugin repo. The user selects a scope (current notebook, selected notebooks or all notebooks), the plugin analyzes notes and shows a graph where node size reflects importance, node color reflects category, solid edges are explicit links and dashed edges are AI-discovered relationships. Click any node and it opens that note. Focus mode shows only that note's neighborhood. The graph exports as PNG, SVG or JSON.


3. Technical Approach

3.1 Why a Plugin and Not a Core Feature

I spent time reading through the Joplin codebase to understand whether this belongs in core or as a plugin. Plugin is the clear answer. Here's the reasoning with concrete evidence:

The panel API is built for exactly this use case. JoplinViewsPanels provides create(), setHtml(), addScript() for loading JS/CSS into webview panels, plus two-way messaging via onMessage() and postMessage(). The codebase even mentions graph rendering as a plugin use case explicitly.

Data API needs no auth token from plugins. JoplinData exposes get, post, put and delete methods that map to the REST API. Since plugins run inside the app, no authorization token is required.

Workspace events enable incremental updates. JoplinWorkspace provides onNoteChange(), onNoteSelectionChange() and onSyncComplete() which let the plugin react to note edits and refresh only the affected parts of the graph instead of recomputing everything.

Secure API key storage is built in. Plugin settings support a secure: true flag which stores values in the OS keychain rather than plaintext config files.

SQLite access for embedding cache. joplin.require('sqlite3') gives access to a native SQLite instance and joplin.plugins.dataDir() provides a persistent directory for the plugin's data. Together these let the plugin cache embedding vectors locally so unchanged notes are never re-analyzed.

Keeps AI dependencies out of core. All ML libraries and API clients stay in the plugin. Users who don't want AI features are completely unaffected. The plugin distributes as a standard .jpl archive through the Joplin plugin repository.

3.2 How This Differs from Existing Plugins

Two graph plugins exist for Joplin: Knowledge Graph and Link Graph UI. Both only extract structural connections. If two notes discuss the same topic but were never manually linked, they show nothing.

I also use Joplin and Obsidian personally and its graph view is great, but it only visualizes wikilink connections between notes. Structurally that's the same as what these existing Joplin plugins do. This project goes further by discovering semantic relationships that no explicit link represents.

3.3 Architecture

The plugin has three layers: Data Collection, AI Analysis and Visualization.

Data Collection

The plugin fetches all notes in scope using the Data API with pagination. The core call:

joplin.data.get(['folders', folderId, 'notes'], {
    fields: ['id', 'title', 'body', 'parent_id', 'created_time', 'updated_time'],
    limit: 100,
    page
})

The output is an array of note objects containing id (32-char hex), title, body (full markdown), parent_id and timestamps for temporal analysis.

Joplin Search API optimization: For Pass B candidate selection, the plugin uses Joplin's full-text search endpoint (joplin.data.get(['search'], { query: '...', type: 'note' })) to quickly identify notes containing specific keywords before sending them for expensive LLM processing. This significantly reduces Pass B token usage.

Explicit links are extracted using the :/noteId regex pattern. Tags are fetched via joplin.data.get(['notes', noteId, 'tags']).

AI Analysis: Cost-Optimized Two-Pass Strategy

This directly addresses the cost concern that came up in the community. A user pointed out that LLM-driven graph analysis can be very expensive in terms of tokens. Fair concern. So the design uses embeddings as the primary tool which is cheap and fast and LLM as an optional refinement which is expensive but rich.

Pass A (embeddings, always runs, cheap)

For each note, the title and body are concatenated into a single text string for embedding. The format is:

Title: {title}

{body}

Putting the title first with a clear label gives it prominence since transformer-based embedding models give more attention to tokens at the beginning of the input. This is cleaner than repeating the title multiple times, which works for TF-IDF but doesn't translate the same way to embedding models that use attention mechanisms.

Before embedding, basic preprocessing happens: strip markdown syntax (links, images, code blocks, headers), remove Joplin resource references (:/noteId patterns), and normalize whitespace. If the text exceeds the model's context window (for example 8191 tokens for OpenAI), it gets truncated.

The output from Pass A includes embedding vectors for each note and semantic edges between notes above the similarity threshold.

Embedding Models

For API usage, the default is text-embedding-3-small from OpenAI:

  • 1536 dimensions, 8191 token context window
  • $0.02 per 1M tokens - very cheap
  • Most reliable, well-documented API

Google Gemini (Free Tier): Google AI Studio provides completely free access to high-quality embeddings with generous usage limits. This is the recommended option for cost-sensitive users who want cloud quality without any API cost. Supported as a first-class provider.

For local inference with Ollama, the default is nomic-embed-text-v1.5:

  • 768 dimensions, 8192 context, Apache 2.0 license
  • Specifically trained for document retrieval and semantic similarity
  • Important: Nomic models are instruction-aware and require task prefixes for optimal accuracy. The plugin automatically prepends search_document: when embedding notes and search_query: when querying for similar notes. Skipping these prefixes measurably degrades retrieval quality.

For multilingual notebooks, nomic-embed-text-v2 is recommended. It uses a Mixture-of-Experts architecture supporting 100+ languages with strong semantic performance.

Other alternatives: mxbai-embed-large (1024 dims, higher quality) and all-minilm (384 dims, faster but lower quality).

Local Inference Options

Ollama offers an easy setup, works across platforms, provides a good selection of models, and exposes a REST API, but it requires a separate installation and runs as a background daemon. transformers.js can run directly in Node.js without external dependencies, though it comes with a larger bundle size, slower CPU performance, and a more limited set of models. llama.cpp via Node bindings is highly efficient and fast especially with GGUF models, but its setup can be complex and may require manual compilation on some platforms. LM Studio provides a user-friendly GUI and simplifies model management, however it is a desktop application and not suitable for headless environments.

My recommendation is to default to Ollama because it has the best developer experience (one command to install, one command to pull models), REST API makes integration simple, and it's cross-platform (Windows, Mac, Linux). I'll also support a "custom endpoint" option in settings so users can point to any OpenAI-compatible embedding endpoint.

Batched Embedding Requests

OpenAI's embedding API accepts up to 2048 inputs per request. Sending one note at a time would be roughly 100× slower than batched requests. The plugin batches notes in groups of up to 100 (balancing memory usage and efficiency):

async function batchEmbed(notes: NoteData[], batchSize = 100): Promise<Map<string, number[]>> {
    const results = new Map<string, number[]>();
    const batches = chunk(notes, batchSize);
    for (const batch of batches) {
        const texts = batch.map(n => preprocessForEmbedding(n));
        const embeddings = await embeddingProvider.embedBatch(texts);
        batch.forEach((note, i) => results.set(note.id, embeddings[i]));
    }
    return results;
}

Similarity Computation

Cosine similarity is used as the primary metric for comparing embedding vectors. It's the standard in NLP and what embedding models are optimized for. It's scale invariant meaning it only measures the angle between vectors not magnitude. It's fast with a simple dot product after normalization at O(n) per comparison. And it's interpretable since the 0-1 range maps nicely to confidence scores.

Alternative approaches were considered:

  • Euclidean distance: Useful when magnitude matters but is sensitive to vector norms and typically requires normalization
  • Dot product: Often used for raw similarity without normalization but is not bounded making it harder to define consistent thresholds
  • Jaccard similarity: Works well for set-based comparisons like shared tags but can't be applied to continuous embeddings
  • BM25 or TF-IDF: Suitable for keyword-based matching but only capture lexical similarity and fail to understand semantic relationships

My approach uses cosine similarity as primary but also incorporates tag overlap where notes sharing tags get a bonus and explicit links which are always shown regardless of similarity score.

Tag Overlap Calculation (Jaccard similarity):

function computeTagOverlap(noteA: NoteData, noteB: NoteData): number {
    const tagsA = new Set(noteA.tags);
    const tagsB = new Set(noteB.tags);
    if (tagsA.size === 0 && tagsB.size === 0) return 0;
    const intersection = new Set([...tagsA].filter(t => tagsB.has(t)));
    const union = new Set([...tagsA, ...tagsB]);
    return intersection.size / union.size; // 0 to 1
}

Link Bonus Calculation:

function computeLinkBonus(noteA: NoteData, noteB: NoteData): number {
    const aLinksToB = noteA.explicitLinks.includes(noteB.id);
    const bLinksToA = noteB.explicitLinks.includes(noteA.id);
    if (aLinksToB && bLinksToA) return 1.0;  // bidirectional
    if (aLinksToB || bLinksToA) return 0.5;  // one-way
    return 0;
}

Temporal Proximity Bonus:

Notes created close in time often relate to the same project or research session. This is an implicit relationship signal most note graph tools ignore. The plugin computes a temporal bonus based on creation date proximity:

function computeTemporalBonus(noteA: NoteData, noteB: NoteData): number {
    const daysDiff = Math.abs(noteA.created_time - noteB.created_time) / (1000 * 60 * 60 * 24);
    if (daysDiff <= 1) return 0.1;   // same day or adjacent
    if (daysDiff <= 7) return 0.05;  // same week
    return 0;
}

This subtle bonus can surface connections between notes from the same research session or project sprint.

Score Normalization:

Raw cosine similarity scores often cluster in a narrow range (e.g., 0.56 to 0.83) rather than spanning the full 0-1 range. To make the threshold meaningful, the scores are min-max normalized across the dataset so the lowest similarity becomes 0 and the highest becomes 1.

Small notebook guard: If the similarity spread is very narrow (max - min < 0.1), normalization is skipped and the threshold applies to raw cosine scores directly. This prevents meaningless normalized thresholds when there are only 2-3 notes or when all notes are extremely similar.

Edge Creation Logic:

The threshold applies to the normalized cosine similarity directly. Tag and link bonuses are additive but only apply when there's already some semantic relationship. This prevents two unrelated notes with shared tags (e.g., both tagged "todo") from forming a spurious semantic edge:

function shouldCreateEdge(
    cosineSim: number,
    tagOverlap: number,
    linkBonus: number,
    temporalBonus: number,
    threshold: number
): boolean {
    const normalizedCosine = normalize(cosineSim);
    // Require minimum semantic similarity (0.3) before bonuses apply
    if (normalizedCosine < 0.3 && linkBonus === 0) return false;
    const boostedScore = normalizedCosine
        + (0.1 * tagOverlap)
        + (0.05 * linkBonus)
        + temporalBonus;
    return boostedScore >= threshold;
}

This ensures notes without tags or links compete fairly since the base threshold applies to cosine similarity alone, while preventing spurious edges from tag-only matches.

Similarity Threshold

Low confidence edges are discarded since showing everything would create a cluttered unusable graph. The default threshold is 0.5 on a 0-1 scale.

Justification:

  • Cosine similarity of 1.0 means identical vectors
  • Cosine similarity of 0.0 means completely unrelated
  • In practice with modern embedding models:
    • 0.8+ means very similar, almost the same topic
    • 0.6-0.8 means related topics worth connecting
    • 0.5-0.6 means somewhat related, marginal
    • Below 0.5 is mostly noise

I chose 0.5 as the default because research literature such as work on semantic similarity with BERT shows that 0.5 is a common cutoff for identifying content that is meaningfully related. It provides a reasonable balance by capturing relevant connections without being too loose. It is also better to under-connect initially than to overwhelm the system with noisy or weak associations.

The threshold setting in the UI will have three presets:

  • Strict (0.7): Only strong connections
  • Balanced (0.5): Default
  • Loose (0.3): More connections with more noise

Centrality Computation in Pass A

Once the similarity matrix (all pairwise cosine similarities) is computed, centrality can be calculated without needing the LLM. The plugin uses degree centrality which counts how many connections each node has above the threshold and normalizes to a 1-10 scale. For more sophisticated centrality, eigenvector centrality weights connections by how central their neighbors are and PageRank uses a similar concept good for directed graphs. Degree centrality is used by default since it's intuitive and fast.

function computeCentrality(notes: NoteData[], similarityMatrix: number[][]): Map<string, number> {
    const centrality = new Map<string, number>();
    const threshold = 0.5;
    for (let i = 0; i < notes.length; i++) {
        let connections = 0;
        for (let j = 0; j < notes.length; j++) {
            if (i !== j && similarityMatrix[i][j] >= threshold) {
                connections++;
            }
        }
        centrality.set(notes[i].id, Math.max(1, Math.round((connections / notes.length) * 10)));
    }
    return centrality;
}

Community Detection - Louvain on Similarity Graph

Since the plugin already constructs a weighted similarity graph, Louvain community detection is the natural fit. It operates directly on graph structure and avoids the curse of dimensionality that affects K-means on raw high-dimensional embeddings.

The implementation uses graphology-communities-louvain - the actively maintained, benchmark-proven choice. It is approximately 45× faster than the older jlouvain package (52ms vs 2368ms on a 1000-node graph) and is actively maintained with latest version 2.0.2. It pairs naturally with graphology, which is used as the graph data structure throughout the plugin.

import Graph from 'graphology';
import louvain from 'graphology-communities-louvain';

const graph = new Graph();
// Add nodes
notes.forEach(note => graph.addNode(note.id, { title: note.title }));
// Add weighted edges above threshold
edges.forEach(edge => {
    if (!graph.hasEdge(edge.source, edge.target)) {
        graph.addEdge(edge.source, edge.target, { weight: edge.confidence });
    }
});
// Assign community to each node
louvain.assign(graph);
// graph.getNodeAttribute(nodeId, 'community') now returns community ID

Known limitation: Louvain can occasionally produce poorly-connected communities. The Leiden algorithm addresses this and is documented as a post-GSoC enhancement.

Fallback: When the graph is too sparse for Louvain (strict threshold, few notes), the plugin falls back to keyword-based categorization - scanning note content for common terms and grouping notes that share keywords. Category names are derived from the most frequent meaningful terms in each group.

Cost Summary for Pass A

For 200 notes:

  • Embeddings only via local Ollama: Free
  • Embeddings only via OpenAI API: ~$0.01

Embeddings get cached in SQLite so they persist across restarts and unchanged notes are never re-analyzed.

Scalability for Large Notebooks

For notebooks under 300 notes, pairwise cosine similarity (O(n²)) is fast - 200 notes means ~20,000 comparisons, completing in milliseconds. For 500+ notes, the plugin processes in batches with a progress indicator and cancel button.

On native ANN libraries: Joplin's plugin API explicitly states that native packages cannot be bundled with plugins because they need to work cross-platform. This rules out hnswlib-node and similar native bindings. Pure-JS HNSW implementations exist but are sparsely maintained. For GSoC scope, O(n²) with early termination and batching handles typical notebook sizes well. ANN is documented as a stretch goal for post-GSoC, where Joplin could potentially expose a native ANN API.

Graph Filtering

Even with a similarity threshold, large notebooks can produce cluttered graphs where important connections get lost in noise. Two filtering strategies prevent this:

  1. Threshold filtering: Only edges above the similarity threshold are shown (default 0.5).

  2. Top-K per node: Each note keeps only its K strongest connections (default K=5). This prevents any single note from having dozens of weak edges cluttering the graph. The UI has a density slider that adjusts this value.

Together these ensure the graph remains readable regardless of notebook size.

Pass B (LLM, optional, user enables it)

This pass only runs when the user explicitly enables it in settings and provides an API key. It sends batches of notes that Pass A already identified as related. The LLM returns category labels, centrality scores and relationship descriptions. By pre-filtering with embeddings first, token usage drops significantly.

LLM Prompt

Related notes are batched together (up to 15 per request to stay within context limits) and sent as a single prompt:

You are analyzing notes for a knowledge graph. Given these related notes, identify:
1. A category label for each note (1-2 words, e.g., "web development", "project planning")
2. A centrality score (1-10) based on how central/important each note seems
3. For each pair of notes, a brief relationship label (2-4 words, e.g., "expands on", "contradicts", "provides example of")

Notes to analyze:
Note 1 [ID: abc123]:
Title: React Component Architecture
Body: We'll use functional components with hooks. The main App component will manage global state via Context API...
---
Note 2 [ID: def456]:
Title: Performance Optimization
Body: React.memo for expensive renders. useMemo and useCallback to prevent unnecessary re-renders...
---

Respond in JSON format:
{
    "notes": [
        {"id": "abc123", "category": "...", "centrality": 7},
        {"id": "def456", "category": "...", "centrality": 5}
    ],
    "relationships": [
        {"from": "abc123", "to": "def456", "label": "optimizes"}
    ]
}

LLM Output Schema

interface PassBResponse {
    notes: {
        id: string;
        category: string;
        centrality: number;    // 1-10
        summary?: string;
    }[];
    relationships: {
        from: string;
        to: string;
        label: string;
    }[];
}

The plugin parses the JSON response, updates each node with the LLM-assigned category and centrality, and for each relationship updates the corresponding semantic edge's label. If the LLM returns relationships not in the similarity set, they get added since the LLM might catch connections embeddings missed.

Error handling validates JSON structure before using it. If parsing fails, it falls back to Pass A results. If individual fields are invalid, Pass A defaults are used for those fields.

LLM Model Recommendations

Default: gpt-4o-mini - $0.15/1M input and $0.60/1M output tokens, excellent at structured tasks, 128k context window, native JSON mode.

Alternatives: gpt-4o (higher quality, $2.50/$10.00 per 1M), claude-3-haiku ($0.25/$1.25 per 1M), or local Ollama with llama3.2:3b for speed or mistral:7b for better quality.

Guaranteed structured output: Instead of relying on prompt instructions alone, the plugin uses each provider's native structured output mechanism:

// OpenAI
const response = await openai.chat.completions.create({
    model: 'gpt-4o-mini',
    messages: [{ role: 'user', content: prompt }],
    response_format: { type: 'json_object' }
});

// Ollama
const response = await fetch('http://localhost:11434/api/generate', {
    body: JSON.stringify({ model, prompt, format: 'json' })
});

// Gemini
const response = await model.generateContent({
    contents: [{ role: 'user', parts: [{ text: prompt }] }],
    generationConfig: { responseMimeType: 'application/json' }
});

Cost Summary

For 200 notes:

  • Local Ollama embeddings only: Free
  • OpenAI embeddings only: ~$0.01
  • OpenAI embeddings + gpt-4o-mini LLM refinement: ~$0.15–0.30
  • Naive full-notebook LLM without embeddings pre-filter: $2–5+

Note Summarization Integration

The plugin can integrate with the existing "Summarise your notes and notebooks" plugin which uses extractive summarization algorithms (LexRank, LSA and KMeans clustering) running locally without API calls. If installed, the AI Note Graph plugin reads these summaries for node tooltips, providing quick context without any additional API usage.

Visualization - Sigma.js + Graphology

The graph renders in a webview panel using sigma.js paired with graphology as the backing data structure. Since graphology is already used for Louvain community detection, using it as the unified graph data structure throughout the plugin eliminates redundancy and keeps the architecture clean. Sigma.js renders thousands of nodes efficiently using WebGL, is actively maintained, and handles the large-notebook case cleanly without a separate upgrade path.

Visual encoding:

  • Node size = centrality (larger means more connections)
  • Node color = community/category
  • Solid edges = explicit links
  • Dashed edges = AI semantic relationships
  • Dotted edges = shared tags
  • Edge thickness = confidence score

Interactions:

  • Click a node to open the note via joplin.commands.execute('openNote', noteId)
  • Hover for tooltip with category, centrality score and AI-generated summary
  • Filter panel: by category, confidence threshold or edge type
  • Toggle between force-directed and hierarchical layouts
  • Search and highlight by note title

Focus Node Mode: Click any node to enter focus mode, showing only that note and its 1–2 hop neighbors. This is essential for large notebooks where the full graph becomes visually overwhelming. Users familiar with Obsidian's graph or the existing Joplin graph plugins expect this behavior.

Export Functionality: Users can export the graph as PNG, SVG or JSON. The JSON export preserves all node metadata and edge weights for external analysis. A "copy to clipboard as image" button enables quick sharing.

3.4 Graph Data Model

The in-memory graph structure passed between the plugin and webview:

interface GraphData {
    nodes: {
        noteId: string;
        title: string;
        summary: string;       // from LLM (Pass B) or empty
        category: string;
        centrality: number;    // 1-10
        folderId: string;
        community: number;     // Louvain community ID
    }[];
    edges: {
        source: string;
        target: string;
        type: 'explicit_link' | 'semantic' | 'tag_shared';
        label: string;
        confidence: number;    // 0-1 (1.0 for explicit links)
    }[];
    metadata: {
        analyzedAt: number;
        scope: string[];
        noteCount: number;
        cursor: string;
    };
}

The plugin persists analysis results in a local SQLite database via joplin.require('sqlite3') stored in joplin.plugins.dataDir() so unchanged notes are never re-analyzed.

3.5 Plugin Settings

Setting Type Description
Analysis Scope Dropdown "Current notebook" / "All notebooks" / "Selected notebooks"
AI Provider Dropdown "openai" / "gemini" / "ollama" / "custom"
API Endpoint String URL, auto-populated based on selected provider
API Key Secure String Stored in OS keychain via secure: true
Embedding Model String e.g. "text-embedding-3-small" or "nomic-embed-text-v1.5" for Ollama
Enable LLM Analysis Boolean Toggle Pass B on/off
Chat Model String "gpt-4o-mini" (only used if Pass B is on)
Similarity Threshold Slider (1-100) Lower = more edges, higher = fewer but stronger connections
Max Edges Per Node Slider (1-20) Top-K filtering to prevent cluttered graphs

Cross-Notebook Analysis: Many Joplin users organize with multiple notebooks and rely on tags to span topics. The three scope options are:

  1. Current notebook (default): Analyzes only the selected notebook
  2. All notebooks: Analyzes every note in Joplin, useful for finding connections across topic boundaries
  3. Selected notebooks: User picks specific notebooks to include, good for project-based analysis

Tags that span notebooks naturally create cross-notebook edges when using broader scopes.

3.6 Graph Mockup

What it looks like with a notebook containing 9 notes:

3.7 Error Handling and Edge Cases

API failures and rate limits: All API calls (embedding and LLM) are wrapped with retry logic using exponential backoff. If the provider is unreachable or returns a rate limit error, the plugin shows a clear message in the panel and falls back to displaying the graph with only explicit links and tags (no semantic edges). The graph is always usable even when the AI layer fails completely.

Ollama not running: When the user selects Ollama as their provider, the plugin pings the local endpoint on startup. If Ollama is not running, the settings panel shows a warning with setup instructions instead of silently failing during analysis.

Large notes exceeding context windows: Embedding models have token limits. Notes longer than the model's context window are truncated to the first N tokens before embedding. For LLM analysis in Pass B, long notes are summarized to their first 2000 words before being sent in the batch. This is documented in the settings tooltip so users understand the behavior.

Very large notebooks: For notebooks with 500+ notes, the pairwise similarity computation (O(n^2)) could become slow. The plugin mitigates this by processing notes in batches, showing a progress indicator in the panel and allowing the user to cancel mid-analysis. Embedding generation is the bottleneck, not similarity computation and the SQLite cache makes repeat analyses near-instant.

Empty or minimal notebooks: If a notebook has 0 or 1 notes, the plugin shows a helpful message instead of an empty graph. If all notes are very short (less than 10 words), similarity scores will be unreliable, so the plugin displays a notice that results may be limited.

Invalid or expired API keys: The plugin validates the API key with a lightweight test request (a single embedding of the word "test") when the user first saves their key in settings. Invalid keys are flagged immediately rather than failing silently during full analysis.

3.8 Incremental Updates

The plugin stays in sync with note changes without re-analyzing the entire notebook.

Adding New Notes

When joplin.workspace.onNoteChange(handler) fires for a new note, the plugin checks if the note is in the currently displayed folder. If yes, it triggers an incremental update: generate embedding for just that note, update the cache, recompute similarities only for this note vs all others, update semantic edges, recalculate centrality for affected nodes, and re-render the graph.

async function handleNoteChange(event: NoteChangeEvent) {
    const note = event.note;
    if (note.parent_id !== currentFolderId) return;

    const embedding = await embedNote(note);
    await cache.upsertEmbedding(note.id, embedding);
    const similarities = await computeSimilaritiesForNote(note.id, allOtherEmbeddings);
    await cache.updateEdgesForNote(note.id, similarities);
    await refreshGraph();
}

Deleting Notes

The onNoteChange event also fires on deletion. The plugin removes the note from the graph immediately, removes its edges, cleans up the embedding and edges from the cache, recalculates centrality for affected nodes (those that were connected to the deleted note), and re-renders the graph.

async function handleNoteDeletion(noteId: string) {
    graphData.nodes = graphData.nodes.filter(n => n.id !== noteId);
    graphData.edges = graphData.edges.filter(e =>
        e.source !== noteId && e.target !== noteId
    );
    await cache.deleteEmbedding(noteId);
    await cache.deleteEdgesForNote(noteId);
    await recalculateCentrality(affectedNodeIds);
    await refreshGraph();
}

Sync from Other Devices

For notes synced from other devices, the plugin uses the Events API cursor combined with the onSyncComplete workspace event. After each full analysis, the cursor is stored. On sync complete, the plugin checks for changes since that cursor, processes only the changed notes, and updates the stored cursor.

joplin.workspace.onSyncComplete(async () => {
    const changes = await joplin.data.get(['events'], { cursor: storedCursor });
    for (const change of changes.items) {
        if (change.type === 'note' && change.item_id in currentScope) {
            await handleNoteChange(change);
        }
    }
    storedCursor = changes.cursor;
});

3.9 Handling Images in Notes

Images are stored as Joplin "resources" and referenced in note bodies via :/resourceId syntax.

Using Joplin's Built-in OCR

Joplin already has OCR support using Tesseract.js. When OCR is enabled, image resources get an ocr_text field populated with extracted text. The plugin fetches this via the Data API and includes it in the embedding text. This way diagrams, screenshots with text and handwritten notes all get included in the embedding.

const resources = await joplin.data.get(['notes', noteId, 'resources'], {
    fields: ['id', 'mime', 'ocr_text']
});
for (const resource of resources.items) {
    if (resource.ocr_text) {
        noteText += '\n' + resource.ocr_text;
    }
}

Image Alt Text

Markdown images often have alt text in the format ![alt text](:/resourceId). The plugin extracts and includes this in the embedding text.

const altTextRegex = /!\[(.*?)\]\(:\/[a-f0-9]{32}\)/g;
let match;
while ((match = altTextRegex.exec(noteBody)) !== null) {
    noteText += ' ' + match[1];
}

Images Without Text

For images without OCR text or alt text, the default behavior is to skip them. Most notes have text content anyway. Notes that are mostly images will have sparser embeddings, but this is correct behavior since a note that's mostly images with no text isn't semantically similar to text-heavy notes.

A future enhancement could use multimodal embeddings (CLIP-style models) to embed images directly into the same vector space as text, enabling visual similarity detection. This is documented as a potential post-GSoC enhancement.


4. Implementation Plan

350 hours across 12 weeks.

Weeks 1-2: Plugin scaffold and data layer

Set up the plugin project using the Joplin generator. Create the webview panel, register the toolbar button and toggle command. Implement note fetching with pagination via the Data API and explicit link extraction using the :/noteId regex pattern.

Outcome: A working plugin skeleton that fetches notes and extracts all explicit links from a selected notebook.

Weeks 3-4: Structural graph rendering

Integrate sigma.js and graphology into the webview panel via the webpack entry point. Render explicit links and shared tags as a force-directed graph. Implement click-to-navigate, the two-way webview message passing pattern, hover tooltips and a basic legend.

Outcome: A functional graph of explicit connections with navigation, tooltips and category legend.

Weeks 5-6: Embedding pipeline and semantic edges

Week 5: Build the embedding provider abstraction supporting OpenAI, Gemini and Ollama. Implement batched embedding requests (up to 100 notes per batch). Apply instruction prefixes for Nomic models. Compute cosine similarity and add semantic edges above the threshold. Set up the SQLite cache.

Week 6: Build the settings panel with secure: true API key storage, provider/model configuration and scope selection. Implement Louvain community detection using graphology-communities-louvain on the similarity graph. Add keyword-based fallback for sparse graphs.

Outcome: Unlinked but semantically related notes appear connected as dashed edges, colored by community.

Weeks 7-8: LLM analysis and categorization

Implement optional Pass B with structured JSON output using each provider's native JSON mode. Build batching logic grouping notes by embedding similarity to minimize token usage. Add cost estimation displayed before analysis runs. Implement all error handling: API failures, rate limits, timeouts, JSON parse failures and fallback to Pass A results.

Outcome: The graph has LLM-assigned category labels, centrality-sized nodes and labeled relationship edges.

Weeks 9-10: Graph UI polish, focus mode, export and incremental updates

Add category coloring with legend, density slider (Top-K per node), edge type filtering, threshold presets and layout toggle. Implement focus node mode (1–2 hop neighborhood view). Add export functionality (PNG, SVG, JSON and clipboard copy). Wire up Events API cursor tracking and workspace event listeners for incremental updates on note edits, deletions and sync. Implement dark and light theme support via Joplin CSS variables.

Outcome: Production-ready graph that stays in sync with note changes, with focus mode and export.

Weeks 11-12: Testing, documentation and release

Unit tests: Link extraction, similarity computation, score normalization, edge creation logic, centrality computation, community detection.

Integration tests: Mock Data API responses for full pipeline runs.

AI Output Testing Strategy:

  • Golden-set tests: ~20 curated notes with known relationships. Verifies that closely related notes (e.g., "React Hooks Guide" and "useState Examples") score above threshold and unrelated notes (e.g., "Grocery List" and "React Hooks Guide") score below.
  • Mock embedding responses: Deterministic embedding vectors for CI - tests similarity computation and graph construction without API costs.
  • Threshold calibration tests: Verifies different threshold settings produce expected graph densities (strict = sparse, loose = dense).
  • Regression tests: Captures baseline output for the golden set and flags unexpected changes when the algorithm is modified.

Edge case handling: empty notebooks, single note, notes with no body text, notes exceeding embedding context windows.

Documentation: Setup guides for OpenAI, Gemini, Ollama and custom endpoints. Performance benchmarking on small (50), medium (200) and large (500+) notebooks. Plugin repository submission and final polish.

Outcome: Published and installable from Joplin's plugin browser.


5. Deliverables

  • Installable Joplin plugin published to the plugin repository
  • Embedding-based semantic analysis discovering relationships between unlinked notes
  • Support for OpenAI, Google Gemini and Ollama providers with batched requests
  • Optional LLM enrichment (Pass B) for category labels, centrality scores and relationship labels
  • Interactive sigma.js graph with click-to-navigate, focus mode, filtering and layout options
  • Export functionality: PNG, SVG, JSON and clipboard copy
  • SQLite caching with incremental updates via workspace events and Events API cursor
  • Settings UI with secure API key storage, provider configuration and three-level scope selection
  • Community detection via graphology-communities-louvain (Louvain algorithm)
  • Test suite: link extraction, similarity computation, graph building and AI output validation with golden-set tests
  • User documentation for setup with OpenAI, Gemini, Ollama and custom endpoints

6. Availability

Weekly availability during GSoC: I can dedicate 7 to 8 hours per day on weekdays and I am also available for meetings or check ins on weekends if needed. If the project demands extra effort at any point I am happy to put in more time on weekends to keep things on track.

Time zone: I am in IST (Indian Standard Time) and flexible with scheduling. I am open to calls or chats based on whatever works best for the mentors.

Any other commitments during the programme: As of now I don't have any other commitments during the GSoC period so I will be fully focused on the project.

4 Likes

@tessus @malekhavasi Hi, I would love your feedback on my proposal

looked like a strong proposal. I loved the way you described everything and you have the full knowledge about vector databases including file based vector databases
best of luck for your journey

1 Like

@yugalkaushik Thank you for your proposal and it’s great that you made this quite comprehensive! I am the primary mentor of this project idea.

Looking at your Graph Data Model, it quite summarizes well what you’re trying to do. I have couple of questions, the focus will be more on the semantic part of your graph implementation:

A) Could you more explain in greater detail the flow from DATA COLLECTION to AI ANALYSIS - Pass A?

  • What is the exact input/output?
  • What are you embedding?
  • Which specific embedding models are you trying to use as default and why?
  • Is Ollama truly the only best option for local inference of embedding models?

B) Regarding the Pass A:

  • Are you planning to show all semantic edges with a score? Or are you planning discard the ones that have low confidence score?
    • If so, then what is the threshold? And if possible, can you justify why?
  • Do you think is possible to do community detection and centrality in Pass A? Could you then somehow create categories in Pass A?
  • Why cosine similarity is the best approach to create semantic edges? Are there other possible better approaches?

C) Regarding Pass B:

  • What are the prompts? What are you inputting to the LLMs?
  • What is the expected output and in which format/schema? How do you then create a semantic graph with categories out of it?
  • Which LLMs are you thinking of using? I see gpt-4o-mini but why?

D) What happens if a users add new notes/notebooks? How would they be integrated to the graph? Same goes for deletion.

E) What if notes contain images?

2 Likes

Hi, thanks for the detailed questions! Let me go through each one.

A) Data Collection to Pass A Flow

What is the exact input/output?

Input to Data Collection:

- Folder ID from `joplin.workspace.selectedFolder()`

- The Data API call: `joplin.data.get(['folders', folderId, 'notes'], { fields: ['id', 'title', 'body', 'parent_id'], limit: 100, page })`

Output from Data Collection (Input to Pass A):

interface NoteData {

id: string;           // 32-char hex ID

title: string;        // note title

body: string;         // full markdown body

parent_id: string;    // folder ID

}

Input to Pass A:

- Array of `NoteData` objects

- Each note's title and body are concatenated into a single text string for embedding

Output from Pass A:

interface EmbeddingResult {

noteId: string;

vector: number[];     // embedding vector 

model: string;       

}

interface SemanticEdge {

noteAId: string;

noteBId: string;

confidenceScore: number;  // cosine similarity, 0-1

}

What are you embedding?

For each note, I concatenate:

{title} {title} {body}

The title is repeated twice to give it more weight since titles are typically more semantically dense than body text. This is a common technique in document embedding.

Before embedding, basic preprocessing:

- Strip markdown syntax (links, images, code blocks, headers)

- Remove Joplin resource references (`:/noteId` patterns)

- Normalize whitespace

Which embedding models and why?

Default recommendation: `text-embedding-3-small` (OpenAI)

Why this model specifically:

- 1536 dimensions (good balance of quality vs storage)

- 8191 token context window (handles most notes)

- $0.02 per 1M tokens (very cheap)

- OpenAI API is the most reliable and well-documented

For local inference with Ollama:

- Default: `nomic-embed-text` (768 dimensions, 8192 context, Apache 2.0 license)

- Alternative: `mxbai-embed-large` (1024 dimensions, slightly better quality)

- Alternative: `all-minilm` (384 dimensions, faster but lower quality)

Why `nomic-embed-text`:

- Specifically trained for document retrieval and semantic similarity

- Open source, no API costs

- Good balance of speed and quality

- 8192 token context matches OpenAI's offering

Is Ollama truly the only best option for local inference?

Ollama offers an easy setup, works across platforms, provides a good selection of models, and exposes a REST API, but it requires a separate installation and runs as a background daemon. transformers.js, on the other hand, can run directly in Node.js without external dependencies, though it comes with a larger bundle size, slower CPU performance, and a more limited set of models. llama.cpp (via Node bindings) is highly efficient and fast, especially with GGUF models, but its setup can be complex and may require manual compilation on some platforms. LM Studio provides a user-friendly GUI and simplifies model management, however it is a desktop application, not suitable for headless environments, and can be excessive if the goal is just embeddings.

My recommendation: Default to Ollama because:

1. It has the best developer experience (one command to install, one command to pull models)

2. REST API makes integration simple

3. Cross-platform (Windows, Mac, Linux)

But I'll also support a "custom endpoint" option in settings so users can point to any OpenAI-compatible embedding endpoint.

B) Regarding Pass A

Are you planning to show all semantic edges or discard low confidence ones?

Discard low confidence edges. Showing everything would create a cluttered, unusable graph.

What is the threshold and why?

Default threshold: 0.5 (on 0-1 scale)

Justification:

- Cosine similarity of 1.0 = identical vectors

- Cosine similarity of 0.0 = completely unrelated

- In practice with modern embedding models:

- 0.8+ = very similar, almost the same topic

- 0.6-0.8 = related topics, worth connecting

- 0.5-0.6 = somewhat related, marginal

- Below 0.5 = mostly noise

I chose 0.5 as the default because research literature, such as work on semantic similarity with BERT, shows that 0.5 is a common cutoff for identifying content that is meaningfully related. It provides a reasonable balance by capturing relevant connections without being too loose. It is also better to under-connect initially than to overwhelm the system with noisy or weak associations, allowing for cleaner results and easier tuning later if needed.

The threshold setting in the UI will have three presets:

- "Strict" (0.7): Only strong connections

- "Balanced" (0.5): Default

- "Loose" (0.3): More connections, more noise

Can you do community detection and centrality in Pass A?

Yes, absolutely and I should have emphasized it more.

Centrality in Pass A:

Once I have the similarity matrix (all pairwise cosine similarities), I can compute centrality without needing the LLM:


function computeCentrality(notes: NoteData[], similarityMatrix: number[][]): Map<string, number> {

const centrality = new Map<string, number>();

const threshold = 0.5;

for (let i = 0; i < notes.length; i++) {

let connections = 0;

for (let j = 0; j < notes.length; j++) {

if (i !== j && similarityMatrix[i][j] >= threshold) {

connections++;

}

}

// Normalize to 1-10 scale

centrality.set(notes[i].id, Math.max(1, Math.round((connections / notes.length) * 10)));

}

return centrality;

}

This gives us degree centrality (how connected a node is). For more sophisticated centrality:

- Eigenvector centrality: weight connections by how central their neighbors are

- PageRank: similar concept, good for directed graphs

I'll use degree centrality by default since it's intuitive and fast.

Community detection in Pass A:

Yes, we can cluster notes into communities using the similarity matrix:

Option 1: K-means clustering on embeddings

  • Cluster embedding vectors directly
  • Good for finding topic clusters

Option 2: Louvain algorithm on similarity graph

  • Treat similarity matrix as weighted adjacency matrix
  • Find communities that maximize modularity
  • Libraries: graphology (JS), or implement simple version

Option 3: Keyword-based fallback

  • Scan for keywords: "react", "api", "design", etc.
  • Assign category based on keyword frequency

Updated approach based on your question:

I'll implement a hybrid:

1. Primary: Cluster embeddings using k-means or hierarchical clustering

2. Labels: Use the most frequent meaningful terms in each cluster as the category name

3. Fallback: If clustering fails or user disables it, use keyword matching

This way Pass A can handle both centrality and categories without needing the LLM at all. Pass B becomes purely optional enrichment.

Why is cosine similarity the best approach? Other approaches?

Why cosine similarity:

1. Standard in NLP: It's what embedding models are optimized for

2. Scale invariant: Only measures angle between vectors, not magnitude

3. Fast: Simple dot product after normalization, O(n) per comparison

4. Interpretable: 0-1 range maps nicely to confidence scores

Alternative approaches:

Euclidean distance is useful when the magnitude of vectors matters, but it is sensitive to vector norms and typically requires normalization for meaningful comparisons. Dot product is often used for raw similarity without normalization, though it is not bounded, which makes it harder to define consistent thresholds. Jaccard similarity works well for set-based comparisons such as shared tags or keywords, but it is limited to discrete sets and cannot be applied to continuous embeddings. BM25 or TF-IDF are suitable for keyword-based matching when embeddings are not used, however they only capture lexical similarity and fail to understand semantic relationships. Learned similarity approaches involve training a model to predict relatedness between inputs, which can be powerful but require labeled training data and are often overkill for simpler use cases.

My approach: Use cosine similarity as primary, but also incorporate:

- Tag overlap: Notes sharing tags get a bonus

- Explicit links: Always shown regardless of similarity score

So the final edge weight combines:

finalScore = 0.7 * cosineSimilarity + 0.2 * tagOverlap + 0.1 * linkBonus

This hybrid approach captures both semantic and structural relationships.

C) Regarding Pass B

What are the prompts? What are you inputting to the LLMs?

Input format:

I batch related notes together (pre-filtered by Pass A similarity) and send them as a single prompt:

You are analyzing notes for a knowledge graph. Given these related notes, identify:

1. A category label for each note (1-2 words, e.g., "web development", "project planning")

2. A centrality score (1-10) based on how central/important each note seems

3. For each pair of notes, a brief relationship label (2-4 words, e.g., "expands on", "contradicts", "provides example of")

Notes to analyze:

Note 1 [ID: abc123]:

Title: React Component Architecture

Body: We'll use functional components with hooks. The main App component will manage global state via Context API...

---

Note 2 [ID: def456]:

Title: Performance Optimization

Body: React.memo for expensive renders. useMemo and useCallback to prevent unnecessary re-renders...

---

Note 3 [ID: ghi789]:

Title: Project Timeline

Body: Week 1-2: Setup and scaffolding. Week 3-4: Core features...

Respond in JSON format:

{

"notes": [

{"id": "abc123", "category": "...", "centrality": 7},

{"id": "def456", "category": "...", "centrality": 5},

{"id": "ghi789", "category": "...", "centrality": 8}

],

"relationships": [

{"from": "abc123", "to": "def456", "label": "optimizes"},

{"from": "ghi789", "to": "abc123", "label": "schedules implementation of"}

]

}

What is the expected output and format?

JSON schema:

interface PassBResponse {

notes: {

id: string;

category: string;    

centrality: number;    // 1-10

summary?: string; 

}[];

relationships: {

from: string;         

to: string;           

label: string;     

}[];

}

How this creates the semantic graph:

1. Parse the JSON response

2. Update each node with the LLM-assigned category and centrality

3. For each relationship, update the corresponding semantic edge's label

4. If LLM returns relationships not in our similarity set, add them (the LLM might catch connections embeddings missed)

Error handling:

- Validate JSON structure before using

- If parsing fails, fall back to Pass A results

- If individual fields are invalid, use Pass A defaults for those fields

Which LLMs and why gpt-4o-mini?

Default: `gpt-4o-mini`

Why:

- Cost: very cheap

- Quality: Surprisingly good for structured tasks like categorization

- Speed: Fast response times

- JSON mode: Native support for structured JSON output

- Context window: 128k tokens (can handle large batches)

Alternatives:

The gpt-4o-mini model is typically the default choice due to its strong performance-to-cost ratio, priced at $0.15 per million input tokens and $0.60 per million output tokens, making it ideal for most applications. For use cases requiring the highest quality responses, gpt-4o is preferred, though it comes at a higher cost of $2.50 per million input tokens and $10.00 per million output tokens. claude-3-haiku serves as a viable alternative to OpenAI models, offering moderate pricing at $0.25 for input and $1.25 for output per million tokens. For users with strict privacy requirements or those looking to avoid API costs altogether, running models locally via Ollama is a good option, as it is free but requires local infrastructure and setup.

For Ollama local:

- Recommend `llama3.2:3b` for speed

- Or `mistral:7b` for better quality

- These handle structured JSON output reasonably well

The plugin settings will let users choose any model on their configured endpoint, but docs will recommend gpt-4o-mini as the default for the best cost/quality ratio.

D) What happens when users add/delete notes?

Adding new notes:

Detection:

1. `joplin.workspace.onNoteChange(handler)` fires when any note is created or modified

2. On each change event, check if the note is in the currently displayed folder

3. If yes, trigger incremental update

Incremental update flow:

async function handleNoteChange(event) {

const note = event.note;

// Check if note is in our current scope

if (note.parent_id !== currentFolderId) return;

// Generate embedding for just this note

const embedding = await embedNote(note);

// Update cache

await cache.upsertEmbedding(note.id, embedding);

// Recompute similarities only for this note vs all others

const similarities = await computeSimilaritiesForNote(note.id, allOtherEmbeddings);

// Update semantic edges

await cache.updateEdgesForNote(note.id, similarities);

// Re-render graph

await refreshGraph();

}

Using Events API cursor for sync:

For notes synced from other devices:

// Store cursor after each full analysis

const cursor = await joplin.data.get(['events'], { cursor: lastCursor });

// On sync complete, check for changes

joplin.workspace.onSyncComplete(async () => {

const changes = await joplin.data.get(['events'], { cursor: storedCursor });

for (const change of changes.items) {

if (change.type === 'note' && change.item_id in currentScope) {

await handleNoteChange(change);

}

}

storedCursor = changes.cursor;

});

Deleting notes:

Detection:

1. `onNoteChange` also fires on deletion

2. OR check if a note in our cache no longer exists via Data API

Cleanup flow:

async function handleNoteDeletion(noteId: string) {

// Remove from graph immediately

graphData.nodes = graphData.nodes.filter(n => n.id !== noteId);

graphData.edges = graphData.edges.filter(e =>

e.source !== noteId && e.target !== noteId

);

// Clean up cache

await cache.deleteEmbedding(noteId);

await cache.deleteEdgesForNote(noteId);

// Recalculate centrality for affected nodes

// (nodes that were connected to the deleted note)

await recalculateCentrality(affectedNodeIds);

// Re-render

await refreshGraph();

}

E) What if notes contain images?

I would love your suggestion on this part, I have written somethings which I was able to understand. Images are stored as Joplin "resources" and referenced in note bodies via `:/resourceId` syntax.

Approach:

1. Use Joplin's built-in OCR text

Joplin already has OCR support (using Tesseract.js). When OCR is enabled, image resources get an `ocr_text` field populated with extracted text.

Via the Data API:

// Get resources for a note

const resources = await joplin.data.get(['notes', noteId, 'resources'], {

fields: ['id', 'mime', 'ocr_text']

});

// For each image resource with OCR text, append to note content

for (const resource of resources.items) {

if (resource.ocr_text) {

noteText += '\n' + resource.ocr_text;

}

}

This way diagrams, screenshots with text, handwritten notes, etc. all get included in the embedding.

2. Image alt text and captions

Markdown images often have alt text: `![alt text](:/resourceId)`

Extract and include this in the embedding text:

const altTextRegex = /!\[(.*?)\]\(:\/[a-f0-9]{32}\)/g;

let match;

while ((match = altTextRegex.exec(noteBody)) !== null) {

noteText += ' ' + match[1]; // append alt text

}

3. What about image content itself (visual)?

For images without OCR text or alt text, we have two options:

Option A: Skip them (simplest)

- Just don't include anything for images without text

- Most notes have text content anyway

- This is what I'll do by default

Option B: Image captioning (expensive and heavy)**

- Send images to a vision model to generate captions

- Include captions in embedding text

- Expensive and slow

What I'll implement for GSoC:

1. Default: Include OCR text if available (leveraging Joplin's existing OCR)

2. Always: Extract markdown alt text from image syntax

Edge case: Notes that are mostly images

For notes with little text but many images:

- If OCR is available, we get text to embed

- If no OCR then such notes will have low similarity to text-heavy notes and they might cluster together as "visual notes" if they have similar alt texts

1 Like

@HahaBill Hi, I have also updated the proposal based on your questions and added community detection and centrality concepts as well.

1 Like

@yugalkaushik Thank you for your answer and providing some clarity! :slight_smile: and I really appreciate your effort of writing all of this down. I think it’s the best if I go through your answers and give my feedback progressively rather than go through everything at once because it’s quite a lot.

A) Data Collection to Pass A flow

With the way you are proposing to embed each note with {title} {title} {body} format, I do understand that this is the way to increase the term frequency score and thus increase the term importance like for example in tf-idf or search queries. But I am not sure whether that would be useful for embedding models, this is something to think about.

B) Regarding Pass A

  • Do you think clustering with the proposed approaches in those dimensions will be reliable and efficient?

  • How would you make sure that the k is the optimal choice? Same question for hierarchical clustering with the distance and linkage. You might find this thread useful: Idea 3 discussion - Using Python subprocess for UMAP and HDBSCAN instead of JavaScript - #7 by HahaBill - you do not have to follow this solution, take it as a source for your inspiration and research

  • This is an interesting idea of having different variables for computing the final score for the semantic edges!

    • How is tagOverlap or linkBonus calculated? I couldn’t find your approach on that.
    • You might want to think about normalizing the computed cosine similarity scores, because they may not be well distributed. What I mean is that the scores might be distributed from 0.56 to 0.83 rather than from 0 to 1. That means that for example the threshold of 0.5 would let almost everything through if we purely score on the cosine similarity.
      • Furthemore, I would assume that the threshold is on your final score? Then if all notes do not have tags or links to each other, then that would mean all the final scores are 0.7 * cosine similarity but your threshold assumes from 0 to 1. Your default threshold (0.5) would be then "stricter". How do you solve that?
1 Like

@HahaBill Hi, I've been digging deeper into some of the implementation details and found a few things that should strengthen the approach.

For scalability with larger notebooks, I looked into Approximate Nearest Neighbor search. Worth noting that hnswlib-node is a native package and Joplin's plugin API explicitly prohibits native packages for cross-platform compatibility, so that route is out. For GSoC scope, O(n²) with batched processing and a progress indicator handles typical notebooks well since most users are under 300 notes anyway. ANN is documented as a post-GSoC stretch goal for when Joplin potentially exposes a native API for this.

I also noticed that dense graphs can turn into "hairballs" even with threshold filtering. Adding a Top-K per node limit where each note keeps only its 5 strongest connections keeps things clean and readable. There's a density slider in the UI so users can tune this value.

One more thing, I saw your summarization plugin from GSoC 2024 uses LexRank, LSA and KMeans clustering. If both plugins are installed, the graph plugin can read those summaries directly for node tooltips instead of making extra LLM calls. Rich hover context at zero additional API cost.

Also refined the scoring approach based on your feedback. Raw cosine similarity scores cluster in a narrow range like 0.56 to 0.83 rather than spanning the full 0 to 1 scale, so min-max normalization is applied first to make the threshold meaningful. Tag and link bonuses are purely additive on top of the normalized cosine score, and they only activate when a minimum cosine floor of 0.3 is already met. This means notes without tags are never penalized and the base comparison is always semantic similarity alone. For community detection, rather than K-means on raw high-dimensional embeddings, the approach uses Louvain via graphology-communities-louvain running directly on the weighted similarity graph, which sidesteps the curse of dimensionality entirely and is faster than older JS alternatives.

1 Like

Hi, Wanted to share a proof of concept I built to validate the core idea before diving into the full implementation.

The POC is a working Joplin plugin that fetches all notes from a selected notebook, generates embeddings using the Google Gemini API (free tier, gemini-embedding-001), computes cosine similarity between every pair of notes, and renders the results as an interactive graph inside a Joplin panel.

It handles all three connection types from the proposal:

  • Solid lines for explicit :/noteId links between notes
  • Dotted lines for notes sharing the same Joplin tags
  • Dashed lines for AI-detected semantic relationships (cosine similarity above 0.5 threshold)

Clicking any node opens that note in Joplin. Hovering shows a tooltip with the full note title.

For the rendering in the POC I used plain HTML5 Canvas with a manual force-directed physics simulation to keep things simple and debuggable. The interesting result is that semantically related notes naturally cluster together in the layout, which visually validates the whole idea. Notes about unrelated topics float away as isolated nodes, which is exactly the correct behavior.

For the actual GSoC implementation the plan is to replace the Canvas renderer with sigma.js paired with graphology as the unified graph data structure, exactly as described in the proposal. Graphology will also power the Louvain community detection via graphology-communities-louvain for node coloring by topic cluster. The POC was intentionally kept lightweight to prove the AI layer works before adding the full library stack.

Preview:


Source Code:
GitHub - yugalkaushik/poc-graph-plugin · GitHub

1 Like

@HahaBill Hi, I came across two libraries worth considering. remove-markdown handles all GFM markdown stripping including edge cases in one call and ml-distance from the mljs organization covers cosine similarity and other distance metrics in a fully typed TypeScript package. Neither adds native dependencies so both bundle cleanly into the .jpl without any cross-platform issues. What are your thoughts on this?

1 Like

@yugalkaushik Thank you for looking into this in more depth and putting time in creating the POC!

For me, the proposal looks solid! For the system prompt and LLM choice, I think it’s a good start :slight_smile:

To answer your question regarding the two libraries:

  • I like the ml-distance because then you can easily try out different similarity/distance computation and fair amount of people are using it.
  • remove-markdown sounds good, just think about what to remove and what to preserve. If you bump to its limitations, then you need to build another functionalities on top of it.
1 Like

It’s a small feedback and personal taste but sigma.js for me doesn’t seem to have the best UX for me. It feels old and not snappy/modern if you know what I mean. It would be cool if you can display it as a 3D viewer. But functionally, sigma.js works perfectly and this is more of a personal taste and pereference.

So yeah, sigma.js is a good choice!!

Update: Due to the scope of this program, let’s focus on the 2D view. There are other more important functionality to implement.

1 Like

Thank you for the feedback, I’ll research more to find a better alternative to sigma.js

1 Like

@yugalkaushik Hi! I hope you've been doing well! I re-read some of the parts of the proposal and I have an additional question:

  • For PASS A: You mention Transformers.js briefly as an alternative for local inference but chose Ollama as the default. Could you expand more in detail on why Ollama is the right default for your project specifically?
  • Could you show me a short video with minimal POC of running sigma.js in a Joplin plugin? This is to make sure that it works.
1 Like

@HahaBill Hi, I have answered your question about pass A and attached video for the sigma.js

  1. I explored using Transformers.js for local inference but it does not fit well with Joplin’s plugin environment because it relies on ONNX Runtime Web, which dynamically loads WASM and helper files that fail under Joplin’s strict CSP and webpack setup and fixing this would require fragile build changes. Even if that worked, running embeddings inside Joplin’s Node.js or Electron process would be CPU intensive and could freeze the UI and the model size of around 20 to 30 MB would either bloat the plugin or require extra download and caching logic. Ollama avoids these issues by running as a separate local service where the plugin makes a simple HTTP request for embeddings, offloading computation and keeping the plugin lightweight. The tradeoff is that it requires a separate installation, so Gemini is recommended as the easiest starting option with just an API key, while Ollama is better for users who want fully local and private inference.
  2. Here is sigma.js working in plugin with simple links.

Source Code: GitHub - yugalkaushik/graph-poc · GitHub

1 Like

@yugalkaushik Hi! Thank you for answer! It’s great to see your POC with sigma.js running in a Joplin plugin.

Your point regarding Transformers.js:

1 Like

@HahaBill

1.

Model all-MiniLM-L6-v2 (Transformers.js) bge-small-en-v1.5 (Transformers.js) nomic-embed-text (Ollama)
Context window 256 tokens 512 tokens 2048 tokens (via Ollama GGUF)
MTEB score ~56 62.28 62.39
Embedding dimensions 384 384 768
Model size ~22 MB ~34 MB 274 MB
Requires external install No No Ollama daemon

The biggest gap across these models is context length. all-MiniLM-L6-v2 silently truncates past 256 tokens with no warning, that's barely two short paragraphs. A typical Joplin research or meeting note runs 400 to 800 words, so MiniLM is throwing away most of the note. bge-small-en-v1.5 improves this to 512 tokens but still struggles with anything longer. nomic-embed-text via Ollama handles up to 2048 tokens, covering most real notes without needing chunking at all.

On quality, the MTEB gap between MiniLM (~56) and the other two (~62) is meaningful for semantic similarity tasks. BGE-small and Nomic are essentially tied on overall MTEB, so that is not really a differentiator between those two.

The real advantage Nomic has over both is its task prefix system. It ships with four dedicated prefixes: clustering: for grouping texts into clusters and discovering topics, search_document: for indexing, search_query: for queries, and classification: for classifiers. The clustering: prefix is literally designed for what this plugin does. Neither MiniLM nor BGE-small has anything like this, BGE-small-v1.5 was actually redesigned specifically to work well without instructions, so there is no equivalent task-type system on that side. Nomic also supports Matryoshka dims (768 down to 64), which lets the plugin offer a compact storage mode at minimal quality cost. MiniLM and BGE do not have this.

So the hierarchy is: MiniLM is fast but practically too limited for real notes, BGE-small is a decent Transformers.js option if you need zero setup, and Nomic is clearly the best when Ollama is available, better context, matched quality, unique clustering prefix, and flexible dims.

2.Inference Speed Comparison (1000 Notes)

Provider Model Environment 1000 Notes Estimate
Transformers.js (WASM) all-MiniLM-L6-v2 Joplin plugin ~5–7 min
Transformers.js (WASM) bge-small-en-v1.5 Joplin plugin ~10–15 min
Ollama (native) nomic-embed-text CPU (llama.cpp AVX2) ~3–4 min

Ollama is powerful and can be significantly faster, but its performance depends entirely on the user’s CPU or GPU, so on lower-end or non-optimized systems it can end up performing similarly to or even slower than browser-based solutions like Transformers.js.

There are multiple reports of slow embedding performance when using Ollama, especially on CPU-bound setups. For example, in this GitHub issue [discussion on slow ingestion with Ollama]( Ingestion of documents with Ollama is incredibly slow · Issue #1691 · zylon-ai/private-gpt · GitHub ), users report extremely slow processing times, with even small documents taking a very long time despite having high-end hardware. Another issue shows embedding generation taking several seconds per batch even on RTX 4090 systems. This reinforces that Ollama performance is highly dependent on CPU, GPU and memory configuration and users without well-optimized systems may experience slower or inconsistent performance compared to expectations.

Given this, one possible approach is to use Transformers.js as the primary backend to ensure a consistent, zero-setup experience across all users, and keep Ollama as an optional or advanced backend for users with capable hardware. Alternatively, Ollama could be removed entirely to simplify the system and avoid variability in performance.

1 Like

Now reading through user experience I’m suggesting using Transformers.js bge-small-en-v1.5 as the single local setup because it provides a consistent, zero-setup and reliable experience across all users. What are your thoughts?

1 Like

@yugalkaushik Amazing! Thank you for looking into this further and your suggestion on using Transformer after the comparisons makes sense! But using nomic-embed-text (Ollama) has other benefits such as larger context window and task instruction prefixes which is something to not ignore. Maybe it’d be the best to do both of the approaches and then compare the final results of the graphs. But that is during the GSoC.

The deadline of submission is coming soon so make sure to have everything ready and best of luck for submitting your proposal on GSoC platform!! :slight_smile:

1 Like