Links:
project idea: gsoc/ideas.md at master · joplin/gsoc · GitHub (idea#2)
Github: IND-Anshuman (Anshuman Singh)
Introduction post: https://discourse.joplinapp.org/t/introducing-ind-anshuman/49145/3
AI Generated Note Graph
Introduction:
I am Anshuman Singh, a second year B. Tech. undergrad at IIIT Jabalpur, with strong interest in gen AI related developments and making different types of chatbots. I have been through the open-source journey for almost a year. Along the journey I have gained experience about things like advance retrieval, knowledge graph construction, training ml models for data preprocessing, NER extraction and many other things. I am proficient in python language and have some basic knowledge of backend and frontend development.
My most recent development was a graphRAG chatbot Nsure AI that involves high fidelity knowledge graph construction on networkX and a powerful dual-reranked retrieval system. Throughout the project I used llm integrations thrice which was quite expensive such that I was not able to deploy my project over a very long time. Thus I have been researching on reducing the use of api key based llm integrations to build optimal and inexpensive solutions.
I am particularly interested in the Joplins project: AI generated note graph as I can see the challenges within it. Though it can be solved by a simple knowledge graph generation step as in the initial part of a graphRAG pipeline but it will definitely involve api based llm integrations for proper contextual understanding which is quite expensive for users. Even if we use offline llm model then it will also consume a large amount of space and during computations it will increase RAM usage which is also costly for deployment. So in this project I have tried to balance both cost and space consumption without sacrificing user experience.
Project Summary:
As users accumulate a large number of notes in Joplin, it becomes increasingly difficult to understand how ideas are related, which notes are important, and how topics are structured across notebooks.
Joplin already has graph-based plugins such as Graph and Link Graph UI, which visualize connections between notes. However, these plugins rely on explicit links, meaning that two notes discussing the same topic remain disconnected unless the user manually links them. There are also plugins like Semantically Similar Notes, which use embeddings to suggest related notes. While useful, these suggestions are presented as lists and do not provide a global view of how notes are structured.
Through this project I aim to build a knowledge graph using proper semantic and contextual understanding to classify different notes of the user in a structural form such that they can analyse the importance of each note without reading them one by one and comparing with each other. In this process I will try to minimise cost consumption by using local offline ml models or offline llm models and also try to minimise space consumption by bringing out best choice of models that can compensate cost consumption and space occupancy.
What will be built:
In this project, I will build a Joplin plugin that:
- constructs a graph where each node represents a note
- identifies relationships between notes using semantic and structural signals
- groups notes into clusters representing topics
- highlights important notes using graph-based ranking
- provides an interactive visualization for exploration
With the help of these implementations users will be able to:
- quickly identify core ideas of notes.
- explore related notes without manual linking.
- understand how their knowledge is structured.
My main goal is not just help in visualization of the notes, but improving how users navigate and think about their notes.
Design Philosophy
The project is designed by considering the following parameters:
- Offline-first : The system should work without requiring external APIs
- Structured, not heuristic-heavy : Relationships should be based on measurable signals
- Scalable : The system should handle large notebooks efficiently
- Readable output : The graph should remain interpretable, not cluttered
- JavaScript-Native & Electron-Friendly: By using WebAssembly (Wasm) for heavy ML computations and utilizing pure JS/TS modules, the pipeline is made compatible to the joplins ecosystem.
Why Plugin (and not core feature)
This feature is implemented as a plugin because this feature is not useful for all joplins users. It is useful only for those people who have a large amount of notes and are having trouble managing it. Introducing it as feature will only bring extra burden. Also it involves the use of AI/ML libraries so it will be better if it is a plugin since the users who do not want these AI features will remain unaffected.
System Architecture:
I will divide this project into 5 steps:
1. Data Preprocessing and Embeddings
Before any analysis, note content is cleaned and normalized.
This includes:
- removing markdown artifacts
- normalizing text
- optionally splitting long notes
Each note is then represented using:
- Embedding vector → captures semantic meaning
- Keyword set → captures important terms
Embedding model: For embedding model, I will utilize Transformers.js to run the quantized Xenova/all-MiniLM-L6-v2 model directly via WebAssembly. This downloads a tiny (<30MB) model to the plugin's local directory and executes inference at near-native speeds, perfectly balancing accuracy and local desktop performance.
Keyword extraction model: Instead of loading a secondary, RAM-heavy ML model like KeyBERT, I will integrate a lightweight JavaScript NLP library such as wink-nlp. This will extract important Noun Phrases and core terminology instantly with virtually zero memory overhead.
Combined Representation: Each note is finally represented as:
{ embedding: Float32Array(384), keywords: ["term1", "term2", ...] }
But Embedding-based similarity may not work well for very short notes (e.g., TODOs or single-line ideas). These notes may produce weak or ambiguous vectors, which could lead to poor or missing connections in the graph. I am still working to figure out the solution of this problem.
Also, there is yet another problem that there are images present in the notes that are not considered by the embeddings model(only their label will be considered). To tackle this problem, I propose to use a lightweight, Xenova-optimized CLIP model (like clip-vit-base-patch32) through Transformers.js. The beauty of CLIP is that it maps both images and text into the same mathematical space. This allows a "Note on Architecture" (text) and a "System Diagram" (image) to naturally cluster together because their vectors are mathematically similar. The only problem with CLIP embedding model is that its size is almost 200mb. This introduces a trade-off: CLIP significantly improves multimodal understanding, but increases model size (~200MB). I’ll need to evaluate whether this overhead is justified for typical users.
2. Relationship Extraction:
After generating embeddings for each note, a straightforward approach would be to compute similarity between every pair of notes. However, this approach does not scale as its complexity reaches O(n^2). To address this, while libraries like FAISS or HNSW are industry standards in Python environments, they require native C++ bindings that frequently break during Electron's cross-platform builds (especially between Intel and Apple Silicon). To ensure 100% reliability for all Joplin users, I will implement Vectra—a lightweight, pure-TypeScript vector store.
Vectra allows us to perform Approximate Nearest Neighbor (ANN) searches natively within the plugin’s sandbox. While it sacrifices the extreme micro-second speeds of C++ HNSW, it provides a stable, zero-dependency environment that can query a 5,000-note index in under 200ms—well within the acceptable range for a background process. One concern here is that Vectra may slow down for very large notebooks (>10k notes), which I’ll need to evaluate and prepare fallback.
The next most important step is the choice of K(number of neighbours). A high K value results in "hairball" visualizations that are difficult to navigate, while a low K might miss subtle thematic bridges. Choosing K is a trade-off between completeness(more connections) and readability(less clutter). I will most likely test various K values to make a baseline limit for the K value and implement dynamic thresholding for the K value above the baseline.
Relationship extraction: Relationships are computed using a weighted combination of signals. Instead of fixing weights upfront, I plan to start with semantic similarity as the primary signal and then incrementally add keyword overlap and explicit links. The exact weights will likely need tuning based on real note datasets.
Optional Power-User Integration: Recognizing that many Joplin users already host their own local LLMs, the plugin will include an optional configuration panel where power users can point the embedding engine to a local Ollama API endpoint (e.g., http://127.0.0.1:11434), allowing them to utilize heavier, higher-dimensional models if they choose.
3. Graph Construction:
Without filtering, the graph becomes dense and unreadable. Even with ANN based candidate selection, there will be many moderately strong connections. These unwanted connections will result in something called “hairball graph” where clear clusters does not exist and important nodes are indistinguishable. I have seen these things happen in my earlier project even when the amount of data was small, so with industrial level data of the joplins ecosystem there is a high probability that there will be distortion in the graph structure.
With my earlier experience, I have finalised two strategies to tackle this problem
1. Threshold filtering (Quality Control)
A threshold is applied to the edge score obtained in the second stage which represents the strength of the relationship. This removes the unwanted edges.
2. Top-K filtering
This step is also similar to the above filtering one, as it restricts the total number of edges a node can have. Thus, each node keeps only its strongest connections.
Now, the pipeline has a refined set of high-quality relationships between notes so next we will transform raw pairwise relationships into a structured, queryable graph model. The graph is intentionally kept sparse, which is essential for both visualization and analysis.
After filtering, the graph is built with:
Node Structure: Each node represents a note and contains:
- id (noteId)
- title
- contentPreview
- embedding (optional, cached)
- keywords
- clusterId (assigned later)
- importanceScore (from PageRank)
- summary(made in the summarisation stage)
Edge Structure: Each edge represents a relationship between two notes:
- source (noteId)
- target (noteId)
- weight (relationship score)
- type (semantic / keyword / explicit / LLM-derived)
- confidence (normalized score)
- explanation (optional, from LLM)
There will be more parameters that might be considered later according to final pipeline structure after discussing with the mentors.
Graph Storage: Two representations are maintained for the graph:
- In-Memory: Used for visualisation of graph. It is optimised for fast traversal.
- Persistent Storage: It is used for caching. This survives even after application restarts.
4. Community Detection, Importance Ranking and Summarisation:
As soon as the graph is constructed, the next objective is to identify groups of closely related notes and automatically categorize them into distinct project branches (e.g., "UX", "Programming", "Marketing").
-
Clustering Engine: I will use the graphology library, specifically graphology-communities-louvain, to perform fast, weighted community detection directly in the TypeScript backend. This algorithm operates on edge weights to assign a unique cluster ID to every node. While Louvain is the primary choice due to its mature TypeScript implementation, I am also exploring a custom refinement phase similar to the Leiden algorithm to ensure communities are internally well-connected and avoid the 'resolution limit' common in basic modularity optimization.
-
Semantic Categorization (Zero-LLM Labeling): Running a 6GB local LLM purely to name clusters is an unacceptable RAM burden for a background plugin. Instead, the plugin will label branches programmatically: once clusters are formed, the system identifies the highest-ranking "hub" node in that cluster (calculated via PageRank). It then extracts the top 2-3 noun phrases from that hub node using wink-nlp, effectively turning the core note's subject into the definitive label for the entire branch. Also for those users who already have Ollama setup in their environment, I will provide option to chose the local LLM for cluster labelling.
The noun-phrase based cluster labeling approach may not always produce meaningful labels, especially for abstract or mixed-topic notes. This could require fallback strategies or optional user input.
After grouping notes into clusters, I am thinking of considering the importance of notes for distinguishing between these nodes graph. So, I will assign an importance score to each node, which is later used for:
- visual emphasis (node size)
- prioritization
- better navigation
PageRank algorithm will be used to calculate the importance scores using the graphology-metrics/centrality/pagerank package.
Summarisation Strategy: Instead of implementing summarization entirely from scratch, I am accomplishing this task with the existing Joplin plugin “Summarise your notes and notebooks!” by HahaBill (GSoC 2024).
Initially I was thinking of using a local LLM model to perform this process. But now I think this joplins plugin has a number of good summarisation strategies or algorithms (TextRank, LexRank, LSA, etc.) and an offline LLM mode that might give similar results as an API based LLM model. Also, I can apply a fallback of TextRank algorithm in case the user does not want to use summarisation plugin api.
The summaries obtained serves three purposes:
- Node Interpretation: Each node displays a short summary to help users quickly understand its content. Without this the nodes are just explicit titles that do note offer semantic understanding to the user.
- Cluster understanding: Summaries of individual notes can be aggregated to generate cluster-level descriptions, helping users understand what a group of notes represents.
- Graph Interaction: These summaries can be used in hover tooltips, side panels and many other ui features.
5. Graph Visualisation:
The visualization panel is where this whole project actually comes alive for the user. Nothing kills a graph plugin faster than a laggy, stuttering UI when you try to load a massive notebook. Because of that, I’m skipping standard Canvas-based libraries like vis-network since it starts lagging as soon as the number of nodes reaches 5000.
The Rendering Engine: I plan to use Sigma.js for the frontend. Since it's WebGL-based, it offloads the rendering straight to the GPU. This means even if a power user opens a notebook with thousands of notes and visible connections, they still get a buttery smooth 60 FPS while zooming and panning around. As a bonus, Sigma.js is built specifically to digest graphology data. This means passing our backend math (clusters, PageRank scores) directly to the frontend is completely seamless without needing to write heavy data-conversion scripts.
Making the Graph Readable (Avoiding the "Hairball"): Dumping a thousand nodes on a screen usually just creates an unreadable, tangled mess. To make the graph actually useful at a glance, I’m prioritizing a few visual cues and controls:
- Visual Hierarchy: The setup will be very simple. Node sizes are scaled by their PageRank score (so central "hub" notes stand out instantly), and colours map directly to their Louvain community (so users can visually separate their "Programming" notes from their "Graphics Design" notes).
- The Density Slider: This is probably the most important UI control. I want to give users a slider that adjusts the similarity threshold in real-time. If the graph looks too cluttered, they can just slide it up to peel back the weaker semantic links and only see the absolute strongest connections.
- Focus & Integration: Double-clicking a specific topic branch will fade out the rest of the graph, letting the user isolate a project. And naturally, clicking any node will fire a joplin.commands.execute('openNote') call, instantly opening that note in the main editor so they can start writing.
By default, the graph will use the Force-Atlas2 layout. It organically pushes unrelated clusters apart while pulling related notes together, creating clear territorial boundaries for different topics. Because this algorithm is computationally heavy, I’ll execute it inside a Web Worker so the Joplin window never freezes while the nodes are settling into place.
I'll also include a secondary Hierarchical (tree-like) layout for users whose notebooks rely heavily on strict sub-notebook structures, making it easier to see "Parent-Child" dependencies.
Limitations: Although the pipeline is designed to be lightweight, performance may vary depending on the user’s hardware, especially for large notebooks or during visualization. Even with GPU-based rendering, extremely large graphs may require additional techniques such as clustering or node collapsing to remain usable.
Integration with Joplin:
I’m structuring the system as a native Joplin plugin, utilizing the Joplin Plugin API to bridge the gap between markdown data and the WebGL visualization layer. The architecture focuses on "Local-First" performance, ensuring that background analysis never compromises the note-taking experience.
System Architecture The data flow follows a decoupled, three-tier structure to maximize efficiency and maintainability:
- Data Layer (Joplin API & SQLite): High-speed fetching of note content and metadata via the joplin.data API. Embeddings, pre-computed similarity scores, and cluster labels are persisted in a local SQLite cache to prevent redundant computations.
- Analysis Engine (Node.js Backend): A non-blocking pipeline executing in the plugin's background process. It manages the lifecycle of Transformers.js (inference), Vectra (indexing), and Graphology (mathematical analysis).
- Presentation Layer (Sigma.js Webview): A dedicated panel that receives the processed graph data and renders the interactive map using GPU acceleration.
Execution & Event-Driven Synchronization To ensure the graph is a "living" document, the plugin implements an event-driven synchronization model. Rather than forcing a full re-index of the notebook, the system utilizes Incremental Updates:
-
Reactive Listeners: The plugin subscribes to onNoteChange and onSyncComplete events.
-
Atomic Updates:
-
New/Updated Notes: Only the affected note is re-embedded. Its nearest neighbors are retrieved from the local index, and edges are updated dynamically without rebuilding the entire graph.
-
Deletions: The corresponding node and its incident edges are purged from the Graphology instance and the WebGL renderer instantly.
While incremental updates improve performance, they may introduce slight inconsistencies in global structures (e.g., cluster boundaries or PageRank scores) until a full recomputation is triggered.
Background Processing & Thread Management To prevent the Joplin UI from "stuttering" during heavy graph math:
- Web Workers: Computationally expensive tasks—specifically Louvain Community Detection and the Force-Atlas2 layout—are offloaded to Web Workers.
- Periodic Global Recomputes: While node-level changes are instant, global metrics (like PageRank centrality and notebook-wide clustering) are recomputed during idle periods or triggered manually by the user to reflect broader structural shifts.
Intelligence & Summarization Fallback The plugin prioritizes ecosystem synergy by attempting to hook into the "Summarize your notes and notebooks" plugin API.
- Primary Path: If the summarization plugin is active, its summaries are fetched and cached as node metadata for the "on-hover" tooltips.
- Native Fallback: In the absence of external plugins, the system utilizes a lightweight extractive summarizer (LexRank-based) written in TypeScript. This ensures every node has a concise preview regardless of the user's plugin configuration.
Final project structure:
Conclusion of the proposal:
My main goal with this architecture is to prove that we don't need expensive, power-hungry LLMs to build a high-quality note graph. By shifting the heavy lifting to a specialized TypeScript-native pipeline, the plugin stays fast, private, and completely free for the user. Even with the local embedding models and the vector index, the entire backend footprint stays well under 1GB. This makes the plugin accessible to everyone, regardless of whether they have a high-end GPU or a limited data plan. I've focused on making the graph structure robust enough for power users while keeping the setup "plug-and-play" for beginners.
I’m really excited about the potential of this approach, but I’m also totally open to pivoting. If there’s a specific part of the design or a different library you think would fit Joplin’s ecosystem better, I’m more than happy to jump in and adjust the plan. I'm looking forward to hearing your thoughts and refining this further!
Implementation Plan
Week 1–2: Plugin setup and note extraction
Week 3–4: Preprocessing and Embeddings
Week 5–6: Relationship Extraction
Week 7–8: Graph Construction
Week 9–10: Community Detection, Importance Ranking and Summarisation
Week 11–12: Graph Visualization
Week 13: UI improvements and documentation
Deliverables
- Joplin plugin (.jpl)
- semantic graph generation system
- clustering and ranking implementation
- interactive visualization panel
- node summarization
- documentation and test coverage
Availability
I will be available for approximately 30–35 hours per week during the GSoC period (IST timezone). I do not have any conflicting commitments and will be able to consistently dedicate time to development and mentor communication.


