GSoC 2026 Proposal Draft - Idea 3: AI-based categorisation

Project Name: AI-based Categorisation

GitHub Profile: angeladev333

Introduction post: here

1. Introduction

I am a 4th year CS student at the University of Waterloo with experience in TypeScript, React, and Python. In my previous internship at Bloomberg, I trained a ML model on real client data base on the Decision Tree Classifier to pair trade matchings for reconciliation use, and I have helped others with their RAG-related projects.

This project seeks to automate the "administrative" side of note-taking by using local AI to suggest tags, organize notebook structures, and identify "cold" notes for archiving.

2. Project Summary

This project will solve the fundamental problem about organizing notes with large context. As users use Joplin for longer periods of time, the workspace becomes larger, more notebooks are created, and they struggle to find notes on a relevant topic.

The project will be a plugin that analyzes note content to provide three core organizational services:

  1. Smart Tagging: Automatically suggests and applies tags based on existing user patterns.

  2. Notebook Auto-Filing: Detects when a note semantically belongs in a different notebook and suggests a move.

  3. Archive Discovery: Identifies notes that haven't been touched or viewed in a long time and suggests moving them to an "Archive" stack to reduce clutter.

Expected Outcome:

  • Local-First Engine: A categorization system using transformers.js (WASM) to keep all data private.

  • Review UI: A dedicated Joplin Panel where users can "Approve" or "Reject" bulk organizational suggestions.

  • Custom Rules: Ability for users to "teach" the AI by giving it examples of how they prefer to categorize.

3. Technical Approach

3.1 Architecture: The "Librarian" Service

I will implement a background service that maintains a local semantic index of the user's notebooks.

  • Detection: Hook into joplin.workspace.onNoteChange and joplin.workspace.onSyncComplete.

  • Inference: Use transformers.js with the Xenova/all-MiniLM-L6-v2 model (~23MB) to generate embeddings for each note.

  • Classification: * For Tagging: Use K-Nearest Neighbors (KNN) to find notes with similar content and suggest their tags.

    • For Notebooks: Use a Centroid-based classifier where each notebook is represented by the average vector of its contained notes.

3.2 Archive Discovery ("Cold Note" Detection)

Since Joplin does not natively track "last viewed" time, I will implement a lightweight tracking mechanism:

  • Activity Logging: Use the joplin.workspace.onNoteSelectionChange event to record a last_viewed_time in the note’s userData.

  • Archiving Logic: A weekly background task will query notes where (current_time - last_viewed_time) > User_Defined_Threshold and updated_time is also old. These will be surfaced in the "Archive Suggestions" UI.

3.3 The "Review & Apply" Workflow (React UI)

To avoid "AI anxiety," the plugin will never move or tag notes without permission.

  • The Panel: A React-based sidebar created via joplin.views.panels.create().

  • Batch Actions: Users can "Select All" suggestions (e.g., "Tag 12 notes as #Research") and apply them in one click via the joplin.data API.

4. Implementation Plan

  • Weeks 1-2: Setup transformers.js in the plugin sandbox. Implement the last_viewed_time tracker using userData.

  • Weeks 3-5: Develop the "Note-to-Notebook" similarity engine. Benchmarking performance for users with 2,000+ notes to ensure no UI lag.

  • Weeks 6-8: Build the React Sidebar Panel. Implement the "Suggestion" logic and the "Accept/Reject" state management.

  • Weeks 9-11: Add "Auto-Archive" discovery. Refine the UI for bulk-applying changes.

  • Week 12: Final testing on mobile/desktop sync compatibility and documentation.

5. Deliverables

  • Joplin ANI Plugin: The core .jpl package.

  • Semantic Model Integration: Optimized local inference pipeline.

  • Archive Dashboard: A UI tool for note lifecycle management.

  • Technical Documentation: Guidelines for extending the AI to support PARA or Johnny Decimal organizational methods.

6. Availability

  • Weekly availability: ~40 hours per week during SGoC
  • Time zone: EST
1 Like

Thanks for the draft proposal, I think it makes sense. Question: how will you map the output from the LLM to actual API actions?

Also, if you haven't already done so please create a few pull requests as we need this when we review the proposals.

@angeladev333 Thank you for your proposal, it seems that this is going to the right direction! I think the proposal needs more clarity and maybe dive a bit deeper into the topic. Laurent made a good point about how you map your outputs into the actual API actions

I’d suggest creating a nice flow diagram, it’ll also help you to see potential flaws or improvements. This contributor made very good diagrams: GSoC 2026 Proposal Draft – Idea 2: AI-Generated note graphs – yugalkaushik

I have few questions:

  1. Can you explain more about the centroid-based classifier for notebooks?
  2. How are you planning to use KNN or centroid-based
  3. How do you semantically identify/categorize clusters? The proposal doesn’t seem to explain that
  4. I don’t think you should restrict yourself and not use LLMs or making this agentic - could be optional for users that have API keys to frontier models.

Last thing, don’t forget to create couple of PRs. As Laurent mentioned, this is something important.

Hi, I wanted to confirm my understanding of the AI-based note categorisation feature before moving forward.

From what I understand, this feature has two main parts:

  1. Creating categories : In Joplin, a category is essentially a notebook which is stored as a 1.folder in the database (folders table). So when the AI suggests a category like "Work" or "Cooking", we are creating a new row in the folders table and then updating the 1. parent_id of the relevant notes to point to that folder ?

  2. Tagging notes : This works across two tables. First we insert the tag name into the tags table, then we create a connection between each note and its tags by inserting rows into the note_tags table ?

Hi @laurent and @HahaBill , thank you so much for your feedback! For the mapping of LLM outputs to API actions, I plan to implement the Command Dispatcher pattern. The LLM will be prompted to return a strictly validated JSON schema. For example, the output would look like:
{ "action": "moveNote", "params": { "noteId": "...", "parentId": "..." } }

The plugin will then pass these validated params to the existing joplin.data post methods.

As for PRs, I expect to have these submitted within the next few days to support my proposal. Sorry for the delay!

2 Likes

Hey Zain, thank you for your careful analysis. That sounds exactly right to me! Moving a note would update the parent_id field through a PUT request to notes/:id. Also, since tags are in a many-to-many relationship with notes, and the note_tags table acts as the join table. The plugin will ideally check if a tag already exists in the tags table before creating a new one to avoid duplicates, then link it to the note.

Thank you for your understanding of the database schema. I’ll add this consideration to my proposal!

1 Like

Ofcourse, thank you again for your guidance and involvement in my proposal! I will update my main post with all my responses to your suggestions :slight_smile:

  1. I treat each Notebook as a cluster in a vector space. The centroid is the average embedding vector of all notes currently in that notebook. When a new note is created, we calculate its embedding using the cosine similarity against each notebook’s centroid. The notebook with the highest similarly score becomes the ā€˜suggested’ home.

  2. I plan to use centroids for broad notebook categorization (stable categories) and KNN for tagging. This is under the assumption that a note might share specific technical terms with only 3 or 4 other notes for tagging rather than a larger notebook average.

  3. The first categorization would be through centroids first, and then I will use the prompt specifically to label new discoveries that centroid-based logic couldn’t handle / a new note. If the cosine similarity of this new note is above a certain threshold (eg 0.85), it is a clear match and we can return directly. If the similarity is low (eg < 0.6) then it is classified as an outlier and moved to ā€œuncategorizedā€ or something to store it temporarily. Once there are enough outliers, we can do local clustering to see if they can form a new distinct group.
    A prompt is also used to generate a human-readable name for new clusters, which the LLM will map to an output using strict structured format.

  4. I agree yeah, I’m not going to restrict this project, open to any further suggestions! Definitely could be option for users that have API keys already, and they can opt-in to OpenAI/Gemini for higher reasoning workflows.

1 Like

Nice! I like that you’re planning to use structured outputs!!

Are you planning to use LangChain for creating your agents? If so, then I’d recommend for you to check whether it’s possible to run it in the Joplin plugin - just in case.

Great suggestion! Yes I’ll look into the implementation of structured outputs for this plugin.

1 Like

I’ve made a new pull request for the CLI: to return to the root folder using ā€˜use /’

1 Like

Thank you everyone for your suggestions! I have an unmerged PR after resolving all comments by coderabbit. Would appreciate any feedback there as well!

In the meantime, I have updated my proposal here:

Project Abstract


This project will solve the fundamental problem about organizing notes with large context. As users use Joplin for longer periods of time, the workspace becomes larger, more notebooks are created, and they struggle to find notes on a relevant topic. The project will be a plugin that analyzes note content to provide three core organizational services:

  1. Smart Tagging: Automatically suggests and applies tags based on existing user patterns.

  2. Notebook Auto-Filing: Detects when a note semantically belongs in a different notebook and suggests a move, including dynamically grouping completely new topics.

  3. Archive Discovery: Identifies notes that haven't been touched or viewed in a long time and suggests moving them to an "Archive" stack to reduce clutter.

Expected Outcome


  1. Local-First Engine: A categorization system using transformers.js to keep all data private.

  2. API Opt-In for Higher Reasoning: Options for users with existing OpenAI or Gemini API keys to opt-in to cloud-based LLMs for advanced clustering and categorization workflows.

  3. Review UI: A dedicated Joplin Panel where users can "Approve" or "Reject" bulk organizational suggestions.

Architecture


Smart Tagging + Notebook Auto-Filing

I will implement a background service that maintains a local semantic index of the user's notebooks. Before finalizing the agent architecture, I will investigate the feasibility of running LangChain within the Joplin plugin sandbox to handle agent routing.

  • Detection: Hook into joplin.workspace.onNoteChange and joplin.workspace.onSyncComplete.

  • Inference: Use transformers.js with the Xenova/all-MiniLM-L6-v2 model (~23MB) to generate embeddings for each note.

  • Classification & Clustering: * For Tagging: Use K-Nearest Neighbors (KNN). This assumes a note might share specific technical terms with only 3 or 4 other notes, making KNN ideal for highly specific tag mapping rather than a broad average.

    • For Notebooks: Use a Centroid-based classifier where each notebook is treated as a cluster in a vector space. The centroid is the average embedding vector of all notes currently in that notebook.

    • Threshold Logic: When a new note is created, its cosine similarity is checked against each notebook's centroid.

      • If similarity is high (e.g. > 0.85), it is a clear match and suggested for that notebook.

      • If similarity is low (e.g. < 0.6), it is classified as an outlier and temporarily stored in an "Uncategorized" group.

    • Dynamic Generation: Once enough outliers accumulate, the plugin will perform local clustering to form new, distinct groups. An LLM prompt utilizing strict structured outputs will then be used to generate human-readable names for these newly discovered clusters.

Archive Discovery

Since Joplin does not natively track "last viewed" time, I will implement a lightweight tracking mechanism:

  • Activity Logging: Use the joplin.workspace.onNoteSelectionChange event to record a last_viewed_time in the note’s userData.

  • Archiving Logic: A weekly background task will query notes where (current_time - last_viewed_time) > User_Defined_Threshold and updated_time is also old. These will be surfaced in the "Archive Suggestions" UI.

Code Affected


As this project will be developed primarily as a Joplin plugin, the core Joplin application codebase will remain largely untouched. This ensures stability and allows the categorization engine to be maintained independently.

Development will be centralized within a new plugin directory/repository and will interact strictly with the exposed Joplin Plugin API surfaces:

  • joplin.workspace: Hooking into onNoteChange, onSyncComplete, and onNoteSelectionChange to trigger embedding generation and log user activity.

  • joplin.data: Executing batch actions (applying tags, moving notes to new notebooks, or moving them to an Archive) and retrieving note text for semantic analysis.

  • joplin.views.panels: Creating the React-based UI sidebar for the "Review & Apply" workflow.

  • Note userData: Utilizing this property to store last_viewed_time and vector embeddings directly on the note objects, avoiding the need for any core database schema modifications.

New Dependencies & Integrations: The plugin package will introduce the following key libraries into its isolated environment:

  • transformers.js (WASM) for local, private embedding generation.

  • LangChain (pending sandbox compatibility verification) for agent routing and logic.

  • REST API or SDK integrations for OpenAI/Gemini to handle the optional structured output naming conventions.

Core Codebase Contributions: While the GSoC project centers on the Plugin API, I am already actively engaged with the core Joplin repository. I recently authored a pull request for the Joplin CLI (packages/app-cli) to implement the use / command for root folder navigation, demonstrating my ability to navigate and safely modify the central codebase should the plugin require new API endpoints in the future.

Pre proposal Work


I have successfully cloned the Joplin repository on my computer and followed the instructions for a build. I have also created a PR for the CLI app, featuring the use / command to work for navigating to the root.

Schedule of Deliverables (timeline)


Weeks 1-2 (Setup)

  • Setup transformers.js in the plugin sandbox.

  • Test LangChain compatibility.

  • Implement the last_viewed_time tracker using userData.

Weeks 3-5 (Phase I: ML Performance)

  • Develop the Centroid and KNN similarity engines.

  • Implement the 0.85/0.6 threshold logic for outlier detection and the structured output LLM generation for naming new clusters. Test performance of different threshold numbers.

  • Benchmarking performance for users with 2,000+ notes.

Weeks 6-8 (Phase II: Feedback UI)

  • Build the React Sidebar Panel.

  • Implement the "Suggestion" logic and the "Accept/Reject" state management.

Weeks 9-11 (Phase III: Archive Feature)

  • Add "Auto-Archive" discovery.

  • Refine the UI for bulk-applying changes and add the OpenAI/Gemini API opt-in settings.

A one-week buffer period for Week 12 has been allocated for any inconveniences that might arise during the project development, and the documentation.

Communications


  • Timezone/ Working hours: EST, can allocate from 9:00 AM to 5:00 PM with some small regular breaks to the project during Mon-Fri. This schedule is only set to show a commitment of around 8 hrs a day, and is completely flexible should there be a need to cooperate/have meetings with people based in other timezones.

  • Email: angela.xu.dev@gmail.com

  • Phone: Ideally I would love to communicate with the mentors hosting a weekly catch up on Google Meet, since the platform allows for screen sharing which would help a lot when discussing programming or doubts about the project.

  • Slack, IRC ids: I am flexible on the platform used for communication with mentors, I can use Discord, Teams, IRC or Slack.

1 Like