Voice memos to Joplin text notes

Started using Joplin two weeks ago and am genuinely impressed, mostly with its customizability. Installed only one plugin because most of the needed functionality is covered by the customizable base.
As part of my workflow I use voice memos and being able to send those to Joplin as text is very appealing to me. So I put together a command line Linux (usable from the Gnome desktop too) utility to record voice memos, transcribe them into text and create new Joplin notes (using the data API).
It uses offline speech recognition based on whisper.cpp and surprisingly, is quite usable and practical.
Please, take a look: GitHub - QuantiusBenignus/NoteWhispers: Voice memos recorded from the microphone, transcribed offline to text and converted to Joplin notes

Definitely not a one-click solution but the setup process, just like Joplin, is quite customizable and relatively easy to follow. The shell script (zsh or bash) will do some sanity checks and provide guidance, even if one does not follow the README file.

9 Likes

While using this little tool, if Joplin was not running, I would save the speech notes transcribed to text into JSON files stored temporarily in the user's config/resources folder. The idea was to pick them up later and insert into Joplin. This has now been implemented.
On new voice memo to Joplin note creation, if succesful (i.e. the Clipper service is up) the tool will also pick up the temporarily stored notes and insert them into Joplin.

A question: Is there a straightforward way (outside of loading plugin code - I like it lean(er) and mean) to sync such "incoming" notes automatically from the Joplin side, say on startup?

Thanks for your post! It inspired me to read a bit into the current state of (offline and google-free) speech-to-text apps, even if I barely use them. I had good results when using GitHub - ideasman42/nerd-dictation: Simple, hackable offline speech to text - using the VOSK-API.. Even with my german dialect :smiley:

The whisper model seems to have better performance and only one model for multiple languages. I'm curious for the results when trying with your script.

About the syncing: You might take a look at the REST uploader (if I understood the question correctly). Though it's only uploading rather than syncing.

2 Likes

Thanks for the reply!
Like the reference to the REST uploader, although I also use the Joplin REST API (with curl as a client, both uploading on creation and "syncing" by uploading the temp files) I can integrate the uploader into my workflow or borrow from it.

Concerning whisper, yes, I have been passively monitoring the state-of-the-art in the field for years (e.g. in the early 2000s using the MS Speech SDK to input formulae in Mathematica; I know, a real dinosaur here) and IMHO, with the transformer models it has now reached critical "offline" mass. I use an AMD A10 APU (circa 2012) with the 'tiny' or 'base' English only (I think whisper is very well versed in German too) models (~300 to 500 MB in memory) and I find it more than acceptable. Using an AMD Ryzen with 6 cores makes it essentially trivial, highly accurate task. Give it a try, looking at your Joppy, I think it would be a simple matter for you to quickly get all the parts up and running, plus I will get a knowledgeable and sympathetic beta tester:-)

1 Like

Hi, is there any plans to use Whisper for Voice Typing? It is quite good, has punctuation, some settings and even translation, all offline.
I am using voice typing more and more in Android and it will be a game changer for long notes.

Shopping for a good app that implements Whisper in Android, I came across these 3:

  1. Whisper Voice Keyboard
  2. Whisper Journal
  3. WhisperInput

The first one is an Android keyboard, but it is English only, not multilingual and not option to select the model, you can select the number of threads.
It has a simple interface with 3 buttons to record audio, remove a line and add a line. These buttons are useful to separate paragraphs and quickly edit while dictating. It process the voice every 10 seconds. I would use this one if it was multilingual and can select a bigger model (even that it makes it slower)

The second one is a note taking app, not a keyboard, same developers of the 1st one.
This is multilingual, has translation and give you the option to pick different sizes of models and compare the how fast/slow it is and the results. Models: tiny.en, tiny, base.en, base, small.en, small.

The third one is also a keyboard that allows some settings like multilingual and model size. Published at GitHub - alex-vt/WhisperInput: Offline voice input panel & keyboard with punctuation for Android.
This is somehow similar to the 1st one (but without the buttons), and they didn't release an app, only the code and I don't know how to test it.

Anyways, I guess that we will see several Android keyboards offering this feature soon with all the bells and whistles and then Whisper could be used in any app, not only in Joplin. Swiftkey already has Bing with chatgpt integrated, but uses the google voice.

The voice typing feature in Joplin Android should have a shortcut for quick access when note is in Edit mode instead of being hidden in a menu.

I found a Whisper based keyboard, multilingual and with several settings, it is quite good using the biggest model.
It would be cool something like this integrated in the app

Futo Voice Input
https://voiceinput.futo.org/

Sayboard exists as a multilingual Vosk based Android keyboard. Unlike FUTO Voice Input, it can input text instantaneously without a time limit, but doesn't yet employ capitalization or punctuation. It also avoids proprietary licensing.

Github: GitHub - ElishaAz/Sayboard: An open-source on-device voice IME (keyboard) for Android using the Vosk library.
F-Droid: Sayboard | F-Droid - Free and Open Source Android App Repository

2 Likes

Sorry for the late reply and thanks for the links.
For me the desktop remains the work platform of choice (power and screen size) so not considering phone apps, aside from maybe, note intake into Joplin.
So I have done limited development for the Linux desktop in the context of this thread:
Blurt (GitHub - QuantiusBenignus/blurt: Gnome shell extension for accurate speech to text input in Linux using whisper.cpp. Input text from speech into any window that has the keyboard focus.) is a simple Gnome shell extension evolved from the command line utility NoteWhispers, which itself, is built around the great whisper.cpp.

Whisper.cpp has become a standard tool in my Linux workflow, initially mostly for Joplin note taking, but now, thanks to this extension, in every application with editable text field. Wanted to avoid simulating input events (frowned upon for good reasons), so one has to still use the middle mouse button to paste the transcribed text from the clipboard.

The base whisper model is used by default with 30x-faster-than-realtime transcription (with CUDA GPU support), resulting in about 300ms transcription for 10s speech on an average machine with a new(ish) CPU. If you use GNOME on Linux you can give it a try on GitHub or at Blurt - GNOME Shell Extensions .