That's interesting, thanks for sharing. The part to extract text from the pdf seems a bit crazy but makes sense, and I assume there's no other way since pdfs don't keep the plain text. Perhaps some of it could be reused if we ever implement the ocr plugin.
If you don't want to scan all the resources on startup, perhaps you can rely on the updated_time property? Once you've finished processing you store the highest updated_time, and next time you don't need to process anything below it. Just an idea if you or someone wants to improve this part.
1 Like