Feedback wanted on SAST + Supply Chain tooling direction

Hi maintainers and any interested member of the Joplin community,

While comparing different security tools for the proposed plugin review pipeline, I realized a lot of the decisions depend more on maintainer/reviewer preference and long-term maintainability, so I wanted to get some direct feedback before finalizing the tooling direction.

One of the biggest tradeoffs I keep running into is maintainability vs deeper analysis.
For example, CodeQL seems significantly stronger for deep semantic and cross-file analysis, but writing and maintaining custom queries also appears much more complex (it uses QL). Semgrep on the other hand seems much easier to maintain and write project-specific rules for, but with comparatively shallower analysis.
But since joplin uses custom api a lot, custom rules become very important for this senario.
So how much should maintainability matter? Is using QL for writing custom rules viable?

Comparing Semgrep and CodeQL · Doyensec's Blog : states how codeQL perform better than semgrep in standard benchmark testing, though also making a lot of false positives (things which it detected as threats but were actually not) + higher execution time.
Meanwhile, on a real world code both semgrep and codeql gave 100% detection result, though codeql still generated 25% false positive

So, what tool would you prefer, A tool that might miss 0 - 20% threats but generate only 0 - 10% false positives, or a tool that would miss 0% - 5% threat but also generates 20-40% false positives?

Another thing I’m unsure about is how strict the pipeline should be around build failures.
Some tools can still scan partially broken/unbuildable codebases, while others rely heavily on successful builds.
If a plugin fails to build, should that immediately stop the review process, or should scanning still continue on the raw source code?

I’ve also been thinking about whether LLMs should have any role in the pipeline at all, even if it’s only for summarizing diffs/findings for reviewers instead of directly making security decisions (as they are prone to hallucination).
Studies like : Semgrep vs CodeQL (2026): Technical Comparison for Security Teams | Konvu
suggest that LLM-based post-filtering could reduce initial false positive rates.
Right now my proposed tooling is :
Semgrep + Socket.dev (LLM would be used to post-filtering / summarization / reducing noisy findings)

Would really appreciate any suggestions or opinions on what direction feels most sustainable for the team long-term.