tipidee
Software
skarnet.org

The cgiwrapper-nollmcrawler program

cgiwrapper-nollmcrawler is a very ad-hoc, quick-and-dirty protection against LLM crawler bots for installations that run tipidee under super-servers from s6-networking. tipidee servers cannot run an anti-crawler solution like Anubis and need alternative protections.

cgiwrapper-nollmcrawler is a chainloading program that you wrap your CGI program with. It takes a regular expression on the command line; if a new client connects to the server and hits the CGI program with a query string that matches the regular expression, the request is denied and the IP of the client is immediately blacklisted. Otherwise, the client is whitelisted and can hit any URL on the server.

This takes advantage of the LLM crawler propensity to hit servers from random IPs with random deep queries, while minimizing false positives from real users, who rarely make a deep query on their first visit.

Interface

As a CGI program:

     cgiwrapper-nollmcrawler [ -f ] [ -v verbosity ] [ -d depth ] rulesdir regex realcgi...

Access rules format

This permits the following implementation:

LLM crawler bots are ruthless and can attack from millions of IPs, which is why efficiency is important. Implementing a ban with just a symlink() is efficient.

Common usage

Exit codes

0
Success.
100
Bad usage.
111
System call failed. This usually signals an issue with the underlying operating system.

Options

-4
Expect IPv4 addresses. Use this option when reading logs from a server listening to an IPv4 address.
-6
Expect IPv6 addresses. Use this option when reading logs from a server listening to an IPv6 address.

Notes