tipidee
Software
skarnet.org
The cgiwrapper-nollmcrawler program
cgiwrapper-nollmcrawler is a very ad-hoc, quick-and-dirty protection
against LLM crawler bots for installations that run tipidee under super-servers from
s6-networking. tipidee servers
cannot run an anti-crawler solution like
Anubis and need alternative protections.
cgiwrapper-nollmcrawler is a chainloading program that you wrap your CGI program
with. It takes a regular expression on the command line; if a new client connects
to the server and hits the CGI program with a query string that matches the
regular expression, the request is denied and the IP of the client is immediately
blacklisted. Otherwise, the client is whitelisted and can hit any URL on the
server.
This takes advantage of the LLM crawler propensity to hit servers from random
IPs with random deep queries, while minimizing false positives from real users,
who rarely make a deep query on their first visit.
Interface
As a CGI program:
cgiwrapper-nollmcrawler [ -f ] [ -v verbosity ] [ -d depth ] rulesdir regex realcgi...
- cgiwrapper-nollmcrawler expects to be run by tipideed as a CGI program,
as a wrapper around realcgi..., which must also, obviously, be runnable
as a CGI program.
- It expects rulesdir to be the access rules directory given as argument
to the -i option to
s6-tcpserver-access
on the tipidee command line. This directory must be writable by the user cgiwrapper-nollmcrawler
is running as (so, typically, the user running the tipideed process). rulesdir must
follow a specific format, see below.
- When cgiwrapper-nollmcrawler is invoked, it first checks whether the client has previously
been whitelisted in rulesdir. In that case, it execs into realcgi... immediately.
- Then it checks whether the depth of the PATH_INFO variable against depth.
If the contents of PATH_INFO have depth slashes (/) or fewer, the query is
allowed and the client is whitelisted.
- Then it checks the contents of the QUERY_STRING variable against regex. If
the query string matches, then cgiwrapper-nollmcrawler blacklists the client in
rulesdir and responds a status 403 with an ungracious message.
- If the query string does not match regex, then the client is whitelisted
and cgiwrapper-nollmcrawler execs into realcgi....
Access rules format
- rulesdir/ip4 must exist if rulesdir performs access
control for IPv4 addresses, and rulesdir/ip6 must exist if
rulesdir performs access control for IPv6 addresses. This is the standard
access rules directory structure.
- The rulesdir/outputs/allow/allow and
rulesdir/outputs/deny/deny files must also exist. They can be empty.
This permits the following implementation:
- When cgiwrapper-llmcrawler whitelists a client, it just means it symlinks
../outputs/allow to the canonical
s6-tcpserver-access
format for the client's IP, in either rulesdir/ip4 or
rulesdir/ip6.
- When cgiwrapper-llmcrawler blacklists a client, it just means it symlinks
../outputs/deny instead.
- This ensures each entry only uses one inode, and as little room as possible.
LLM crawler bots are ruthless and can attack from millions of IPs, which is why
efficiency is important. Implementing a ban with just a symlink() is efficient.
Common usage
- cgiwrapper-nollmcrawler expects to be run by tipideed as a CGI program,
as a wrapper around realcgi....
- e.g. if the URL you want to protect is https://example.com/cgit.cgi,
and cgit.cgi is a direct cgit binary, then the way to protect it is:
- Move cgit.cgi to cgit.cgi-real and never link this resource anywhere.
- Write a script (shell, execline, whatever language you want) standing in for cgit.cgi
that execs into cgiwrapper-nollmcrawler with cgit.cgi-real as its last argument. Make it executable.
- cgiwrapper-nollmcrawler is typically be used to protect cgit, but it can
protect any backend that uses CGI as its interface and has deep URLs with easily
identifiable query strings.
Exit codes
- 0
- Success.
- 100
- Bad usage.
- 111
- System call failed. This usually signals an issue with the
underlying operating system.
Options
- -4
- Expect IPv4 addresses. Use this option when reading logs from a server listening
to an IPv4 address.
- -6
- Expect IPv6 addresses. Use this option when reading logs from a server listening
to an IPv6 address.
Notes
- This Fediverse
thread tells the story of how cgiwrapper-nollmcrawler came to be, and how it was
deployed on skarnet.org.