Using 444
Brett Cooper
bctrainers at gmail.com
Sun Sep 28 02:51:33 UTC 2025
Honestly, I wouldn't consider the 'AI vs resources' issue off-topic...
granted I have a modest wall of text incoming. :-)
While I have not used the `go-away` package from
https://git.gammaspectra.live/git/go-away, I've also seen the effects of
AI-harvesters on web servers. The resource consumption can sometimes be
absolutely immense, be it software or physical capacity ceilings being
met. That of which being due to AI-bots completely ignoring or evading
rate limiting due to the usage of massive ranges of CIDRs available.
With that said, the majority of "good" AI-harvesters/agents utilize a
user agent, which makes blocking or rate limiting them at the
nginx-level fairly straight forward. Otherwise, the more 'sneaky' AI
harvesters, will generally be mimicking real or near-real-looking user
agents. At that point, those AI-harvesters/agents are generally and
predominately based on 'cloud' CIDRs, making it fairly easy to
block/filter.
Which in turn, gives us some options to use against AI
agents/harvesters.
1) Using something like the 'nginx ultimate bad bot blocker' project
located at:
https://github.com/mitchellkrogza/nginx-ultimate-bad-bot-blocker and
configuring it to be exceptionally strict (return 444 or rate limiting)
against user agents deemed unwanted.
2) Using iptables/nftables on Linux or an appliance in front of the
nginx server to block/drop swaths of CIDRs relating to problematic/toxic
cloud networks/data centers.
3) Stout rate limiting via making use of the ngx_http_limit_req module.
The best use-case is utilizing all three options.
In the case of using that ultimate bad bot blocker, I've added the
following to my blacklist-user-agents.conf file (this includes probing
and AI clients):
"~*(?:\b)libwww-perl(?:\b)" 3;
"~*(?:\b)wget(?:\b)" 3;
"~*(?:\b)Go\-http\-client(?:\b)" 3;
"~*(?:\b)LieBaoFast(?:\b)" 3;
"~*(?:\b)Mb2345Browser(?:\b)" 3;
"~*(?:\b)MicroMessenger(?:\b)" 3;
"~*(?:\b)zh_CN(?:\b)" 3;
"~*(?:\b)Kinza(?:\b)" 3;
"~*(?:\b)Bytespider(?:\b)" 3; #TikTok Scraper
"~*(?:\b)Baiduspider(?:\b)" 3;
"~*(?:\b)Sogou(?:\b)" 3;
"~*(?:\b)Datanyze(?:\b)" 3;
"~*(?:\b)AspiegelBot(?:\b)" 3;
"~*(?:\b)adscanner(?:\b)" 3;
"~*(?:\b)serpstatbot(?:\b)" 3;
"~*(?:\b)spaziodat(?:\b)" 3;
"~*(?:\b)undefined(?:\b)" 3;
"~*(?:\b)claudebot(?:\b)" 3;
"~*(?:\b)anthropic\-ai(?:\b)" 3;
"~*(?:\b)ccbot(?:\b)" 3;
"~*(?:\b)FacebookBot(?:\b)" 3;
"~*(?:\b)OmigiliBot(?:\b)" 3;
"~*(?:\b)cohere\-ai(?:\b)" 3;
"~*(?:\b)Diffbot(?:\b)" 3;
"~*(?:\b)omgili(?:\b)" 3;
"~*(?:\b)GoogleOther(?:\b)" 3;
"~*(?:\b)Google\-Extended(?:\b)" 3;
"~*(?:\b)ChatGPT-User(?:\b)" 3;
"~*(?:\b)GPTBot(?:\b)" 3;
"~*(?:\b)Amazonbot(?:\b)" 3;
"~*(?:\b)Applebot(?:\b)" 3;
"~*(?:\b)PerplexityBot(?:\b)" 3;
"~*(?:\b)YouBot(?:\b)" 3;
I've probably left off a few on this list, but eh... This seems to have
stopped the majority of AI scrapers/harvesters (and probers/exploiters)
using such user agents. Thus, leaving out some misnomers to be blocked
at the firewall level.
As for me, the ones being blocked at the firewall level are primarily
Chinese based cloud providers. The worst one that I've seen to date
occurred earlier this year where four different cloud data centers were
being used (abused?). Ultimately, someone or some company used these
cloud provider services for mass scraping into AI harvesting/training.
At one point, one of my servers was pushing thousands of requests a
second from hundreds of different IP addresses, none of the IP addresses
being unique and all using varying user agents and random page accesses
that were previously scraped (A URL-list seems to have been collected
initially). It wasn't until I blocked the majority of Alibaba Cloud
(AS45102), Huawei Cloud (AS136907), TenCent Cloud (AS132203), and a
small amount from OVH (AS16276) did things /mostly/ return to normalcy.
I've never been a fan of the scorched-earth approach with blanket
banning/dropping providers in swaths such as this. It is legitimately
absurd that I've had to resort to doing such just to get some sanity
back and resource usage under control.
--Brett
------ Original Message ------
>From "Jeffrey Walton" <noloader at gmail.com>
To nginx at freenginx.org
Date 09/27/2025 04:45:25 P
Subject Re: Using 444
>On Sat, Sep 27, 2025 at 2:28 PM Paul <paul at stormy.ca> wrote:
>>
>> [...]
>> Maxim, many thanks. Currently battling a DDoS including out of control
>> "AI". Front end nginx/1.18.0 (Ubuntu) easily handles volume (CPU usage
>> rarely above 1%) but proxied apache2 often runs up to 98% across 12
>> cores (complex cgi needs 20-40 ms per response.)
>>
>> I'm attempting to mitigate. Your advice appreciated. I've "snipped"
>> below for readability:
>
>My apologies if this wanders too off-topic.
>
>A lot of folks are having trouble due to AI Agents scraping their
>sites for training data. It hit the folks at GNU particularly hard.
>If AI is so smart, then why does it not clone a project instead of
>scraping source code presented as web pages???
>
>You might consider putting a box on the front-end to handle the abuse
>from AI agents. Anibus, go-away and several others are popular.
>go-away provides a list of similar projects at
><https://git.gammaspectra.live/git/go-away#other-similar-projects>.
>In fact, go-away names Nginx's ngx_http_js_challenge_module as a
>mitigation for the problem.
>
>Jeff
More information about the nginx
mailing list