Using 444

Sun Sep 28 02:51:33 UTC 2025

Honestly, I wouldn't consider the 'AI vs resources' issue off-topic... 
granted I have a modest wall of text incoming. :-)

While I have not used the `go-away` package from 
https://git.gammaspectra.live/git/go-away, I've also seen the effects of 
AI-harvesters on web servers. The resource consumption can sometimes be 
absolutely immense, be it software or physical capacity ceilings being 
met. That of which being due to AI-bots completely ignoring or evading 
rate limiting due to the usage of massive ranges of CIDRs available.

With that said, the majority of "good" AI-harvesters/agents utilize a 
user agent, which makes blocking or rate limiting them at the 
nginx-level fairly straight forward. Otherwise, the more 'sneaky' AI 
harvesters, will generally be mimicking real or near-real-looking user 
agents. At that point, those AI-harvesters/agents are generally and 
predominately based on 'cloud' CIDRs, making it fairly easy to 
block/filter.

Which in turn, gives us some options to use against AI 
agents/harvesters.

1) Using something like the 'nginx ultimate bad bot blocker' project 
located at: 
https://github.com/mitchellkrogza/nginx-ultimate-bad-bot-blocker and 
configuring it to be exceptionally strict (return 444 or rate limiting) 
against user agents deemed unwanted.
2) Using iptables/nftables on Linux or an appliance in front of the 
nginx server to block/drop swaths of CIDRs relating to problematic/toxic 
cloud networks/data centers.
3) Stout rate limiting via making use of the ngx_http_limit_req module.

The best use-case is utilizing all three options.

In the case of using that ultimate bad bot blocker, I've added the 
following to my blacklist-user-agents.conf file (this includes probing 
and AI clients):

"~*(?:\b)libwww-perl(?:\b)" 3;
"~*(?:\b)wget(?:\b)" 3;
"~*(?:\b)Go\-http\-client(?:\b)" 3;
"~*(?:\b)LieBaoFast(?:\b)" 3;
"~*(?:\b)Mb2345Browser(?:\b)" 3;
"~*(?:\b)MicroMessenger(?:\b)" 3;
"~*(?:\b)zh_CN(?:\b)" 3;
"~*(?:\b)Kinza(?:\b)" 3;
"~*(?:\b)Bytespider(?:\b)" 3; #TikTok Scraper
"~*(?:\b)Baiduspider(?:\b)" 3;
"~*(?:\b)Sogou(?:\b)" 3;
"~*(?:\b)Datanyze(?:\b)" 3;
"~*(?:\b)AspiegelBot(?:\b)" 3;
"~*(?:\b)adscanner(?:\b)" 3;
"~*(?:\b)serpstatbot(?:\b)" 3;
"~*(?:\b)spaziodat(?:\b)" 3;
"~*(?:\b)undefined(?:\b)" 3;
"~*(?:\b)claudebot(?:\b)" 3;
"~*(?:\b)anthropic\-ai(?:\b)" 3;
"~*(?:\b)ccbot(?:\b)" 3;
"~*(?:\b)FacebookBot(?:\b)" 3;
"~*(?:\b)OmigiliBot(?:\b)" 3;
"~*(?:\b)cohere\-ai(?:\b)" 3;
"~*(?:\b)Diffbot(?:\b)" 3;
"~*(?:\b)omgili(?:\b)" 3;
"~*(?:\b)GoogleOther(?:\b)" 3;
"~*(?:\b)Google\-Extended(?:\b)" 3;
"~*(?:\b)ChatGPT-User(?:\b)" 3;
"~*(?:\b)GPTBot(?:\b)" 3;
"~*(?:\b)Amazonbot(?:\b)" 3;
"~*(?:\b)Applebot(?:\b)" 3;
"~*(?:\b)PerplexityBot(?:\b)" 3;
"~*(?:\b)YouBot(?:\b)" 3;

I've probably left off a few on this list, but eh... This seems to have 
stopped the majority of AI scrapers/harvesters (and probers/exploiters) 
using such user agents. Thus, leaving out some misnomers to be blocked 
at the firewall level.

As for me, the ones being blocked at the firewall level are primarily 
Chinese based cloud providers. The worst one that I've seen to date 
occurred earlier this year where four different cloud data centers were 
being used (abused?). Ultimately, someone or some company used these 
cloud provider services for mass scraping into AI harvesting/training. 
At one point, one of my servers was pushing thousands of requests a 
second from hundreds of different IP addresses, none of the IP addresses 
being unique and all using varying user agents and random page accesses 
that were previously scraped (A URL-list seems to have been collected 
initially). It wasn't until I blocked the majority of Alibaba Cloud 
(AS45102), Huawei Cloud (AS136907), TenCent Cloud (AS132203), and a 
small amount from OVH (AS16276) did things /mostly/ return to normalcy.

I've never been a fan of the scorched-earth approach with blanket 
banning/dropping providers in swaths such as this. It is legitimately 
absurd that I've had to resort to doing such just to get some sanity 
back and resource usage under control.

--Brett

------ Original Message ------
>From "Jeffrey Walton" <noloader at gmail.com>
To nginx at freenginx.org
Date 09/27/2025 04:45:25 P
Subject Re: Using 444

>On Sat, Sep 27, 2025 at 2:28 PM Paul <paul at stormy.ca> wrote:
>>
>>  [...]
>>  Maxim, many thanks.  Currently battling a DDoS including out of control
>>  "AI". Front end nginx/1.18.0 (Ubuntu) easily handles volume (CPU usage
>>  rarely above 1%) but proxied apache2 often runs up to 98% across 12
>>  cores (complex cgi needs 20-40 ms per response.)
>>
>>  I'm attempting to mitigate.  Your advice appreciated. I've "snipped"
>>  below for readability:
>
>My apologies if this wanders too off-topic.
>
>A lot of folks are having trouble due to AI Agents scraping their
>sites for training data.  It hit the folks at GNU particularly hard.
>If AI is so smart, then why does it not clone a project instead of
>scraping source code presented as web pages???
>
>You might consider putting a box on the front-end to handle the abuse
>from AI agents.  Anibus, go-away and several others are popular.
>go-away provides a list of similar projects at
><https://git.gammaspectra.live/git/go-away#other-similar-projects>.
>In fact, go-away names Nginx's ngx_http_js_challenge_module as a
>mitigation for the problem.
>
>Jeff