Using 444

Mon Sep 29 08:17:48 UTC 2025

Hello!

On Sat, Sep 27, 2025 at 02:28:11PM -0400, Paul wrote:

> On 9/27/25 03:08, Maxim Dounin wrote:
> > Hello!
> 
> Maxim, many thanks.  Currently battling a DDoS including out of control
> "AI". Front end nginx/1.18.0 (Ubuntu) easily handles volume (CPU usage
> rarely above 1%) but proxied apache2 often runs up to 98% across 12 cores
> (complex cgi needs 20-40 ms per response.)
> 
> I'm attempting to mitigate.  Your advice appreciated. I've "snipped" below
> for readability:
> 
> [snip]
> > > I am currently (a bit "hit and miss") using :
> > > 
> > > proxy_buffering on;     # maybe helps proxied apache2 ?
> > 
> > Proxy buffering is on by default (see
> > http://freenginx.org/r/proxy_buffering), so there is no need to
> > switch it on unless you've switched it off at previous
> > configuration levels.
> 
> Understood, thanks -- I had two lines (rem'd in or out for testing purposes)
> trying to respect genuine requests from regular users.  Given that nginx has
> a lot of spare capacity, could this be better tuned to alleviate the load on
> the back end?  I've read your doc, but in a production environment, I'm
> unsure of the implications of "proxy_buffers number size;" and
> "proxy_busy_buffers_size size;"

In general, "proxy_buffering on" (the default) is to minimize 
usage of backend resources: it is designed to read the response 
from the backend as fast as possible into nginx buffers, so the 
backend connection can be released and/or closed even if the 
client is slow and sending the response to the client takes 
significant time.  It is not that important nowadays, since 
clients are usually fast now, yet still can help in some cases.  
Unlikely in case of AI scrappers though.

Other related settings, such as proxy_buffers, is to control what 
nginx does with buffers, and mostly needed to optimize processing 
on the nginx side.  In particular, larger proxy_buffers might be 
needed if you want to keep more data in memory (vs. disk 
buffering).  As long as responses are small enough to fit into 
existing memory buffers (4k proxy_buffer_size + 8 * 4k 
proxy_buffers == 36k by default), you probably don't need to tune 
anything.

The proxy_busy_buffers_size directive controls how many memory 
buffers can be used to send the response to the client (vs.  
writing the response to the file-based buffer).  It often needs to 
be explicitly configured to ensure it matches non-default 
proxy_buffers settings, but otherwise there isn't much need to 
tune it.

> > > connection_pool_size 512;
> > > client_header_buffer_size 512;
> > > large_client_header_buffers 4 512;
> > 
> > Similarly, I would rather use the default values unless you
> > understand why you want to change these.
> 
> Maybe mistakenly, I was trying to eliminate stupidly artificial cgi requests
> -- "GET /cgi-bin/....." that ran several kilobytes long.  The backend apache
> could "swallow" them (normally a 404) but I was trying to eliminate the
> overhead.

If the goal is to stop requests with very long URIs, using an 
explicit regular expression to limit such URIs might be a better 
option.  For example:

if ($request_uri ~ ".{256}") { return 444; }

The regular expression matches any request URI with more than 256 
characters, and such requests are rejected .

> > > location ~ \.php$ {  return 444; }
> 
> You did not mention this, but it does not appear to work well. access.log
> today gives hundreds of:
> 
> 104.46.211.169 - - [27/Sep/2025:12:32:12 +0000] "GET /zhidagen.php HTTP/1.1"
> 404 5013 "-" "-"
> 
> and the 5013 bytes is our "404-solr-try-again" page, not the 444 expected.

This indicate there is something wrong with the configuration.  
Possible issues include:

- Location being configured in the wrong/other server{} block.

- Other locations with regular expressions interfere and take 
  precedence.

>From the details provided I suspect it's 404 from nginx, so might 
be simply a request from an unrelated server{} block handled by 
nginx?

> > Also, depending on the traffic pattern you are seeing, it might be
> > a good idea to configure limit_req / limit_conn with appropriate
> > limits.
> 
> Again thanks, I had tried various 'location' lines such as
> 	limit_req_zone $binary_remote_addr zone=mylimit:5m rate=1r/s;
> 	limit_req zone=mylimit burst=5 nodelay;
> 
> without success... obviously haven't fully understood

Depending on the traffic pattern, limiting per $binary_remote_addr 
might not be effective.  In particular, AI scrappers I've observed 
tend to use lots of IP addresses, and limiting them based on sole 
IP address doesn't work well.

For freenginx.org source code repositories I currently use 
something like this to limit abusive behaviour (yet still allow 
automated requests when needed, such as for non-abusive search 
engine indexing and repository cloning):

    map $binary_remote_addr $net24 { ~^(\C\C\C) $1; }
    map $binary_remote_addr $net16 { ~^(\C\C) $1; }
    map $binary_remote_addr $net8 { ~^(\C) $1; }

    limit_conn_zone $binary_remote_addr zone=conns:1m;
    limit_conn_zone $net24 zone=conns24:1m;
    limit_conn_zone $net16 zone=conns16:1m;
    limit_conn_zone $net8 zone=conns8:1m;

Additionally, I use the following to limit most abusive AI 
scrappers with multiple netblocks, mostly filed with netblocks 
manually:

    geo $remote_addr $netname {
        # AS45102, Alibaba Cloud LLC

        47.74.0.0/15 AS45102;
        47.80.0.0/13 AS45102;
        47.76.0.0/14 AS45102;

        # AS32934, Facebook, netblocks observed in logs

        57.141.0.0/16 AS32934;
        57.142.0.0/15 AS32934;
        57.144.0.0/14 AS32934;
        57.148.0.0/15 AS32934;

        # Huawei netblocks, from geofeed in whois records

        1.178.32.0/23 HW;
        ...
    }

    limit_conn_zone $netname zone=connsname:1m;

With the following limits in proxied locations:

    limit_conn conns 5;
    limit_conn conns24 10;
    limit_conn conns16 20;
    limit_conn conns8 30;
    limit_conn connsname 10;

The backend is configured to serve 30 parallel requests and has 
listen queue 128 (Apache httpd with "MaxRequestWorkers 30").  With 
the above limits it currently works without issues, ensuring no 
errors and reasonable response time for all users.

If the goal is to stop all automated scrapping, using some 
JS-based challenge as already recommended in this thread might be 
a better option.

Hope this helps.

-- 
Maxim Dounin
http://mdounin.ru/