Possible issue with LRU and shared memory zones?

Sat Sep 21 13:14:08 UTC 2024

Hi!

We've used nginx on our FreeBSD systems for what feels like forever, and 
love it. Over the last few years we've been hit by pretty massive DDoS 
attacks, and have been employing various tricks in nginx to fend them 
off. One of them is, of course, rate limiting.

Given a config like..
   limit_req_zone $request zone=unique_request_5:100m rate=5r/s;

and then
     limit_req zone=unique_request_5 burst=50 nodelay;

we're getting messages like this:
   could not allocate node in limit_req zone "unique_request_5"

We see this on an idle node that only get very sporadic requests. 
However, this is preceded by a DDoS attack several hours earlier, which 
consisted of requests hitting this exact location block with short 
requests like
   POST /foo/bar?token=DEADBEEF

When, after a few million requests like this in a short timespan, a 
"normal" request comes in - *much* longer than the DDoS request - , e.g.
   POST /foo/bar?token=DEADBEEF&moredata=foo&evenmoredata=bar

this is immediately REJECTED by the rate limiter, and we get the 
aforementioned error in the log.

The current theory, supported by consulting with FreeBSD developers far 
more educated and experienced than myself, is that something is going 
wrong with the LRU allocator: Since nearly all of the shared memory zone 
was filled with short requests, freeing up one (or even two) of them 
will not be sufficient for these new requests. Only an nginx restart 
clears this up.

Is there anything we can do to avoid this? I know the API for clearing 
and monitoring the shared memory zones until now has only been available 
in nginx plus - but we are strictly on a FOSS-only diet so using 
anything like that is obviously out of the question.

Thanks, and take care,
Eirik Øverby