avoid limit_req / limit_conn limits for proxy_cache cached content

Mon May 18 00:19:35 UTC 2026

Hello!

On Sun, May 17, 2026 at 11:21:15AM -0500, Constantine A. Murenin wrote:

> My optimised and heavily-cached OpenGrok-based dev web site has finally
> succumbed to the DDoS from the supposed AI abuse, so, I'm, reevaluating
> resource usage and the applicable limits.
> 
> My objective is to serve everyone as long as there is available capacity.
> Instead of doing a cat and mouse game on blocking any specific identifier,
> like User-Agent or IP or netblock or region, I want to block excessive
> usage IFF the content isn't already cached.
> 
> For example, the search results page on my OpenGrok may take 10ms to 50ms
> to generate when the Lucene index is stored on an mfs, but once generated,
> the page is basically "free" to serve, and I want keep serving the cached
> entry even during what may look like a DDoS attack on my instance.
> Otherwise, a Slashdot-like event could result in the most popular
> combination of identifiers being promptly blocked, and legitimate users
> being denied access, even when the content was actually "free" and wouldn't
> have required any excessive resources to generate and serve.
> 
> I've re-looked at http://freenginx.org/r/limit_req and
> http://freenginx.org/r/limit_conn, but I don't see any way to exclude
> cached content from still getting subjected to the limits.
> 
> I think the standard route here may be to use an
> http://freenginx.org/r/error_page exception handler, to automatically
> handle the 503 errors thrown by limit_req and limit_conn, and continue
> serving the content if cached, but I'm not quite certain how to integrate
> it with http://freenginx.org/r/proxy_cache.  Any suggestions?

I don't think there is a good way to check if the particular 
request is going to be served from the cache or not.

An obvious solution would be to introduce additional proxy layer 
after the cache, and apply limits there.

Another possible solution might be to use proxy_cache with 
proxy_pass to a backend which always returns an error, and 
error_page to handle errors in a different location with limits, 
the same proxy_cache and proxy_pass to the real backend.

> One option may be to use http://freenginx.org/r/proxy_store instead of
> proxy_cache, but I'm not sure that'll work properly when I'm also caching
> the search result pages, too, to account for the Slashdot-like events
> (they're currently referred to as "When many people access the same link
> simultaneously -- such as when a GitLab link is shared in a chat room"),
> without creating new restrictions on the input for the search query string,
> for example, not to mention having to do manual purges of the cached data
> and missing all the other nice features of the standard proxy_cache.

Using proxy_store with non-trivial URIs might be problematic, as 
well as using it for content which might change.  Basically, it is 
a mechanism to mirror static files which never change.  While 
using it as a cache is certainly possible, it is going to be 
non-trivial and error-prone solution.  Additional proxy layer is 
probably much easier.

Also, not directly related to the question, but rather about 
AI-scrapers in general:

- For Mercurial repositories on freenginx.org, which effectively 
  provide infinite number of distinct resources, I observe that 
AI-scrapers started to use large botnets with multiple IP 
addresses from different netblocks (millions of unique IP 
addresses identified as abusive AI-scrapers in just a couple of 
days).  Limiting them with limit_req / limit_conn with traditional 
IP-based or netblock-based limits become ineffective.

- Using userid session cookies (http:/freenginx.org/r/userid) and 
  limiting users without $uid_got seems to be effective 
last-resort measure: abusive bots don't seem to try to use cookies 
at all.  It can block legitimate users (if all the limits are 
already consumed by bots), but for legitimate users with real 
browsers it's just a matter of refreshing the page.  I initially 
though I would have to implement some proof-of-work mechanism to 
stop them, similarly to what Anubis does, but trivial cookies seem 
to be quote effective as well.

- Some AI-labyrinth solutions might be also effective here.  Since 
  AI-scrapers ignore "nofollow" (and that's why they try to scrape 
Mercurial repositories on freenginx.org in the first place), they 
basically can index any infinite resources.  Which gives an 
opportunity to keep them indexing something really cheap to 
generate rather than real resources, without any negative effects 
on legitimate users or robots.

Hope this helps.

-- 
Maxim Dounin
http://mdounin.ru/