avoid limit_req / limit_conn limits for proxy_cache cached content
Maxim Dounin
mdounin at mdounin.ru
Mon May 18 00:19:35 UTC 2026
Hello!
On Sun, May 17, 2026 at 11:21:15AM -0500, Constantine A. Murenin wrote:
> My optimised and heavily-cached OpenGrok-based dev web site has finally
> succumbed to the DDoS from the supposed AI abuse, so, I'm, reevaluating
> resource usage and the applicable limits.
>
> My objective is to serve everyone as long as there is available capacity.
> Instead of doing a cat and mouse game on blocking any specific identifier,
> like User-Agent or IP or netblock or region, I want to block excessive
> usage IFF the content isn't already cached.
>
> For example, the search results page on my OpenGrok may take 10ms to 50ms
> to generate when the Lucene index is stored on an mfs, but once generated,
> the page is basically "free" to serve, and I want keep serving the cached
> entry even during what may look like a DDoS attack on my instance.
> Otherwise, a Slashdot-like event could result in the most popular
> combination of identifiers being promptly blocked, and legitimate users
> being denied access, even when the content was actually "free" and wouldn't
> have required any excessive resources to generate and serve.
>
> I've re-looked at http://freenginx.org/r/limit_req and
> http://freenginx.org/r/limit_conn, but I don't see any way to exclude
> cached content from still getting subjected to the limits.
>
> I think the standard route here may be to use an
> http://freenginx.org/r/error_page exception handler, to automatically
> handle the 503 errors thrown by limit_req and limit_conn, and continue
> serving the content if cached, but I'm not quite certain how to integrate
> it with http://freenginx.org/r/proxy_cache. Any suggestions?
I don't think there is a good way to check if the particular
request is going to be served from the cache or not.
An obvious solution would be to introduce additional proxy layer
after the cache, and apply limits there.
Another possible solution might be to use proxy_cache with
proxy_pass to a backend which always returns an error, and
error_page to handle errors in a different location with limits,
the same proxy_cache and proxy_pass to the real backend.
> One option may be to use http://freenginx.org/r/proxy_store instead of
> proxy_cache, but I'm not sure that'll work properly when I'm also caching
> the search result pages, too, to account for the Slashdot-like events
> (they're currently referred to as "When many people access the same link
> simultaneously -- such as when a GitLab link is shared in a chat room"),
> without creating new restrictions on the input for the search query string,
> for example, not to mention having to do manual purges of the cached data
> and missing all the other nice features of the standard proxy_cache.
Using proxy_store with non-trivial URIs might be problematic, as
well as using it for content which might change. Basically, it is
a mechanism to mirror static files which never change. While
using it as a cache is certainly possible, it is going to be
non-trivial and error-prone solution. Additional proxy layer is
probably much easier.
Also, not directly related to the question, but rather about
AI-scrapers in general:
- For Mercurial repositories on freenginx.org, which effectively
provide infinite number of distinct resources, I observe that
AI-scrapers started to use large botnets with multiple IP
addresses from different netblocks (millions of unique IP
addresses identified as abusive AI-scrapers in just a couple of
days). Limiting them with limit_req / limit_conn with traditional
IP-based or netblock-based limits become ineffective.
- Using userid session cookies (http:/freenginx.org/r/userid) and
limiting users without $uid_got seems to be effective
last-resort measure: abusive bots don't seem to try to use cookies
at all. It can block legitimate users (if all the limits are
already consumed by bots), but for legitimate users with real
browsers it's just a matter of refreshing the page. I initially
though I would have to implement some proof-of-work mechanism to
stop them, similarly to what Anubis does, but trivial cookies seem
to be quote effective as well.
- Some AI-labyrinth solutions might be also effective here. Since
AI-scrapers ignore "nofollow" (and that's why they try to scrape
Mercurial repositories on freenginx.org in the first place), they
basically can index any infinite resources. Which gives an
opportunity to keep them indexing something really cheap to
generate rather than real resources, without any negative effects
on legitimate users or robots.
Hope this helps.
--
Maxim Dounin
http://mdounin.ru/
More information about the nginx
mailing list