[PATCH 2 of 5] Syslog: introduced ngx_syslog_send() error logging moderation

Thu Mar 7 16:06:03 UTC 2024

Hello!

On Thu, Mar 07, 2024 at 05:24:47PM +0300, Vladimir Homutov wrote:

> On Fri, Mar 01, 2024 at 06:18:18AM +0300, Maxim Dounin wrote:
> > # HG changeset patch
> > # User Maxim Dounin <mdounin at mdounin.ru>
> > # Date 1709260932 -10800
> > #      Fri Mar 01 05:42:12 2024 +0300
> > # Node ID 1c9264603adc240b226e1a149c01491d62302ded
> > # Parent  c7c8354f99face52f3b51442360317f3e2768492
> > Syslog: introduced ngx_syslog_send() error logging moderation.
> >
> > Errors when logging to syslog are now logged at most once per second.
> > This ensures that persistent errors won't flood other logs, and spontaneous
> > errors, such as ENOBUFS as observed on BSD systems when syslogd cannot cope
> > with load, or EAGAIN as seen in similar situation on Linux, won't further
> > overload logging subsystem, leading to more errors.
> >
> > Further, errors now can only trigger reconnects at most once per second.
> > This ensures that persistent errors, which cannot be fixed with reconnects,
> > don't trigger too much unneeded work.
> >
> > Additionally, in case of connection errors, such as when syslogd is not
> > running, connection attempts are only made once per second.
> 
> The change is good, but I can't get the remaining logic:
> We don't do reconnects too often (good), but we are still trying
> to send into socket that responded with error last time.
> 
> Previously, we closed such socket (the goal was to recover from
> persistent error).
> 
> With the patch we are trying to recoverr once per second, but still
> trying to use the (presumably) bad socket.
> Maybe we should instead just stop sending anything for the second?

At least some errors on datagram sockets, such as ENOBUFS on BSD 
systems, are transient, and basically mean that the particular 
datagram was lost, while other datagrams sent to the same socket 
might sill succeed.  Reconnecting on each lost datagram is useless 
(and will require additional resources), but sending other 
datagrams to the same socket might be still beneficial - they 
might be properly delivered. 

The idea is that if we've encountered an error which wasn't fixed 
by a reconnect, we are probably dealing with one of such errors, 
and hence we suppress reconnects for a while.

Suppressing logging at all does not seem to be needed, since 
sending which results in errors, as long as it doesn't trigger 
additional error logging, takes roughly the same resources as 
normal sending, and hence can be used safely.  At least I'm not 
aware of anything similar to ENOSPC errors, which are known to 
require lots of resources in some cases (and therefore we suppress 
file logging for a second on ENOSPC).

> If it seems too much (we can produce a lot of logs during a second and
> if an error was random, things would run just fine), then I guess that
> we can introduce a counter: close the socket in case of N successive
> errors, and put it into moderation mode, and retry again in 1 second.
> 
> Other patches look good for me.

Thanks for looking.

[...]

-- 
Maxim Dounin
http://mdounin.ru/