Ticket #760 (new defect)

Opened 2 years ago

Last modified 1 year ago

Random crashing on FreeBSD 6.1

Reported by: anonymous Assigned to: jan
Priority: highest Milestone:
Component: core Version: 1.4.11
Severity: critical Keywords:
Cc: Blocking:
Need Feedback:

Description

Here is the backtrace, lighttpd crashes randomly about 30-40 times a day on a fairly heavy traffic website that serves 30-60mb files.

Attachments

lighttpd.trace (1.3 kB) - added by Wilik on 07/23/2006 04:36:56 AM.
lighttpd.trace
lighttpd.strace (22.1 kB) - added by Wilik on 07/23/2006 04:57:09 PM.
strace file
lighttpd.conf (1.7 kB) - added by geoff@cs.hmc.edu on 05/23/2007 12:09:43 AM.
bug760.patch (4.2 kB) - added by geoff@cs.hmc.edu on 06/26/2007 10:39:35 PM.
Patch to work around problem of crashing on large files

Change History

07/23/2006 04:36:56 AM changed by Wilik

  • attachment lighttpd.trace added.

lighttpd.trace

07/23/2006 04:57:09 PM changed by Wilik

  • attachment lighttpd.strace added.

strace file

10/04/2006 09:05:40 PM changed by wiak

i have the same problem :/ LightTPD crashes on heavy trafficc random

10/07/2006 06:58:30 PM changed by weird_ed

This has been a problem for me on FreeBSD 6.x since I first started using Lighty on v1.4.10. It's running under supervise now, so it restarts immediately, but it's still annoying and not very impressive.

03/11/2007 10:49:37 PM changed by tbone

Upgrade to 1.4.13 and see if it still happens.

Also, what ulimits are you running lighttpd under?

03/11/2007 10:51:38 PM changed by tbone

Also, paste your lighttpd config.

05/22/2007 11:33:19 PM changed by geoff@cs.hmc.edu

I'm lucky to be able to reproduce the bug at will, so once I found this report it was easy to confirm that I'm having the same problem. Even better, it was also trivial to identify the proximate cause.

The problem is a malloc failure in buffer_prepare_copy. That, in turn, is caused by a massive memory leak. Lighty's process size when it died was 3013792, or over 3 GB. Not coincidentally, the file I was downloading is 3.8 GB in size. Clearly, lighty is either trying to cache the entire file internally, or failing to free buffers as the copy progresses.

A test with a smaller file (0.5 GB) revealed that the process remains large after the file has been downloaded. Since the modern malloc often returns freed space to the system, this indicates that it's a plain memory leak. That, in turn, ought to make the bug pretty easy to find.

(follow-up: ↓ 8 ) 05/22/2007 11:46:03 PM changed by darix

  • blocking changed.

what is your usage pattern for lighttpd? i am only aware of one memory leak in lighttpd in combination with mod_proxy. are you using mod_proxy? and can you attach your config? (you can obfuscate stuff if needed)

05/23/2007 12:09:43 AM changed by geoff@cs.hmc.edu

  • attachment lighttpd.conf added.

(in reply to: ↑ 7 ) 05/23/2007 08:27:55 AM changed by geoff@cs.hmc.edu

Replying to darix:

what is your usage pattern for lighttpd? i am only aware of one memory leak in lighttpd in combination with mod_proxy. are you using mod_proxy? and can you attach your config? (you can obfuscate stuff if needed)

No need to obfuscate; I just attached the config file. There's no mod_proxy.

The usage pattern is VERY light (only a few users per day), but essentially all the activity is downloads of huge files over a slow link. I did a bit of code browsing, and my guess is that chunk.c doesn't limit the length of the chunk queue. So the slow link backs up, the chunk queue grows to the size of the file, and lighty runs out of memory.

If that guess (and it's only a guess) is correct, there's no memory leak, just a failure to limit the queue length. I haven't dug deeply into the code yet to see whether that's true, nor to see how hard it will be to add a queue limit.

05/23/2007 11:23:08 PM changed by geoff@cs.hmc.edu

OK, I did a test and it's not a memory leak. I downloaded a 553M-ish file, and lighty went up to 552M in size, then shrank back to a thrifty 26M after the download was done. (Note that it didn't quite get to the size of the file; I think that's because some of the chunks went out over the net while the file was being read in.)

I think this should be easy to fix. I just need to understand how lighty's asynchrony works. Then I could make it stop reading the file when the queue got too big, and come back later.

Oh, one other thing. I keep talking about a file, but the bug is actually related to CGIs. As far as I can tell (I didn't write the Ruby code), our CGI stuffs the file directly to lighty, rather than using X-LIGHTTPD-send-file to get the data to go out. Obviously, that suggests an alternate fix on the Ruby side. But it's still a bug that lighty swallows whatever a broken CGI sends it, without limiting its memory usage.

(As a somewhat related comment, the crash is due to an assertion failure after a malloc. A web server should never crash due to a malloc failure; at an absolute minimum it should generate a log message, and really it should degrade gracefully. A relatively easy quick fix would be to replace assert with a macro that generated a log message before dying.)

05/24/2007 01:27:06 AM changed by darix

configure lighttpd to use sendfile and the memory usage will be lower.

06/26/2007 10:38:35 PM changed by geoff@cs.hmc.edu

Unfortunately, configuring sendfile doesn't help because the Rails version I'm using doesn't support it (nor should it, since sendfile is server-specific). In any case, that only works around the bug. It shouldn't be possible for a misbehaving CGI script to crash the server simply by supplying a large amount of output.

Fortunately, I was able to come up with a patch that mitigates the problem. I will attach it after I complete this comment. My change limits the size of the write queue, and stops reading input from the FastCGI script when it becomes excessively large. The downside of my patch is that the entire server process blocks (this is undoubtedly because I don't properly understand lighty's asynchrony mechanisms). However, if you set max_procs to an appropriate value in the fastcgi.server section of your config file, the blocked process won't be problematic because other processes will handle other users. I used max_procs = 10, since my server has few users despite serving very large files.

WARNING: Install this patch with caution. It will not crash your server, but it may make it inaccessible if lots of users are downloading large files at the same time. I doubt that this is the "correct" fix. However, I hope that this patch is useful to some people who are having this problem, and I hope it will help someone more knowledgeable to develop a better patch.

06/26/2007 10:39:35 PM changed by geoff@cs.hmc.edu

  • attachment bug760.patch added.

Patch to work around problem of crashing on large files


Add/Change #760 (Random crashing on FreeBSD 6.1)




Change Properties