Project

General

Profile

Actions

Bug #1583

closed

lighttpd craches after two days

Added by Anonymous about 16 years ago. Updated about 16 years ago.

Status:
Fixed
Priority:
Normal
Category:
core
Target version:
-
ASK QUESTIONS IN Forums:

Description

I'm using lighttpd 1.4.18 (packaged by Dag for RHEL4 lighttpd-1.4.18-1.el4.rf.src.rpm) on a CentOS 4 box. It's a public mirror server for a couple of linux distros, with a fair amount of traffic and requests.

Only static pages and files are served, nothing fancy, no cgi, only 2 modules loaded (mod_access and mod_accesslog). The files servered are from small rpm/deb files up to 3-4gb .iso files (like a normal public mirror).

Lighttpd crashes every 2 days for no apparent reason. There's nothing in the access or error logs indicating this crash. I do have JFFNMS monitoring the lighttpd service, and the memusage never goes past 6-7 MB. When i restart the process, it's working fine, fast, without problems (although there is the occasional connection reset when upgrading with APT, but i guess that's from bug 675).

I've run it through valgrind according to the bug reporting page, see the attached valgrind log, especially at the end. I'm planning to compile it from source, and try with that one, see how it works out, maybe the binary version i've installed is bugged somehow.

I've been using it for 6 days and it crashed twice after two days of usage.

-- gimre


Files

lighttpd.24900 (9.59 KB) lighttpd.24900 valgrind output -- gimre Anonymous, 2008-03-06 10:36
lighttpd-gdb-bt.txt (1.96 KB) lighttpd-gdb-bt.txt gdb output -- gimre Anonymous, 2008-03-07 15:28
Actions #1

Updated by stbuehler about 16 years ago

A backtrace of the crash would be really useful; and if it dies under heavy load i suspect it is #1562.

Actions #2

Updated by Anonymous about 16 years ago

I tried to get a backtrace but the binary wasn't compiled with any debug stuff, so i recompiled it from source (1.4.18.tar.gz), with the '-g' option, and then i got something. Hope this help a little bit more. See attachment.

-- gimre

Actions #3

Updated by stbuehler about 16 years ago

Hm sh... I don't think that backtrace helps us - the code referenced is well tested and should not fail itself, so i think the memory got corrupted by something else and lighty crashed later.

I still think it could be related to #1562 (out of bounds array access), so if you want you can try the pre-release of 1.4.19: http://blog.lighttpd.net/

Perhaps it would be good to know what is in the buffer (b=0x9ab5448), with the following gdb commands you could have looked at that: (too late now, i know. but perhaps next time, if it crashes at the same place)


frame 2
print *b
Actions #4

Updated by Anonymous about 16 years ago

Replying to stbuehler:

Perhaps it would be good to know what is in the buffer (b=0x9ab5448), with the following gdb commands you could have looked at that: (too late now, i know. but perhaps next time, if it crashes at the same place) {{{
frame 2
print *b
}}}

I actually didn't quit gdb, was thinking maybe you need some more info, so here it is:


(gdb) frame 2
#2  0x0805c4f7 in buffer_copy_string_buffer (b=0x9ab5448, src=0x9c14060) at buffer.c:163
163             return buffer_copy_string_len(b, src->ptr, src->used - 1);
(gdb) print *b
$2 = {ptr = 0x0, used = 0, size = 64}

Before the crash(es) i noticed a lot of connections to port 80, like 600-700 (or more) simultaneous connections in various states (as per the netstat -tnp command).

Anything else i can help with in this gdb session?

Actions #5

Updated by stbuehler about 16 years ago

Crazy...^^

Okay. It could be that your assert() function doesn't work and malloc returned 0 - but i don't think that is the case as it doesn't look like your server was out of ram.

So it looks like someone set ptr to NULL but left size = 64; the buffer routine thinks there is enough space and just tries to reuse that, without checking ptr for zero (if someone corrupts the buffer struct that wouldn't help in every case anyway, it could just be an invalid non-zero pointer).

Do you use any 3rd-party modules? I didn't find something suspicious happening with con->physical.basedir (that is the buffer) in the upstream source.

So, still the possibility with #1562 remains ;-)

And i think you can close your gdb session, i cannot think of something else useful.

Actions #6

Updated by Anonymous about 16 years ago

Nope, i'm not using any fancy modules, in fact only mod_access and mod_accesslog, no virtualhost, just the basic setup, one docroot.

In netstat output however i did see a lot for connections from the same IP address, that would explain the high connection count, so i'm just going to try mod_evasive, and limit the concurrent connections a bit, see how it works out.

After that, i'll still try the pre-release without the limit.

I'll update this ticket if i find something.

-- gimre

Actions #7

Updated by admin about 16 years ago

like 600-700 (or more) simultaneous connections

What does /server-status say?
And what is your max FDs value?
If it's the default (1024), you probably ran into http://trac.lighttpd.net/trac/ticket/1562

Actions #8

Updated by Anonymous about 16 years ago

Replying to Olaf van der Spek:

What does /server-status say?
And what is your max FDs value?
If it's the default (1024), you probably ran into http://trac.lighttpd.net/trac/ticket/1562

Yes, it's the default. I didn't mess with it. Maybe i should increment it? That won't resolve the problem, only delay it, right? I'm not sure i understand completely what that #1562 bug is all about.

I just enabled server-status to see what it says. For now the connections are increasing steadily, most of them in (W)rite state, serving files.

(btw, i have problems submitting comments to this ticket, see http://trac.edgewall.org/ticket/6975)

Actions #9

Updated by admin about 16 years ago

Maybe i should increment it? That won't resolve the problem, only delay it, right?

Yes, you should. ;)
It should solve the problem, as there is only a max connections, which is also 1024 by default. IF you triple max FDs, you'll hit max connections and not max FDs, so you won't hit #1562.

Actions #10

Updated by Anonymous about 16 years ago

It seems increasing the server.max-fds to 3072 helped, no crashes for 3 days. I'll keep monitoring it, and i'll definitely will upgrade to 1.4.19 when it's ready.

Thanks for all the help.

-- gimre

Actions #11

Updated by admin about 16 years ago

1.4.19 has been released already.

Actions #12

Updated by stbuehler about 16 years ago

  • Status changed from New to Fixed
  • Resolution set to duplicate

Okay, this really seems to be a duplicate of #1562.

Actions

Also available in: Atom