Ticket #897 (reopened defect)

Opened 2 years ago

Last modified 2 weeks ago

lighttpd responds 500 to ALL requests after brief period of overload (it never recovers from "cool-down" period)

Reported by: namosys Assigned to: jan
Priority: highest Milestone: 1.4.20
Component: mod_fastcgi Version: 1.4.19
Severity: critical Keywords: patch
Cc: jan Blocking:
Need Feedback: 1

Description

I wrote a simple php program as:

<?

$r="";
for($i=0;$i<1000000;++$i)
$r.="sdjkfhsldkjfhsdlf".time();

?>

then bind lighty in localhost, then start ab:

/usr/local/apache2/bin/ab -n 100000 -c 200 http://127.0.0.1/s.php

top shows phps are very busy:

  PID USERNAME   THR PRI NICE   SIZE    RES STATE    TIME   WCPU COMMAND
29035 www          1 125    0 81656K 14240K RUN      0:33  5.27% php
29045 www          1 125    0 90428K 14248K *SYSMA   0:33  5.22% php
29047 www          1 125    0 90428K 14244K *SYSMA   0:33  5.18% php
29034 www          1 124    0 90668K 16776K *SYSMA   0:34  5.13% php
29039 www          1 125    0 90396K 22264K *SYSMA   0:34  5.13% php
29042 www          1 124    0 90492K 23084K RUN      0:33  5.13% php
29037 www          1 124    0 90396K 17300K RUN      0:33  5.13% php
29036 www          1 124    0 90332K 14196K *SYSMA   0:34  5.08% php

After about 30 seconds, lighty.error.log said:

2006-10-25 18:20:15: (mod_fastcgi.c.3423) all handlers for  /s.php on .php are down. 

then answers 500 to every request.

I interrupted ab, and wait until php progresses are idle, but lighty still refused to reply.

Attachments

Change History

01/19/2007 08:48:39 AM changed by anonymous

this bug is present in lighty 1.4.13 to on FreeBSD 6

(follow-up: ↓ 3 ) 02/19/2007 04:47:35 AM changed by anonymous

confirmed on FreeBSD 6.1-RELEASE-p14

any news on this?

(in reply to: ↑ 2 ) 02/19/2007 04:48:29 AM changed by anonymous

Replying to anonymous:

confirmed on FreeBSD 6.1-RELEASE-p14 any news on this?

forgot to add -- lighttpd-1.4.13

02/19/2007 12:50:58 PM changed by jan

  • status changed from new to closed.
  • resolution set to wontfix.

This is the expected behaviour on 1.4.x and is documented in Docs:PerformanceFastCGI in the wiki.

In 1.5.0 we are doing the scheduling of requests in the module ourself and send a 502/503 in case we can't connect to the backends. The overall problem stays the same: if you don't have enough backends to handle the load some requests have to be dropped.

02/25/2007 09:20:41 PM changed by anonymous

  • status changed from closed to reopened.
  • resolution deleted.

Hi Jan,

The problem in my case is that after such an "overload" incident, lighty responds with 500 to _all_ php requests. You are forced to restart it to get it back into "regular operation mode."

Is this going to be addressed?

Also, is there something you recommend we do so that backends are available most of the time (if not always)?

Thank you

03/02/2007 11:01:01 AM changed by jan

The backends should be reactived after the cool-down period, right. Can you attach the output of

$ lsof -Pnp <pidofphp> -p <pidoflighty>

We have too see the state of the connections when you see this problem.

04/09/2007 08:46:48 AM changed by hutuworm

confirmed on RHEL 5 + lighttpd 1.4.13

06/05/2007 12:12:38 PM changed by njaguar

This also occurs on my site, specifically after being massively DDoSed (but probably under the same test case as above, as well) using lighttpd 1.4.11 + FreeBSD 4.7-RELEASE #3

Please note that it does not always occur after a DDoS, but only sometimes, which is even more odd.

I have to kill lighttpd and restart it manually in order to fix the issue, it never resolves itself, and in the case of this morning, was down for almost 2 hours (ouch) with the same exact error as above (all backends are marked busy -> 500 error), even after all the attacks subsided.

08/19/2007 02:48:57 AM changed by bb

  • blocking changed.
  • pending changed.

Yup, I have the same problem. Mandrivalinux 2007.1 using web.py with fcgi on an old Pentium II 450 MHz. The problem is apparent on pages that are pretty heavy (much database action and lots of text-processing); manually refreshing it a couple of times quickly causes the server to "lock up" and send 500 for a couple of minutes, long after the python process has finished. During this time, the lighttpd process is at 99% CPU usage.

09/04/2007 04:52:50 AM changed by venkatraju@gmail.com

  • version changed from 1.4.12 to 1.4.17.

I see a similar problem where lighttpd stops forwarding any fastcgi requests to my app and returns 500. This does not get resolved automatically - the only option is a restart of lighttpd. Tried a little hack in mod_fastcgi.c and it seemed to solve the problem. More details in the forum post http://forum.lighttpd.net/topic/16057

03/03/2008 01:05:42 PM changed by anonymous

  • cc set to jan.
  • pending set to 1.

I believe I found the problem: take a look in mod_fastcgi.c. On line 2884, a process is being assigned the state PROC_STATE_DIED_WAIT_FOR_PID. The next time fcgi_restart_dead_procs is called, the switch statement matches this state (line 2638). However, since the process is still alive, the program ends up on line 2671. When execution reaches this point, it should be noted that the process is not dead; but merely overloaded. Since the process is left in the old state, the active_procs counter ends up being decremented several times and therefore becomes positive (it's a size_t). Note: the line numbers I referred to were with respect to 1.4.18.

I am posting a patch that worked for me, but I don't recommend using it until a lighttpd developer verifies it since this is my first time looking at and working with the code...

--- mod_fastcgi.c.orig  Mon Mar  3 04:59:34 2008
+++ mod_fastcgi.c       Mon Mar  3 04:59:48 2008
@@ -2669,7 +2669,11 @@
                        }

                        /* fall through if we have a dead proc now */
-                       if (proc->state != PROC_STATE_DIED) break;
+                       if (proc->state != PROC_STATE_DIED) {
+                               proc->state = PROC_STATE_OVERLOADED;
+                               host->active_procs++;
+                               break;
+                       }

                case PROC_STATE_DIED:
                        /* local procs get restarted by us,

03/05/2008 10:27:25 AM changed by anonymous

  • version changed from 1.4.17 to 1.4.18.

03/05/2008 10:32:29 AM changed by anonymous

  • summary changed from lighty answers 500 to all requests when all backends are marked busy to lighty answers 500 to all requests when all backends are marked busy (INCLUDES PATCH, SOMEONE PLEASE CHECK).

03/09/2008 07:16:45 PM changed by anonymous

  • keywords set to patch.

03/12/2008 02:11:43 PM changed by anonymous

  • version changed from 1.4.18 to 1.4.19.

03/24/2008 11:09:36 PM changed by anonymous

I have been fighting this bug for the last few weeks. After spending a day in gdb fixing it, I find that someone has *already* fixed it. Why is there no developer traction on this bug, especially given that there is a patch?

04/20/2008 05:56:31 AM changed by anonymous

  • summary changed from lighty answers 500 to all requests when all backends are marked busy (INCLUDES PATCH, SOMEONE PLEASE CHECK) to lighttpd responds 500 to ALL requests after brief period of overload (it never recovers from "cool-down" period).

Perhaps the title is misleading. I am changing it in the hopes that it will attract some attention from the developers. I have been using the patch posted above for quite a while and haven't had any problems with it. I hope it will get fixed in the next release.

By the way, an extra note to the developers: try the method outlined in the first post. After a while, lighttpd will respond 500 to some requests. This is not the problem. Now stop the ab process and try executing the script from your browser. All subsequent requests will return 500 no matter how long you wait. This is the problem.

05/04/2008 01:40:16 PM changed by stbuehler

  1. The patch doesn't convince me:
    • i couldn't find the given lines numbers
    • the proc isn't counted as active in state OVERLOADED, so even if we do not count active_proc correctly, the patch will probably not really fix it.
    • a local proc should only get into that state if it killed the socket, so even if it is not dead it is assumed it will not work anymore. But perhaps we really shouldn't care about that and just try to use it again.
  2. I tried reproducing it (i used usleep instead of busy loops) - but it just worked as it should.

Add/Change #897 (lighttpd responds 500 to ALL requests after brief period of overload (it never recovers from "cool-down" period))




Change Properties