Ticket #575 (new defect)

Opened 2 years ago

Last modified 2 months ago

high-time connections in handle-req impact fastcgi overload calculation

Reported by: moorman Owned by: jan
Priority: highest Milestone:
Component: mod_fastcgi Version: 1.4.19
Severity: critical Keywords:
Cc: ts77@… Blocked By:
Need User Feedback: no Blocking:

Description

This ticket is a summary of details presented to Jan via IRC on 2006-03-10.

Based on a pool of six lighttpd heads receiving traffic from a load balancer, all six heads reached a terminal overload state where they could not recover without restart. From internal statistics, fastcgi load was 100+ on each head. After restart of lighttpd on a head, once it was picked up by the load balancer, fastcgi load stabilized at ~20.

fastcgi.backend.main-php.0.connected: 205994
fastcgi.backend.main-php.0.died: 0
fastcgi.backend.main-php.0.disabled: 0
fastcgi.backend.main-php.0.load: 144
fastcgi.backend.main-php.0.overloaded: 488
fastcgi.backend.main-php.1.connected: 155287
fastcgi.backend.main-php.1.died: 0
fastcgi.backend.main-php.1.disabled: 0
fastcgi.backend.main-php.1.load: 144
fastcgi.backend.main-php.1.overloaded: 488
fastcgi.backend.main-php.load: 288

Confirmed at the load balancer that this was not a high amount of inbound traffic. lighttpd server status showed a reasonable distribution of various pages waiting in handle-req status with high values for the Time column.

338 connections
hWhhhhrhhhhhhhhhWrhhhrhhhhhhhrhWrhrhhhhhhhWhhhhhhh
hhhhhhhhhrhhrhhhhhhhhhhhhhhhhhhhhrhhhhhhhhhhhhrrhh
rhhhhhrWrrrrhhhhhhrhhhhhhhrhhhhhrhhhhhhrhWhhhhrrhr
hhrhhhhhhhhhhhhWhhhrhhhrhhrhhhrhhhWhhhhhhhhhhhrhhh
hhrrhhrhhrhhhrhrrhhhhhWhhhhhhhWhrhrrrhhhrrhhhhrhhh
WWrrhrrrrWrhrhWrrrrrrrhrWhrrhrrhhrhhhhrhrhhhWhrWrr
hrhrhhhhhhhhrhhrhhhWhrhhhrrrrrrhhhhhhh

Approximately 150 connections shown in handle-req status have Time of 2756 or higher. Approximately 30-40 connections of this set have Time of 5000 or higher.

lighttpd error log shows continual overload status causing disable, wait, re-enable in continual cycle. Heads will not recover without restart, but head works fine after restart has occurred.

Based on discussion via IRC, as a workaround measure, plan is to add a global timeout for handle-req, such that these long-running connections in handle-req status will be shed.

-Jacob

Attachments

Change History

  Changed 2 years ago by jbyers@…

I see the same condition with lighttpd-1.4.11. Over time, many php fastcgi process build up with large handle-req times. These php processes can be successfully killed and are then respawned. I do not however see anything in the lighttpd error log corresponding with processes falling into this state. PHP is not segfaulting, nor running out of memory.

The same behavior occurs with identical builds of PHP 5.1.2 and 5.1.6, the latter of which has a completely re-written fastcgi implementation. lighttpd-1.4.11 on AMD64 RHEL4.

  Changed 13 months ago by sblam@…

  • pending unset

I think the problem still persists in 1.4.16.

My log is full of this:

2007-08-08 11:02:46: (mod_fastcgi.c.2836) backend is overloaded; we'll disable it for 2 seconds and send the request to another backend instead: reconnects: 0 load: 138 2007-08-08 11:02:49: (mod_fastcgi.c.3479) all handlers for /server.php on .php are down. 2007-08-08 11:02:49: (mod_fastcgi.c.2614) fcgi-server re-enabled: 0 /tmp/php-fastcgi.socket 2007-08-08 11:02:59: (mod_fastcgi.c.2836) backend is overloaded; we'll disable it for 2 seconds and send the request to another backend instead: reconnects: 0 load: 138 2007-08-08 11:02:59: (mod_fastcgi.c.3479) all handlers for /server.php on .php are down. 2007-08-08 11:03:02: (mod_fastcgi.c.2614) fcgi-server re-enabled: 0 /tmp/php-fastcgi.socket ...

and while it isn't all locked-up, it fills with: 2007-08-08 11:21:37: (server.c.1165) NOTE: a request for /foo timed out after writi ng 26280 bytes. We waited 360 seconds. If this a problem increase server.max-write-idle

  Changed 11 months ago by ts77

  • cc ts77@… added

  Changed 9 months ago by anonymous

  • version changed from 1.4.11 to 1.4.18

I still experience this same issue in 1.4.18, after a server reboot it might work for another couple weeks.

  Changed 9 months ago by oschonrock

we saw what appears to what may be a related issue with overloading (to do with PHP not indicating to lighty that it is in fact overloaded):

http://trac.lighttpd.net/trac/ticket/1488

have you considered trying to launch the php-fcgi server separately with spawn_fcgi as described in that issue?

  Changed 8 months ago by pat@…

  • priority changed from high to highest

We also experience this problem on a regular basis across three web servers under reasonable load (around 1M hits per day each - although the problem does not appear related to load and often occurs well outside of peak times).

We see the problem with the following configurations:

PHP4.4.4 (eAccelerator) under spawn_fcgi lighttpd 1.4.13

PHP5.2.5 (XCache/Suhosin) spawned directly by Lighty lighttpd 1.4.18

I have altered the priority, as this appears to be a show-stopping bug for PHP FastCGI under lighttpd.

Has anyone tried 1.5.x-svn?

  Changed 7 months ago by Aleksey Korzun

Same problem here, I was advised to upgrade to 1.5.x branch. I doubt that will change anything.

  Changed 7 months ago by ff@…

Same issues here. Has anyone experienced issues with the patch supplied? I would like to see some action in this "bug" (I know it is basically a PHP-not-obeying-fastcgi-standards-issue).

Thank you!

follow-up: ↓ 10   Changed 7 months ago by pat@…

WORKING RESOLUTION:

Given the comment above, and given that the 1.5.x branch is now close to release, (and given that 1.4.x was causing severe instability in our production environment) it seemed prudent to try 1.5.x to determine if this would have any effect. I built 1.5.0-r1992 from SVN using the following configuration:

./configure --prefix=/usr --libdir=/usr/lib/lighttpd \
            --with-bzip2 \
            --with-attr \
            --with-linux-aio \
            --with-openssl=/usr/include/openssl
/etc/lighttpd.conf
[...]
proxy-core.balancer               = "sqf"
proxy-core.allow-x-sendfile       = "enable"
proxy-core.allow-x-rewrite        = "enable"

$HTTP["url"] =~ "\.php" {
  proxy-core.protocol             = "fastcgi"
  proxy-core.max-pool-size        = 4 # (set to same as PHP_FCGI_CHILDREN)
  proxy-core.backends             = ( "unix:/tmp/.fcgi-php.socket" )
  proxy-core.rewrite-request = (
    "_pathinfo" => ( "\.php(/.*)" => "$1" )
  )
}
[...]

This configuration has thus far resolved the PHP lock-up issue that we have been experiencing. We have not experienced server downtime for over 4 days (we were previously experiencing downtime on individual members of our cluster several times per day).

In reference to the above comment (ff@…):

I don't pretend to be an expert (and indeed I know little about the FastCGI protocol); however, several people have suggested that the PHP's mis-implementation of the FastCGI protocol does _not_ cause issues when running under spawn-php. I do not know whether this is indeed the case but I experienced the issue described in this ticket under both configurations (spawn-php or lighttpd spawned interpretters) as noted in my earlier post. It is possible that these issues are therefore entirely separate but I am not able to determine this.

If it is of any use to those who may be attempting to debug this issue, it is worth noting that I also experienced this issue using all three of the following configurations (under lighttpd 1.4.x):

  • spawn-php over TCP/IP
  • spawn-php over unix socket
  • lighttpd spawns single PHP process which spawns own children (unix socket)
  • lighttpd spawns many individual PHP interpretters (unix socket)

Cheers, Patrick

in reply to: ↑ 9   Changed 6 months ago by anonymous

I've upgraded to 1.5 now and i don't get a build up of handle-req any more now its write-content connection times that go into the high thousands. I've set server.max-write-idle to 200 but that hasn't solved anything. Any ideas?

  Changed 6 months ago by Aleksey Korzun

Thanks, Pat.

I will wait until 1.5 is stable to roll it out to production. This looks promising so far!

in reply to: ↑ description   Changed 2 months ago by georgexsh

  • version changed from 1.4.18 to 1.4.19

It seems that 1.4.19 + php 5.2.4 + xcache have seem issue.

Add/Change #575 (high-time connections in handle-req impact fastcgi overload calculation)

Author



Change Properties
<Author field>
Action
as new
 
Note: See TracTickets for help on using tickets.