From benjamin at octopuce.fr Fri Dec 12 02:25:43 2008 From: benjamin at octopuce.fr (Benjamin Sonntag) Date: Fri, 12 Dec 2008 03:25:43 +0100 Subject: Bug : Assert error in exp_timer() | Child not responding to ping, killing it. Message-ID: <4941CBA7.6040501@octopuce.fr> Hi all, (first of all, I'll be glad to obtain a login/pass on the trac so that I may create a ticket for this one if it became a real bug ;) and help varnish community) I guess I found a bug :) So please find below the informations I was able to gather to start working on this issue : We are using varnish 2.0.2 (debian package version from lenny) on this machine (from dell) : Linux cache1b 2.6.25-2-amd64 #1 SMP Fri Jun 27 14:47:16 UTC 2008 x86_64 GNU/Linux - 2 physical processors (8 pipelines total) : Intel(R) Xeon(R) CPU X5460 @ 3.16GHz - 8* 4GB FB-DIMM (total 32GB RAM) (yes, I like this machine ;) ) I found a previous similar error last July in the mailing list, but didn't know how to solve it : we put those parameters high enough (I guess) : cli_timeout 40 [seconds] ping_interval 12 [seconds] The main question is : how can I create a backtrace, how can I obtain a core dump to create this backtrace ? Thanks for your help, I will, of course, do my best to find a solution for this issue. Regards, Benjamin Sonntag Here is what Syslog said (the most important I guess) : Dec 11 20:17:01 cache1b /USR/SBIN/CRON[30655]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly) Dec 11 20:17:09 cache1b varnishd[63966]: Child (42470) not responding to ping, killing it. Child (42470) died signal=6 Child (42470) Panic message: Assert error in exp_timer(), cache_expire.c line 303: Condition(oe2->timer_when >= oe->timer_when) not true. thread = (cache-timeout) Child cleanup complete child (30657) Started Child (30657) said Closed fds: 4 5 6 10 11 13 14 Child (30657) said Child starts Child (30657) said managed to mmap 68719476736 bytes of 68719476736 Child (30657) said Ready Child (30657) said Probe("GET /search/C=?definition=homepage HTTP/1.1^M Child (30657) said Host: 192.168.131.101^M Child (30657) said Connection: close^M Child (30657) said ^M Child (30657) said ", 4, 1) Child (30657) said Probe("GET /search/C=?definition=homepage HTTP/1.1^M Child (30657) said Host: 192.168.131.102^M Child (30657) said Connection: close^M Child (30657) said ^M Child (30657) said ", 4, 1) Child (30657) said Probe("GET /search/C=?definition=homepage HTTP/1.1^M Child (30657) said Host: 192.168.131.107^M Child (30657) said Connection: close^M Child (30657) said ^M Child (30657) said ", 4, 1) Dec 11 20:20:01 cache1b /USR/SBIN/CRON[31317]: (root) CMD (if [ -x /etc/munin... Here is the varnishlog extract (my server is quite busy, I had a hard time finding this place, so awk|sort|uniq|grep was my friends ;) ) 1137 StatSess c 192.168.131.8 43747 0 1 1 0 0 0 233 5 0 StatAddr - 192.168.131.8 0 285468 2281231 2281230 0 0 1873552 489132873 83107957357 1139 ReqStart c 192.168.131.7 51390 2047115919 1139 RxRequest c GET 1139 RxURL c /confidential/url/blabla&purge=1 1139 RxProtocol c HTTP/1.1 1139 RxHeader c Host: varnish:30000 1139 RxHeader c Accept: */* 1139 VCL_call c recv 1139 VCL_return c lookup 1139 VCL_call c hash 1139 VCL_return c hash 1139 VCL_call c miss 1139 VCL_return c fetch 1137 BackendOpen b be1b 192.168.131.30 35113 192.168.131.101 30000 1139 Backend c 1137 lb3 be1b 1137 TxRequest b GET 1137 TxURL b /confidential/url/blabla&purge=1 1137 TxProtocol b HTTP/1.1 1137 TxHeader b Host: varnish:30000 1137 TxHeader b Accept: */* 1137 TxHeader b X-Varnish: 2047115919 1137 TxHeader b X-Forwarded-For: 192.168.131.7 1157 SessionClose c remote closed 0 WorkThread - 0x43033cb0 start 0 CLI - Rd vcl.load boot ./vcl.1P9zoqAU.so 0 CLI - Wr 0 200 Loaded "./vcl.1P9zoqAU.so" as "boot" 0 CLI - Rd vcl.load test ./vcl.FANefPfn.so 0 WorkThread - 0x45037cb0 start 0 Backend_health - be2b Still sick 4--X-S-RH 1 3 8 0.021917 0.021917 HTTP/1.1 200 OK 0 Backend_health - be1b Still sick 4--X-S--- 0 3 8 0.000000 0.000000 0 Backend_health - be3b Still sick 4--X-S--- 0 3 8 0.000000 0.000000 0 WorkThread - 0x40c85cb0 start 0 CLI - Wr 0 200 Loaded "./vcl.FANefPfn.so" as "test" 0 CLI - Rd vcl.use test 0 CLI - Wr 0 200 0 CLI - Rd start 0 Debug - "Acceptor is epoll" 0 CLI - Wr 0 200 0 WorkThread - 0x4683acb0 start 0 WorkThread - 0x4703bcb0 start 0 WorkThread - 0x4783ccb0 start 0 WorkThread - 0x4803dcb0 start 11 SessionOpen c 192.168.131.9 49159 :30000 11 ReqStart c 192.168.131.9 49159 705356244 11 RxRequest c GET 11 RxURL c /confidential/url/blabla&ttl=120 11 RxProtocol c HTTP/1.1 11 RxHeader c Host: varnish:30000 11 RxHeader c Accept: */* 11 VCL_call c recv 11 VCL_return c lookup 11 VCL_call c hash 11 VCL_return c hash 11 VCL_call c miss 11 VCL_return c fetch 11 VCL_call c error 11 VCL_return c deliver 11 Length c 5 11 VCL_call c deliver 11 VCL_return c deliver 11 TxProtocol c HTTP/1.1 11 TxStatus c 503 11 TxResponse c Service Unavailable 11 TxHeader c Server: Varnish 11 TxHeader c Retry-After: 0 11 TxHeader c Content-Type: text/html; charset=utf-8 11 TxHeader c Content-Length: 5 11 TxHeader c Date: Thu, 11 Dec 2008 19:17:10 GMT 11 TxHeader c X-Varnish: 705356244 11 TxHeader c Age: 0 11 TxHeader c Via: 1.1 varnish 11 TxHeader c Connection: close Then our VCL code (... don't know if it's good or bad syntax, but it used to work) backend be1b { .host = "192.168.131.101"; .port = "30000"; .probe = { .url = "/search/C=?definition=homepage"; .timeout = 4s; .interval = 1s; .window = 8; .threshold = 3; } } backend be2b { .host = "192.168.131.102"; .port = "30000"; .probe = { .url = "/search/C=?definition=homepage"; .timeout = 4s; .interval = 1s; .window = 8; .threshold = 3; } } backend be3b { .host = "192.168.131.107"; .port = "30000"; .probe = { .url = "/search/C=?definition=homepage"; .timeout = 4s; .interval = 1s; .window = 8; .threshold = 3; } } director lb3 round-robin { { .backend = be1b; } { .backend = be2b; } { .backend = be3b; } } sub vcl_recv { set req.backend = lb3; if (req.url ~ "&nocache=1") { set req.url = regsub(req.url,"&nocache=1",""); pass; } lookup; } sub vcl_hash { set req.hash += req.url; hash; } sub vcl_fetch { if (obj.status != 200 && obj.status != 403 && obj.status != 404) { restart; } if (obj.ttl < 3600s) { set obj.ttl = 3600s; } # Si on a &ttl=nombre, on fixe la ttl a heures if (req.url ~ "&ttl=") { if (req.url ~ "&ttl=001") { set obj.ttl=3600s; } if (req.url ~ "&ttl=002") { set obj.ttl=7200s; } if (req.url ~ "&ttl=003") { set obj.ttl=10800s; } if (req.url ~ "&ttl=006") { set obj.ttl=21600s; } if (req.url ~ "&ttl=009") { set obj.ttl=32400s; } if (req.url ~ "&ttl=012") { set obj.ttl=43200s; } if (req.url ~ "&ttl=015") { set obj.ttl=54000s; } if (req.url ~ "&ttl=018") { set obj.ttl=64800s; } if (req.url ~ "&ttl=021") { set obj.ttl=75600s; } if (req.url ~ "&ttl=024") { set obj.ttl=86400s; } if (req.url ~ "&ttl=096") { set obj.ttl=345600s; } if (req.url ~ "&ttl=168") { set obj.ttl=604800s; } if (req.url ~ "&ttl=672") { set obj.ttl=2419200s; } set req.url = regsub(req.url,"&ttl=([0-9]+)",""); } # Si on a &purge=1, on purge l'entree demandee du cache. if (req.url ~ "&purge=1") { set req.url = regsub(req.url,"&purge=1",""); purge_url(req.url); } } sub vcl_error { set obj.http.Content-Type = "text/html; charset=utf-8"; synthetic {"error"}; deliver; } From perbu at linpro.no Fri Dec 12 08:01:59 2008 From: perbu at linpro.no (Per Buer) Date: Fri, 12 Dec 2008 09:01:59 +0100 Subject: Bug : Assert error in exp_timer() | Child not responding to ping, killing it. In-Reply-To: <4941CBA7.6040501@octopuce.fr> References: <4941CBA7.6040501@octopuce.fr> Message-ID: <49421A77.8080101@linpro.no> Benjamin Sonntag wrote: > Hi all, > > (first of all, I'll be glad to obtain a login/pass on the trac so that I > may create a ticket for this one if it became a real bug ;) and help > varnish community) Please go to http://varnish.projects.linpro.no/register and register. Then send me the username off list. -- Per Buer - Leder Infrastruktur og Drift - Redpill Linpro Telefon: 21 54 41 21 - Mobil: 958 39 117 http://linpro.no/ | http://redpill.se/ -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 260 bytes Desc: OpenPGP digital signature URL: From phk at phk.freebsd.dk Fri Dec 12 10:50:19 2008 From: phk at phk.freebsd.dk (Poul-Henning Kamp) Date: Fri, 12 Dec 2008 10:50:19 +0000 Subject: Bug : Assert error in exp_timer() | Child not responding to ping, killing it. In-Reply-To: Your message of "Fri, 12 Dec 2008 03:25:43 +0100." <4941CBA7.6040501@octopuce.fr> Message-ID: <9774.1229079019@critter.freebsd.dk> Hi Benjamin, I'll look at it. Just wanted to point this out in the mean time: > if (req.url ~ "&ttl=") { > if (req.url ~ "&ttl=001") { set obj.ttl=3600s; } > if (req.url ~ "&ttl=002") { set obj.ttl=7200s; } > if (req.url ~ "&ttl=003") { set obj.ttl=10800s; } > if (req.url ~ "&ttl=006") { set obj.ttl=21600s; } > if (req.url ~ "&ttl=009") { set obj.ttl=32400s; } > if (req.url ~ "&ttl=012") { set obj.ttl=43200s; } > if (req.url ~ "&ttl=015") { set obj.ttl=54000s; } > if (req.url ~ "&ttl=018") { set obj.ttl=64800s; } > if (req.url ~ "&ttl=021") { set obj.ttl=75600s; } > if (req.url ~ "&ttl=024") { set obj.ttl=86400s; } > if (req.url ~ "&ttl=096") { set obj.ttl=345600s; } > if (req.url ~ "&ttl=168") { set obj.ttl=604800s; } > if (req.url ~ "&ttl=672") { set obj.ttl=2419200s; } VCL supports other units of time than seconds, so for increased readability, you could write: set obj.ttl = 1h; set obj.ttl = 2h; ... set obj.ttl = 1d; ... set obj.ttl = 1w; set obj.ttl = 4w; -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk at FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From benjamin at octopuce.fr Tue Dec 16 00:32:57 2008 From: benjamin at octopuce.fr (Benjamin Sonntag) Date: Tue, 16 Dec 2008 01:32:57 +0100 Subject: Bug : Assert error in exp_timer() | (same bug, different log) In-Reply-To: <4941CBA7.6040501@octopuce.fr> References: <4941CBA7.6040501@octopuce.fr> Message-ID: <4946F739.4000102@octopuce.fr> Hi all, I just had another crash on the same varnish server with the following log : Needless to say that I'm quite lost ... I hope that all this log means nothing and that the really important issue here is "not responding to ping, killing it." Maybe all the rest is only a consequence of the killing of the child ? As I put fairly big values for the child check timeouts and counts, I guess it's not normal that a child stayed like that, waiting for anything ... Is there a way to debug it properly ? Regards, Benjamin Sonntag Dec 15 19:39:46 cache1b varnishd[63966]: Child (30657) not responding to ping, killing it. Dec 15 19:39:46 cache1b varnishd[63966]: Child (30657) died signal=6 Dec 15 19:39:46 cache1b varnishd[63966]: Child (30657) Panic message: Assert error in EXP_Rearm(), cache_expire.c line 242: Condition(oe->timer_idx != BINHEAP_NOIDX) not true. t hread = (cache-worker)sp = 0x7f25cc591008 { fd = 1135, id = 1135, xid = 714117855, client = 192.168.131.9:33717, step = STP_LOOKUP, handling = HASH, ws = 0x7f25cc591078 { id = "sess", {s,f,r,e} = {0x7f25cc5917b0,,+388,(nil),+8192}, }, worker = 0x7f26168fdcb0 { }, vcl = { srcname = { "/etc/varnish/default.vcl", "Default", }, }, }, Dec 15 19:39:46 cache1b varnishd[63966]: Child cleanup complete Dec 15 19:39:46 cache1b varnishd[63966]: child (24332) Started Dec 15 19:39:46 cache1b varnishd[63966]: Child (24332) said Closed fds: 4 5 6 10 11 13 14 Dec 15 19:39:46 cache1b varnishd[63966]: Child (24332) said Child starts Dec 15 19:39:46 cache1b varnishd[63966]: Child (24332) said managed to mmap 68719476736 bytes of 68719476736 Dec 15 19:39:46 cache1b varnishd[63966]: Child (24332) said Ready Dec 15 19:39:46 cache1b varnishd[63966]: Child (24332) said Probe("GET /search/C=?definition=homepage HTTP/1.1^M Dec 15 19:39:46 cache1b varnishd[63966]: Child (24332) said Host: 192.168.131.102^M Dec 15 19:39:46 cache1b varnishd[63966]: Child (24332) said Connection: close^M Dec 15 19:39:46 cache1b varnishd[63966]: Child (24332) said ^M Dec 15 19:39:46 cache1b varnishd[63966]: Child (24332) said ", 4, 1) Dec 15 19:39:46 cache1b varnishd[63966]: Child (24332) said Probe("GET /search/C=?definition=homepage HTTP/1.1^M Dec 15 19:39:46 cache1b varnishd[63966]: Child (24332) said Host: 192.168.131.101^M Dec 15 19:39:46 cache1b varnishd[63966]: Child (24332) said Connection: close^M Dec 15 19:39:46 cache1b varnishd[63966]: Child (24332) said ^M Dec 15 19:39:46 cache1b varnishd[63966]: Child (24332) said ", 4, 1) Dec 15 19:39:46 cache1b varnishd[63966]: Child (24332) said Probe("GET /search/C=?definition=homepage HTTP/1.1^M Dec 15 19:39:46 cache1b varnishd[63966]: Child (24332) said Host: 192.168.131.107^M Dec 15 19:39:46 cache1b varnishd[63966]: Child (24332) said Connection: close^M Dec 15 19:39:46 cache1b varnishd[63966]: Child (24332) said ^M Dec 15 19:39:46 cache1b varnishd[63966]: Child (24332) said ", 4, 1) From benjamin at octopuce.fr Tue Dec 16 08:09:18 2008 From: benjamin at octopuce.fr (Benjamin Sonntag) Date: Tue, 16 Dec 2008 09:09:18 +0100 Subject: Bug : Assert error in exp_timer() | (graph analysis) In-Reply-To: <4946F739.4000102@octopuce.fr> References: <4941CBA7.6040501@octopuce.fr> <4946F739.4000102@octopuce.fr> Message-ID: <4947622E.4050702@octopuce.fr> Hi, following-up of the same bug (the main issue seems to be "child not responding to ping, killing it.") I found that there was a spike in the netstat statistics during the crash. (in fact, the "established" connexion number always grow up until varnish crashes) At the following url, you will found the cache during 3 crashes this week, and the netstat at the same time (thanks munin) http://benjamin.sonntag.fr/download/cache1b-varnish_usage-week.png http://benjamin.sonntag.fr/download/cache1b-netstat-week.png I hope it may help finding a solution (I keep searching, I will be in the source again at the end of the week). Maybe a network connection freeing procedure is missing somewhere. regards, Benjamin Sonntag From rurban at x-ray.at Tue Dec 16 09:02:14 2008 From: rurban at x-ray.at (Reini Urban) Date: Tue, 16 Dec 2008 10:02:14 +0100 Subject: cygwin-1.5 + cygwin-1.7 Message-ID: <6910a60812160102t12f04cf2l1a712a68436f3a1b@mail.gmail.com> Just a short notice: cygwin-1.5.x does not compile OOTB with varnish-2.0.2, because of a missing clock_gettime(CLOCK_MONOTONIC, ...) definition. That means HAVE_CLOCK_GETTIME is set, but CLOCK_MONOTONIC is missing in 1.5.x, but is defined in the upcoming 1.7 cygwin release. I haven't tried cygwin-1.5.x with HAVE_CLOCK_GETTIME undefined yet, though it should work. cygwin-1.7 /usr/include/time.h also has this: #if defined(_POSIX_MONOTONIC_CLOCK) /* The identifier for the system-wide monotonic clock, which is defined * as a clock whose value cannot be set via clock_settime() and which * cannot have backward clock jumps. */ #define CLOCK_MONOTONIC (clockid_t)4 #endif Interestingly _POSIX_MONOTONIC_CLOCK is not set per default, so I added it manually. All this should probably be settled in autotools, and not here. But varnish then needs a tighter check for _POSIX_MONOTONIC_CLOCK in libvarnish/time.c I'll keep you posted if this MacOSX warning applies to cygwin also, and compat has to be used. "Fix build on MacOS X: add a fake clock_gettime() and fix some includes. WARNING: varnish will build and run, but the lack of a monotonic clock may lead to strange behaviour if the clock is stepped (rather than skewed) while varnish is running." http://projects.linpro.no/pipermail/varnish-commit/2006-October/001257.html -- Reini Urban From varnish-dev at projects.linpro.no Tue Dec 16 14:19:42 2008 From: varnish-dev at projects.linpro.no (Online Viagra) Date: Tue, 16 Dec 2008 15:19:42 +0100 (CET) Subject: Dear varnish-dev@projects.linpro.no Tue, 16 Dec 2008 11:19:41 +0900 76% 0FF! Message-ID: <20081216201941.2885.qmail@ÀÌÇÏÁø> An HTML attachment was scrubbed... URL: From benjamin at octopuce.fr Thu Dec 18 10:57:58 2008 From: benjamin at octopuce.fr (Benjamin Sonntag) Date: Thu, 18 Dec 2008 11:57:58 +0100 Subject: Bug: Child not responding to ping, killing it. In-Reply-To: <4947622E.4050702@octopuce.fr> References: <4941CBA7.6040501@octopuce.fr> <4946F739.4000102@octopuce.fr> <4947622E.4050702@octopuce.fr> Message-ID: <494A2CB6.2010900@octopuce.fr> Hi all (again ;) if I talk too much, tell me I will stop), I continue to investigate this problem. It seems that varnish is really keeping ESTABLISHED connexions to the backend for a verryve verry verrry long time : cache1b# netstat -apnt |grep ESTABLISHED|awk '{print $5}' | cut -f 1 -d ':'| sort | uniq -c | sort -g 6 client1 8 client2 10 client3 > total 24 open connexions 43 backend1 50 backend2 74 backend3 > total 167 open connexions !!! The strange thing in that situation is that, on the BACKEND side, the number of ESTABLISHED connexions is quite low : for i in be1b be2b be3b ; do ssh $i netstat -apnt |grep :30000 |grep ESTABLISHED ; done | wc -l 20 maybe the problem is on the BACKEND REUSE code ? maybe it is on the PROBE code ? Maybe there is not really any problem on varnish side : I have another idea regarding this, that may come from the fact that the Backends are behind an ipvs load-balancer (yes, our config is quite complex...) this ipvs load-balancer is in NAT mode, so, there is a NAT (and therefore a connexion tracking list) somewhere between varnish and the backends. Maybe the connexion between varnish and its backend is using http keepalive, so the TCP channel is not closed at the end, and maybe it is closed some time AFTER the NAT connexion keeping timeout. In that case, varnish never receive the TCP connexion closing packet, and thus keeps the connexion open until ... it fills up its connexion stack. There is so many scenario that I don't think I will be able to test all of them before my client (the user of this big cluster) kicks varnish off ;) but I will try them in order to find a solution, to be continued... Regards, B. From phk at phk.freebsd.dk Thu Dec 18 10:59:37 2008 From: phk at phk.freebsd.dk (Poul-Henning Kamp) Date: Thu, 18 Dec 2008 10:59:37 +0000 Subject: Bug: Child not responding to ping, killing it. In-Reply-To: Your message of "Thu, 18 Dec 2008 11:57:58 +0100." <494A2CB6.2010900@octopuce.fr> Message-ID: <89629.1229597977@critter.freebsd.dk> In message <494A2CB6.2010900 at octopuce.fr>, Benjamin Sonntag writes: >I continue to investigate this problem. It seems that varnish is really >keeping ESTABLISHED connexions to the backend for a verryve verry verrry >long time : Varnish does not close backend connections, it leaves them open until the backend times them out. How you TCP connections get out of sync between varnish host and backend host I have no idea... -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk at FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From varnish-dev at projects.linpro.no Wed Dec 24 01:00:54 2008 From: varnish-dev at projects.linpro.no (varnish-dev at projects.linpro.no) Date: Wed, 24 Dec 2008 02:00:54 +0100 (CET) Subject: Are you missing in action? Message-ID: <20081224010054.B97361EC12E@projects.linpro.no> An HTML attachment was scrubbed... URL: From varnish-dev at projects.linpro.no Fri Dec 26 22:18:28 2008 From: varnish-dev at projects.linpro.no (varnish-dev at projects.linpro.no) Date: Fri, 26 Dec 2008 23:18:28 +0100 (CET) Subject: She kept moaning in pleasure Message-ID: <20081226221828.092AA1EC20C@projects.linpro.no> An HTML attachment was scrubbed... URL: