From info at dubistmeinheld.de Thu Jun 1 11:51:00 2017 From: info at dubistmeinheld.de (info at dubistmeinheld.de) Date: Thu, 1 Jun 2017 13:51:00 +0200 Subject: Can't create more than 494 threads Message-ID: Hi, I'm using Varnish since a couple of years and I'm very satisfied. Thank you for the work! Recently I setup a new server with Varnish 5.0. For some reasons, threads_created is stuck at 494 and threads_failed increase every second, even with no load (-s malloc,2G -p thread_pools=2 -p thread_pool_min=250 -p thread_pool_max=2000 -p thread_pool_fail_delay=2). MAIN.threads 494 . Total number of threads MAIN.threads_limited 0 0.00 Threads hit max MAIN.threads_created 494 0.01 Threads created MAIN.threads_destroyed 0 0.00 Threads destroyed MAIN.threads_failed 356 0.01 Thread creation failed I tried now for a couple of days to fix this issue and I'm looking in the area of a kernel param. But I'm completely stuck on what's causing it. I would appreciate any help and pointing me into the right direction. Best regards, Jens From japrice at gmail.com Thu Jun 1 16:52:42 2017 From: japrice at gmail.com (Jason Price) Date: Thu, 1 Jun 2017 12:52:42 -0400 Subject: Can't create more than 494 threads In-Reply-To: References: Message-ID: dimesg might help. log files should indicate if you're in an 'open files limit' issue... I don't think varnishlog will will show the kind of errors you're looking for here. On Thu, Jun 1, 2017 at 7:51 AM, info at dubistmeinheld.de < info at dubistmeinheld.de> wrote: > Hi, > > I'm using Varnish since a couple of years and I'm very satisfied. Thank > you for the work! > > Recently I setup a new server with Varnish 5.0. For some reasons, > threads_created is stuck at 494 and threads_failed increase every > second, even with no load (-s malloc,2G -p thread_pools=2 -p > thread_pool_min=250 -p thread_pool_max=2000 -p thread_pool_fail_delay=2). > > MAIN.threads 494 . Total number of threads > MAIN.threads_limited 0 0.00 Threads hit max > MAIN.threads_created 494 0.01 Threads created > MAIN.threads_destroyed 0 0.00 Threads destroyed > MAIN.threads_failed 356 0.01 Thread creation failed > > I tried now for a couple of days to fix this issue and I'm looking in > the area of a kernel param. But I'm completely stuck on what's causing it. > > I would appreciate any help and pointing me into the right direction. > > Best regards, > Jens > > _______________________________________________ > varnish-misc mailing list > varnish-misc at varnish-cache.org > https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc > -------------- next part -------------- An HTML attachment was scrubbed... URL: From np.lists at sharphosting.uk Thu Jun 1 21:22:02 2017 From: np.lists at sharphosting.uk (Nigel Peck) Date: Thu, 1 Jun 2017 16:22:02 -0500 Subject: Unexplained Cache MISSes In-Reply-To: References: <211c667a-ce70-6373-c840-4482c159e38c@sharphosting.uk> <35dfe986-72dc-95f5-0319-9d0743aebbe4@sharphosting.uk> <3fdcafb5-4000-3d64-478b-fb60baa9a783@sharphosting.uk> Message-ID: <53cec1b0-0b57-7110-4b78-9ac280eaa782@sharphosting.uk> On 31/05/2017 18:33, Dridi Boukelmoune wrote: > There's no ordering guarantee in the varnishlog output, although they > should likely be ordered since they share the same hash. You'd need to > check the Timestamp records to get a grasp of chronology. Thanks, I'll keep that in mind. I looked at a typical set of entries that I saved. This is not a busy site. All the timestamps are in order. I've included the full log below, based on a search for the ReqURL. There is the PURGE and then the restart that gets a HIT. The restart shows itself as being the 7th hit on that object - X-Cache: HIT (7) - I can't check the VXID because the PURGE entry doesn't include it[1]. And then the next request, 40 minutes later, gets a MISS. All caching on this server is for one week. The next request an hour after that gets HIT(1). So all working properly apart from the restart getting a HIT, resulting in the next request getting a MISS instead[2]. It seems clear to me that there is some bug causing a delay on the PURGE going through in some cases (around 10% of purges in my case), so the restart comes back round before the PURGE has completed. The purge completes after the restart. [1] - I'm not sure how I can get the VXID for a purge, since it seems vcl_purge does not have access to the obj it is going to purge. Hopefully the obj.hits being more than 1 or 2 in the restarted hit is evidence enough, and the lack of intervening entries on this non-busy site. [2] - Very noticeable in my case, because I am using Varnish to ensure every request is a cache HIT, even for pages that only get viewed once or twice a week, to improve performance. So I'm monitoring MISSes and seeing this happening. > If it's a bug, it might be one of those hard to reproduce... > > Amazingly enough I never looked at the logs of a purge, maybe ExpKill > could give us a VXID to then check against the hit. If only SomeoneElse(tm) > could spare me the time and look at it themselves and tell us (wink wink=). I'm very happy to help in any way I can. Please let me know anything I can do or information I can provide. I'm no C programmer (web developer/server admin), so can't help out with coding/patching/debugging[3], but anything else I can do, please let me know what you need. Would a cleanly installed server and absolute minimum VCL to reproduce this be useful? You would be welcome to have access to that server, if useful, once I've got it set up and producing the same problem. Nigel [3] - Assuming it's a bug, which for my part I'm convinced it is at this point. * << Request >> 266604 - Begin req 266603 rxreq - Timestamp Start: 1495662133.465511 0.000000 0.000000 - Timestamp Req: 1495662133.465511 0.000000 0.000000 - ReqStart xxx.xxx.xxx.xx2 57250 - ReqMethod PURGE - ReqURL /example/url - ReqProtocol HTTP/1.1 - ReqHeader TE: deflate,gzip;q=0.3 - ReqHeader Connection: TE, close - ReqHeader Accept-Encoding: gzip - ReqHeader Host: www.example.com - ReqHeader User-Agent: SuperDuperApps-Cache-Purger/0.1 - ReqHeader X-Forwarded-For: xxx.xxx.xxx.xx2 - VCL_call RECV - ReqHeader X-Processed-By: Melian - VCL_acl MATCH purgers "xxx.xxx.xxx.xx2" - VCL_return purge - VCL_call HASH - VCL_return lookup - VCL_call PURGE - ReqMethod GET - VCL_return restart - Timestamp Restart: 1495662133.465563 0.000052 0.000052 - Link req 266605 restart - End * << Request >> 266605 - Begin req 266604 restart - Timestamp Start: 1495662133.465563 0.000052 0.000000 - ReqStart xxx.xxx.xxx.xx2 57250 - ReqMethod GET - ReqURL /example/url - ReqProtocol HTTP/1.1 - ReqHeader TE: deflate,gzip;q=0.3 - ReqHeader Connection: TE, close - ReqHeader Accept-Encoding: gzip - ReqHeader Host: www.example.com - ReqHeader User-Agent: SuperDuperApps-Cache-Purger/0.1 - ReqHeader X-Forwarded-For: xxx.xxx.xxx.xx2 - ReqHeader X-Processed-By: Melian - VCL_call RECV - VCL_return hash - VCL_call HASH - VCL_return lookup - Hit 132102 - VCL_call HIT - VCL_return deliver - RespProtocol HTTP/1.1 - RespStatus 200 - RespReason OK - RespHeader Date: Wed, 24 May 2017 02:37:14 GMT - RespHeader Server: Apache/2 - RespHeader P3P: CP="NOI ADM DEV PSAi COM NAV OUR OTRo STP IND DEM" - RespHeader Last-Modified: Wed, 24 May 2017 02:37:15 GMT - RespHeader Content-Type: text/html; charset=utf-8 - RespHeader X-Host: www.example.com - RespHeader X-URL: /example/url - RespHeader Cache-Control: max-age=3600 - RespHeader Content-Encoding: gzip - RespHeader Vary: Accept-Encoding - RespHeader X-Varnish: 266605 132102 - RespHeader Age: 68698 - RespHeader Via: 1.1 varnish-v4 - VCL_call DELIVER - RespUnset Age: 68698 - RespHeader Age: 0 - RespHeader X-Cache: HIT (7) - RespUnset X-Host: www.example.com - RespUnset X-URL: /example/url - RespUnset X-Varnish: 266605 132102 - RespUnset Via: 1.1 varnish-v4 - RespHeader Via: Varnish - VCL_return deliver - Timestamp Process: 1495662133.465618 0.000107 0.000055 - RespHeader Accept-Ranges: bytes - RespHeader Content-Length: 7493 - Debug "RES_MODE 2" - RespHeader Connection: close - Timestamp Resp: 1495662133.465660 0.000149 0.000042 - ReqAcct 225 0 225 396 7493 7889 - End * << Request >> 3017 - Begin req 3016 rxreq - Timestamp Start: 1495664394.000921 0.000000 0.000000 - Timestamp Req: 1495664394.000921 0.000000 0.000000 - ReqStart xxx.xxx.xxx.xx3 45771 - ReqMethod GET - ReqURL /example/url - ReqProtocol HTTP/1.1 - ReqHeader Host: www.example.com - ReqHeader Connection: Keep-alive - ReqHeader Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 - ReqHeader From: googlebot(at)googlebot.com - ReqHeader User-Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) - ReqHeader Accept-Encoding: gzip,deflate,br - ReqHeader If-Modified-Since: Wed, 24 May 2017 15:16:04 GMT - ReqHeader X-Forwarded-For: xxx.xxx.xxx.xx3 - VCL_call RECV - ReqHeader X-Processed-By: Melian - VCL_return hash - ReqUnset Accept-Encoding: gzip,deflate,br - ReqHeader Accept-Encoding: gzip - VCL_call HASH - VCL_return lookup - VCL_call MISS - VCL_return fetch - Link bereq 3018 fetch - Timestamp Fetch: 1495664394.381188 0.380267 0.380267 - RespProtocol HTTP/1.1 - RespStatus 200 - RespReason OK - RespHeader Date: Wed, 24 May 2017 22:19:54 GMT - RespHeader Server: Apache/2 - RespHeader P3P: CP="NOI ADM DEV PSAi COM NAV OUR OTRo STP IND DEM" - RespHeader Last-Modified: Wed, 24 May 2017 22:19:54 GMT - RespHeader Content-Type: text/html; charset=utf-8 - RespHeader X-Host: www.example.com - RespHeader X-URL: /example/url - RespHeader Cache-Control: max-age=3600 - RespHeader Content-Encoding: gzip - RespHeader Vary: Accept-Encoding - RespHeader X-Varnish: 3017 - RespHeader Age: 0 - RespHeader Via: 1.1 varnish-v4 - VCL_call DELIVER - RespHeader X-Cache: MISS - RespUnset X-Host: www.example.com - RespUnset X-URL: /example/url - RespUnset X-Varnish: 3017 - RespUnset Via: 1.1 varnish-v4 - RespHeader Via: Varnish - VCL_return deliver - Timestamp Process: 1495664394.381214 0.380294 0.000026 - RespHeader Accept-Ranges: bytes - RespHeader Transfer-Encoding: chunked - Debug "RES_MODE 8" - RespHeader Connection: keep-alive - Timestamp Resp: 1495664394.396562 0.395641 0.015347 - ReqAcct 409 0 409 404 7493 7897 - End * << Request >> 35821 - Begin req 35820 rxreq - Timestamp Start: 1495668065.207785 0.000000 0.000000 - Timestamp Req: 1495668065.207785 0.000000 0.000000 - ReqStart xxx.xxx.xxx.xx1 33904 - ReqMethod GET - ReqURL /example/url - ReqProtocol HTTP/1.1 - ReqHeader Host: www.example.com - ReqHeader Connection: Keep-alive - ReqHeader Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 - ReqHeader From: googlebot(at)googlebot.com - ReqHeader User-Agent: Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) - ReqHeader Accept-Encoding: gzip,deflate,br - ReqHeader If-Modified-Since: Wed, 24 May 2017 22:19:54 GMT - ReqHeader X-Forwarded-For: xxx.xxx.xxx.xx1 - VCL_call RECV - ReqHeader X-Processed-By: Melian - VCL_return hash - ReqUnset Accept-Encoding: gzip,deflate,br - ReqHeader Accept-Encoding: gzip - VCL_call HASH - VCL_return lookup - Hit 3018 - VCL_call HIT - VCL_return deliver - RespProtocol HTTP/1.1 - RespStatus 200 - RespReason OK - RespHeader Date: Wed, 24 May 2017 22:19:54 GMT - RespHeader Server: Apache/2 - RespHeader P3P: CP="NOI ADM DEV PSAi COM NAV OUR OTRo STP IND DEM" - RespHeader Last-Modified: Wed, 24 May 2017 22:19:54 GMT - RespHeader Content-Type: text/html; charset=utf-8 - RespHeader X-Host: www.example.com - RespHeader X-URL: /example/url - RespHeader Cache-Control: max-age=3600 - RespHeader Content-Encoding: gzip - RespHeader Vary: Accept-Encoding - RespHeader X-Varnish: 35821 3018 - RespHeader Age: 3670 - RespHeader Via: 1.1 varnish-v4 - VCL_call DELIVER - RespUnset Age: 3670 - RespHeader Age: 0 - RespHeader X-Cache: HIT (1) - RespUnset X-Host: www.example.com - RespUnset X-URL: /example/url - RespUnset X-Varnish: 35821 3018 - RespUnset Via: 1.1 varnish-v4 - RespHeader Via: Varnish - VCL_return deliver - Timestamp Process: 1495668065.207927 0.000142 0.000142 - RespProtocol HTTP/1.1 - RespStatus 304 - RespReason Not Modified - RespReason Not Modified - Debug "RES_MODE 0" - RespHeader Connection: keep-alive - Timestamp Resp: 1495668065.208002 0.000217 0.000075 - ReqAcct 409 0 409 367 0 367 - End From np.lists at sharphosting.uk Thu Jun 1 21:29:57 2017 From: np.lists at sharphosting.uk (Nigel Peck) Date: Thu, 1 Jun 2017 16:29:57 -0500 Subject: Unexplained Cache MISSes In-Reply-To: References: <211c667a-ce70-6373-c840-4482c159e38c@sharphosting.uk> <35dfe986-72dc-95f5-0319-9d0743aebbe4@sharphosting.uk> Message-ID: <3a9c36a9-d503-0bde-802b-b5dcf15b3e3f@sharphosting.uk> On 31/05/2017 18:21, Dridi Boukelmoune wrote: > On Wed, May 31, 2017 at 6:25 PM, Guillaume Quintard >> I got that idea too, but the HIT after the purge return an object with a >> large age. > > The age is something that could come from the backend. Does the VXID > match the one that was just purged when a restart gets a hit? As mentioned in the other email, a purge log entry does not include the VXID as far as I can see, and "obj" is not available in vcl_purge either. I can say that in my case there is definitely no Age header coming from the back-end. Also as shown in the example I sent it is the 7th HIT on that object. Nigel From info at dubistmeinheld.de Fri Jun 2 08:49:06 2017 From: info at dubistmeinheld.de (info at dubistmeinheld.de) Date: Fri, 2 Jun 2017 10:49:06 +0200 Subject: Can't create more than 494 threads In-Reply-To: References: Message-ID: On 01.06.2017 18:52, Jason Price wrote: > dimesg might help. log files should indicate if you're in an 'open > files limit' issue... You pointed me in the right direction regarding log files. Thanks! >From syslog (only shown when starting varnish this error comes up, then it's omitted): kernel: [84513.627267] cgroup: fork rejected by pids controller in /system.slice/varnish.service Which led me to the solution to change systemd settings: https://www.novell.com/support/kb/doc.php?id=7018594 This may also affect not only openSuSE distros in the future. Happy again! > On Thu, Jun 1, 2017 at 7:51 AM, info at dubistmeinheld.de > > wrote: > > Hi, > > I'm using Varnish since a couple of years and I'm very satisfied. Thank > you for the work! > > Recently I setup a new server with Varnish 5.0. For some reasons, > threads_created is stuck at 494 and threads_failed increase every > second, even with no load (-s malloc,2G -p thread_pools=2 -p > thread_pool_min=250 -p thread_pool_max=2000 -p > thread_pool_fail_delay=2). > > MAIN.threads 494 . Total number of threads > MAIN.threads_limited 0 0.00 Threads hit max > MAIN.threads_created 494 0.01 Threads created > MAIN.threads_destroyed 0 0.00 Threads destroyed > MAIN.threads_failed 356 0.01 Thread creation failed > > I tried now for a couple of days to fix this issue and I'm looking in > the area of a kernel param. But I'm completely stuck on what's > causing it. > > I would appreciate any help and pointing me into the right direction. > > Best regards, > Jens > > _______________________________________________ > varnish-misc mailing list > varnish-misc at varnish-cache.org > https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc > > > From joh.hendriks at gmail.com Fri Jun 2 10:06:18 2017 From: joh.hendriks at gmail.com (Johan Hendriks) Date: Fri, 2 Jun 2017 12:06:18 +0200 Subject: Varnish performance with phpinfo Message-ID: <6fa4576d-d25e-b770-44da-98877379a815@gmail.com> Hello all, First sorry for the long email. I have a strange issue with varnish. At least I think it is strange. We start some tests with varnish, but we have an issue. I am running varnish 4.1.6 on FreeBSD 11.1-prerelease. Where varnish listen on port 82 and apache on 80, This is just for the tests. We use the following start options. # Varnish varnishd_enable="YES" varnishd_listen="192.168.2.247:82" varnishd_pidfile="/var/run/varnishd.pid" varnishd_storage="default=malloc,2024M" varnishd_config="/usr/local/etc/varnish/default.vcl" varnishd_hash="critbit" varnishd_admin=":6082" varnishncsa_enable="YES" We did a test with a static page and that went fine. First we see it is not cached, second attempt is cached. root at desk:~ # curl -I www.testdomain.nl:82/info.html HTTP/1.1 200 OK Date: Fri, 02 Jun 2017 09:19:52 GMT Last-Modified: Thu, 01 Jun 2017 12:50:37 GMT ETag: "cf4-550e57bc1f812" Content-Length: 3316 Content-Type: text/html cache-control: max-age = 259200 X-Varnish: 2 Age: 0 Via: 1.1 varnish-v4 Server: varnish X-Powered-By: My Varnish X-Cache: MISS Accept-Ranges: bytes Connection: keep-alive root at desk:~ # curl -I www.testdomain.nl:82/info.html HTTP/1.1 200 OK Date: Fri, 02 Jun 2017 09:19:52 GMT Last-Modified: Thu, 01 Jun 2017 12:50:37 GMT ETag: "cf4-550e57bc1f812" Content-Length: 3316 Content-Type: text/html cache-control: max-age = 259200 X-Varnish: 5 3 Age: 6 Via: 1.1 varnish-v4 Server: varnish X-Powered-By: My Varnish X-Cache: HIT Accept-Ranges: bytes Connection: keep-alive if I benchmark the server I get the following. First is derectly to Apache root at testserver:~ # bombardier -c400 -n10000 http://www.testdomain.nl/info.html Bombarding http://www.testdomain.nl/info.html with 10000 requests using 400 connections 10000 / 10000 [=============================================================] 100.00% 0s Done! Statistics Avg Stdev Max Reqs/sec 12459.00 898.32 13301 Latency 31.04ms 25.28ms 280.90ms HTTP codes: 1xx - 0, 2xx - 10000, 3xx - 0, 4xx - 0, 5xx - 0 others - 0 Throughput: 42.16MB/s This is via varnish. So that works as intended. Varnish does its job and servers the page better. root at testserver:~ # bombardier -c400 -n10000 http://www.testdomain.nl:82/info.html Bombarding http://www.testdomain.nl:82/info.html with 10000 requests using 400 connections 10000 / 10000 [=============================================================] 100.00% 0s Done! Statistics Avg Stdev Max Reqs/sec 19549.00 7649.32 24313 Latency 17.90ms 66.77ms 485.07ms HTTP codes: 1xx - 0, 2xx - 10000, 3xx - 0, 4xx - 0, 5xx - 0 others - 0 Throughput: 71.58MB/s The next one is against a info.php file, which runs phpinfo(); So first agains the server without varnish. root at testserver:~ # bombardier -c400 -n10000 http://www.testdomain.nl/info.php Bombarding http://www.testdomain.nl/info.php with 10000 requests using 400 connections 10000 / 10000 [============================================================] 100.00% 11s Done! Statistics Avg Stdev Max Reqs/sec 828.00 127.66 1010 Latency 472.10ms 59.10ms 740.43ms HTTP codes: 1xx - 0, 2xx - 10000, 3xx - 0, 4xx - 0, 5xx - 0 others - 0 Throughput: 75.51MB/s But then against the server with varnish. So we make sure it is in cache root at desk:~ # curl -I www.testdomain.nl:82/info.php HTTP/1.1 200 OK Date: Fri, 02 Jun 2017 09:36:16 GMT Content-Type: text/html; charset=UTF-8 cache-control: max-age = 259200 X-Varnish: 7 Age: 0 Via: 1.1 varnish-v4 Server: varnish X-Powered-By: My Varnish X-Cache: MISS Accept-Ranges: bytes Connection: keep-alive root at desk:~ # curl -I www.testdomain.nl:82/info.php HTTP/1.1 200 OK Date: Fri, 02 Jun 2017 09:36:16 GMT Content-Type: text/html; charset=UTF-8 cache-control: max-age = 259200 X-Varnish: 10 8 Age: 2 Via: 1.1 varnish-v4 Server: varnish X-Powered-By: My Varnish X-Cache: HIT Accept-Ranges: bytes Connection: keep-alive So it is in cache now. root at testserver:~ # bombardier -c400 -n10000 http://www.testdomain.nl:82/info.php Bombarding http://www.testdomain.nl:82/info.php with 10000 requests using 400 connections 10000 / 10000 [===========================================================================================================================================================================================================] 100.00% 8s Done! Statistics Avg Stdev Max Reqs/sec 1179.00 230.77 1981 Latency 219.94ms 340.29ms 2.00s HTTP codes: 1xx - 0, 2xx - 9938, 3xx - 0, 4xx - 0, 5xx - 0 others - 62 Errors: dialing to the given TCP address timed out - 62 Throughput: 83.16MB/s I expected this to be much more in favour of varnish, but it even generated some errors! Time taken is lower but I expected it to be much faster. Also the 62 errors is not good i guess. I do see the following with varnish log * << Request >> 11141123 - Begin req 1310723 rxreq - Timestamp Start: 1496396250.098654 0.000000 0.000000 - Timestamp Req: 1496396250.098654 0.000000 0.000000 - ReqStart 192.168.2.39 14818 - ReqMethod GET - ReqURL /info.php - ReqProtocol HTTP/1.1 - ReqHeader User-Agent: fasthttp - ReqHeader Host: www.testdomain.nl:82 - ReqHeader X-Forwarded-For: 192.168.2.39 - VCL_call RECV - ReqUnset X-Forwarded-For: 192.168.2.39 - ReqHeader X-Forwarded-For: 192.168.2.39, 192.168.2.39 - VCL_return hash - VCL_call HASH - VCL_return lookup - Hit 8 - VCL_call HIT - VCL_return deliver - RespProtocol HTTP/1.1 - RespStatus 200 - RespReason OK - RespHeader Date: Fri, 02 Jun 2017 09:36:16 GMT - RespHeader Server: Apache/2.4.25 (FreeBSD) OpenSSL/1.0.2l - RespHeader X-Powered-By: PHP/7.0.19 - RespHeader Content-Type: text/html; charset=UTF-8 - RespHeader cache-control: max-age = 259200 - RespHeader X-Varnish: 11141123 8 - RespHeader Age: 73 - RespHeader Via: 1.1 varnish-v4 - VCL_call DELIVER - RespUnset Server: Apache/2.4.25 (FreeBSD) OpenSSL/1.0.2l - RespHeader Server: varnish - RespUnset X-Powered-By: PHP/7.0.19 - RespHeader X-Powered-By: My Varnish - RespHeader X-Cache: HIT - VCL_return deliver - Timestamp Process: 1496396250.098712 0.000058 0.000058 - RespHeader Accept-Ranges: bytes - RespHeader Content-Length: 95200 - Debug "RES_MODE 2" - RespHeader Connection: keep-alive *- Debug "Hit idle send timeout, wrote = 89972/95508; retrying"** **- Debug "Write error, retval = -1, len = 5536, errno = Resource temporarily unavailable"* - Timestamp Resp: 1496396371.131526 121.032872 121.032814 - ReqAcct 82 0 82 308 95200 95508 - End Sometimes I see this Debug line also - *Debug "Write error, retval = -1, len = 95563, errno = Broken pipe"* I also installed varnish 5.1.2 but the results are the same. Is there something I miss? My vcl file is pretty basic. https://pastebin.com/rbb42x7h Thanks all for your time. regards Johan -------------- next part -------------- An HTML attachment was scrubbed... URL: From lagged at gmail.com Fri Jun 2 11:18:45 2017 From: lagged at gmail.com (Andrei) Date: Fri, 2 Jun 2017 06:18:45 -0500 Subject: Can't create more than 494 threads In-Reply-To: References: Message-ID: Good catch. Thanks for the details! On Fri, Jun 2, 2017 at 3:49 AM, info at dubistmeinheld.de < info at dubistmeinheld.de> wrote: > On 01.06.2017 18:52, Jason Price wrote: > > dimesg might help. log files should indicate if you're in an 'open > > files limit' issue... > > You pointed me in the right direction regarding log files. Thanks! > > From syslog (only shown when starting varnish this error comes up, then > it's omitted): > kernel: [84513.627267] cgroup: > fork rejected by pids controller in /system.slice/varnish.service > > Which led me to the solution to change systemd settings: > https://www.novell.com/support/kb/doc.php?id=7018594 > > This may also affect not only openSuSE distros in the future. Happy again! > > > On Thu, Jun 1, 2017 at 7:51 AM, info at dubistmeinheld.de > > > > wrote: > > > > Hi, > > > > I'm using Varnish since a couple of years and I'm very satisfied. > Thank > > you for the work! > > > > Recently I setup a new server with Varnish 5.0. For some reasons, > > threads_created is stuck at 494 and threads_failed increase every > > second, even with no load (-s malloc,2G -p thread_pools=2 -p > > thread_pool_min=250 -p thread_pool_max=2000 -p > > thread_pool_fail_delay=2). > > > > MAIN.threads 494 . Total number of > threads > > MAIN.threads_limited 0 0.00 Threads hit max > > MAIN.threads_created 494 0.01 Threads created > > MAIN.threads_destroyed 0 0.00 Threads destroyed > > MAIN.threads_failed 356 0.01 Thread creation > failed > > > > I tried now for a couple of days to fix this issue and I'm looking in > > the area of a kernel param. But I'm completely stuck on what's > > causing it. > > > > I would appreciate any help and pointing me into the right direction. > > > > Best regards, > > Jens > > > > _______________________________________________ > > varnish-misc mailing list > > varnish-misc at varnish-cache.org cache.org> > > https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc > > > > > > > > _______________________________________________ > varnish-misc mailing list > varnish-misc at varnish-cache.org > https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc > -------------- next part -------------- An HTML attachment was scrubbed... URL: From dridi at varni.sh Fri Jun 2 23:08:17 2017 From: dridi at varni.sh (Dridi Boukelmoune) Date: Sat, 3 Jun 2017 01:08:17 +0200 Subject: Unexplained Cache MISSes In-Reply-To: <53cec1b0-0b57-7110-4b78-9ac280eaa782@sharphosting.uk> References: <211c667a-ce70-6373-c840-4482c159e38c@sharphosting.uk> <35dfe986-72dc-95f5-0319-9d0743aebbe4@sharphosting.uk> <3fdcafb5-4000-3d64-478b-fb60baa9a783@sharphosting.uk> <53cec1b0-0b57-7110-4b78-9ac280eaa782@sharphosting.uk> Message-ID: >> Amazingly enough I never looked at the logs of a purge, maybe ExpKill >> could give us a VXID to then check against the hit. If only SomeoneElse(tm) >> could spare me the time and look at it themselves and tell us (wink wink=). > > > I'm very happy to help in any way I can. Please let me know anything I can > do or information I can provide. I'm no C programmer (web developer/server > admin), so can't help out with coding/patching/debugging[3], but anything > else I can do, please let me know what you need. Well, luckily I didn't write any C code to find out what purge logs look like. I'm certainly not going to debug code I'm not familiar with ;) I wrote a dummy test case instead: varnishtest "purge logs" server s1 { rxreq expect req.url == "/to-be-purged" txresp } -start varnish v1 -vcl+backend { sub vcl_recv { if (req.method == "PURGE") { return (purge); } } } -start client c1 { txreq -url "/to-be-purged" rxresp txreq -req PURGE -url "/to-be-purged" rxresp txreq -req PURGE -url "/unknown" rxresp } -run And then looked at the logs manually: varnishtest test.vtc -v | grep vsl | less Here's a sample: [...] **** v1 0.4 vsl| 1002 VCL_return b deliver **** v1 0.4 vsl| 1002 Storage b malloc s0 [...] **** v1 0.4 vsl| 0 ExpKill - EXP_When p=0x7f420b027000 e=1496443420.703764200 f=0xe **** v1 0.4 vsl| 0 ExpKill - EXP_expire p=0x7f420b027000 e=-0.000092268 f=0x0 **** v1 0.4 vsl| 0 ExpKill - EXP_Expired x=1002 t=-0 [...] **** v1 0.4 vsl| 1003 ReqMethod c PURGE **** v1 0.4 vsl| 1003 ReqURL c /to-be-purged [...] **** v1 0.4 vsl| 1003 VCL_return c purge **** v1 0.4 vsl| 1003 VCL_call c HASH **** v1 0.4 vsl| 1003 VCL_return c lookup **** v1 0.4 vsl| 1003 VCL_call c PURGE **** v1 0.4 vsl| 1003 VCL_return c synth [...] **** v1 0.4 vsl| 1004 ReqMethod c PURGE **** v1 0.4 vsl| 1004 ReqURL c /unknown [...] **** v1 0.4 vsl| 1004 VCL_return c purge **** v1 0.4 vsl| 1004 VCL_call c HASH **** v1 0.4 vsl| 1004 VCL_return c lookup **** v1 0.4 vsl| 1004 VCL_call c PURGE **** v1 0.4 vsl| 1004 VCL_return c synth [...] The interesting transaction id (VXID) is 1002. So 1) purge-related logs will only show up with raw grouping in varnishlog (which I find unfortunate but I should have remembered the expiry thread would have been involved) and 2) we don't see in a transaction log how many objects were actually purged (moved to the expiry inbox). The ExpKill records appear before because transactions commit their logs when they finish by default. > Would a cleanly installed server and absolute minimum VCL to reproduce this > be useful? You would be welcome to have access to that server, if useful, > once I've got it set up and producing the same problem. Not yet, at this point we know that we were looking at an incomplete picture so what you need to do is capture raw logs and we will be able to get both a VXID and a timestamp from the ExpKill records (although the timestamp for EXP_expire puzzles me). See man varnishlog to see how to write (-w) and then read (-r) logs to/from a file. When you notice the alleged bug, note the transaction id and write the current logs (with the -d option) so that you can pick up all the interesting bits at rest (instead of doing it on live traffic). > I can say that in my case there is definitely no Age header coming from the > back-end. Also as shown in the example I sent it is the 7th HIT on that > object. Yes, smells like a bug. But before capturing logs, make sure to remove Hash records from the vsl_mask (man varnishd) so we can confirm what's being purged too. I have a theory, a long shot that will only prove how unfamiliar I am with this part of Varnish. Since the purge moves the object to the expiry inbox, it could be that under load the restart may happen before the expiry thread marks it as expired, thus creating a race with the next lookup. Cheers, Dridi From info at dubistmeinheld.de Thu Jun 8 13:32:17 2017 From: info at dubistmeinheld.de (info at dubistmeinheld.de) Date: Thu, 8 Jun 2017 15:32:17 +0200 Subject: Stuck with sc_rx_timeout Message-ID: Hi, I'm closely monitoring varnish 5.0 stats. I could fix some issues, but I am stuck on where sc_rx_timeout is coming from. Looking at the documentation, it says "Number of session closes with Error RX_TIMEOUT (Receive timeout)". I do not fully understand this sentence. - Is this a timeout expressing that varnish does not receive an answer within a certain time from the backend? - Do these timeouts happen from time to time and are ok, or is there an issue on the server (code, params?). - If parameters, which ones could be candidates to tune? - Also I have issues with more parameters like sc_req_http10 (please have a look below) and unsure if they are severe. I would be happy if you could point me in some direction. Cheers, Jens # /usr/sbin/varnishstat -1 | grep -i '\(err\|fail\|drop\)' MAIN.sess_drop 0 0.00 Sessions dropped MAIN.sess_fail 0 0.00 Session accept failures MAIN.client_req_400 2 0.00 Client requests received, subject to 400 errors MAIN.client_req_417 0 0.00 Client requests received, subject to 417 errors MAIN.backend_fail 2 0.00 Backend conn. failures MAIN.fetch_failed 0 0.00 Fetch failed (all causes) MAIN.fetch_no_thread 0 0.00 Fetch failed (no thread) MAIN.threads_failed 0 0.00 Thread creation failed MAIN.sess_dropped 0 0.00 Sessions dropped for thread MAIN.sess_closed_err 21784 0.30 Session Closed with error MAIN.sc_req_http10 1223 0.02 Session Err REQ_HTTP10 MAIN.sc_rx_bad 0 0.00 Session Err RX_BAD MAIN.sc_rx_body 34 0.00 Session Err RX_BODY MAIN.sc_rx_junk 2 0.00 Session Err RX_JUNK MAIN.sc_rx_overflow 0 0.00 Session Err RX_OVERFLOW MAIN.sc_rx_timeout 20525 0.29 Session Err RX_TIMEOUT MAIN.sc_tx_error 0 0.00 Session Err TX_ERROR MAIN.sc_overload 0 0.00 Session Err OVERLOAD MAIN.sc_pipe_overflow 0 0.00 Session Err PIPE_OVERFLOW MAIN.sc_range_short 0 0.00 Session Err RANGE_SHORT MAIN.esi_errors 0 0.00 ESI parse errors (unlock) SMA.s0.c_fail 0 0.00 Allocator failures SMA.Transient.c_fail 0 0.00 Allocator failures From remofurlanetto at gmail.com Fri Jun 9 20:10:31 2017 From: remofurlanetto at gmail.com (Remo Furlanetto) Date: Fri, 9 Jun 2017 22:10:31 +0200 Subject: varnish daemon as non root - vmods libraries directory Message-ID: Hi, Is there one way to configure varnish to read the vmods libraries in a different directory? I am asking this because of the permissions I have compiled and installed the varnish in a different path using other user (non-root). Everything works when I don't use any import in my VCL file. but when I try to use for example "import std;", I receive an error because the process is trying to read a system directory. [docker at localhost varnish]$ /home/docker/varnish/sbin/varnishd -a :29800 -f /home/docker/varnish/etc/default.vcl -T 127.0.0.1:6082 -p thread_pool_min=50 -p thread_pool_max=1000 -S /home/docker/varnish/etc/secret -s malloc,256M -n /home/docker/varnish/tmp -P /home/docker/varnish/run/ varnish.pid Error: Message from VCC-compiler: Could not load VMOD std * File name: libvmod_std.so* * dlerror: /usr/local/lib/varnish/vmods/libvmod_std.so: cannot open shared object file: No such file or directory* ('/home/docker/varnish/etc/default.vcl' Line 3 Pos 8) import std; -------###- Running VCC-compiler failed, exited with 2 VCL compilation failed I am not sure, but I believe that could be possible to configure other folder because when the installation has finished, I saw that the libraries is under a folder "lib" [docker at localhost varnish]$ ls -ltr /home/docker/varnish total 40 drwxrwxr-x 3 docker docker 4096 Jun 9 11:24 include drwxrwxr-x 2 docker docker 4096 Jun 9 11:24 bin drwxr-xr-x 3 docker docker 4096 Jun 9 11:24 var drwxrwxr-x 6 docker docker 4096 Jun 9 11:24 share drwxrwxr-x 4 docker docker 4096 Jun 9 11:24 lib drwxrwxr-x 2 docker docker 4096 Jun 9 11:56 sysconfig drwxrwxr-x 2 docker docker 4096 Jun 9 12:29 etc drwxrwxr-x 2 docker docker 4096 Jun 9 12:29 run drwxrwxr-x 2 docker docker 4096 Jun 9 12:33 sbin drwxrwxr-x 4 docker docker 4096 Jun 9 12:37 tmp [docker at localhost lib]$ find /home/docker/varnish/lib/ /home/docker/varnish/lib/ /home/docker/varnish/lib/pkgconfig /home/docker/varnish/lib/pkgconfig/varnishapi.pc /home/docker/varnish/lib/libvarnishapi.so /home/docker/varnish/lib/libvarnishapi.la /home/docker/varnish/lib/libvarnishapi.so.1.0.6 /home/docker/varnish/lib/varnish /home/docker/varnish/lib/varnish/vmods /home/docker/varnish/lib/varnish/vmods/libvmod_directors.so */home/docker/varnish/lib/varnish/vmods/libvmod_std.so* /home/docker/varnish/lib/varnish/vmods/libvmod_directors.la /home/docker/varnish/lib/varnish/vmods/libvmod_std.la /home/docker/varnish/lib/libvarnishapi.so.1 I appreciate if someone could help me. Thank you very much. -- *Remo M. Furlanetto* *E-mail:* *remofurlanetto at gmail.com * *Telefone:* (11) 99910-0565 -------------- next part -------------- An HTML attachment was scrubbed... URL: From dridi at varni.sh Fri Jun 9 21:37:54 2017 From: dridi at varni.sh (Dridi Boukelmoune) Date: Fri, 9 Jun 2017 23:37:54 +0200 Subject: varnish daemon as non root - vmods libraries directory In-Reply-To: References: Message-ID: On Fri, Jun 9, 2017 at 10:10 PM, Remo Furlanetto wrote: > Hi, > > Is there one way to configure varnish to read the vmods libraries in a > different directory? You can use the `from` keyword: import from ; There is also a vmod_dir or vmod_path parameter depending on your version of Varnish, see man varnishd. Dridi From remofurlanetto at gmail.com Fri Jun 9 22:16:26 2017 From: remofurlanetto at gmail.com (Remo Furlanetto) Date: Sat, 10 Jun 2017 00:16:26 +0200 Subject: varnish daemon as non root - vmods libraries directory In-Reply-To: References: Message-ID: Hi Dridi, Thank you for your answer. I have found a solution. I had to execute the script "configure" with --exec-prefix wget https://repo.varnish-cache.org/source/varnish-5.1.2.tar.gz tar -xvzf varnish-5.1.2.tar.gz cd varnish-5.1.2 ./autogen.sh ./configure --prefix=/home/docker/varnish --exec-prefix=/home/docker/varnish make make install Thank you Remo. On Fri, Jun 9, 2017 at 11:37 PM, Dridi Boukelmoune wrote: > On Fri, Jun 9, 2017 at 10:10 PM, Remo Furlanetto > wrote: > > Hi, > > > > Is there one way to configure varnish to read the vmods libraries in a > > different directory? > > You can use the `from` keyword: > > import from ; > > There is also a vmod_dir or vmod_path parameter depending on your > version of Varnish, see man varnishd. > > Dridi > -- *Remo M. Furlanetto* *E-mail:* *remofurlanetto at gmail.com * *Telefone:* (11) 99910-0565 -------------- next part -------------- An HTML attachment was scrubbed... URL: From dridi at varni.sh Fri Jun 9 22:30:42 2017 From: dridi at varni.sh (Dridi Boukelmoune) Date: Sat, 10 Jun 2017 00:30:42 +0200 Subject: varnish daemon as non root - vmods libraries directory In-Reply-To: References: Message-ID: On Sat, Jun 10, 2017 at 12:16 AM, Remo Furlanetto wrote: > Hi Dridi, > > Thank you for your answer. > > I have found a solution. I had to execute the script "configure" with > --exec-prefix Oh yes, that too. Since you mentioned installing to a different prefix on purpose, I didnt think you needed help in this area. Whats weird is that if you set --prefix but not --exec-prefix, the latter falls back to the former. Glad to see it worked out anyway. Cheers From np.lists at sharphosting.uk Fri Jun 16 18:27:15 2017 From: np.lists at sharphosting.uk (Nigel Peck) Date: Fri, 16 Jun 2017 13:27:15 -0500 Subject: Unexplained Cache MISSes In-Reply-To: References: <211c667a-ce70-6373-c840-4482c159e38c@sharphosting.uk> <35dfe986-72dc-95f5-0319-9d0743aebbe4@sharphosting.uk> <3fdcafb5-4000-3d64-478b-fb60baa9a783@sharphosting.uk> <53cec1b0-0b57-7110-4b78-9ac280eaa782@sharphosting.uk> Message-ID: <1a0267d7-8cc4-4a9c-5f0a-9719db34321d@sharphosting.uk> Sorry for the delay on working on this. I've read your email a few times now and am still confused! I need to read the man pages suggested but haven't got to it yet. Will let you know when I make some progress on it. I'm fixing the issue in the interim here by issuing another GET request in my cache refresh scripts for any PURGE requests that come back with a HIT. Nigel On 02/06/2017 18:08, Dridi Boukelmoune wrote: >>> Amazingly enough I never looked at the logs of a purge, maybe ExpKill >>> could give us a VXID to then check against the hit. If only SomeoneElse(tm) >>> could spare me the time and look at it themselves and tell us (wink wink=). >> >> >> I'm very happy to help in any way I can. Please let me know anything I can >> do or information I can provide. I'm no C programmer (web developer/server >> admin), so can't help out with coding/patching/debugging[3], but anything >> else I can do, please let me know what you need. > > Well, luckily I didn't write any C code to find out what purge logs > look like. I'm certainly not going to debug code I'm not familiar with ;) > > I wrote a dummy test case instead: > > varnishtest "purge logs" > > server s1 { > rxreq > expect req.url == "/to-be-purged" > txresp > } -start > > varnish v1 -vcl+backend { > sub vcl_recv { > if (req.method == "PURGE") { > return (purge); > } > } > } -start > > client c1 { > txreq -url "/to-be-purged" > rxresp > > txreq -req PURGE -url "/to-be-purged" > rxresp > > txreq -req PURGE -url "/unknown" > rxresp > } -run > > And then looked at the logs manually: > > varnishtest test.vtc -v | grep vsl | less > > Here's a sample: > > [...] > **** v1 0.4 vsl| 1002 VCL_return b deliver > **** v1 0.4 vsl| 1002 Storage b malloc s0 > [...] > **** v1 0.4 vsl| 0 ExpKill - EXP_When > p=0x7f420b027000 e=1496443420.703764200 f=0xe > **** v1 0.4 vsl| 0 ExpKill - EXP_expire > p=0x7f420b027000 e=-0.000092268 f=0x0 > **** v1 0.4 vsl| 0 ExpKill - EXP_Expired x=1002 t=-0 > [...] > **** v1 0.4 vsl| 1003 ReqMethod c PURGE > **** v1 0.4 vsl| 1003 ReqURL c /to-be-purged > [...] > **** v1 0.4 vsl| 1003 VCL_return c purge > **** v1 0.4 vsl| 1003 VCL_call c HASH > **** v1 0.4 vsl| 1003 VCL_return c lookup > **** v1 0.4 vsl| 1003 VCL_call c PURGE > **** v1 0.4 vsl| 1003 VCL_return c synth > [...] > **** v1 0.4 vsl| 1004 ReqMethod c PURGE > **** v1 0.4 vsl| 1004 ReqURL c /unknown > [...] > **** v1 0.4 vsl| 1004 VCL_return c purge > **** v1 0.4 vsl| 1004 VCL_call c HASH > **** v1 0.4 vsl| 1004 VCL_return c lookup > **** v1 0.4 vsl| 1004 VCL_call c PURGE > **** v1 0.4 vsl| 1004 VCL_return c synth > [...] > > The interesting transaction id (VXID) is 1002. > > So 1) purge-related logs will only show up with raw grouping in > varnishlog (which I find unfortunate but I should have remembered the > expiry thread would have been involved) and 2) we don't see in a > transaction log how many objects were actually purged (moved to the > expiry inbox). > > The ExpKill records appear before because transactions commit their > logs when they finish by default. > >> Would a cleanly installed server and absolute minimum VCL to reproduce this >> be useful? You would be welcome to have access to that server, if useful, >> once I've got it set up and producing the same problem. > > Not yet, at this point we know that we were looking at an incomplete > picture so what you need to do is capture raw logs and we will be able > to get both a VXID and a timestamp from the ExpKill records (although > the timestamp for EXP_expire puzzles me). > > See man varnishlog to see how to write (-w) and then read (-r) logs > to/from a file. When you notice the alleged bug, note the transaction > id and write the current logs (with the -d option) so that you can > pick up all the interesting bits at rest (instead of doing it on live > traffic). > >> I can say that in my case there is definitely no Age header coming from the >> back-end. Also as shown in the example I sent it is the 7th HIT on that >> object. > > Yes, smells like a bug. But before capturing logs, make sure to remove > Hash records from the vsl_mask (man varnishd) so we can confirm what's > being purged too. > > I have a theory, a long shot that will only prove how unfamiliar I am > with this part of Varnish. Since the purge moves the object to the > expiry inbox, it could be that under load the restart may happen > before the expiry thread marks it as expired, thus creating a race > with the next lookup. > > Cheers, > Dridi > From np.lists at sharphosting.uk Fri Jun 16 19:09:40 2017 From: np.lists at sharphosting.uk (Nigel Peck) Date: Fri, 16 Jun 2017 14:09:40 -0500 Subject: Unexplained Cache MISSes In-Reply-To: <1a0267d7-8cc4-4a9c-5f0a-9719db34321d@sharphosting.uk> References: <211c667a-ce70-6373-c840-4482c159e38c@sharphosting.uk> <35dfe986-72dc-95f5-0319-9d0743aebbe4@sharphosting.uk> <3fdcafb5-4000-3d64-478b-fb60baa9a783@sharphosting.uk> <53cec1b0-0b57-7110-4b78-9ac280eaa782@sharphosting.uk> <1a0267d7-8cc4-4a9c-5f0a-9719db34321d@sharphosting.uk> Message-ID: <499ebc8d-e952-571c-e378-0fe092c6c709@sharphosting.uk> Here's an interesting thing about this. When I refreshed the cache just now (PURGE) for 204 URLs, 78 of them were a HIT instead of a MISS. All had been in the cache for 9 hours at least. (a re-issued GET request received a MISS for all 78) When I immediately issued a PURGE again a few seconds later for all 204 URLs, every one of them was a MISS and purged successfully. I did it again a few seconds after that, and again all good. Same again a few minutes after that. No HITs. So this seems to be in some way related to how long the objects have been in the cache. Nigel On 16/06/2017 13:27, Nigel Peck wrote: > > Sorry for the delay on working on this. I've read your email a few times > now and am still confused! I need to read the man pages suggested but > haven't got to it yet. Will let you know when I make some progress on it. > > I'm fixing the issue in the interim here by issuing another GET request > in my cache refresh scripts for any PURGE requests that come back with a > HIT. > > Nigel > > On 02/06/2017 18:08, Dridi Boukelmoune wrote: >>>> Amazingly enough I never looked at the logs of a purge, maybe ExpKill >>>> could give us a VXID to then check against the hit. If only >>>> SomeoneElse(tm) >>>> could spare me the time and look at it themselves and tell us (wink >>>> wink=). >>> >>> >>> I'm very happy to help in any way I can. Please let me know anything >>> I can >>> do or information I can provide. I'm no C programmer (web >>> developer/server >>> admin), so can't help out with coding/patching/debugging[3], but >>> anything >>> else I can do, please let me know what you need. >> >> Well, luckily I didn't write any C code to find out what purge logs >> look like. I'm certainly not going to debug code I'm not familiar with ;) >> >> I wrote a dummy test case instead: >> >> varnishtest "purge logs" >> >> server s1 { >> rxreq >> expect req.url == "/to-be-purged" >> txresp >> } -start >> >> varnish v1 -vcl+backend { >> sub vcl_recv { >> if (req.method == "PURGE") { >> return (purge); >> } >> } >> } -start >> >> client c1 { >> txreq -url "/to-be-purged" >> rxresp >> >> txreq -req PURGE -url "/to-be-purged" >> rxresp >> >> txreq -req PURGE -url "/unknown" >> rxresp >> } -run >> >> And then looked at the logs manually: >> >> varnishtest test.vtc -v | grep vsl | less >> >> Here's a sample: >> >> [...] >> **** v1 0.4 vsl| 1002 VCL_return b deliver >> **** v1 0.4 vsl| 1002 Storage b malloc s0 >> [...] >> **** v1 0.4 vsl| 0 ExpKill - EXP_When >> p=0x7f420b027000 e=1496443420.703764200 f=0xe >> **** v1 0.4 vsl| 0 ExpKill - EXP_expire >> p=0x7f420b027000 e=-0.000092268 f=0x0 >> **** v1 0.4 vsl| 0 ExpKill - EXP_Expired >> x=1002 t=-0 >> [...] >> **** v1 0.4 vsl| 1003 ReqMethod c PURGE >> **** v1 0.4 vsl| 1003 ReqURL c /to-be-purged >> [...] >> **** v1 0.4 vsl| 1003 VCL_return c purge >> **** v1 0.4 vsl| 1003 VCL_call c HASH >> **** v1 0.4 vsl| 1003 VCL_return c lookup >> **** v1 0.4 vsl| 1003 VCL_call c PURGE >> **** v1 0.4 vsl| 1003 VCL_return c synth >> [...] >> **** v1 0.4 vsl| 1004 ReqMethod c PURGE >> **** v1 0.4 vsl| 1004 ReqURL c /unknown >> [...] >> **** v1 0.4 vsl| 1004 VCL_return c purge >> **** v1 0.4 vsl| 1004 VCL_call c HASH >> **** v1 0.4 vsl| 1004 VCL_return c lookup >> **** v1 0.4 vsl| 1004 VCL_call c PURGE >> **** v1 0.4 vsl| 1004 VCL_return c synth >> [...] >> >> The interesting transaction id (VXID) is 1002. >> >> So 1) purge-related logs will only show up with raw grouping in >> varnishlog (which I find unfortunate but I should have remembered the >> expiry thread would have been involved) and 2) we don't see in a >> transaction log how many objects were actually purged (moved to the >> expiry inbox). >> >> The ExpKill records appear before because transactions commit their >> logs when they finish by default. >> >>> Would a cleanly installed server and absolute minimum VCL to >>> reproduce this >>> be useful? You would be welcome to have access to that server, if >>> useful, >>> once I've got it set up and producing the same problem. >> >> Not yet, at this point we know that we were looking at an incomplete >> picture so what you need to do is capture raw logs and we will be able >> to get both a VXID and a timestamp from the ExpKill records (although >> the timestamp for EXP_expire puzzles me). >> >> See man varnishlog to see how to write (-w) and then read (-r) logs >> to/from a file. When you notice the alleged bug, note the transaction >> id and write the current logs (with the -d option) so that you can >> pick up all the interesting bits at rest (instead of doing it on live >> traffic). >> >>> I can say that in my case there is definitely no Age header coming >>> from the >>> back-end. Also as shown in the example I sent it is the 7th HIT on that >>> object. >> >> Yes, smells like a bug. But before capturing logs, make sure to remove >> Hash records from the vsl_mask (man varnishd) so we can confirm what's >> being purged too. >> >> I have a theory, a long shot that will only prove how unfamiliar I am >> with this part of Varnish. Since the purge moves the object to the >> expiry inbox, it could be that under load the restart may happen >> before the expiry thread marks it as expired, thus creating a race >> with the next lookup. >> >> Cheers, >> Dridi >> > > _______________________________________________ > varnish-misc mailing list > varnish-misc at varnish-cache.org > https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc From r at roze.lv Wed Jun 21 12:23:42 2017 From: r at roze.lv (Reinis Rozitis) Date: Wed, 21 Jun 2017 15:23:42 +0300 Subject: Assert error in http1_minimal_response(), http1/cache_http1_fsm.c line 234 Message-ID: <509EFFB6DD2147BE8DCF2FC0A6BD4E3A@MasterPC> Hello, before making a new issue I wanted to clarify: After upgrading varnish to 5.1.2 (from 3.x) sometimes it panics with: Assert error in http1_minimal_response(), http1/cache_http1_fsm.c line 234: Condition(VTCP_Check(1)) not true. version = varnish-5.1.2 revision 6ece695, vrt api = 6.0 ident = Linux,4.11.4-1.gcba98ee-default,x86_64,-jnone,-sfile,-sfile,-sfile,-sfile,-smalloc,-hcritbit,epoll now = 681947.494149 (mono), 1498046899.128715 (real) Backtrace: 0x43af07: /data/varnish5/sbin/varnishd() [0x43af07] 0x45bfe5: /data/varnish5/sbin/varnishd() [0x45bfe5] 0x45cf72: /data/varnish5/sbin/varnishd() [0x45cf72] 0x454049: /data/varnish5/sbin/varnishd() [0x454049] 0x4544ab: /data/varnish5/sbin/varnishd() [0x4544ab] 0x7fee77d44744: /lib64/libpthread.so.0(+0x8744) [0x7fee77d44744] 0x7fee77a82d3d: /lib64/libc.so.6(clone+0x6d) [0x7fee77a82d3d] errno = 32 (Broken pipe) [also full backtrace] I've checked past issues and there was exactly one matches: https://github.com/varnishcache/varnish-cache/issues/2267 but it is kind of closed closed with https://github.com/varnishcache/varnish-cache/commit/a8b453cb432e9717e1a8afab91433aa4294ba27e Should I add to the existing bug report or create a new one? rr From r at roze.lv Wed Jun 21 12:34:13 2017 From: r at roze.lv (Reinis Rozitis) Date: Wed, 21 Jun 2017 15:34:13 +0300 Subject: Assert error in http1_minimal_response(), http1/cache_http1_fsm.c line 234 In-Reply-To: <509EFFB6DD2147BE8DCF2FC0A6BD4E3A@MasterPC> References: <509EFFB6DD2147BE8DCF2FC0A6BD4E3A@MasterPC> Message-ID: <77172C33544E4F7CBC9B79D585638E83@MasterPC> Also the fix ( https://github.com/varnishcache/varnish-cache/commit/a8b453cb432e9717e1a8afab91433aa4294ba27e ) itself is a bit odd, since it's within: #if (defined (__SVR4) && defined (__sun)) || defined (__NetBSD__) .. but both the original authors and mine environment is Linux (eg I can't see how it actually changes something regarding the issue). rr From guillaume at varnish-software.com Fri Jun 23 08:58:21 2017 From: guillaume at varnish-software.com (Guillaume Quintard) Date: Fri, 23 Jun 2017 10:58:21 +0200 Subject: Varnish performance with phpinfo In-Reply-To: <6fa4576d-d25e-b770-44da-98877379a815@gmail.com> References: <6fa4576d-d25e-b770-44da-98877379a815@gmail.com> Message-ID: Stupid question but, aren't you being limited by your client, or a firewall, maybe? -- Guillaume Quintard On Fri, Jun 2, 2017 at 12:06 PM, Johan Hendriks wrote: > Hello all, First sorry for the long email. > I have a strange issue with varnish. At least I think it is strange. > > We start some tests with varnish, but we have an issue. > > I am running varnish 4.1.6 on FreeBSD 11.1-prerelease. Where varnish > listen on port 82 and apache on 80, This is just for the tests. > We use the following start options. > > # Varnish > varnishd_enable="YES" > varnishd_listen="192.168.2.247:82" > varnishd_pidfile="/var/run/varnishd.pid" > varnishd_storage="default=malloc,2024M" > varnishd_config="/usr/local/etc/varnish/default.vcl" > varnishd_hash="critbit" > varnishd_admin=":6082" > varnishncsa_enable="YES" > > We did a test with a static page and that went fine. First we see it is > not cached, second attempt is cached. > > root at desk:~ # curl -I www.testdomain.nl:82/info.html > HTTP/1.1 200 OK > Date: Fri, 02 Jun 2017 09:19:52 GMT > Last-Modified: Thu, 01 Jun 2017 12:50:37 GMT > ETag: "cf4-550e57bc1f812" > Content-Length: 3316 > Content-Type: text/html > cache-control: max-age = 259200 > X-Varnish: 2 > Age: 0 > Via: 1.1 varnish-v4 > Server: varnish > X-Powered-By: My Varnish > X-Cache: MISS > Accept-Ranges: bytes > Connection: keep-alive > > root at desk:~ # curl -I www.testdomain.nl:82/info.html > HTTP/1.1 200 OK > Date: Fri, 02 Jun 2017 09:19:52 GMT > Last-Modified: Thu, 01 Jun 2017 12:50:37 GMT > ETag: "cf4-550e57bc1f812" > Content-Length: 3316 > Content-Type: text/html > cache-control: max-age = 259200 > X-Varnish: 5 3 > Age: 6 > Via: 1.1 varnish-v4 > Server: varnish > X-Powered-By: My Varnish > X-Cache: HIT > Accept-Ranges: bytes > Connection: keep-alive > > if I benchmark the server I get the following. > First is derectly to Apache > > root at testserver:~ # bombardier -c400 -n10000 > http://www.testdomain.nl/info.html > Bombarding http://www.testdomain.nl/info.html with 10000 requests using > 400 connections > 10000 / 10000 [============================= > ================================] 100.00% 0s > Done! > Statistics Avg Stdev Max > Reqs/sec 12459.00 898.32 13301 > Latency 31.04ms 25.28ms 280.90ms > HTTP codes: > 1xx - 0, 2xx - 10000, 3xx - 0, 4xx - 0, 5xx - 0 > others - 0 > Throughput: 42.16MB/s > > This is via varnish. So that works as intended. > Varnish does its job and servers the page better. > > root at testserver:~ # bombardier -c400 -n10000 http://www.testdomain.nl:82/ > info.html > Bombarding http://www.testdomain.nl:82/info.html with 10000 requests > using 400 connections > 10000 / 10000 [============================= > ================================] 100.00% 0s > Done! > Statistics Avg Stdev Max > Reqs/sec 19549.00 7649.32 24313 > Latency 17.90ms 66.77ms 485.07ms > HTTP codes: > 1xx - 0, 2xx - 10000, 3xx - 0, 4xx - 0, 5xx - 0 > others - 0 > Throughput: 71.58MB/s > > > The next one is against a info.php file, which runs phpinfo(); > > So first agains the server without varnish. > > root at testserver:~ # bombardier -c400 -n10000 > http://www.testdomain.nl/info.php > Bombarding http://www.testdomain.nl/info.php with 10000 requests using > 400 connections > 10000 / 10000 [============================= > ===============================] 100.00% 11s > Done! > Statistics Avg Stdev Max > Reqs/sec 828.00 127.66 1010 > Latency 472.10ms 59.10ms 740.43ms > HTTP codes: > 1xx - 0, 2xx - 10000, 3xx - 0, 4xx - 0, 5xx - 0 > others - 0 > Throughput: 75.51MB/s > > But then against the server with varnish. > So we make sure it is in cache > > root at desk:~ # curl -I www.testdomain.nl:82/info.php > HTTP/1.1 200 OK > Date: Fri, 02 Jun 2017 09:36:16 GMT > Content-Type: text/html; charset=UTF-8 > cache-control: max-age = 259200 > X-Varnish: 7 > Age: 0 > Via: 1.1 varnish-v4 > Server: varnish > X-Powered-By: My Varnish > X-Cache: MISS > Accept-Ranges: bytes > Connection: keep-alive > > root at desk:~ # curl -I www.testdomain.nl:82/info.php > HTTP/1.1 200 OK > Date: Fri, 02 Jun 2017 09:36:16 GMT > Content-Type: text/html; charset=UTF-8 > cache-control: max-age = 259200 > X-Varnish: 10 8 > Age: 2 > Via: 1.1 varnish-v4 > Server: varnish > X-Powered-By: My Varnish > X-Cache: HIT > Accept-Ranges: bytes > Connection: keep-alive > > So it is in cache now. > root at testserver:~ # bombardier -c400 -n10000 http://www.testdomain.nl:82/ > info.php > Bombarding http://www.testdomain.nl:82/info.php with 10000 requests using > 400 connections > 10000 / 10000 [============================= > ============================================================ > ============================================================ > ======================================================] 100.00% 8s > Done! > Statistics Avg Stdev Max > Reqs/sec 1179.00 230.77 1981 > Latency 219.94ms 340.29ms 2.00s > HTTP codes: > 1xx - 0, 2xx - 9938, 3xx - 0, 4xx - 0, 5xx - 0 > others - 62 > Errors: > dialing to the given TCP address timed out - 62 > Throughput: 83.16MB/s > > I expected this to be much more in favour of varnish, but it even > generated some errors! Time taken is lower but I expected it to be much > faster. Also the 62 errors is not good i guess. > > I do see the following with varnish log > * << Request >> 11141123 > - Begin req 1310723 rxreq > - Timestamp Start: 1496396250.098654 0.000000 0.000000 > - Timestamp Req: 1496396250.098654 0.000000 0.000000 > - ReqStart 192.168.2.39 14818 > - ReqMethod GET > - ReqURL /info.php > - ReqProtocol HTTP/1.1 > - ReqHeader User-Agent: fasthttp > - ReqHeader Host: www.testdomain.nl:82 > - ReqHeader X-Forwarded-For: 192.168.2.39 > - VCL_call RECV > - ReqUnset X-Forwarded-For: 192.168.2.39 > - ReqHeader X-Forwarded-For: 192.168.2.39, 192.168.2.39 > - VCL_return hash > - VCL_call HASH > - VCL_return lookup > - Hit 8 > - VCL_call HIT > - VCL_return deliver > - RespProtocol HTTP/1.1 > - RespStatus 200 > - RespReason OK > - RespHeader Date: Fri, 02 Jun 2017 09:36:16 GMT > - RespHeader Server: Apache/2.4.25 (FreeBSD) OpenSSL/1.0.2l > - RespHeader X-Powered-By: PHP/7.0.19 > - RespHeader Content-Type: text/html; charset=UTF-8 > - RespHeader cache-control: max-age = 259200 > - RespHeader X-Varnish: 11141123 8 > - RespHeader Age: 73 > - RespHeader Via: 1.1 varnish-v4 > - VCL_call DELIVER > - RespUnset Server: Apache/2.4.25 (FreeBSD) OpenSSL/1.0.2l > - RespHeader Server: varnish > - RespUnset X-Powered-By: PHP/7.0.19 > - RespHeader X-Powered-By: My Varnish > - RespHeader X-Cache: HIT > - VCL_return deliver > - Timestamp Process: 1496396250.098712 0.000058 0.000058 > - RespHeader Accept-Ranges: bytes > - RespHeader Content-Length: 95200 > - Debug "RES_MODE 2" > - RespHeader Connection: keep-alive > *- Debug "Hit idle send timeout, wrote = 89972/95508; retrying"* > *- Debug "Write error, retval = -1, len = 5536, errno = > Resource temporarily unavailable"* > - Timestamp Resp: 1496396371.131526 121.032872 121.032814 > - ReqAcct 82 0 82 308 95200 95508 > - End > > Sometimes I see this Debug line also - *Debug "Write error, > retval = -1, len = 95563, errno = Broken pipe"* > > > I also installed varnish 5.1.2 but the results are the same. > Is there something I miss? > > My vcl file is pretty basic. > > https://pastebin.com/rbb42x7h > > Thanks all for your time. > > regards > Johan > > > _______________________________________________ > varnish-misc mailing list > varnish-misc at varnish-cache.org > https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc > -------------- next part -------------- An HTML attachment was scrubbed... URL: From guillaume at varnish-software.com Fri Jun 23 09:05:33 2017 From: guillaume at varnish-software.com (Guillaume Quintard) Date: Fri, 23 Jun 2017 11:05:33 +0200 Subject: Stuck with sc_rx_timeout In-Reply-To: References: Message-ID: Don't worry about it, your client just left while you where expecting data from it. -- Guillaume Quintard On Thu, Jun 8, 2017 at 3:32 PM, info at dubistmeinheld.de < info at dubistmeinheld.de> wrote: > Hi, > > I'm closely monitoring varnish 5.0 stats. I could fix some issues, but I > am stuck on where sc_rx_timeout is coming from. > > Looking at the documentation, it says "Number of session closes with > Error RX_TIMEOUT (Receive timeout)". > > I do not fully understand this sentence. > - Is this a timeout expressing that varnish does not receive an answer > within a certain time from the backend? > - Do these timeouts happen from time to time and are ok, or is there an > issue on the server (code, params?). > - If parameters, which ones could be candidates to tune? > - Also I have issues with more parameters like sc_req_http10 (please > have a look below) and unsure if they are severe. > > I would be happy if you could point me in some direction. > > Cheers, > Jens > > # /usr/sbin/varnishstat -1 | grep -i '\(err\|fail\|drop\)' > MAIN.sess_drop 0 0.00 Sessions dropped > MAIN.sess_fail 0 0.00 Session accept failures > MAIN.client_req_400 2 0.00 Client requests received, > subject to 400 errors > MAIN.client_req_417 0 0.00 Client requests received, > subject to 417 errors > MAIN.backend_fail 2 0.00 Backend conn. failures > MAIN.fetch_failed 0 0.00 Fetch failed (all causes) > MAIN.fetch_no_thread 0 0.00 Fetch failed (no thread) > MAIN.threads_failed 0 0.00 Thread creation failed > MAIN.sess_dropped 0 0.00 Sessions dropped for > thread > MAIN.sess_closed_err 21784 0.30 Session Closed with error > MAIN.sc_req_http10 1223 0.02 Session Err REQ_HTTP10 > MAIN.sc_rx_bad 0 0.00 Session Err RX_BAD > MAIN.sc_rx_body 34 0.00 Session Err RX_BODY > MAIN.sc_rx_junk 2 0.00 Session Err RX_JUNK > MAIN.sc_rx_overflow 0 0.00 Session Err RX_OVERFLOW > MAIN.sc_rx_timeout 20525 0.29 Session Err RX_TIMEOUT > MAIN.sc_tx_error 0 0.00 Session Err TX_ERROR > MAIN.sc_overload 0 0.00 Session Err OVERLOAD > MAIN.sc_pipe_overflow 0 0.00 Session Err PIPE_OVERFLOW > MAIN.sc_range_short 0 0.00 Session Err RANGE_SHORT > MAIN.esi_errors 0 0.00 ESI parse > errors (unlock) > SMA.s0.c_fail 0 0.00 Allocator > failures > SMA.Transient.c_fail 0 0.00 Allocator > failures > > _______________________________________________ > varnish-misc mailing list > varnish-misc at varnish-cache.org > https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc > -------------- next part -------------- An HTML attachment was scrubbed... URL: From guillaume at varnish-software.com Fri Jun 23 09:09:50 2017 From: guillaume at varnish-software.com (Guillaume Quintard) Date: Fri, 23 Jun 2017 11:09:50 +0200 Subject: Unexplained Cache MISSes In-Reply-To: <499ebc8d-e952-571c-e378-0fe092c6c709@sharphosting.uk> References: <211c667a-ce70-6373-c840-4482c159e38c@sharphosting.uk> <35dfe986-72dc-95f5-0319-9d0743aebbe4@sharphosting.uk> <3fdcafb5-4000-3d64-478b-fb60baa9a783@sharphosting.uk> <53cec1b0-0b57-7110-4b78-9ac280eaa782@sharphosting.uk> <1a0267d7-8cc4-4a9c-5f0a-9719db34321d@sharphosting.uk> <499ebc8d-e952-571c-e378-0fe092c6c709@sharphosting.uk> Message-ID: Hum, could you toy with ttl/grace/keep periods? Like having only a one week TTL but no grace/keep, then a one week grace but no TTL/keep? The period when the purge occurs may be important... -- Guillaume Quintard On Fri, Jun 16, 2017 at 9:09 PM, Nigel Peck wrote: > > Here's an interesting thing about this. When I refreshed the cache just > now (PURGE) for 204 URLs, 78 of them were a HIT instead of a MISS. All had > been in the cache for 9 hours at least. (a re-issued GET request received a > MISS for all 78) > > When I immediately issued a PURGE again a few seconds later for all 204 > URLs, every one of them was a MISS and purged successfully. I did it again > a few seconds after that, and again all good. Same again a few minutes > after that. No HITs. > > So this seems to be in some way related to how long the objects have been > in the cache. > > Nigel > > > On 16/06/2017 13:27, Nigel Peck wrote: > >> >> Sorry for the delay on working on this. I've read your email a few times >> now and am still confused! I need to read the man pages suggested but >> haven't got to it yet. Will let you know when I make some progress on it. >> >> I'm fixing the issue in the interim here by issuing another GET request >> in my cache refresh scripts for any PURGE requests that come back with a >> HIT. >> >> Nigel >> >> On 02/06/2017 18:08, Dridi Boukelmoune wrote: >> >>> Amazingly enough I never looked at the logs of a purge, maybe ExpKill >>>>> could give us a VXID to then check against the hit. If only >>>>> SomeoneElse(tm) >>>>> could spare me the time and look at it themselves and tell us (wink >>>>> wink=). >>>>> >>>> >>>> >>>> I'm very happy to help in any way I can. Please let me know anything I >>>> can >>>> do or information I can provide. I'm no C programmer (web >>>> developer/server >>>> admin), so can't help out with coding/patching/debugging[3], but >>>> anything >>>> else I can do, please let me know what you need. >>>> >>> >>> Well, luckily I didn't write any C code to find out what purge logs >>> look like. I'm certainly not going to debug code I'm not familiar with ;) >>> >>> I wrote a dummy test case instead: >>> >>> varnishtest "purge logs" >>> >>> server s1 { >>> rxreq >>> expect req.url == "/to-be-purged" >>> txresp >>> } -start >>> >>> varnish v1 -vcl+backend { >>> sub vcl_recv { >>> if (req.method == "PURGE") { >>> return (purge); >>> } >>> } >>> } -start >>> >>> client c1 { >>> txreq -url "/to-be-purged" >>> rxresp >>> >>> txreq -req PURGE -url "/to-be-purged" >>> rxresp >>> >>> txreq -req PURGE -url "/unknown" >>> rxresp >>> } -run >>> >>> And then looked at the logs manually: >>> >>> varnishtest test.vtc -v | grep vsl | less >>> >>> Here's a sample: >>> >>> [...] >>> **** v1 0.4 vsl| 1002 VCL_return b deliver >>> **** v1 0.4 vsl| 1002 Storage b malloc s0 >>> [...] >>> **** v1 0.4 vsl| 0 ExpKill - EXP_When >>> p=0x7f420b027000 e=1496443420.703764200 f=0xe >>> **** v1 0.4 vsl| 0 ExpKill - EXP_expire >>> p=0x7f420b027000 e=-0.000092268 f=0x0 >>> **** v1 0.4 vsl| 0 ExpKill - EXP_Expired x=1002 >>> t=-0 >>> [...] >>> **** v1 0.4 vsl| 1003 ReqMethod c PURGE >>> **** v1 0.4 vsl| 1003 ReqURL c /to-be-purged >>> [...] >>> **** v1 0.4 vsl| 1003 VCL_return c purge >>> **** v1 0.4 vsl| 1003 VCL_call c HASH >>> **** v1 0.4 vsl| 1003 VCL_return c lookup >>> **** v1 0.4 vsl| 1003 VCL_call c PURGE >>> **** v1 0.4 vsl| 1003 VCL_return c synth >>> [...] >>> **** v1 0.4 vsl| 1004 ReqMethod c PURGE >>> **** v1 0.4 vsl| 1004 ReqURL c /unknown >>> [...] >>> **** v1 0.4 vsl| 1004 VCL_return c purge >>> **** v1 0.4 vsl| 1004 VCL_call c HASH >>> **** v1 0.4 vsl| 1004 VCL_return c lookup >>> **** v1 0.4 vsl| 1004 VCL_call c PURGE >>> **** v1 0.4 vsl| 1004 VCL_return c synth >>> [...] >>> >>> The interesting transaction id (VXID) is 1002. >>> >>> So 1) purge-related logs will only show up with raw grouping in >>> varnishlog (which I find unfortunate but I should have remembered the >>> expiry thread would have been involved) and 2) we don't see in a >>> transaction log how many objects were actually purged (moved to the >>> expiry inbox). >>> >>> The ExpKill records appear before because transactions commit their >>> logs when they finish by default. >>> >>> Would a cleanly installed server and absolute minimum VCL to reproduce >>>> this >>>> be useful? You would be welcome to have access to that server, if >>>> useful, >>>> once I've got it set up and producing the same problem. >>>> >>> >>> Not yet, at this point we know that we were looking at an incomplete >>> picture so what you need to do is capture raw logs and we will be able >>> to get both a VXID and a timestamp from the ExpKill records (although >>> the timestamp for EXP_expire puzzles me). >>> >>> See man varnishlog to see how to write (-w) and then read (-r) logs >>> to/from a file. When you notice the alleged bug, note the transaction >>> id and write the current logs (with the -d option) so that you can >>> pick up all the interesting bits at rest (instead of doing it on live >>> traffic). >>> >>> I can say that in my case there is definitely no Age header coming from >>>> the >>>> back-end. Also as shown in the example I sent it is the 7th HIT on that >>>> object. >>>> >>> >>> Yes, smells like a bug. But before capturing logs, make sure to remove >>> Hash records from the vsl_mask (man varnishd) so we can confirm what's >>> being purged too. >>> >>> I have a theory, a long shot that will only prove how unfamiliar I am >>> with this part of Varnish. Since the purge moves the object to the >>> expiry inbox, it could be that under load the restart may happen >>> before the expiry thread marks it as expired, thus creating a race >>> with the next lookup. >>> >>> Cheers, >>> Dridi >>> >>> >> _______________________________________________ >> varnish-misc mailing list >> varnish-misc at varnish-cache.org >> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc >> > > _______________________________________________ > varnish-misc mailing list > varnish-misc at varnish-cache.org > https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stefanobaldo at gmail.com Fri Jun 23 14:01:26 2017 From: stefanobaldo at gmail.com (Stefano Baldo) Date: Fri, 23 Jun 2017 11:01:26 -0300 Subject: Child process recurrently being restarted Message-ID: Hello. I am having a critical problem with Varnish Cache in production for over a month and any help will be appreciated. The problem is that Varnish child process is recurrently being restarted after 10~20h of use, with the following message: Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) not responding to CLI, killed it. Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected reply from ping: 400 CLI communication error Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) died signal=9 Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup complete Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) Started Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said Child starts Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said SMF.s0 mmap'ed 483183820800 bytes of 483183820800 The following link is the varnishstat output just 1 minute before a restart: https://pastebin.com/g0g5RVTs Environment: varnish-5.1.2 revision 6ece695 Debian 8.7 - Debian GNU/Linux 8 (3.16.0) Installed using pre-built package from official repo at packagecloud.io CPU 2x2.9 GHz Mem 3.69 GiB Running inside a Docker container NFILES=131072 MEMLOCK=82000 Additional info: - I need to cache a large number of objets and the cache should last for almost a week, so I have set up a 450G storage space, I don't know if this is a problem; - I use ban a lot. There was about 40k bans in the system just before the last crash. I really don't know if this is too much or may have anything to do with it; - No registered CPU spikes (almost always by 30%); - No panic is reported, the only info I can retrieve is from syslog; - During all the time, event moments before the crashes, everything is okay and requests are being responded very fast. Best, Stefano Baldo -------------- next part -------------- An HTML attachment was scrubbed... URL: From guillaume at varnish-software.com Fri Jun 23 14:30:18 2017 From: guillaume at varnish-software.com (Guillaume Quintard) Date: Fri, 23 Jun 2017 16:30:18 +0200 Subject: Child process recurrently being restarted In-Reply-To: References: Message-ID: Hi Stefano, Let's cover the usual suspects: I/Os. I think here Varnish gets stuck trying to push/pull data and can't make time to reply to the CLI. I'd recommend monitoring the disk activity (bandwidth and iops) to confirm. After some time, the file storage is terrible on a hard drive (SSDs take a bit more time to degrade) because of fragmentation. One solution to help the disks cope is to overprovision themif they're SSDs, and you can try different advices in the file storage definition in the command line (last parameter, after granularity). Is your /var/lib/varnish mount on tmpfs? That could help too. 40K bans is a lot, are they ban-lurker friendly? -- Guillaume Quintard On Fri, Jun 23, 2017 at 4:01 PM, Stefano Baldo wrote: > Hello. > > I am having a critical problem with Varnish Cache in production for over a > month and any help will be appreciated. > The problem is that Varnish child process is recurrently being restarted > after 10~20h of use, with the following message: > > Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) not responding > to CLI, killed it. > Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected reply from ping: > 400 CLI communication error > Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) died signal=9 > Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup complete > Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) Started > Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said Child > starts > Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said SMF.s0 > mmap'ed 483183820800 bytes of 483183820800 > > The following link is the varnishstat output just 1 minute before a > restart: > > https://pastebin.com/g0g5RVTs > > Environment: > > varnish-5.1.2 revision 6ece695 > Debian 8.7 - Debian GNU/Linux 8 (3.16.0) > Installed using pre-built package from official repo at packagecloud.io > CPU 2x2.9 GHz > Mem 3.69 GiB > Running inside a Docker container > NFILES=131072 > MEMLOCK=82000 > > Additional info: > > - I need to cache a large number of objets and the cache should last for > almost a week, so I have set up a 450G storage space, I don't know if this > is a problem; > - I use ban a lot. There was about 40k bans in the system just before the > last crash. I really don't know if this is too much or may have anything to > do with it; > - No registered CPU spikes (almost always by 30%); > - No panic is reported, the only info I can retrieve is from syslog; > - During all the time, event moments before the crashes, everything is > okay and requests are being responded very fast. > > Best, > Stefano Baldo > > > _______________________________________________ > varnish-misc mailing list > varnish-misc at varnish-cache.org > https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joh.hendriks at gmail.com Fri Jun 23 14:52:53 2017 From: joh.hendriks at gmail.com (Johan Hendriks) Date: Fri, 23 Jun 2017 16:52:53 +0200 Subject: Varnish performance with phpinfo In-Reply-To: References: <6fa4576d-d25e-b770-44da-98877379a815@gmail.com> Message-ID: <83029bff-6f19-5d12-0514-fa6441ecbd6a@gmail.com> Thanks for you answer. I was thinking about that also, but I could not find anything that pointed in that direction. But should I hit that limit also with the info.html file then or could it be the size of the page. The info.html is off cource way smaller than the whole php.info page. regards Johan Op 23/06/2017 om 10:58 schreef Guillaume Quintard: > Stupid question but, aren't you being limited by your client, or a > firewall, maybe? > > -- > Guillaume Quintard > > On Fri, Jun 2, 2017 at 12:06 PM, Johan Hendriks > > wrote: > > Hello all, First sorry for the long email. > I have a strange issue with varnish. At least I think it is strange. > > We start some tests with varnish, but we have an issue. > > I am running varnish 4.1.6 on FreeBSD 11.1-prerelease. Where > varnish listen on port 82 and apache on 80, This is just for the > tests. > We use the following start options. > > # Varnish > varnishd_enable="YES" > varnishd_listen="192.168.2.247:82 " > varnishd_pidfile="/var/run/varnishd.pid" > varnishd_storage="default=malloc,2024M" > varnishd_config="/usr/local/etc/varnish/default.vcl" > varnishd_hash="critbit" > varnishd_admin=":6082" > varnishncsa_enable="YES" > > We did a test with a static page and that went fine. First we see > it is not cached, second attempt is cached. > > root at desk:~ # curl -I www.testdomain.nl:82/info.html > > HTTP/1.1 200 OK > Date: Fri, 02 Jun 2017 09:19:52 GMT > Last-Modified: Thu, 01 Jun 2017 12:50:37 GMT > ETag: "cf4-550e57bc1f812" > Content-Length: 3316 > Content-Type: text/html > cache-control: max-age = 259200 > X-Varnish: 2 > Age: 0 > Via: 1.1 varnish-v4 > Server: varnish > X-Powered-By: My Varnish > X-Cache: MISS > Accept-Ranges: bytes > Connection: keep-alive > > root at desk:~ # curl -I www.testdomain.nl:82/info.html > > HTTP/1.1 200 OK > Date: Fri, 02 Jun 2017 09:19:52 GMT > Last-Modified: Thu, 01 Jun 2017 12:50:37 GMT > ETag: "cf4-550e57bc1f812" > Content-Length: 3316 > Content-Type: text/html > cache-control: max-age = 259200 > X-Varnish: 5 3 > Age: 6 > Via: 1.1 varnish-v4 > Server: varnish > X-Powered-By: My Varnish > X-Cache: HIT > Accept-Ranges: bytes > Connection: keep-alive > > if I benchmark the server I get the following. > First is derectly to Apache > > root at testserver:~ # bombardier -c400 -n10000 > http://www.testdomain.nl/info.html > > Bombarding http://www.testdomain.nl/info.html > with 10000 requests using 400 > connections > 10000 / 10000 > [=============================================================] > 100.00% 0s > Done! > Statistics Avg Stdev Max > Reqs/sec 12459.00 898.32 13301 > Latency 31.04ms 25.28ms 280.90ms > HTTP codes: > 1xx - 0, 2xx - 10000, 3xx - 0, 4xx - 0, 5xx - 0 > others - 0 > Throughput: 42.16MB/s > > This is via varnish. So that works as intended. > Varnish does its job and servers the page better. > > root at testserver:~ # bombardier -c400 -n10000 > http://www.testdomain.nl:82/info.html > > Bombarding http://www.testdomain.nl:82/info.html > with 10000 requests using > 400 connections > 10000 / 10000 > [=============================================================] > 100.00% 0s > Done! > Statistics Avg Stdev Max > Reqs/sec 19549.00 7649.32 24313 > Latency 17.90ms 66.77ms 485.07ms > HTTP codes: > 1xx - 0, 2xx - 10000, 3xx - 0, 4xx - 0, 5xx - 0 > others - 0 > Throughput: 71.58MB/s > > > The next one is against a info.php file, which runs phpinfo(); > > So first agains the server without varnish. > > root at testserver:~ # bombardier -c400 -n10000 > http://www.testdomain.nl/info.php > Bombarding http://www.testdomain.nl/info.php > with 10000 requests using 400 > connections > 10000 / 10000 > [============================================================] > 100.00% 11s > Done! > Statistics Avg Stdev Max > Reqs/sec 828.00 127.66 1010 > Latency 472.10ms 59.10ms 740.43ms > HTTP codes: > 1xx - 0, 2xx - 10000, 3xx - 0, 4xx - 0, 5xx - 0 > others - 0 > Throughput: 75.51MB/s > > But then against the server with varnish. > So we make sure it is in cache > > root at desk:~ # curl -I www.testdomain.nl:82/info.php > > HTTP/1.1 200 OK > Date: Fri, 02 Jun 2017 09:36:16 GMT > Content-Type: text/html; charset=UTF-8 > cache-control: max-age = 259200 > X-Varnish: 7 > Age: 0 > Via: 1.1 varnish-v4 > Server: varnish > X-Powered-By: My Varnish > X-Cache: MISS > Accept-Ranges: bytes > Connection: keep-alive > > root at desk:~ # curl -I www.testdomain.nl:82/info.php > > HTTP/1.1 200 OK > Date: Fri, 02 Jun 2017 09:36:16 GMT > Content-Type: text/html; charset=UTF-8 > cache-control: max-age = 259200 > X-Varnish: 10 8 > Age: 2 > Via: 1.1 varnish-v4 > Server: varnish > X-Powered-By: My Varnish > X-Cache: HIT > Accept-Ranges: bytes > Connection: keep-alive > > So it is in cache now. > root at testserver:~ # bombardier -c400 -n10000 > http://www.testdomain.nl:82/info.php > > Bombarding http://www.testdomain.nl:82/info.php > with 10000 requests using > 400 connections > 10000 / 10000 > [===========================================================================================================================================================================================================] > 100.00% 8s > Done! > Statistics Avg Stdev Max > Reqs/sec 1179.00 230.77 1981 > Latency 219.94ms 340.29ms 2.00s > HTTP codes: > 1xx - 0, 2xx - 9938, 3xx - 0, 4xx - 0, 5xx - 0 > others - 62 > Errors: > dialing to the given TCP address timed out - 62 > Throughput: 83.16MB/s > > I expected this to be much more in favour of varnish, but it even > generated some errors! Time taken is lower but I expected it to be > much faster. Also the 62 errors is not good i guess. > > I do see the following with varnish log > * << Request >> 11141123 > - Begin req 1310723 rxreq > - Timestamp Start: 1496396250.098654 0.000000 0.000000 > - Timestamp Req: 1496396250.098654 0.000000 0.000000 > - ReqStart 192.168.2.39 14818 > - ReqMethod GET > - ReqURL /info.php > - ReqProtocol HTTP/1.1 > - ReqHeader User-Agent: fasthttp > - ReqHeader Host: www.testdomain.nl:82 > > - ReqHeader X-Forwarded-For: 192.168.2.39 > - VCL_call RECV > - ReqUnset X-Forwarded-For: 192.168.2.39 > - ReqHeader X-Forwarded-For: 192.168.2.39, 192.168.2.39 > - VCL_return hash > - VCL_call HASH > - VCL_return lookup > - Hit 8 > - VCL_call HIT > - VCL_return deliver > - RespProtocol HTTP/1.1 > - RespStatus 200 > - RespReason OK > - RespHeader Date: Fri, 02 Jun 2017 09:36:16 GMT > - RespHeader Server: Apache/2.4.25 (FreeBSD) OpenSSL/1.0.2l > - RespHeader X-Powered-By: PHP/7.0.19 > - RespHeader Content-Type: text/html; charset=UTF-8 > - RespHeader cache-control: max-age = 259200 > - RespHeader X-Varnish: 11141123 8 > - RespHeader Age: 73 > - RespHeader Via: 1.1 varnish-v4 > - VCL_call DELIVER > - RespUnset Server: Apache/2.4.25 (FreeBSD) OpenSSL/1.0.2l > - RespHeader Server: varnish > - RespUnset X-Powered-By: PHP/7.0.19 > - RespHeader X-Powered-By: My Varnish > - RespHeader X-Cache: HIT > - VCL_return deliver > - Timestamp Process: 1496396250.098712 0.000058 0.000058 > - RespHeader Accept-Ranges: bytes > - RespHeader Content-Length: 95200 > - Debug "RES_MODE 2" > - RespHeader Connection: keep-alive > *- Debug "Hit idle send timeout, wrote = 89972/95508; > retrying"** > **- Debug "Write error, retval = -1, len = 5536, errno > = Resource temporarily unavailable"* > - Timestamp Resp: 1496396371.131526 121.032872 121.032814 > - ReqAcct 82 0 82 308 95200 95508 > - End > > Sometimes I see this Debug line also - *Debug "Write > error, retval = -1, len = 95563, errno = Broken pipe"* > > > I also installed varnish 5.1.2 but the results are the same. > Is there something I miss? > > My vcl file is pretty basic. > > https://pastebin.com/rbb42x7h > > Thanks all for your time. > > regards > Johan > > > _______________________________________________ > varnish-misc mailing list > varnish-misc at varnish-cache.org > https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From guillaume at varnish-software.com Fri Jun 23 15:36:49 2017 From: guillaume at varnish-software.com (Guillaume Quintard) Date: Fri, 23 Jun 2017 17:36:49 +0200 Subject: Varnish performance with phpinfo In-Reply-To: <83029bff-6f19-5d12-0514-fa6441ecbd6a@gmail.com> References: <6fa4576d-d25e-b770-44da-98877379a815@gmail.com> <83029bff-6f19-5d12-0514-fa6441ecbd6a@gmail.com> Message-ID: Simple way to test: grow the info.html size :-) -- Guillaume Quintard On Fri, Jun 23, 2017 at 4:52 PM, Johan Hendriks wrote: > Thanks for you answer. > I was thinking about that also, but I could not find anything that pointed > in that direction. > But should I hit that limit also with the info.html file then or could it > be the size of the page. > The info.html is off cource way smaller than the whole php.info page. > > regards > Johan > > Op 23/06/2017 om 10:58 schreef Guillaume Quintard: > > Stupid question but, aren't you being limited by your client, or a > firewall, maybe? > > -- > Guillaume Quintard > > On Fri, Jun 2, 2017 at 12:06 PM, Johan Hendriks > wrote: > >> Hello all, First sorry for the long email. >> I have a strange issue with varnish. At least I think it is strange. >> >> We start some tests with varnish, but we have an issue. >> >> I am running varnish 4.1.6 on FreeBSD 11.1-prerelease. Where varnish >> listen on port 82 and apache on 80, This is just for the tests. >> We use the following start options. >> >> # Varnish >> varnishd_enable="YES" >> varnishd_listen="192.168.2.247:82" >> varnishd_pidfile="/var/run/varnishd.pid" >> varnishd_storage="default=malloc,2024M" >> varnishd_config="/usr/local/etc/varnish/default.vcl" >> varnishd_hash="critbit" >> varnishd_admin=":6082" >> varnishncsa_enable="YES" >> >> We did a test with a static page and that went fine. First we see it is >> not cached, second attempt is cached. >> >> root at desk:~ # curl -I www.testdomain.nl:82/info.html >> HTTP/1.1 200 OK >> Date: Fri, 02 Jun 2017 09:19:52 GMT >> Last-Modified: Thu, 01 Jun 2017 12:50:37 GMT >> ETag: "cf4-550e57bc1f812" >> Content-Length: 3316 >> Content-Type: text/html >> cache-control: max-age = 259200 >> X-Varnish: 2 >> Age: 0 >> Via: 1.1 varnish-v4 >> Server: varnish >> X-Powered-By: My Varnish >> X-Cache: MISS >> Accept-Ranges: bytes >> Connection: keep-alive >> >> root at desk:~ # curl -I www.testdomain.nl:82/info.html >> HTTP/1.1 200 OK >> Date: Fri, 02 Jun 2017 09:19:52 GMT >> Last-Modified: Thu, 01 Jun 2017 12:50:37 GMT >> ETag: "cf4-550e57bc1f812" >> Content-Length: 3316 >> Content-Type: text/html >> cache-control: max-age = 259200 >> X-Varnish: 5 3 >> Age: 6 >> Via: 1.1 varnish-v4 >> Server: varnish >> X-Powered-By: My Varnish >> X-Cache: HIT >> Accept-Ranges: bytes >> Connection: keep-alive >> >> if I benchmark the server I get the following. >> First is derectly to Apache >> >> root at testserver:~ # bombardier -c400 -n10000 >> http://www.testdomain.nl/info.html >> Bombarding http://www.testdomain.nl/info.html with 10000 requests using >> 400 connections >> 10000 / 10000 [============================= >> ================================] 100.00% 0s >> Done! >> Statistics Avg Stdev Max >> Reqs/sec 12459.00 898.32 13301 >> Latency 31.04ms 25.28ms 280.90ms >> HTTP codes: >> 1xx - 0, 2xx - 10000, 3xx - 0, 4xx - 0, 5xx - 0 >> others - 0 >> Throughput: 42.16MB/s >> >> This is via varnish. So that works as intended. >> Varnish does its job and servers the page better. >> >> root at testserver:~ # bombardier -c400 -n10000 >> http://www.testdomain.nl:82/info.html >> Bombarding http://www.testdomain.nl:82/info.html with 10000 requests >> using 400 connections >> 10000 / 10000 [============================= >> ================================] 100.00% 0s >> Done! >> Statistics Avg Stdev Max >> Reqs/sec 19549.00 7649.32 24313 >> Latency 17.90ms 66.77ms 485.07ms >> HTTP codes: >> 1xx - 0, 2xx - 10000, 3xx - 0, 4xx - 0, 5xx - 0 >> others - 0 >> Throughput: 71.58MB/s >> >> >> The next one is against a info.php file, which runs phpinfo(); >> >> So first agains the server without varnish. >> >> root at testserver:~ # bombardier -c400 -n10000 >> http://www.testdomain.nl/info.php >> Bombarding http://www.testdomain.nl/info.php with 10000 requests using >> 400 connections >> 10000 / 10000 [============================= >> ===============================] 100.00% 11s >> Done! >> Statistics Avg Stdev Max >> Reqs/sec 828.00 127.66 1010 >> Latency 472.10ms 59.10ms 740.43ms >> HTTP codes: >> 1xx - 0, 2xx - 10000, 3xx - 0, 4xx - 0, 5xx - 0 >> others - 0 >> Throughput: 75.51MB/s >> >> But then against the server with varnish. >> So we make sure it is in cache >> >> root at desk:~ # curl -I www.testdomain.nl:82/info.php >> HTTP/1.1 200 OK >> Date: Fri, 02 Jun 2017 09:36:16 GMT >> Content-Type: text/html; charset=UTF-8 >> cache-control: max-age = 259200 >> X-Varnish: 7 >> Age: 0 >> Via: 1.1 varnish-v4 >> Server: varnish >> X-Powered-By: My Varnish >> X-Cache: MISS >> Accept-Ranges: bytes >> Connection: keep-alive >> >> root at desk:~ # curl -I www.testdomain.nl:82/info.php >> HTTP/1.1 200 OK >> Date: Fri, 02 Jun 2017 09:36:16 GMT >> Content-Type: text/html; charset=UTF-8 >> cache-control: max-age = 259200 >> X-Varnish: 10 8 >> Age: 2 >> Via: 1.1 varnish-v4 >> Server: varnish >> X-Powered-By: My Varnish >> X-Cache: HIT >> Accept-Ranges: bytes >> Connection: keep-alive >> >> So it is in cache now. >> root at testserver:~ # bombardier -c400 -n10000 >> http://www.testdomain.nl:82/info.php >> Bombarding http://www.testdomain.nl:82/info.php with 10000 requests >> using 400 connections >> 10000 / 10000 [============================= >> ============================================================ >> ============================================================ >> ======================================================] 100.00% 8s >> Done! >> Statistics Avg Stdev Max >> Reqs/sec 1179.00 230.77 1981 >> Latency 219.94ms 340.29ms 2.00s >> HTTP codes: >> 1xx - 0, 2xx - 9938, 3xx - 0, 4xx - 0, 5xx - 0 >> others - 62 >> Errors: >> dialing to the given TCP address timed out - 62 >> Throughput: 83.16MB/s >> >> I expected this to be much more in favour of varnish, but it even >> generated some errors! Time taken is lower but I expected it to be much >> faster. Also the 62 errors is not good i guess. >> >> I do see the following with varnish log >> * << Request >> 11141123 >> - Begin req 1310723 rxreq >> - Timestamp Start: 1496396250.098654 0.000000 0.000000 >> - Timestamp Req: 1496396250.098654 0.000000 0.000000 >> - ReqStart 192.168.2.39 14818 >> - ReqMethod GET >> - ReqURL /info.php >> - ReqProtocol HTTP/1.1 >> - ReqHeader User-Agent: fasthttp >> - ReqHeader Host: www.testdomain.nl:82 >> - ReqHeader X-Forwarded-For: 192.168.2.39 >> - VCL_call RECV >> - ReqUnset X-Forwarded-For: 192.168.2.39 >> - ReqHeader X-Forwarded-For: 192.168.2.39, 192.168.2.39 >> - VCL_return hash >> - VCL_call HASH >> - VCL_return lookup >> - Hit 8 >> - VCL_call HIT >> - VCL_return deliver >> - RespProtocol HTTP/1.1 >> - RespStatus 200 >> - RespReason OK >> - RespHeader Date: Fri, 02 Jun 2017 09:36:16 GMT >> - RespHeader Server: Apache/2.4.25 (FreeBSD) OpenSSL/1.0.2l >> - RespHeader X-Powered-By: PHP/7.0.19 >> - RespHeader Content-Type: text/html; charset=UTF-8 >> - RespHeader cache-control: max-age = 259200 >> - RespHeader X-Varnish: 11141123 8 >> - RespHeader Age: 73 >> - RespHeader Via: 1.1 varnish-v4 >> - VCL_call DELIVER >> - RespUnset Server: Apache/2.4.25 (FreeBSD) OpenSSL/1.0.2l >> - RespHeader Server: varnish >> - RespUnset X-Powered-By: PHP/7.0.19 >> - RespHeader X-Powered-By: My Varnish >> - RespHeader X-Cache: HIT >> - VCL_return deliver >> - Timestamp Process: 1496396250.098712 0.000058 0.000058 >> - RespHeader Accept-Ranges: bytes >> - RespHeader Content-Length: 95200 >> - Debug "RES_MODE 2" >> - RespHeader Connection: keep-alive >> *- Debug "Hit idle send timeout, wrote = 89972/95508; >> retrying"* >> *- Debug "Write error, retval = -1, len = 5536, errno = >> Resource temporarily unavailable"* >> - Timestamp Resp: 1496396371.131526 121.032872 121.032814 >> - ReqAcct 82 0 82 308 95200 95508 >> - End >> >> Sometimes I see this Debug line also - *Debug "Write error, >> retval = -1, len = 95563, errno = Broken pipe"* >> >> >> I also installed varnish 5.1.2 but the results are the same. >> Is there something I miss? >> >> My vcl file is pretty basic. >> >> https://pastebin.com/rbb42x7h >> >> Thanks all for your time. >> >> regards >> Johan >> >> >> _______________________________________________ >> varnish-misc mailing list >> varnish-misc at varnish-cache.org >> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc >> > > > > _______________________________________________ > varnish-misc mailing list > varnish-misc at varnish-cache.org > https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc > -------------- next part -------------- An HTML attachment was scrubbed... URL: From np.lists at sharphosting.uk Fri Jun 23 19:02:50 2017 From: np.lists at sharphosting.uk (Nigel Peck) Date: Fri, 23 Jun 2017 14:02:50 -0500 Subject: Unexplained Cache MISSes In-Reply-To: References: <211c667a-ce70-6373-c840-4482c159e38c@sharphosting.uk> <35dfe986-72dc-95f5-0319-9d0743aebbe4@sharphosting.uk> <3fdcafb5-4000-3d64-478b-fb60baa9a783@sharphosting.uk> <53cec1b0-0b57-7110-4b78-9ac280eaa782@sharphosting.uk> <1a0267d7-8cc4-4a9c-5f0a-9719db34321d@sharphosting.uk> <499ebc8d-e952-571c-e378-0fe092c6c709@sharphosting.uk> Message-ID: Sure, that's something I can understand! Will gather some data over the next couple of days for different time periods and configurations. On 23/06/2017 04:09, Guillaume Quintard wrote: > Hum, could you toy with ttl/grace/keep periods? Like having only a one > week TTL but no grace/keep, then a one week grace but no TTL/keep? > The period when the purge occurs may be important... > > -- > Guillaume Quintard > > On Fri, Jun 16, 2017 at 9:09 PM, Nigel Peck > wrote: > > > Here's an interesting thing about this. When I refreshed the cache > just now (PURGE) for 204 URLs, 78 of them were a HIT instead of a > MISS. All had been in the cache for 9 hours at least. (a re-issued > GET request received a MISS for all 78) > > When I immediately issued a PURGE again a few seconds later for all > 204 URLs, every one of them was a MISS and purged successfully. I > did it again a few seconds after that, and again all good. Same > again a few minutes after that. No HITs. > > So this seems to be in some way related to how long the objects have > been in the cache. > > Nigel > > > On 16/06/2017 13:27, Nigel Peck wrote: > > > Sorry for the delay on working on this. I've read your email a > few times now and am still confused! I need to read the man > pages suggested but haven't got to it yet. Will let you know > when I make some progress on it. > > I'm fixing the issue in the interim here by issuing another GET > request in my cache refresh scripts for any PURGE requests that > come back with a HIT. > > Nigel > > On 02/06/2017 18:08, Dridi Boukelmoune wrote: > > Amazingly enough I never looked at the logs of a > purge, maybe ExpKill > could give us a VXID to then check against the hit. > If only SomeoneElse(tm) > could spare me the time and look at it themselves > and tell us (wink wink=). > > > > I'm very happy to help in any way I can. Please let me > know anything I can > do or information I can provide. I'm no C programmer > (web developer/server > admin), so can't help out with > coding/patching/debugging[3], but anything > else I can do, please let me know what you need. > > > Well, luckily I didn't write any C code to find out what > purge logs > look like. I'm certainly not going to debug code I'm not > familiar with ;) > > I wrote a dummy test case instead: > > varnishtest "purge logs" > > server s1 { > rxreq > expect req.url == "/to-be-purged" > txresp > } -start > > varnish v1 -vcl+backend { > sub vcl_recv { > if (req.method == "PURGE") { > return (purge); > } > } > } -start > > client c1 { > txreq -url "/to-be-purged" > rxresp > > txreq -req PURGE -url "/to-be-purged" > rxresp > > txreq -req PURGE -url "/unknown" > rxresp > } -run > > And then looked at the logs manually: > > varnishtest test.vtc -v | grep vsl | less > > Here's a sample: > > [...] > **** v1 0.4 vsl| 1002 VCL_return b deliver > **** v1 0.4 vsl| 1002 Storage b malloc s0 > [...] > **** v1 0.4 vsl| 0 ExpKill - EXP_When > p=0x7f420b027000 e=1496443420.703764200 f=0xe > **** v1 0.4 vsl| 0 ExpKill - > EXP_expire > p=0x7f420b027000 e=-0.000092268 f=0x0 > **** v1 0.4 vsl| 0 ExpKill - > EXP_Expired x=1002 t=-0 > [...] > **** v1 0.4 vsl| 1003 ReqMethod c PURGE > **** v1 0.4 vsl| 1003 ReqURL c > /to-be-purged > [...] > **** v1 0.4 vsl| 1003 VCL_return c purge > **** v1 0.4 vsl| 1003 VCL_call c HASH > **** v1 0.4 vsl| 1003 VCL_return c lookup > **** v1 0.4 vsl| 1003 VCL_call c PURGE > **** v1 0.4 vsl| 1003 VCL_return c synth > [...] > **** v1 0.4 vsl| 1004 ReqMethod c PURGE > **** v1 0.4 vsl| 1004 ReqURL c /unknown > [...] > **** v1 0.4 vsl| 1004 VCL_return c purge > **** v1 0.4 vsl| 1004 VCL_call c HASH > **** v1 0.4 vsl| 1004 VCL_return c lookup > **** v1 0.4 vsl| 1004 VCL_call c PURGE > **** v1 0.4 vsl| 1004 VCL_return c synth > [...] > > The interesting transaction id (VXID) is 1002. > > So 1) purge-related logs will only show up with raw grouping in > varnishlog (which I find unfortunate but I should have > remembered the > expiry thread would have been involved) and 2) we don't see in a > transaction log how many objects were actually purged (moved > to the > expiry inbox). > > The ExpKill records appear before because transactions > commit their > logs when they finish by default. > > Would a cleanly installed server and absolute minimum > VCL to reproduce this > be useful? You would be welcome to have access to that > server, if useful, > once I've got it set up and producing the same problem. > > > Not yet, at this point we know that we were looking at an > incomplete > picture so what you need to do is capture raw logs and we > will be able > to get both a VXID and a timestamp from the ExpKill records > (although > the timestamp for EXP_expire puzzles me). > > See man varnishlog to see how to write (-w) and then read > (-r) logs > to/from a file. When you notice the alleged bug, note the > transaction > id and write the current logs (with the -d option) so that > you can > pick up all the interesting bits at rest (instead of doing > it on live > traffic). > > I can say that in my case there is definitely no Age > header coming from the > back-end. Also as shown in the example I sent it is the > 7th HIT on that > object. > > > Yes, smells like a bug. But before capturing logs, make sure > to remove > Hash records from the vsl_mask (man varnishd) so we can > confirm what's > being purged too. > > I have a theory, a long shot that will only prove how > unfamiliar I am > with this part of Varnish. Since the purge moves the object > to the > expiry inbox, it could be that under load the restart may happen > before the expiry thread marks it as expired, thus creating > a race > with the next lookup. > > Cheers, > Dridi > > > _______________________________________________ > varnish-misc mailing list > varnish-misc at varnish-cache.org > > https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc > > > _______________________________________________ > varnish-misc mailing list > varnish-misc at varnish-cache.org > https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc > > > From stefanobaldo at gmail.com Mon Jun 26 14:51:40 2017 From: stefanobaldo at gmail.com (Stefano Baldo) Date: Mon, 26 Jun 2017 11:51:40 -0300 Subject: Child process recurrently being restarted In-Reply-To: References: Message-ID: Hi Guillaume. Thanks for answering. I'm using a SSD disk. I've changed from ext4 to ext2 to increase performance but it stills restarting. Also, I checked the I/O performance for the disk and there is no signal of overhead. I've changed the /var/lib/varnish to a tmpfs and increased its 80m default size passing "-l 200m,20m" to varnishd and using "nodev,nosuid,noatime,size=256M 0 0" for the tmpfs mount. There was a problem here. After a couple of hours varnish died and I received a "no space left on device" message - deleting the /var/lib/varnish solved the problem and varnish was up again, but it's weird because there was free memory on the host to be used with the tmpfs directory, so I don't know what could have happened. I will try to stop increasing the /var/lib/varnish size. Anyway, I am worried about the bans. You asked me if the bans are lurker friedly. Well, I don't think so. My bans are created this way: ban("req.http.host == " + req.http.host + " && req.url ~ " + req.url + " && req.http.User-Agent !~ Googlebot"); Are they lurker friendly? I was taking a quick look and the documentation and it looks like they're not. Best, Stefano On Fri, Jun 23, 2017 at 11:30 AM, Guillaume Quintard < guillaume at varnish-software.com> wrote: > Hi Stefano, > > Let's cover the usual suspects: I/Os. I think here Varnish gets stuck > trying to push/pull data and can't make time to reply to the CLI. I'd > recommend monitoring the disk activity (bandwidth and iops) to confirm. > > After some time, the file storage is terrible on a hard drive (SSDs take a > bit more time to degrade) because of fragmentation. One solution to help > the disks cope is to overprovision themif they're SSDs, and you can try > different advices in the file storage definition in the command line (last > parameter, after granularity). > > Is your /var/lib/varnish mount on tmpfs? That could help too. > > 40K bans is a lot, are they ban-lurker friendly? > > -- > Guillaume Quintard > > On Fri, Jun 23, 2017 at 4:01 PM, Stefano Baldo > wrote: > >> Hello. >> >> I am having a critical problem with Varnish Cache in production for over >> a month and any help will be appreciated. >> The problem is that Varnish child process is recurrently being restarted >> after 10~20h of use, with the following message: >> >> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) not >> responding to CLI, killed it. >> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected reply from ping: >> 400 CLI communication error >> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) died signal=9 >> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup complete >> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) Started >> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said Child >> starts >> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said SMF.s0 >> mmap'ed 483183820800 bytes of 483183820800 >> >> The following link is the varnishstat output just 1 minute before a >> restart: >> >> https://pastebin.com/g0g5RVTs >> >> Environment: >> >> varnish-5.1.2 revision 6ece695 >> Debian 8.7 - Debian GNU/Linux 8 (3.16.0) >> Installed using pre-built package from official repo at packagecloud.io >> CPU 2x2.9 GHz >> Mem 3.69 GiB >> Running inside a Docker container >> NFILES=131072 >> MEMLOCK=82000 >> >> Additional info: >> >> - I need to cache a large number of objets and the cache should last for >> almost a week, so I have set up a 450G storage space, I don't know if this >> is a problem; >> - I use ban a lot. There was about 40k bans in the system just before the >> last crash. I really don't know if this is too much or may have anything to >> do with it; >> - No registered CPU spikes (almost always by 30%); >> - No panic is reported, the only info I can retrieve is from syslog; >> - During all the time, event moments before the crashes, everything is >> okay and requests are being responded very fast. >> >> Best, >> Stefano Baldo >> >> >> _______________________________________________ >> varnish-misc mailing list >> varnish-misc at varnish-cache.org >> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From guillaume at varnish-software.com Mon Jun 26 15:43:54 2017 From: guillaume at varnish-software.com (Guillaume Quintard) Date: Mon, 26 Jun 2017 17:43:54 +0200 Subject: Child process recurrently being restarted In-Reply-To: References: Message-ID: Not lurker friendly at all indeed. You'll need to avoid req.* expression. Easiest way is to stash the host, user-agent and url in beresp.http.* and ban against those (unset them in vcl_deliver). I don't think you need to expand the VSL at all. -- Guillaume Quintard On Jun 26, 2017 16:51, "Stefano Baldo" wrote: Hi Guillaume. Thanks for answering. I'm using a SSD disk. I've changed from ext4 to ext2 to increase performance but it stills restarting. Also, I checked the I/O performance for the disk and there is no signal of overhead. I've changed the /var/lib/varnish to a tmpfs and increased its 80m default size passing "-l 200m,20m" to varnishd and using "nodev,nosuid,noatime,size=256M 0 0" for the tmpfs mount. There was a problem here. After a couple of hours varnish died and I received a "no space left on device" message - deleting the /var/lib/varnish solved the problem and varnish was up again, but it's weird because there was free memory on the host to be used with the tmpfs directory, so I don't know what could have happened. I will try to stop increasing the /var/lib/varnish size. Anyway, I am worried about the bans. You asked me if the bans are lurker friedly. Well, I don't think so. My bans are created this way: ban("req.http.host == " + req.http.host + " && req.url ~ " + req.url + " && req.http.User-Agent !~ Googlebot"); Are they lurker friendly? I was taking a quick look and the documentation and it looks like they're not. Best, Stefano On Fri, Jun 23, 2017 at 11:30 AM, Guillaume Quintard < guillaume at varnish-software.com> wrote: > Hi Stefano, > > Let's cover the usual suspects: I/Os. I think here Varnish gets stuck > trying to push/pull data and can't make time to reply to the CLI. I'd > recommend monitoring the disk activity (bandwidth and iops) to confirm. > > After some time, the file storage is terrible on a hard drive (SSDs take a > bit more time to degrade) because of fragmentation. One solution to help > the disks cope is to overprovision themif they're SSDs, and you can try > different advices in the file storage definition in the command line (last > parameter, after granularity). > > Is your /var/lib/varnish mount on tmpfs? That could help too. > > 40K bans is a lot, are they ban-lurker friendly? > > -- > Guillaume Quintard > > On Fri, Jun 23, 2017 at 4:01 PM, Stefano Baldo > wrote: > >> Hello. >> >> I am having a critical problem with Varnish Cache in production for over >> a month and any help will be appreciated. >> The problem is that Varnish child process is recurrently being restarted >> after 10~20h of use, with the following message: >> >> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) not >> responding to CLI, killed it. >> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected reply from ping: >> 400 CLI communication error >> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) died signal=9 >> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup complete >> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) Started >> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said Child >> starts >> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said SMF.s0 >> mmap'ed 483183820800 bytes of 483183820800 >> >> The following link is the varnishstat output just 1 minute before a >> restart: >> >> https://pastebin.com/g0g5RVTs >> >> Environment: >> >> varnish-5.1.2 revision 6ece695 >> Debian 8.7 - Debian GNU/Linux 8 (3.16.0) >> Installed using pre-built package from official repo at packagecloud.io >> CPU 2x2.9 GHz >> Mem 3.69 GiB >> Running inside a Docker container >> NFILES=131072 >> MEMLOCK=82000 >> >> Additional info: >> >> - I need to cache a large number of objets and the cache should last for >> almost a week, so I have set up a 450G storage space, I don't know if this >> is a problem; >> - I use ban a lot. There was about 40k bans in the system just before the >> last crash. I really don't know if this is too much or may have anything to >> do with it; >> - No registered CPU spikes (almost always by 30%); >> - No panic is reported, the only info I can retrieve is from syslog; >> - During all the time, event moments before the crashes, everything is >> okay and requests are being responded very fast. >> >> Best, >> Stefano Baldo >> >> >> _______________________________________________ >> varnish-misc mailing list >> varnish-misc at varnish-cache.org >> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stefanobaldo at gmail.com Mon Jun 26 17:06:05 2017 From: stefanobaldo at gmail.com (Stefano Baldo) Date: Mon, 26 Jun 2017 14:06:05 -0300 Subject: Child process recurrently being restarted In-Reply-To: References: Message-ID: Hi Guillaume, Can the following be considered "ban lurker friendly"? sub vcl_backend_response { set beresp.http.x-url = bereq.http.host + bereq.url; set beresp.http.x-user-agent = bereq.http.user-agent; } sub vcl_recv { if (req.method == "PURGE") { ban("obj.http.x-url == " + req.http.host + req.url + " && obj.http.x-user-agent !~ Googlebot"); return(synth(750)); } } sub vcl_deliver { unset resp.http.x-url; unset resp.http.x-user-agent; } Best, Stefano On Mon, Jun 26, 2017 at 12:43 PM, Guillaume Quintard < guillaume at varnish-software.com> wrote: > Not lurker friendly at all indeed. You'll need to avoid req.* expression. > Easiest way is to stash the host, user-agent and url in beresp.http.* and > ban against those (unset them in vcl_deliver). > > I don't think you need to expand the VSL at all. > > -- > Guillaume Quintard > > On Jun 26, 2017 16:51, "Stefano Baldo" wrote: > > Hi Guillaume. > > Thanks for answering. > > I'm using a SSD disk. I've changed from ext4 to ext2 to increase > performance but it stills restarting. > Also, I checked the I/O performance for the disk and there is no signal of > overhead. > > I've changed the /var/lib/varnish to a tmpfs and increased its 80m default > size passing "-l 200m,20m" to varnishd and using > "nodev,nosuid,noatime,size=256M 0 0" for the tmpfs mount. There was a > problem here. After a couple of hours varnish died and I received a "no > space left on device" message - deleting the /var/lib/varnish solved the > problem and varnish was up again, but it's weird because there was free > memory on the host to be used with the tmpfs directory, so I don't know > what could have happened. I will try to stop increasing the > /var/lib/varnish size. > > Anyway, I am worried about the bans. You asked me if the bans are lurker > friedly. Well, I don't think so. My bans are created this way: > > ban("req.http.host == " + req.http.host + " && req.url ~ " + req.url + " > && req.http.User-Agent !~ Googlebot"); > > Are they lurker friendly? I was taking a quick look and the documentation > and it looks like they're not. > > Best, > Stefano > > > On Fri, Jun 23, 2017 at 11:30 AM, Guillaume Quintard < > guillaume at varnish-software.com> wrote: > >> Hi Stefano, >> >> Let's cover the usual suspects: I/Os. I think here Varnish gets stuck >> trying to push/pull data and can't make time to reply to the CLI. I'd >> recommend monitoring the disk activity (bandwidth and iops) to confirm. >> >> After some time, the file storage is terrible on a hard drive (SSDs take >> a bit more time to degrade) because of fragmentation. One solution to help >> the disks cope is to overprovision themif they're SSDs, and you can try >> different advices in the file storage definition in the command line (last >> parameter, after granularity). >> >> Is your /var/lib/varnish mount on tmpfs? That could help too. >> >> 40K bans is a lot, are they ban-lurker friendly? >> >> -- >> Guillaume Quintard >> >> On Fri, Jun 23, 2017 at 4:01 PM, Stefano Baldo >> wrote: >> >>> Hello. >>> >>> I am having a critical problem with Varnish Cache in production for over >>> a month and any help will be appreciated. >>> The problem is that Varnish child process is recurrently being restarted >>> after 10~20h of use, with the following message: >>> >>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) not >>> responding to CLI, killed it. >>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected reply from >>> ping: 400 CLI communication error >>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) died signal=9 >>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup complete >>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) Started >>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said Child >>> starts >>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said SMF.s0 >>> mmap'ed 483183820800 bytes of 483183820800 >>> >>> The following link is the varnishstat output just 1 minute before a >>> restart: >>> >>> https://pastebin.com/g0g5RVTs >>> >>> Environment: >>> >>> varnish-5.1.2 revision 6ece695 >>> Debian 8.7 - Debian GNU/Linux 8 (3.16.0) >>> Installed using pre-built package from official repo at packagecloud.io >>> CPU 2x2.9 GHz >>> Mem 3.69 GiB >>> Running inside a Docker container >>> NFILES=131072 >>> MEMLOCK=82000 >>> >>> Additional info: >>> >>> - I need to cache a large number of objets and the cache should last for >>> almost a week, so I have set up a 450G storage space, I don't know if this >>> is a problem; >>> - I use ban a lot. There was about 40k bans in the system just before >>> the last crash. I really don't know if this is too much or may have >>> anything to do with it; >>> - No registered CPU spikes (almost always by 30%); >>> - No panic is reported, the only info I can retrieve is from syslog; >>> - During all the time, event moments before the crashes, everything is >>> okay and requests are being responded very fast. >>> >>> Best, >>> Stefano Baldo >>> >>> >>> _______________________________________________ >>> varnish-misc mailing list >>> varnish-misc at varnish-cache.org >>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc >>> >> >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From guillaume at varnish-software.com Mon Jun 26 18:10:37 2017 From: guillaume at varnish-software.com (Guillaume Quintard) Date: Mon, 26 Jun 2017 20:10:37 +0200 Subject: Child process recurrently being restarted In-Reply-To: References: Message-ID: Looking good! -- Guillaume Quintard On Mon, Jun 26, 2017 at 7:06 PM, Stefano Baldo wrote: > Hi Guillaume, > > Can the following be considered "ban lurker friendly"? > > sub vcl_backend_response { > set beresp.http.x-url = bereq.http.host + bereq.url; > set beresp.http.x-user-agent = bereq.http.user-agent; > } > > sub vcl_recv { > if (req.method == "PURGE") { > ban("obj.http.x-url == " + req.http.host + req.url + " && > obj.http.x-user-agent !~ Googlebot"); > return(synth(750)); > } > } > > sub vcl_deliver { > unset resp.http.x-url; > unset resp.http.x-user-agent; > } > > Best, > Stefano > > > On Mon, Jun 26, 2017 at 12:43 PM, Guillaume Quintard < > guillaume at varnish-software.com> wrote: > >> Not lurker friendly at all indeed. You'll need to avoid req.* expression. >> Easiest way is to stash the host, user-agent and url in beresp.http.* and >> ban against those (unset them in vcl_deliver). >> >> I don't think you need to expand the VSL at all. >> >> -- >> Guillaume Quintard >> >> On Jun 26, 2017 16:51, "Stefano Baldo" wrote: >> >> Hi Guillaume. >> >> Thanks for answering. >> >> I'm using a SSD disk. I've changed from ext4 to ext2 to increase >> performance but it stills restarting. >> Also, I checked the I/O performance for the disk and there is no signal >> of overhead. >> >> I've changed the /var/lib/varnish to a tmpfs and increased its 80m >> default size passing "-l 200m,20m" to varnishd and using >> "nodev,nosuid,noatime,size=256M 0 0" for the tmpfs mount. There was a >> problem here. After a couple of hours varnish died and I received a "no >> space left on device" message - deleting the /var/lib/varnish solved the >> problem and varnish was up again, but it's weird because there was free >> memory on the host to be used with the tmpfs directory, so I don't know >> what could have happened. I will try to stop increasing the >> /var/lib/varnish size. >> >> Anyway, I am worried about the bans. You asked me if the bans are lurker >> friedly. Well, I don't think so. My bans are created this way: >> >> ban("req.http.host == " + req.http.host + " && req.url ~ " + req.url + " >> && req.http.User-Agent !~ Googlebot"); >> >> Are they lurker friendly? I was taking a quick look and the documentation >> and it looks like they're not. >> >> Best, >> Stefano >> >> >> On Fri, Jun 23, 2017 at 11:30 AM, Guillaume Quintard < >> guillaume at varnish-software.com> wrote: >> >>> Hi Stefano, >>> >>> Let's cover the usual suspects: I/Os. I think here Varnish gets stuck >>> trying to push/pull data and can't make time to reply to the CLI. I'd >>> recommend monitoring the disk activity (bandwidth and iops) to confirm. >>> >>> After some time, the file storage is terrible on a hard drive (SSDs take >>> a bit more time to degrade) because of fragmentation. One solution to help >>> the disks cope is to overprovision themif they're SSDs, and you can try >>> different advices in the file storage definition in the command line (last >>> parameter, after granularity). >>> >>> Is your /var/lib/varnish mount on tmpfs? That could help too. >>> >>> 40K bans is a lot, are they ban-lurker friendly? >>> >>> -- >>> Guillaume Quintard >>> >>> On Fri, Jun 23, 2017 at 4:01 PM, Stefano Baldo >>> wrote: >>> >>>> Hello. >>>> >>>> I am having a critical problem with Varnish Cache in production for >>>> over a month and any help will be appreciated. >>>> The problem is that Varnish child process is recurrently being >>>> restarted after 10~20h of use, with the following message: >>>> >>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) not >>>> responding to CLI, killed it. >>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected reply from >>>> ping: 400 CLI communication error >>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) died >>>> signal=9 >>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup complete >>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) Started >>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said Child >>>> starts >>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said SMF.s0 >>>> mmap'ed 483183820800 bytes of 483183820800 >>>> >>>> The following link is the varnishstat output just 1 minute before a >>>> restart: >>>> >>>> https://pastebin.com/g0g5RVTs >>>> >>>> Environment: >>>> >>>> varnish-5.1.2 revision 6ece695 >>>> Debian 8.7 - Debian GNU/Linux 8 (3.16.0) >>>> Installed using pre-built package from official repo at packagecloud.io >>>> CPU 2x2.9 GHz >>>> Mem 3.69 GiB >>>> Running inside a Docker container >>>> NFILES=131072 >>>> MEMLOCK=82000 >>>> >>>> Additional info: >>>> >>>> - I need to cache a large number of objets and the cache should last >>>> for almost a week, so I have set up a 450G storage space, I don't know if >>>> this is a problem; >>>> - I use ban a lot. There was about 40k bans in the system just before >>>> the last crash. I really don't know if this is too much or may have >>>> anything to do with it; >>>> - No registered CPU spikes (almost always by 30%); >>>> - No panic is reported, the only info I can retrieve is from syslog; >>>> - During all the time, event moments before the crashes, everything is >>>> okay and requests are being responded very fast. >>>> >>>> Best, >>>> Stefano Baldo >>>> >>>> >>>> _______________________________________________ >>>> varnish-misc mailing list >>>> varnish-misc at varnish-cache.org >>>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc >>>> >>> >>> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stefanobaldo at gmail.com Mon Jun 26 18:21:43 2017 From: stefanobaldo at gmail.com (Stefano Baldo) Date: Mon, 26 Jun 2017 15:21:43 -0300 Subject: Child process recurrently being restarted In-Reply-To: References: Message-ID: Hi Guillaume. I think things will start to going better now after changing the bans. This is how my last varnishstat looked like moments before a crash regarding the bans: MAIN.bans 41336 . Count of bans MAIN.bans_completed 37967 . Number of bans marked 'completed' MAIN.bans_obj 0 . Number of bans using obj.* MAIN.bans_req 41335 . Number of bans using req.* MAIN.bans_added 41336 0.68 Bans added MAIN.bans_deleted 0 0.00 Bans deleted And this is how it looks like now: MAIN.bans 2 . Count of bans MAIN.bans_completed 1 . Number of bans marked 'completed' MAIN.bans_obj 2 . Number of bans using obj.* MAIN.bans_req 0 . Number of bans using req.* MAIN.bans_added 2016 0.69 Bans added MAIN.bans_deleted 2014 0.69 Bans deleted Before the changes, bans were never deleted! Now the bans are added and quickly deleted after a minute or even a couple of seconds. May this was the cause of the problem? It seems like varnish was having a large number of bans to manage and test against. I will let it ride now. Let's see if the problem persists or it's gone! :-) Best, Stefano On Mon, Jun 26, 2017 at 3:10 PM, Guillaume Quintard < guillaume at varnish-software.com> wrote: > Looking good! > > -- > Guillaume Quintard > > On Mon, Jun 26, 2017 at 7:06 PM, Stefano Baldo > wrote: > >> Hi Guillaume, >> >> Can the following be considered "ban lurker friendly"? >> >> sub vcl_backend_response { >> set beresp.http.x-url = bereq.http.host + bereq.url; >> set beresp.http.x-user-agent = bereq.http.user-agent; >> } >> >> sub vcl_recv { >> if (req.method == "PURGE") { >> ban("obj.http.x-url == " + req.http.host + req.url + " && >> obj.http.x-user-agent !~ Googlebot"); >> return(synth(750)); >> } >> } >> >> sub vcl_deliver { >> unset resp.http.x-url; >> unset resp.http.x-user-agent; >> } >> >> Best, >> Stefano >> >> >> On Mon, Jun 26, 2017 at 12:43 PM, Guillaume Quintard < >> guillaume at varnish-software.com> wrote: >> >>> Not lurker friendly at all indeed. You'll need to avoid req.* >>> expression. Easiest way is to stash the host, user-agent and url in >>> beresp.http.* and ban against those (unset them in vcl_deliver). >>> >>> I don't think you need to expand the VSL at all. >>> >>> -- >>> Guillaume Quintard >>> >>> On Jun 26, 2017 16:51, "Stefano Baldo" wrote: >>> >>> Hi Guillaume. >>> >>> Thanks for answering. >>> >>> I'm using a SSD disk. I've changed from ext4 to ext2 to increase >>> performance but it stills restarting. >>> Also, I checked the I/O performance for the disk and there is no signal >>> of overhead. >>> >>> I've changed the /var/lib/varnish to a tmpfs and increased its 80m >>> default size passing "-l 200m,20m" to varnishd and using >>> "nodev,nosuid,noatime,size=256M 0 0" for the tmpfs mount. There was a >>> problem here. After a couple of hours varnish died and I received a "no >>> space left on device" message - deleting the /var/lib/varnish solved the >>> problem and varnish was up again, but it's weird because there was free >>> memory on the host to be used with the tmpfs directory, so I don't know >>> what could have happened. I will try to stop increasing the >>> /var/lib/varnish size. >>> >>> Anyway, I am worried about the bans. You asked me if the bans are lurker >>> friedly. Well, I don't think so. My bans are created this way: >>> >>> ban("req.http.host == " + req.http.host + " && req.url ~ " + req.url + " >>> && req.http.User-Agent !~ Googlebot"); >>> >>> Are they lurker friendly? I was taking a quick look and the >>> documentation and it looks like they're not. >>> >>> Best, >>> Stefano >>> >>> >>> On Fri, Jun 23, 2017 at 11:30 AM, Guillaume Quintard < >>> guillaume at varnish-software.com> wrote: >>> >>>> Hi Stefano, >>>> >>>> Let's cover the usual suspects: I/Os. I think here Varnish gets stuck >>>> trying to push/pull data and can't make time to reply to the CLI. I'd >>>> recommend monitoring the disk activity (bandwidth and iops) to confirm. >>>> >>>> After some time, the file storage is terrible on a hard drive (SSDs >>>> take a bit more time to degrade) because of fragmentation. One solution to >>>> help the disks cope is to overprovision themif they're SSDs, and you can >>>> try different advices in the file storage definition in the command line >>>> (last parameter, after granularity). >>>> >>>> Is your /var/lib/varnish mount on tmpfs? That could help too. >>>> >>>> 40K bans is a lot, are they ban-lurker friendly? >>>> >>>> -- >>>> Guillaume Quintard >>>> >>>> On Fri, Jun 23, 2017 at 4:01 PM, Stefano Baldo >>>> wrote: >>>> >>>>> Hello. >>>>> >>>>> I am having a critical problem with Varnish Cache in production for >>>>> over a month and any help will be appreciated. >>>>> The problem is that Varnish child process is recurrently being >>>>> restarted after 10~20h of use, with the following message: >>>>> >>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) not >>>>> responding to CLI, killed it. >>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected reply from >>>>> ping: 400 CLI communication error >>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) died >>>>> signal=9 >>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup complete >>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) Started >>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said Child >>>>> starts >>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said >>>>> SMF.s0 mmap'ed 483183820800 bytes of 483183820800 >>>>> >>>>> The following link is the varnishstat output just 1 minute before a >>>>> restart: >>>>> >>>>> https://pastebin.com/g0g5RVTs >>>>> >>>>> Environment: >>>>> >>>>> varnish-5.1.2 revision 6ece695 >>>>> Debian 8.7 - Debian GNU/Linux 8 (3.16.0) >>>>> Installed using pre-built package from official repo at >>>>> packagecloud.io >>>>> CPU 2x2.9 GHz >>>>> Mem 3.69 GiB >>>>> Running inside a Docker container >>>>> NFILES=131072 >>>>> MEMLOCK=82000 >>>>> >>>>> Additional info: >>>>> >>>>> - I need to cache a large number of objets and the cache should last >>>>> for almost a week, so I have set up a 450G storage space, I don't know if >>>>> this is a problem; >>>>> - I use ban a lot. There was about 40k bans in the system just before >>>>> the last crash. I really don't know if this is too much or may have >>>>> anything to do with it; >>>>> - No registered CPU spikes (almost always by 30%); >>>>> - No panic is reported, the only info I can retrieve is from syslog; >>>>> - During all the time, event moments before the crashes, everything is >>>>> okay and requests are being responded very fast. >>>>> >>>>> Best, >>>>> Stefano Baldo >>>>> >>>>> >>>>> _______________________________________________ >>>>> varnish-misc mailing list >>>>> varnish-misc at varnish-cache.org >>>>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc >>>>> >>>> >>>> >>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From guillaume at varnish-software.com Mon Jun 26 18:47:33 2017 From: guillaume at varnish-software.com (Guillaume Quintard) Date: Mon, 26 Jun 2017 20:47:33 +0200 Subject: Child process recurrently being restarted In-Reply-To: References: Message-ID: Nice! It may have been the cause, time will tell.can you report back in a few days to let us know? -- Guillaume Quintard On Jun 26, 2017 20:21, "Stefano Baldo" wrote: > Hi Guillaume. > > I think things will start to going better now after changing the bans. > This is how my last varnishstat looked like moments before a crash > regarding the bans: > > MAIN.bans 41336 . Count of bans > MAIN.bans_completed 37967 . Number of bans marked > 'completed' > MAIN.bans_obj 0 . Number of bans using obj.* > MAIN.bans_req 41335 . Number of bans using req.* > MAIN.bans_added 41336 0.68 Bans added > MAIN.bans_deleted 0 0.00 Bans deleted > > And this is how it looks like now: > > MAIN.bans 2 . Count of bans > MAIN.bans_completed 1 . Number of bans marked > 'completed' > MAIN.bans_obj 2 . Number of bans using obj.* > MAIN.bans_req 0 . Number of bans using req.* > MAIN.bans_added 2016 0.69 Bans added > MAIN.bans_deleted 2014 0.69 Bans deleted > > Before the changes, bans were never deleted! > Now the bans are added and quickly deleted after a minute or even a couple > of seconds. > > May this was the cause of the problem? It seems like varnish was having a > large number of bans to manage and test against. > I will let it ride now. Let's see if the problem persists or it's gone! :-) > > Best, > Stefano > > > On Mon, Jun 26, 2017 at 3:10 PM, Guillaume Quintard < > guillaume at varnish-software.com> wrote: > >> Looking good! >> >> -- >> Guillaume Quintard >> >> On Mon, Jun 26, 2017 at 7:06 PM, Stefano Baldo >> wrote: >> >>> Hi Guillaume, >>> >>> Can the following be considered "ban lurker friendly"? >>> >>> sub vcl_backend_response { >>> set beresp.http.x-url = bereq.http.host + bereq.url; >>> set beresp.http.x-user-agent = bereq.http.user-agent; >>> } >>> >>> sub vcl_recv { >>> if (req.method == "PURGE") { >>> ban("obj.http.x-url == " + req.http.host + req.url + " && >>> obj.http.x-user-agent !~ Googlebot"); >>> return(synth(750)); >>> } >>> } >>> >>> sub vcl_deliver { >>> unset resp.http.x-url; >>> unset resp.http.x-user-agent; >>> } >>> >>> Best, >>> Stefano >>> >>> >>> On Mon, Jun 26, 2017 at 12:43 PM, Guillaume Quintard < >>> guillaume at varnish-software.com> wrote: >>> >>>> Not lurker friendly at all indeed. You'll need to avoid req.* >>>> expression. Easiest way is to stash the host, user-agent and url in >>>> beresp.http.* and ban against those (unset them in vcl_deliver). >>>> >>>> I don't think you need to expand the VSL at all. >>>> >>>> -- >>>> Guillaume Quintard >>>> >>>> On Jun 26, 2017 16:51, "Stefano Baldo" wrote: >>>> >>>> Hi Guillaume. >>>> >>>> Thanks for answering. >>>> >>>> I'm using a SSD disk. I've changed from ext4 to ext2 to increase >>>> performance but it stills restarting. >>>> Also, I checked the I/O performance for the disk and there is no signal >>>> of overhead. >>>> >>>> I've changed the /var/lib/varnish to a tmpfs and increased its 80m >>>> default size passing "-l 200m,20m" to varnishd and using >>>> "nodev,nosuid,noatime,size=256M 0 0" for the tmpfs mount. There was a >>>> problem here. After a couple of hours varnish died and I received a "no >>>> space left on device" message - deleting the /var/lib/varnish solved the >>>> problem and varnish was up again, but it's weird because there was free >>>> memory on the host to be used with the tmpfs directory, so I don't know >>>> what could have happened. I will try to stop increasing the >>>> /var/lib/varnish size. >>>> >>>> Anyway, I am worried about the bans. You asked me if the bans are >>>> lurker friedly. Well, I don't think so. My bans are created this way: >>>> >>>> ban("req.http.host == " + req.http.host + " && req.url ~ " + req.url + >>>> " && req.http.User-Agent !~ Googlebot"); >>>> >>>> Are they lurker friendly? I was taking a quick look and the >>>> documentation and it looks like they're not. >>>> >>>> Best, >>>> Stefano >>>> >>>> >>>> On Fri, Jun 23, 2017 at 11:30 AM, Guillaume Quintard < >>>> guillaume at varnish-software.com> wrote: >>>> >>>>> Hi Stefano, >>>>> >>>>> Let's cover the usual suspects: I/Os. I think here Varnish gets stuck >>>>> trying to push/pull data and can't make time to reply to the CLI. I'd >>>>> recommend monitoring the disk activity (bandwidth and iops) to confirm. >>>>> >>>>> After some time, the file storage is terrible on a hard drive (SSDs >>>>> take a bit more time to degrade) because of fragmentation. One solution to >>>>> help the disks cope is to overprovision themif they're SSDs, and you can >>>>> try different advices in the file storage definition in the command line >>>>> (last parameter, after granularity). >>>>> >>>>> Is your /var/lib/varnish mount on tmpfs? That could help too. >>>>> >>>>> 40K bans is a lot, are they ban-lurker friendly? >>>>> >>>>> -- >>>>> Guillaume Quintard >>>>> >>>>> On Fri, Jun 23, 2017 at 4:01 PM, Stefano Baldo >>>> > wrote: >>>>> >>>>>> Hello. >>>>>> >>>>>> I am having a critical problem with Varnish Cache in production for >>>>>> over a month and any help will be appreciated. >>>>>> The problem is that Varnish child process is recurrently being >>>>>> restarted after 10~20h of use, with the following message: >>>>>> >>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) not >>>>>> responding to CLI, killed it. >>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected reply from >>>>>> ping: 400 CLI communication error >>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) died >>>>>> signal=9 >>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup complete >>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) Started >>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said >>>>>> Child starts >>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said >>>>>> SMF.s0 mmap'ed 483183820800 bytes of 483183820800 >>>>>> >>>>>> The following link is the varnishstat output just 1 minute before a >>>>>> restart: >>>>>> >>>>>> https://pastebin.com/g0g5RVTs >>>>>> >>>>>> Environment: >>>>>> >>>>>> varnish-5.1.2 revision 6ece695 >>>>>> Debian 8.7 - Debian GNU/Linux 8 (3.16.0) >>>>>> Installed using pre-built package from official repo at >>>>>> packagecloud.io >>>>>> CPU 2x2.9 GHz >>>>>> Mem 3.69 GiB >>>>>> Running inside a Docker container >>>>>> NFILES=131072 >>>>>> MEMLOCK=82000 >>>>>> >>>>>> Additional info: >>>>>> >>>>>> - I need to cache a large number of objets and the cache should last >>>>>> for almost a week, so I have set up a 450G storage space, I don't know if >>>>>> this is a problem; >>>>>> - I use ban a lot. There was about 40k bans in the system just before >>>>>> the last crash. I really don't know if this is too much or may have >>>>>> anything to do with it; >>>>>> - No registered CPU spikes (almost always by 30%); >>>>>> - No panic is reported, the only info I can retrieve is from syslog; >>>>>> - During all the time, event moments before the crashes, everything >>>>>> is okay and requests are being responded very fast. >>>>>> >>>>>> Best, >>>>>> Stefano Baldo >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> varnish-misc mailing list >>>>>> varnish-misc at varnish-cache.org >>>>>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc >>>>>> >>>>> >>>>> >>>> >>>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stefanobaldo at gmail.com Mon Jun 26 19:08:54 2017 From: stefanobaldo at gmail.com (Stefano Baldo) Date: Mon, 26 Jun 2017 16:08:54 -0300 Subject: Child process recurrently being restarted In-Reply-To: References: Message-ID: Sure, will do! On Mon, Jun 26, 2017 at 3:47 PM, Guillaume Quintard < guillaume at varnish-software.com> wrote: > Nice! It may have been the cause, time will tell.can you report back in a > few days to let us know? > -- > Guillaume Quintard > > On Jun 26, 2017 20:21, "Stefano Baldo" wrote: > >> Hi Guillaume. >> >> I think things will start to going better now after changing the bans. >> This is how my last varnishstat looked like moments before a crash >> regarding the bans: >> >> MAIN.bans 41336 . Count of bans >> MAIN.bans_completed 37967 . Number of bans marked >> 'completed' >> MAIN.bans_obj 0 . Number of bans using >> obj.* >> MAIN.bans_req 41335 . Number of bans using >> req.* >> MAIN.bans_added 41336 0.68 Bans added >> MAIN.bans_deleted 0 0.00 Bans deleted >> >> And this is how it looks like now: >> >> MAIN.bans 2 . Count of bans >> MAIN.bans_completed 1 . Number of bans marked >> 'completed' >> MAIN.bans_obj 2 . Number of bans using >> obj.* >> MAIN.bans_req 0 . Number of bans using >> req.* >> MAIN.bans_added 2016 0.69 Bans added >> MAIN.bans_deleted 2014 0.69 Bans deleted >> >> Before the changes, bans were never deleted! >> Now the bans are added and quickly deleted after a minute or even a >> couple of seconds. >> >> May this was the cause of the problem? It seems like varnish was having a >> large number of bans to manage and test against. >> I will let it ride now. Let's see if the problem persists or it's gone! >> :-) >> >> Best, >> Stefano >> >> >> On Mon, Jun 26, 2017 at 3:10 PM, Guillaume Quintard < >> guillaume at varnish-software.com> wrote: >> >>> Looking good! >>> >>> -- >>> Guillaume Quintard >>> >>> On Mon, Jun 26, 2017 at 7:06 PM, Stefano Baldo >>> wrote: >>> >>>> Hi Guillaume, >>>> >>>> Can the following be considered "ban lurker friendly"? >>>> >>>> sub vcl_backend_response { >>>> set beresp.http.x-url = bereq.http.host + bereq.url; >>>> set beresp.http.x-user-agent = bereq.http.user-agent; >>>> } >>>> >>>> sub vcl_recv { >>>> if (req.method == "PURGE") { >>>> ban("obj.http.x-url == " + req.http.host + req.url + " && >>>> obj.http.x-user-agent !~ Googlebot"); >>>> return(synth(750)); >>>> } >>>> } >>>> >>>> sub vcl_deliver { >>>> unset resp.http.x-url; >>>> unset resp.http.x-user-agent; >>>> } >>>> >>>> Best, >>>> Stefano >>>> >>>> >>>> On Mon, Jun 26, 2017 at 12:43 PM, Guillaume Quintard < >>>> guillaume at varnish-software.com> wrote: >>>> >>>>> Not lurker friendly at all indeed. You'll need to avoid req.* >>>>> expression. Easiest way is to stash the host, user-agent and url in >>>>> beresp.http.* and ban against those (unset them in vcl_deliver). >>>>> >>>>> I don't think you need to expand the VSL at all. >>>>> >>>>> -- >>>>> Guillaume Quintard >>>>> >>>>> On Jun 26, 2017 16:51, "Stefano Baldo" wrote: >>>>> >>>>> Hi Guillaume. >>>>> >>>>> Thanks for answering. >>>>> >>>>> I'm using a SSD disk. I've changed from ext4 to ext2 to increase >>>>> performance but it stills restarting. >>>>> Also, I checked the I/O performance for the disk and there is no >>>>> signal of overhead. >>>>> >>>>> I've changed the /var/lib/varnish to a tmpfs and increased its 80m >>>>> default size passing "-l 200m,20m" to varnishd and using >>>>> "nodev,nosuid,noatime,size=256M 0 0" for the tmpfs mount. There was a >>>>> problem here. After a couple of hours varnish died and I received a "no >>>>> space left on device" message - deleting the /var/lib/varnish solved the >>>>> problem and varnish was up again, but it's weird because there was free >>>>> memory on the host to be used with the tmpfs directory, so I don't know >>>>> what could have happened. I will try to stop increasing the >>>>> /var/lib/varnish size. >>>>> >>>>> Anyway, I am worried about the bans. You asked me if the bans are >>>>> lurker friedly. Well, I don't think so. My bans are created this way: >>>>> >>>>> ban("req.http.host == " + req.http.host + " && req.url ~ " + req.url + >>>>> " && req.http.User-Agent !~ Googlebot"); >>>>> >>>>> Are they lurker friendly? I was taking a quick look and the >>>>> documentation and it looks like they're not. >>>>> >>>>> Best, >>>>> Stefano >>>>> >>>>> >>>>> On Fri, Jun 23, 2017 at 11:30 AM, Guillaume Quintard < >>>>> guillaume at varnish-software.com> wrote: >>>>> >>>>>> Hi Stefano, >>>>>> >>>>>> Let's cover the usual suspects: I/Os. I think here Varnish gets stuck >>>>>> trying to push/pull data and can't make time to reply to the CLI. I'd >>>>>> recommend monitoring the disk activity (bandwidth and iops) to confirm. >>>>>> >>>>>> After some time, the file storage is terrible on a hard drive (SSDs >>>>>> take a bit more time to degrade) because of fragmentation. One solution to >>>>>> help the disks cope is to overprovision themif they're SSDs, and you can >>>>>> try different advices in the file storage definition in the command line >>>>>> (last parameter, after granularity). >>>>>> >>>>>> Is your /var/lib/varnish mount on tmpfs? That could help too. >>>>>> >>>>>> 40K bans is a lot, are they ban-lurker friendly? >>>>>> >>>>>> -- >>>>>> Guillaume Quintard >>>>>> >>>>>> On Fri, Jun 23, 2017 at 4:01 PM, Stefano Baldo < >>>>>> stefanobaldo at gmail.com> wrote: >>>>>> >>>>>>> Hello. >>>>>>> >>>>>>> I am having a critical problem with Varnish Cache in production for >>>>>>> over a month and any help will be appreciated. >>>>>>> The problem is that Varnish child process is recurrently being >>>>>>> restarted after 10~20h of use, with the following message: >>>>>>> >>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) not >>>>>>> responding to CLI, killed it. >>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected reply from >>>>>>> ping: 400 CLI communication error >>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) died >>>>>>> signal=9 >>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup complete >>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) Started >>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said >>>>>>> Child starts >>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said >>>>>>> SMF.s0 mmap'ed 483183820800 bytes of 483183820800 >>>>>>> >>>>>>> The following link is the varnishstat output just 1 minute before a >>>>>>> restart: >>>>>>> >>>>>>> https://pastebin.com/g0g5RVTs >>>>>>> >>>>>>> Environment: >>>>>>> >>>>>>> varnish-5.1.2 revision 6ece695 >>>>>>> Debian 8.7 - Debian GNU/Linux 8 (3.16.0) >>>>>>> Installed using pre-built package from official repo at >>>>>>> packagecloud.io >>>>>>> CPU 2x2.9 GHz >>>>>>> Mem 3.69 GiB >>>>>>> Running inside a Docker container >>>>>>> NFILES=131072 >>>>>>> MEMLOCK=82000 >>>>>>> >>>>>>> Additional info: >>>>>>> >>>>>>> - I need to cache a large number of objets and the cache should last >>>>>>> for almost a week, so I have set up a 450G storage space, I don't know if >>>>>>> this is a problem; >>>>>>> - I use ban a lot. There was about 40k bans in the system just >>>>>>> before the last crash. I really don't know if this is too much or may have >>>>>>> anything to do with it; >>>>>>> - No registered CPU spikes (almost always by 30%); >>>>>>> - No panic is reported, the only info I can retrieve is from syslog; >>>>>>> - During all the time, event moments before the crashes, everything >>>>>>> is okay and requests are being responded very fast. >>>>>>> >>>>>>> Best, >>>>>>> Stefano Baldo >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> varnish-misc mailing list >>>>>>> varnish-misc at varnish-cache.org >>>>>>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> >>> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From stefanobaldo at gmail.com Tue Jun 27 12:37:54 2017 From: stefanobaldo at gmail.com (Stefano Baldo) Date: Tue, 27 Jun 2017 09:37:54 -0300 Subject: Child process recurrently being restarted In-Reply-To: References: Message-ID: Hi Guillaume, FYI, restarted again after ~16h :-( Uptime mgt: 0+17:48:50 Uptime child: 0+02:17:14 Best, Stefano On Mon, Jun 26, 2017 at 3:47 PM, Guillaume Quintard < guillaume at varnish-software.com> wrote: > Nice! It may have been the cause, time will tell.can you report back in a > few days to let us know? > -- > Guillaume Quintard > > On Jun 26, 2017 20:21, "Stefano Baldo" wrote: > >> Hi Guillaume. >> >> I think things will start to going better now after changing the bans. >> This is how my last varnishstat looked like moments before a crash >> regarding the bans: >> >> MAIN.bans 41336 . Count of bans >> MAIN.bans_completed 37967 . Number of bans marked >> 'completed' >> MAIN.bans_obj 0 . Number of bans using >> obj.* >> MAIN.bans_req 41335 . Number of bans using >> req.* >> MAIN.bans_added 41336 0.68 Bans added >> MAIN.bans_deleted 0 0.00 Bans deleted >> >> And this is how it looks like now: >> >> MAIN.bans 2 . Count of bans >> MAIN.bans_completed 1 . Number of bans marked >> 'completed' >> MAIN.bans_obj 2 . Number of bans using >> obj.* >> MAIN.bans_req 0 . Number of bans using >> req.* >> MAIN.bans_added 2016 0.69 Bans added >> MAIN.bans_deleted 2014 0.69 Bans deleted >> >> Before the changes, bans were never deleted! >> Now the bans are added and quickly deleted after a minute or even a >> couple of seconds. >> >> May this was the cause of the problem? It seems like varnish was having a >> large number of bans to manage and test against. >> I will let it ride now. Let's see if the problem persists or it's gone! >> :-) >> >> Best, >> Stefano >> >> >> On Mon, Jun 26, 2017 at 3:10 PM, Guillaume Quintard < >> guillaume at varnish-software.com> wrote: >> >>> Looking good! >>> >>> -- >>> Guillaume Quintard >>> >>> On Mon, Jun 26, 2017 at 7:06 PM, Stefano Baldo >>> wrote: >>> >>>> Hi Guillaume, >>>> >>>> Can the following be considered "ban lurker friendly"? >>>> >>>> sub vcl_backend_response { >>>> set beresp.http.x-url = bereq.http.host + bereq.url; >>>> set beresp.http.x-user-agent = bereq.http.user-agent; >>>> } >>>> >>>> sub vcl_recv { >>>> if (req.method == "PURGE") { >>>> ban("obj.http.x-url == " + req.http.host + req.url + " && >>>> obj.http.x-user-agent !~ Googlebot"); >>>> return(synth(750)); >>>> } >>>> } >>>> >>>> sub vcl_deliver { >>>> unset resp.http.x-url; >>>> unset resp.http.x-user-agent; >>>> } >>>> >>>> Best, >>>> Stefano >>>> >>>> >>>> On Mon, Jun 26, 2017 at 12:43 PM, Guillaume Quintard < >>>> guillaume at varnish-software.com> wrote: >>>> >>>>> Not lurker friendly at all indeed. You'll need to avoid req.* >>>>> expression. Easiest way is to stash the host, user-agent and url in >>>>> beresp.http.* and ban against those (unset them in vcl_deliver). >>>>> >>>>> I don't think you need to expand the VSL at all. >>>>> >>>>> -- >>>>> Guillaume Quintard >>>>> >>>>> On Jun 26, 2017 16:51, "Stefano Baldo" wrote: >>>>> >>>>> Hi Guillaume. >>>>> >>>>> Thanks for answering. >>>>> >>>>> I'm using a SSD disk. I've changed from ext4 to ext2 to increase >>>>> performance but it stills restarting. >>>>> Also, I checked the I/O performance for the disk and there is no >>>>> signal of overhead. >>>>> >>>>> I've changed the /var/lib/varnish to a tmpfs and increased its 80m >>>>> default size passing "-l 200m,20m" to varnishd and using >>>>> "nodev,nosuid,noatime,size=256M 0 0" for the tmpfs mount. There was a >>>>> problem here. After a couple of hours varnish died and I received a "no >>>>> space left on device" message - deleting the /var/lib/varnish solved the >>>>> problem and varnish was up again, but it's weird because there was free >>>>> memory on the host to be used with the tmpfs directory, so I don't know >>>>> what could have happened. I will try to stop increasing the >>>>> /var/lib/varnish size. >>>>> >>>>> Anyway, I am worried about the bans. You asked me if the bans are >>>>> lurker friedly. Well, I don't think so. My bans are created this way: >>>>> >>>>> ban("req.http.host == " + req.http.host + " && req.url ~ " + req.url + >>>>> " && req.http.User-Agent !~ Googlebot"); >>>>> >>>>> Are they lurker friendly? I was taking a quick look and the >>>>> documentation and it looks like they're not. >>>>> >>>>> Best, >>>>> Stefano >>>>> >>>>> >>>>> On Fri, Jun 23, 2017 at 11:30 AM, Guillaume Quintard < >>>>> guillaume at varnish-software.com> wrote: >>>>> >>>>>> Hi Stefano, >>>>>> >>>>>> Let's cover the usual suspects: I/Os. I think here Varnish gets stuck >>>>>> trying to push/pull data and can't make time to reply to the CLI. I'd >>>>>> recommend monitoring the disk activity (bandwidth and iops) to confirm. >>>>>> >>>>>> After some time, the file storage is terrible on a hard drive (SSDs >>>>>> take a bit more time to degrade) because of fragmentation. One solution to >>>>>> help the disks cope is to overprovision themif they're SSDs, and you can >>>>>> try different advices in the file storage definition in the command line >>>>>> (last parameter, after granularity). >>>>>> >>>>>> Is your /var/lib/varnish mount on tmpfs? That could help too. >>>>>> >>>>>> 40K bans is a lot, are they ban-lurker friendly? >>>>>> >>>>>> -- >>>>>> Guillaume Quintard >>>>>> >>>>>> On Fri, Jun 23, 2017 at 4:01 PM, Stefano Baldo < >>>>>> stefanobaldo at gmail.com> wrote: >>>>>> >>>>>>> Hello. >>>>>>> >>>>>>> I am having a critical problem with Varnish Cache in production for >>>>>>> over a month and any help will be appreciated. >>>>>>> The problem is that Varnish child process is recurrently being >>>>>>> restarted after 10~20h of use, with the following message: >>>>>>> >>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) not >>>>>>> responding to CLI, killed it. >>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected reply from >>>>>>> ping: 400 CLI communication error >>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) died >>>>>>> signal=9 >>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup complete >>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) Started >>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said >>>>>>> Child starts >>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said >>>>>>> SMF.s0 mmap'ed 483183820800 bytes of 483183820800 >>>>>>> >>>>>>> The following link is the varnishstat output just 1 minute before a >>>>>>> restart: >>>>>>> >>>>>>> https://pastebin.com/g0g5RVTs >>>>>>> >>>>>>> Environment: >>>>>>> >>>>>>> varnish-5.1.2 revision 6ece695 >>>>>>> Debian 8.7 - Debian GNU/Linux 8 (3.16.0) >>>>>>> Installed using pre-built package from official repo at >>>>>>> packagecloud.io >>>>>>> CPU 2x2.9 GHz >>>>>>> Mem 3.69 GiB >>>>>>> Running inside a Docker container >>>>>>> NFILES=131072 >>>>>>> MEMLOCK=82000 >>>>>>> >>>>>>> Additional info: >>>>>>> >>>>>>> - I need to cache a large number of objets and the cache should last >>>>>>> for almost a week, so I have set up a 450G storage space, I don't know if >>>>>>> this is a problem; >>>>>>> - I use ban a lot. There was about 40k bans in the system just >>>>>>> before the last crash. I really don't know if this is too much or may have >>>>>>> anything to do with it; >>>>>>> - No registered CPU spikes (almost always by 30%); >>>>>>> - No panic is reported, the only info I can retrieve is from syslog; >>>>>>> - During all the time, event moments before the crashes, everything >>>>>>> is okay and requests are being responded very fast. >>>>>>> >>>>>>> Best, >>>>>>> Stefano Baldo >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> varnish-misc mailing list >>>>>>> varnish-misc at varnish-cache.org >>>>>>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> >>> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From stefanobaldo at gmail.com Tue Jun 27 21:07:31 2017 From: stefanobaldo at gmail.com (Stefano Baldo) Date: Tue, 27 Jun 2017 18:07:31 -0300 Subject: Child process recurrently being restarted In-Reply-To: References: Message-ID: Hi Guillaume. It keeps restarting. Would you mind taking a quick look in the following VCL file to check if you find anything suspicious? Thank you very much. Best, Stefano vcl 4.0; import std; backend default { .host = "sites-web-server-lb"; .port = "80"; } include "/etc/varnish/bad_bot_detection.vcl"; sub vcl_recv { call bad_bot_detection; if (req.url == "/nocache" || req.url == "/version") { return(pass); } unset req.http.Cookie; if (req.method == "PURGE") { ban("obj.http.x-host == " + req.http.host + " && obj.http.x-user-agent !~ Googlebot"); return(synth(750)); } set req.url = regsuball(req.url, "(? " + req.url); return(deliver); } elsif (resp.status == 501) { set resp.status = 200; set resp.http.Content-Type = "text/html; charset=utf-8"; synthetic(std.fileread("/etc/varnish/pages/invalid_domain.html")); return(deliver); } } sub vcl_backend_response { unset beresp.http.Set-Cookie; set beresp.http.x-host = bereq.http.host; set beresp.http.x-user-agent = bereq.http.user-agent; if (bereq.url == "/themes/basic/assets/theme.min.css" || bereq.url == "/api/events/PAGEVIEW" || bereq.url ~ "^\/assets\/img\/") { set beresp.http.Cache-Control = "max-age=0"; } else { unset beresp.http.Cache-Control; } if (beresp.status == 200 || beresp.status == 301 || beresp.status == 302 || beresp.status == 404) { if (bereq.url ~ "\&ordenar=aleatorio$") { set beresp.http.X-TTL = "1d"; set beresp.ttl = 1d; } else { set beresp.http.X-TTL = "1w"; set beresp.ttl = 1w; } } if (bereq.url !~ "\.(jpeg|jpg|png|gif|gz|tgz|bz2|tbz|mp3|ogg|swf|flv)$") { set beresp.do_gzip = true; } } sub vcl_pipe { set bereq.http.connection = "close"; return (pipe); } sub vcl_deliver { unset resp.http.x-host; unset resp.http.x-user-agent; } sub vcl_backend_error { if (beresp.status == 502 || beresp.status == 503 || beresp.status == 504) { set beresp.status = 200; set beresp.http.Content-Type = "text/html; charset=utf-8"; synthetic(std.fileread("/etc/varnish/pages/maintenance.html")); return (deliver); } } sub vcl_hash { if (req.http.User-Agent ~ "Google Page Speed") { hash_data("Google Page Speed"); } elsif (req.http.User-Agent ~ "Googlebot") { hash_data("Googlebot"); } } sub vcl_deliver { if (resp.status == 501) { return (synth(resp.status)); } if (obj.hits > 0) { set resp.http.X-Cache = "hit"; } else { set resp.http.X-Cache = "miss"; } } On Mon, Jun 26, 2017 at 3:47 PM, Guillaume Quintard < guillaume at varnish-software.com> wrote: > Nice! It may have been the cause, time will tell.can you report back in a > few days to let us know? > -- > Guillaume Quintard > > On Jun 26, 2017 20:21, "Stefano Baldo" wrote: > >> Hi Guillaume. >> >> I think things will start to going better now after changing the bans. >> This is how my last varnishstat looked like moments before a crash >> regarding the bans: >> >> MAIN.bans 41336 . Count of bans >> MAIN.bans_completed 37967 . Number of bans marked >> 'completed' >> MAIN.bans_obj 0 . Number of bans using >> obj.* >> MAIN.bans_req 41335 . Number of bans using >> req.* >> MAIN.bans_added 41336 0.68 Bans added >> MAIN.bans_deleted 0 0.00 Bans deleted >> >> And this is how it looks like now: >> >> MAIN.bans 2 . Count of bans >> MAIN.bans_completed 1 . Number of bans marked >> 'completed' >> MAIN.bans_obj 2 . Number of bans using >> obj.* >> MAIN.bans_req 0 . Number of bans using >> req.* >> MAIN.bans_added 2016 0.69 Bans added >> MAIN.bans_deleted 2014 0.69 Bans deleted >> >> Before the changes, bans were never deleted! >> Now the bans are added and quickly deleted after a minute or even a >> couple of seconds. >> >> May this was the cause of the problem? It seems like varnish was having a >> large number of bans to manage and test against. >> I will let it ride now. Let's see if the problem persists or it's gone! >> :-) >> >> Best, >> Stefano >> >> >> On Mon, Jun 26, 2017 at 3:10 PM, Guillaume Quintard < >> guillaume at varnish-software.com> wrote: >> >>> Looking good! >>> >>> -- >>> Guillaume Quintard >>> >>> On Mon, Jun 26, 2017 at 7:06 PM, Stefano Baldo >>> wrote: >>> >>>> Hi Guillaume, >>>> >>>> Can the following be considered "ban lurker friendly"? >>>> >>>> sub vcl_backend_response { >>>> set beresp.http.x-url = bereq.http.host + bereq.url; >>>> set beresp.http.x-user-agent = bereq.http.user-agent; >>>> } >>>> >>>> sub vcl_recv { >>>> if (req.method == "PURGE") { >>>> ban("obj.http.x-url == " + req.http.host + req.url + " && >>>> obj.http.x-user-agent !~ Googlebot"); >>>> return(synth(750)); >>>> } >>>> } >>>> >>>> sub vcl_deliver { >>>> unset resp.http.x-url; >>>> unset resp.http.x-user-agent; >>>> } >>>> >>>> Best, >>>> Stefano >>>> >>>> >>>> On Mon, Jun 26, 2017 at 12:43 PM, Guillaume Quintard < >>>> guillaume at varnish-software.com> wrote: >>>> >>>>> Not lurker friendly at all indeed. You'll need to avoid req.* >>>>> expression. Easiest way is to stash the host, user-agent and url in >>>>> beresp.http.* and ban against those (unset them in vcl_deliver). >>>>> >>>>> I don't think you need to expand the VSL at all. >>>>> >>>>> -- >>>>> Guillaume Quintard >>>>> >>>>> On Jun 26, 2017 16:51, "Stefano Baldo" wrote: >>>>> >>>>> Hi Guillaume. >>>>> >>>>> Thanks for answering. >>>>> >>>>> I'm using a SSD disk. I've changed from ext4 to ext2 to increase >>>>> performance but it stills restarting. >>>>> Also, I checked the I/O performance for the disk and there is no >>>>> signal of overhead. >>>>> >>>>> I've changed the /var/lib/varnish to a tmpfs and increased its 80m >>>>> default size passing "-l 200m,20m" to varnishd and using >>>>> "nodev,nosuid,noatime,size=256M 0 0" for the tmpfs mount. There was a >>>>> problem here. After a couple of hours varnish died and I received a "no >>>>> space left on device" message - deleting the /var/lib/varnish solved the >>>>> problem and varnish was up again, but it's weird because there was free >>>>> memory on the host to be used with the tmpfs directory, so I don't know >>>>> what could have happened. I will try to stop increasing the >>>>> /var/lib/varnish size. >>>>> >>>>> Anyway, I am worried about the bans. You asked me if the bans are >>>>> lurker friedly. Well, I don't think so. My bans are created this way: >>>>> >>>>> ban("req.http.host == " + req.http.host + " && req.url ~ " + req.url + >>>>> " && req.http.User-Agent !~ Googlebot"); >>>>> >>>>> Are they lurker friendly? I was taking a quick look and the >>>>> documentation and it looks like they're not. >>>>> >>>>> Best, >>>>> Stefano >>>>> >>>>> >>>>> On Fri, Jun 23, 2017 at 11:30 AM, Guillaume Quintard < >>>>> guillaume at varnish-software.com> wrote: >>>>> >>>>>> Hi Stefano, >>>>>> >>>>>> Let's cover the usual suspects: I/Os. I think here Varnish gets stuck >>>>>> trying to push/pull data and can't make time to reply to the CLI. I'd >>>>>> recommend monitoring the disk activity (bandwidth and iops) to confirm. >>>>>> >>>>>> After some time, the file storage is terrible on a hard drive (SSDs >>>>>> take a bit more time to degrade) because of fragmentation. One solution to >>>>>> help the disks cope is to overprovision themif they're SSDs, and you can >>>>>> try different advices in the file storage definition in the command line >>>>>> (last parameter, after granularity). >>>>>> >>>>>> Is your /var/lib/varnish mount on tmpfs? That could help too. >>>>>> >>>>>> 40K bans is a lot, are they ban-lurker friendly? >>>>>> >>>>>> -- >>>>>> Guillaume Quintard >>>>>> >>>>>> On Fri, Jun 23, 2017 at 4:01 PM, Stefano Baldo < >>>>>> stefanobaldo at gmail.com> wrote: >>>>>> >>>>>>> Hello. >>>>>>> >>>>>>> I am having a critical problem with Varnish Cache in production for >>>>>>> over a month and any help will be appreciated. >>>>>>> The problem is that Varnish child process is recurrently being >>>>>>> restarted after 10~20h of use, with the following message: >>>>>>> >>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) not >>>>>>> responding to CLI, killed it. >>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected reply from >>>>>>> ping: 400 CLI communication error >>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) died >>>>>>> signal=9 >>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup complete >>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) Started >>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said >>>>>>> Child starts >>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said >>>>>>> SMF.s0 mmap'ed 483183820800 bytes of 483183820800 >>>>>>> >>>>>>> The following link is the varnishstat output just 1 minute before a >>>>>>> restart: >>>>>>> >>>>>>> https://pastebin.com/g0g5RVTs >>>>>>> >>>>>>> Environment: >>>>>>> >>>>>>> varnish-5.1.2 revision 6ece695 >>>>>>> Debian 8.7 - Debian GNU/Linux 8 (3.16.0) >>>>>>> Installed using pre-built package from official repo at >>>>>>> packagecloud.io >>>>>>> CPU 2x2.9 GHz >>>>>>> Mem 3.69 GiB >>>>>>> Running inside a Docker container >>>>>>> NFILES=131072 >>>>>>> MEMLOCK=82000 >>>>>>> >>>>>>> Additional info: >>>>>>> >>>>>>> - I need to cache a large number of objets and the cache should last >>>>>>> for almost a week, so I have set up a 450G storage space, I don't know if >>>>>>> this is a problem; >>>>>>> - I use ban a lot. There was about 40k bans in the system just >>>>>>> before the last crash. I really don't know if this is too much or may have >>>>>>> anything to do with it; >>>>>>> - No registered CPU spikes (almost always by 30%); >>>>>>> - No panic is reported, the only info I can retrieve is from syslog; >>>>>>> - During all the time, event moments before the crashes, everything >>>>>>> is okay and requests are being responded very fast. >>>>>>> >>>>>>> Best, >>>>>>> Stefano Baldo >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> varnish-misc mailing list >>>>>>> varnish-misc at varnish-cache.org >>>>>>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc >>>>>>> >>>>>> >>>>>> >>>>> >>>>> >>>> >>> >> -------------- next part -------------- An HTML attachment was scrubbed... URL: From guillaume at varnish-software.com Wed Jun 28 07:12:46 2017 From: guillaume at varnish-software.com (Guillaume Quintard) Date: Wed, 28 Jun 2017 09:12:46 +0200 Subject: Child process recurrently being restarted In-Reply-To: References: Message-ID: Sadly, nothing suspicious here, you can still try: - bumping the cli_timeout - changing your disk scheduler - changing the advice option of the file storage I'm still convinced this is due to Varnish getting stuck waiting for the disk because of the file storage fragmentation. Maybe you could look at SMF.*.g_alloc and compare it to the number of objects. Ideally, we would have a 1:1 relation between objects and allocations. If that number drops prior to a restart, that would be a good clue. -- Guillaume Quintard On Tue, Jun 27, 2017 at 11:07 PM, Stefano Baldo wrote: > Hi Guillaume. > > It keeps restarting. > Would you mind taking a quick look in the following VCL file to check if > you find anything suspicious? > > Thank you very much. > > Best, > Stefano > > vcl 4.0; > > import std; > > backend default { > .host = "sites-web-server-lb"; > .port = "80"; > } > > include "/etc/varnish/bad_bot_detection.vcl"; > > sub vcl_recv { > call bad_bot_detection; > > if (req.url == "/nocache" || req.url == "/version") { > return(pass); > } > > unset req.http.Cookie; > if (req.method == "PURGE") { > ban("obj.http.x-host == " + req.http.host + " && obj.http.x-user-agent > !~ Googlebot"); > return(synth(750)); > } > > set req.url = regsuball(req.url, "(? } > > sub vcl_synth { > if (resp.status == 750) { > set resp.status = 200; > synthetic("PURGED => " + req.url); > return(deliver); > } elsif (resp.status == 501) { > set resp.status = 200; > set resp.http.Content-Type = "text/html; charset=utf-8"; > synthetic(std.fileread("/etc/varnish/pages/invalid_domain.html")); > return(deliver); > } > } > > sub vcl_backend_response { > unset beresp.http.Set-Cookie; > set beresp.http.x-host = bereq.http.host; > set beresp.http.x-user-agent = bereq.http.user-agent; > > if (bereq.url == "/themes/basic/assets/theme.min.css" > || bereq.url == "/api/events/PAGEVIEW" > || bereq.url ~ "^\/assets\/img\/") { > set beresp.http.Cache-Control = "max-age=0"; > } else { > unset beresp.http.Cache-Control; > } > > if (beresp.status == 200 || > beresp.status == 301 || > beresp.status == 302 || > beresp.status == 404) { > if (bereq.url ~ "\&ordenar=aleatorio$") { > set beresp.http.X-TTL = "1d"; > set beresp.ttl = 1d; > } else { > set beresp.http.X-TTL = "1w"; > set beresp.ttl = 1w; > } > } > > if (bereq.url !~ "\.(jpeg|jpg|png|gif|gz|tgz|bz2|tbz|mp3|ogg|swf|flv)$") > { > set beresp.do_gzip = true; > } > } > > sub vcl_pipe { > set bereq.http.connection = "close"; > return (pipe); > } > > sub vcl_deliver { > unset resp.http.x-host; > unset resp.http.x-user-agent; > } > > sub vcl_backend_error { > if (beresp.status == 502 || beresp.status == 503 || beresp.status == > 504) { > set beresp.status = 200; > set beresp.http.Content-Type = "text/html; charset=utf-8"; > synthetic(std.fileread("/etc/varnish/pages/maintenance.html")); > return (deliver); > } > } > > sub vcl_hash { > if (req.http.User-Agent ~ "Google Page Speed") { > hash_data("Google Page Speed"); > } elsif (req.http.User-Agent ~ "Googlebot") { > hash_data("Googlebot"); > } > } > > sub vcl_deliver { > if (resp.status == 501) { > return (synth(resp.status)); > } > if (obj.hits > 0) { > set resp.http.X-Cache = "hit"; > } else { > set resp.http.X-Cache = "miss"; > } > } > > > On Mon, Jun 26, 2017 at 3:47 PM, Guillaume Quintard < > guillaume at varnish-software.com> wrote: > >> Nice! It may have been the cause, time will tell.can you report back in a >> few days to let us know? >> -- >> Guillaume Quintard >> >> On Jun 26, 2017 20:21, "Stefano Baldo" wrote: >> >>> Hi Guillaume. >>> >>> I think things will start to going better now after changing the bans. >>> This is how my last varnishstat looked like moments before a crash >>> regarding the bans: >>> >>> MAIN.bans 41336 . Count of bans >>> MAIN.bans_completed 37967 . Number of bans marked >>> 'completed' >>> MAIN.bans_obj 0 . Number of bans using >>> obj.* >>> MAIN.bans_req 41335 . Number of bans using >>> req.* >>> MAIN.bans_added 41336 0.68 Bans added >>> MAIN.bans_deleted 0 0.00 Bans deleted >>> >>> And this is how it looks like now: >>> >>> MAIN.bans 2 . Count of bans >>> MAIN.bans_completed 1 . Number of bans marked >>> 'completed' >>> MAIN.bans_obj 2 . Number of bans using >>> obj.* >>> MAIN.bans_req 0 . Number of bans using >>> req.* >>> MAIN.bans_added 2016 0.69 Bans added >>> MAIN.bans_deleted 2014 0.69 Bans deleted >>> >>> Before the changes, bans were never deleted! >>> Now the bans are added and quickly deleted after a minute or even a >>> couple of seconds. >>> >>> May this was the cause of the problem? It seems like varnish was having >>> a large number of bans to manage and test against. >>> I will let it ride now. Let's see if the problem persists or it's gone! >>> :-) >>> >>> Best, >>> Stefano >>> >>> >>> On Mon, Jun 26, 2017 at 3:10 PM, Guillaume Quintard < >>> guillaume at varnish-software.com> wrote: >>> >>>> Looking good! >>>> >>>> -- >>>> Guillaume Quintard >>>> >>>> On Mon, Jun 26, 2017 at 7:06 PM, Stefano Baldo >>>> wrote: >>>> >>>>> Hi Guillaume, >>>>> >>>>> Can the following be considered "ban lurker friendly"? >>>>> >>>>> sub vcl_backend_response { >>>>> set beresp.http.x-url = bereq.http.host + bereq.url; >>>>> set beresp.http.x-user-agent = bereq.http.user-agent; >>>>> } >>>>> >>>>> sub vcl_recv { >>>>> if (req.method == "PURGE") { >>>>> ban("obj.http.x-url == " + req.http.host + req.url + " && >>>>> obj.http.x-user-agent !~ Googlebot"); >>>>> return(synth(750)); >>>>> } >>>>> } >>>>> >>>>> sub vcl_deliver { >>>>> unset resp.http.x-url; >>>>> unset resp.http.x-user-agent; >>>>> } >>>>> >>>>> Best, >>>>> Stefano >>>>> >>>>> >>>>> On Mon, Jun 26, 2017 at 12:43 PM, Guillaume Quintard < >>>>> guillaume at varnish-software.com> wrote: >>>>> >>>>>> Not lurker friendly at all indeed. You'll need to avoid req.* >>>>>> expression. Easiest way is to stash the host, user-agent and url in >>>>>> beresp.http.* and ban against those (unset them in vcl_deliver). >>>>>> >>>>>> I don't think you need to expand the VSL at all. >>>>>> >>>>>> -- >>>>>> Guillaume Quintard >>>>>> >>>>>> On Jun 26, 2017 16:51, "Stefano Baldo" >>>>>> wrote: >>>>>> >>>>>> Hi Guillaume. >>>>>> >>>>>> Thanks for answering. >>>>>> >>>>>> I'm using a SSD disk. I've changed from ext4 to ext2 to increase >>>>>> performance but it stills restarting. >>>>>> Also, I checked the I/O performance for the disk and there is no >>>>>> signal of overhead. >>>>>> >>>>>> I've changed the /var/lib/varnish to a tmpfs and increased its 80m >>>>>> default size passing "-l 200m,20m" to varnishd and using >>>>>> "nodev,nosuid,noatime,size=256M 0 0" for the tmpfs mount. There was >>>>>> a problem here. After a couple of hours varnish died and I received a "no >>>>>> space left on device" message - deleting the /var/lib/varnish solved the >>>>>> problem and varnish was up again, but it's weird because there was free >>>>>> memory on the host to be used with the tmpfs directory, so I don't know >>>>>> what could have happened. I will try to stop increasing the >>>>>> /var/lib/varnish size. >>>>>> >>>>>> Anyway, I am worried about the bans. You asked me if the bans are >>>>>> lurker friedly. Well, I don't think so. My bans are created this way: >>>>>> >>>>>> ban("req.http.host == " + req.http.host + " && req.url ~ " + req.url >>>>>> + " && req.http.User-Agent !~ Googlebot"); >>>>>> >>>>>> Are they lurker friendly? I was taking a quick look and the >>>>>> documentation and it looks like they're not. >>>>>> >>>>>> Best, >>>>>> Stefano >>>>>> >>>>>> >>>>>> On Fri, Jun 23, 2017 at 11:30 AM, Guillaume Quintard < >>>>>> guillaume at varnish-software.com> wrote: >>>>>> >>>>>>> Hi Stefano, >>>>>>> >>>>>>> Let's cover the usual suspects: I/Os. I think here Varnish gets >>>>>>> stuck trying to push/pull data and can't make time to reply to the CLI. I'd >>>>>>> recommend monitoring the disk activity (bandwidth and iops) to confirm. >>>>>>> >>>>>>> After some time, the file storage is terrible on a hard drive (SSDs >>>>>>> take a bit more time to degrade) because of fragmentation. One solution to >>>>>>> help the disks cope is to overprovision themif they're SSDs, and you can >>>>>>> try different advices in the file storage definition in the command line >>>>>>> (last parameter, after granularity). >>>>>>> >>>>>>> Is your /var/lib/varnish mount on tmpfs? That could help too. >>>>>>> >>>>>>> 40K bans is a lot, are they ban-lurker friendly? >>>>>>> >>>>>>> -- >>>>>>> Guillaume Quintard >>>>>>> >>>>>>> On Fri, Jun 23, 2017 at 4:01 PM, Stefano Baldo < >>>>>>> stefanobaldo at gmail.com> wrote: >>>>>>> >>>>>>>> Hello. >>>>>>>> >>>>>>>> I am having a critical problem with Varnish Cache in production for >>>>>>>> over a month and any help will be appreciated. >>>>>>>> The problem is that Varnish child process is recurrently being >>>>>>>> restarted after 10~20h of use, with the following message: >>>>>>>> >>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) not >>>>>>>> responding to CLI, killed it. >>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected reply from >>>>>>>> ping: 400 CLI communication error >>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) died >>>>>>>> signal=9 >>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup complete >>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) Started >>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said >>>>>>>> Child starts >>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said >>>>>>>> SMF.s0 mmap'ed 483183820800 bytes of 483183820800 >>>>>>>> >>>>>>>> The following link is the varnishstat output just 1 minute before a >>>>>>>> restart: >>>>>>>> >>>>>>>> https://pastebin.com/g0g5RVTs >>>>>>>> >>>>>>>> Environment: >>>>>>>> >>>>>>>> varnish-5.1.2 revision 6ece695 >>>>>>>> Debian 8.7 - Debian GNU/Linux 8 (3.16.0) >>>>>>>> Installed using pre-built package from official repo at >>>>>>>> packagecloud.io >>>>>>>> CPU 2x2.9 GHz >>>>>>>> Mem 3.69 GiB >>>>>>>> Running inside a Docker container >>>>>>>> NFILES=131072 >>>>>>>> MEMLOCK=82000 >>>>>>>> >>>>>>>> Additional info: >>>>>>>> >>>>>>>> - I need to cache a large number of objets and the cache should >>>>>>>> last for almost a week, so I have set up a 450G storage space, I don't know >>>>>>>> if this is a problem; >>>>>>>> - I use ban a lot. There was about 40k bans in the system just >>>>>>>> before the last crash. I really don't know if this is too much or may have >>>>>>>> anything to do with it; >>>>>>>> - No registered CPU spikes (almost always by 30%); >>>>>>>> - No panic is reported, the only info I can retrieve is from syslog; >>>>>>>> - During all the time, event moments before the crashes, everything >>>>>>>> is okay and requests are being responded very fast. >>>>>>>> >>>>>>>> Best, >>>>>>>> Stefano Baldo >>>>>>>> >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> varnish-misc mailing list >>>>>>>> varnish-misc at varnish-cache.org >>>>>>>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joh.hendriks at gmail.com Wed Jun 28 08:58:44 2017 From: joh.hendriks at gmail.com (Johan Hendriks) Date: Wed, 28 Jun 2017 10:58:44 +0200 Subject: Varnish performance with phpinfo In-Reply-To: References: <6fa4576d-d25e-b770-44da-98877379a815@gmail.com> <83029bff-6f19-5d12-0514-fa6441ecbd6a@gmail.com> Message-ID: I created a .html file from the php.info page and the results are the same. So i think it is a local problem on the client. Thank you for your time an sorry for the noice. Regards, Johan Hendriks Op 23/06/2017 om 17:36 schreef Guillaume Quintard: > Simple way to test: grow the info.html size :-) > > -- > Guillaume Quintard > > On Fri, Jun 23, 2017 at 4:52 PM, Johan Hendriks > > wrote: > > Thanks for you answer. > I was thinking about that also, but I could not find anything that > pointed in that direction. > But should I hit that limit also with the info.html file then or > could it be the size of the page. > The info.html is off cource way smaller than the whole php.info > page. > > regards > Johan > > > Op 23/06/2017 om 10:58 schreef Guillaume Quintard: >> Stupid question but, aren't you being limited by your client, or >> a firewall, maybe? >> >> -- >> Guillaume Quintard >> >> On Fri, Jun 2, 2017 at 12:06 PM, Johan Hendriks >> > wrote: >> >> Hello all, First sorry for the long email. >> I have a strange issue with varnish. At least I think it is >> strange. >> >> We start some tests with varnish, but we have an issue. >> >> I am running varnish 4.1.6 on FreeBSD 11.1-prerelease. Where >> varnish listen on port 82 and apache on 80, This is just for >> the tests. >> We use the following start options. >> >> # Varnish >> varnishd_enable="YES" >> varnishd_listen="192.168.2.247:82 " >> varnishd_pidfile="/var/run/varnishd.pid" >> varnishd_storage="default=malloc,2024M" >> varnishd_config="/usr/local/etc/varnish/default.vcl" >> varnishd_hash="critbit" >> varnishd_admin=":6082" >> varnishncsa_enable="YES" >> >> We did a test with a static page and that went fine. First we >> see it is not cached, second attempt is cached. >> >> root at desk:~ # curl -I www.testdomain.nl:82/info.html >> >> HTTP/1.1 200 OK >> Date: Fri, 02 Jun 2017 09:19:52 GMT >> Last-Modified: Thu, 01 Jun 2017 12:50:37 GMT >> ETag: "cf4-550e57bc1f812" >> Content-Length: 3316 >> Content-Type: text/html >> cache-control: max-age = 259200 >> X-Varnish: 2 >> Age: 0 >> Via: 1.1 varnish-v4 >> Server: varnish >> X-Powered-By: My Varnish >> X-Cache: MISS >> Accept-Ranges: bytes >> Connection: keep-alive >> >> root at desk:~ # curl -I www.testdomain.nl:82/info.html >> >> HTTP/1.1 200 OK >> Date: Fri, 02 Jun 2017 09:19:52 GMT >> Last-Modified: Thu, 01 Jun 2017 12:50:37 GMT >> ETag: "cf4-550e57bc1f812" >> Content-Length: 3316 >> Content-Type: text/html >> cache-control: max-age = 259200 >> X-Varnish: 5 3 >> Age: 6 >> Via: 1.1 varnish-v4 >> Server: varnish >> X-Powered-By: My Varnish >> X-Cache: HIT >> Accept-Ranges: bytes >> Connection: keep-alive >> >> if I benchmark the server I get the following. >> First is derectly to Apache >> >> root at testserver:~ # bombardier -c400 -n10000 >> http://www.testdomain.nl/info.html >> >> Bombarding http://www.testdomain.nl/info.html >> with 10000 requests >> using 400 connections >> 10000 / 10000 >> [=============================================================] >> 100.00% 0s >> Done! >> Statistics Avg Stdev Max >> Reqs/sec 12459.00 898.32 13301 >> Latency 31.04ms 25.28ms 280.90ms >> HTTP codes: >> 1xx - 0, 2xx - 10000, 3xx - 0, 4xx - 0, 5xx - 0 >> others - 0 >> Throughput: 42.16MB/s >> >> This is via varnish. So that works as intended. >> Varnish does its job and servers the page better. >> >> root at testserver:~ # bombardier -c400 -n10000 >> http://www.testdomain.nl:82/info.html >> >> Bombarding http://www.testdomain.nl:82/info.html >> with 10000 requests >> using 400 connections >> 10000 / 10000 >> [=============================================================] >> 100.00% 0s >> Done! >> Statistics Avg Stdev Max >> Reqs/sec 19549.00 7649.32 24313 >> Latency 17.90ms 66.77ms 485.07ms >> HTTP codes: >> 1xx - 0, 2xx - 10000, 3xx - 0, 4xx - 0, 5xx - 0 >> others - 0 >> Throughput: 71.58MB/s >> >> >> The next one is against a info.php file, which runs phpinfo(); >> >> So first agains the server without varnish. >> >> root at testserver:~ # bombardier -c400 -n10000 >> http://www.testdomain.nl/info.php >> >> Bombarding http://www.testdomain.nl/info.php >> with 10000 requests using >> 400 connections >> 10000 / 10000 >> [============================================================] >> 100.00% 11s >> Done! >> Statistics Avg Stdev Max >> Reqs/sec 828.00 127.66 1010 >> Latency 472.10ms 59.10ms 740.43ms >> HTTP codes: >> 1xx - 0, 2xx - 10000, 3xx - 0, 4xx - 0, 5xx - 0 >> others - 0 >> Throughput: 75.51MB/s >> >> But then against the server with varnish. >> So we make sure it is in cache >> >> root at desk:~ # curl -I www.testdomain.nl:82/info.php >> >> HTTP/1.1 200 OK >> Date: Fri, 02 Jun 2017 09:36:16 GMT >> Content-Type: text/html; charset=UTF-8 >> cache-control: max-age = 259200 >> X-Varnish: 7 >> Age: 0 >> Via: 1.1 varnish-v4 >> Server: varnish >> X-Powered-By: My Varnish >> X-Cache: MISS >> Accept-Ranges: bytes >> Connection: keep-alive >> >> root at desk:~ # curl -I www.testdomain.nl:82/info.php >> >> HTTP/1.1 200 OK >> Date: Fri, 02 Jun 2017 09:36:16 GMT >> Content-Type: text/html; charset=UTF-8 >> cache-control: max-age = 259200 >> X-Varnish: 10 8 >> Age: 2 >> Via: 1.1 varnish-v4 >> Server: varnish >> X-Powered-By: My Varnish >> X-Cache: HIT >> Accept-Ranges: bytes >> Connection: keep-alive >> >> So it is in cache now. >> root at testserver:~ # bombardier -c400 -n10000 >> http://www.testdomain.nl:82/info.php >> >> Bombarding http://www.testdomain.nl:82/info.php >> with 10000 requests >> using 400 connections >> 10000 / 10000 >> [===========================================================================================================================================================================================================] >> 100.00% 8s >> Done! >> Statistics Avg Stdev Max >> Reqs/sec 1179.00 230.77 1981 >> Latency 219.94ms 340.29ms 2.00s >> HTTP codes: >> 1xx - 0, 2xx - 9938, 3xx - 0, 4xx - 0, 5xx - 0 >> others - 62 >> Errors: >> dialing to the given TCP address timed out - 62 >> Throughput: 83.16MB/s >> >> I expected this to be much more in favour of varnish, but it >> even generated some errors! Time taken is lower but I >> expected it to be much faster. Also the 62 errors is not good >> i guess. >> >> I do see the following with varnish log >> * << Request >> 11141123 >> - Begin req 1310723 rxreq >> - Timestamp Start: 1496396250.098654 0.000000 0.000000 >> - Timestamp Req: 1496396250.098654 0.000000 0.000000 >> - ReqStart 192.168.2.39 14818 >> - ReqMethod GET >> - ReqURL /info.php >> - ReqProtocol HTTP/1.1 >> - ReqHeader User-Agent: fasthttp >> - ReqHeader Host: www.testdomain.nl:82 >> >> - ReqHeader X-Forwarded-For: 192.168.2.39 >> - VCL_call RECV >> - ReqUnset X-Forwarded-For: 192.168.2.39 >> - ReqHeader X-Forwarded-For: 192.168.2.39, 192.168.2.39 >> - VCL_return hash >> - VCL_call HASH >> - VCL_return lookup >> - Hit 8 >> - VCL_call HIT >> - VCL_return deliver >> - RespProtocol HTTP/1.1 >> - RespStatus 200 >> - RespReason OK >> - RespHeader Date: Fri, 02 Jun 2017 09:36:16 GMT >> - RespHeader Server: Apache/2.4.25 (FreeBSD) OpenSSL/1.0.2l >> - RespHeader X-Powered-By: PHP/7.0.19 >> - RespHeader Content-Type: text/html; charset=UTF-8 >> - RespHeader cache-control: max-age = 259200 >> - RespHeader X-Varnish: 11141123 8 >> - RespHeader Age: 73 >> - RespHeader Via: 1.1 varnish-v4 >> - VCL_call DELIVER >> - RespUnset Server: Apache/2.4.25 (FreeBSD) OpenSSL/1.0.2l >> - RespHeader Server: varnish >> - RespUnset X-Powered-By: PHP/7.0.19 >> - RespHeader X-Powered-By: My Varnish >> - RespHeader X-Cache: HIT >> - VCL_return deliver >> - Timestamp Process: 1496396250.098712 0.000058 0.000058 >> - RespHeader Accept-Ranges: bytes >> - RespHeader Content-Length: 95200 >> - Debug "RES_MODE 2" >> - RespHeader Connection: keep-alive >> *- Debug "Hit idle send timeout, wrote = >> 89972/95508; retrying"** >> **- Debug "Write error, retval = -1, len = 5536, >> errno = Resource temporarily unavailable"* >> - Timestamp Resp: 1496396371.131526 121.032872 121.032814 >> - ReqAcct 82 0 82 308 95200 95508 >> - End >> >> Sometimes I see this Debug line also - *Debug >> "Write error, retval = -1, len = 95563, errno = Broken pipe"* >> >> >> I also installed varnish 5.1.2 but the results are the same. >> Is there something I miss? >> >> My vcl file is pretty basic. >> >> https://pastebin.com/rbb42x7h >> >> Thanks all for your time. >> >> regards >> Johan >> >> >> _______________________________________________ >> varnish-misc mailing list >> varnish-misc at varnish-cache.org >> >> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc >> >> >> > > > _______________________________________________ > varnish-misc mailing list > varnish-misc at varnish-cache.org > https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stefanobaldo at gmail.com Wed Jun 28 13:20:16 2017 From: stefanobaldo at gmail.com (Stefano Baldo) Date: Wed, 28 Jun 2017 10:20:16 -0300 Subject: Child process recurrently being restarted In-Reply-To: References: Message-ID: Hi Guillaume. I increased the cli_timeout yesterday to 900sec (15min) and it restarted anyway, which seems to indicate that the thread is really stalled. This was 1 minute after the last restart: MAIN.n_object 3908216 . object structs made SMF.s0.g_alloc 7794510 . Allocations outstanding I've just changed the I/O Scheduler to noop to see what happens. One interest thing I've found is about the memory usage. In the 1st minute of use: MemTotal: 3865572 kB MemFree: 120768 kB MemAvailable: 2300268 kB 1 minute before a restart: MemTotal: 3865572 kB MemFree: 82480 kB MemAvailable: 68316 kB It seems like the system is possibly running out of memory. When calling varnishd, I'm specifying only "-s file,..." as storage. I see in some examples that is common to use "-s file" AND "-s malloc" together. Should I be passing "-s malloc" as well to somehow try to limit the memory usage by varnishd? Best, Stefano On Wed, Jun 28, 2017 at 4:12 AM, Guillaume Quintard < guillaume at varnish-software.com> wrote: > Sadly, nothing suspicious here, you can still try: > - bumping the cli_timeout > - changing your disk scheduler > - changing the advice option of the file storage > > I'm still convinced this is due to Varnish getting stuck waiting for the > disk because of the file storage fragmentation. > > Maybe you could look at SMF.*.g_alloc and compare it to the number of > objects. Ideally, we would have a 1:1 relation between objects and > allocations. If that number drops prior to a restart, that would be a good > clue. > > > -- > Guillaume Quintard > > On Tue, Jun 27, 2017 at 11:07 PM, Stefano Baldo > wrote: > >> Hi Guillaume. >> >> It keeps restarting. >> Would you mind taking a quick look in the following VCL file to check if >> you find anything suspicious? >> >> Thank you very much. >> >> Best, >> Stefano >> >> vcl 4.0; >> >> import std; >> >> backend default { >> .host = "sites-web-server-lb"; >> .port = "80"; >> } >> >> include "/etc/varnish/bad_bot_detection.vcl"; >> >> sub vcl_recv { >> call bad_bot_detection; >> >> if (req.url == "/nocache" || req.url == "/version") { >> return(pass); >> } >> >> unset req.http.Cookie; >> if (req.method == "PURGE") { >> ban("obj.http.x-host == " + req.http.host + " && >> obj.http.x-user-agent !~ Googlebot"); >> return(synth(750)); >> } >> >> set req.url = regsuball(req.url, "(?> } >> >> sub vcl_synth { >> if (resp.status == 750) { >> set resp.status = 200; >> synthetic("PURGED => " + req.url); >> return(deliver); >> } elsif (resp.status == 501) { >> set resp.status = 200; >> set resp.http.Content-Type = "text/html; charset=utf-8"; >> synthetic(std.fileread("/etc/varnish/pages/invalid_domain.html")); >> return(deliver); >> } >> } >> >> sub vcl_backend_response { >> unset beresp.http.Set-Cookie; >> set beresp.http.x-host = bereq.http.host; >> set beresp.http.x-user-agent = bereq.http.user-agent; >> >> if (bereq.url == "/themes/basic/assets/theme.min.css" >> || bereq.url == "/api/events/PAGEVIEW" >> || bereq.url ~ "^\/assets\/img\/") { >> set beresp.http.Cache-Control = "max-age=0"; >> } else { >> unset beresp.http.Cache-Control; >> } >> >> if (beresp.status == 200 || >> beresp.status == 301 || >> beresp.status == 302 || >> beresp.status == 404) { >> if (bereq.url ~ "\&ordenar=aleatorio$") { >> set beresp.http.X-TTL = "1d"; >> set beresp.ttl = 1d; >> } else { >> set beresp.http.X-TTL = "1w"; >> set beresp.ttl = 1w; >> } >> } >> >> if (bereq.url !~ "\.(jpeg|jpg|png|gif|gz|tgz|bz2|tbz|mp3|ogg|swf|flv)$") >> { >> set beresp.do_gzip = true; >> } >> } >> >> sub vcl_pipe { >> set bereq.http.connection = "close"; >> return (pipe); >> } >> >> sub vcl_deliver { >> unset resp.http.x-host; >> unset resp.http.x-user-agent; >> } >> >> sub vcl_backend_error { >> if (beresp.status == 502 || beresp.status == 503 || beresp.status == >> 504) { >> set beresp.status = 200; >> set beresp.http.Content-Type = "text/html; charset=utf-8"; >> synthetic(std.fileread("/etc/varnish/pages/maintenance.html")); >> return (deliver); >> } >> } >> >> sub vcl_hash { >> if (req.http.User-Agent ~ "Google Page Speed") { >> hash_data("Google Page Speed"); >> } elsif (req.http.User-Agent ~ "Googlebot") { >> hash_data("Googlebot"); >> } >> } >> >> sub vcl_deliver { >> if (resp.status == 501) { >> return (synth(resp.status)); >> } >> if (obj.hits > 0) { >> set resp.http.X-Cache = "hit"; >> } else { >> set resp.http.X-Cache = "miss"; >> } >> } >> >> >> On Mon, Jun 26, 2017 at 3:47 PM, Guillaume Quintard < >> guillaume at varnish-software.com> wrote: >> >>> Nice! It may have been the cause, time will tell.can you report back in >>> a few days to let us know? >>> -- >>> Guillaume Quintard >>> >>> On Jun 26, 2017 20:21, "Stefano Baldo" wrote: >>> >>>> Hi Guillaume. >>>> >>>> I think things will start to going better now after changing the bans. >>>> This is how my last varnishstat looked like moments before a crash >>>> regarding the bans: >>>> >>>> MAIN.bans 41336 . Count of bans >>>> MAIN.bans_completed 37967 . Number of bans marked >>>> 'completed' >>>> MAIN.bans_obj 0 . Number of bans using >>>> obj.* >>>> MAIN.bans_req 41335 . Number of bans using >>>> req.* >>>> MAIN.bans_added 41336 0.68 Bans added >>>> MAIN.bans_deleted 0 0.00 Bans deleted >>>> >>>> And this is how it looks like now: >>>> >>>> MAIN.bans 2 . Count of bans >>>> MAIN.bans_completed 1 . Number of bans marked >>>> 'completed' >>>> MAIN.bans_obj 2 . Number of bans using >>>> obj.* >>>> MAIN.bans_req 0 . Number of bans using >>>> req.* >>>> MAIN.bans_added 2016 0.69 Bans added >>>> MAIN.bans_deleted 2014 0.69 Bans deleted >>>> >>>> Before the changes, bans were never deleted! >>>> Now the bans are added and quickly deleted after a minute or even a >>>> couple of seconds. >>>> >>>> May this was the cause of the problem? It seems like varnish was having >>>> a large number of bans to manage and test against. >>>> I will let it ride now. Let's see if the problem persists or it's gone! >>>> :-) >>>> >>>> Best, >>>> Stefano >>>> >>>> >>>> On Mon, Jun 26, 2017 at 3:10 PM, Guillaume Quintard < >>>> guillaume at varnish-software.com> wrote: >>>> >>>>> Looking good! >>>>> >>>>> -- >>>>> Guillaume Quintard >>>>> >>>>> On Mon, Jun 26, 2017 at 7:06 PM, Stefano Baldo >>>> > wrote: >>>>> >>>>>> Hi Guillaume, >>>>>> >>>>>> Can the following be considered "ban lurker friendly"? >>>>>> >>>>>> sub vcl_backend_response { >>>>>> set beresp.http.x-url = bereq.http.host + bereq.url; >>>>>> set beresp.http.x-user-agent = bereq.http.user-agent; >>>>>> } >>>>>> >>>>>> sub vcl_recv { >>>>>> if (req.method == "PURGE") { >>>>>> ban("obj.http.x-url == " + req.http.host + req.url + " && >>>>>> obj.http.x-user-agent !~ Googlebot"); >>>>>> return(synth(750)); >>>>>> } >>>>>> } >>>>>> >>>>>> sub vcl_deliver { >>>>>> unset resp.http.x-url; >>>>>> unset resp.http.x-user-agent; >>>>>> } >>>>>> >>>>>> Best, >>>>>> Stefano >>>>>> >>>>>> >>>>>> On Mon, Jun 26, 2017 at 12:43 PM, Guillaume Quintard < >>>>>> guillaume at varnish-software.com> wrote: >>>>>> >>>>>>> Not lurker friendly at all indeed. You'll need to avoid req.* >>>>>>> expression. Easiest way is to stash the host, user-agent and url in >>>>>>> beresp.http.* and ban against those (unset them in vcl_deliver). >>>>>>> >>>>>>> I don't think you need to expand the VSL at all. >>>>>>> >>>>>>> -- >>>>>>> Guillaume Quintard >>>>>>> >>>>>>> On Jun 26, 2017 16:51, "Stefano Baldo" >>>>>>> wrote: >>>>>>> >>>>>>> Hi Guillaume. >>>>>>> >>>>>>> Thanks for answering. >>>>>>> >>>>>>> I'm using a SSD disk. I've changed from ext4 to ext2 to increase >>>>>>> performance but it stills restarting. >>>>>>> Also, I checked the I/O performance for the disk and there is no >>>>>>> signal of overhead. >>>>>>> >>>>>>> I've changed the /var/lib/varnish to a tmpfs and increased its 80m >>>>>>> default size passing "-l 200m,20m" to varnishd and using >>>>>>> "nodev,nosuid,noatime,size=256M 0 0" for the tmpfs mount. There was >>>>>>> a problem here. After a couple of hours varnish died and I received a "no >>>>>>> space left on device" message - deleting the /var/lib/varnish solved the >>>>>>> problem and varnish was up again, but it's weird because there was free >>>>>>> memory on the host to be used with the tmpfs directory, so I don't know >>>>>>> what could have happened. I will try to stop increasing the >>>>>>> /var/lib/varnish size. >>>>>>> >>>>>>> Anyway, I am worried about the bans. You asked me if the bans are >>>>>>> lurker friedly. Well, I don't think so. My bans are created this way: >>>>>>> >>>>>>> ban("req.http.host == " + req.http.host + " && req.url ~ " + req.url >>>>>>> + " && req.http.User-Agent !~ Googlebot"); >>>>>>> >>>>>>> Are they lurker friendly? I was taking a quick look and the >>>>>>> documentation and it looks like they're not. >>>>>>> >>>>>>> Best, >>>>>>> Stefano >>>>>>> >>>>>>> >>>>>>> On Fri, Jun 23, 2017 at 11:30 AM, Guillaume Quintard < >>>>>>> guillaume at varnish-software.com> wrote: >>>>>>> >>>>>>>> Hi Stefano, >>>>>>>> >>>>>>>> Let's cover the usual suspects: I/Os. I think here Varnish gets >>>>>>>> stuck trying to push/pull data and can't make time to reply to the CLI. I'd >>>>>>>> recommend monitoring the disk activity (bandwidth and iops) to confirm. >>>>>>>> >>>>>>>> After some time, the file storage is terrible on a hard drive (SSDs >>>>>>>> take a bit more time to degrade) because of fragmentation. One solution to >>>>>>>> help the disks cope is to overprovision themif they're SSDs, and you can >>>>>>>> try different advices in the file storage definition in the command line >>>>>>>> (last parameter, after granularity). >>>>>>>> >>>>>>>> Is your /var/lib/varnish mount on tmpfs? That could help too. >>>>>>>> >>>>>>>> 40K bans is a lot, are they ban-lurker friendly? >>>>>>>> >>>>>>>> -- >>>>>>>> Guillaume Quintard >>>>>>>> >>>>>>>> On Fri, Jun 23, 2017 at 4:01 PM, Stefano Baldo < >>>>>>>> stefanobaldo at gmail.com> wrote: >>>>>>>> >>>>>>>>> Hello. >>>>>>>>> >>>>>>>>> I am having a critical problem with Varnish Cache in production >>>>>>>>> for over a month and any help will be appreciated. >>>>>>>>> The problem is that Varnish child process is recurrently being >>>>>>>>> restarted after 10~20h of use, with the following message: >>>>>>>>> >>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) not >>>>>>>>> responding to CLI, killed it. >>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected reply >>>>>>>>> from ping: 400 CLI communication error >>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) died >>>>>>>>> signal=9 >>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup >>>>>>>>> complete >>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) Started >>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said >>>>>>>>> Child starts >>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said >>>>>>>>> SMF.s0 mmap'ed 483183820800 bytes of 483183820800 >>>>>>>>> >>>>>>>>> The following link is the varnishstat output just 1 minute before >>>>>>>>> a restart: >>>>>>>>> >>>>>>>>> https://pastebin.com/g0g5RVTs >>>>>>>>> >>>>>>>>> Environment: >>>>>>>>> >>>>>>>>> varnish-5.1.2 revision 6ece695 >>>>>>>>> Debian 8.7 - Debian GNU/Linux 8 (3.16.0) >>>>>>>>> Installed using pre-built package from official repo at >>>>>>>>> packagecloud.io >>>>>>>>> CPU 2x2.9 GHz >>>>>>>>> Mem 3.69 GiB >>>>>>>>> Running inside a Docker container >>>>>>>>> NFILES=131072 >>>>>>>>> MEMLOCK=82000 >>>>>>>>> >>>>>>>>> Additional info: >>>>>>>>> >>>>>>>>> - I need to cache a large number of objets and the cache should >>>>>>>>> last for almost a week, so I have set up a 450G storage space, I don't know >>>>>>>>> if this is a problem; >>>>>>>>> - I use ban a lot. There was about 40k bans in the system just >>>>>>>>> before the last crash. I really don't know if this is too much or may have >>>>>>>>> anything to do with it; >>>>>>>>> - No registered CPU spikes (almost always by 30%); >>>>>>>>> - No panic is reported, the only info I can retrieve is from >>>>>>>>> syslog; >>>>>>>>> - During all the time, event moments before the crashes, >>>>>>>>> everything is okay and requests are being responded very fast. >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> Stefano Baldo >>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> varnish-misc mailing list >>>>>>>>> varnish-misc at varnish-cache.org >>>>>>>>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From guillaume at varnish-software.com Wed Jun 28 13:26:20 2017 From: guillaume at varnish-software.com (Guillaume Quintard) Date: Wed, 28 Jun 2017 15:26:20 +0200 Subject: Child process recurrently being restarted In-Reply-To: References: Message-ID: Hi, can you look that "varnishstat -1 | grep g_bytes" and see if if matches the memory you are seeing? -- Guillaume Quintard On Wed, Jun 28, 2017 at 3:20 PM, Stefano Baldo wrote: > Hi Guillaume. > > I increased the cli_timeout yesterday to 900sec (15min) and it restarted > anyway, which seems to indicate that the thread is really stalled. > > This was 1 minute after the last restart: > > MAIN.n_object 3908216 . object structs made > SMF.s0.g_alloc 7794510 . Allocations outstanding > > I've just changed the I/O Scheduler to noop to see what happens. > > One interest thing I've found is about the memory usage. > > In the 1st minute of use: > MemTotal: 3865572 kB > MemFree: 120768 kB > MemAvailable: 2300268 kB > > 1 minute before a restart: > MemTotal: 3865572 kB > MemFree: 82480 kB > MemAvailable: 68316 kB > > It seems like the system is possibly running out of memory. > > When calling varnishd, I'm specifying only "-s file,..." as storage. I see > in some examples that is common to use "-s file" AND "-s malloc" together. > Should I be passing "-s malloc" as well to somehow try to limit the memory > usage by varnishd? > > Best, > Stefano > > > On Wed, Jun 28, 2017 at 4:12 AM, Guillaume Quintard < > guillaume at varnish-software.com> wrote: > >> Sadly, nothing suspicious here, you can still try: >> - bumping the cli_timeout >> - changing your disk scheduler >> - changing the advice option of the file storage >> >> I'm still convinced this is due to Varnish getting stuck waiting for the >> disk because of the file storage fragmentation. >> >> Maybe you could look at SMF.*.g_alloc and compare it to the number of >> objects. Ideally, we would have a 1:1 relation between objects and >> allocations. If that number drops prior to a restart, that would be a good >> clue. >> >> >> -- >> Guillaume Quintard >> >> On Tue, Jun 27, 2017 at 11:07 PM, Stefano Baldo >> wrote: >> >>> Hi Guillaume. >>> >>> It keeps restarting. >>> Would you mind taking a quick look in the following VCL file to check if >>> you find anything suspicious? >>> >>> Thank you very much. >>> >>> Best, >>> Stefano >>> >>> vcl 4.0; >>> >>> import std; >>> >>> backend default { >>> .host = "sites-web-server-lb"; >>> .port = "80"; >>> } >>> >>> include "/etc/varnish/bad_bot_detection.vcl"; >>> >>> sub vcl_recv { >>> call bad_bot_detection; >>> >>> if (req.url == "/nocache" || req.url == "/version") { >>> return(pass); >>> } >>> >>> unset req.http.Cookie; >>> if (req.method == "PURGE") { >>> ban("obj.http.x-host == " + req.http.host + " && >>> obj.http.x-user-agent !~ Googlebot"); >>> return(synth(750)); >>> } >>> >>> set req.url = regsuball(req.url, "(?>> } >>> >>> sub vcl_synth { >>> if (resp.status == 750) { >>> set resp.status = 200; >>> synthetic("PURGED => " + req.url); >>> return(deliver); >>> } elsif (resp.status == 501) { >>> set resp.status = 200; >>> set resp.http.Content-Type = "text/html; charset=utf-8"; >>> synthetic(std.fileread("/etc/varnish/pages/invalid_domain.html")); >>> return(deliver); >>> } >>> } >>> >>> sub vcl_backend_response { >>> unset beresp.http.Set-Cookie; >>> set beresp.http.x-host = bereq.http.host; >>> set beresp.http.x-user-agent = bereq.http.user-agent; >>> >>> if (bereq.url == "/themes/basic/assets/theme.min.css" >>> || bereq.url == "/api/events/PAGEVIEW" >>> || bereq.url ~ "^\/assets\/img\/") { >>> set beresp.http.Cache-Control = "max-age=0"; >>> } else { >>> unset beresp.http.Cache-Control; >>> } >>> >>> if (beresp.status == 200 || >>> beresp.status == 301 || >>> beresp.status == 302 || >>> beresp.status == 404) { >>> if (bereq.url ~ "\&ordenar=aleatorio$") { >>> set beresp.http.X-TTL = "1d"; >>> set beresp.ttl = 1d; >>> } else { >>> set beresp.http.X-TTL = "1w"; >>> set beresp.ttl = 1w; >>> } >>> } >>> >>> if (bereq.url !~ "\.(jpeg|jpg|png|gif|gz|tgz|bz2|tbz|mp3|ogg|swf|flv)$") >>> { >>> set beresp.do_gzip = true; >>> } >>> } >>> >>> sub vcl_pipe { >>> set bereq.http.connection = "close"; >>> return (pipe); >>> } >>> >>> sub vcl_deliver { >>> unset resp.http.x-host; >>> unset resp.http.x-user-agent; >>> } >>> >>> sub vcl_backend_error { >>> if (beresp.status == 502 || beresp.status == 503 || beresp.status == >>> 504) { >>> set beresp.status = 200; >>> set beresp.http.Content-Type = "text/html; charset=utf-8"; >>> synthetic(std.fileread("/etc/varnish/pages/maintenance.html")); >>> return (deliver); >>> } >>> } >>> >>> sub vcl_hash { >>> if (req.http.User-Agent ~ "Google Page Speed") { >>> hash_data("Google Page Speed"); >>> } elsif (req.http.User-Agent ~ "Googlebot") { >>> hash_data("Googlebot"); >>> } >>> } >>> >>> sub vcl_deliver { >>> if (resp.status == 501) { >>> return (synth(resp.status)); >>> } >>> if (obj.hits > 0) { >>> set resp.http.X-Cache = "hit"; >>> } else { >>> set resp.http.X-Cache = "miss"; >>> } >>> } >>> >>> >>> On Mon, Jun 26, 2017 at 3:47 PM, Guillaume Quintard < >>> guillaume at varnish-software.com> wrote: >>> >>>> Nice! It may have been the cause, time will tell.can you report back in >>>> a few days to let us know? >>>> -- >>>> Guillaume Quintard >>>> >>>> On Jun 26, 2017 20:21, "Stefano Baldo" wrote: >>>> >>>>> Hi Guillaume. >>>>> >>>>> I think things will start to going better now after changing the bans. >>>>> This is how my last varnishstat looked like moments before a crash >>>>> regarding the bans: >>>>> >>>>> MAIN.bans 41336 . Count of bans >>>>> MAIN.bans_completed 37967 . Number of bans marked >>>>> 'completed' >>>>> MAIN.bans_obj 0 . Number of bans using >>>>> obj.* >>>>> MAIN.bans_req 41335 . Number of bans using >>>>> req.* >>>>> MAIN.bans_added 41336 0.68 Bans added >>>>> MAIN.bans_deleted 0 0.00 Bans deleted >>>>> >>>>> And this is how it looks like now: >>>>> >>>>> MAIN.bans 2 . Count of bans >>>>> MAIN.bans_completed 1 . Number of bans marked >>>>> 'completed' >>>>> MAIN.bans_obj 2 . Number of bans using >>>>> obj.* >>>>> MAIN.bans_req 0 . Number of bans using >>>>> req.* >>>>> MAIN.bans_added 2016 0.69 Bans added >>>>> MAIN.bans_deleted 2014 0.69 Bans deleted >>>>> >>>>> Before the changes, bans were never deleted! >>>>> Now the bans are added and quickly deleted after a minute or even a >>>>> couple of seconds. >>>>> >>>>> May this was the cause of the problem? It seems like varnish was >>>>> having a large number of bans to manage and test against. >>>>> I will let it ride now. Let's see if the problem persists or it's >>>>> gone! :-) >>>>> >>>>> Best, >>>>> Stefano >>>>> >>>>> >>>>> On Mon, Jun 26, 2017 at 3:10 PM, Guillaume Quintard < >>>>> guillaume at varnish-software.com> wrote: >>>>> >>>>>> Looking good! >>>>>> >>>>>> -- >>>>>> Guillaume Quintard >>>>>> >>>>>> On Mon, Jun 26, 2017 at 7:06 PM, Stefano Baldo < >>>>>> stefanobaldo at gmail.com> wrote: >>>>>> >>>>>>> Hi Guillaume, >>>>>>> >>>>>>> Can the following be considered "ban lurker friendly"? >>>>>>> >>>>>>> sub vcl_backend_response { >>>>>>> set beresp.http.x-url = bereq.http.host + bereq.url; >>>>>>> set beresp.http.x-user-agent = bereq.http.user-agent; >>>>>>> } >>>>>>> >>>>>>> sub vcl_recv { >>>>>>> if (req.method == "PURGE") { >>>>>>> ban("obj.http.x-url == " + req.http.host + req.url + " && >>>>>>> obj.http.x-user-agent !~ Googlebot"); >>>>>>> return(synth(750)); >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> sub vcl_deliver { >>>>>>> unset resp.http.x-url; >>>>>>> unset resp.http.x-user-agent; >>>>>>> } >>>>>>> >>>>>>> Best, >>>>>>> Stefano >>>>>>> >>>>>>> >>>>>>> On Mon, Jun 26, 2017 at 12:43 PM, Guillaume Quintard < >>>>>>> guillaume at varnish-software.com> wrote: >>>>>>> >>>>>>>> Not lurker friendly at all indeed. You'll need to avoid req.* >>>>>>>> expression. Easiest way is to stash the host, user-agent and url in >>>>>>>> beresp.http.* and ban against those (unset them in vcl_deliver). >>>>>>>> >>>>>>>> I don't think you need to expand the VSL at all. >>>>>>>> >>>>>>>> -- >>>>>>>> Guillaume Quintard >>>>>>>> >>>>>>>> On Jun 26, 2017 16:51, "Stefano Baldo" >>>>>>>> wrote: >>>>>>>> >>>>>>>> Hi Guillaume. >>>>>>>> >>>>>>>> Thanks for answering. >>>>>>>> >>>>>>>> I'm using a SSD disk. I've changed from ext4 to ext2 to increase >>>>>>>> performance but it stills restarting. >>>>>>>> Also, I checked the I/O performance for the disk and there is no >>>>>>>> signal of overhead. >>>>>>>> >>>>>>>> I've changed the /var/lib/varnish to a tmpfs and increased its 80m >>>>>>>> default size passing "-l 200m,20m" to varnishd and using >>>>>>>> "nodev,nosuid,noatime,size=256M 0 0" for the tmpfs mount. There >>>>>>>> was a problem here. After a couple of hours varnish died and I received a >>>>>>>> "no space left on device" message - deleting the /var/lib/varnish solved >>>>>>>> the problem and varnish was up again, but it's weird because there was free >>>>>>>> memory on the host to be used with the tmpfs directory, so I don't know >>>>>>>> what could have happened. I will try to stop increasing the >>>>>>>> /var/lib/varnish size. >>>>>>>> >>>>>>>> Anyway, I am worried about the bans. You asked me if the bans are >>>>>>>> lurker friedly. Well, I don't think so. My bans are created this way: >>>>>>>> >>>>>>>> ban("req.http.host == " + req.http.host + " && req.url ~ " + >>>>>>>> req.url + " && req.http.User-Agent !~ Googlebot"); >>>>>>>> >>>>>>>> Are they lurker friendly? I was taking a quick look and the >>>>>>>> documentation and it looks like they're not. >>>>>>>> >>>>>>>> Best, >>>>>>>> Stefano >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Jun 23, 2017 at 11:30 AM, Guillaume Quintard < >>>>>>>> guillaume at varnish-software.com> wrote: >>>>>>>> >>>>>>>>> Hi Stefano, >>>>>>>>> >>>>>>>>> Let's cover the usual suspects: I/Os. I think here Varnish gets >>>>>>>>> stuck trying to push/pull data and can't make time to reply to the CLI. I'd >>>>>>>>> recommend monitoring the disk activity (bandwidth and iops) to confirm. >>>>>>>>> >>>>>>>>> After some time, the file storage is terrible on a hard drive >>>>>>>>> (SSDs take a bit more time to degrade) because of fragmentation. One >>>>>>>>> solution to help the disks cope is to overprovision themif they're SSDs, >>>>>>>>> and you can try different advices in the file storage definition in the >>>>>>>>> command line (last parameter, after granularity). >>>>>>>>> >>>>>>>>> Is your /var/lib/varnish mount on tmpfs? That could help too. >>>>>>>>> >>>>>>>>> 40K bans is a lot, are they ban-lurker friendly? >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Guillaume Quintard >>>>>>>>> >>>>>>>>> On Fri, Jun 23, 2017 at 4:01 PM, Stefano Baldo < >>>>>>>>> stefanobaldo at gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Hello. >>>>>>>>>> >>>>>>>>>> I am having a critical problem with Varnish Cache in production >>>>>>>>>> for over a month and any help will be appreciated. >>>>>>>>>> The problem is that Varnish child process is recurrently being >>>>>>>>>> restarted after 10~20h of use, with the following message: >>>>>>>>>> >>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) not >>>>>>>>>> responding to CLI, killed it. >>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected reply >>>>>>>>>> from ping: 400 CLI communication error >>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) died >>>>>>>>>> signal=9 >>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup >>>>>>>>>> complete >>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) >>>>>>>>>> Started >>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said >>>>>>>>>> Child starts >>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said >>>>>>>>>> SMF.s0 mmap'ed 483183820800 bytes of 483183820800 >>>>>>>>>> >>>>>>>>>> The following link is the varnishstat output just 1 minute before >>>>>>>>>> a restart: >>>>>>>>>> >>>>>>>>>> https://pastebin.com/g0g5RVTs >>>>>>>>>> >>>>>>>>>> Environment: >>>>>>>>>> >>>>>>>>>> varnish-5.1.2 revision 6ece695 >>>>>>>>>> Debian 8.7 - Debian GNU/Linux 8 (3.16.0) >>>>>>>>>> Installed using pre-built package from official repo at >>>>>>>>>> packagecloud.io >>>>>>>>>> CPU 2x2.9 GHz >>>>>>>>>> Mem 3.69 GiB >>>>>>>>>> Running inside a Docker container >>>>>>>>>> NFILES=131072 >>>>>>>>>> MEMLOCK=82000 >>>>>>>>>> >>>>>>>>>> Additional info: >>>>>>>>>> >>>>>>>>>> - I need to cache a large number of objets and the cache should >>>>>>>>>> last for almost a week, so I have set up a 450G storage space, I don't know >>>>>>>>>> if this is a problem; >>>>>>>>>> - I use ban a lot. There was about 40k bans in the system just >>>>>>>>>> before the last crash. I really don't know if this is too much or may have >>>>>>>>>> anything to do with it; >>>>>>>>>> - No registered CPU spikes (almost always by 30%); >>>>>>>>>> - No panic is reported, the only info I can retrieve is from >>>>>>>>>> syslog; >>>>>>>>>> - During all the time, event moments before the crashes, >>>>>>>>>> everything is okay and requests are being responded very fast. >>>>>>>>>> >>>>>>>>>> Best, >>>>>>>>>> Stefano Baldo >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> _______________________________________________ >>>>>>>>>> varnish-misc mailing list >>>>>>>>>> varnish-misc at varnish-cache.org >>>>>>>>>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stefanobaldo at gmail.com Wed Jun 28 13:39:52 2017 From: stefanobaldo at gmail.com (Stefano Baldo) Date: Wed, 28 Jun 2017 10:39:52 -0300 Subject: Child process recurrently being restarted In-Reply-To: References: Message-ID: Hi. root at 2c6c325b279f:/# varnishstat -1 | grep g_bytes SMA.Transient.g_bytes 519022 . Bytes outstanding SMF.s0.g_bytes 23662845952 . Bytes outstanding You mean g_bytes from SMA.Transient? I have set no malloc storage. On Wed, Jun 28, 2017 at 10:26 AM, Guillaume Quintard < guillaume at varnish-software.com> wrote: > Hi, > > can you look that "varnishstat -1 | grep g_bytes" and see if if matches > the memory you are seeing? > > -- > Guillaume Quintard > > On Wed, Jun 28, 2017 at 3:20 PM, Stefano Baldo > wrote: > >> Hi Guillaume. >> >> I increased the cli_timeout yesterday to 900sec (15min) and it restarted >> anyway, which seems to indicate that the thread is really stalled. >> >> This was 1 minute after the last restart: >> >> MAIN.n_object 3908216 . object structs made >> SMF.s0.g_alloc 7794510 . Allocations outstanding >> >> I've just changed the I/O Scheduler to noop to see what happens. >> >> One interest thing I've found is about the memory usage. >> >> In the 1st minute of use: >> MemTotal: 3865572 kB >> MemFree: 120768 kB >> MemAvailable: 2300268 kB >> >> 1 minute before a restart: >> MemTotal: 3865572 kB >> MemFree: 82480 kB >> MemAvailable: 68316 kB >> >> It seems like the system is possibly running out of memory. >> >> When calling varnishd, I'm specifying only "-s file,..." as storage. I >> see in some examples that is common to use "-s file" AND "-s malloc" >> together. Should I be passing "-s malloc" as well to somehow try to limit >> the memory usage by varnishd? >> >> Best, >> Stefano >> >> >> On Wed, Jun 28, 2017 at 4:12 AM, Guillaume Quintard < >> guillaume at varnish-software.com> wrote: >> >>> Sadly, nothing suspicious here, you can still try: >>> - bumping the cli_timeout >>> - changing your disk scheduler >>> - changing the advice option of the file storage >>> >>> I'm still convinced this is due to Varnish getting stuck waiting for the >>> disk because of the file storage fragmentation. >>> >>> Maybe you could look at SMF.*.g_alloc and compare it to the number of >>> objects. Ideally, we would have a 1:1 relation between objects and >>> allocations. If that number drops prior to a restart, that would be a good >>> clue. >>> >>> >>> -- >>> Guillaume Quintard >>> >>> On Tue, Jun 27, 2017 at 11:07 PM, Stefano Baldo >>> wrote: >>> >>>> Hi Guillaume. >>>> >>>> It keeps restarting. >>>> Would you mind taking a quick look in the following VCL file to check >>>> if you find anything suspicious? >>>> >>>> Thank you very much. >>>> >>>> Best, >>>> Stefano >>>> >>>> vcl 4.0; >>>> >>>> import std; >>>> >>>> backend default { >>>> .host = "sites-web-server-lb"; >>>> .port = "80"; >>>> } >>>> >>>> include "/etc/varnish/bad_bot_detection.vcl"; >>>> >>>> sub vcl_recv { >>>> call bad_bot_detection; >>>> >>>> if (req.url == "/nocache" || req.url == "/version") { >>>> return(pass); >>>> } >>>> >>>> unset req.http.Cookie; >>>> if (req.method == "PURGE") { >>>> ban("obj.http.x-host == " + req.http.host + " && >>>> obj.http.x-user-agent !~ Googlebot"); >>>> return(synth(750)); >>>> } >>>> >>>> set req.url = regsuball(req.url, "(?>>> } >>>> >>>> sub vcl_synth { >>>> if (resp.status == 750) { >>>> set resp.status = 200; >>>> synthetic("PURGED => " + req.url); >>>> return(deliver); >>>> } elsif (resp.status == 501) { >>>> set resp.status = 200; >>>> set resp.http.Content-Type = "text/html; charset=utf-8"; >>>> synthetic(std.fileread("/etc/varnish/pages/invalid_domain.html")); >>>> return(deliver); >>>> } >>>> } >>>> >>>> sub vcl_backend_response { >>>> unset beresp.http.Set-Cookie; >>>> set beresp.http.x-host = bereq.http.host; >>>> set beresp.http.x-user-agent = bereq.http.user-agent; >>>> >>>> if (bereq.url == "/themes/basic/assets/theme.min.css" >>>> || bereq.url == "/api/events/PAGEVIEW" >>>> || bereq.url ~ "^\/assets\/img\/") { >>>> set beresp.http.Cache-Control = "max-age=0"; >>>> } else { >>>> unset beresp.http.Cache-Control; >>>> } >>>> >>>> if (beresp.status == 200 || >>>> beresp.status == 301 || >>>> beresp.status == 302 || >>>> beresp.status == 404) { >>>> if (bereq.url ~ "\&ordenar=aleatorio$") { >>>> set beresp.http.X-TTL = "1d"; >>>> set beresp.ttl = 1d; >>>> } else { >>>> set beresp.http.X-TTL = "1w"; >>>> set beresp.ttl = 1w; >>>> } >>>> } >>>> >>>> if (bereq.url !~ "\.(jpeg|jpg|png|gif|gz|tgz|bz2|tbz|mp3|ogg|swf|flv)$") >>>> { >>>> set beresp.do_gzip = true; >>>> } >>>> } >>>> >>>> sub vcl_pipe { >>>> set bereq.http.connection = "close"; >>>> return (pipe); >>>> } >>>> >>>> sub vcl_deliver { >>>> unset resp.http.x-host; >>>> unset resp.http.x-user-agent; >>>> } >>>> >>>> sub vcl_backend_error { >>>> if (beresp.status == 502 || beresp.status == 503 || beresp.status == >>>> 504) { >>>> set beresp.status = 200; >>>> set beresp.http.Content-Type = "text/html; charset=utf-8"; >>>> synthetic(std.fileread("/etc/varnish/pages/maintenance.html")); >>>> return (deliver); >>>> } >>>> } >>>> >>>> sub vcl_hash { >>>> if (req.http.User-Agent ~ "Google Page Speed") { >>>> hash_data("Google Page Speed"); >>>> } elsif (req.http.User-Agent ~ "Googlebot") { >>>> hash_data("Googlebot"); >>>> } >>>> } >>>> >>>> sub vcl_deliver { >>>> if (resp.status == 501) { >>>> return (synth(resp.status)); >>>> } >>>> if (obj.hits > 0) { >>>> set resp.http.X-Cache = "hit"; >>>> } else { >>>> set resp.http.X-Cache = "miss"; >>>> } >>>> } >>>> >>>> >>>> On Mon, Jun 26, 2017 at 3:47 PM, Guillaume Quintard < >>>> guillaume at varnish-software.com> wrote: >>>> >>>>> Nice! It may have been the cause, time will tell.can you report back >>>>> in a few days to let us know? >>>>> -- >>>>> Guillaume Quintard >>>>> >>>>> On Jun 26, 2017 20:21, "Stefano Baldo" wrote: >>>>> >>>>>> Hi Guillaume. >>>>>> >>>>>> I think things will start to going better now after changing the bans. >>>>>> This is how my last varnishstat looked like moments before a crash >>>>>> regarding the bans: >>>>>> >>>>>> MAIN.bans 41336 . Count of bans >>>>>> MAIN.bans_completed 37967 . Number of bans >>>>>> marked 'completed' >>>>>> MAIN.bans_obj 0 . Number of bans using >>>>>> obj.* >>>>>> MAIN.bans_req 41335 . Number of bans using >>>>>> req.* >>>>>> MAIN.bans_added 41336 0.68 Bans added >>>>>> MAIN.bans_deleted 0 0.00 Bans deleted >>>>>> >>>>>> And this is how it looks like now: >>>>>> >>>>>> MAIN.bans 2 . Count of bans >>>>>> MAIN.bans_completed 1 . Number of bans >>>>>> marked 'completed' >>>>>> MAIN.bans_obj 2 . Number of bans using >>>>>> obj.* >>>>>> MAIN.bans_req 0 . Number of bans using >>>>>> req.* >>>>>> MAIN.bans_added 2016 0.69 Bans added >>>>>> MAIN.bans_deleted 2014 0.69 Bans deleted >>>>>> >>>>>> Before the changes, bans were never deleted! >>>>>> Now the bans are added and quickly deleted after a minute or even a >>>>>> couple of seconds. >>>>>> >>>>>> May this was the cause of the problem? It seems like varnish was >>>>>> having a large number of bans to manage and test against. >>>>>> I will let it ride now. Let's see if the problem persists or it's >>>>>> gone! :-) >>>>>> >>>>>> Best, >>>>>> Stefano >>>>>> >>>>>> >>>>>> On Mon, Jun 26, 2017 at 3:10 PM, Guillaume Quintard < >>>>>> guillaume at varnish-software.com> wrote: >>>>>> >>>>>>> Looking good! >>>>>>> >>>>>>> -- >>>>>>> Guillaume Quintard >>>>>>> >>>>>>> On Mon, Jun 26, 2017 at 7:06 PM, Stefano Baldo < >>>>>>> stefanobaldo at gmail.com> wrote: >>>>>>> >>>>>>>> Hi Guillaume, >>>>>>>> >>>>>>>> Can the following be considered "ban lurker friendly"? >>>>>>>> >>>>>>>> sub vcl_backend_response { >>>>>>>> set beresp.http.x-url = bereq.http.host + bereq.url; >>>>>>>> set beresp.http.x-user-agent = bereq.http.user-agent; >>>>>>>> } >>>>>>>> >>>>>>>> sub vcl_recv { >>>>>>>> if (req.method == "PURGE") { >>>>>>>> ban("obj.http.x-url == " + req.http.host + req.url + " && >>>>>>>> obj.http.x-user-agent !~ Googlebot"); >>>>>>>> return(synth(750)); >>>>>>>> } >>>>>>>> } >>>>>>>> >>>>>>>> sub vcl_deliver { >>>>>>>> unset resp.http.x-url; >>>>>>>> unset resp.http.x-user-agent; >>>>>>>> } >>>>>>>> >>>>>>>> Best, >>>>>>>> Stefano >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Jun 26, 2017 at 12:43 PM, Guillaume Quintard < >>>>>>>> guillaume at varnish-software.com> wrote: >>>>>>>> >>>>>>>>> Not lurker friendly at all indeed. You'll need to avoid req.* >>>>>>>>> expression. Easiest way is to stash the host, user-agent and url in >>>>>>>>> beresp.http.* and ban against those (unset them in vcl_deliver). >>>>>>>>> >>>>>>>>> I don't think you need to expand the VSL at all. >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Guillaume Quintard >>>>>>>>> >>>>>>>>> On Jun 26, 2017 16:51, "Stefano Baldo" >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Hi Guillaume. >>>>>>>>> >>>>>>>>> Thanks for answering. >>>>>>>>> >>>>>>>>> I'm using a SSD disk. I've changed from ext4 to ext2 to increase >>>>>>>>> performance but it stills restarting. >>>>>>>>> Also, I checked the I/O performance for the disk and there is no >>>>>>>>> signal of overhead. >>>>>>>>> >>>>>>>>> I've changed the /var/lib/varnish to a tmpfs and increased its 80m >>>>>>>>> default size passing "-l 200m,20m" to varnishd and using >>>>>>>>> "nodev,nosuid,noatime,size=256M 0 0" for the tmpfs mount. There >>>>>>>>> was a problem here. After a couple of hours varnish died and I received a >>>>>>>>> "no space left on device" message - deleting the /var/lib/varnish solved >>>>>>>>> the problem and varnish was up again, but it's weird because there was free >>>>>>>>> memory on the host to be used with the tmpfs directory, so I don't know >>>>>>>>> what could have happened. I will try to stop increasing the >>>>>>>>> /var/lib/varnish size. >>>>>>>>> >>>>>>>>> Anyway, I am worried about the bans. You asked me if the bans are >>>>>>>>> lurker friedly. Well, I don't think so. My bans are created this way: >>>>>>>>> >>>>>>>>> ban("req.http.host == " + req.http.host + " && req.url ~ " + >>>>>>>>> req.url + " && req.http.User-Agent !~ Googlebot"); >>>>>>>>> >>>>>>>>> Are they lurker friendly? I was taking a quick look and the >>>>>>>>> documentation and it looks like they're not. >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> Stefano >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Jun 23, 2017 at 11:30 AM, Guillaume Quintard < >>>>>>>>> guillaume at varnish-software.com> wrote: >>>>>>>>> >>>>>>>>>> Hi Stefano, >>>>>>>>>> >>>>>>>>>> Let's cover the usual suspects: I/Os. I think here Varnish gets >>>>>>>>>> stuck trying to push/pull data and can't make time to reply to the CLI. I'd >>>>>>>>>> recommend monitoring the disk activity (bandwidth and iops) to confirm. >>>>>>>>>> >>>>>>>>>> After some time, the file storage is terrible on a hard drive >>>>>>>>>> (SSDs take a bit more time to degrade) because of fragmentation. One >>>>>>>>>> solution to help the disks cope is to overprovision themif they're SSDs, >>>>>>>>>> and you can try different advices in the file storage definition in the >>>>>>>>>> command line (last parameter, after granularity). >>>>>>>>>> >>>>>>>>>> Is your /var/lib/varnish mount on tmpfs? That could help too. >>>>>>>>>> >>>>>>>>>> 40K bans is a lot, are they ban-lurker friendly? >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Guillaume Quintard >>>>>>>>>> >>>>>>>>>> On Fri, Jun 23, 2017 at 4:01 PM, Stefano Baldo < >>>>>>>>>> stefanobaldo at gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hello. >>>>>>>>>>> >>>>>>>>>>> I am having a critical problem with Varnish Cache in production >>>>>>>>>>> for over a month and any help will be appreciated. >>>>>>>>>>> The problem is that Varnish child process is recurrently being >>>>>>>>>>> restarted after 10~20h of use, with the following message: >>>>>>>>>>> >>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) not >>>>>>>>>>> responding to CLI, killed it. >>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected reply >>>>>>>>>>> from ping: 400 CLI communication error >>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) died >>>>>>>>>>> signal=9 >>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup >>>>>>>>>>> complete >>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) >>>>>>>>>>> Started >>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said >>>>>>>>>>> Child starts >>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said >>>>>>>>>>> SMF.s0 mmap'ed 483183820800 bytes of 483183820800 >>>>>>>>>>> >>>>>>>>>>> The following link is the varnishstat output just 1 minute >>>>>>>>>>> before a restart: >>>>>>>>>>> >>>>>>>>>>> https://pastebin.com/g0g5RVTs >>>>>>>>>>> >>>>>>>>>>> Environment: >>>>>>>>>>> >>>>>>>>>>> varnish-5.1.2 revision 6ece695 >>>>>>>>>>> Debian 8.7 - Debian GNU/Linux 8 (3.16.0) >>>>>>>>>>> Installed using pre-built package from official repo at >>>>>>>>>>> packagecloud.io >>>>>>>>>>> CPU 2x2.9 GHz >>>>>>>>>>> Mem 3.69 GiB >>>>>>>>>>> Running inside a Docker container >>>>>>>>>>> NFILES=131072 >>>>>>>>>>> MEMLOCK=82000 >>>>>>>>>>> >>>>>>>>>>> Additional info: >>>>>>>>>>> >>>>>>>>>>> - I need to cache a large number of objets and the cache should >>>>>>>>>>> last for almost a week, so I have set up a 450G storage space, I don't know >>>>>>>>>>> if this is a problem; >>>>>>>>>>> - I use ban a lot. There was about 40k bans in the system just >>>>>>>>>>> before the last crash. I really don't know if this is too much or may have >>>>>>>>>>> anything to do with it; >>>>>>>>>>> - No registered CPU spikes (almost always by 30%); >>>>>>>>>>> - No panic is reported, the only info I can retrieve is from >>>>>>>>>>> syslog; >>>>>>>>>>> - During all the time, event moments before the crashes, >>>>>>>>>>> everything is okay and requests are being responded very fast. >>>>>>>>>>> >>>>>>>>>>> Best, >>>>>>>>>>> Stefano Baldo >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> varnish-misc mailing list >>>>>>>>>>> varnish-misc at varnish-cache.org >>>>>>>>>>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish >>>>>>>>>>> -misc >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From guillaume at varnish-software.com Wed Jun 28 13:43:55 2017 From: guillaume at varnish-software.com (Guillaume Quintard) Date: Wed, 28 Jun 2017 15:43:55 +0200 Subject: Child process recurrently being restarted In-Reply-To: References: Message-ID: Yeah, I was wondering about Transient, but it seems under control. Apart from moving away from file storage, I have nothing at the moment :-/ -- Guillaume Quintard On Wed, Jun 28, 2017 at 3:39 PM, Stefano Baldo wrote: > Hi. > > root at 2c6c325b279f:/# varnishstat -1 | grep g_bytes > SMA.Transient.g_bytes 519022 . Bytes > outstanding > SMF.s0.g_bytes 23662845952 . Bytes > outstanding > > You mean g_bytes from SMA.Transient? I have set no malloc storage. > > > On Wed, Jun 28, 2017 at 10:26 AM, Guillaume Quintard < > guillaume at varnish-software.com> wrote: > >> Hi, >> >> can you look that "varnishstat -1 | grep g_bytes" and see if if matches >> the memory you are seeing? >> >> -- >> Guillaume Quintard >> >> On Wed, Jun 28, 2017 at 3:20 PM, Stefano Baldo >> wrote: >> >>> Hi Guillaume. >>> >>> I increased the cli_timeout yesterday to 900sec (15min) and it restarted >>> anyway, which seems to indicate that the thread is really stalled. >>> >>> This was 1 minute after the last restart: >>> >>> MAIN.n_object 3908216 . object structs made >>> SMF.s0.g_alloc 7794510 . Allocations outstanding >>> >>> I've just changed the I/O Scheduler to noop to see what happens. >>> >>> One interest thing I've found is about the memory usage. >>> >>> In the 1st minute of use: >>> MemTotal: 3865572 kB >>> MemFree: 120768 kB >>> MemAvailable: 2300268 kB >>> >>> 1 minute before a restart: >>> MemTotal: 3865572 kB >>> MemFree: 82480 kB >>> MemAvailable: 68316 kB >>> >>> It seems like the system is possibly running out of memory. >>> >>> When calling varnishd, I'm specifying only "-s file,..." as storage. I >>> see in some examples that is common to use "-s file" AND "-s malloc" >>> together. Should I be passing "-s malloc" as well to somehow try to limit >>> the memory usage by varnishd? >>> >>> Best, >>> Stefano >>> >>> >>> On Wed, Jun 28, 2017 at 4:12 AM, Guillaume Quintard < >>> guillaume at varnish-software.com> wrote: >>> >>>> Sadly, nothing suspicious here, you can still try: >>>> - bumping the cli_timeout >>>> - changing your disk scheduler >>>> - changing the advice option of the file storage >>>> >>>> I'm still convinced this is due to Varnish getting stuck waiting for >>>> the disk because of the file storage fragmentation. >>>> >>>> Maybe you could look at SMF.*.g_alloc and compare it to the number of >>>> objects. Ideally, we would have a 1:1 relation between objects and >>>> allocations. If that number drops prior to a restart, that would be a good >>>> clue. >>>> >>>> >>>> -- >>>> Guillaume Quintard >>>> >>>> On Tue, Jun 27, 2017 at 11:07 PM, Stefano Baldo >>> > wrote: >>>> >>>>> Hi Guillaume. >>>>> >>>>> It keeps restarting. >>>>> Would you mind taking a quick look in the following VCL file to check >>>>> if you find anything suspicious? >>>>> >>>>> Thank you very much. >>>>> >>>>> Best, >>>>> Stefano >>>>> >>>>> vcl 4.0; >>>>> >>>>> import std; >>>>> >>>>> backend default { >>>>> .host = "sites-web-server-lb"; >>>>> .port = "80"; >>>>> } >>>>> >>>>> include "/etc/varnish/bad_bot_detection.vcl"; >>>>> >>>>> sub vcl_recv { >>>>> call bad_bot_detection; >>>>> >>>>> if (req.url == "/nocache" || req.url == "/version") { >>>>> return(pass); >>>>> } >>>>> >>>>> unset req.http.Cookie; >>>>> if (req.method == "PURGE") { >>>>> ban("obj.http.x-host == " + req.http.host + " && >>>>> obj.http.x-user-agent !~ Googlebot"); >>>>> return(synth(750)); >>>>> } >>>>> >>>>> set req.url = regsuball(req.url, "(?>>>> } >>>>> >>>>> sub vcl_synth { >>>>> if (resp.status == 750) { >>>>> set resp.status = 200; >>>>> synthetic("PURGED => " + req.url); >>>>> return(deliver); >>>>> } elsif (resp.status == 501) { >>>>> set resp.status = 200; >>>>> set resp.http.Content-Type = "text/html; charset=utf-8"; >>>>> synthetic(std.fileread("/etc/varnish/pages/invalid_domain.html")); >>>>> return(deliver); >>>>> } >>>>> } >>>>> >>>>> sub vcl_backend_response { >>>>> unset beresp.http.Set-Cookie; >>>>> set beresp.http.x-host = bereq.http.host; >>>>> set beresp.http.x-user-agent = bereq.http.user-agent; >>>>> >>>>> if (bereq.url == "/themes/basic/assets/theme.min.css" >>>>> || bereq.url == "/api/events/PAGEVIEW" >>>>> || bereq.url ~ "^\/assets\/img\/") { >>>>> set beresp.http.Cache-Control = "max-age=0"; >>>>> } else { >>>>> unset beresp.http.Cache-Control; >>>>> } >>>>> >>>>> if (beresp.status == 200 || >>>>> beresp.status == 301 || >>>>> beresp.status == 302 || >>>>> beresp.status == 404) { >>>>> if (bereq.url ~ "\&ordenar=aleatorio$") { >>>>> set beresp.http.X-TTL = "1d"; >>>>> set beresp.ttl = 1d; >>>>> } else { >>>>> set beresp.http.X-TTL = "1w"; >>>>> set beresp.ttl = 1w; >>>>> } >>>>> } >>>>> >>>>> if (bereq.url !~ "\.(jpeg|jpg|png|gif|gz|tgz|bz2|tbz|mp3|ogg|swf|flv)$") >>>>> { >>>>> set beresp.do_gzip = true; >>>>> } >>>>> } >>>>> >>>>> sub vcl_pipe { >>>>> set bereq.http.connection = "close"; >>>>> return (pipe); >>>>> } >>>>> >>>>> sub vcl_deliver { >>>>> unset resp.http.x-host; >>>>> unset resp.http.x-user-agent; >>>>> } >>>>> >>>>> sub vcl_backend_error { >>>>> if (beresp.status == 502 || beresp.status == 503 || beresp.status == >>>>> 504) { >>>>> set beresp.status = 200; >>>>> set beresp.http.Content-Type = "text/html; charset=utf-8"; >>>>> synthetic(std.fileread("/etc/varnish/pages/maintenance.html")); >>>>> return (deliver); >>>>> } >>>>> } >>>>> >>>>> sub vcl_hash { >>>>> if (req.http.User-Agent ~ "Google Page Speed") { >>>>> hash_data("Google Page Speed"); >>>>> } elsif (req.http.User-Agent ~ "Googlebot") { >>>>> hash_data("Googlebot"); >>>>> } >>>>> } >>>>> >>>>> sub vcl_deliver { >>>>> if (resp.status == 501) { >>>>> return (synth(resp.status)); >>>>> } >>>>> if (obj.hits > 0) { >>>>> set resp.http.X-Cache = "hit"; >>>>> } else { >>>>> set resp.http.X-Cache = "miss"; >>>>> } >>>>> } >>>>> >>>>> >>>>> On Mon, Jun 26, 2017 at 3:47 PM, Guillaume Quintard < >>>>> guillaume at varnish-software.com> wrote: >>>>> >>>>>> Nice! It may have been the cause, time will tell.can you report back >>>>>> in a few days to let us know? >>>>>> -- >>>>>> Guillaume Quintard >>>>>> >>>>>> On Jun 26, 2017 20:21, "Stefano Baldo" >>>>>> wrote: >>>>>> >>>>>>> Hi Guillaume. >>>>>>> >>>>>>> I think things will start to going better now after changing the >>>>>>> bans. >>>>>>> This is how my last varnishstat looked like moments before a crash >>>>>>> regarding the bans: >>>>>>> >>>>>>> MAIN.bans 41336 . Count of bans >>>>>>> MAIN.bans_completed 37967 . Number of bans >>>>>>> marked 'completed' >>>>>>> MAIN.bans_obj 0 . Number of bans >>>>>>> using obj.* >>>>>>> MAIN.bans_req 41335 . Number of bans >>>>>>> using req.* >>>>>>> MAIN.bans_added 41336 0.68 Bans added >>>>>>> MAIN.bans_deleted 0 0.00 Bans deleted >>>>>>> >>>>>>> And this is how it looks like now: >>>>>>> >>>>>>> MAIN.bans 2 . Count of bans >>>>>>> MAIN.bans_completed 1 . Number of bans >>>>>>> marked 'completed' >>>>>>> MAIN.bans_obj 2 . Number of bans >>>>>>> using obj.* >>>>>>> MAIN.bans_req 0 . Number of bans >>>>>>> using req.* >>>>>>> MAIN.bans_added 2016 0.69 Bans added >>>>>>> MAIN.bans_deleted 2014 0.69 Bans deleted >>>>>>> >>>>>>> Before the changes, bans were never deleted! >>>>>>> Now the bans are added and quickly deleted after a minute or even a >>>>>>> couple of seconds. >>>>>>> >>>>>>> May this was the cause of the problem? It seems like varnish was >>>>>>> having a large number of bans to manage and test against. >>>>>>> I will let it ride now. Let's see if the problem persists or it's >>>>>>> gone! :-) >>>>>>> >>>>>>> Best, >>>>>>> Stefano >>>>>>> >>>>>>> >>>>>>> On Mon, Jun 26, 2017 at 3:10 PM, Guillaume Quintard < >>>>>>> guillaume at varnish-software.com> wrote: >>>>>>> >>>>>>>> Looking good! >>>>>>>> >>>>>>>> -- >>>>>>>> Guillaume Quintard >>>>>>>> >>>>>>>> On Mon, Jun 26, 2017 at 7:06 PM, Stefano Baldo < >>>>>>>> stefanobaldo at gmail.com> wrote: >>>>>>>> >>>>>>>>> Hi Guillaume, >>>>>>>>> >>>>>>>>> Can the following be considered "ban lurker friendly"? >>>>>>>>> >>>>>>>>> sub vcl_backend_response { >>>>>>>>> set beresp.http.x-url = bereq.http.host + bereq.url; >>>>>>>>> set beresp.http.x-user-agent = bereq.http.user-agent; >>>>>>>>> } >>>>>>>>> >>>>>>>>> sub vcl_recv { >>>>>>>>> if (req.method == "PURGE") { >>>>>>>>> ban("obj.http.x-url == " + req.http.host + req.url + " && >>>>>>>>> obj.http.x-user-agent !~ Googlebot"); >>>>>>>>> return(synth(750)); >>>>>>>>> } >>>>>>>>> } >>>>>>>>> >>>>>>>>> sub vcl_deliver { >>>>>>>>> unset resp.http.x-url; >>>>>>>>> unset resp.http.x-user-agent; >>>>>>>>> } >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> Stefano >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mon, Jun 26, 2017 at 12:43 PM, Guillaume Quintard < >>>>>>>>> guillaume at varnish-software.com> wrote: >>>>>>>>> >>>>>>>>>> Not lurker friendly at all indeed. You'll need to avoid req.* >>>>>>>>>> expression. Easiest way is to stash the host, user-agent and url in >>>>>>>>>> beresp.http.* and ban against those (unset them in vcl_deliver). >>>>>>>>>> >>>>>>>>>> I don't think you need to expand the VSL at all. >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Guillaume Quintard >>>>>>>>>> >>>>>>>>>> On Jun 26, 2017 16:51, "Stefano Baldo" >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> Hi Guillaume. >>>>>>>>>> >>>>>>>>>> Thanks for answering. >>>>>>>>>> >>>>>>>>>> I'm using a SSD disk. I've changed from ext4 to ext2 to increase >>>>>>>>>> performance but it stills restarting. >>>>>>>>>> Also, I checked the I/O performance for the disk and there is no >>>>>>>>>> signal of overhead. >>>>>>>>>> >>>>>>>>>> I've changed the /var/lib/varnish to a tmpfs and increased its >>>>>>>>>> 80m default size passing "-l 200m,20m" to varnishd and using >>>>>>>>>> "nodev,nosuid,noatime,size=256M 0 0" for the tmpfs mount. There >>>>>>>>>> was a problem here. After a couple of hours varnish died and I received a >>>>>>>>>> "no space left on device" message - deleting the /var/lib/varnish solved >>>>>>>>>> the problem and varnish was up again, but it's weird because there was free >>>>>>>>>> memory on the host to be used with the tmpfs directory, so I don't know >>>>>>>>>> what could have happened. I will try to stop increasing the >>>>>>>>>> /var/lib/varnish size. >>>>>>>>>> >>>>>>>>>> Anyway, I am worried about the bans. You asked me if the bans are >>>>>>>>>> lurker friedly. Well, I don't think so. My bans are created this way: >>>>>>>>>> >>>>>>>>>> ban("req.http.host == " + req.http.host + " && req.url ~ " + >>>>>>>>>> req.url + " && req.http.User-Agent !~ Googlebot"); >>>>>>>>>> >>>>>>>>>> Are they lurker friendly? I was taking a quick look and the >>>>>>>>>> documentation and it looks like they're not. >>>>>>>>>> >>>>>>>>>> Best, >>>>>>>>>> Stefano >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Fri, Jun 23, 2017 at 11:30 AM, Guillaume Quintard < >>>>>>>>>> guillaume at varnish-software.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Stefano, >>>>>>>>>>> >>>>>>>>>>> Let's cover the usual suspects: I/Os. I think here Varnish gets >>>>>>>>>>> stuck trying to push/pull data and can't make time to reply to the CLI. I'd >>>>>>>>>>> recommend monitoring the disk activity (bandwidth and iops) to confirm. >>>>>>>>>>> >>>>>>>>>>> After some time, the file storage is terrible on a hard drive >>>>>>>>>>> (SSDs take a bit more time to degrade) because of fragmentation. One >>>>>>>>>>> solution to help the disks cope is to overprovision themif they're SSDs, >>>>>>>>>>> and you can try different advices in the file storage definition in the >>>>>>>>>>> command line (last parameter, after granularity). >>>>>>>>>>> >>>>>>>>>>> Is your /var/lib/varnish mount on tmpfs? That could help too. >>>>>>>>>>> >>>>>>>>>>> 40K bans is a lot, are they ban-lurker friendly? >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Guillaume Quintard >>>>>>>>>>> >>>>>>>>>>> On Fri, Jun 23, 2017 at 4:01 PM, Stefano Baldo < >>>>>>>>>>> stefanobaldo at gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hello. >>>>>>>>>>>> >>>>>>>>>>>> I am having a critical problem with Varnish Cache in production >>>>>>>>>>>> for over a month and any help will be appreciated. >>>>>>>>>>>> The problem is that Varnish child process is recurrently being >>>>>>>>>>>> restarted after 10~20h of use, with the following message: >>>>>>>>>>>> >>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) not >>>>>>>>>>>> responding to CLI, killed it. >>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected reply >>>>>>>>>>>> from ping: 400 CLI communication error >>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) >>>>>>>>>>>> died signal=9 >>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup >>>>>>>>>>>> complete >>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) >>>>>>>>>>>> Started >>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) >>>>>>>>>>>> said Child starts >>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) >>>>>>>>>>>> said SMF.s0 mmap'ed 483183820800 bytes of 483183820800 >>>>>>>>>>>> >>>>>>>>>>>> The following link is the varnishstat output just 1 minute >>>>>>>>>>>> before a restart: >>>>>>>>>>>> >>>>>>>>>>>> https://pastebin.com/g0g5RVTs >>>>>>>>>>>> >>>>>>>>>>>> Environment: >>>>>>>>>>>> >>>>>>>>>>>> varnish-5.1.2 revision 6ece695 >>>>>>>>>>>> Debian 8.7 - Debian GNU/Linux 8 (3.16.0) >>>>>>>>>>>> Installed using pre-built package from official repo at >>>>>>>>>>>> packagecloud.io >>>>>>>>>>>> CPU 2x2.9 GHz >>>>>>>>>>>> Mem 3.69 GiB >>>>>>>>>>>> Running inside a Docker container >>>>>>>>>>>> NFILES=131072 >>>>>>>>>>>> MEMLOCK=82000 >>>>>>>>>>>> >>>>>>>>>>>> Additional info: >>>>>>>>>>>> >>>>>>>>>>>> - I need to cache a large number of objets and the cache should >>>>>>>>>>>> last for almost a week, so I have set up a 450G storage space, I don't know >>>>>>>>>>>> if this is a problem; >>>>>>>>>>>> - I use ban a lot. There was about 40k bans in the system just >>>>>>>>>>>> before the last crash. I really don't know if this is too much or may have >>>>>>>>>>>> anything to do with it; >>>>>>>>>>>> - No registered CPU spikes (almost always by 30%); >>>>>>>>>>>> - No panic is reported, the only info I can retrieve is from >>>>>>>>>>>> syslog; >>>>>>>>>>>> - During all the time, event moments before the crashes, >>>>>>>>>>>> everything is okay and requests are being responded very fast. >>>>>>>>>>>> >>>>>>>>>>>> Best, >>>>>>>>>>>> Stefano Baldo >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> varnish-misc mailing list >>>>>>>>>>>> varnish-misc at varnish-cache.org >>>>>>>>>>>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish >>>>>>>>>>>> -misc >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>> >>>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stefanobaldo at gmail.com Wed Jun 28 13:47:17 2017 From: stefanobaldo at gmail.com (Stefano Baldo) Date: Wed, 28 Jun 2017 10:47:17 -0300 Subject: Child process recurrently being restarted In-Reply-To: References: Message-ID: SMA.Transient.g_alloc 3518 . Allocations outstanding SMA.Transient.g_bytes 546390 . Bytes outstanding SMA.Transient.g_space 0 . Bytes available g_space is always 0. It could mean anything? On Wed, Jun 28, 2017 at 10:43 AM, Guillaume Quintard < guillaume at varnish-software.com> wrote: > Yeah, I was wondering about Transient, but it seems under control. > > Apart from moving away from file storage, I have nothing at the moment :-/ > > -- > Guillaume Quintard > > On Wed, Jun 28, 2017 at 3:39 PM, Stefano Baldo > wrote: > >> Hi. >> >> root at 2c6c325b279f:/# varnishstat -1 | grep g_bytes >> SMA.Transient.g_bytes 519022 . Bytes >> outstanding >> SMF.s0.g_bytes 23662845952 . Bytes >> outstanding >> >> You mean g_bytes from SMA.Transient? I have set no malloc storage. >> >> >> On Wed, Jun 28, 2017 at 10:26 AM, Guillaume Quintard < >> guillaume at varnish-software.com> wrote: >> >>> Hi, >>> >>> can you look that "varnishstat -1 | grep g_bytes" and see if if matches >>> the memory you are seeing? >>> >>> -- >>> Guillaume Quintard >>> >>> On Wed, Jun 28, 2017 at 3:20 PM, Stefano Baldo >>> wrote: >>> >>>> Hi Guillaume. >>>> >>>> I increased the cli_timeout yesterday to 900sec (15min) and it >>>> restarted anyway, which seems to indicate that the thread is really stalled. >>>> >>>> This was 1 minute after the last restart: >>>> >>>> MAIN.n_object 3908216 . object structs made >>>> SMF.s0.g_alloc 7794510 . Allocations outstanding >>>> >>>> I've just changed the I/O Scheduler to noop to see what happens. >>>> >>>> One interest thing I've found is about the memory usage. >>>> >>>> In the 1st minute of use: >>>> MemTotal: 3865572 kB >>>> MemFree: 120768 kB >>>> MemAvailable: 2300268 kB >>>> >>>> 1 minute before a restart: >>>> MemTotal: 3865572 kB >>>> MemFree: 82480 kB >>>> MemAvailable: 68316 kB >>>> >>>> It seems like the system is possibly running out of memory. >>>> >>>> When calling varnishd, I'm specifying only "-s file,..." as storage. I >>>> see in some examples that is common to use "-s file" AND "-s malloc" >>>> together. Should I be passing "-s malloc" as well to somehow try to limit >>>> the memory usage by varnishd? >>>> >>>> Best, >>>> Stefano >>>> >>>> >>>> On Wed, Jun 28, 2017 at 4:12 AM, Guillaume Quintard < >>>> guillaume at varnish-software.com> wrote: >>>> >>>>> Sadly, nothing suspicious here, you can still try: >>>>> - bumping the cli_timeout >>>>> - changing your disk scheduler >>>>> - changing the advice option of the file storage >>>>> >>>>> I'm still convinced this is due to Varnish getting stuck waiting for >>>>> the disk because of the file storage fragmentation. >>>>> >>>>> Maybe you could look at SMF.*.g_alloc and compare it to the number of >>>>> objects. Ideally, we would have a 1:1 relation between objects and >>>>> allocations. If that number drops prior to a restart, that would be a good >>>>> clue. >>>>> >>>>> >>>>> -- >>>>> Guillaume Quintard >>>>> >>>>> On Tue, Jun 27, 2017 at 11:07 PM, Stefano Baldo < >>>>> stefanobaldo at gmail.com> wrote: >>>>> >>>>>> Hi Guillaume. >>>>>> >>>>>> It keeps restarting. >>>>>> Would you mind taking a quick look in the following VCL file to check >>>>>> if you find anything suspicious? >>>>>> >>>>>> Thank you very much. >>>>>> >>>>>> Best, >>>>>> Stefano >>>>>> >>>>>> vcl 4.0; >>>>>> >>>>>> import std; >>>>>> >>>>>> backend default { >>>>>> .host = "sites-web-server-lb"; >>>>>> .port = "80"; >>>>>> } >>>>>> >>>>>> include "/etc/varnish/bad_bot_detection.vcl"; >>>>>> >>>>>> sub vcl_recv { >>>>>> call bad_bot_detection; >>>>>> >>>>>> if (req.url == "/nocache" || req.url == "/version") { >>>>>> return(pass); >>>>>> } >>>>>> >>>>>> unset req.http.Cookie; >>>>>> if (req.method == "PURGE") { >>>>>> ban("obj.http.x-host == " + req.http.host + " && >>>>>> obj.http.x-user-agent !~ Googlebot"); >>>>>> return(synth(750)); >>>>>> } >>>>>> >>>>>> set req.url = regsuball(req.url, "(?>>>>> } >>>>>> >>>>>> sub vcl_synth { >>>>>> if (resp.status == 750) { >>>>>> set resp.status = 200; >>>>>> synthetic("PURGED => " + req.url); >>>>>> return(deliver); >>>>>> } elsif (resp.status == 501) { >>>>>> set resp.status = 200; >>>>>> set resp.http.Content-Type = "text/html; charset=utf-8"; >>>>>> synthetic(std.fileread("/etc/varnish/pages/invalid_domain.ht >>>>>> ml")); >>>>>> return(deliver); >>>>>> } >>>>>> } >>>>>> >>>>>> sub vcl_backend_response { >>>>>> unset beresp.http.Set-Cookie; >>>>>> set beresp.http.x-host = bereq.http.host; >>>>>> set beresp.http.x-user-agent = bereq.http.user-agent; >>>>>> >>>>>> if (bereq.url == "/themes/basic/assets/theme.min.css" >>>>>> || bereq.url == "/api/events/PAGEVIEW" >>>>>> || bereq.url ~ "^\/assets\/img\/") { >>>>>> set beresp.http.Cache-Control = "max-age=0"; >>>>>> } else { >>>>>> unset beresp.http.Cache-Control; >>>>>> } >>>>>> >>>>>> if (beresp.status == 200 || >>>>>> beresp.status == 301 || >>>>>> beresp.status == 302 || >>>>>> beresp.status == 404) { >>>>>> if (bereq.url ~ "\&ordenar=aleatorio$") { >>>>>> set beresp.http.X-TTL = "1d"; >>>>>> set beresp.ttl = 1d; >>>>>> } else { >>>>>> set beresp.http.X-TTL = "1w"; >>>>>> set beresp.ttl = 1w; >>>>>> } >>>>>> } >>>>>> >>>>>> if (bereq.url !~ "\.(jpeg|jpg|png|gif|gz|tgz|bz2|tbz|mp3|ogg|swf|flv)$") >>>>>> { >>>>>> set beresp.do_gzip = true; >>>>>> } >>>>>> } >>>>>> >>>>>> sub vcl_pipe { >>>>>> set bereq.http.connection = "close"; >>>>>> return (pipe); >>>>>> } >>>>>> >>>>>> sub vcl_deliver { >>>>>> unset resp.http.x-host; >>>>>> unset resp.http.x-user-agent; >>>>>> } >>>>>> >>>>>> sub vcl_backend_error { >>>>>> if (beresp.status == 502 || beresp.status == 503 || beresp.status >>>>>> == 504) { >>>>>> set beresp.status = 200; >>>>>> set beresp.http.Content-Type = "text/html; charset=utf-8"; >>>>>> synthetic(std.fileread("/etc/varnish/pages/maintenance.html")); >>>>>> return (deliver); >>>>>> } >>>>>> } >>>>>> >>>>>> sub vcl_hash { >>>>>> if (req.http.User-Agent ~ "Google Page Speed") { >>>>>> hash_data("Google Page Speed"); >>>>>> } elsif (req.http.User-Agent ~ "Googlebot") { >>>>>> hash_data("Googlebot"); >>>>>> } >>>>>> } >>>>>> >>>>>> sub vcl_deliver { >>>>>> if (resp.status == 501) { >>>>>> return (synth(resp.status)); >>>>>> } >>>>>> if (obj.hits > 0) { >>>>>> set resp.http.X-Cache = "hit"; >>>>>> } else { >>>>>> set resp.http.X-Cache = "miss"; >>>>>> } >>>>>> } >>>>>> >>>>>> >>>>>> On Mon, Jun 26, 2017 at 3:47 PM, Guillaume Quintard < >>>>>> guillaume at varnish-software.com> wrote: >>>>>> >>>>>>> Nice! It may have been the cause, time will tell.can you report back >>>>>>> in a few days to let us know? >>>>>>> -- >>>>>>> Guillaume Quintard >>>>>>> >>>>>>> On Jun 26, 2017 20:21, "Stefano Baldo" >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Guillaume. >>>>>>>> >>>>>>>> I think things will start to going better now after changing the >>>>>>>> bans. >>>>>>>> This is how my last varnishstat looked like moments before a crash >>>>>>>> regarding the bans: >>>>>>>> >>>>>>>> MAIN.bans 41336 . Count of bans >>>>>>>> MAIN.bans_completed 37967 . Number of bans >>>>>>>> marked 'completed' >>>>>>>> MAIN.bans_obj 0 . Number of bans >>>>>>>> using obj.* >>>>>>>> MAIN.bans_req 41335 . Number of bans >>>>>>>> using req.* >>>>>>>> MAIN.bans_added 41336 0.68 Bans added >>>>>>>> MAIN.bans_deleted 0 0.00 Bans deleted >>>>>>>> >>>>>>>> And this is how it looks like now: >>>>>>>> >>>>>>>> MAIN.bans 2 . Count of bans >>>>>>>> MAIN.bans_completed 1 . Number of bans >>>>>>>> marked 'completed' >>>>>>>> MAIN.bans_obj 2 . Number of bans >>>>>>>> using obj.* >>>>>>>> MAIN.bans_req 0 . Number of bans >>>>>>>> using req.* >>>>>>>> MAIN.bans_added 2016 0.69 Bans added >>>>>>>> MAIN.bans_deleted 2014 0.69 Bans deleted >>>>>>>> >>>>>>>> Before the changes, bans were never deleted! >>>>>>>> Now the bans are added and quickly deleted after a minute or even a >>>>>>>> couple of seconds. >>>>>>>> >>>>>>>> May this was the cause of the problem? It seems like varnish was >>>>>>>> having a large number of bans to manage and test against. >>>>>>>> I will let it ride now. Let's see if the problem persists or it's >>>>>>>> gone! :-) >>>>>>>> >>>>>>>> Best, >>>>>>>> Stefano >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Jun 26, 2017 at 3:10 PM, Guillaume Quintard < >>>>>>>> guillaume at varnish-software.com> wrote: >>>>>>>> >>>>>>>>> Looking good! >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Guillaume Quintard >>>>>>>>> >>>>>>>>> On Mon, Jun 26, 2017 at 7:06 PM, Stefano Baldo < >>>>>>>>> stefanobaldo at gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Hi Guillaume, >>>>>>>>>> >>>>>>>>>> Can the following be considered "ban lurker friendly"? >>>>>>>>>> >>>>>>>>>> sub vcl_backend_response { >>>>>>>>>> set beresp.http.x-url = bereq.http.host + bereq.url; >>>>>>>>>> set beresp.http.x-user-agent = bereq.http.user-agent; >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> sub vcl_recv { >>>>>>>>>> if (req.method == "PURGE") { >>>>>>>>>> ban("obj.http.x-url == " + req.http.host + req.url + " && >>>>>>>>>> obj.http.x-user-agent !~ Googlebot"); >>>>>>>>>> return(synth(750)); >>>>>>>>>> } >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> sub vcl_deliver { >>>>>>>>>> unset resp.http.x-url; >>>>>>>>>> unset resp.http.x-user-agent; >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> Best, >>>>>>>>>> Stefano >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Mon, Jun 26, 2017 at 12:43 PM, Guillaume Quintard < >>>>>>>>>> guillaume at varnish-software.com> wrote: >>>>>>>>>> >>>>>>>>>>> Not lurker friendly at all indeed. You'll need to avoid req.* >>>>>>>>>>> expression. Easiest way is to stash the host, user-agent and url in >>>>>>>>>>> beresp.http.* and ban against those (unset them in vcl_deliver). >>>>>>>>>>> >>>>>>>>>>> I don't think you need to expand the VSL at all. >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Guillaume Quintard >>>>>>>>>>> >>>>>>>>>>> On Jun 26, 2017 16:51, "Stefano Baldo" >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> Hi Guillaume. >>>>>>>>>>> >>>>>>>>>>> Thanks for answering. >>>>>>>>>>> >>>>>>>>>>> I'm using a SSD disk. I've changed from ext4 to ext2 to increase >>>>>>>>>>> performance but it stills restarting. >>>>>>>>>>> Also, I checked the I/O performance for the disk and there is no >>>>>>>>>>> signal of overhead. >>>>>>>>>>> >>>>>>>>>>> I've changed the /var/lib/varnish to a tmpfs and increased its >>>>>>>>>>> 80m default size passing "-l 200m,20m" to varnishd and using >>>>>>>>>>> "nodev,nosuid,noatime,size=256M 0 0" for the tmpfs mount. There >>>>>>>>>>> was a problem here. After a couple of hours varnish died and I received a >>>>>>>>>>> "no space left on device" message - deleting the /var/lib/varnish solved >>>>>>>>>>> the problem and varnish was up again, but it's weird because there was free >>>>>>>>>>> memory on the host to be used with the tmpfs directory, so I don't know >>>>>>>>>>> what could have happened. I will try to stop increasing the >>>>>>>>>>> /var/lib/varnish size. >>>>>>>>>>> >>>>>>>>>>> Anyway, I am worried about the bans. You asked me if the bans >>>>>>>>>>> are lurker friedly. Well, I don't think so. My bans are created this way: >>>>>>>>>>> >>>>>>>>>>> ban("req.http.host == " + req.http.host + " && req.url ~ " + >>>>>>>>>>> req.url + " && req.http.User-Agent !~ Googlebot"); >>>>>>>>>>> >>>>>>>>>>> Are they lurker friendly? I was taking a quick look and the >>>>>>>>>>> documentation and it looks like they're not. >>>>>>>>>>> >>>>>>>>>>> Best, >>>>>>>>>>> Stefano >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Fri, Jun 23, 2017 at 11:30 AM, Guillaume Quintard < >>>>>>>>>>> guillaume at varnish-software.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Stefano, >>>>>>>>>>>> >>>>>>>>>>>> Let's cover the usual suspects: I/Os. I think here Varnish gets >>>>>>>>>>>> stuck trying to push/pull data and can't make time to reply to the CLI. I'd >>>>>>>>>>>> recommend monitoring the disk activity (bandwidth and iops) to confirm. >>>>>>>>>>>> >>>>>>>>>>>> After some time, the file storage is terrible on a hard drive >>>>>>>>>>>> (SSDs take a bit more time to degrade) because of fragmentation. One >>>>>>>>>>>> solution to help the disks cope is to overprovision themif they're SSDs, >>>>>>>>>>>> and you can try different advices in the file storage definition in the >>>>>>>>>>>> command line (last parameter, after granularity). >>>>>>>>>>>> >>>>>>>>>>>> Is your /var/lib/varnish mount on tmpfs? That could help too. >>>>>>>>>>>> >>>>>>>>>>>> 40K bans is a lot, are they ban-lurker friendly? >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Guillaume Quintard >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Jun 23, 2017 at 4:01 PM, Stefano Baldo < >>>>>>>>>>>> stefanobaldo at gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hello. >>>>>>>>>>>>> >>>>>>>>>>>>> I am having a critical problem with Varnish Cache in >>>>>>>>>>>>> production for over a month and any help will be appreciated. >>>>>>>>>>>>> The problem is that Varnish child process is recurrently being >>>>>>>>>>>>> restarted after 10~20h of use, with the following message: >>>>>>>>>>>>> >>>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) >>>>>>>>>>>>> not responding to CLI, killed it. >>>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected reply >>>>>>>>>>>>> from ping: 400 CLI communication error >>>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) >>>>>>>>>>>>> died signal=9 >>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup >>>>>>>>>>>>> complete >>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) >>>>>>>>>>>>> Started >>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) >>>>>>>>>>>>> said Child starts >>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) >>>>>>>>>>>>> said SMF.s0 mmap'ed 483183820800 bytes of 483183820800 >>>>>>>>>>>>> >>>>>>>>>>>>> The following link is the varnishstat output just 1 minute >>>>>>>>>>>>> before a restart: >>>>>>>>>>>>> >>>>>>>>>>>>> https://pastebin.com/g0g5RVTs >>>>>>>>>>>>> >>>>>>>>>>>>> Environment: >>>>>>>>>>>>> >>>>>>>>>>>>> varnish-5.1.2 revision 6ece695 >>>>>>>>>>>>> Debian 8.7 - Debian GNU/Linux 8 (3.16.0) >>>>>>>>>>>>> Installed using pre-built package from official repo at >>>>>>>>>>>>> packagecloud.io >>>>>>>>>>>>> CPU 2x2.9 GHz >>>>>>>>>>>>> Mem 3.69 GiB >>>>>>>>>>>>> Running inside a Docker container >>>>>>>>>>>>> NFILES=131072 >>>>>>>>>>>>> MEMLOCK=82000 >>>>>>>>>>>>> >>>>>>>>>>>>> Additional info: >>>>>>>>>>>>> >>>>>>>>>>>>> - I need to cache a large number of objets and the cache >>>>>>>>>>>>> should last for almost a week, so I have set up a 450G storage space, I >>>>>>>>>>>>> don't know if this is a problem; >>>>>>>>>>>>> - I use ban a lot. There was about 40k bans in the system just >>>>>>>>>>>>> before the last crash. I really don't know if this is too much or may have >>>>>>>>>>>>> anything to do with it; >>>>>>>>>>>>> - No registered CPU spikes (almost always by 30%); >>>>>>>>>>>>> - No panic is reported, the only info I can retrieve is from >>>>>>>>>>>>> syslog; >>>>>>>>>>>>> - During all the time, event moments before the crashes, >>>>>>>>>>>>> everything is okay and requests are being responded very fast. >>>>>>>>>>>>> >>>>>>>>>>>>> Best, >>>>>>>>>>>>> Stefano Baldo >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> varnish-misc mailing list >>>>>>>>>>>>> varnish-misc at varnish-cache.org >>>>>>>>>>>>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish >>>>>>>>>>>>> -misc >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>> >>>>> >>>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stefanobaldo at gmail.com Wed Jun 28 13:54:28 2017 From: stefanobaldo at gmail.com (Stefano Baldo) Date: Wed, 28 Jun 2017 10:54:28 -0300 Subject: Child process recurrently being restarted In-Reply-To: References: Message-ID: Also, we are running varnish inside a docker container. The storage disk is attached to the same host, and mounted to the container via docker volume. Do you think it's worth a try to run varnish directly on the host, avoiding docker? I don't see how this could be a problem but I don't know what to do anymore. Best, On Wed, Jun 28, 2017 at 10:43 AM, Guillaume Quintard < guillaume at varnish-software.com> wrote: > Yeah, I was wondering about Transient, but it seems under control. > > Apart from moving away from file storage, I have nothing at the moment :-/ > > -- > Guillaume Quintard > > On Wed, Jun 28, 2017 at 3:39 PM, Stefano Baldo > wrote: > >> Hi. >> >> root at 2c6c325b279f:/# varnishstat -1 | grep g_bytes >> SMA.Transient.g_bytes 519022 . Bytes >> outstanding >> SMF.s0.g_bytes 23662845952 . Bytes >> outstanding >> >> You mean g_bytes from SMA.Transient? I have set no malloc storage. >> >> >> On Wed, Jun 28, 2017 at 10:26 AM, Guillaume Quintard < >> guillaume at varnish-software.com> wrote: >> >>> Hi, >>> >>> can you look that "varnishstat -1 | grep g_bytes" and see if if matches >>> the memory you are seeing? >>> >>> -- >>> Guillaume Quintard >>> >>> On Wed, Jun 28, 2017 at 3:20 PM, Stefano Baldo >>> wrote: >>> >>>> Hi Guillaume. >>>> >>>> I increased the cli_timeout yesterday to 900sec (15min) and it >>>> restarted anyway, which seems to indicate that the thread is really stalled. >>>> >>>> This was 1 minute after the last restart: >>>> >>>> MAIN.n_object 3908216 . object structs made >>>> SMF.s0.g_alloc 7794510 . Allocations outstanding >>>> >>>> I've just changed the I/O Scheduler to noop to see what happens. >>>> >>>> One interest thing I've found is about the memory usage. >>>> >>>> In the 1st minute of use: >>>> MemTotal: 3865572 kB >>>> MemFree: 120768 kB >>>> MemAvailable: 2300268 kB >>>> >>>> 1 minute before a restart: >>>> MemTotal: 3865572 kB >>>> MemFree: 82480 kB >>>> MemAvailable: 68316 kB >>>> >>>> It seems like the system is possibly running out of memory. >>>> >>>> When calling varnishd, I'm specifying only "-s file,..." as storage. I >>>> see in some examples that is common to use "-s file" AND "-s malloc" >>>> together. Should I be passing "-s malloc" as well to somehow try to limit >>>> the memory usage by varnishd? >>>> >>>> Best, >>>> Stefano >>>> >>>> >>>> On Wed, Jun 28, 2017 at 4:12 AM, Guillaume Quintard < >>>> guillaume at varnish-software.com> wrote: >>>> >>>>> Sadly, nothing suspicious here, you can still try: >>>>> - bumping the cli_timeout >>>>> - changing your disk scheduler >>>>> - changing the advice option of the file storage >>>>> >>>>> I'm still convinced this is due to Varnish getting stuck waiting for >>>>> the disk because of the file storage fragmentation. >>>>> >>>>> Maybe you could look at SMF.*.g_alloc and compare it to the number of >>>>> objects. Ideally, we would have a 1:1 relation between objects and >>>>> allocations. If that number drops prior to a restart, that would be a good >>>>> clue. >>>>> >>>>> >>>>> -- >>>>> Guillaume Quintard >>>>> >>>>> On Tue, Jun 27, 2017 at 11:07 PM, Stefano Baldo < >>>>> stefanobaldo at gmail.com> wrote: >>>>> >>>>>> Hi Guillaume. >>>>>> >>>>>> It keeps restarting. >>>>>> Would you mind taking a quick look in the following VCL file to check >>>>>> if you find anything suspicious? >>>>>> >>>>>> Thank you very much. >>>>>> >>>>>> Best, >>>>>> Stefano >>>>>> >>>>>> vcl 4.0; >>>>>> >>>>>> import std; >>>>>> >>>>>> backend default { >>>>>> .host = "sites-web-server-lb"; >>>>>> .port = "80"; >>>>>> } >>>>>> >>>>>> include "/etc/varnish/bad_bot_detection.vcl"; >>>>>> >>>>>> sub vcl_recv { >>>>>> call bad_bot_detection; >>>>>> >>>>>> if (req.url == "/nocache" || req.url == "/version") { >>>>>> return(pass); >>>>>> } >>>>>> >>>>>> unset req.http.Cookie; >>>>>> if (req.method == "PURGE") { >>>>>> ban("obj.http.x-host == " + req.http.host + " && >>>>>> obj.http.x-user-agent !~ Googlebot"); >>>>>> return(synth(750)); >>>>>> } >>>>>> >>>>>> set req.url = regsuball(req.url, "(?>>>>> } >>>>>> >>>>>> sub vcl_synth { >>>>>> if (resp.status == 750) { >>>>>> set resp.status = 200; >>>>>> synthetic("PURGED => " + req.url); >>>>>> return(deliver); >>>>>> } elsif (resp.status == 501) { >>>>>> set resp.status = 200; >>>>>> set resp.http.Content-Type = "text/html; charset=utf-8"; >>>>>> synthetic(std.fileread("/etc/varnish/pages/invalid_domain.ht >>>>>> ml")); >>>>>> return(deliver); >>>>>> } >>>>>> } >>>>>> >>>>>> sub vcl_backend_response { >>>>>> unset beresp.http.Set-Cookie; >>>>>> set beresp.http.x-host = bereq.http.host; >>>>>> set beresp.http.x-user-agent = bereq.http.user-agent; >>>>>> >>>>>> if (bereq.url == "/themes/basic/assets/theme.min.css" >>>>>> || bereq.url == "/api/events/PAGEVIEW" >>>>>> || bereq.url ~ "^\/assets\/img\/") { >>>>>> set beresp.http.Cache-Control = "max-age=0"; >>>>>> } else { >>>>>> unset beresp.http.Cache-Control; >>>>>> } >>>>>> >>>>>> if (beresp.status == 200 || >>>>>> beresp.status == 301 || >>>>>> beresp.status == 302 || >>>>>> beresp.status == 404) { >>>>>> if (bereq.url ~ "\&ordenar=aleatorio$") { >>>>>> set beresp.http.X-TTL = "1d"; >>>>>> set beresp.ttl = 1d; >>>>>> } else { >>>>>> set beresp.http.X-TTL = "1w"; >>>>>> set beresp.ttl = 1w; >>>>>> } >>>>>> } >>>>>> >>>>>> if (bereq.url !~ "\.(jpeg|jpg|png|gif|gz|tgz|bz2|tbz|mp3|ogg|swf|flv)$") >>>>>> { >>>>>> set beresp.do_gzip = true; >>>>>> } >>>>>> } >>>>>> >>>>>> sub vcl_pipe { >>>>>> set bereq.http.connection = "close"; >>>>>> return (pipe); >>>>>> } >>>>>> >>>>>> sub vcl_deliver { >>>>>> unset resp.http.x-host; >>>>>> unset resp.http.x-user-agent; >>>>>> } >>>>>> >>>>>> sub vcl_backend_error { >>>>>> if (beresp.status == 502 || beresp.status == 503 || beresp.status >>>>>> == 504) { >>>>>> set beresp.status = 200; >>>>>> set beresp.http.Content-Type = "text/html; charset=utf-8"; >>>>>> synthetic(std.fileread("/etc/varnish/pages/maintenance.html")); >>>>>> return (deliver); >>>>>> } >>>>>> } >>>>>> >>>>>> sub vcl_hash { >>>>>> if (req.http.User-Agent ~ "Google Page Speed") { >>>>>> hash_data("Google Page Speed"); >>>>>> } elsif (req.http.User-Agent ~ "Googlebot") { >>>>>> hash_data("Googlebot"); >>>>>> } >>>>>> } >>>>>> >>>>>> sub vcl_deliver { >>>>>> if (resp.status == 501) { >>>>>> return (synth(resp.status)); >>>>>> } >>>>>> if (obj.hits > 0) { >>>>>> set resp.http.X-Cache = "hit"; >>>>>> } else { >>>>>> set resp.http.X-Cache = "miss"; >>>>>> } >>>>>> } >>>>>> >>>>>> >>>>>> On Mon, Jun 26, 2017 at 3:47 PM, Guillaume Quintard < >>>>>> guillaume at varnish-software.com> wrote: >>>>>> >>>>>>> Nice! It may have been the cause, time will tell.can you report back >>>>>>> in a few days to let us know? >>>>>>> -- >>>>>>> Guillaume Quintard >>>>>>> >>>>>>> On Jun 26, 2017 20:21, "Stefano Baldo" >>>>>>> wrote: >>>>>>> >>>>>>>> Hi Guillaume. >>>>>>>> >>>>>>>> I think things will start to going better now after changing the >>>>>>>> bans. >>>>>>>> This is how my last varnishstat looked like moments before a crash >>>>>>>> regarding the bans: >>>>>>>> >>>>>>>> MAIN.bans 41336 . Count of bans >>>>>>>> MAIN.bans_completed 37967 . Number of bans >>>>>>>> marked 'completed' >>>>>>>> MAIN.bans_obj 0 . Number of bans >>>>>>>> using obj.* >>>>>>>> MAIN.bans_req 41335 . Number of bans >>>>>>>> using req.* >>>>>>>> MAIN.bans_added 41336 0.68 Bans added >>>>>>>> MAIN.bans_deleted 0 0.00 Bans deleted >>>>>>>> >>>>>>>> And this is how it looks like now: >>>>>>>> >>>>>>>> MAIN.bans 2 . Count of bans >>>>>>>> MAIN.bans_completed 1 . Number of bans >>>>>>>> marked 'completed' >>>>>>>> MAIN.bans_obj 2 . Number of bans >>>>>>>> using obj.* >>>>>>>> MAIN.bans_req 0 . Number of bans >>>>>>>> using req.* >>>>>>>> MAIN.bans_added 2016 0.69 Bans added >>>>>>>> MAIN.bans_deleted 2014 0.69 Bans deleted >>>>>>>> >>>>>>>> Before the changes, bans were never deleted! >>>>>>>> Now the bans are added and quickly deleted after a minute or even a >>>>>>>> couple of seconds. >>>>>>>> >>>>>>>> May this was the cause of the problem? It seems like varnish was >>>>>>>> having a large number of bans to manage and test against. >>>>>>>> I will let it ride now. Let's see if the problem persists or it's >>>>>>>> gone! :-) >>>>>>>> >>>>>>>> Best, >>>>>>>> Stefano >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Jun 26, 2017 at 3:10 PM, Guillaume Quintard < >>>>>>>> guillaume at varnish-software.com> wrote: >>>>>>>> >>>>>>>>> Looking good! >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Guillaume Quintard >>>>>>>>> >>>>>>>>> On Mon, Jun 26, 2017 at 7:06 PM, Stefano Baldo < >>>>>>>>> stefanobaldo at gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Hi Guillaume, >>>>>>>>>> >>>>>>>>>> Can the following be considered "ban lurker friendly"? >>>>>>>>>> >>>>>>>>>> sub vcl_backend_response { >>>>>>>>>> set beresp.http.x-url = bereq.http.host + bereq.url; >>>>>>>>>> set beresp.http.x-user-agent = bereq.http.user-agent; >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> sub vcl_recv { >>>>>>>>>> if (req.method == "PURGE") { >>>>>>>>>> ban("obj.http.x-url == " + req.http.host + req.url + " && >>>>>>>>>> obj.http.x-user-agent !~ Googlebot"); >>>>>>>>>> return(synth(750)); >>>>>>>>>> } >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> sub vcl_deliver { >>>>>>>>>> unset resp.http.x-url; >>>>>>>>>> unset resp.http.x-user-agent; >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> Best, >>>>>>>>>> Stefano >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Mon, Jun 26, 2017 at 12:43 PM, Guillaume Quintard < >>>>>>>>>> guillaume at varnish-software.com> wrote: >>>>>>>>>> >>>>>>>>>>> Not lurker friendly at all indeed. You'll need to avoid req.* >>>>>>>>>>> expression. Easiest way is to stash the host, user-agent and url in >>>>>>>>>>> beresp.http.* and ban against those (unset them in vcl_deliver). >>>>>>>>>>> >>>>>>>>>>> I don't think you need to expand the VSL at all. >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Guillaume Quintard >>>>>>>>>>> >>>>>>>>>>> On Jun 26, 2017 16:51, "Stefano Baldo" >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> Hi Guillaume. >>>>>>>>>>> >>>>>>>>>>> Thanks for answering. >>>>>>>>>>> >>>>>>>>>>> I'm using a SSD disk. I've changed from ext4 to ext2 to increase >>>>>>>>>>> performance but it stills restarting. >>>>>>>>>>> Also, I checked the I/O performance for the disk and there is no >>>>>>>>>>> signal of overhead. >>>>>>>>>>> >>>>>>>>>>> I've changed the /var/lib/varnish to a tmpfs and increased its >>>>>>>>>>> 80m default size passing "-l 200m,20m" to varnishd and using >>>>>>>>>>> "nodev,nosuid,noatime,size=256M 0 0" for the tmpfs mount. There >>>>>>>>>>> was a problem here. After a couple of hours varnish died and I received a >>>>>>>>>>> "no space left on device" message - deleting the /var/lib/varnish solved >>>>>>>>>>> the problem and varnish was up again, but it's weird because there was free >>>>>>>>>>> memory on the host to be used with the tmpfs directory, so I don't know >>>>>>>>>>> what could have happened. I will try to stop increasing the >>>>>>>>>>> /var/lib/varnish size. >>>>>>>>>>> >>>>>>>>>>> Anyway, I am worried about the bans. You asked me if the bans >>>>>>>>>>> are lurker friedly. Well, I don't think so. My bans are created this way: >>>>>>>>>>> >>>>>>>>>>> ban("req.http.host == " + req.http.host + " && req.url ~ " + >>>>>>>>>>> req.url + " && req.http.User-Agent !~ Googlebot"); >>>>>>>>>>> >>>>>>>>>>> Are they lurker friendly? I was taking a quick look and the >>>>>>>>>>> documentation and it looks like they're not. >>>>>>>>>>> >>>>>>>>>>> Best, >>>>>>>>>>> Stefano >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Fri, Jun 23, 2017 at 11:30 AM, Guillaume Quintard < >>>>>>>>>>> guillaume at varnish-software.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi Stefano, >>>>>>>>>>>> >>>>>>>>>>>> Let's cover the usual suspects: I/Os. I think here Varnish gets >>>>>>>>>>>> stuck trying to push/pull data and can't make time to reply to the CLI. I'd >>>>>>>>>>>> recommend monitoring the disk activity (bandwidth and iops) to confirm. >>>>>>>>>>>> >>>>>>>>>>>> After some time, the file storage is terrible on a hard drive >>>>>>>>>>>> (SSDs take a bit more time to degrade) because of fragmentation. One >>>>>>>>>>>> solution to help the disks cope is to overprovision themif they're SSDs, >>>>>>>>>>>> and you can try different advices in the file storage definition in the >>>>>>>>>>>> command line (last parameter, after granularity). >>>>>>>>>>>> >>>>>>>>>>>> Is your /var/lib/varnish mount on tmpfs? That could help too. >>>>>>>>>>>> >>>>>>>>>>>> 40K bans is a lot, are they ban-lurker friendly? >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Guillaume Quintard >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Jun 23, 2017 at 4:01 PM, Stefano Baldo < >>>>>>>>>>>> stefanobaldo at gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hello. >>>>>>>>>>>>> >>>>>>>>>>>>> I am having a critical problem with Varnish Cache in >>>>>>>>>>>>> production for over a month and any help will be appreciated. >>>>>>>>>>>>> The problem is that Varnish child process is recurrently being >>>>>>>>>>>>> restarted after 10~20h of use, with the following message: >>>>>>>>>>>>> >>>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) >>>>>>>>>>>>> not responding to CLI, killed it. >>>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected reply >>>>>>>>>>>>> from ping: 400 CLI communication error >>>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) >>>>>>>>>>>>> died signal=9 >>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup >>>>>>>>>>>>> complete >>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) >>>>>>>>>>>>> Started >>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) >>>>>>>>>>>>> said Child starts >>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) >>>>>>>>>>>>> said SMF.s0 mmap'ed 483183820800 bytes of 483183820800 >>>>>>>>>>>>> >>>>>>>>>>>>> The following link is the varnishstat output just 1 minute >>>>>>>>>>>>> before a restart: >>>>>>>>>>>>> >>>>>>>>>>>>> https://pastebin.com/g0g5RVTs >>>>>>>>>>>>> >>>>>>>>>>>>> Environment: >>>>>>>>>>>>> >>>>>>>>>>>>> varnish-5.1.2 revision 6ece695 >>>>>>>>>>>>> Debian 8.7 - Debian GNU/Linux 8 (3.16.0) >>>>>>>>>>>>> Installed using pre-built package from official repo at >>>>>>>>>>>>> packagecloud.io >>>>>>>>>>>>> CPU 2x2.9 GHz >>>>>>>>>>>>> Mem 3.69 GiB >>>>>>>>>>>>> Running inside a Docker container >>>>>>>>>>>>> NFILES=131072 >>>>>>>>>>>>> MEMLOCK=82000 >>>>>>>>>>>>> >>>>>>>>>>>>> Additional info: >>>>>>>>>>>>> >>>>>>>>>>>>> - I need to cache a large number of objets and the cache >>>>>>>>>>>>> should last for almost a week, so I have set up a 450G storage space, I >>>>>>>>>>>>> don't know if this is a problem; >>>>>>>>>>>>> - I use ban a lot. There was about 40k bans in the system just >>>>>>>>>>>>> before the last crash. I really don't know if this is too much or may have >>>>>>>>>>>>> anything to do with it; >>>>>>>>>>>>> - No registered CPU spikes (almost always by 30%); >>>>>>>>>>>>> - No panic is reported, the only info I can retrieve is from >>>>>>>>>>>>> syslog; >>>>>>>>>>>>> - During all the time, event moments before the crashes, >>>>>>>>>>>>> everything is okay and requests are being responded very fast. >>>>>>>>>>>>> >>>>>>>>>>>>> Best, >>>>>>>>>>>>> Stefano Baldo >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> varnish-misc mailing list >>>>>>>>>>>>> varnish-misc at varnish-cache.org >>>>>>>>>>>>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish >>>>>>>>>>>>> -misc >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>> >>>>> >>>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From guillaume at varnish-software.com Wed Jun 28 13:58:55 2017 From: guillaume at varnish-software.com (Guillaume Quintard) Date: Wed, 28 Jun 2017 15:58:55 +0200 Subject: Child process recurrently being restarted In-Reply-To: References: Message-ID: Transient is not limited I suppose, so the g_space == 0 is normal. You can try running on bare metal, not sure there will be a difference -- Guillaume Quintard On Wed, Jun 28, 2017 at 3:54 PM, Stefano Baldo wrote: > Also, we are running varnish inside a docker container. > The storage disk is attached to the same host, and mounted to the > container via docker volume. > > Do you think it's worth a try to run varnish directly on the host, > avoiding docker? I don't see how this could be a problem but I don't know > what to do anymore. > > Best, > > > On Wed, Jun 28, 2017 at 10:43 AM, Guillaume Quintard < > guillaume at varnish-software.com> wrote: > >> Yeah, I was wondering about Transient, but it seems under control. >> >> Apart from moving away from file storage, I have nothing at the moment :-/ >> >> -- >> Guillaume Quintard >> >> On Wed, Jun 28, 2017 at 3:39 PM, Stefano Baldo >> wrote: >> >>> Hi. >>> >>> root at 2c6c325b279f:/# varnishstat -1 | grep g_bytes >>> SMA.Transient.g_bytes 519022 . Bytes >>> outstanding >>> SMF.s0.g_bytes 23662845952 . Bytes >>> outstanding >>> >>> You mean g_bytes from SMA.Transient? I have set no malloc storage. >>> >>> >>> On Wed, Jun 28, 2017 at 10:26 AM, Guillaume Quintard < >>> guillaume at varnish-software.com> wrote: >>> >>>> Hi, >>>> >>>> can you look that "varnishstat -1 | grep g_bytes" and see if if matches >>>> the memory you are seeing? >>>> >>>> -- >>>> Guillaume Quintard >>>> >>>> On Wed, Jun 28, 2017 at 3:20 PM, Stefano Baldo >>>> wrote: >>>> >>>>> Hi Guillaume. >>>>> >>>>> I increased the cli_timeout yesterday to 900sec (15min) and it >>>>> restarted anyway, which seems to indicate that the thread is really stalled. >>>>> >>>>> This was 1 minute after the last restart: >>>>> >>>>> MAIN.n_object 3908216 . object structs made >>>>> SMF.s0.g_alloc 7794510 . Allocations >>>>> outstanding >>>>> >>>>> I've just changed the I/O Scheduler to noop to see what happens. >>>>> >>>>> One interest thing I've found is about the memory usage. >>>>> >>>>> In the 1st minute of use: >>>>> MemTotal: 3865572 kB >>>>> MemFree: 120768 kB >>>>> MemAvailable: 2300268 kB >>>>> >>>>> 1 minute before a restart: >>>>> MemTotal: 3865572 kB >>>>> MemFree: 82480 kB >>>>> MemAvailable: 68316 kB >>>>> >>>>> It seems like the system is possibly running out of memory. >>>>> >>>>> When calling varnishd, I'm specifying only "-s file,..." as storage. I >>>>> see in some examples that is common to use "-s file" AND "-s malloc" >>>>> together. Should I be passing "-s malloc" as well to somehow try to limit >>>>> the memory usage by varnishd? >>>>> >>>>> Best, >>>>> Stefano >>>>> >>>>> >>>>> On Wed, Jun 28, 2017 at 4:12 AM, Guillaume Quintard < >>>>> guillaume at varnish-software.com> wrote: >>>>> >>>>>> Sadly, nothing suspicious here, you can still try: >>>>>> - bumping the cli_timeout >>>>>> - changing your disk scheduler >>>>>> - changing the advice option of the file storage >>>>>> >>>>>> I'm still convinced this is due to Varnish getting stuck waiting for >>>>>> the disk because of the file storage fragmentation. >>>>>> >>>>>> Maybe you could look at SMF.*.g_alloc and compare it to the number of >>>>>> objects. Ideally, we would have a 1:1 relation between objects and >>>>>> allocations. If that number drops prior to a restart, that would be a good >>>>>> clue. >>>>>> >>>>>> >>>>>> -- >>>>>> Guillaume Quintard >>>>>> >>>>>> On Tue, Jun 27, 2017 at 11:07 PM, Stefano Baldo < >>>>>> stefanobaldo at gmail.com> wrote: >>>>>> >>>>>>> Hi Guillaume. >>>>>>> >>>>>>> It keeps restarting. >>>>>>> Would you mind taking a quick look in the following VCL file to >>>>>>> check if you find anything suspicious? >>>>>>> >>>>>>> Thank you very much. >>>>>>> >>>>>>> Best, >>>>>>> Stefano >>>>>>> >>>>>>> vcl 4.0; >>>>>>> >>>>>>> import std; >>>>>>> >>>>>>> backend default { >>>>>>> .host = "sites-web-server-lb"; >>>>>>> .port = "80"; >>>>>>> } >>>>>>> >>>>>>> include "/etc/varnish/bad_bot_detection.vcl"; >>>>>>> >>>>>>> sub vcl_recv { >>>>>>> call bad_bot_detection; >>>>>>> >>>>>>> if (req.url == "/nocache" || req.url == "/version") { >>>>>>> return(pass); >>>>>>> } >>>>>>> >>>>>>> unset req.http.Cookie; >>>>>>> if (req.method == "PURGE") { >>>>>>> ban("obj.http.x-host == " + req.http.host + " && >>>>>>> obj.http.x-user-agent !~ Googlebot"); >>>>>>> return(synth(750)); >>>>>>> } >>>>>>> >>>>>>> set req.url = regsuball(req.url, "(?>>>>>> } >>>>>>> >>>>>>> sub vcl_synth { >>>>>>> if (resp.status == 750) { >>>>>>> set resp.status = 200; >>>>>>> synthetic("PURGED => " + req.url); >>>>>>> return(deliver); >>>>>>> } elsif (resp.status == 501) { >>>>>>> set resp.status = 200; >>>>>>> set resp.http.Content-Type = "text/html; charset=utf-8"; >>>>>>> synthetic(std.fileread("/etc/varnish/pages/invalid_domain.ht >>>>>>> ml")); >>>>>>> return(deliver); >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> sub vcl_backend_response { >>>>>>> unset beresp.http.Set-Cookie; >>>>>>> set beresp.http.x-host = bereq.http.host; >>>>>>> set beresp.http.x-user-agent = bereq.http.user-agent; >>>>>>> >>>>>>> if (bereq.url == "/themes/basic/assets/theme.min.css" >>>>>>> || bereq.url == "/api/events/PAGEVIEW" >>>>>>> || bereq.url ~ "^\/assets\/img\/") { >>>>>>> set beresp.http.Cache-Control = "max-age=0"; >>>>>>> } else { >>>>>>> unset beresp.http.Cache-Control; >>>>>>> } >>>>>>> >>>>>>> if (beresp.status == 200 || >>>>>>> beresp.status == 301 || >>>>>>> beresp.status == 302 || >>>>>>> beresp.status == 404) { >>>>>>> if (bereq.url ~ "\&ordenar=aleatorio$") { >>>>>>> set beresp.http.X-TTL = "1d"; >>>>>>> set beresp.ttl = 1d; >>>>>>> } else { >>>>>>> set beresp.http.X-TTL = "1w"; >>>>>>> set beresp.ttl = 1w; >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> if (bereq.url !~ "\.(jpeg|jpg|png|gif|gz|tgz|bz2|tbz|mp3|ogg|swf|flv)$") >>>>>>> { >>>>>>> set beresp.do_gzip = true; >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> sub vcl_pipe { >>>>>>> set bereq.http.connection = "close"; >>>>>>> return (pipe); >>>>>>> } >>>>>>> >>>>>>> sub vcl_deliver { >>>>>>> unset resp.http.x-host; >>>>>>> unset resp.http.x-user-agent; >>>>>>> } >>>>>>> >>>>>>> sub vcl_backend_error { >>>>>>> if (beresp.status == 502 || beresp.status == 503 || beresp.status >>>>>>> == 504) { >>>>>>> set beresp.status = 200; >>>>>>> set beresp.http.Content-Type = "text/html; charset=utf-8"; >>>>>>> synthetic(std.fileread("/etc/varnish/pages/maintenance.html")); >>>>>>> return (deliver); >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> sub vcl_hash { >>>>>>> if (req.http.User-Agent ~ "Google Page Speed") { >>>>>>> hash_data("Google Page Speed"); >>>>>>> } elsif (req.http.User-Agent ~ "Googlebot") { >>>>>>> hash_data("Googlebot"); >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> sub vcl_deliver { >>>>>>> if (resp.status == 501) { >>>>>>> return (synth(resp.status)); >>>>>>> } >>>>>>> if (obj.hits > 0) { >>>>>>> set resp.http.X-Cache = "hit"; >>>>>>> } else { >>>>>>> set resp.http.X-Cache = "miss"; >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> >>>>>>> On Mon, Jun 26, 2017 at 3:47 PM, Guillaume Quintard < >>>>>>> guillaume at varnish-software.com> wrote: >>>>>>> >>>>>>>> Nice! It may have been the cause, time will tell.can you report >>>>>>>> back in a few days to let us know? >>>>>>>> -- >>>>>>>> Guillaume Quintard >>>>>>>> >>>>>>>> On Jun 26, 2017 20:21, "Stefano Baldo" >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi Guillaume. >>>>>>>>> >>>>>>>>> I think things will start to going better now after changing the >>>>>>>>> bans. >>>>>>>>> This is how my last varnishstat looked like moments before a crash >>>>>>>>> regarding the bans: >>>>>>>>> >>>>>>>>> MAIN.bans 41336 . Count of bans >>>>>>>>> MAIN.bans_completed 37967 . Number of bans >>>>>>>>> marked 'completed' >>>>>>>>> MAIN.bans_obj 0 . Number of bans >>>>>>>>> using obj.* >>>>>>>>> MAIN.bans_req 41335 . Number of bans >>>>>>>>> using req.* >>>>>>>>> MAIN.bans_added 41336 0.68 Bans added >>>>>>>>> MAIN.bans_deleted 0 0.00 Bans deleted >>>>>>>>> >>>>>>>>> And this is how it looks like now: >>>>>>>>> >>>>>>>>> MAIN.bans 2 . Count of bans >>>>>>>>> MAIN.bans_completed 1 . Number of bans >>>>>>>>> marked 'completed' >>>>>>>>> MAIN.bans_obj 2 . Number of bans >>>>>>>>> using obj.* >>>>>>>>> MAIN.bans_req 0 . Number of bans >>>>>>>>> using req.* >>>>>>>>> MAIN.bans_added 2016 0.69 Bans added >>>>>>>>> MAIN.bans_deleted 2014 0.69 Bans deleted >>>>>>>>> >>>>>>>>> Before the changes, bans were never deleted! >>>>>>>>> Now the bans are added and quickly deleted after a minute or even >>>>>>>>> a couple of seconds. >>>>>>>>> >>>>>>>>> May this was the cause of the problem? It seems like varnish was >>>>>>>>> having a large number of bans to manage and test against. >>>>>>>>> I will let it ride now. Let's see if the problem persists or it's >>>>>>>>> gone! :-) >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> Stefano >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mon, Jun 26, 2017 at 3:10 PM, Guillaume Quintard < >>>>>>>>> guillaume at varnish-software.com> wrote: >>>>>>>>> >>>>>>>>>> Looking good! >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Guillaume Quintard >>>>>>>>>> >>>>>>>>>> On Mon, Jun 26, 2017 at 7:06 PM, Stefano Baldo < >>>>>>>>>> stefanobaldo at gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Guillaume, >>>>>>>>>>> >>>>>>>>>>> Can the following be considered "ban lurker friendly"? >>>>>>>>>>> >>>>>>>>>>> sub vcl_backend_response { >>>>>>>>>>> set beresp.http.x-url = bereq.http.host + bereq.url; >>>>>>>>>>> set beresp.http.x-user-agent = bereq.http.user-agent; >>>>>>>>>>> } >>>>>>>>>>> >>>>>>>>>>> sub vcl_recv { >>>>>>>>>>> if (req.method == "PURGE") { >>>>>>>>>>> ban("obj.http.x-url == " + req.http.host + req.url + " && >>>>>>>>>>> obj.http.x-user-agent !~ Googlebot"); >>>>>>>>>>> return(synth(750)); >>>>>>>>>>> } >>>>>>>>>>> } >>>>>>>>>>> >>>>>>>>>>> sub vcl_deliver { >>>>>>>>>>> unset resp.http.x-url; >>>>>>>>>>> unset resp.http.x-user-agent; >>>>>>>>>>> } >>>>>>>>>>> >>>>>>>>>>> Best, >>>>>>>>>>> Stefano >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Mon, Jun 26, 2017 at 12:43 PM, Guillaume Quintard < >>>>>>>>>>> guillaume at varnish-software.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Not lurker friendly at all indeed. You'll need to avoid req.* >>>>>>>>>>>> expression. Easiest way is to stash the host, user-agent and url in >>>>>>>>>>>> beresp.http.* and ban against those (unset them in vcl_deliver). >>>>>>>>>>>> >>>>>>>>>>>> I don't think you need to expand the VSL at all. >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Guillaume Quintard >>>>>>>>>>>> >>>>>>>>>>>> On Jun 26, 2017 16:51, "Stefano Baldo" >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hi Guillaume. >>>>>>>>>>>> >>>>>>>>>>>> Thanks for answering. >>>>>>>>>>>> >>>>>>>>>>>> I'm using a SSD disk. I've changed from ext4 to ext2 to >>>>>>>>>>>> increase performance but it stills restarting. >>>>>>>>>>>> Also, I checked the I/O performance for the disk and there is >>>>>>>>>>>> no signal of overhead. >>>>>>>>>>>> >>>>>>>>>>>> I've changed the /var/lib/varnish to a tmpfs and increased its >>>>>>>>>>>> 80m default size passing "-l 200m,20m" to varnishd and using >>>>>>>>>>>> "nodev,nosuid,noatime,size=256M 0 0" for the tmpfs mount. >>>>>>>>>>>> There was a problem here. After a couple of hours varnish died and I >>>>>>>>>>>> received a "no space left on device" message - deleting the >>>>>>>>>>>> /var/lib/varnish solved the problem and varnish was up again, but it's >>>>>>>>>>>> weird because there was free memory on the host to be used with the tmpfs >>>>>>>>>>>> directory, so I don't know what could have happened. I will try to stop >>>>>>>>>>>> increasing the /var/lib/varnish size. >>>>>>>>>>>> >>>>>>>>>>>> Anyway, I am worried about the bans. You asked me if the bans >>>>>>>>>>>> are lurker friedly. Well, I don't think so. My bans are created this way: >>>>>>>>>>>> >>>>>>>>>>>> ban("req.http.host == " + req.http.host + " && req.url ~ " + >>>>>>>>>>>> req.url + " && req.http.User-Agent !~ Googlebot"); >>>>>>>>>>>> >>>>>>>>>>>> Are they lurker friendly? I was taking a quick look and the >>>>>>>>>>>> documentation and it looks like they're not. >>>>>>>>>>>> >>>>>>>>>>>> Best, >>>>>>>>>>>> Stefano >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Jun 23, 2017 at 11:30 AM, Guillaume Quintard < >>>>>>>>>>>> guillaume at varnish-software.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Stefano, >>>>>>>>>>>>> >>>>>>>>>>>>> Let's cover the usual suspects: I/Os. I think here Varnish >>>>>>>>>>>>> gets stuck trying to push/pull data and can't make time to reply to the >>>>>>>>>>>>> CLI. I'd recommend monitoring the disk activity (bandwidth and iops) to >>>>>>>>>>>>> confirm. >>>>>>>>>>>>> >>>>>>>>>>>>> After some time, the file storage is terrible on a hard drive >>>>>>>>>>>>> (SSDs take a bit more time to degrade) because of fragmentation. One >>>>>>>>>>>>> solution to help the disks cope is to overprovision themif they're SSDs, >>>>>>>>>>>>> and you can try different advices in the file storage definition in the >>>>>>>>>>>>> command line (last parameter, after granularity). >>>>>>>>>>>>> >>>>>>>>>>>>> Is your /var/lib/varnish mount on tmpfs? That could help too. >>>>>>>>>>>>> >>>>>>>>>>>>> 40K bans is a lot, are they ban-lurker friendly? >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Guillaume Quintard >>>>>>>>>>>>> >>>>>>>>>>>>> On Fri, Jun 23, 2017 at 4:01 PM, Stefano Baldo < >>>>>>>>>>>>> stefanobaldo at gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hello. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I am having a critical problem with Varnish Cache in >>>>>>>>>>>>>> production for over a month and any help will be appreciated. >>>>>>>>>>>>>> The problem is that Varnish child process is recurrently >>>>>>>>>>>>>> being restarted after 10~20h of use, with the following message: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) >>>>>>>>>>>>>> not responding to CLI, killed it. >>>>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected >>>>>>>>>>>>>> reply from ping: 400 CLI communication error >>>>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) >>>>>>>>>>>>>> died signal=9 >>>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup >>>>>>>>>>>>>> complete >>>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) >>>>>>>>>>>>>> Started >>>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) >>>>>>>>>>>>>> said Child starts >>>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) >>>>>>>>>>>>>> said SMF.s0 mmap'ed 483183820800 bytes of 483183820800 >>>>>>>>>>>>>> >>>>>>>>>>>>>> The following link is the varnishstat output just 1 minute >>>>>>>>>>>>>> before a restart: >>>>>>>>>>>>>> >>>>>>>>>>>>>> https://pastebin.com/g0g5RVTs >>>>>>>>>>>>>> >>>>>>>>>>>>>> Environment: >>>>>>>>>>>>>> >>>>>>>>>>>>>> varnish-5.1.2 revision 6ece695 >>>>>>>>>>>>>> Debian 8.7 - Debian GNU/Linux 8 (3.16.0) >>>>>>>>>>>>>> Installed using pre-built package from official repo at >>>>>>>>>>>>>> packagecloud.io >>>>>>>>>>>>>> CPU 2x2.9 GHz >>>>>>>>>>>>>> Mem 3.69 GiB >>>>>>>>>>>>>> Running inside a Docker container >>>>>>>>>>>>>> NFILES=131072 >>>>>>>>>>>>>> MEMLOCK=82000 >>>>>>>>>>>>>> >>>>>>>>>>>>>> Additional info: >>>>>>>>>>>>>> >>>>>>>>>>>>>> - I need to cache a large number of objets and the cache >>>>>>>>>>>>>> should last for almost a week, so I have set up a 450G storage space, I >>>>>>>>>>>>>> don't know if this is a problem; >>>>>>>>>>>>>> - I use ban a lot. There was about 40k bans in the system >>>>>>>>>>>>>> just before the last crash. I really don't know if this is too much or may >>>>>>>>>>>>>> have anything to do with it; >>>>>>>>>>>>>> - No registered CPU spikes (almost always by 30%); >>>>>>>>>>>>>> - No panic is reported, the only info I can retrieve is from >>>>>>>>>>>>>> syslog; >>>>>>>>>>>>>> - During all the time, event moments before the crashes, >>>>>>>>>>>>>> everything is okay and requests are being responded very fast. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Best, >>>>>>>>>>>>>> Stefano Baldo >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> varnish-misc mailing list >>>>>>>>>>>>>> varnish-misc at varnish-cache.org >>>>>>>>>>>>>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish >>>>>>>>>>>>>> -misc >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From reza at varnish-software.com Wed Jun 28 14:33:49 2017 From: reza at varnish-software.com (Reza Naghibi) Date: Wed, 28 Jun 2017 10:33:49 -0400 Subject: Child process recurrently being restarted In-Reply-To: References: Message-ID: Assuming the problem is running out of memory, you will need to do some memory tuning, especially given the number of threads you are using and your access patterns. Your options: - Add more memory to the system - Reduce thread_pool_max - Reduce jemalloc's thread cache (MALLOC_CONF="lg_tcache_max:10") - Use some of the tuning params in here: https://info.varnish-software.com/blog/understanding-varnish-cache-memory-usage -- Reza Naghibi Varnish Software On Wed, Jun 28, 2017 at 9:26 AM, Guillaume Quintard < guillaume at varnish-software.com> wrote: > Hi, > > can you look that "varnishstat -1 | grep g_bytes" and see if if matches > the memory you are seeing? > > -- > Guillaume Quintard > > On Wed, Jun 28, 2017 at 3:20 PM, Stefano Baldo > wrote: > >> Hi Guillaume. >> >> I increased the cli_timeout yesterday to 900sec (15min) and it restarted >> anyway, which seems to indicate that the thread is really stalled. >> >> This was 1 minute after the last restart: >> >> MAIN.n_object 3908216 . object structs made >> SMF.s0.g_alloc 7794510 . Allocations outstanding >> >> I've just changed the I/O Scheduler to noop to see what happens. >> >> One interest thing I've found is about the memory usage. >> >> In the 1st minute of use: >> MemTotal: 3865572 kB >> MemFree: 120768 kB >> MemAvailable: 2300268 kB >> >> 1 minute before a restart: >> MemTotal: 3865572 kB >> MemFree: 82480 kB >> MemAvailable: 68316 kB >> >> It seems like the system is possibly running out of memory. >> >> When calling varnishd, I'm specifying only "-s file,..." as storage. I >> see in some examples that is common to use "-s file" AND "-s malloc" >> together. Should I be passing "-s malloc" as well to somehow try to limit >> the memory usage by varnishd? >> >> Best, >> Stefano >> >> >> On Wed, Jun 28, 2017 at 4:12 AM, Guillaume Quintard < >> guillaume at varnish-software.com> wrote: >> >>> Sadly, nothing suspicious here, you can still try: >>> - bumping the cli_timeout >>> - changing your disk scheduler >>> - changing the advice option of the file storage >>> >>> I'm still convinced this is due to Varnish getting stuck waiting for the >>> disk because of the file storage fragmentation. >>> >>> Maybe you could look at SMF.*.g_alloc and compare it to the number of >>> objects. Ideally, we would have a 1:1 relation between objects and >>> allocations. If that number drops prior to a restart, that would be a good >>> clue. >>> >>> >>> -- >>> Guillaume Quintard >>> >>> On Tue, Jun 27, 2017 at 11:07 PM, Stefano Baldo >>> wrote: >>> >>>> Hi Guillaume. >>>> >>>> It keeps restarting. >>>> Would you mind taking a quick look in the following VCL file to check >>>> if you find anything suspicious? >>>> >>>> Thank you very much. >>>> >>>> Best, >>>> Stefano >>>> >>>> vcl 4.0; >>>> >>>> import std; >>>> >>>> backend default { >>>> .host = "sites-web-server-lb"; >>>> .port = "80"; >>>> } >>>> >>>> include "/etc/varnish/bad_bot_detection.vcl"; >>>> >>>> sub vcl_recv { >>>> call bad_bot_detection; >>>> >>>> if (req.url == "/nocache" || req.url == "/version") { >>>> return(pass); >>>> } >>>> >>>> unset req.http.Cookie; >>>> if (req.method == "PURGE") { >>>> ban("obj.http.x-host == " + req.http.host + " && >>>> obj.http.x-user-agent !~ Googlebot"); >>>> return(synth(750)); >>>> } >>>> >>>> set req.url = regsuball(req.url, "(?>>> } >>>> >>>> sub vcl_synth { >>>> if (resp.status == 750) { >>>> set resp.status = 200; >>>> synthetic("PURGED => " + req.url); >>>> return(deliver); >>>> } elsif (resp.status == 501) { >>>> set resp.status = 200; >>>> set resp.http.Content-Type = "text/html; charset=utf-8"; >>>> synthetic(std.fileread("/etc/varnish/pages/invalid_domain.html")); >>>> return(deliver); >>>> } >>>> } >>>> >>>> sub vcl_backend_response { >>>> unset beresp.http.Set-Cookie; >>>> set beresp.http.x-host = bereq.http.host; >>>> set beresp.http.x-user-agent = bereq.http.user-agent; >>>> >>>> if (bereq.url == "/themes/basic/assets/theme.min.css" >>>> || bereq.url == "/api/events/PAGEVIEW" >>>> || bereq.url ~ "^\/assets\/img\/") { >>>> set beresp.http.Cache-Control = "max-age=0"; >>>> } else { >>>> unset beresp.http.Cache-Control; >>>> } >>>> >>>> if (beresp.status == 200 || >>>> beresp.status == 301 || >>>> beresp.status == 302 || >>>> beresp.status == 404) { >>>> if (bereq.url ~ "\&ordenar=aleatorio$") { >>>> set beresp.http.X-TTL = "1d"; >>>> set beresp.ttl = 1d; >>>> } else { >>>> set beresp.http.X-TTL = "1w"; >>>> set beresp.ttl = 1w; >>>> } >>>> } >>>> >>>> if (bereq.url !~ "\.(jpeg|jpg|png|gif|gz|tgz|bz2|tbz|mp3|ogg|swf|flv)$") >>>> { >>>> set beresp.do_gzip = true; >>>> } >>>> } >>>> >>>> sub vcl_pipe { >>>> set bereq.http.connection = "close"; >>>> return (pipe); >>>> } >>>> >>>> sub vcl_deliver { >>>> unset resp.http.x-host; >>>> unset resp.http.x-user-agent; >>>> } >>>> >>>> sub vcl_backend_error { >>>> if (beresp.status == 502 || beresp.status == 503 || beresp.status == >>>> 504) { >>>> set beresp.status = 200; >>>> set beresp.http.Content-Type = "text/html; charset=utf-8"; >>>> synthetic(std.fileread("/etc/varnish/pages/maintenance.html")); >>>> return (deliver); >>>> } >>>> } >>>> >>>> sub vcl_hash { >>>> if (req.http.User-Agent ~ "Google Page Speed") { >>>> hash_data("Google Page Speed"); >>>> } elsif (req.http.User-Agent ~ "Googlebot") { >>>> hash_data("Googlebot"); >>>> } >>>> } >>>> >>>> sub vcl_deliver { >>>> if (resp.status == 501) { >>>> return (synth(resp.status)); >>>> } >>>> if (obj.hits > 0) { >>>> set resp.http.X-Cache = "hit"; >>>> } else { >>>> set resp.http.X-Cache = "miss"; >>>> } >>>> } >>>> >>>> >>>> On Mon, Jun 26, 2017 at 3:47 PM, Guillaume Quintard < >>>> guillaume at varnish-software.com> wrote: >>>> >>>>> Nice! It may have been the cause, time will tell.can you report back >>>>> in a few days to let us know? >>>>> -- >>>>> Guillaume Quintard >>>>> >>>>> On Jun 26, 2017 20:21, "Stefano Baldo" wrote: >>>>> >>>>>> Hi Guillaume. >>>>>> >>>>>> I think things will start to going better now after changing the bans. >>>>>> This is how my last varnishstat looked like moments before a crash >>>>>> regarding the bans: >>>>>> >>>>>> MAIN.bans 41336 . Count of bans >>>>>> MAIN.bans_completed 37967 . Number of bans >>>>>> marked 'completed' >>>>>> MAIN.bans_obj 0 . Number of bans using >>>>>> obj.* >>>>>> MAIN.bans_req 41335 . Number of bans using >>>>>> req.* >>>>>> MAIN.bans_added 41336 0.68 Bans added >>>>>> MAIN.bans_deleted 0 0.00 Bans deleted >>>>>> >>>>>> And this is how it looks like now: >>>>>> >>>>>> MAIN.bans 2 . Count of bans >>>>>> MAIN.bans_completed 1 . Number of bans >>>>>> marked 'completed' >>>>>> MAIN.bans_obj 2 . Number of bans using >>>>>> obj.* >>>>>> MAIN.bans_req 0 . Number of bans using >>>>>> req.* >>>>>> MAIN.bans_added 2016 0.69 Bans added >>>>>> MAIN.bans_deleted 2014 0.69 Bans deleted >>>>>> >>>>>> Before the changes, bans were never deleted! >>>>>> Now the bans are added and quickly deleted after a minute or even a >>>>>> couple of seconds. >>>>>> >>>>>> May this was the cause of the problem? It seems like varnish was >>>>>> having a large number of bans to manage and test against. >>>>>> I will let it ride now. Let's see if the problem persists or it's >>>>>> gone! :-) >>>>>> >>>>>> Best, >>>>>> Stefano >>>>>> >>>>>> >>>>>> On Mon, Jun 26, 2017 at 3:10 PM, Guillaume Quintard < >>>>>> guillaume at varnish-software.com> wrote: >>>>>> >>>>>>> Looking good! >>>>>>> >>>>>>> -- >>>>>>> Guillaume Quintard >>>>>>> >>>>>>> On Mon, Jun 26, 2017 at 7:06 PM, Stefano Baldo < >>>>>>> stefanobaldo at gmail.com> wrote: >>>>>>> >>>>>>>> Hi Guillaume, >>>>>>>> >>>>>>>> Can the following be considered "ban lurker friendly"? >>>>>>>> >>>>>>>> sub vcl_backend_response { >>>>>>>> set beresp.http.x-url = bereq.http.host + bereq.url; >>>>>>>> set beresp.http.x-user-agent = bereq.http.user-agent; >>>>>>>> } >>>>>>>> >>>>>>>> sub vcl_recv { >>>>>>>> if (req.method == "PURGE") { >>>>>>>> ban("obj.http.x-url == " + req.http.host + req.url + " && >>>>>>>> obj.http.x-user-agent !~ Googlebot"); >>>>>>>> return(synth(750)); >>>>>>>> } >>>>>>>> } >>>>>>>> >>>>>>>> sub vcl_deliver { >>>>>>>> unset resp.http.x-url; >>>>>>>> unset resp.http.x-user-agent; >>>>>>>> } >>>>>>>> >>>>>>>> Best, >>>>>>>> Stefano >>>>>>>> >>>>>>>> >>>>>>>> On Mon, Jun 26, 2017 at 12:43 PM, Guillaume Quintard < >>>>>>>> guillaume at varnish-software.com> wrote: >>>>>>>> >>>>>>>>> Not lurker friendly at all indeed. You'll need to avoid req.* >>>>>>>>> expression. Easiest way is to stash the host, user-agent and url in >>>>>>>>> beresp.http.* and ban against those (unset them in vcl_deliver). >>>>>>>>> >>>>>>>>> I don't think you need to expand the VSL at all. >>>>>>>>> >>>>>>>>> -- >>>>>>>>> Guillaume Quintard >>>>>>>>> >>>>>>>>> On Jun 26, 2017 16:51, "Stefano Baldo" >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>> Hi Guillaume. >>>>>>>>> >>>>>>>>> Thanks for answering. >>>>>>>>> >>>>>>>>> I'm using a SSD disk. I've changed from ext4 to ext2 to increase >>>>>>>>> performance but it stills restarting. >>>>>>>>> Also, I checked the I/O performance for the disk and there is no >>>>>>>>> signal of overhead. >>>>>>>>> >>>>>>>>> I've changed the /var/lib/varnish to a tmpfs and increased its 80m >>>>>>>>> default size passing "-l 200m,20m" to varnishd and using >>>>>>>>> "nodev,nosuid,noatime,size=256M 0 0" for the tmpfs mount. There >>>>>>>>> was a problem here. After a couple of hours varnish died and I received a >>>>>>>>> "no space left on device" message - deleting the /var/lib/varnish solved >>>>>>>>> the problem and varnish was up again, but it's weird because there was free >>>>>>>>> memory on the host to be used with the tmpfs directory, so I don't know >>>>>>>>> what could have happened. I will try to stop increasing the >>>>>>>>> /var/lib/varnish size. >>>>>>>>> >>>>>>>>> Anyway, I am worried about the bans. You asked me if the bans are >>>>>>>>> lurker friedly. Well, I don't think so. My bans are created this way: >>>>>>>>> >>>>>>>>> ban("req.http.host == " + req.http.host + " && req.url ~ " + >>>>>>>>> req.url + " && req.http.User-Agent !~ Googlebot"); >>>>>>>>> >>>>>>>>> Are they lurker friendly? I was taking a quick look and the >>>>>>>>> documentation and it looks like they're not. >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> Stefano >>>>>>>>> >>>>>>>>> >>>>>>>>> On Fri, Jun 23, 2017 at 11:30 AM, Guillaume Quintard < >>>>>>>>> guillaume at varnish-software.com> wrote: >>>>>>>>> >>>>>>>>>> Hi Stefano, >>>>>>>>>> >>>>>>>>>> Let's cover the usual suspects: I/Os. I think here Varnish gets >>>>>>>>>> stuck trying to push/pull data and can't make time to reply to the CLI. I'd >>>>>>>>>> recommend monitoring the disk activity (bandwidth and iops) to confirm. >>>>>>>>>> >>>>>>>>>> After some time, the file storage is terrible on a hard drive >>>>>>>>>> (SSDs take a bit more time to degrade) because of fragmentation. One >>>>>>>>>> solution to help the disks cope is to overprovision themif they're SSDs, >>>>>>>>>> and you can try different advices in the file storage definition in the >>>>>>>>>> command line (last parameter, after granularity). >>>>>>>>>> >>>>>>>>>> Is your /var/lib/varnish mount on tmpfs? That could help too. >>>>>>>>>> >>>>>>>>>> 40K bans is a lot, are they ban-lurker friendly? >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Guillaume Quintard >>>>>>>>>> >>>>>>>>>> On Fri, Jun 23, 2017 at 4:01 PM, Stefano Baldo < >>>>>>>>>> stefanobaldo at gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hello. >>>>>>>>>>> >>>>>>>>>>> I am having a critical problem with Varnish Cache in production >>>>>>>>>>> for over a month and any help will be appreciated. >>>>>>>>>>> The problem is that Varnish child process is recurrently being >>>>>>>>>>> restarted after 10~20h of use, with the following message: >>>>>>>>>>> >>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) not >>>>>>>>>>> responding to CLI, killed it. >>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected reply >>>>>>>>>>> from ping: 400 CLI communication error >>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) died >>>>>>>>>>> signal=9 >>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup >>>>>>>>>>> complete >>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) >>>>>>>>>>> Started >>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said >>>>>>>>>>> Child starts >>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said >>>>>>>>>>> SMF.s0 mmap'ed 483183820800 bytes of 483183820800 >>>>>>>>>>> >>>>>>>>>>> The following link is the varnishstat output just 1 minute >>>>>>>>>>> before a restart: >>>>>>>>>>> >>>>>>>>>>> https://pastebin.com/g0g5RVTs >>>>>>>>>>> >>>>>>>>>>> Environment: >>>>>>>>>>> >>>>>>>>>>> varnish-5.1.2 revision 6ece695 >>>>>>>>>>> Debian 8.7 - Debian GNU/Linux 8 (3.16.0) >>>>>>>>>>> Installed using pre-built package from official repo at >>>>>>>>>>> packagecloud.io >>>>>>>>>>> CPU 2x2.9 GHz >>>>>>>>>>> Mem 3.69 GiB >>>>>>>>>>> Running inside a Docker container >>>>>>>>>>> NFILES=131072 >>>>>>>>>>> MEMLOCK=82000 >>>>>>>>>>> >>>>>>>>>>> Additional info: >>>>>>>>>>> >>>>>>>>>>> - I need to cache a large number of objets and the cache should >>>>>>>>>>> last for almost a week, so I have set up a 450G storage space, I don't know >>>>>>>>>>> if this is a problem; >>>>>>>>>>> - I use ban a lot. There was about 40k bans in the system just >>>>>>>>>>> before the last crash. I really don't know if this is too much or may have >>>>>>>>>>> anything to do with it; >>>>>>>>>>> - No registered CPU spikes (almost always by 30%); >>>>>>>>>>> - No panic is reported, the only info I can retrieve is from >>>>>>>>>>> syslog; >>>>>>>>>>> - During all the time, event moments before the crashes, >>>>>>>>>>> everything is okay and requests are being responded very fast. >>>>>>>>>>> >>>>>>>>>>> Best, >>>>>>>>>>> Stefano Baldo >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> varnish-misc mailing list >>>>>>>>>>> varnish-misc at varnish-cache.org >>>>>>>>>>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish >>>>>>>>>>> -misc >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>> >>> >> > > _______________________________________________ > varnish-misc mailing list > varnish-misc at varnish-cache.org > https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc > -------------- next part -------------- An HTML attachment was scrubbed... URL: From reza at varnish-software.com Wed Jun 28 14:48:36 2017 From: reza at varnish-software.com (Reza Naghibi) Date: Wed, 28 Jun 2017 10:48:36 -0400 Subject: Child process recurrently being restarted In-Reply-To: References: Message-ID: Transient is using memory. One other option I forgot to mention: - Move transient allocations to file: -s Transient=file,/tmp,1G -- Reza Naghibi Varnish Software On Wed, Jun 28, 2017 at 9:39 AM, Stefano Baldo wrote: > Hi. > > root at 2c6c325b279f:/# varnishstat -1 | grep g_bytes > SMA.Transient.g_bytes 519022 . Bytes > outstanding > SMF.s0.g_bytes 23662845952 . Bytes > outstanding > > You mean g_bytes from SMA.Transient? I have set no malloc storage. > > > On Wed, Jun 28, 2017 at 10:26 AM, Guillaume Quintard < > guillaume at varnish-software.com> wrote: > >> Hi, >> >> can you look that "varnishstat -1 | grep g_bytes" and see if if matches >> the memory you are seeing? >> >> -- >> Guillaume Quintard >> >> On Wed, Jun 28, 2017 at 3:20 PM, Stefano Baldo >> wrote: >> >>> Hi Guillaume. >>> >>> I increased the cli_timeout yesterday to 900sec (15min) and it restarted >>> anyway, which seems to indicate that the thread is really stalled. >>> >>> This was 1 minute after the last restart: >>> >>> MAIN.n_object 3908216 . object structs made >>> SMF.s0.g_alloc 7794510 . Allocations outstanding >>> >>> I've just changed the I/O Scheduler to noop to see what happens. >>> >>> One interest thing I've found is about the memory usage. >>> >>> In the 1st minute of use: >>> MemTotal: 3865572 kB >>> MemFree: 120768 kB >>> MemAvailable: 2300268 kB >>> >>> 1 minute before a restart: >>> MemTotal: 3865572 kB >>> MemFree: 82480 kB >>> MemAvailable: 68316 kB >>> >>> It seems like the system is possibly running out of memory. >>> >>> When calling varnishd, I'm specifying only "-s file,..." as storage. I >>> see in some examples that is common to use "-s file" AND "-s malloc" >>> together. Should I be passing "-s malloc" as well to somehow try to limit >>> the memory usage by varnishd? >>> >>> Best, >>> Stefano >>> >>> >>> On Wed, Jun 28, 2017 at 4:12 AM, Guillaume Quintard < >>> guillaume at varnish-software.com> wrote: >>> >>>> Sadly, nothing suspicious here, you can still try: >>>> - bumping the cli_timeout >>>> - changing your disk scheduler >>>> - changing the advice option of the file storage >>>> >>>> I'm still convinced this is due to Varnish getting stuck waiting for >>>> the disk because of the file storage fragmentation. >>>> >>>> Maybe you could look at SMF.*.g_alloc and compare it to the number of >>>> objects. Ideally, we would have a 1:1 relation between objects and >>>> allocations. If that number drops prior to a restart, that would be a good >>>> clue. >>>> >>>> >>>> -- >>>> Guillaume Quintard >>>> >>>> On Tue, Jun 27, 2017 at 11:07 PM, Stefano Baldo >>> > wrote: >>>> >>>>> Hi Guillaume. >>>>> >>>>> It keeps restarting. >>>>> Would you mind taking a quick look in the following VCL file to check >>>>> if you find anything suspicious? >>>>> >>>>> Thank you very much. >>>>> >>>>> Best, >>>>> Stefano >>>>> >>>>> vcl 4.0; >>>>> >>>>> import std; >>>>> >>>>> backend default { >>>>> .host = "sites-web-server-lb"; >>>>> .port = "80"; >>>>> } >>>>> >>>>> include "/etc/varnish/bad_bot_detection.vcl"; >>>>> >>>>> sub vcl_recv { >>>>> call bad_bot_detection; >>>>> >>>>> if (req.url == "/nocache" || req.url == "/version") { >>>>> return(pass); >>>>> } >>>>> >>>>> unset req.http.Cookie; >>>>> if (req.method == "PURGE") { >>>>> ban("obj.http.x-host == " + req.http.host + " && >>>>> obj.http.x-user-agent !~ Googlebot"); >>>>> return(synth(750)); >>>>> } >>>>> >>>>> set req.url = regsuball(req.url, "(?>>>> } >>>>> >>>>> sub vcl_synth { >>>>> if (resp.status == 750) { >>>>> set resp.status = 200; >>>>> synthetic("PURGED => " + req.url); >>>>> return(deliver); >>>>> } elsif (resp.status == 501) { >>>>> set resp.status = 200; >>>>> set resp.http.Content-Type = "text/html; charset=utf-8"; >>>>> synthetic(std.fileread("/etc/varnish/pages/invalid_domain.html")); >>>>> return(deliver); >>>>> } >>>>> } >>>>> >>>>> sub vcl_backend_response { >>>>> unset beresp.http.Set-Cookie; >>>>> set beresp.http.x-host = bereq.http.host; >>>>> set beresp.http.x-user-agent = bereq.http.user-agent; >>>>> >>>>> if (bereq.url == "/themes/basic/assets/theme.min.css" >>>>> || bereq.url == "/api/events/PAGEVIEW" >>>>> || bereq.url ~ "^\/assets\/img\/") { >>>>> set beresp.http.Cache-Control = "max-age=0"; >>>>> } else { >>>>> unset beresp.http.Cache-Control; >>>>> } >>>>> >>>>> if (beresp.status == 200 || >>>>> beresp.status == 301 || >>>>> beresp.status == 302 || >>>>> beresp.status == 404) { >>>>> if (bereq.url ~ "\&ordenar=aleatorio$") { >>>>> set beresp.http.X-TTL = "1d"; >>>>> set beresp.ttl = 1d; >>>>> } else { >>>>> set beresp.http.X-TTL = "1w"; >>>>> set beresp.ttl = 1w; >>>>> } >>>>> } >>>>> >>>>> if (bereq.url !~ "\.(jpeg|jpg|png|gif|gz|tgz|bz2|tbz|mp3|ogg|swf|flv)$") >>>>> { >>>>> set beresp.do_gzip = true; >>>>> } >>>>> } >>>>> >>>>> sub vcl_pipe { >>>>> set bereq.http.connection = "close"; >>>>> return (pipe); >>>>> } >>>>> >>>>> sub vcl_deliver { >>>>> unset resp.http.x-host; >>>>> unset resp.http.x-user-agent; >>>>> } >>>>> >>>>> sub vcl_backend_error { >>>>> if (beresp.status == 502 || beresp.status == 503 || beresp.status == >>>>> 504) { >>>>> set beresp.status = 200; >>>>> set beresp.http.Content-Type = "text/html; charset=utf-8"; >>>>> synthetic(std.fileread("/etc/varnish/pages/maintenance.html")); >>>>> return (deliver); >>>>> } >>>>> } >>>>> >>>>> sub vcl_hash { >>>>> if (req.http.User-Agent ~ "Google Page Speed") { >>>>> hash_data("Google Page Speed"); >>>>> } elsif (req.http.User-Agent ~ "Googlebot") { >>>>> hash_data("Googlebot"); >>>>> } >>>>> } >>>>> >>>>> sub vcl_deliver { >>>>> if (resp.status == 501) { >>>>> return (synth(resp.status)); >>>>> } >>>>> if (obj.hits > 0) { >>>>> set resp.http.X-Cache = "hit"; >>>>> } else { >>>>> set resp.http.X-Cache = "miss"; >>>>> } >>>>> } >>>>> >>>>> >>>>> On Mon, Jun 26, 2017 at 3:47 PM, Guillaume Quintard < >>>>> guillaume at varnish-software.com> wrote: >>>>> >>>>>> Nice! It may have been the cause, time will tell.can you report back >>>>>> in a few days to let us know? >>>>>> -- >>>>>> Guillaume Quintard >>>>>> >>>>>> On Jun 26, 2017 20:21, "Stefano Baldo" >>>>>> wrote: >>>>>> >>>>>>> Hi Guillaume. >>>>>>> >>>>>>> I think things will start to going better now after changing the >>>>>>> bans. >>>>>>> This is how my last varnishstat looked like moments before a crash >>>>>>> regarding the bans: >>>>>>> >>>>>>> MAIN.bans 41336 . Count of bans >>>>>>> MAIN.bans_completed 37967 . Number of bans >>>>>>> marked 'completed' >>>>>>> MAIN.bans_obj 0 . Number of bans >>>>>>> using obj.* >>>>>>> MAIN.bans_req 41335 . Number of bans >>>>>>> using req.* >>>>>>> MAIN.bans_added 41336 0.68 Bans added >>>>>>> MAIN.bans_deleted 0 0.00 Bans deleted >>>>>>> >>>>>>> And this is how it looks like now: >>>>>>> >>>>>>> MAIN.bans 2 . Count of bans >>>>>>> MAIN.bans_completed 1 . Number of bans >>>>>>> marked 'completed' >>>>>>> MAIN.bans_obj 2 . Number of bans >>>>>>> using obj.* >>>>>>> MAIN.bans_req 0 . Number of bans >>>>>>> using req.* >>>>>>> MAIN.bans_added 2016 0.69 Bans added >>>>>>> MAIN.bans_deleted 2014 0.69 Bans deleted >>>>>>> >>>>>>> Before the changes, bans were never deleted! >>>>>>> Now the bans are added and quickly deleted after a minute or even a >>>>>>> couple of seconds. >>>>>>> >>>>>>> May this was the cause of the problem? It seems like varnish was >>>>>>> having a large number of bans to manage and test against. >>>>>>> I will let it ride now. Let's see if the problem persists or it's >>>>>>> gone! :-) >>>>>>> >>>>>>> Best, >>>>>>> Stefano >>>>>>> >>>>>>> >>>>>>> On Mon, Jun 26, 2017 at 3:10 PM, Guillaume Quintard < >>>>>>> guillaume at varnish-software.com> wrote: >>>>>>> >>>>>>>> Looking good! >>>>>>>> >>>>>>>> -- >>>>>>>> Guillaume Quintard >>>>>>>> >>>>>>>> On Mon, Jun 26, 2017 at 7:06 PM, Stefano Baldo < >>>>>>>> stefanobaldo at gmail.com> wrote: >>>>>>>> >>>>>>>>> Hi Guillaume, >>>>>>>>> >>>>>>>>> Can the following be considered "ban lurker friendly"? >>>>>>>>> >>>>>>>>> sub vcl_backend_response { >>>>>>>>> set beresp.http.x-url = bereq.http.host + bereq.url; >>>>>>>>> set beresp.http.x-user-agent = bereq.http.user-agent; >>>>>>>>> } >>>>>>>>> >>>>>>>>> sub vcl_recv { >>>>>>>>> if (req.method == "PURGE") { >>>>>>>>> ban("obj.http.x-url == " + req.http.host + req.url + " && >>>>>>>>> obj.http.x-user-agent !~ Googlebot"); >>>>>>>>> return(synth(750)); >>>>>>>>> } >>>>>>>>> } >>>>>>>>> >>>>>>>>> sub vcl_deliver { >>>>>>>>> unset resp.http.x-url; >>>>>>>>> unset resp.http.x-user-agent; >>>>>>>>> } >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> Stefano >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mon, Jun 26, 2017 at 12:43 PM, Guillaume Quintard < >>>>>>>>> guillaume at varnish-software.com> wrote: >>>>>>>>> >>>>>>>>>> Not lurker friendly at all indeed. You'll need to avoid req.* >>>>>>>>>> expression. Easiest way is to stash the host, user-agent and url in >>>>>>>>>> beresp.http.* and ban against those (unset them in vcl_deliver). >>>>>>>>>> >>>>>>>>>> I don't think you need to expand the VSL at all. >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Guillaume Quintard >>>>>>>>>> >>>>>>>>>> On Jun 26, 2017 16:51, "Stefano Baldo" >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> Hi Guillaume. >>>>>>>>>> >>>>>>>>>> Thanks for answering. >>>>>>>>>> >>>>>>>>>> I'm using a SSD disk. I've changed from ext4 to ext2 to increase >>>>>>>>>> performance but it stills restarting. >>>>>>>>>> Also, I checked the I/O performance for the disk and there is no >>>>>>>>>> signal of overhead. >>>>>>>>>> >>>>>>>>>> I've changed the /var/lib/varnish to a tmpfs and increased its >>>>>>>>>> 80m default size passing "-l 200m,20m" to varnishd and using >>>>>>>>>> "nodev,nosuid,noatime,size=256M 0 0" for the tmpfs mount. There >>>>>>>>>> was a problem here. After a couple of hours varnish died and I received a >>>>>>>>>> "no space left on device" message - deleting the /var/lib/varnish solved >>>>>>>>>> the problem and varnish was up again, but it's weird because there was free >>>>>>>>>> memory on the host to be used with the tmpfs directory, so I don't know >>>>>>>>>> what could have happened. I will try to stop increasing the >>>>>>>>>> /var/lib/varnish size. >>>>>>>>>> >>>>>>>>>> Anyway, I am worried about the bans. You asked me if the bans are >>>>>>>>>> lurker friedly. Well, I don't think so. My bans are created this way: >>>>>>>>>> >>>>>>>>>> ban("req.http.host == " + req.http.host + " && req.url ~ " + >>>>>>>>>> req.url + " && req.http.User-Agent !~ Googlebot"); >>>>>>>>>> >>>>>>>>>> Are they lurker friendly? I was taking a quick look and the >>>>>>>>>> documentation and it looks like they're not. >>>>>>>>>> >>>>>>>>>> Best, >>>>>>>>>> Stefano >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Fri, Jun 23, 2017 at 11:30 AM, Guillaume Quintard < >>>>>>>>>> guillaume at varnish-software.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Stefano, >>>>>>>>>>> >>>>>>>>>>> Let's cover the usual suspects: I/Os. I think here Varnish gets >>>>>>>>>>> stuck trying to push/pull data and can't make time to reply to the CLI. I'd >>>>>>>>>>> recommend monitoring the disk activity (bandwidth and iops) to confirm. >>>>>>>>>>> >>>>>>>>>>> After some time, the file storage is terrible on a hard drive >>>>>>>>>>> (SSDs take a bit more time to degrade) because of fragmentation. One >>>>>>>>>>> solution to help the disks cope is to overprovision themif they're SSDs, >>>>>>>>>>> and you can try different advices in the file storage definition in the >>>>>>>>>>> command line (last parameter, after granularity). >>>>>>>>>>> >>>>>>>>>>> Is your /var/lib/varnish mount on tmpfs? That could help too. >>>>>>>>>>> >>>>>>>>>>> 40K bans is a lot, are they ban-lurker friendly? >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Guillaume Quintard >>>>>>>>>>> >>>>>>>>>>> On Fri, Jun 23, 2017 at 4:01 PM, Stefano Baldo < >>>>>>>>>>> stefanobaldo at gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hello. >>>>>>>>>>>> >>>>>>>>>>>> I am having a critical problem with Varnish Cache in production >>>>>>>>>>>> for over a month and any help will be appreciated. >>>>>>>>>>>> The problem is that Varnish child process is recurrently being >>>>>>>>>>>> restarted after 10~20h of use, with the following message: >>>>>>>>>>>> >>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) not >>>>>>>>>>>> responding to CLI, killed it. >>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected reply >>>>>>>>>>>> from ping: 400 CLI communication error >>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) >>>>>>>>>>>> died signal=9 >>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup >>>>>>>>>>>> complete >>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) >>>>>>>>>>>> Started >>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) >>>>>>>>>>>> said Child starts >>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) >>>>>>>>>>>> said SMF.s0 mmap'ed 483183820800 bytes of 483183820800 >>>>>>>>>>>> >>>>>>>>>>>> The following link is the varnishstat output just 1 minute >>>>>>>>>>>> before a restart: >>>>>>>>>>>> >>>>>>>>>>>> https://pastebin.com/g0g5RVTs >>>>>>>>>>>> >>>>>>>>>>>> Environment: >>>>>>>>>>>> >>>>>>>>>>>> varnish-5.1.2 revision 6ece695 >>>>>>>>>>>> Debian 8.7 - Debian GNU/Linux 8 (3.16.0) >>>>>>>>>>>> Installed using pre-built package from official repo at >>>>>>>>>>>> packagecloud.io >>>>>>>>>>>> CPU 2x2.9 GHz >>>>>>>>>>>> Mem 3.69 GiB >>>>>>>>>>>> Running inside a Docker container >>>>>>>>>>>> NFILES=131072 >>>>>>>>>>>> MEMLOCK=82000 >>>>>>>>>>>> >>>>>>>>>>>> Additional info: >>>>>>>>>>>> >>>>>>>>>>>> - I need to cache a large number of objets and the cache should >>>>>>>>>>>> last for almost a week, so I have set up a 450G storage space, I don't know >>>>>>>>>>>> if this is a problem; >>>>>>>>>>>> - I use ban a lot. There was about 40k bans in the system just >>>>>>>>>>>> before the last crash. I really don't know if this is too much or may have >>>>>>>>>>>> anything to do with it; >>>>>>>>>>>> - No registered CPU spikes (almost always by 30%); >>>>>>>>>>>> - No panic is reported, the only info I can retrieve is from >>>>>>>>>>>> syslog; >>>>>>>>>>>> - During all the time, event moments before the crashes, >>>>>>>>>>>> everything is okay and requests are being responded very fast. >>>>>>>>>>>> >>>>>>>>>>>> Best, >>>>>>>>>>>> Stefano Baldo >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> varnish-misc mailing list >>>>>>>>>>>> varnish-misc at varnish-cache.org >>>>>>>>>>>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish >>>>>>>>>>>> -misc >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>> >>>> >>> >> > > _______________________________________________ > varnish-misc mailing list > varnish-misc at varnish-cache.org > https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc > -------------- next part -------------- An HTML attachment was scrubbed... URL: From reza at varnish-software.com Wed Jun 28 16:20:33 2017 From: reza at varnish-software.com (Reza Naghibi) Date: Wed, 28 Jun 2017 12:20:33 -0400 Subject: Child process recurrently being restarted In-Reply-To: References: Message-ID: That means its unlimited. Those numbers are from the Varnish perspective, so they don't account for how jemalloc manages those allocations. -- Reza Naghibi Varnish Software On Wed, Jun 28, 2017 at 9:47 AM, Stefano Baldo wrote: > SMA.Transient.g_alloc 3518 . Allocations > outstanding > SMA.Transient.g_bytes 546390 . Bytes > outstanding > SMA.Transient.g_space 0 . Bytes > available > > g_space is always 0. It could mean anything? > > On Wed, Jun 28, 2017 at 10:43 AM, Guillaume Quintard < > guillaume at varnish-software.com> wrote: > >> Yeah, I was wondering about Transient, but it seems under control. >> >> Apart from moving away from file storage, I have nothing at the moment :-/ >> >> -- >> Guillaume Quintard >> >> On Wed, Jun 28, 2017 at 3:39 PM, Stefano Baldo >> wrote: >> >>> Hi. >>> >>> root at 2c6c325b279f:/# varnishstat -1 | grep g_bytes >>> SMA.Transient.g_bytes 519022 . Bytes >>> outstanding >>> SMF.s0.g_bytes 23662845952 . Bytes >>> outstanding >>> >>> You mean g_bytes from SMA.Transient? I have set no malloc storage. >>> >>> >>> On Wed, Jun 28, 2017 at 10:26 AM, Guillaume Quintard < >>> guillaume at varnish-software.com> wrote: >>> >>>> Hi, >>>> >>>> can you look that "varnishstat -1 | grep g_bytes" and see if if matches >>>> the memory you are seeing? >>>> >>>> -- >>>> Guillaume Quintard >>>> >>>> On Wed, Jun 28, 2017 at 3:20 PM, Stefano Baldo >>>> wrote: >>>> >>>>> Hi Guillaume. >>>>> >>>>> I increased the cli_timeout yesterday to 900sec (15min) and it >>>>> restarted anyway, which seems to indicate that the thread is really stalled. >>>>> >>>>> This was 1 minute after the last restart: >>>>> >>>>> MAIN.n_object 3908216 . object structs made >>>>> SMF.s0.g_alloc 7794510 . Allocations >>>>> outstanding >>>>> >>>>> I've just changed the I/O Scheduler to noop to see what happens. >>>>> >>>>> One interest thing I've found is about the memory usage. >>>>> >>>>> In the 1st minute of use: >>>>> MemTotal: 3865572 kB >>>>> MemFree: 120768 kB >>>>> MemAvailable: 2300268 kB >>>>> >>>>> 1 minute before a restart: >>>>> MemTotal: 3865572 kB >>>>> MemFree: 82480 kB >>>>> MemAvailable: 68316 kB >>>>> >>>>> It seems like the system is possibly running out of memory. >>>>> >>>>> When calling varnishd, I'm specifying only "-s file,..." as storage. I >>>>> see in some examples that is common to use "-s file" AND "-s malloc" >>>>> together. Should I be passing "-s malloc" as well to somehow try to limit >>>>> the memory usage by varnishd? >>>>> >>>>> Best, >>>>> Stefano >>>>> >>>>> >>>>> On Wed, Jun 28, 2017 at 4:12 AM, Guillaume Quintard < >>>>> guillaume at varnish-software.com> wrote: >>>>> >>>>>> Sadly, nothing suspicious here, you can still try: >>>>>> - bumping the cli_timeout >>>>>> - changing your disk scheduler >>>>>> - changing the advice option of the file storage >>>>>> >>>>>> I'm still convinced this is due to Varnish getting stuck waiting for >>>>>> the disk because of the file storage fragmentation. >>>>>> >>>>>> Maybe you could look at SMF.*.g_alloc and compare it to the number of >>>>>> objects. Ideally, we would have a 1:1 relation between objects and >>>>>> allocations. If that number drops prior to a restart, that would be a good >>>>>> clue. >>>>>> >>>>>> >>>>>> -- >>>>>> Guillaume Quintard >>>>>> >>>>>> On Tue, Jun 27, 2017 at 11:07 PM, Stefano Baldo < >>>>>> stefanobaldo at gmail.com> wrote: >>>>>> >>>>>>> Hi Guillaume. >>>>>>> >>>>>>> It keeps restarting. >>>>>>> Would you mind taking a quick look in the following VCL file to >>>>>>> check if you find anything suspicious? >>>>>>> >>>>>>> Thank you very much. >>>>>>> >>>>>>> Best, >>>>>>> Stefano >>>>>>> >>>>>>> vcl 4.0; >>>>>>> >>>>>>> import std; >>>>>>> >>>>>>> backend default { >>>>>>> .host = "sites-web-server-lb"; >>>>>>> .port = "80"; >>>>>>> } >>>>>>> >>>>>>> include "/etc/varnish/bad_bot_detection.vcl"; >>>>>>> >>>>>>> sub vcl_recv { >>>>>>> call bad_bot_detection; >>>>>>> >>>>>>> if (req.url == "/nocache" || req.url == "/version") { >>>>>>> return(pass); >>>>>>> } >>>>>>> >>>>>>> unset req.http.Cookie; >>>>>>> if (req.method == "PURGE") { >>>>>>> ban("obj.http.x-host == " + req.http.host + " && >>>>>>> obj.http.x-user-agent !~ Googlebot"); >>>>>>> return(synth(750)); >>>>>>> } >>>>>>> >>>>>>> set req.url = regsuball(req.url, "(?>>>>>> } >>>>>>> >>>>>>> sub vcl_synth { >>>>>>> if (resp.status == 750) { >>>>>>> set resp.status = 200; >>>>>>> synthetic("PURGED => " + req.url); >>>>>>> return(deliver); >>>>>>> } elsif (resp.status == 501) { >>>>>>> set resp.status = 200; >>>>>>> set resp.http.Content-Type = "text/html; charset=utf-8"; >>>>>>> synthetic(std.fileread("/etc/varnish/pages/invalid_domain.ht >>>>>>> ml")); >>>>>>> return(deliver); >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> sub vcl_backend_response { >>>>>>> unset beresp.http.Set-Cookie; >>>>>>> set beresp.http.x-host = bereq.http.host; >>>>>>> set beresp.http.x-user-agent = bereq.http.user-agent; >>>>>>> >>>>>>> if (bereq.url == "/themes/basic/assets/theme.min.css" >>>>>>> || bereq.url == "/api/events/PAGEVIEW" >>>>>>> || bereq.url ~ "^\/assets\/img\/") { >>>>>>> set beresp.http.Cache-Control = "max-age=0"; >>>>>>> } else { >>>>>>> unset beresp.http.Cache-Control; >>>>>>> } >>>>>>> >>>>>>> if (beresp.status == 200 || >>>>>>> beresp.status == 301 || >>>>>>> beresp.status == 302 || >>>>>>> beresp.status == 404) { >>>>>>> if (bereq.url ~ "\&ordenar=aleatorio$") { >>>>>>> set beresp.http.X-TTL = "1d"; >>>>>>> set beresp.ttl = 1d; >>>>>>> } else { >>>>>>> set beresp.http.X-TTL = "1w"; >>>>>>> set beresp.ttl = 1w; >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> if (bereq.url !~ "\.(jpeg|jpg|png|gif|gz|tgz|bz2|tbz|mp3|ogg|swf|flv)$") >>>>>>> { >>>>>>> set beresp.do_gzip = true; >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> sub vcl_pipe { >>>>>>> set bereq.http.connection = "close"; >>>>>>> return (pipe); >>>>>>> } >>>>>>> >>>>>>> sub vcl_deliver { >>>>>>> unset resp.http.x-host; >>>>>>> unset resp.http.x-user-agent; >>>>>>> } >>>>>>> >>>>>>> sub vcl_backend_error { >>>>>>> if (beresp.status == 502 || beresp.status == 503 || beresp.status >>>>>>> == 504) { >>>>>>> set beresp.status = 200; >>>>>>> set beresp.http.Content-Type = "text/html; charset=utf-8"; >>>>>>> synthetic(std.fileread("/etc/varnish/pages/maintenance.html")); >>>>>>> return (deliver); >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> sub vcl_hash { >>>>>>> if (req.http.User-Agent ~ "Google Page Speed") { >>>>>>> hash_data("Google Page Speed"); >>>>>>> } elsif (req.http.User-Agent ~ "Googlebot") { >>>>>>> hash_data("Googlebot"); >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> sub vcl_deliver { >>>>>>> if (resp.status == 501) { >>>>>>> return (synth(resp.status)); >>>>>>> } >>>>>>> if (obj.hits > 0) { >>>>>>> set resp.http.X-Cache = "hit"; >>>>>>> } else { >>>>>>> set resp.http.X-Cache = "miss"; >>>>>>> } >>>>>>> } >>>>>>> >>>>>>> >>>>>>> On Mon, Jun 26, 2017 at 3:47 PM, Guillaume Quintard < >>>>>>> guillaume at varnish-software.com> wrote: >>>>>>> >>>>>>>> Nice! It may have been the cause, time will tell.can you report >>>>>>>> back in a few days to let us know? >>>>>>>> -- >>>>>>>> Guillaume Quintard >>>>>>>> >>>>>>>> On Jun 26, 2017 20:21, "Stefano Baldo" >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi Guillaume. >>>>>>>>> >>>>>>>>> I think things will start to going better now after changing the >>>>>>>>> bans. >>>>>>>>> This is how my last varnishstat looked like moments before a crash >>>>>>>>> regarding the bans: >>>>>>>>> >>>>>>>>> MAIN.bans 41336 . Count of bans >>>>>>>>> MAIN.bans_completed 37967 . Number of bans >>>>>>>>> marked 'completed' >>>>>>>>> MAIN.bans_obj 0 . Number of bans >>>>>>>>> using obj.* >>>>>>>>> MAIN.bans_req 41335 . Number of bans >>>>>>>>> using req.* >>>>>>>>> MAIN.bans_added 41336 0.68 Bans added >>>>>>>>> MAIN.bans_deleted 0 0.00 Bans deleted >>>>>>>>> >>>>>>>>> And this is how it looks like now: >>>>>>>>> >>>>>>>>> MAIN.bans 2 . Count of bans >>>>>>>>> MAIN.bans_completed 1 . Number of bans >>>>>>>>> marked 'completed' >>>>>>>>> MAIN.bans_obj 2 . Number of bans >>>>>>>>> using obj.* >>>>>>>>> MAIN.bans_req 0 . Number of bans >>>>>>>>> using req.* >>>>>>>>> MAIN.bans_added 2016 0.69 Bans added >>>>>>>>> MAIN.bans_deleted 2014 0.69 Bans deleted >>>>>>>>> >>>>>>>>> Before the changes, bans were never deleted! >>>>>>>>> Now the bans are added and quickly deleted after a minute or even >>>>>>>>> a couple of seconds. >>>>>>>>> >>>>>>>>> May this was the cause of the problem? It seems like varnish was >>>>>>>>> having a large number of bans to manage and test against. >>>>>>>>> I will let it ride now. Let's see if the problem persists or it's >>>>>>>>> gone! :-) >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> Stefano >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mon, Jun 26, 2017 at 3:10 PM, Guillaume Quintard < >>>>>>>>> guillaume at varnish-software.com> wrote: >>>>>>>>> >>>>>>>>>> Looking good! >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Guillaume Quintard >>>>>>>>>> >>>>>>>>>> On Mon, Jun 26, 2017 at 7:06 PM, Stefano Baldo < >>>>>>>>>> stefanobaldo at gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Guillaume, >>>>>>>>>>> >>>>>>>>>>> Can the following be considered "ban lurker friendly"? >>>>>>>>>>> >>>>>>>>>>> sub vcl_backend_response { >>>>>>>>>>> set beresp.http.x-url = bereq.http.host + bereq.url; >>>>>>>>>>> set beresp.http.x-user-agent = bereq.http.user-agent; >>>>>>>>>>> } >>>>>>>>>>> >>>>>>>>>>> sub vcl_recv { >>>>>>>>>>> if (req.method == "PURGE") { >>>>>>>>>>> ban("obj.http.x-url == " + req.http.host + req.url + " && >>>>>>>>>>> obj.http.x-user-agent !~ Googlebot"); >>>>>>>>>>> return(synth(750)); >>>>>>>>>>> } >>>>>>>>>>> } >>>>>>>>>>> >>>>>>>>>>> sub vcl_deliver { >>>>>>>>>>> unset resp.http.x-url; >>>>>>>>>>> unset resp.http.x-user-agent; >>>>>>>>>>> } >>>>>>>>>>> >>>>>>>>>>> Best, >>>>>>>>>>> Stefano >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Mon, Jun 26, 2017 at 12:43 PM, Guillaume Quintard < >>>>>>>>>>> guillaume at varnish-software.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Not lurker friendly at all indeed. You'll need to avoid req.* >>>>>>>>>>>> expression. Easiest way is to stash the host, user-agent and url in >>>>>>>>>>>> beresp.http.* and ban against those (unset them in vcl_deliver). >>>>>>>>>>>> >>>>>>>>>>>> I don't think you need to expand the VSL at all. >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Guillaume Quintard >>>>>>>>>>>> >>>>>>>>>>>> On Jun 26, 2017 16:51, "Stefano Baldo" >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Hi Guillaume. >>>>>>>>>>>> >>>>>>>>>>>> Thanks for answering. >>>>>>>>>>>> >>>>>>>>>>>> I'm using a SSD disk. I've changed from ext4 to ext2 to >>>>>>>>>>>> increase performance but it stills restarting. >>>>>>>>>>>> Also, I checked the I/O performance for the disk and there is >>>>>>>>>>>> no signal of overhead. >>>>>>>>>>>> >>>>>>>>>>>> I've changed the /var/lib/varnish to a tmpfs and increased its >>>>>>>>>>>> 80m default size passing "-l 200m,20m" to varnishd and using >>>>>>>>>>>> "nodev,nosuid,noatime,size=256M 0 0" for the tmpfs mount. >>>>>>>>>>>> There was a problem here. After a couple of hours varnish died and I >>>>>>>>>>>> received a "no space left on device" message - deleting the >>>>>>>>>>>> /var/lib/varnish solved the problem and varnish was up again, but it's >>>>>>>>>>>> weird because there was free memory on the host to be used with the tmpfs >>>>>>>>>>>> directory, so I don't know what could have happened. I will try to stop >>>>>>>>>>>> increasing the /var/lib/varnish size. >>>>>>>>>>>> >>>>>>>>>>>> Anyway, I am worried about the bans. You asked me if the bans >>>>>>>>>>>> are lurker friedly. Well, I don't think so. My bans are created this way: >>>>>>>>>>>> >>>>>>>>>>>> ban("req.http.host == " + req.http.host + " && req.url ~ " + >>>>>>>>>>>> req.url + " && req.http.User-Agent !~ Googlebot"); >>>>>>>>>>>> >>>>>>>>>>>> Are they lurker friendly? I was taking a quick look and the >>>>>>>>>>>> documentation and it looks like they're not. >>>>>>>>>>>> >>>>>>>>>>>> Best, >>>>>>>>>>>> Stefano >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Fri, Jun 23, 2017 at 11:30 AM, Guillaume Quintard < >>>>>>>>>>>> guillaume at varnish-software.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi Stefano, >>>>>>>>>>>>> >>>>>>>>>>>>> Let's cover the usual suspects: I/Os. I think here Varnish >>>>>>>>>>>>> gets stuck trying to push/pull data and can't make time to reply to the >>>>>>>>>>>>> CLI. I'd recommend monitoring the disk activity (bandwidth and iops) to >>>>>>>>>>>>> confirm. >>>>>>>>>>>>> >>>>>>>>>>>>> After some time, the file storage is terrible on a hard drive >>>>>>>>>>>>> (SSDs take a bit more time to degrade) because of fragmentation. One >>>>>>>>>>>>> solution to help the disks cope is to overprovision themif they're SSDs, >>>>>>>>>>>>> and you can try different advices in the file storage definition in the >>>>>>>>>>>>> command line (last parameter, after granularity). >>>>>>>>>>>>> >>>>>>>>>>>>> Is your /var/lib/varnish mount on tmpfs? That could help too. >>>>>>>>>>>>> >>>>>>>>>>>>> 40K bans is a lot, are they ban-lurker friendly? >>>>>>>>>>>>> >>>>>>>>>>>>> -- >>>>>>>>>>>>> Guillaume Quintard >>>>>>>>>>>>> >>>>>>>>>>>>> On Fri, Jun 23, 2017 at 4:01 PM, Stefano Baldo < >>>>>>>>>>>>> stefanobaldo at gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hello. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I am having a critical problem with Varnish Cache in >>>>>>>>>>>>>> production for over a month and any help will be appreciated. >>>>>>>>>>>>>> The problem is that Varnish child process is recurrently >>>>>>>>>>>>>> being restarted after 10~20h of use, with the following message: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) >>>>>>>>>>>>>> not responding to CLI, killed it. >>>>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected >>>>>>>>>>>>>> reply from ping: 400 CLI communication error >>>>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) >>>>>>>>>>>>>> died signal=9 >>>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup >>>>>>>>>>>>>> complete >>>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) >>>>>>>>>>>>>> Started >>>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) >>>>>>>>>>>>>> said Child starts >>>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) >>>>>>>>>>>>>> said SMF.s0 mmap'ed 483183820800 bytes of 483183820800 >>>>>>>>>>>>>> >>>>>>>>>>>>>> The following link is the varnishstat output just 1 minute >>>>>>>>>>>>>> before a restart: >>>>>>>>>>>>>> >>>>>>>>>>>>>> https://pastebin.com/g0g5RVTs >>>>>>>>>>>>>> >>>>>>>>>>>>>> Environment: >>>>>>>>>>>>>> >>>>>>>>>>>>>> varnish-5.1.2 revision 6ece695 >>>>>>>>>>>>>> Debian 8.7 - Debian GNU/Linux 8 (3.16.0) >>>>>>>>>>>>>> Installed using pre-built package from official repo at >>>>>>>>>>>>>> packagecloud.io >>>>>>>>>>>>>> CPU 2x2.9 GHz >>>>>>>>>>>>>> Mem 3.69 GiB >>>>>>>>>>>>>> Running inside a Docker container >>>>>>>>>>>>>> NFILES=131072 >>>>>>>>>>>>>> MEMLOCK=82000 >>>>>>>>>>>>>> >>>>>>>>>>>>>> Additional info: >>>>>>>>>>>>>> >>>>>>>>>>>>>> - I need to cache a large number of objets and the cache >>>>>>>>>>>>>> should last for almost a week, so I have set up a 450G storage space, I >>>>>>>>>>>>>> don't know if this is a problem; >>>>>>>>>>>>>> - I use ban a lot. There was about 40k bans in the system >>>>>>>>>>>>>> just before the last crash. I really don't know if this is too much or may >>>>>>>>>>>>>> have anything to do with it; >>>>>>>>>>>>>> - No registered CPU spikes (almost always by 30%); >>>>>>>>>>>>>> - No panic is reported, the only info I can retrieve is from >>>>>>>>>>>>>> syslog; >>>>>>>>>>>>>> - During all the time, event moments before the crashes, >>>>>>>>>>>>>> everything is okay and requests are being responded very fast. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Best, >>>>>>>>>>>>>> Stefano Baldo >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>>> varnish-misc mailing list >>>>>>>>>>>>>> varnish-misc at varnish-cache.org >>>>>>>>>>>>>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish >>>>>>>>>>>>>> -misc >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> > > _______________________________________________ > varnish-misc mailing list > varnish-misc at varnish-cache.org > https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stefanobaldo at gmail.com Thu Jun 29 17:09:32 2017 From: stefanobaldo at gmail.com (Stefano Baldo) Date: Thu, 29 Jun 2017 14:09:32 -0300 Subject: Child process recurrently being restarted In-Reply-To: References: Message-ID: Hi Guillaume and Reza. This time varnish restarted but it left some more info on syslog. It seems like the system is running out of memory. Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.297487] pool_herder invoked oom-killer: gfp_mask=0x2000d0, order=2, oom_score_adj=0 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.300992] pool_herder cpuset=/ mems_allowed=0 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.303157] CPU: 1 PID: 16214 Comm: pool_herder Tainted: G C O 3.16.0-4-amd64 #1 Debian 3.16.36-1+deb8u2 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] Hardware name: Xen HVM domU, BIOS 4.2.amazon 02/16/2017 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] 0000000000000000 ffffffff815123b5 ffff8800eb3652f0 0000000000000000 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] ffffffff8150ff8d 0000000000000000 ffffffff810d6e3f 0000000000000000 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] ffffffff81516d2e 0000000000000200 ffffffff810689d3 ffffffff810c43e4 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] Call Trace: Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [] ? dump_stack+0x5d/0x78 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [] ? dump_header+0x76/0x1e8 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [] ? smp_call_function_single+0x5f/0xa0 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [] ? mutex_lock+0xe/0x2a Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [] ? put_online_cpus+0x23/0x80 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [] ? rcu_oom_notify+0xc4/0xe0 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [] ? do_try_to_free_pages+0x4ac/0x520 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [] ? oom_kill_process+0x21d/0x370 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [] ? find_lock_task_mm+0x3d/0x90 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [] ? out_of_memory+0x473/0x4b0 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [] ? __alloc_pages_nodemask+0x9ef/0xb50 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [] ? copy_process.part.25+0x116/0x1c50 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [] ? __do_page_fault+0x1d1/0x4f0 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [] ? do_fork+0xe0/0x3d0 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [] ? stub_clone+0x69/0x90 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [] ? system_call_fast_compare_end+0x10/0x15 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.367638] Mem-Info: Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.368962] Node 0 DMA per-cpu: Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.370768] CPU 0: hi: 0, btch: 1 usd: 0 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.373249] CPU 1: hi: 0, btch: 1 usd: 0 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.375652] Node 0 DMA32 per-cpu: Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.377508] CPU 0: hi: 186, btch: 31 usd: 29 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.379898] CPU 1: hi: 186, btch: 31 usd: 0 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.382318] active_anon:846474 inactive_anon:1913 isolated_anon:0 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.382318] active_file:408 inactive_file:415 isolated_file:32 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.382318] unevictable:20736 dirty:27 writeback:0 unstable:0 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.382318] free:16797 slab_reclaimable:15276 slab_unreclaimable:10521 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.382318] mapped:22002 shmem:22935 pagetables:30362 bounce:0 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.382318] free_cma:0 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.397242] Node 0 DMA free:15192kB min:184kB low:228kB high:276kB active_anon:416kB inactive_anon:60kB active_file:0kB inactive_file:0kB unevictable:20kB isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB mlocked:20kB dirty:0kB writeback:0kB mapped:20kB shmem:80kB slab_reclaimable:32kB slab_unreclaimable:0kB kernel_stack:112kB pagetables:20kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.416338] lowmem_reserve[]: 0 3757 3757 3757 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.419030] Node 0 DMA32 free:50120kB min:44868kB low:56084kB high:67300kB active_anon:3386780kB inactive_anon:7592kB active_file:1732kB inactive_file:2060kB unevictable:82924kB isolated(anon):0kB isolated(file):128kB present:3915776kB managed:3849676kB mlocked:82924kB dirty:108kB writeback:0kB mapped:88432kB shmem:91660kB slab_reclaimable:61072kB slab_unreclaimable:42184kB kernel_stack:27248kB pagetables:121428kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.440095] lowmem_reserve[]: 0 0 0 0 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.442202] Node 0 DMA: 22*4kB (UEM) 6*8kB (EM) 1*16kB (E) 2*32kB (UM) 2*64kB (UE) 2*128kB (EM) 3*256kB (UEM) 1*512kB (E) 3*1024kB (UEM) 3*2048kB (EMR) 1*4096kB (M) = 15192kB Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.451936] Node 0 DMA32: 4031*4kB (EM) 2729*8kB (EM) 324*16kB (EM) 1*32kB (R) 1*64kB (R) 0*128kB 0*256kB 1*512kB (R) 1*1024kB (R) 1*2048kB (R) 0*4096kB = 46820kB Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.460240] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.464122] 24240 total pagecache pages Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.466048] 0 pages in swap cache Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.467672] Swap cache stats: add 0, delete 0, find 0/0 Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.470159] Free swap = 0kB Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.471513] Total swap = 0kB Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.472980] 982941 pages RAM Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.474380] 0 pages HighMem/MovableOnly Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.476190] 16525 pages reserved Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.477772] 0 pages hwpoisoned Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.479189] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.482698] [ 163] 0 163 10419 1295 21 0 0 systemd-journal Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.486646] [ 165] 0 165 10202 136 21 0 -1000 systemd-udevd Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.490598] [ 294] 0 294 6351 1729 14 0 0 dhclient Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.494457] [ 319] 0 319 6869 62 18 0 0 cron Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.498260] [ 321] 0 321 4964 67 14 0 0 systemd-logind Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.502346] [ 326] 105 326 10558 101 25 0 -900 dbus-daemon Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.506315] [ 342] 0 342 65721 228 31 0 0 rsyslogd Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.510222] [ 343] 0 343 88199 2108 61 0 -500 dockerd Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.514022] [ 350] 106 350 18280 181 36 0 0 zabbix_agentd Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.518040] [ 351] 106 351 18280 475 36 0 0 zabbix_agentd Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.522041] [ 352] 106 352 18280 187 36 0 0 zabbix_agentd Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.526025] [ 353] 106 353 18280 187 36 0 0 zabbix_agentd Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.530067] [ 354] 106 354 18280 187 36 0 0 zabbix_agentd Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.534033] [ 355] 106 355 18280 190 36 0 0 zabbix_agentd Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.538001] [ 358] 0 358 66390 1826 32 0 0 fail2ban-server Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.541972] [ 400] 0 400 35984 444 24 0 -500 docker-containe Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.545879] [ 568] 0 568 13796 168 30 0 -1000 sshd Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.549733] [ 576] 0 576 3604 41 12 0 0 agetty Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.553569] [ 577] 0 577 3559 38 12 0 0 agetty Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.557322] [16201] 0 16201 29695 20707 60 0 0 varnishd Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.561103] [16209] 108 16209 118909802 822425 29398 0 0 cache-main Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.565002] [27352] 0 27352 20131 214 42 0 0 sshd Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.568682] [27354] 1000 27354 20165 211 41 0 0 sshd Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.572307] [27355] 1000 27355 5487 146 17 0 0 bash Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.575920] [27360] 0 27360 11211 107 26 0 0 sudo Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.579593] [27361] 0 27361 11584 97 27 0 0 su Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.583155] [27362] 0 27362 5481 142 15 0 0 bash Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.586782] [27749] 0 27749 20131 214 41 0 0 sshd Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.590428] [27751] 1000 27751 20164 211 39 0 0 sshd Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.593979] [27752] 1000 27752 5487 147 15 0 0 bash Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.597488] [28762] 0 28762 26528 132 17 0 0 varnishstat Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.601239] [28764] 0 28764 11211 106 26 0 0 sudo Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.604737] [28765] 0 28765 11584 97 26 0 0 su Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.608602] [28766] 0 28766 5481 141 15 0 0 bash Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.612288] [28768] 0 28768 26528 220 18 0 0 varnishstat Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.616189] Out of memory: Kill process 16209 (cache-main) score 880 or sacrifice child Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.620106] Killed process 16209 (cache-main) total-vm:475639208kB, anon-rss:3289700kB, file-rss:0kB Jun 29 13:11:01 ip-172-25-2-8 varnishd[16201]: Child (16209) died signal=9 Jun 29 13:11:01 ip-172-25-2-8 varnishd[16201]: Child cleanup complete Jun 29 13:11:01 ip-172-25-2-8 varnishd[16201]: Child (30313) Started Jun 29 13:11:01 ip-172-25-2-8 varnishd[16201]: Child (30313) said Child starts Jun 29 13:11:01 ip-172-25-2-8 varnishd[16201]: Child (30313) said SMF.s0 mmap'ed 483183820800 bytes of 483183820800 Best, Stefano On Wed, Jun 28, 2017 at 11:33 AM, Reza Naghibi wrote: > Assuming the problem is running out of memory, you will need to do some > memory tuning, especially given the number of threads you are using and > your access patterns. Your options: > > - Add more memory to the system > - Reduce thread_pool_max > - Reduce jemalloc's thread cache (MALLOC_CONF="lg_tcache_max:10") > - Use some of the tuning params in here: https://info.varnish- > software.com/blog/understanding-varnish-cache-memory-usage > > > > -- > Reza Naghibi > Varnish Software > > On Wed, Jun 28, 2017 at 9:26 AM, Guillaume Quintard < > guillaume at varnish-software.com> wrote: > >> Hi, >> >> can you look that "varnishstat -1 | grep g_bytes" and see if if matches >> the memory you are seeing? >> >> -- >> Guillaume Quintard >> >> On Wed, Jun 28, 2017 at 3:20 PM, Stefano Baldo >> wrote: >> >>> Hi Guillaume. >>> >>> I increased the cli_timeout yesterday to 900sec (15min) and it restarted >>> anyway, which seems to indicate that the thread is really stalled. >>> >>> This was 1 minute after the last restart: >>> >>> MAIN.n_object 3908216 . object structs made >>> SMF.s0.g_alloc 7794510 . Allocations outstanding >>> >>> I've just changed the I/O Scheduler to noop to see what happens. >>> >>> One interest thing I've found is about the memory usage. >>> >>> In the 1st minute of use: >>> MemTotal: 3865572 kB >>> MemFree: 120768 kB >>> MemAvailable: 2300268 kB >>> >>> 1 minute before a restart: >>> MemTotal: 3865572 kB >>> MemFree: 82480 kB >>> MemAvailable: 68316 kB >>> >>> It seems like the system is possibly running out of memory. >>> >>> When calling varnishd, I'm specifying only "-s file,..." as storage. I >>> see in some examples that is common to use "-s file" AND "-s malloc" >>> together. Should I be passing "-s malloc" as well to somehow try to limit >>> the memory usage by varnishd? >>> >>> Best, >>> Stefano >>> >>> >>> On Wed, Jun 28, 2017 at 4:12 AM, Guillaume Quintard < >>> guillaume at varnish-software.com> wrote: >>> >>>> Sadly, nothing suspicious here, you can still try: >>>> - bumping the cli_timeout >>>> - changing your disk scheduler >>>> - changing the advice option of the file storage >>>> >>>> I'm still convinced this is due to Varnish getting stuck waiting for >>>> the disk because of the file storage fragmentation. >>>> >>>> Maybe you could look at SMF.*.g_alloc and compare it to the number of >>>> objects. Ideally, we would have a 1:1 relation between objects and >>>> allocations. If that number drops prior to a restart, that would be a good >>>> clue. >>>> >>>> >>>> -- >>>> Guillaume Quintard >>>> >>>> On Tue, Jun 27, 2017 at 11:07 PM, Stefano Baldo >>> > wrote: >>>> >>>>> Hi Guillaume. >>>>> >>>>> It keeps restarting. >>>>> Would you mind taking a quick look in the following VCL file to check >>>>> if you find anything suspicious? >>>>> >>>>> Thank you very much. >>>>> >>>>> Best, >>>>> Stefano >>>>> >>>>> vcl 4.0; >>>>> >>>>> import std; >>>>> >>>>> backend default { >>>>> .host = "sites-web-server-lb"; >>>>> .port = "80"; >>>>> } >>>>> >>>>> include "/etc/varnish/bad_bot_detection.vcl"; >>>>> >>>>> sub vcl_recv { >>>>> call bad_bot_detection; >>>>> >>>>> if (req.url == "/nocache" || req.url == "/version") { >>>>> return(pass); >>>>> } >>>>> >>>>> unset req.http.Cookie; >>>>> if (req.method == "PURGE") { >>>>> ban("obj.http.x-host == " + req.http.host + " && >>>>> obj.http.x-user-agent !~ Googlebot"); >>>>> return(synth(750)); >>>>> } >>>>> >>>>> set req.url = regsuball(req.url, "(?>>>> } >>>>> >>>>> sub vcl_synth { >>>>> if (resp.status == 750) { >>>>> set resp.status = 200; >>>>> synthetic("PURGED => " + req.url); >>>>> return(deliver); >>>>> } elsif (resp.status == 501) { >>>>> set resp.status = 200; >>>>> set resp.http.Content-Type = "text/html; charset=utf-8"; >>>>> synthetic(std.fileread("/etc/varnish/pages/invalid_domain.html")); >>>>> return(deliver); >>>>> } >>>>> } >>>>> >>>>> sub vcl_backend_response { >>>>> unset beresp.http.Set-Cookie; >>>>> set beresp.http.x-host = bereq.http.host; >>>>> set beresp.http.x-user-agent = bereq.http.user-agent; >>>>> >>>>> if (bereq.url == "/themes/basic/assets/theme.min.css" >>>>> || bereq.url == "/api/events/PAGEVIEW" >>>>> || bereq.url ~ "^\/assets\/img\/") { >>>>> set beresp.http.Cache-Control = "max-age=0"; >>>>> } else { >>>>> unset beresp.http.Cache-Control; >>>>> } >>>>> >>>>> if (beresp.status == 200 || >>>>> beresp.status == 301 || >>>>> beresp.status == 302 || >>>>> beresp.status == 404) { >>>>> if (bereq.url ~ "\&ordenar=aleatorio$") { >>>>> set beresp.http.X-TTL = "1d"; >>>>> set beresp.ttl = 1d; >>>>> } else { >>>>> set beresp.http.X-TTL = "1w"; >>>>> set beresp.ttl = 1w; >>>>> } >>>>> } >>>>> >>>>> if (bereq.url !~ "\.(jpeg|jpg|png|gif|gz|tgz|bz2|tbz|mp3|ogg|swf|flv)$") >>>>> { >>>>> set beresp.do_gzip = true; >>>>> } >>>>> } >>>>> >>>>> sub vcl_pipe { >>>>> set bereq.http.connection = "close"; >>>>> return (pipe); >>>>> } >>>>> >>>>> sub vcl_deliver { >>>>> unset resp.http.x-host; >>>>> unset resp.http.x-user-agent; >>>>> } >>>>> >>>>> sub vcl_backend_error { >>>>> if (beresp.status == 502 || beresp.status == 503 || beresp.status == >>>>> 504) { >>>>> set beresp.status = 200; >>>>> set beresp.http.Content-Type = "text/html; charset=utf-8"; >>>>> synthetic(std.fileread("/etc/varnish/pages/maintenance.html")); >>>>> return (deliver); >>>>> } >>>>> } >>>>> >>>>> sub vcl_hash { >>>>> if (req.http.User-Agent ~ "Google Page Speed") { >>>>> hash_data("Google Page Speed"); >>>>> } elsif (req.http.User-Agent ~ "Googlebot") { >>>>> hash_data("Googlebot"); >>>>> } >>>>> } >>>>> >>>>> sub vcl_deliver { >>>>> if (resp.status == 501) { >>>>> return (synth(resp.status)); >>>>> } >>>>> if (obj.hits > 0) { >>>>> set resp.http.X-Cache = "hit"; >>>>> } else { >>>>> set resp.http.X-Cache = "miss"; >>>>> } >>>>> } >>>>> >>>>> >>>>> On Mon, Jun 26, 2017 at 3:47 PM, Guillaume Quintard < >>>>> guillaume at varnish-software.com> wrote: >>>>> >>>>>> Nice! It may have been the cause, time will tell.can you report back >>>>>> in a few days to let us know? >>>>>> -- >>>>>> Guillaume Quintard >>>>>> >>>>>> On Jun 26, 2017 20:21, "Stefano Baldo" >>>>>> wrote: >>>>>> >>>>>>> Hi Guillaume. >>>>>>> >>>>>>> I think things will start to going better now after changing the >>>>>>> bans. >>>>>>> This is how my last varnishstat looked like moments before a crash >>>>>>> regarding the bans: >>>>>>> >>>>>>> MAIN.bans 41336 . Count of bans >>>>>>> MAIN.bans_completed 37967 . Number of bans >>>>>>> marked 'completed' >>>>>>> MAIN.bans_obj 0 . Number of bans >>>>>>> using obj.* >>>>>>> MAIN.bans_req 41335 . Number of bans >>>>>>> using req.* >>>>>>> MAIN.bans_added 41336 0.68 Bans added >>>>>>> MAIN.bans_deleted 0 0.00 Bans deleted >>>>>>> >>>>>>> And this is how it looks like now: >>>>>>> >>>>>>> MAIN.bans 2 . Count of bans >>>>>>> MAIN.bans_completed 1 . Number of bans >>>>>>> marked 'completed' >>>>>>> MAIN.bans_obj 2 . Number of bans >>>>>>> using obj.* >>>>>>> MAIN.bans_req 0 . Number of bans >>>>>>> using req.* >>>>>>> MAIN.bans_added 2016 0.69 Bans added >>>>>>> MAIN.bans_deleted 2014 0.69 Bans deleted >>>>>>> >>>>>>> Before the changes, bans were never deleted! >>>>>>> Now the bans are added and quickly deleted after a minute or even a >>>>>>> couple of seconds. >>>>>>> >>>>>>> May this was the cause of the problem? It seems like varnish was >>>>>>> having a large number of bans to manage and test against. >>>>>>> I will let it ride now. Let's see if the problem persists or it's >>>>>>> gone! :-) >>>>>>> >>>>>>> Best, >>>>>>> Stefano >>>>>>> >>>>>>> >>>>>>> On Mon, Jun 26, 2017 at 3:10 PM, Guillaume Quintard < >>>>>>> guillaume at varnish-software.com> wrote: >>>>>>> >>>>>>>> Looking good! >>>>>>>> >>>>>>>> -- >>>>>>>> Guillaume Quintard >>>>>>>> >>>>>>>> On Mon, Jun 26, 2017 at 7:06 PM, Stefano Baldo < >>>>>>>> stefanobaldo at gmail.com> wrote: >>>>>>>> >>>>>>>>> Hi Guillaume, >>>>>>>>> >>>>>>>>> Can the following be considered "ban lurker friendly"? >>>>>>>>> >>>>>>>>> sub vcl_backend_response { >>>>>>>>> set beresp.http.x-url = bereq.http.host + bereq.url; >>>>>>>>> set beresp.http.x-user-agent = bereq.http.user-agent; >>>>>>>>> } >>>>>>>>> >>>>>>>>> sub vcl_recv { >>>>>>>>> if (req.method == "PURGE") { >>>>>>>>> ban("obj.http.x-url == " + req.http.host + req.url + " && >>>>>>>>> obj.http.x-user-agent !~ Googlebot"); >>>>>>>>> return(synth(750)); >>>>>>>>> } >>>>>>>>> } >>>>>>>>> >>>>>>>>> sub vcl_deliver { >>>>>>>>> unset resp.http.x-url; >>>>>>>>> unset resp.http.x-user-agent; >>>>>>>>> } >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> Stefano >>>>>>>>> >>>>>>>>> >>>>>>>>> On Mon, Jun 26, 2017 at 12:43 PM, Guillaume Quintard < >>>>>>>>> guillaume at varnish-software.com> wrote: >>>>>>>>> >>>>>>>>>> Not lurker friendly at all indeed. You'll need to avoid req.* >>>>>>>>>> expression. Easiest way is to stash the host, user-agent and url in >>>>>>>>>> beresp.http.* and ban against those (unset them in vcl_deliver). >>>>>>>>>> >>>>>>>>>> I don't think you need to expand the VSL at all. >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Guillaume Quintard >>>>>>>>>> >>>>>>>>>> On Jun 26, 2017 16:51, "Stefano Baldo" >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> Hi Guillaume. >>>>>>>>>> >>>>>>>>>> Thanks for answering. >>>>>>>>>> >>>>>>>>>> I'm using a SSD disk. I've changed from ext4 to ext2 to increase >>>>>>>>>> performance but it stills restarting. >>>>>>>>>> Also, I checked the I/O performance for the disk and there is no >>>>>>>>>> signal of overhead. >>>>>>>>>> >>>>>>>>>> I've changed the /var/lib/varnish to a tmpfs and increased its >>>>>>>>>> 80m default size passing "-l 200m,20m" to varnishd and using >>>>>>>>>> "nodev,nosuid,noatime,size=256M 0 0" for the tmpfs mount. There >>>>>>>>>> was a problem here. After a couple of hours varnish died and I received a >>>>>>>>>> "no space left on device" message - deleting the /var/lib/varnish solved >>>>>>>>>> the problem and varnish was up again, but it's weird because there was free >>>>>>>>>> memory on the host to be used with the tmpfs directory, so I don't know >>>>>>>>>> what could have happened. I will try to stop increasing the >>>>>>>>>> /var/lib/varnish size. >>>>>>>>>> >>>>>>>>>> Anyway, I am worried about the bans. You asked me if the bans are >>>>>>>>>> lurker friedly. Well, I don't think so. My bans are created this way: >>>>>>>>>> >>>>>>>>>> ban("req.http.host == " + req.http.host + " && req.url ~ " + >>>>>>>>>> req.url + " && req.http.User-Agent !~ Googlebot"); >>>>>>>>>> >>>>>>>>>> Are they lurker friendly? I was taking a quick look and the >>>>>>>>>> documentation and it looks like they're not. >>>>>>>>>> >>>>>>>>>> Best, >>>>>>>>>> Stefano >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Fri, Jun 23, 2017 at 11:30 AM, Guillaume Quintard < >>>>>>>>>> guillaume at varnish-software.com> wrote: >>>>>>>>>> >>>>>>>>>>> Hi Stefano, >>>>>>>>>>> >>>>>>>>>>> Let's cover the usual suspects: I/Os. I think here Varnish gets >>>>>>>>>>> stuck trying to push/pull data and can't make time to reply to the CLI. I'd >>>>>>>>>>> recommend monitoring the disk activity (bandwidth and iops) to confirm. >>>>>>>>>>> >>>>>>>>>>> After some time, the file storage is terrible on a hard drive >>>>>>>>>>> (SSDs take a bit more time to degrade) because of fragmentation. One >>>>>>>>>>> solution to help the disks cope is to overprovision themif they're SSDs, >>>>>>>>>>> and you can try different advices in the file storage definition in the >>>>>>>>>>> command line (last parameter, after granularity). >>>>>>>>>>> >>>>>>>>>>> Is your /var/lib/varnish mount on tmpfs? That could help too. >>>>>>>>>>> >>>>>>>>>>> 40K bans is a lot, are they ban-lurker friendly? >>>>>>>>>>> >>>>>>>>>>> -- >>>>>>>>>>> Guillaume Quintard >>>>>>>>>>> >>>>>>>>>>> On Fri, Jun 23, 2017 at 4:01 PM, Stefano Baldo < >>>>>>>>>>> stefanobaldo at gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hello. >>>>>>>>>>>> >>>>>>>>>>>> I am having a critical problem with Varnish Cache in production >>>>>>>>>>>> for over a month and any help will be appreciated. >>>>>>>>>>>> The problem is that Varnish child process is recurrently being >>>>>>>>>>>> restarted after 10~20h of use, with the following message: >>>>>>>>>>>> >>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) not >>>>>>>>>>>> responding to CLI, killed it. >>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected reply >>>>>>>>>>>> from ping: 400 CLI communication error >>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) >>>>>>>>>>>> died signal=9 >>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup >>>>>>>>>>>> complete >>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) >>>>>>>>>>>> Started >>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) >>>>>>>>>>>> said Child starts >>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) >>>>>>>>>>>> said SMF.s0 mmap'ed 483183820800 bytes of 483183820800 >>>>>>>>>>>> >>>>>>>>>>>> The following link is the varnishstat output just 1 minute >>>>>>>>>>>> before a restart: >>>>>>>>>>>> >>>>>>>>>>>> https://pastebin.com/g0g5RVTs >>>>>>>>>>>> >>>>>>>>>>>> Environment: >>>>>>>>>>>> >>>>>>>>>>>> varnish-5.1.2 revision 6ece695 >>>>>>>>>>>> Debian 8.7 - Debian GNU/Linux 8 (3.16.0) >>>>>>>>>>>> Installed using pre-built package from official repo at >>>>>>>>>>>> packagecloud.io >>>>>>>>>>>> CPU 2x2.9 GHz >>>>>>>>>>>> Mem 3.69 GiB >>>>>>>>>>>> Running inside a Docker container >>>>>>>>>>>> NFILES=131072 >>>>>>>>>>>> MEMLOCK=82000 >>>>>>>>>>>> >>>>>>>>>>>> Additional info: >>>>>>>>>>>> >>>>>>>>>>>> - I need to cache a large number of objets and the cache should >>>>>>>>>>>> last for almost a week, so I have set up a 450G storage space, I don't know >>>>>>>>>>>> if this is a problem; >>>>>>>>>>>> - I use ban a lot. There was about 40k bans in the system just >>>>>>>>>>>> before the last crash. I really don't know if this is too much or may have >>>>>>>>>>>> anything to do with it; >>>>>>>>>>>> - No registered CPU spikes (almost always by 30%); >>>>>>>>>>>> - No panic is reported, the only info I can retrieve is from >>>>>>>>>>>> syslog; >>>>>>>>>>>> - During all the time, event moments before the crashes, >>>>>>>>>>>> everything is okay and requests are being responded very fast. >>>>>>>>>>>> >>>>>>>>>>>> Best, >>>>>>>>>>>> Stefano Baldo >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>> varnish-misc mailing list >>>>>>>>>>>> varnish-misc at varnish-cache.org >>>>>>>>>>>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish >>>>>>>>>>>> -misc >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>> >>>> >>> >> >> _______________________________________________ >> varnish-misc mailing list >> varnish-misc at varnish-cache.org >> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: