From phk at phk.freebsd.dk Mon Apr 4 07:54:42 2016 From: phk at phk.freebsd.dk (Poul-Henning Kamp) Date: Mon, 04 Apr 2016 07:54:42 +0000 Subject: [patch] "monotonic wallclock time" In-Reply-To: <56FD5113.9000707@schokola.de> References: <56FD5113.9000707@schokola.de> Message-ID: <50511.1459756482@critter.freebsd.dk> -------- In message <56FD5113.9000707 at schokola.de>, Nils Goroll writes: Deven (@dho) has proposed switching to 64bit ints for time format and he claims to have numbers showing this being a good idea. Can you coordinate with him, so you don't stomp on each others toes ? -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk at FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From slink at schokola.de Mon Apr 4 08:18:36 2016 From: slink at schokola.de (Nils Goroll) Date: Mon, 4 Apr 2016 10:18:36 +0200 Subject: [patch] "monotonic wallclock time" In-Reply-To: <50511.1459756482@critter.freebsd.dk> References: <56FD5113.9000707@schokola.de> <50511.1459756482@critter.freebsd.dk> Message-ID: <5702235C.90205@schokola.de> On 04/04/16 09:54, Poul-Henning Kamp wrote: > Deven (@dho) has proposed switching to 64bit ints for time format > and he claims to have numbers showing this being a good idea. Sure. Regarding his proposal I am waiting for details. Other than that, changing the datatype and my proposal are independent. So my question at this point is: Does my proposal make any sense? Nils From phk at phk.freebsd.dk Mon Apr 4 08:20:40 2016 From: phk at phk.freebsd.dk (Poul-Henning Kamp) Date: Mon, 04 Apr 2016 08:20:40 +0000 Subject: [patch] "monotonic wallclock time" In-Reply-To: <5702235C.90205@schokola.de> References: <56FD5113.9000707@schokola.de> <50511.1459756482@critter.freebsd.dk> <5702235C.90205@schokola.de> Message-ID: <50615.1459758040@critter.freebsd.dk> -------- In message <5702235C.90205 at schokola.de>, Nils Goroll writes: >On 04/04/16 09:54, Poul-Henning Kamp wrote: >> Deven (@dho) has proposed switching to 64bit ints for time format >> and he claims to have numbers showing this being a good idea. > >Sure. Regarding his proposal I am waiting for details. > >Other than that, changing the datatype and my proposal are independent. So my >question at this point is: Does my proposal make any sense? Since Devon has looked at this, he may have a more informed answer than me, so talk to him first. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk at FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From dridi at varni.sh Mon Apr 4 10:40:47 2016 From: dridi at varni.sh (Dridi Boukelmoune) Date: Mon, 4 Apr 2016 12:40:47 +0200 Subject: [PATCH] Retire VCL_EVENT_USE Message-ID: I need this out of the way to write a VIP on first-load and last-discard events :) Dridi -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-Retire-VCL_EVENT_USE.patch Type: text/x-patch Size: 4301 bytes Desc: not available URL: From phk at phk.freebsd.dk Mon Apr 4 10:49:38 2016 From: phk at phk.freebsd.dk (Poul-Henning Kamp) Date: Mon, 04 Apr 2016 10:49:38 +0000 Subject: [PATCH] Retire VCL_EVENT_USE In-Reply-To: References: Message-ID: <51353.1459766978@critter.freebsd.dk> -------- In message , Dridi Boukelmoune writes: >I need this out of the way to write a VIP on first-load and >last-discard events :) Go for it. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk at FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From dridi at varni.sh Mon Apr 4 13:46:37 2016 From: dridi at varni.sh (Dridi Boukelmoune) Date: Mon, 4 Apr 2016 15:46:37 +0200 Subject: VTAILQ_EMPTY vs -Wparentheses-equality with clang Message-ID: Hi, As I said on IRC I have hit a bug while working on a VMOD and then on Varnish itself. I thought I had seen other false-positives but that's the only one, in several places. Basically it complains when you use it in an if or while statement or I suppose anything that expects a condition, because it interprets the outer set of parentheses as a hint not to warn about assignment. So we've gone full circle on this one! Do we use this kind of condition in Varnish? if ((var = expr)) If not I suggest we disable it in autogen.des, my workaround is to use gcc. I haven't tried clang above 3.7.0, but basically it is _not_ looking at preprocessed code, in which we obviously don't have the outer parentheses: if (VTAILQ_EMPTY(...)) make -k log attached. Current Travis CI continuous integration uses clang 3.4 FYI. Best Regards, Dridi -------------- next part -------------- A non-text attachment was scrubbed... Name: clang-warnings.log Type: text/x-log Size: 5922 bytes Desc: not available URL: From fgsch at lodoss.net Mon Apr 4 14:43:17 2016 From: fgsch at lodoss.net (Federico Schwindt) Date: Mon, 4 Apr 2016 15:43:17 +0100 Subject: VTAILQ_EMPTY vs -Wparentheses-equality with clang In-Reply-To: References: Message-ID: Weird. I'm using clang 3.7 and I'm not seeing any warnings. What OS is this? On Mon, Apr 4, 2016 at 2:46 PM, Dridi Boukelmoune wrote: > Hi, > > As I said on IRC I have hit a bug while working on a VMOD and then on > Varnish itself. I thought I had seen other false-positives but that's > the only one, in several places. > > Basically it complains when you use it in an if or while statement or > I suppose anything that expects a condition, because it interprets the > outer set of parentheses as a hint not to warn about assignment. So > we've gone full circle on this one! > > Do we use this kind of condition in Varnish? > > if ((var = expr)) > > If not I suggest we disable it in autogen.des, my workaround is to use gcc. > > I haven't tried clang above 3.7.0, but basically it is _not_ looking > at preprocessed code, in which we obviously don't have the outer > parentheses: > > if (VTAILQ_EMPTY(...)) > > make -k log attached. > > Current Travis CI continuous integration uses clang 3.4 FYI. > > Best Regards, > Dridi > > _______________________________________________ > varnish-dev mailing list > varnish-dev at varnish-cache.org > https://www.varnish-cache.org/lists/mailman/listinfo/varnish-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From dridi at varni.sh Mon Apr 4 14:51:44 2016 From: dridi at varni.sh (Dridi Boukelmoune) Date: Mon, 4 Apr 2016 16:51:44 +0200 Subject: VTAILQ_EMPTY vs -Wparentheses-equality with clang In-Reply-To: References: Message-ID: On Mon, Apr 4, 2016 at 4:43 PM, Federico Schwindt wrote: > Weird. I'm using clang 3.7 and I'm not seeing any warnings. > > What OS is this? You know the OS is Fedora ;) I'm running f23, up to date: $ clang --version clang version 3.7.0 (tags/RELEASE_370/final) Target: x86_64-redhat-linux-gnu Thread model: posix Maybe it's fixed in 3.7.1? What's your clang version? Thanks From phk at phk.freebsd.dk Mon Apr 4 14:57:22 2016 From: phk at phk.freebsd.dk (Poul-Henning Kamp) Date: Mon, 04 Apr 2016 14:57:22 +0000 Subject: VTAILQ_EMPTY vs -Wparentheses-equality with clang In-Reply-To: References: Message-ID: <52076.1459781842@critter.freebsd.dk> -------- In message , Federico Schwindt writes: >Weird. I'm using clang 3.7 and I'm not seeing any warnings. I havn't compared vqueue.h to FreeBSD's sys/queue.h, but since FreeBSD is on Clain 3.8 now, that would be a place to look... -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk at FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From fgsch at lodoss.net Mon Apr 4 15:47:07 2016 From: fgsch at lodoss.net (Federico Schwindt) Date: Mon, 4 Apr 2016 16:47:07 +0100 Subject: VTAILQ_EMPTY vs -Wparentheses-equality with clang In-Reply-To: <52076.1459781842@critter.freebsd.dk> References: <52076.1459781842@critter.freebsd.dk> Message-ID: I think FreeBSD uses *-Wno-parentheses-equality. * Might be worth adding it (if supported). On Mon, Apr 4, 2016 at 3:57 PM, Poul-Henning Kamp wrote: > -------- > In message < > CAJV_h0bcW6MUNHbc8Qy4emHcrFfdvsGiPGSkoUSTQpXXrA64_g at mail.gmail.com> > , Federico Schwindt writes: > > >Weird. I'm using clang 3.7 and I'm not seeing any warnings. > > I havn't compared vqueue.h to FreeBSD's sys/queue.h, but since > FreeBSD is on Clain 3.8 now, that would be a place to look... > > > -- > Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 > phk at FreeBSD.ORG | TCP/IP since RFC 956 > FreeBSD committer | BSD since 4.3-tahoe > Never attribute to malice what can adequately be explained by incompetence. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From dho at fastly.com Mon Apr 4 17:34:10 2016 From: dho at fastly.com (Devon H. O'Dell) Date: Mon, 4 Apr 2016 10:34:10 -0700 Subject: [patch] "monotonic wallclock time" In-Reply-To: <50615.1459758040@critter.freebsd.dk> References: <56FD5113.9000707@schokola.de> <50511.1459756482@critter.freebsd.dk> <5702235C.90205@schokola.de> <50615.1459758040@critter.freebsd.dk> Message-ID: Hi! On Mon, Apr 4, 2016 at 1:20 AM, Poul-Henning Kamp wrote: > -------- > In message <5702235C.90205 at schokola.de>, Nils Goroll writes: >>On 04/04/16 09:54, Poul-Henning Kamp wrote: >>> Deven (@dho) has proposed switching to 64bit ints for time format >>> and he claims to have numbers showing this being a good idea. >> >>Sure. Regarding his proposal I am waiting for details. I'll send a separate email to the list regarding Varnish performance stuff that is immediately interesting to us. >>Other than that, changing the datatype and my proposal are independent. So my >>question at this point is: Does my proposal make any sense? > > Since Devon has looked at this, he may have a more informed answer than > me, so talk to him first. Looking at the patch, I believe this proposal is entirely independent of what I would propose and can be considered separately. I don't believe that the performance comment is necessarily useful -- monotonic time interfaces may be faster to access, but we don't need to manage a real and monotonic time component per-request, and it doesn't need to be double-precision FP (which kills some performance gains you get from using that interface anyway). But that doesn't really matter for the context of this patch, which I think is useful and solves a real problem, and changing it to behave differently is much more invasive. --dho > -- > Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 > phk at FreeBSD.ORG | TCP/IP since RFC 956 > FreeBSD committer | BSD since 4.3-tahoe > Never attribute to malice what can adequately be explained by incompetence. > > _______________________________________________ > varnish-dev mailing list > varnish-dev at varnish-cache.org > https://www.varnish-cache.org/lists/mailman/listinfo/varnish-dev From phk at phk.freebsd.dk Mon Apr 4 18:49:30 2016 From: phk at phk.freebsd.dk (Poul-Henning Kamp) Date: Mon, 04 Apr 2016 18:49:30 +0000 Subject: VTAILQ_EMPTY vs -Wparentheses-equality with clang In-Reply-To: References: <52076.1459781842@critter.freebsd.dk> Message-ID: <74363.1459795770@critter.freebsd.dk> -------- In message , Federico Schwindt writes: >I think FreeBSD uses *-Wno-parentheses-equality. * > >Might be worth adding it (if supported). I've been pondering writing a program to automatically detect the optimal set of compiler flags for any given source code, but I have not been able to get myself to do so, because to follow style in that space I would have to write it in PL/1, Modula-3, SmallTalk or some other horrid and obsolete language. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk at FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From dho at fastly.com Mon Apr 4 19:21:39 2016 From: dho at fastly.com (Devon H. O'Dell) Date: Mon, 4 Apr 2016 12:21:39 -0700 Subject: Varnish performance musings Message-ID: Hi all, Probably best to bring this discussion out of IRC at this point. I've looked into Varnish performance (from CPU consumption perspective) on our installations for nearly 3 years, and wanted to share some findings (as well as some thoughts on how to address these things). Unfortunately no TL;DR, but maybe it's good to separate these topics. It's going to be a bit before I'm going to have time to do a Varnish 4 test to get numbers -- our fork is based on 2.1.4 and I things do not line up 1:1. Furthermore, getting noise-free numbers for just Varnish is difficult for a number of reasons, but effectively I cannot get a good sample of an individual process due in part to how perf performs and aggregates its samples, and in part to how many threads we run. But here are some of the things I've seen as problematic over the last few years, and ideas for fixing them. ## Header Scanning The number one CPU hog in Varnish right now outside of PCRE is header processing. Any header lookup or set is a linear scan, and any operation on a non-existent header has worst-case time. Any headers added with VCL become worst-case. Our plan for this turns header access / modification into O(1): a. All headers are accessed from offsetting inside a LUT. b. All headers in http_headers.h have a guaranteed slot in the LUT. c. VCL-accessed headers are managed at vcc-time and receive slots in the LUT at initialization-time. d. VMODs have an interface such that headers they access outside of headers defined in http_headers.h or VCL can be registered at initialization time. This also provides a means for accessing multiple same-named headers. We would introduce some VCL syntax to be able to access a specific header (e.g. beresp.http.Set-Cookie[2]), to get the number of occurrences of a header of a particular name. An interface would also exist to be able to apply some function to all headers (possibly also to all headers matching a specific name). The latter of these is something we have already -- we've added a "collection" type to VCL, as well as some functions to apply a function (called a "callback") to members of the collection. Callbacks operate in a context in which they are provided a key and value; they are not closures. This has workspace overhead that I'm not entirely happy with yet, so we have not made it a generally accessible thing yet. In the case of multiple headers with the same name, the storage format would still be a LUT, but the next member for a header would appear "chained" in some later part of the headers, any offset is defined as a 3-tuple of (p, q, next). When next is NULL, only one header of a particular instance appears. Since Varnish bounds the number of headers that can be handled in a request, this table doesn't have to be very large and can probably be bounded to (n_known_headers + max_headers) entries. ## Time Storage as double-precision FP and printf(3)-family Varnish uses double-precision FP for storing timestamps. The rationale for this is reasonable: a native type exists that can support fractional seconds. Arithmetic between two timestamps can be easily applied in code without relying on APIs that make said arithmetic difficult to read. This is a good argument for having times stored in a native format. Unfortunately, there are a few downsides: FP operations are typically slow, error-prone at a hardware / compiler level (https://github.com/varnishcache/varnish-cache/issues/1875 as a recent example), and stringifying floating point numbers correctly is really hard. I have just done a measurement of our production Varnish and TIM_foo functions no longer appear as significant CPU users. I believe this is because of a change I made a year or two ago in our fork that snapshots timestamps before calling into VCL. All VRT functions operate on the same timestamps, and therefore all VCL callbacks appear to occur at the same time. (This has numerous beneficial properties and only a few negative ones). Each VCL function gets its own snapshot. However, printf(3)-family functions are still super-heavy CPU consumers, accounting ~6.5% of total CPU time in our Varnish. A third of this time is spent in `__printf_fp`, which is the glibc function that handles representation of floating-point values. The *only* thing Varnish really uses FP for is doubles; it's logical to assume without doing a full audit is that something like 20% of printf(3)-family time is spent converting double-precision numbers to strings and the majority of the remaining time is format string parsing. From this perspective, it is still worth it to analyze the performance of VTIM-family functions to get an idea of their overhead: 1. TIM_real in our tree showed up in top functions of a synthetic, all-hit workload. Annotating the function shows where samples saw most of the time spent. In this sample, we can see that nearly 2/3ds of the time is spent in setting up the call stack and 1/3 of the time is doing FP-ops. ? 000000000000aec0 : 36.20 ? sub $0x18,%rsp 26.70 ? xor %edi,%edi ? mov %rsp,%rsi ... 1.81 ?40: cvtsi2 0x8(%rsp),%xmm1 10.86 ? mulsd 0x8511(%rip),%xmm1 # 13420 <__func__.5739+0x13> 19.00 ? cvtsi2 (%rsp),%xmm0 2. Other time-related functions have FP-dominating components. TIM_format, for example, is dominated by nearly 2/3rds of its time on a single FP instruction (this is probably partially due to pipeline stalling). ? 000000000000ae40 : 62.16 ? cvttsd %xmm0,%rax Inlining these functions would be beneficial in the TIM_real (sorry, I am still operating in V2 terminology, but I believe all of this still applies to V4) sense, but moving away from double as a time storage format would be beneficial in general. This would be done by using 64-bit counters that represent the number of nanoseconds since the epoch. We will run out of those in something like 540 years, and I'm happy to make that someone else's problem :). a. It reduces significant portion of overhead in VTIM-family functions b. It reduces significant portion of overhead in printf c. It maintains composability of time arithmetic The major downside is that timestamp printing now needs additional work to print fractional time components. Finally, this gets a little into printf(3)-family inefficiencies as well. Because it parses format strings every time, we've optimized a number of places where we were using sprintf(3)-like interfaces to simply use string buffers. There is VSB of course, but we also use https://github.com/dhobsd/vstring (partially for FP stuff, partially for allowing static-backed buffers to upgrade to dynamic ones if necessary). The code overhead of string building is unfortunate, but at 6.5% overhead to use printf(3), this is a real win. (Some of the unlabeled blocks are things like _IO_default_xsputn, so the overhead of printf(3) here is likely still higher than 6.5%). See https://9vx.org/images/fg.png -- this was taken on a machine that is handling nearly 12k RPS on top of ~3-4k threads. By moving to integer times, conversion and printing would likely reduce the overhead of printf(3) by 20% without actually changing consumption of printf. I am unclear how this applies to Varnish 4, but I think relatively little is changed in this context between the versions. ## PCRE There are other things we've done (like optimizing regexes that are obviously prefix and suffix matches -- turns out lots of people write things like `if (req.http.x-foo ~ "^bar.*$")` that are effectively `if (strncmp(req.http.x-foo, "bar" 3))` because it's easy), but I don't see those as being as high priority for upstream; they're largely issues for our multi-tenant use case. We have done this already; another thing we would like to do is to check regexes for things like backtracking and use DFA-based matching where possible. In the flame graph screenshot, the obvious VRT functions are PCRE. ## Expected Impact The expected impact of fixing these things is almost purely in latency. For this machine handling 12k RPS, that is the constant throughput bound, but we are bursting up to nearly 4k threads to serve the load. If header processing, PCRE, and printf were reduced to 50% of their current overhead, we'd expect to be able to handle the same load with something like 350 fewer threads, which is a real win for us. Note than even our 99%ile latency is largely covered by cache hits, so these effects would improve service for the vast majority of requests. Anyway, those are some thoughts. Looking forward to comments, though maybe there's a better venue for that than this ML? --dho From geoff at uplex.de Mon Apr 4 20:12:46 2016 From: geoff at uplex.de (Geoffrey Simmons) Date: Mon, 4 Apr 2016 22:12:46 +0200 Subject: Varnish performance musings In-Reply-To: References: Message-ID: <0B4BFCCA-C541-4EBD-AF0F-2499DA62F8DA@uplex.de> > On Apr 4, 2016, at 9:21 PM, Devon H. O'Dell wrote: > > ## PCRE > > There are other things we've done (like optimizing regexes that are > obviously prefix and suffix matches -- turns out lots of people write > things like `if (req.http.x-foo ~ "^bar.*$")` that are effectively `if > (strncmp(req.http.x-foo, "bar" 3))` because it's easy), but I don't > see those as being as high priority for upstream; they're largely > issues for our multi-tenant use case. We have done this already; > another thing we would like to do is to check regexes for things like > backtracking and use DFA-based matching where possible. In the flame > graph screenshot, the obvious VRT functions are PCRE. You might be interested in this, although it's new as can be (just today tagged as v0.1) -- a VMOD to access Google's RE2 regular expression lib: https://code.uplex.de/uplex-varnish/libvmod-re2 For those not familiar with RE2: it limits the syntax so that patterns are regular languages in the strictly formal sense. Most notably, backrefs within a pattern are not allowed. That means that the matcher can run as DFAs/NFAs, there is never any backtracking, and the time requirement for matches is always linear in the length of the string to be matched. So far this is just a proof of concept, and I haven't done any performance testing. From the documentation, I suspect that there are certain kinds of use cases for Varnish where RE2 would perform better than PCRE, and many cases where it doesn't make much difference (such as the prefix or suffix matches you mentioned). But that's all speculation until it's been tested. Best, Geoff From dho at fastly.com Mon Apr 4 21:30:49 2016 From: dho at fastly.com (Devon H. O'Dell) Date: Mon, 4 Apr 2016 14:30:49 -0700 Subject: Varnish performance musings In-Reply-To: <0B4BFCCA-C541-4EBD-AF0F-2499DA62F8DA@uplex.de> References: <0B4BFCCA-C541-4EBD-AF0F-2499DA62F8DA@uplex.de> Message-ID: On Mon, Apr 4, 2016 at 1:12 PM, Geoffrey Simmons wrote: >> On Apr 4, 2016, at 9:21 PM, Devon H. O'Dell wrote: >> >> ## PCRE >> >> There are other things we've done (like optimizing regexes that are >> obviously prefix and suffix matches -- turns out lots of people write >> things like `if (req.http.x-foo ~ "^bar.*$")` that are effectively `if >> (strncmp(req.http.x-foo, "bar" 3))` because it's easy), but I don't >> see those as being as high priority for upstream; they're largely >> issues for our multi-tenant use case. We have done this already; >> another thing we would like to do is to check regexes for things like >> backtracking and use DFA-based matching where possible. In the flame >> graph screenshot, the obvious VRT functions are PCRE. > > You might be interested in this, although it's new as can be (just today tagged as v0.1) -- a VMOD to access Google's RE2 regular expression lib: > > https://code.uplex.de/uplex-varnish/libvmod-re2 > > For those not familiar with RE2: it limits the syntax so that patterns are regular languages in the strictly formal sense. Most notably, backrefs within a pattern are not allowed. That means that the matcher can run as DFAs/NFAs, there is never any backtracking, and the time requirement for matches is always linear in the length of the string to be matched. Thanks for pointing this out! We have also considered RE2, but one of the problems is determining whether a particular regular expression is actually regular. One way would be to first try to compile with RE2, then fall back to PCRE if that didn't work. But there are also other problems -- we expose regular expression groups into VCL through `re.group.N` where `N` is some number in [0, 9] (we do not currently support named captures, and I don't have any plans to implement that) -- and providing cross-API consistency there is also problematic. PCRE does allow one to execute regexps in a DFA context using `pcre_dfa_exec`, but this also changes the semantics of matching as well due to what PCRE calls "auto-possessification". So we've continued to punt on PCRE improvements in our Varnish until we have time to think about all the semantic changes. > So far this is just a proof of concept, and I haven't done any performance testing. From the documentation, I suspect that there are certain kinds of use cases for Varnish where RE2 would perform better than PCRE, and many cases where it doesn't make much difference (such as the prefix or suffix matches you mentioned). But that's all speculation until it's been tested. The prefix and suffix cases are really interesting. Several people here were surprised at the limitations of regex optimizations performed by the engines. There are some optimizations that can be performed, but we don't currently run pcre_study either, and I'm not sure the range of things it actually improves. Most of the performance problems I see with regular expressions are more to do with the regexes themselves either being poorly written, or just the wrong tool for the job. Anyway, I believe that RE2 will outperform PCRE for a great many things; I haven't benchmarked it against pcre_dfa_exec (though I've seen some benchmarks in the context of pcre-jit, which is another thing I haven't tried at all), but my primary concern about just adopting RE2 is about the classes of regexes that simply don't work in RE2 context. We do have a non-trivial number of customers who rely on being able to use effectively non-regular expressions. --dho > > Best, > Geoff > > From slink at schokola.de Mon Apr 4 21:37:30 2016 From: slink at schokola.de (Nils Goroll) Date: Mon, 4 Apr 2016 23:37:30 +0200 Subject: Varnish performance musings In-Reply-To: References: Message-ID: <5702DE9A.7040103@schokola.de> Hi Devon, thank you very much for the interesting writeup - despite the fact that I have so much unfinished Varnish work on my list already, I'd like to dump some thoughts in the hope that others may pick up on them: > All VRT functions operate on the same timestamps (for each VCL callback) This sounds perfectly reasonable, I think we should just do this. > 1. TIM_real in our tree showed up in top functions of a synthetic, > all-hit workload. I once optimized a CLOCK_REALTIME bound app by caching real time and offsetting it with the TSC as long as the thread didn't migrate. This turned a syscall into a handful instructions for the fast path, but there's a portability question, and (unless one has constant_tsc) the inaccuracy due to speed stepping. > 64-bit counters that represent the number of nanoseconds since the > epoch. We will run out of those in something like 540 years, and I'm > happy to make that someone else's problem :). Besides the fact that this should be "(far) enough (in the future) for everyone", I'm not even convinced that we need nanos in varnish. Couldn't we shave off some 3-8 bits or even use only micros? > ## PCRE > > There are other things we've done (like optimizing regexes that are > obviously prefix and suffix matches -- turns out lots of people write > things like `if (req.http.x-foo ~ "^bar.*$")` that are effectively `if > (strncmp(req.http.x-foo, "bar" 3))` because it's easy), but I don't > see those as being as high priority for upstream; they're largely > issues for our multi-tenant use case. Years ago I learned about the fastly "Host header switch" problem and actually it has an interesting generalization: As we compile VCC we could as well compile pattern matchers also. I do this in https://code.uplex.de/uplex-varnish/dcs_classifier - and the results are pretty impressive. I have no experience with it, but there's also re2c, which takes a generic approach to the problem. VCC could generate optimal matcher code for common yet simple expressions. Adding a glob pattern type could be an option. Ideally, I'd like the "Host header switch problem" solved by having a VCC statement which would compile something like this... select req.http.Host { "www.foo.com": { call vcl_recv_foo; break; } "www.bar.*": { call vcl_recv_bar; break; } "*": { call ... } } ...into classifier tree code. Nils From slink at schokola.de Mon Apr 4 21:46:08 2016 From: slink at schokola.de (Nils Goroll) Date: Mon, 4 Apr 2016 23:46:08 +0200 Subject: VIP6 / Re: backend PROXY support - thought dump In-Reply-To: <56F5555D.6050406@schokola.de> References: <56F5555D.6050406@schokola.de> Message-ID: <5702E0A0.3060101@schokola.de> Hi, On 25/03/16 16:12, Nils Goroll wrote: > I now think that we should really support configuration for > PROXY mode by means of a backend protocol property: Whether a socket we > connect to supports PROXY (or even CLUSTER as suggested above) is purely a > property of that socket and we should avoid the need to duplicate this > information in VCL. Quoting VIP6 https://github.com/varnishcache/varnish-cache/wiki/VIP6:-What-does-pipe-mean-in-Varnish5%3F We add a std.send_proxy_header() which can be called from vcl_pipe{} only. I think whether or not a PROXY header gets added should be based on an attribute of the chosen backend, anything else results in a blowup of VCL code. Nils From dho at fastly.com Mon Apr 4 22:30:54 2016 From: dho at fastly.com (Devon H. O'Dell) Date: Mon, 4 Apr 2016 15:30:54 -0700 Subject: Varnish performance musings In-Reply-To: <5702DE9A.7040103@schokola.de> References: <5702DE9A.7040103@schokola.de> Message-ID: On Mon, Apr 4, 2016 at 2:37 PM, Nils Goroll wrote: > Hi Devon, > > thank you very much for the interesting writeup - despite the fact that I have > so much unfinished Varnish work on my list already, I'd like to dump some > thoughts in the hope that others may pick up on them: Thanks for taking the time to reply! >> All VRT functions operate on the same timestamps (for each VCL callback) > > This sounds perfectly reasonable, I think we should just do this. I plan on sending a patch, but I need to learn more about VRT_CTX to do it. As I mentioned previously, I'm still working on effectively a 2.1.4 codebase, and I'm not ultra-familiar with the architectural changes, especially in the VCL/VRT area. If someone else wants to do this quickly, that'd be fine by me. But I'd also be happy to learn a bit more about current architecture and do it myself, if people are happy to wait a little bit. >> 1. TIM_real in our tree showed up in top functions of a synthetic, >> all-hit workload. > > I once optimized a CLOCK_REALTIME bound app by caching real time and offsetting > it with the TSC as long as the thread didn't migrate. This turned a syscall into > a handful instructions for the fast path, but there's a portability question, > and (unless one has constant_tsc) the inaccuracy due to speed stepping. On x86, the TSC runs at a fixed speed regardless of C-state and P-state, so while clock_gettime(3) might require constant_tsc to work as expected, static inline uint64_t rdtscp(void) { uint32_t eax, edx; __asm__ __volatile__("rdtscp" : "=a" (eax), "=d" (edx) :: "%ecx", "memory"); return (((uint64_t)edx << 32) | eax); } will always provide a value increasing at a constant rate. (This may not be true in a VM, and of course this doesn't help with portability, and doesn't help on machines where you need to serialize rdtsc with cpuid because no rdtscp instruction is presen. None of that is true for us. One thing that is true is that C/P-state switches that change clock speed can skew the result of the measurement, but there are ways to solve this as well, sometimes just by ignoring it entirely since the numbers are usually only good for eyeballing anyway.) When we need performance measurements, we're usually using this `rdtscp` wrapper; see for example https://github.com/fastly/librip for implementation and https://9vx.org/post/ripping-rings/#ripping-rings:3dc2715b7d0d6dc78b3ff9173cd0b415 for write-up. We also use this wrapper to plot time it takes to hold / wait locks which we report, e.g.: dho at cachenode:~$ varnishstat -1 | egrep lock.*expire lock_cycles_held_expire 16595254263058 71858036.60 Cycles exp_mtx was held lock_cycles_wait_expire 1821824709085 7888565.28 Cycles to acquire exp_mtx These numbers get plotted along with other counters in our monitoring services. I've found the resulting graphs extremely useful in debugging production scenarios with livelock (both hold/wait time spike tremendously) and deadlock (hold/wait time fall to 0) at a glance. (The librip thing mentioned above is useful for finding out exactly where deadlock happens, though that can be done semi-trivially with defined lock hierarchies.) Hand-rolled TSC opens a ton of possibilities when used judiciously. >> 64-bit counters that represent the number of nanoseconds since the >> epoch. We will run out of those in something like 540 years, and I'm >> happy to make that someone else's problem :). > > Besides the fact that this should be "(far) enough (in the future) for > everyone", I'm not even convinced that we need nanos in varnish. Couldn't we > shave off some 3-8 bits or even use only micros? Shaving off bits doesn't seem hugely useful to me without some other information to pack in. A `double` is already 64 bits per the C standard, so on C implementations `uint64_t`, we may as well use all the bits. There are some cases where we have timestamps that may be <1?s apart depending on various error paths. Consider connect(2) returning EAGAIN as a straw man. This failure mode might cause the back-end side of things to error out in <1?s, and logging only microseconds isn't hugely helpful in that case -- if the duration is 0, does that mean I didn't run the code path, or it just executed faster than could be measured with microseconds? This lack of information would have caused us headaches in debugging various failure modes in the state machine over the years. >> ## PCRE >> >> There are other things we've done (like optimizing regexes that are >> obviously prefix and suffix matches -- turns out lots of people write >> things like `if (req.http.x-foo ~ "^bar.*$")` that are effectively `if >> (strncmp(req.http.x-foo, "bar" 3))` because it's easy), but I don't >> see those as being as high priority for upstream; they're largely >> issues for our multi-tenant use case. > > Years ago I learned about the fastly "Host header switch" problem and actually > it has an interesting generalization: As we compile VCC we could as well compile > pattern matchers also. I do this in > https://code.uplex.de/uplex-varnish/dcs_classifier - and the results are pretty > impressive. The icache thrash from the domain lookup function is actually pretty bad, and it's hard to maintain. We have an implementation of a minimal perfect hash table that we were considering as a candidate, but it doesn't work well for prefix or wildcard matching. > I have no experience with it, but there's also re2c, which takes a generic > approach to the problem. > > VCC could generate optimal matcher code for common yet simple expressions. > Adding a glob pattern type could be an option. > > Ideally, I'd like the "Host header switch problem" solved by having a VCC > statement which would compile something like this... > > select req.http.Host { > "www.foo.com": { > call vcl_recv_foo; > break; > } > "www.bar.*": { > call vcl_recv_bar; > break; > } > "*": { > call ... > } > } > > ...into classifier tree code. Our "table" extension to VCL would effectively allow this, except it does not support wildcards (we have a version that does, but it requires N-1 lookups where N is the bottom level of the domain). I have also done some work with qp tries (http://dotat.at/prog/qp/README.html), and suspect that a prefix-matching variant can be constructed. This is effectively how we do matching on our surrogate keys, but using critbit. (I am looking forward to replacing that with QP hopefully soon -- but then, I also have a large backlog of Stuff To Do -- if only there were 48 hours in a day!) Currently all of our customer VCLs are compiled into separate DSOs, so we do not have a single VCL that we indirect into based on host/ip/service. I believe there's also a strong argument to be made for the benefit of process isolation in multi-tenant environments, and I'd like to encourage a solution to this problem that moves more towards routing than centralization. --dho > > Nils From phk at phk.freebsd.dk Mon Apr 4 22:40:23 2016 From: phk at phk.freebsd.dk (Poul-Henning Kamp) Date: Mon, 04 Apr 2016 22:40:23 +0000 Subject: VIP6 / Re: backend PROXY support - thought dump In-Reply-To: <5702E0A0.3060101@schokola.de> References: <56F5555D.6050406@schokola.de> <5702E0A0.3060101@schokola.de> Message-ID: <65053.1459809623@critter.freebsd.dk> -------- In message <5702E0A0.3060101 at schokola.de>, Nils Goroll writes: >Quoting VIP6 >https://github.com/varnishcache/varnish-cache/wiki/VIP6:-What-does-pipe-mean-in-Varnish5%3F > > We add a std.send_proxy_header() which can be called from vcl_pipe{} > only. > >I think whether or not a PROXY header gets added should be based on an attribute >of the chosen backend, anything else results in a blowup of VCL code. I came to the same conclusion. Then I tried to make it a backend attribute, which was easy, but then I realized that we do not have a hold of the clients struct sess in the backend fetch code, so we have no way to get the addresses we need to put into the PROXY header in the first place. Since we'll need to refcount struct sess for H2 anyway, that's not fatal, it's just not as trivial as I thought it would be. There is another detail though. If we've sent a PROXY header to the backend, that connection cannot be reused by anybody but the very same struct sess. Ideally we'd hang the connection on the struct sess, so we can reap them when it goes away, but that is not at all trivial (What if the VCL changed before the next req ?) The alternative is to mark them 'magic' in the backend connection pool, and only recycle them if its the same struct sess. That leaves us with a garbage collection issue. Finally there is the really crude solution: Just never recycle backend connections on which we sent a PROXY header. I'm leaning heavily on the last one, until somebody shows me something that simply cannot be done with that restriction. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk at FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From phk at phk.freebsd.dk Wed Apr 6 08:27:21 2016 From: phk at phk.freebsd.dk (Poul-Henning Kamp) Date: Wed, 06 Apr 2016 08:27:21 +0000 Subject: Varnish5 builtin VCL without pipe Message-ID: <59499.1459931241@critter.freebsd.dk> Can we make builtin::vcl_recv{} look like this in Varnish5 ? sub vcl_recv { if (req.method == "PRI" || /* HTTP/2.0 */ req.method == "CONNECT" || req.method == "OPTIONS" || req.method == "TRACE") { return (synth(405)); } if (req.method != "GET" && req.method != "HEAD") { /* We only deal with GET and HEAD by default */ return (pass); } if (req.http.Authorization || req.http.Cookie) { /* Not cacheable by default */ return (pass); } return (hash); } -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk at FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From fgsch at lodoss.net Wed Apr 6 12:34:03 2016 From: fgsch at lodoss.net (Federico Schwindt) Date: Wed, 6 Apr 2016 13:34:03 +0100 Subject: Varnish5 builtin VCL without pipe In-Reply-To: <59499.1459931241@critter.freebsd.dk> References: <59499.1459931241@critter.freebsd.dk> Message-ID: For the record, added my comment to the VIP at https://github.com/varnishcache/varnish-cache/wiki/VIP8:-No-pipe-in-builtin.vcl-in-V5 . On Wed, Apr 6, 2016 at 9:27 AM, Poul-Henning Kamp wrote: > Can we make builtin::vcl_recv{} look like this in Varnish5 ? > > sub vcl_recv { > if (req.method == "PRI" || /* HTTP/2.0 */ > req.method == "CONNECT" || > req.method == "OPTIONS" || > req.method == "TRACE") { > return (synth(405)); > } > if (req.method != "GET" && req.method != "HEAD") { > /* We only deal with GET and HEAD by default */ > return (pass); > } > if (req.http.Authorization || req.http.Cookie) { > /* Not cacheable by default */ > return (pass); > } > return (hash); > } > > -- > Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 > phk at FreeBSD.ORG | TCP/IP since RFC 956 > FreeBSD committer | BSD since 4.3-tahoe > Never attribute to malice what can adequately be explained by incompetence. > > _______________________________________________ > varnish-dev mailing list > varnish-dev at varnish-cache.org > https://www.varnish-cache.org/lists/mailman/listinfo/varnish-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From fgsch at lodoss.net Wed Apr 6 12:46:56 2016 From: fgsch at lodoss.net (Federico Schwindt) Date: Wed, 6 Apr 2016 13:46:56 +0100 Subject: Varnish performance musings In-Reply-To: References: Message-ID: Hi, I'm all for the LUT idea and I think it makes even more sense when you consider H/2. Having the same timestamp during a particular method also makes sense. I believe we only need to change now calls to use ctx->now instead of getting the time on each call but haven't looked into detail. Being a bit smarter wrt some regular expressions might also be a good idea, specially for cases like the one you mentioned. FWIW, we have re-enabled JIT in recent Varnish versions for PCRE >= 8.32. Perhaps it's worth checking if this provides any benefit before we consider optimizing the custom patterns. On Mon, Apr 4, 2016 at 8:21 PM, Devon H. O'Dell wrote: > Hi all, > > Probably best to bring this discussion out of IRC at this point. I've > looked into Varnish performance (from CPU consumption perspective) on > our installations for nearly 3 years, and wanted to share some > findings (as well as some thoughts on how to address these things). > Unfortunately no TL;DR, but maybe it's good to separate these topics. > > It's going to be a bit before I'm going to have time to do a Varnish 4 > test to get numbers -- our fork is based on 2.1.4 and I things do not > line up 1:1. Furthermore, getting noise-free numbers for just Varnish > is difficult for a number of reasons, but effectively I cannot get a > good sample of an individual process due in part to how perf performs > and aggregates its samples, and in part to how many threads we run. > But here are some of the things I've seen as problematic over the last > few years, and ideas for fixing them. > > ## Header Scanning > > The number one CPU hog in Varnish right now outside of PCRE is header > processing. Any header lookup or set is a linear scan, and any > operation on a non-existent header has worst-case time. Any headers > added with VCL become worst-case. Our plan for this turns header > access / modification into O(1): > > a. All headers are accessed from offsetting inside a LUT. > b. All headers in http_headers.h have a guaranteed slot in the LUT. > c. VCL-accessed headers are managed at vcc-time and receive slots in > the LUT at initialization-time. > d. VMODs have an interface such that headers they access outside of > headers defined in http_headers.h or VCL can be registered at > initialization time. > > This also provides a means for accessing multiple same-named headers. > We would introduce some VCL syntax to be able to access a specific > header (e.g. beresp.http.Set-Cookie[2]), to get the number of > occurrences of a header of a particular name. An interface would also > exist to be able to apply some function to all headers (possibly also > to all headers matching a specific name). The latter of these is > something we have already -- we've added a "collection" type to VCL, > as well as some functions to apply a function (called a "callback") to > members of the collection. Callbacks operate in a context in which > they are provided a key and value; they are not closures. This has > workspace overhead that I'm not entirely happy with yet, so we have > not made it a generally accessible thing yet. > > In the case of multiple headers with the same name, the storage format > would still be a LUT, but the next member for a header would appear > "chained" in some later part of the headers, any offset is defined as > a 3-tuple of (p, q, next). When next is NULL, only one header of a > particular instance appears. Since Varnish bounds the number of > headers that can be handled in a request, this table doesn't have to > be very large and can probably be bounded to (n_known_headers + > max_headers) entries. > > ## Time Storage as double-precision FP and printf(3)-family > > Varnish uses double-precision FP for storing timestamps. The rationale > for this is reasonable: a native type exists that can support > fractional seconds. Arithmetic between two timestamps can be easily > applied in code without relying on APIs that make said arithmetic > difficult to read. This is a good argument for having times stored in > a native format. Unfortunately, there are a few downsides: FP > operations are typically slow, error-prone at a hardware / compiler > level (https://github.com/varnishcache/varnish-cache/issues/1875 as a > recent example), and stringifying floating point numbers correctly is > really hard. > > I have just done a measurement of our production Varnish and TIM_foo > functions no longer appear as significant CPU users. I believe this is > because of a change I made a year or two ago in our fork that > snapshots timestamps before calling into VCL. All VRT functions > operate on the same timestamps, and therefore all VCL callbacks appear > to occur at the same time. (This has numerous beneficial properties > and only a few negative ones). Each VCL function gets its own > snapshot. > > However, printf(3)-family functions are still super-heavy CPU > consumers, accounting ~6.5% of total CPU time in our Varnish. A third > of this time is spent in `__printf_fp`, which is the glibc function > that handles representation of floating-point values. The *only* thing > Varnish really uses FP for is doubles; it's logical to assume without > doing a full audit is that something like 20% of printf(3)-family time > is spent converting double-precision numbers to strings and the > majority of the remaining time is format string parsing. From this > perspective, it is still worth it to analyze the performance of > VTIM-family functions to get an idea of their overhead: > > 1. TIM_real in our tree showed up in top functions of a synthetic, > all-hit workload. Annotating the function shows where samples saw most > of the time spent. In this sample, we can see that nearly 2/3ds of the > time is spent in setting up the call stack and 1/3 of the time is > doing FP-ops. > > ? 000000000000aec0 : > 36.20 ? sub $0x18,%rsp > 26.70 ? xor %edi,%edi > ? mov %rsp,%rsi > ... > 1.81 ?40: cvtsi2 0x8(%rsp),%xmm1 > 10.86 ? mulsd 0x8511(%rip),%xmm1 # 13420 <__func__.5739+0x13> > 19.00 ? cvtsi2 (%rsp),%xmm0 > > 2. Other time-related functions have FP-dominating components. > TIM_format, for example, is dominated by nearly 2/3rds of its time on > a single FP instruction (this is probably partially due to pipeline > stalling). > > ? 000000000000ae40 : > 62.16 ? cvttsd %xmm0,%rax > > Inlining these functions would be beneficial in the TIM_real (sorry, I > am still operating in V2 terminology, but I believe all of this still > applies to V4) sense, but moving away from double as a time storage > format would be beneficial in general. This would be done by using > 64-bit counters that represent the number of nanoseconds since the > epoch. We will run out of those in something like 540 years, and I'm > happy to make that someone else's problem :). > > a. It reduces significant portion of overhead in VTIM-family functions > b. It reduces significant portion of overhead in printf > c. It maintains composability of time arithmetic > > The major downside is that timestamp printing now needs additional > work to print fractional time components. > > Finally, this gets a little into printf(3)-family inefficiencies as > well. Because it parses format strings every time, we've optimized a > number of places where we were using sprintf(3)-like interfaces to > simply use string buffers. There is VSB of course, but we also use > https://github.com/dhobsd/vstring (partially for FP stuff, partially > for allowing static-backed buffers to upgrade to dynamic ones if > necessary). The code overhead of string building is unfortunate, but > at 6.5% overhead to use printf(3), this is a real win. (Some of the > unlabeled blocks are things like _IO_default_xsputn, so the overhead > of printf(3) here is likely still higher than 6.5%). See > https://9vx.org/images/fg.png -- this was taken on a machine that is > handling nearly 12k RPS on top of ~3-4k threads. By moving to integer > times, conversion and printing would likely reduce the overhead of > printf(3) by 20% without actually changing consumption of printf. > > I am unclear how this applies to Varnish 4, but I think relatively > little is changed in this context between the versions. > > ## PCRE > > There are other things we've done (like optimizing regexes that are > obviously prefix and suffix matches -- turns out lots of people write > things like `if (req.http.x-foo ~ "^bar.*$")` that are effectively `if > (strncmp(req.http.x-foo, "bar" 3))` because it's easy), but I don't > see those as being as high priority for upstream; they're largely > issues for our multi-tenant use case. We have done this already; > another thing we would like to do is to check regexes for things like > backtracking and use DFA-based matching where possible. In the flame > graph screenshot, the obvious VRT functions are PCRE. > > ## Expected Impact > > The expected impact of fixing these things is almost purely in > latency. For this machine handling 12k RPS, that is the constant > throughput bound, but we are bursting up to nearly 4k threads to serve > the load. If header processing, PCRE, and printf were reduced to 50% > of their current overhead, we'd expect to be able to handle the same > load with something like 350 fewer threads, which is a real win for > us. Note than even our 99%ile latency is largely covered by cache > hits, so these effects would improve service for the vast majority of > requests. > > Anyway, those are some thoughts. Looking forward to comments, though > maybe there's a better venue for that than this ML? > > --dho > > _______________________________________________ > varnish-dev mailing list > varnish-dev at varnish-cache.org > https://www.varnish-cache.org/lists/mailman/listinfo/varnish-dev -------------- next part -------------- An HTML attachment was scrubbed... URL: From ksorensen at nordija.com Wed Apr 6 13:47:26 2016 From: ksorensen at nordija.com (Kristian =?ISO-8859-1?Q?Gr=F8nfeldt_S=F8rensen?=) Date: Wed, 06 Apr 2016 15:47:26 +0200 Subject: Varnish5 builtin VCL without pipe In-Reply-To: <59499.1459931241@critter.freebsd.dk> References: <59499.1459931241@critter.freebsd.dk> Message-ID: <1459950446.24083.40.camel@nordija.com> On ons, 2016-04-06 at 08:27 +0000, Poul-Henning Kamp wrote: > Can we make builtin::vcl_recv{} look like this in Varnish5 ? > > sub vcl_recv { > ????if (req.method == "PRI" ||??????????/* HTTP/2.0 */ > ??????req.method == "CONNECT" || > ??????req.method == "OPTIONS" || > ??????req.method == "TRACE") { > ????????return (synth(405)); > ????} > ????if (req.method != "GET" && req.method != "HEAD") { > ????????/* We only deal with GET and HEAD by default */ > ????????return (pass); > ????} > ????if (req.http.Authorization || req.http.Cookie) { > ????????/* Not cacheable by default */ > ????????return (pass); > ????} > ????return (hash); > } > I was actually going to suggest that Varnish5 should default to pass() PATCH requests rather than piping them, but this change seems to take care of that, so I think it's a good idea. BR Kristian S?rensen From slink at schokola.de Wed Apr 6 14:29:24 2016 From: slink at schokola.de (Nils Goroll) Date: Wed, 6 Apr 2016 16:29:24 +0200 Subject: VIP6 / Re: backend PROXY support - thought dump In-Reply-To: <65053.1459809623@critter.freebsd.dk> References: <56F5555D.6050406@schokola.de> <5702E0A0.3060101@schokola.de> <65053.1459809623@critter.freebsd.dk> Message-ID: <57051D44.2000702@schokola.de> On 05/04/16 00:40, Poul-Henning Kamp wrote: > > If we've sent a PROXY header to the backend, that connection cannot > be reused by anybody but the very same struct sess. As recycling for current PROXY semantics would only be relevant for ESI and restarts, I think not recycling connections should be acceptable. Nils From slink at schokola.de Wed Apr 6 14:30:27 2016 From: slink at schokola.de (Nils Goroll) Date: Wed, 6 Apr 2016 16:30:27 +0200 Subject: backend PROXY support - thought dump In-Reply-To: <56F5555D.6050406@schokola.de> References: <56F5555D.6050406@schokola.de> Message-ID: <57051D83.8040707@schokola.de> On 25/03/16 16:12, Nils Goroll wrote: > A new accept socket protocol type (eg named CLUSTER) could require a PROXY2 > header with every request. My suggestion lacks a perspective for H2, but AFAIK we're not yet clear about whether or not we'd multiplex several sps onto the same vbe anyway From dridi at varni.sh Wed Apr 6 14:54:38 2016 From: dridi at varni.sh (Dridi Boukelmoune) Date: Wed, 6 Apr 2016 16:54:38 +0200 Subject: Varnish5 builtin VCL without pipe In-Reply-To: References: <59499.1459931241@critter.freebsd.dk> Message-ID: On Wed, Apr 6, 2016 at 2:34 PM, Federico Schwindt wrote: > For the record, added my comment to the VIP at > https://github.com/varnishcache/varnish-cache/wiki/VIP8:-No-pipe-in-builtin.vcl-in-V5. Your comment on OPTIONS is valid, but we should also consider protocol upgrades. Maybe the test should be if the method is OPTIONS and the request contains an Upgrade header field. Regarding optional upgrades, what happens if the backend responds with 101? As the VIP says, no matter how pipe turns out in v5, but how does H2 deal with protocol upgrades? I think it doesn't (eg. websockets) Dridi From fgsch at lodoss.net Wed Apr 6 15:02:01 2016 From: fgsch at lodoss.net (Federico Schwindt) Date: Wed, 6 Apr 2016 16:02:01 +0100 Subject: Varnish5 builtin VCL without pipe In-Reply-To: References: <59499.1459931241@critter.freebsd.dk> Message-ID: Can you elaborate? How is Upgrade related to OPTIONS? Also (and this might require more clarifications on the VIP itself), WebSockets support still requires VCL changes, so we'll be expanding the changes to vcl_recv{} but either way won't work out of the box. On Wed, Apr 6, 2016 at 3:54 PM, Dridi Boukelmoune wrote: > On Wed, Apr 6, 2016 at 2:34 PM, Federico Schwindt > wrote: > > For the record, added my comment to the VIP at > > > https://github.com/varnishcache/varnish-cache/wiki/VIP8:-No-pipe-in-builtin.vcl-in-V5 > . > > Your comment on OPTIONS is valid, but we should also consider protocol > upgrades. Maybe the test should be if the method is OPTIONS and the > request contains an Upgrade header field. > > Regarding optional upgrades, what happens if the backend responds with 101? > > As the VIP says, no matter how pipe turns out in v5, but how does H2 > deal with protocol upgrades? I think it doesn't (eg. websockets) > > Dridi > -------------- next part -------------- An HTML attachment was scrubbed... URL: From guillaume at varnish-software.com Wed Apr 6 15:03:57 2016 From: guillaume at varnish-software.com (Guillaume Quintard) Date: Wed, 6 Apr 2016 17:03:57 +0200 Subject: Varnish performance musings In-Reply-To: References: Message-ID: Thanks for all the info, all of you. I'm wondering, H/2 mandates lower-case for headers, allowing us to get rid of strncmp, and in addition, the name length is known, so checking for equality boils down to a memcmp. Would that affect the results you are seing, Devon? I'm genuinely curious about the memory/speed balance of both the LUT and the simple list we have, specially for "small" (<20?) sets of headers. Also, H/2 makes it an error to send two same-named headers, except for cookies (exceptions are fun!), so I'm wondering if the section about duplicate headers is worth it (again, I'm not judging, really asking). -- Guillaume Quintard On Wed, Apr 6, 2016 at 2:46 PM, Federico Schwindt wrote: > Hi, > > I'm all for the LUT idea and I think it makes even more sense when you > consider H/2. > > Having the same timestamp during a particular method also makes sense. > I believe we only need to change now calls to use ctx->now instead of > getting the time on each call but haven't looked into detail. > > Being a bit smarter wrt some regular expressions might also be a good > idea, specially for cases like the one you mentioned. > FWIW, we have re-enabled JIT in recent Varnish versions for PCRE >= 8.32. > Perhaps it's worth checking if this provides any benefit before we consider > optimizing the custom patterns. > > On Mon, Apr 4, 2016 at 8:21 PM, Devon H. O'Dell wrote: > >> Hi all, >> >> Probably best to bring this discussion out of IRC at this point. I've >> looked into Varnish performance (from CPU consumption perspective) on >> our installations for nearly 3 years, and wanted to share some >> findings (as well as some thoughts on how to address these things). >> Unfortunately no TL;DR, but maybe it's good to separate these topics. >> >> It's going to be a bit before I'm going to have time to do a Varnish 4 >> test to get numbers -- our fork is based on 2.1.4 and I things do not >> line up 1:1. Furthermore, getting noise-free numbers for just Varnish >> is difficult for a number of reasons, but effectively I cannot get a >> good sample of an individual process due in part to how perf performs >> and aggregates its samples, and in part to how many threads we run. >> But here are some of the things I've seen as problematic over the last >> few years, and ideas for fixing them. >> >> ## Header Scanning >> >> The number one CPU hog in Varnish right now outside of PCRE is header >> processing. Any header lookup or set is a linear scan, and any >> operation on a non-existent header has worst-case time. Any headers >> added with VCL become worst-case. Our plan for this turns header >> access / modification into O(1): >> >> a. All headers are accessed from offsetting inside a LUT. >> b. All headers in http_headers.h have a guaranteed slot in the LUT. >> c. VCL-accessed headers are managed at vcc-time and receive slots in >> the LUT at initialization-time. >> d. VMODs have an interface such that headers they access outside of >> headers defined in http_headers.h or VCL can be registered at >> initialization time. >> >> This also provides a means for accessing multiple same-named headers. >> We would introduce some VCL syntax to be able to access a specific >> header (e.g. beresp.http.Set-Cookie[2]), to get the number of >> occurrences of a header of a particular name. An interface would also >> exist to be able to apply some function to all headers (possibly also >> to all headers matching a specific name). The latter of these is >> something we have already -- we've added a "collection" type to VCL, >> as well as some functions to apply a function (called a "callback") to >> members of the collection. Callbacks operate in a context in which >> they are provided a key and value; they are not closures. This has >> workspace overhead that I'm not entirely happy with yet, so we have >> not made it a generally accessible thing yet. >> >> In the case of multiple headers with the same name, the storage format >> would still be a LUT, but the next member for a header would appear >> "chained" in some later part of the headers, any offset is defined as >> a 3-tuple of (p, q, next). When next is NULL, only one header of a >> particular instance appears. Since Varnish bounds the number of >> headers that can be handled in a request, this table doesn't have to >> be very large and can probably be bounded to (n_known_headers + >> max_headers) entries. >> >> ## Time Storage as double-precision FP and printf(3)-family >> >> Varnish uses double-precision FP for storing timestamps. The rationale >> for this is reasonable: a native type exists that can support >> fractional seconds. Arithmetic between two timestamps can be easily >> applied in code without relying on APIs that make said arithmetic >> difficult to read. This is a good argument for having times stored in >> a native format. Unfortunately, there are a few downsides: FP >> operations are typically slow, error-prone at a hardware / compiler >> level (https://github.com/varnishcache/varnish-cache/issues/1875 as a >> recent example), and stringifying floating point numbers correctly is >> really hard. >> >> I have just done a measurement of our production Varnish and TIM_foo >> functions no longer appear as significant CPU users. I believe this is >> because of a change I made a year or two ago in our fork that >> snapshots timestamps before calling into VCL. All VRT functions >> operate on the same timestamps, and therefore all VCL callbacks appear >> to occur at the same time. (This has numerous beneficial properties >> and only a few negative ones). Each VCL function gets its own >> snapshot. >> >> However, printf(3)-family functions are still super-heavy CPU >> consumers, accounting ~6.5% of total CPU time in our Varnish. A third >> of this time is spent in `__printf_fp`, which is the glibc function >> that handles representation of floating-point values. The *only* thing >> Varnish really uses FP for is doubles; it's logical to assume without >> doing a full audit is that something like 20% of printf(3)-family time >> is spent converting double-precision numbers to strings and the >> majority of the remaining time is format string parsing. From this >> perspective, it is still worth it to analyze the performance of >> VTIM-family functions to get an idea of their overhead: >> >> 1. TIM_real in our tree showed up in top functions of a synthetic, >> all-hit workload. Annotating the function shows where samples saw most >> of the time spent. In this sample, we can see that nearly 2/3ds of the >> time is spent in setting up the call stack and 1/3 of the time is >> doing FP-ops. >> >> ? 000000000000aec0 : >> 36.20 ? sub $0x18,%rsp >> 26.70 ? xor %edi,%edi >> ? mov %rsp,%rsi >> ... >> 1.81 ?40: cvtsi2 0x8(%rsp),%xmm1 >> 10.86 ? mulsd 0x8511(%rip),%xmm1 # 13420 >> <__func__.5739+0x13> >> 19.00 ? cvtsi2 (%rsp),%xmm0 >> >> 2. Other time-related functions have FP-dominating components. >> TIM_format, for example, is dominated by nearly 2/3rds of its time on >> a single FP instruction (this is probably partially due to pipeline >> stalling). >> >> ? 000000000000ae40 : >> 62.16 ? cvttsd %xmm0,%rax >> >> Inlining these functions would be beneficial in the TIM_real (sorry, I >> am still operating in V2 terminology, but I believe all of this still >> applies to V4) sense, but moving away from double as a time storage >> format would be beneficial in general. This would be done by using >> 64-bit counters that represent the number of nanoseconds since the >> epoch. We will run out of those in something like 540 years, and I'm >> happy to make that someone else's problem :). >> >> a. It reduces significant portion of overhead in VTIM-family functions >> b. It reduces significant portion of overhead in printf >> c. It maintains composability of time arithmetic >> >> The major downside is that timestamp printing now needs additional >> work to print fractional time components. >> >> Finally, this gets a little into printf(3)-family inefficiencies as >> well. Because it parses format strings every time, we've optimized a >> number of places where we were using sprintf(3)-like interfaces to >> simply use string buffers. There is VSB of course, but we also use >> https://github.com/dhobsd/vstring (partially for FP stuff, partially >> for allowing static-backed buffers to upgrade to dynamic ones if >> necessary). The code overhead of string building is unfortunate, but >> at 6.5% overhead to use printf(3), this is a real win. (Some of the >> unlabeled blocks are things like _IO_default_xsputn, so the overhead >> of printf(3) here is likely still higher than 6.5%). See >> https://9vx.org/images/fg.png -- this was taken on a machine that is >> handling nearly 12k RPS on top of ~3-4k threads. By moving to integer >> times, conversion and printing would likely reduce the overhead of >> printf(3) by 20% without actually changing consumption of printf. >> >> I am unclear how this applies to Varnish 4, but I think relatively >> little is changed in this context between the versions. >> >> ## PCRE >> >> There are other things we've done (like optimizing regexes that are >> obviously prefix and suffix matches -- turns out lots of people write >> things like `if (req.http.x-foo ~ "^bar.*$")` that are effectively `if >> (strncmp(req.http.x-foo, "bar" 3))` because it's easy), but I don't >> see those as being as high priority for upstream; they're largely >> issues for our multi-tenant use case. We have done this already; >> another thing we would like to do is to check regexes for things like >> backtracking and use DFA-based matching where possible. In the flame >> graph screenshot, the obvious VRT functions are PCRE. >> >> ## Expected Impact >> >> The expected impact of fixing these things is almost purely in >> latency. For this machine handling 12k RPS, that is the constant >> throughput bound, but we are bursting up to nearly 4k threads to serve >> the load. If header processing, PCRE, and printf were reduced to 50% >> of their current overhead, we'd expect to be able to handle the same >> load with something like 350 fewer threads, which is a real win for >> us. Note than even our 99%ile latency is largely covered by cache >> hits, so these effects would improve service for the vast majority of >> requests. >> >> Anyway, those are some thoughts. Looking forward to comments, though >> maybe there's a better venue for that than this ML? >> >> --dho >> >> _______________________________________________ >> varnish-dev mailing list >> varnish-dev at varnish-cache.org >> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-dev > > > > _______________________________________________ > varnish-dev mailing list > varnish-dev at varnish-cache.org > https://www.varnish-cache.org/lists/mailman/listinfo/varnish-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From dridi at varni.sh Wed Apr 6 15:19:49 2016 From: dridi at varni.sh (Dridi Boukelmoune) Date: Wed, 6 Apr 2016 17:19:49 +0200 Subject: Varnish5 builtin VCL without pipe In-Reply-To: References: <59499.1459931241@critter.freebsd.dk> Message-ID: On Wed, Apr 6, 2016 at 5:02 PM, Federico Schwindt wrote: > Can you elaborate? How is Upgrade related to OPTIONS? It isn't, only the Upgrade header is important. You can use OPTIONS as a neutral request to try a protocol switch. But the question remains for status codes like 101 (Switching Protocol) or 426 (Upgrade Required). Dridi From guillaume at varnish-software.com Wed Apr 6 15:21:47 2016 From: guillaume at varnish-software.com (Guillaume Quintard) Date: Wed, 6 Apr 2016 17:21:47 +0200 Subject: Varnish5 builtin VCL without pipe In-Reply-To: References: <59499.1459931241@critter.freebsd.dk> Message-ID: On Apr 6, 2016 17:14, "Dridi Boukelmoune" wrote: > As the VIP says, no matter how pipe turns out in v5, but how does H2 > deal with protocol upgrades? I think it doesn't (eg. websockets) > It doesn't. If you came through ALPN, you could have gone directly to the required protocol. Same thing if you upgraded from H/1. I'm not sure I get you point about the upgrade field as it's optional. -------------- next part -------------- An HTML attachment was scrubbed... URL: From dridi at varni.sh Wed Apr 6 15:43:17 2016 From: dridi at varni.sh (Dridi Boukelmoune) Date: Wed, 6 Apr 2016 17:43:17 +0200 Subject: Varnish5 builtin VCL without pipe In-Reply-To: References: <59499.1459931241@critter.freebsd.dk> Message-ID: > I'm not sure I get you point about the upgrade field as it's optional. Like I said, I don't think H2 allows upgrades. The other question was about backends doing a protocol switch, which means the client asked for an Upgrade and got it. There's no way to go from vcl_backend_response to vcl_pipe (or fwiw the state after the subroutine) so upgrade requests should be intercepted in vcl_recv. Dridi From dho at fastly.com Wed Apr 6 16:12:08 2016 From: dho at fastly.com (Devon H. O'Dell) Date: Wed, 6 Apr 2016 09:12:08 -0700 Subject: Varnish performance musings In-Reply-To: References: Message-ID: On Wed, Apr 6, 2016 at 8:03 AM, Guillaume Quintard wrote: > Thanks for all the info, all of you. > > I'm wondering, H/2 mandates lower-case for headers, allowing us to get rid > of strncmp, and in addition, the name length is known, so checking for > equality boils down to a memcmp. Would that affect the results you are > seing, Devon? I'm genuinely curious about the memory/speed balance of both > the LUT and the simple list we have, specially for "small" (<20?) sets of > headers. Maybe -- our h2 strategy is still nascent and not inside Varnish for a number of reasons. I'm sure that we can benefit from using something that is not strcasecmp, but I believe the overhead is more on the dereference / cache miss side than in strcasecmp (which doesn't show up in the profile.) Also worth noting that we are doing additional strict validation of protocol string and headers in request / response, and this did not change the profile considerably. Apart from the cache miss portion, the issue also isn't with any particular individual scan, it's that people have VCLs that toy around with tens of various headers to store state between VCL states. So it's really that any (be)?re(q|sp).http.foo access potentially causes one, and there are probably lots of those kinds of access in any VCL. Things like mobile device detection are particularly bad, because they're typically implemented as multiple if-blocks of regexes against User-Agent, which is probably the longest header in the majority of requests. So if User-Agent appears near the end of headers, that can be bad. Other examples include cases where headers are checked / set / tested for a small portion of requests (for things like A/B testing or what-have-you). We've not yet implemented the LUT, but I'm hopeful that is a thing someone will be able to do soon. (I'm not sure how easy it will be to upstream that change given our code drift.) > Also, H/2 makes it an error to send two same-named headers, except for > cookies (exceptions are fun!), so I'm wondering if the section about > duplicate headers is worth it (again, I'm not judging, really asking). Yes, because Set-Cookie too (though we have a Set-Cookie-specific workaround, I believe). I believe technically it is an error today in HTTP/1.1 for headers other than Cookie / Set-Cookie / Via / maybe a couple others, but nobody cares and plenty of people do it. (Which is really great when you have cases where proxies disagree with origin about which Content-Length header to respect.) This said, I have very little insight into h2. That probably needs fixed, but will take time. > -- > Guillaume Quintard > > On Wed, Apr 6, 2016 at 2:46 PM, Federico Schwindt wrote: >> >> Hi, >> >> I'm all for the LUT idea and I think it makes even more sense when you >> consider H/2. >> >> Having the same timestamp during a particular method also makes sense. >> I believe we only need to change now calls to use ctx->now instead of >> getting the time on each call but haven't looked into detail. >> >> Being a bit smarter wrt some regular expressions might also be a good >> idea, specially for cases like the one you mentioned. >> FWIW, we have re-enabled JIT in recent Varnish versions for PCRE >= 8.32. >> Perhaps it's worth checking if this provides any benefit before we consider >> optimizing the custom patterns. >> >> On Mon, Apr 4, 2016 at 8:21 PM, Devon H. O'Dell wrote: >>> >>> Hi all, >>> >>> Probably best to bring this discussion out of IRC at this point. I've >>> looked into Varnish performance (from CPU consumption perspective) on >>> our installations for nearly 3 years, and wanted to share some >>> findings (as well as some thoughts on how to address these things). >>> Unfortunately no TL;DR, but maybe it's good to separate these topics. >>> >>> It's going to be a bit before I'm going to have time to do a Varnish 4 >>> test to get numbers -- our fork is based on 2.1.4 and I things do not >>> line up 1:1. Furthermore, getting noise-free numbers for just Varnish >>> is difficult for a number of reasons, but effectively I cannot get a >>> good sample of an individual process due in part to how perf performs >>> and aggregates its samples, and in part to how many threads we run. >>> But here are some of the things I've seen as problematic over the last >>> few years, and ideas for fixing them. >>> >>> ## Header Scanning >>> >>> The number one CPU hog in Varnish right now outside of PCRE is header >>> processing. Any header lookup or set is a linear scan, and any >>> operation on a non-existent header has worst-case time. Any headers >>> added with VCL become worst-case. Our plan for this turns header >>> access / modification into O(1): >>> >>> a. All headers are accessed from offsetting inside a LUT. >>> b. All headers in http_headers.h have a guaranteed slot in the LUT. >>> c. VCL-accessed headers are managed at vcc-time and receive slots in >>> the LUT at initialization-time. >>> d. VMODs have an interface such that headers they access outside of >>> headers defined in http_headers.h or VCL can be registered at >>> initialization time. >>> >>> This also provides a means for accessing multiple same-named headers. >>> We would introduce some VCL syntax to be able to access a specific >>> header (e.g. beresp.http.Set-Cookie[2]), to get the number of >>> occurrences of a header of a particular name. An interface would also >>> exist to be able to apply some function to all headers (possibly also >>> to all headers matching a specific name). The latter of these is >>> something we have already -- we've added a "collection" type to VCL, >>> as well as some functions to apply a function (called a "callback") to >>> members of the collection. Callbacks operate in a context in which >>> they are provided a key and value; they are not closures. This has >>> workspace overhead that I'm not entirely happy with yet, so we have >>> not made it a generally accessible thing yet. >>> >>> In the case of multiple headers with the same name, the storage format >>> would still be a LUT, but the next member for a header would appear >>> "chained" in some later part of the headers, any offset is defined as >>> a 3-tuple of (p, q, next). When next is NULL, only one header of a >>> particular instance appears. Since Varnish bounds the number of >>> headers that can be handled in a request, this table doesn't have to >>> be very large and can probably be bounded to (n_known_headers + >>> max_headers) entries. >>> >>> ## Time Storage as double-precision FP and printf(3)-family >>> >>> Varnish uses double-precision FP for storing timestamps. The rationale >>> for this is reasonable: a native type exists that can support >>> fractional seconds. Arithmetic between two timestamps can be easily >>> applied in code without relying on APIs that make said arithmetic >>> difficult to read. This is a good argument for having times stored in >>> a native format. Unfortunately, there are a few downsides: FP >>> operations are typically slow, error-prone at a hardware / compiler >>> level (https://github.com/varnishcache/varnish-cache/issues/1875 as a >>> recent example), and stringifying floating point numbers correctly is >>> really hard. >>> >>> I have just done a measurement of our production Varnish and TIM_foo >>> functions no longer appear as significant CPU users. I believe this is >>> because of a change I made a year or two ago in our fork that >>> snapshots timestamps before calling into VCL. All VRT functions >>> operate on the same timestamps, and therefore all VCL callbacks appear >>> to occur at the same time. (This has numerous beneficial properties >>> and only a few negative ones). Each VCL function gets its own >>> snapshot. >>> >>> However, printf(3)-family functions are still super-heavy CPU >>> consumers, accounting ~6.5% of total CPU time in our Varnish. A third >>> of this time is spent in `__printf_fp`, which is the glibc function >>> that handles representation of floating-point values. The *only* thing >>> Varnish really uses FP for is doubles; it's logical to assume without >>> doing a full audit is that something like 20% of printf(3)-family time >>> is spent converting double-precision numbers to strings and the >>> majority of the remaining time is format string parsing. From this >>> perspective, it is still worth it to analyze the performance of >>> VTIM-family functions to get an idea of their overhead: >>> >>> 1. TIM_real in our tree showed up in top functions of a synthetic, >>> all-hit workload. Annotating the function shows where samples saw most >>> of the time spent. In this sample, we can see that nearly 2/3ds of the >>> time is spent in setting up the call stack and 1/3 of the time is >>> doing FP-ops. >>> >>> ? 000000000000aec0 : >>> 36.20 ? sub $0x18,%rsp >>> 26.70 ? xor %edi,%edi >>> ? mov %rsp,%rsi >>> ... >>> 1.81 ?40: cvtsi2 0x8(%rsp),%xmm1 >>> 10.86 ? mulsd 0x8511(%rip),%xmm1 # 13420 >>> <__func__.5739+0x13> >>> 19.00 ? cvtsi2 (%rsp),%xmm0 >>> >>> 2. Other time-related functions have FP-dominating components. >>> TIM_format, for example, is dominated by nearly 2/3rds of its time on >>> a single FP instruction (this is probably partially due to pipeline >>> stalling). >>> >>> ? 000000000000ae40 : >>> 62.16 ? cvttsd %xmm0,%rax >>> >>> Inlining these functions would be beneficial in the TIM_real (sorry, I >>> am still operating in V2 terminology, but I believe all of this still >>> applies to V4) sense, but moving away from double as a time storage >>> format would be beneficial in general. This would be done by using >>> 64-bit counters that represent the number of nanoseconds since the >>> epoch. We will run out of those in something like 540 years, and I'm >>> happy to make that someone else's problem :). >>> >>> a. It reduces significant portion of overhead in VTIM-family functions >>> b. It reduces significant portion of overhead in printf >>> c. It maintains composability of time arithmetic >>> >>> The major downside is that timestamp printing now needs additional >>> work to print fractional time components. >>> >>> Finally, this gets a little into printf(3)-family inefficiencies as >>> well. Because it parses format strings every time, we've optimized a >>> number of places where we were using sprintf(3)-like interfaces to >>> simply use string buffers. There is VSB of course, but we also use >>> https://github.com/dhobsd/vstring (partially for FP stuff, partially >>> for allowing static-backed buffers to upgrade to dynamic ones if >>> necessary). The code overhead of string building is unfortunate, but >>> at 6.5% overhead to use printf(3), this is a real win. (Some of the >>> unlabeled blocks are things like _IO_default_xsputn, so the overhead >>> of printf(3) here is likely still higher than 6.5%). See >>> https://9vx.org/images/fg.png -- this was taken on a machine that is >>> handling nearly 12k RPS on top of ~3-4k threads. By moving to integer >>> times, conversion and printing would likely reduce the overhead of >>> printf(3) by 20% without actually changing consumption of printf. >>> >>> I am unclear how this applies to Varnish 4, but I think relatively >>> little is changed in this context between the versions. >>> >>> ## PCRE >>> >>> There are other things we've done (like optimizing regexes that are >>> obviously prefix and suffix matches -- turns out lots of people write >>> things like `if (req.http.x-foo ~ "^bar.*$")` that are effectively `if >>> (strncmp(req.http.x-foo, "bar" 3))` because it's easy), but I don't >>> see those as being as high priority for upstream; they're largely >>> issues for our multi-tenant use case. We have done this already; >>> another thing we would like to do is to check regexes for things like >>> backtracking and use DFA-based matching where possible. In the flame >>> graph screenshot, the obvious VRT functions are PCRE. >>> >>> ## Expected Impact >>> >>> The expected impact of fixing these things is almost purely in >>> latency. For this machine handling 12k RPS, that is the constant >>> throughput bound, but we are bursting up to nearly 4k threads to serve >>> the load. If header processing, PCRE, and printf were reduced to 50% >>> of their current overhead, we'd expect to be able to handle the same >>> load with something like 350 fewer threads, which is a real win for >>> us. Note than even our 99%ile latency is largely covered by cache >>> hits, so these effects would improve service for the vast majority of >>> requests. >>> >>> Anyway, those are some thoughts. Looking forward to comments, though >>> maybe there's a better venue for that than this ML? >>> >>> --dho >>> >>> _______________________________________________ >>> varnish-dev mailing list >>> varnish-dev at varnish-cache.org >>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-dev >> >> >> >> _______________________________________________ >> varnish-dev mailing list >> varnish-dev at varnish-cache.org >> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-dev > > From phk at phk.freebsd.dk Wed Apr 6 16:27:16 2016 From: phk at phk.freebsd.dk (Poul-Henning Kamp) Date: Wed, 06 Apr 2016 16:27:16 +0000 Subject: Varnish5 builtin VCL without pipe In-Reply-To: References: <59499.1459931241@critter.freebsd.dk> Message-ID: <1484.1459960036@critter.freebsd.dk> -------- In message , Federico Schwindt writes: >Can you elaborate? How is Upgrade related to OPTIONS? > >Also (and this might require more clarifications on the VIP itself), I welcome all of you to amend & discuss in the VIP itself, it is meant to be our record of the decision for later... -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk at FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From reza at varnish-software.com Wed Apr 6 21:09:26 2016 From: reza at varnish-software.com (Reza Naghibi) Date: Wed, 6 Apr 2016 17:09:26 -0400 Subject: Expanding VCL object support Message-ID: Below are some thoughts and a prototype patch for expanding object support. Today, objects are limited to global objects which have a lifetime of the entire VCL. I feel its useful to have objects which can be created during the request and their scope is limited to that request. When the request is done, the objects are finalized. The driver for this is creating a new curl/http vmod (I have several other vmods in mind which would benefit from this). When making curl requests from VCL, you may want to have multiple outstanding (async) requests, so we need to have the ability to encapsulate these requests in isolated, independent, and request scoped objects. Another use case is creating simple type objects, like VCL_STRING, VCL_INT, and VCL_BLOB. We can wrap these types into an object and we now have VCL variables which don't require conversion into headers and we can define them on the fly per request. This will likely open the door for some very interesting VCL functionality :) Here is a VCL snippet which compiles and works with the patch and uses my new libvmod_types [0]. --- import types; sub vcl_init { //global objects, these are unchanged from 4.0 new s = types.string("Hello!"); new reqs = types.integer(0); } sub vcl_recv { //new req scoped objects new req.var.slocal = types.string("Request scoped string"); new req.var.s2 = types.string("request string two"); new req.var.count = types.integer(1); } sub vcl_backend_fetch { //new bereq scoped objects new bereq.var.sbe = types.string("berequest string v1"); set bereq.http.sbe = bereq.var.sbe.value(); bereq.var.sbe.set("berequest string v2"); set bereq.http.sbe2 = bereq.var.sbe.value(); } sub vcl_deliver { //referencing a mix of global and req scoped objects set resp.http.X-s = s.value(); set resp.http.X-s-length = s.length(); set resp.http.X-slocal = req.var.slocal.value(); set resp.http.X-slocal-length = req.var.slocal.length(); req.var.count.increment(10); set resp.http.count = req.var.count.value(); set resp.http.reqs = reqs.increment_get(1); } --- The theoretical curl/http example: --- import http; sub vcl_recv { //http request #1 new req.var.h1 = http.request(); req.var.h1.set_header("foo", "bar"); req.var.h1.set_url("POST", "http://host1/blah?ok=true"); req.var.h1.send(); //http request #2 (we dont read it so its async) new req.var.h2 = http.request(); req.var.h2.set_url("GET", "http://host2/ping"); req.var.h2.send(); } sub vcl_deliver { //reference and read http request #1 and block for result set resp.http.X-test-response-code = req.var.h1.get_response_code(); } --- I left the legacy global objects alone in code and syntax. I introduced 2 new variable name scopes: req.var.* and bereq.var.*. This is completely cosmetic as these variables can still be request scoped without the (be)req.var prefix. However, the reason for adding it is to give the user some kind of indication that their variable is tied to a frontend, backend, or global scope. Otherwise I have the feeling having a bunch of un-prefixed variables throwing vcc scope errors when used incorrectly will be confusing. Also, the implementation is fairly simple because I piggybacked on the vmod/vrt priv_task implementation. Request scoped objects are basically given a shimmed struct vmod_priv. I had to jump thru a few small hoops in vcc code to get the priv->priv to cast into an actual struct that the VMOD expects. This may or may not be related to VIP#1, but it would be cleaner to move objects to something more priv like than trying to pass in an explicit struct. However, for the patch, I kept the object interface the same and made use of the previously mentioned vcc/vrt shims. The patch is enough to have the examples work and give you guys an idea of how it would work. I wanted to get some feedback before spending more time on this. Its based off of this commit [1], so feel free to comment on github if you want. [0] https://github.com/rezan/libvmod-types [1] https://github.com/rezan/varnish-cache/commit/b547bd9ad2fca9db1ef17ee73b8e9b7df9950c34 Thanks! -- Reza Naghibi Varnish Software -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-Expanded-object-support.patch Type: text/x-patch Size: 6176 bytes Desc: not available URL: From slink at schokola.de Thu Apr 7 06:50:20 2016 From: slink at schokola.de (Nils Goroll) Date: Thu, 7 Apr 2016 08:50:20 +0200 Subject: VTIM_real vs VTIM_mono / double vs uint64_t In-Reply-To: <5702DE9A.7040103@schokola.de> References: <5702DE9A.7040103@schokola.de> Message-ID: <5706032C.2050807@schokola.de> On 04/04/16 23:37, Nils Goroll wrote: > I once optimized a CLOCK_REALTIME bound app by caching real time and offsetting > it with the TSC Long story short, glibc does this ifdef HP_TIMING_AVAIL and my age-old wisdom regarding mono vs. real does not hold any more - at least not on my own machine. - Linux 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt25-1 (2016-03-06) x86_64 - i7-4600M In particular, glibc doesn't syscall (!), which I was complete unware of. I expect this may look completely different on other platforms. On 04/04/16 21:21, Devon H. O'Dell wrote: > using 64-bit counters that represent the number of nanoseconds since the > epoch. insights from a trivial benchmark (patch attached) on my machine: * real/mono overhead is <= ~5% * ignoring the VTIM value, the double/uint64 overhead is negligible (as expected) * with two arith ops per call (addition, division by 2), the double/uint64 overhead is ~10 - ~15% Please consider the usual disclaimers for synthetic benchmarks in a tight loop. So this basically confirms dhos more realistic benchmarks: - on this platform, optimizing for mono time does not make any sense - the main potential is avoiding FP ops by representing time as an integer Nils - # default optimizer varnish-cache/lib/libvarnish (master)$ cc -Wall -o foo -DTEST_DRIVER -I../.. -I../../include vtim.c vas.c -lm varnish-cache/lib/libvarnish (master)$ ./foo bench bench noop test value 0.000000 bench noop took 2.051ns per call bench warmup test value 0.000000 bench warmup took 19.128ns per call bench VTIM_mono test value 0.000000 bench VTIM_mono took 19.116ns per call bench VTIM_mono_i test value 0.000000 bench VTIM_mono_i took 19.034ns per call bench VTIM_real test value 0.000000 bench VTIM_real took 20.007ns per call bench VTIM_real_i test value 0.000000 bench VTIM_real_i took 19.995ns per call bench VTIM_mono test value 7008.426196 bench VTIM_mono took 25.131ns per call bench VTIM_mono_i test value 7008644393743.000000 bench VTIM_mono_i took 21.817ns per call bench VTIM_real test value 1460010972.766325 bench VTIM_real took 26.303ns per call bench VTIM_real_i test value 1460010972992060672.000000 bench VTIM_real_i took 22.570ns per call # -O6 varnish-cache/lib/libvarnish (master)$ cc -O6 -Wall -o foo -DTEST_DRIVER -I../.. -I../../include vtim.c vas.c -lm varnish-cache/lib/libvarnish (master)$ ./foo bench bench noop test value 0.000000 bench noop took 0.000ns per call bench warmup test value 0.000000 bench warmup took 15.500ns per call bench VTIM_mono test value 0.000000 bench VTIM_mono took 15.533ns per call bench VTIM_mono_i test value 0.000000 bench VTIM_mono_i took 15.552ns per call bench VTIM_real test value 0.000000 bench VTIM_real took 15.861ns per call bench VTIM_real_i test value 0.000000 bench VTIM_real_i took 15.899ns per call bench VTIM_mono test value 7624.422416 bench VTIM_mono took 20.277ns per call bench VTIM_mono_i test value 7624605637622.000000 bench VTIM_mono_i took 18.319ns per call bench VTIM_real test value 1460011588.673950 bench VTIM_real took 20.942ns per call bench VTIM_real_i test value 1460011588860285696.000000 bench VTIM_real_i took 18.630ns per call -------------- next part -------------- A non-text attachment was scrubbed... Name: 0001-uint64_t-VTIM-functions-and-a-trivial-benchmark.patch Type: text/x-patch Size: 3295 bytes Desc: not available URL: From slink at schokola.de Thu Apr 7 06:55:33 2016 From: slink at schokola.de (Nils Goroll) Date: Thu, 7 Apr 2016 08:55:33 +0200 Subject: VTIM_real vs VTIM_mono / double vs uint64_t In-Reply-To: <5706032C.2050807@schokola.de> References: <5702DE9A.7040103@schokola.de> <5706032C.2050807@schokola.de> Message-ID: <57060465.9060403@schokola.de> stupid glitch: the patch had the wrong factor for gettimeofday seconds to nanoseconds diff --git a/lib/libvarnish/vtim.c b/lib/libvarnish/vtim.c index e0f81bb..91a7531 100644 --- a/lib/libvarnish/vtim.c +++ b/lib/libvarnish/vtim.c @@ -125,7 +125,7 @@ VTIM_mono_i(void) struct timeval tv; AZ(gettimeofday(&tv, NULL)); - return (tv.tv_sec * 1e6 + tv.tv_usec * 1e3); + return (tv.tv_sec * 1e9 + tv.tv_usec * 1e3); #endif } @@ -157,7 +157,7 @@ VTIM_real_i(void) struct timeval tv; AZ(gettimeofday(&tv, NULL)); - return (tv.tv_sec * 1e6 + tv.tv_usec * 1e3); + return (tv.tv_sec * 1e9 + tv.tv_usec * 1e3); #endif } From slink at schokola.de Thu Apr 7 08:16:26 2016 From: slink at schokola.de (Nils Goroll) Date: Thu, 7 Apr 2016 10:16:26 +0200 Subject: SmartOS: VTIM_real vs VTIM_mono / double vs uint64_t In-Reply-To: <5706032C.2050807@schokola.de> References: <5702DE9A.7040103@schokola.de> <5706032C.2050807@schokola.de> Message-ID: <5706175A.9010808@schokola.de> On 07/04/16 08:50, Nils Goroll wrote: > I expect this may look completely different on other platforms. SmartOS: significantly less efficient than linux, higher real/mono overhead, double/uint64 not as relevant because of the lower efficiency [uplex at varnishdev-il ~/src/varnish-cache/lib/libvarnish]$ uname -a SunOS varnishdev-il.ham1.v0.uplex.de 5.11 joyent_20151029T053122Z i86pc i386 i86pc [uplex at varnishdev-il ~/src/varnish-cache/lib/libvarnish]$ cc -o foo -DTEST_DRIVER -I../.. -I../../include vtim.c vas.c -lm [uplex at varnishdev-il ~/src/varnish-cache/lib/libvarnish]$ ./foo bench bench noop test value 0.000000 bench noop took 2.616ns per call bench warmup test value 0.000000 bench warmup took 276.545ns per call bench VTIM_mono test value 0.000000 bench VTIM_mono took 274.854ns per call bench VTIM_mono_i test value 0.000000 bench VTIM_mono_i took 269.778ns per call bench VTIM_real test value 0.000000 bench VTIM_real took 284.716ns per call bench VTIM_real_i test value 0.000000 bench VTIM_real_i took 285.487ns per call bench VTIM_mono test value 3678449.798561 bench VTIM_mono took 278.187ns per call bench VTIM_mono_i test value 3678452517365498.000000 bench VTIM_mono_i took 271.873ns per call bench VTIM_real test value 1460016144.865661 bench VTIM_real took 293.806ns per call bench VTIM_real_i test value 1460016147788149504.000000 bench VTIM_real_i took 292.245ns per call From martin at varnish-software.com Thu Apr 7 09:33:49 2016 From: martin at varnish-software.com (Martin Blix Grydeland) Date: Thu, 07 Apr 2016 09:33:49 +0000 Subject: VTIM_real vs VTIM_mono / double vs uint64_t In-Reply-To: <5706032C.2050807@schokola.de> References: <5702DE9A.7040103@schokola.de> <5706032C.2050807@schokola.de> Message-ID: On Thu, 7 Apr 2016 at 08:57 Nils Goroll wrote: > In particular, glibc doesn't syscall (!), which I was complete unware of. > Linux uses a mechanism called VDSO(7) (Virtual Dynamic Shared Object) for it's gettimeofday() implementation. It utilizes a shared read-only page that holds the kernels current time. So the gettimeofday becomes much less expensive, as you avoid the context switch to read it. Will still be some memory reads though, so something based around the Intel in-chip clock would still be faster, though I'd guess not so much faster that it warrants the added complexity for Varnish' use cases. We do have more platforms to cater for than Linux though. Martin -------------- next part -------------- An HTML attachment was scrubbed... URL: From phk at phk.freebsd.dk Thu Apr 7 10:10:04 2016 From: phk at phk.freebsd.dk (Poul-Henning Kamp) Date: Thu, 07 Apr 2016 10:10:04 +0000 Subject: Patches as github issues from now on Message-ID: <31037.1460023804@critter.freebsd.dk> Next step of github migration: >From now on send patches as github pull requests. Discussion of patches will be in the pull request comment fields. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk at FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From phk at phk.freebsd.dk Thu Apr 7 10:18:46 2016 From: phk at phk.freebsd.dk (Poul-Henning Kamp) Date: Thu, 07 Apr 2016 10:18:46 +0000 Subject: Patches as github issues from now on In-Reply-To: <31037.1460023804@critter.freebsd.dk> References: <31037.1460023804@critter.freebsd.dk> Message-ID: <31380.1460024326@critter.freebsd.dk> -------- In message <31037.1460023804 at critter.freebsd.dk>, Poul-Henning Kamp writes: >Next step of github migration: > >>From now on send patches as github pull requests. >>From now on send patches as github pull requests. > >Discussion of patches will be in the pull request comment fields. I should add that any patches previously emailed to -dev but still in limbo, should be resubmitted as github pull requests. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk at FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From phk at phk.freebsd.dk Thu Apr 7 10:16:31 2016 From: phk at phk.freebsd.dk (Poul-Henning Kamp) Date: Thu, 07 Apr 2016 10:16:31 +0000 Subject: VTIM_real vs VTIM_mono / double vs uint64_t In-Reply-To: References: <5702DE9A.7040103@schokola.de> <5706032C.2050807@schokola.de> Message-ID: <31310.1460024191@critter.freebsd.dk> -------- In message , Martin Blix Grydeland writes: >Will still be some >memory reads though, so something based around the Intel in-chip clock There is no such thing as the "Intel in-chip clock". And we'll only be using clock_gettime() or gettimeofday(), and leave the hardware frobbing to the OS. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk at FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From slink at schokola.de Thu Apr 7 11:28:26 2016 From: slink at schokola.de (Nils Goroll) Date: Thu, 7 Apr 2016 13:28:26 +0200 Subject: VTIM_real vs VTIM_mono / double vs uint64_t In-Reply-To: <31310.1460024191@critter.freebsd.dk> References: <5702DE9A.7040103@schokola.de> <5706032C.2050807@schokola.de> <31310.1460024191@critter.freebsd.dk> Message-ID: <5706445A.3060506@schokola.de> Hi, @Martin, thanks for the VDSO explanations - I knew about this but was not up-to-date: because I had once worked on an issue where Linux _did_ syscall for clock_gettime, I was surprised to see that my machine doesn't. But my remark on this list was superfluous. On 07/04/16 12:16, Poul-Henning Kamp wrote: >> Will still be some >> memory reads though, so something based around the Intel in-chip clock > > There is no such thing as the "Intel in-chip clock". I think martin was referring to the TSC. > And we'll only be using clock_gettime() or gettimeofday(), and leave > the hardware frobbing to the OS. I do fully support this decision now. Nils From fgsch at lodoss.net Thu Apr 7 11:45:59 2016 From: fgsch at lodoss.net (Federico Schwindt) Date: Thu, 7 Apr 2016 12:45:59 +0100 Subject: Expanding VCL object support In-Reply-To: References: Message-ID: I definitely like this and I can see a lot of places where it will become handy. What I'm not particularly keen is the xxx.var namespace. While I understand the reasoning behind and I'd like to see some kind of variable support in Varnish I'm not sure reusing the *req/*resp space is the way forward. If you really want to tie the variables to a particular namespace you could name them as `req_var_xxx`, `bereq_var_xxx`, etc. so this feels a bit superfluous. Speaking of variable support, if we were to have these new scopes isn't this exactly giving us that, perhaps with some extra cost? On Wed, Apr 6, 2016 at 10:09 PM, Reza Naghibi wrote: > Below are some thoughts and a prototype patch for expanding object > support. Today, objects are limited to global objects which have a lifetime > of the entire VCL. I feel its useful to have objects which can be created > during the request and their scope is limited to that request. When the > request is done, the objects are finalized. > > The driver for this is creating a new curl/http vmod (I have several other > vmods in mind which would benefit from this). When making curl requests > from VCL, you may want to have multiple outstanding (async) requests, so we > need to have the ability to encapsulate these requests in isolated, > independent, and request scoped objects. Another use case is creating > simple type objects, like VCL_STRING, VCL_INT, and VCL_BLOB. We can wrap > these types into an object and we now have VCL variables which don't > require conversion into headers and we can define them on the fly per > request. This will likely open the door for some very interesting VCL > functionality :) > > Here is a VCL snippet which compiles and works with the patch and uses my > new libvmod_types [0]. > > --- > import types; > > sub vcl_init > { > //global objects, these are unchanged from 4.0 > new s = types.string("Hello!"); > new reqs = types.integer(0); > } > > sub vcl_recv > { > //new req scoped objects > new req.var.slocal = types.string("Request scoped string"); > new req.var.s2 = types.string("request string two"); > new req.var.count = types.integer(1); > } > > sub vcl_backend_fetch > { > //new bereq scoped objects > new bereq.var.sbe = types.string("berequest string v1"); > set bereq.http.sbe = bereq.var.sbe.value(); > > bereq.var.sbe.set("berequest string v2"); > set bereq.http.sbe2 = bereq.var.sbe.value(); > } > > sub vcl_deliver > { > //referencing a mix of global and req scoped objects > set resp.http.X-s = s.value(); > set resp.http.X-s-length = s.length(); > > set resp.http.X-slocal = req.var.slocal.value(); > set resp.http.X-slocal-length = req.var.slocal.length(); > > req.var.count.increment(10); > set resp.http.count = req.var.count.value(); > set resp.http.reqs = reqs.increment_get(1); > } > --- > > The theoretical curl/http example: > > --- > import http; > > sub vcl_recv > { > //http request #1 > new req.var.h1 = http.request(); > req.var.h1.set_header("foo", "bar"); > req.var.h1.set_url("POST", "http://host1/blah?ok=true"); > req.var.h1.send(); > > //http request #2 (we dont read it so its async) > new req.var.h2 = http.request(); > req.var.h2.set_url("GET", "http://host2/ping"); > req.var.h2.send(); > } > > sub vcl_deliver > { > //reference and read http request #1 and block for result > set resp.http.X-test-response-code = req.var.h1.get_response_code(); > } > --- > > I left the legacy global objects alone in code and syntax. I introduced 2 > new variable name scopes: req.var.* and bereq.var.*. This is completely > cosmetic as these variables can still be request scoped without the > (be)req.var prefix. However, the reason for adding it is to give the user > some kind of indication that their variable is tied to a frontend, backend, > or global scope. Otherwise I have the feeling having a bunch of un-prefixed > variables throwing vcc scope errors when used incorrectly will be confusing. > > Also, the implementation is fairly simple because I piggybacked on the > vmod/vrt priv_task implementation. Request scoped objects are basically > given a shimmed struct vmod_priv. I had to jump thru a few small hoops in > vcc code to get the priv->priv to cast into an actual struct that the VMOD > expects. This may or may not be related to VIP#1, but it would be cleaner > to move objects to something more priv like than trying to pass in an > explicit struct. However, for the patch, I kept the object interface the > same and made use of the previously mentioned vcc/vrt shims. > > The patch is enough to have the examples work and give you guys an idea of > how it would work. I wanted to get some feedback before spending more time > on this. Its based off of this commit [1], so feel free to comment on > github if you want. > > > [0] https://github.com/rezan/libvmod-types > [1] > https://github.com/rezan/varnish-cache/commit/b547bd9ad2fca9db1ef17ee73b8e9b7df9950c34 > > Thanks! > > -- > Reza Naghibi > Varnish Software > > _______________________________________________ > varnish-dev mailing list > varnish-dev at varnish-cache.org > https://www.varnish-cache.org/lists/mailman/listinfo/varnish-dev > -------------- next part -------------- An HTML attachment was scrubbed... URL: From reza at varnish-software.com Thu Apr 7 14:04:15 2016 From: reza at varnish-software.com (Reza Naghibi) Date: Thu, 7 Apr 2016 10:04:15 -0400 Subject: Expanding VCL object support In-Reply-To: References: Message-ID: I was thinking about this a bit more last night. I agree that "req.var.blah.foo()" is pretty verbose. We could drop it all and just do "new blah = ..." with "blah.foo()" and let vcc and the user reason about the scope. Maybe a shorthand prefix is best. One issue is if you want to target the top request lifetime (ESI), then without a prefix, there is no way to do that. There is no way to tell is "new blah = ..." should be request or top request scope. With a prefix, vcc would know which scope you want it to be. Im thinking having top request scopes would be useful, but if not, then we could favor dropping the prefix altogether. -- Reza Naghibi Varnish Software On Thu, Apr 7, 2016 at 7:45 AM, Federico Schwindt wrote: > I definitely like this and I can see a lot of places where it will become > handy. > > What I'm not particularly keen is the xxx.var namespace. While I > understand the reasoning behind and I'd like to see some kind of variable > support in Varnish I'm not sure reusing the *req/*resp space is the way > forward. > If you really want to tie the variables to a particular namespace you > could name them as `req_var_xxx`, `bereq_var_xxx`, etc. so this feels a bit > superfluous. > > Speaking of variable support, if we were to have these new scopes isn't > this exactly giving us that, perhaps with some extra cost? > > On Wed, Apr 6, 2016 at 10:09 PM, Reza Naghibi > wrote: > >> Below are some thoughts and a prototype patch for expanding object >> support. Today, objects are limited to global objects which have a lifetime >> of the entire VCL. I feel its useful to have objects which can be created >> during the request and their scope is limited to that request. When the >> request is done, the objects are finalized. >> >> The driver for this is creating a new curl/http vmod (I have several >> other vmods in mind which would benefit from this). When making curl >> requests from VCL, you may want to have multiple outstanding (async) >> requests, so we need to have the ability to encapsulate these requests in >> isolated, independent, and request scoped objects. Another use case is >> creating simple type objects, like VCL_STRING, VCL_INT, and VCL_BLOB. We >> can wrap these types into an object and we now have VCL variables which >> don't require conversion into headers and we can define them on the fly per >> request. This will likely open the door for some very interesting VCL >> functionality :) >> >> Here is a VCL snippet which compiles and works with the patch and uses my >> new libvmod_types [0]. >> >> --- >> import types; >> >> sub vcl_init >> { >> //global objects, these are unchanged from 4.0 >> new s = types.string("Hello!"); >> new reqs = types.integer(0); >> } >> >> sub vcl_recv >> { >> //new req scoped objects >> new req.var.slocal = types.string("Request scoped string"); >> new req.var.s2 = types.string("request string two"); >> new req.var.count = types.integer(1); >> } >> >> sub vcl_backend_fetch >> { >> //new bereq scoped objects >> new bereq.var.sbe = types.string("berequest string v1"); >> set bereq.http.sbe = bereq.var.sbe.value(); >> >> bereq.var.sbe.set("berequest string v2"); >> set bereq.http.sbe2 = bereq.var.sbe.value(); >> } >> >> sub vcl_deliver >> { >> //referencing a mix of global and req scoped objects >> set resp.http.X-s = s.value(); >> set resp.http.X-s-length = s.length(); >> >> set resp.http.X-slocal = req.var.slocal.value(); >> set resp.http.X-slocal-length = req.var.slocal.length(); >> >> req.var.count.increment(10); >> set resp.http.count = req.var.count.value(); >> set resp.http.reqs = reqs.increment_get(1); >> } >> --- >> >> The theoretical curl/http example: >> >> --- >> import http; >> >> sub vcl_recv >> { >> //http request #1 >> new req.var.h1 = http.request(); >> req.var.h1.set_header("foo", "bar"); >> req.var.h1.set_url("POST", "http://host1/blah?ok=true"); >> req.var.h1.send(); >> >> //http request #2 (we dont read it so its async) >> new req.var.h2 = http.request(); >> req.var.h2.set_url("GET", "http://host2/ping"); >> req.var.h2.send(); >> } >> >> sub vcl_deliver >> { >> //reference and read http request #1 and block for result >> set resp.http.X-test-response-code = req.var.h1.get_response_code(); >> } >> --- >> >> I left the legacy global objects alone in code and syntax. I introduced 2 >> new variable name scopes: req.var.* and bereq.var.*. This is completely >> cosmetic as these variables can still be request scoped without the >> (be)req.var prefix. However, the reason for adding it is to give the user >> some kind of indication that their variable is tied to a frontend, backend, >> or global scope. Otherwise I have the feeling having a bunch of un-prefixed >> variables throwing vcc scope errors when used incorrectly will be confusing. >> >> Also, the implementation is fairly simple because I piggybacked on the >> vmod/vrt priv_task implementation. Request scoped objects are basically >> given a shimmed struct vmod_priv. I had to jump thru a few small hoops in >> vcc code to get the priv->priv to cast into an actual struct that the VMOD >> expects. This may or may not be related to VIP#1, but it would be cleaner >> to move objects to something more priv like than trying to pass in an >> explicit struct. However, for the patch, I kept the object interface the >> same and made use of the previously mentioned vcc/vrt shims. >> >> The patch is enough to have the examples work and give you guys an idea >> of how it would work. I wanted to get some feedback before spending more >> time on this. Its based off of this commit [1], so feel free to comment on >> github if you want. >> >> >> [0] https://github.com/rezan/libvmod-types >> [1] >> https://github.com/rezan/varnish-cache/commit/b547bd9ad2fca9db1ef17ee73b8e9b7df9950c34 >> >> Thanks! >> >> -- >> Reza Naghibi >> Varnish Software >> >> _______________________________________________ >> varnish-dev mailing list >> varnish-dev at varnish-cache.org >> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-dev >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From reza at varnish-software.com Thu Apr 7 14:12:12 2016 From: reza at varnish-software.com (Reza Naghibi) Date: Thu, 7 Apr 2016 10:12:12 -0400 Subject: Expanding VCL object support In-Reply-To: References: Message-ID: Maybe something like "new (top)blah = ..." and "blah.foo()" would allow us to target different scopes but keep the variable names and VCL syntax more clean. Just thinking out loud. -- Reza Naghibi Varnish Software On Thu, Apr 7, 2016 at 10:04 AM, Reza Naghibi wrote: > I was thinking about this a bit more last night. I agree that > "req.var.blah.foo()" is pretty verbose. We could drop it all and just do > "new blah = ..." with "blah.foo()" and let vcc and the user reason about > the scope. Maybe a shorthand prefix is best. One issue is if you want to > target the top request lifetime (ESI), then without a prefix, there is no > way to do that. There is no way to tell is "new blah = ..." should be > request or top request scope. With a prefix, vcc would know which scope you > want it to be. Im thinking having top request scopes would be useful, but > if not, then we could favor dropping the prefix altogether. > > -- > Reza Naghibi > Varnish Software > > On Thu, Apr 7, 2016 at 7:45 AM, Federico Schwindt > wrote: > >> I definitely like this and I can see a lot of places where it will become >> handy. >> >> What I'm not particularly keen is the xxx.var namespace. While I >> understand the reasoning behind and I'd like to see some kind of variable >> support in Varnish I'm not sure reusing the *req/*resp space is the way >> forward. >> If you really want to tie the variables to a particular namespace you >> could name them as `req_var_xxx`, `bereq_var_xxx`, etc. so this feels a bit >> superfluous. >> >> Speaking of variable support, if we were to have these new scopes isn't >> this exactly giving us that, perhaps with some extra cost? >> >> On Wed, Apr 6, 2016 at 10:09 PM, Reza Naghibi >> wrote: >> >>> Below are some thoughts and a prototype patch for expanding object >>> support. Today, objects are limited to global objects which have a lifetime >>> of the entire VCL. I feel its useful to have objects which can be created >>> during the request and their scope is limited to that request. When the >>> request is done, the objects are finalized. >>> >>> The driver for this is creating a new curl/http vmod (I have several >>> other vmods in mind which would benefit from this). When making curl >>> requests from VCL, you may want to have multiple outstanding (async) >>> requests, so we need to have the ability to encapsulate these requests in >>> isolated, independent, and request scoped objects. Another use case is >>> creating simple type objects, like VCL_STRING, VCL_INT, and VCL_BLOB. We >>> can wrap these types into an object and we now have VCL variables which >>> don't require conversion into headers and we can define them on the fly per >>> request. This will likely open the door for some very interesting VCL >>> functionality :) >>> >>> Here is a VCL snippet which compiles and works with the patch and uses >>> my new libvmod_types [0]. >>> >>> --- >>> import types; >>> >>> sub vcl_init >>> { >>> //global objects, these are unchanged from 4.0 >>> new s = types.string("Hello!"); >>> new reqs = types.integer(0); >>> } >>> >>> sub vcl_recv >>> { >>> //new req scoped objects >>> new req.var.slocal = types.string("Request scoped string"); >>> new req.var.s2 = types.string("request string two"); >>> new req.var.count = types.integer(1); >>> } >>> >>> sub vcl_backend_fetch >>> { >>> //new bereq scoped objects >>> new bereq.var.sbe = types.string("berequest string v1"); >>> set bereq.http.sbe = bereq.var.sbe.value(); >>> >>> bereq.var.sbe.set("berequest string v2"); >>> set bereq.http.sbe2 = bereq.var.sbe.value(); >>> } >>> >>> sub vcl_deliver >>> { >>> //referencing a mix of global and req scoped objects >>> set resp.http.X-s = s.value(); >>> set resp.http.X-s-length = s.length(); >>> >>> set resp.http.X-slocal = req.var.slocal.value(); >>> set resp.http.X-slocal-length = req.var.slocal.length(); >>> >>> req.var.count.increment(10); >>> set resp.http.count = req.var.count.value(); >>> set resp.http.reqs = reqs.increment_get(1); >>> } >>> --- >>> >>> The theoretical curl/http example: >>> >>> --- >>> import http; >>> >>> sub vcl_recv >>> { >>> //http request #1 >>> new req.var.h1 = http.request(); >>> req.var.h1.set_header("foo", "bar"); >>> req.var.h1.set_url("POST", "http://host1/blah?ok=true"); >>> req.var.h1.send(); >>> >>> //http request #2 (we dont read it so its async) >>> new req.var.h2 = http.request(); >>> req.var.h2.set_url("GET", "http://host2/ping"); >>> req.var.h2.send(); >>> } >>> >>> sub vcl_deliver >>> { >>> //reference and read http request #1 and block for result >>> set resp.http.X-test-response-code = req.var.h1.get_response_code(); >>> } >>> --- >>> >>> I left the legacy global objects alone in code and syntax. I introduced >>> 2 new variable name scopes: req.var.* and bereq.var.*. This is completely >>> cosmetic as these variables can still be request scoped without the >>> (be)req.var prefix. However, the reason for adding it is to give the user >>> some kind of indication that their variable is tied to a frontend, backend, >>> or global scope. Otherwise I have the feeling having a bunch of un-prefixed >>> variables throwing vcc scope errors when used incorrectly will be confusing. >>> >>> Also, the implementation is fairly simple because I piggybacked on the >>> vmod/vrt priv_task implementation. Request scoped objects are basically >>> given a shimmed struct vmod_priv. I had to jump thru a few small hoops in >>> vcc code to get the priv->priv to cast into an actual struct that the VMOD >>> expects. This may or may not be related to VIP#1, but it would be cleaner >>> to move objects to something more priv like than trying to pass in an >>> explicit struct. However, for the patch, I kept the object interface the >>> same and made use of the previously mentioned vcc/vrt shims. >>> >>> The patch is enough to have the examples work and give you guys an idea >>> of how it would work. I wanted to get some feedback before spending more >>> time on this. Its based off of this commit [1], so feel free to comment on >>> github if you want. >>> >>> >>> [0] https://github.com/rezan/libvmod-types >>> [1] >>> https://github.com/rezan/varnish-cache/commit/b547bd9ad2fca9db1ef17ee73b8e9b7df9950c34 >>> >>> Thanks! >>> >>> -- >>> Reza Naghibi >>> Varnish Software >>> >>> _______________________________________________ >>> varnish-dev mailing list >>> varnish-dev at varnish-cache.org >>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-dev >>> >> >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From phk at phk.freebsd.dk Thu Apr 7 21:25:35 2016 From: phk at phk.freebsd.dk (Poul-Henning Kamp) Date: Thu, 07 Apr 2016 21:25:35 +0000 Subject: VTIM_real vs VTIM_mono / double vs uint64_t In-Reply-To: <5706445A.3060506@schokola.de> References: <5702DE9A.7040103@schokola.de> <5706032C.2050807@schokola.de> <31310.1460024191@critter.freebsd.dk> <5706445A.3060506@schokola.de> Message-ID: <35155.1460064335@critter.freebsd.dk> -------- In message <5706445A.3060506 at schokola.de>, Nils Goroll writes: >Hi, > >@Martin, thanks for the VDSO explanations - I knew about this but was not >up-to-date: because I had once worked on an issue where Linux _did_ syscall for >clock_gettime, I was surprised to see that my machine doesn't. But my remark on >this list was superfluous. > >On 07/04/16 12:16, Poul-Henning Kamp wrote: >>> Will still be some >>> memory reads though, so something based around the Intel in-chip clock >> >> There is no such thing as the "Intel in-chip clock". > >I think martin was referring to the TSC. I think so too, but I wanted to make it absolutely clear that it is not a "in-chip clock" :-) -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk at FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From lkarsten at varnish-software.com Fri Apr 8 08:29:30 2016 From: lkarsten at varnish-software.com (Lasse Karstensen) Date: Fri, 8 Apr 2016 10:29:30 +0200 Subject: New commiter: =?iso-8859-1?Q?P=E5?= =?iso-8859-1?Q?l?= Hermunn Johansen (VS) Message-ID: <20160408082929.GA650@immer.varnish-software.com> Hi all. After consulting Poul-Henning, I've given P?l Hermunn Johansen commit rights to the Varnish Cache tree now. P?l Hermunn works in VS's Oslo office. He is working under Martin's kind supervision initially. His IRC handle is hermunn. Please make him feel welcome. -- Lasse Karstensen From fgsch at lodoss.net Wed Apr 13 16:26:09 2016 From: fgsch at lodoss.net (Federico Schwindt) Date: Wed, 13 Apr 2016 17:26:09 +0100 Subject: github labels In-Reply-To: References: <56E7CE42.2020009@uplex.de> Message-ID: phk? On Tue, Mar 15, 2016 at 9:23 AM, Federico Schwindt wrote: > Good idea. > > On Tue, Mar 15, 2016 at 8:56 AM, Geoff Simmons wrote: > >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA256 >> >> On 03/15/2016 09:29 AM, Federico Schwindt wrote: >> > >> > - 4.0 - 4.1 - master - more-info-needed - needs-backport >> >> ACK >> >> > We can also have labels for different components as we had in >> > trac: >> > >> > - build - documentation - varnishadm - varnishd - varnishhist - >> > varnishlog - varnishncsa - varnishstat - varnishtest - varnishtop - >> > vmod >> >> - - varnishapi >> >> I have occasionally brought up issues about VSL, and there's also VSC >> and the CLI. I may have been the only one, but it never fit well into >> any of these categories. >> >> >> Best, >> Geoff >> - -- >> ** * * UPLEX - Nils Goroll Systemoptimierung >> >> Scheffelstra?e 32 >> 22301 Hamburg >> >> Tel +49 40 2880 5731 >> Mob +49 176 636 90917 >> Fax +49 40 42949753 >> >> http://uplex.de >> -----BEGIN PGP SIGNATURE----- >> Version: GnuPG v1 >> >> iQIcBAEBCAAGBQJW5842AAoJEOUwvh9pJNURml0P/2tikLJkZ1kElCK7fV9/RFsY >> FS/RuCl16nLA6bitVt7T+7E2+w9Kh+qeOCPQh3TLMJ6VjCV9Qfuhde3ytT09I7b3 >> 6sP+kYybmKN2mObdoKwkgf6chcIrsZeGvYWiLbZ3VYcRMSwyelSajhhUxWHciczA >> 1C5aBwXNxCpH0TRoh6yndD4ReDHTiXeS/2/IALkelkNV1wZhD6iQ1dU22URfiAT1 >> ZIn7ECouiZxSgXLslDsx2oHcjHMvjhCDQEmRmdT3KTzMoS4Oyr5fCk53DgeILqV2 >> 2rJ2mADetd449gcTDMRQkw18PwJFoaGaWN7c4Jx1tOVCCQMPdZn4hue/YK6noqbe >> 1IcG1gVhjO3aUq4ItKlQxGAdAlDq0AS4E73+KRJpK3GhNn9dZpOvyROlYEJggN5Q >> tsGV6hBKBeFzdJNd9Ec0/cPJtWOyrigt/OD3mbvj+6YL8qn5v2tGUdBhFk67v4Iu >> goF8qArIxVDfPt5YLyuumAUZwyWC82tgYfj+aVXkKKYmyj8uOrcdmUsjNC3EwxEi >> BGgCYySxF3EBtQzn9hGXoM6/cy1zm5A8/rhuroCU2n9XSyWziEfzlcQ0Wj4y0E3H >> 1tVLb08hOSjNe8z3CvQCowPce1B8kZlfXOEdMI4h09fxwS3GK6r1x5Lu7pEAldGX >> 2GsZIUg4sKby7PsQj60L >> =qAre >> -----END PGP SIGNATURE----- >> >> _______________________________________________ >> varnish-dev mailing list >> varnish-dev at varnish-cache.org >> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-dev >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From phk at phk.freebsd.dk Wed Apr 13 17:42:45 2016 From: phk at phk.freebsd.dk (Poul-Henning Kamp) Date: Wed, 13 Apr 2016 17:42:45 +0000 Subject: github labels In-Reply-To: References: <56E7CE42.2020009@uplex.de> Message-ID: <30901.1460569365@critter.freebsd.dk> -------- In message , Federico Schwindt writes: >phk? Whatever makes sense... -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk at FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From fgsch at lodoss.net Wed Apr 13 17:47:43 2016 From: fgsch at lodoss.net (Federico Schwindt) Date: Wed, 13 Apr 2016 18:47:43 +0100 Subject: github labels In-Reply-To: <30901.1460569365@critter.freebsd.dk> References: <56E7CE42.2020009@uplex.de> <30901.1460569365@critter.freebsd.dk> Message-ID: Someone with permissions have to create them though :) On Wed, Apr 13, 2016 at 6:42 PM, Poul-Henning Kamp wrote: > -------- > In message < > CAJV_h0aynNeFz75yHwdQ6Ua2nC1rB6Z2WdQFiBQgEWfbQ-g75w at mail.gmail.com> > , Federico Schwindt writes: > > >phk? > > Whatever makes sense... > > -- > Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 > phk at FreeBSD.ORG | TCP/IP since RFC 956 > FreeBSD committer | BSD since 4.3-tahoe > Never attribute to malice what can adequately be explained by incompetence. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From phk at phk.freebsd.dk Wed Apr 13 18:18:44 2016 From: phk at phk.freebsd.dk (Poul-Henning Kamp) Date: Wed, 13 Apr 2016 18:18:44 +0000 Subject: github labels In-Reply-To: References: <56E7CE42.2020009@uplex.de> <30901.1460569365@critter.freebsd.dk> Message-ID: <31041.1460571524@critter.freebsd.dk> -------- In message , Federico Schwindt writes: >Someone with permissions have to create them though :) Any idea what kind of permissions ? -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk at FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From fgsch at lodoss.net Wed Apr 13 22:24:15 2016 From: fgsch at lodoss.net (Federico Schwindt) Date: Wed, 13 Apr 2016 23:24:15 +0100 Subject: github labels In-Reply-To: References: <56E7CE42.2020009@uplex.de> Message-ID: I've created all the labels now. Happy tagging! On Tue, Mar 15, 2016 at 9:23 AM, Federico Schwindt wrote: > Good idea. > > On Tue, Mar 15, 2016 at 8:56 AM, Geoff Simmons wrote: > >> -----BEGIN PGP SIGNED MESSAGE----- >> Hash: SHA256 >> >> On 03/15/2016 09:29 AM, Federico Schwindt wrote: >> > >> > - 4.0 - 4.1 - master - more-info-needed - needs-backport >> >> ACK >> >> > We can also have labels for different components as we had in >> > trac: >> > >> > - build - documentation - varnishadm - varnishd - varnishhist - >> > varnishlog - varnishncsa - varnishstat - varnishtest - varnishtop - >> > vmod >> >> - - varnishapi >> >> I have occasionally brought up issues about VSL, and there's also VSC >> and the CLI. I may have been the only one, but it never fit well into >> any of these categories. >> >> >> Best, >> Geoff >> - -- >> ** * * UPLEX - Nils Goroll Systemoptimierung >> >> Scheffelstra?e 32 >> 22301 Hamburg >> >> Tel +49 40 2880 5731 >> Mob +49 176 636 90917 >> Fax +49 40 42949753 >> >> http://uplex.de >> -----BEGIN PGP SIGNATURE----- >> Version: GnuPG v1 >> >> iQIcBAEBCAAGBQJW5842AAoJEOUwvh9pJNURml0P/2tikLJkZ1kElCK7fV9/RFsY >> FS/RuCl16nLA6bitVt7T+7E2+w9Kh+qeOCPQh3TLMJ6VjCV9Qfuhde3ytT09I7b3 >> 6sP+kYybmKN2mObdoKwkgf6chcIrsZeGvYWiLbZ3VYcRMSwyelSajhhUxWHciczA >> 1C5aBwXNxCpH0TRoh6yndD4ReDHTiXeS/2/IALkelkNV1wZhD6iQ1dU22URfiAT1 >> ZIn7ECouiZxSgXLslDsx2oHcjHMvjhCDQEmRmdT3KTzMoS4Oyr5fCk53DgeILqV2 >> 2rJ2mADetd449gcTDMRQkw18PwJFoaGaWN7c4Jx1tOVCCQMPdZn4hue/YK6noqbe >> 1IcG1gVhjO3aUq4ItKlQxGAdAlDq0AS4E73+KRJpK3GhNn9dZpOvyROlYEJggN5Q >> tsGV6hBKBeFzdJNd9Ec0/cPJtWOyrigt/OD3mbvj+6YL8qn5v2tGUdBhFk67v4Iu >> goF8qArIxVDfPt5YLyuumAUZwyWC82tgYfj+aVXkKKYmyj8uOrcdmUsjNC3EwxEi >> BGgCYySxF3EBtQzn9hGXoM6/cy1zm5A8/rhuroCU2n9XSyWziEfzlcQ0Wj4y0E3H >> 1tVLb08hOSjNe8z3CvQCowPce1B8kZlfXOEdMI4h09fxwS3GK6r1x5Lu7pEAldGX >> 2GsZIUg4sKby7PsQj60L >> =qAre >> -----END PGP SIGNATURE----- >> >> _______________________________________________ >> varnish-dev mailing list >> varnish-dev at varnish-cache.org >> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-dev >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From slink at schokola.de Fri Apr 15 15:44:14 2016 From: slink at schokola.de (Nils Goroll) Date: Fri, 15 Apr 2016 17:44:14 +0200 Subject: ban lurker questions Message-ID: <57110C4E.8010209@schokola.de> Hi, I am working on a ban (lurker) performance issue, which, at this point, is simply caused by too frequent and too inefficient bans - but anyway, I'd like to understand the code to the best of my abilities. 1) ban_cleantail why do we acquire the mtx for every ban we look at rather than collecting all the bans to be freed in a loop while holding the mtx? 2) would this assertion be correct? diff --git a/bin/varnishd/cache/cache_ban_lurker.c b/bin/varnishd/cache/cache_ban_lurker.c index 65c552e..fb69f78 100644 --- a/bin/varnishd/cache/cache_ban_lurker.c +++ b/bin/varnishd/cache/cache_ban_lurker.c @@ -190,6 +190,7 @@ ban_lurker_test_ban(struct worker *wrk, struct vsl_log *vsl, struct ban *bt, VSC_C_main->bans_lurker_obj_killed++; } else { if (oc->ban != bd) { + assert(oc->ban == bt); Lck_Lock(&ban_mtx); oc->ban->refcount--; VTAILQ_REMOVE(&oc->ban->objcore, oc, ban_list); 3) ban_lurker_getfirst questions: - for the contention case, shouldn't we continue walking the bt->objcore list and sleep only if we hit the marker? instead of sleeping, we could make progress on other objheads, and chances are that we've got more luck next time. - do IUC correctly that getfirst moves the oc to the tail of the bt->objcore list, behind the marker, to ensure we don't re-visit ocs which have not got killed yet, after being handed off to exp? 4) ban_lurker_work why do we mark_completed and then clean out the completed bans in the next step rather than removing the completed bans straight away? Why do we need to do the spec fiddling in ban_mark_completed (including a membar (!)) if we're about to ditch the ban anyway? Danke, Nils From slink at schokola.de Fri Apr 15 16:11:46 2016 From: slink at schokola.de (Nils Goroll) Date: Fri, 15 Apr 2016 18:11:46 +0200 Subject: ban_lurker_age, ban_lurker_sleep In-Reply-To: <57110C4E.8010209@schokola.de> References: <57110C4E.8010209@schokola.de> Message-ID: <571112C2.5070607@schokola.de> I've just pushed two minor improvements on the documentation, but I'd suggest two changes * ban_lurker_age: By holding off the ban lurker for a bit, - we increase the likelihood of removing duplicate bans before spending time on testing them and - increase the number of bans we test against each object in a single go but we also increase the likelihood of bans requiring request-time evaluation, which is bad for latency. I cannot present generic real world numbers, because I have only looked at one real life application today. This application issues bans in bursts which it fires within a couple of seconds. Are there any other applications which are likely to issue duplicate bans over longer periods of time? Otherwise I'd suggest that we lower the default to 5 seconds. * ban_lurker_sleep This is good for holding off the ban lurker after ban_lurker_batch, but IMHO it's really bad that the param is also used for the sleep time at lock contention. I strongly suggest that we make these two different parameters. 10ms default sounds sensible, but people may be tempted to increase ban_lurker_sleep to seconds timeframes and that could really hurt in the contention case. Thx, Nils From phk at phk.freebsd.dk Mon Apr 18 08:20:32 2016 From: phk at phk.freebsd.dk (Poul-Henning Kamp) Date: Mon, 18 Apr 2016 08:20:32 +0000 Subject: ban lurker questions In-Reply-To: <57110C4E.8010209@schokola.de> References: <57110C4E.8010209@schokola.de> Message-ID: <41572.1460967632@critter.freebsd.dk> -------- In message <57110C4E.8010209 at schokola.de>, Nils Goroll writes: >Hi, > >I am working on a ban (lurker) performance issue, which, at this point, is >simply caused by too frequent and too inefficient bans - but anyway, I'd like to >understand the code to the best of my abilities. > >1) ban_cleantail > >why do we acquire the mtx for every ban we look at rather than collecting all >the bans to be freed in a loop while holding the mtx? Probably just an accident of how the code has developed. >2) would this assertion be correct? > >diff --git a/bin/varnishd/cache/cache_ban_lurker.c >b/bin/varnishd/cache/cache_ban_lurker.c >index 65c552e..fb69f78 100644 >--- a/bin/varnishd/cache/cache_ban_lurker.c >+++ b/bin/varnishd/cache/cache_ban_lurker.c >@@ -190,6 +190,7 @@ ban_lurker_test_ban(struct worker *wrk, struct vsl_log *vsl, >struct ban *bt, > VSC_C_main->bans_lurker_obj_killed++; > } else { > if (oc->ban != bd) { >+ assert(oc->ban == bt); > Lck_Lock(&ban_mtx); > oc->ban->refcount--; > VTAILQ_REMOVE(&oc->ban->objcore, oc, ban_list); I am not sure. Isn't there a window where a HSH_Lookup could race us ? >3) ban_lurker_getfirst questions: > >- for the contention case, shouldn't we continue walking the bt->objcore list > and sleep only if we hit the marker? We'd need to come back to the missed oc's later, the lurker cannot skip some of the oc's. >- do IUC correctly that getfirst moves the oc to the tail of the bt->objcore > list, behind the marker, to ensure we don't re-visit ocs which have not got > killed yet, after being handed off to exp? Not sure I understand the question... >4) ban_lurker_work > >why do we mark_completed and then clean out the completed bans in the next step >rather than removing the completed bans straight away? Why do we need to do the >spec fiddling in ban_mark_completed (including a membar (!)) if we're about to >ditch the ban anyway? again, probably just an accident of how the code developed. Improvements are most welcome -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk at FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From phk at phk.freebsd.dk Mon Apr 18 08:39:52 2016 From: phk at phk.freebsd.dk (Poul-Henning Kamp) Date: Mon, 18 Apr 2016 08:39:52 +0000 Subject: ban_lurker_age, ban_lurker_sleep In-Reply-To: <571112C2.5070607@schokola.de> References: <57110C4E.8010209@schokola.de> <571112C2.5070607@schokola.de> Message-ID: <19481.1460968792@critter.freebsd.dk> -------- In message <571112C2.5070607 at schokola.de>, Nils Goroll writes: >I've just pushed two minor improvements on the documentation, but I'd suggest >two changes > >* ban_lurker_age: >but we also increase the likelihood of bans requiring request-time evaluation, >which is bad for latency. The important thing is for the lurker to not get in the way of request-time evaluation, because it will be definition waste a lot of time on objects clients don't care for, thus needlessly delaying the ones the clients ask for. The right thing to do with respect to both ban_lurker_{age|sleep} is probably to look at the rate of req-time evaluations and modulate the lurker activity to minimize interference. Likewise the mutex contest delay should increase in response to rate of req-time evaluations. For instance we could make a param which says how many ban-checks per second we allow. If the req-time processing leaves any unused, the lurker is free to "fill up the quota". Unfortunately such "quota" parameters need to scale and I have no idea on what (ncpu ?) But again, data-driven analysis is most welcome. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk at FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From slink at schokola.de Mon Apr 18 18:40:14 2016 From: slink at schokola.de (Nils Goroll) Date: Mon, 18 Apr 2016 20:40:14 +0200 Subject: ban lurker questions In-Reply-To: <41572.1460967632@critter.freebsd.dk> References: <57110C4E.8010209@schokola.de> <41572.1460967632@critter.freebsd.dk> Message-ID: <57152A0E.5040404@schokola.de> Hi, I've created https://github.com/varnishcache/varnish-cache/pull/1910 based on the insights which my questions related to. FTR regarding phks last response: >> 1) ban_cleantail >> >> why do we acquire the mtx for every ban we look at rather than collecting all >> the bans to be freed in a loop while holding the mtx? > > Probably just an accident of how the code has developed. implemented. >> 2) would this assertion be correct? >> >> diff --git a/bin/varnishd/cache/cache_ban_lurker.c >> b/bin/varnishd/cache/cache_ban_lurker.c >> index 65c552e..fb69f78 100644 >> --- a/bin/varnishd/cache/cache_ban_lurker.c >> +++ b/bin/varnishd/cache/cache_ban_lurker.c >> @@ -190,6 +190,7 @@ ban_lurker_test_ban(struct worker *wrk, struct vsl_log *vsl, >> struct ban *bt, >> VSC_C_main->bans_lurker_obj_killed++; >> } else { >> if (oc->ban != bd) { >> + assert(oc->ban == bt); >> Lck_Lock(&ban_mtx); >> oc->ban->refcount--; >> VTAILQ_REMOVE(&oc->ban->objcore, oc, ban_list); > > I am not sure. Isn't there a window where a HSH_Lookup could race us ? Yes, absolutely, thank you. >> 3) ban_lurker_getfirst questions: >> >> - for the contention case, shouldn't we continue walking the bt->objcore list >> and sleep only if we hit the marker? > > We'd need to come back to the missed oc's later, the lurker cannot skip > some of the oc's. Yes, that was clear. I've implemented this. >> - do IUC correctly that getfirst moves the oc to the tail of the bt->objcore >> list, behind the marker, to ensure we don't re-visit ocs which have not got >> killed yet, after being handed off to exp? > > Not sure I understand the question... I was more or less babbling to myself, just re-confirming if my understanding is correct: ban_lurker_test_ban() inserts oc_marker into the oc list of the ban being tested. ban_lurker_getfirst() then moves the oc it could lock behind the marker, which will avoid it to be found again during the same lurker run. I was not quite sure IUC why the oc is not taken off the ban's list in the first place, and I presume this is because actual expiry will happen asynchronously in the exp thread, which should find consistent linkage (ban->oc) in place. >> 4) ban_lurker_work >> >> why do we mark_completed and then clean out the completed bans in the next step >> rather than removing the completed bans straight away? Why do we need to do the >> spec fiddling in ban_mark_completed (including a membar (!)) if we're about to >> ditch the ban anyway? > > again, probably just an accident of how the code developed. Implemented From slink at schokola.de Mon Apr 18 19:04:56 2016 From: slink at schokola.de (Nils Goroll) Date: Mon, 18 Apr 2016 21:04:56 +0200 Subject: ban_lurker_age, ban_lurker_sleep In-Reply-To: <19481.1460968792@critter.freebsd.dk> References: <57110C4E.8010209@schokola.de> <571112C2.5070607@schokola.de> <19481.1460968792@critter.freebsd.dk> Message-ID: <57152FD8.8090705@schokola.de> On 18/04/16 10:39, Poul-Henning Kamp wrote: > -------- > In message <571112C2.5070607 at schokola.de>, Nils Goroll writes: > >> I've just pushed two minor improvements on the documentation, but I'd suggest >> two changes >> >> * ban_lurker_age: > >> but we also increase the likelihood of bans requiring request-time evaluation, >> which is bad for latency. > > The important thing is for the lurker to not get in the way of > request-time evaluation, because it will be definition waste a lot > of time on objects clients don't care for, thus needlessly delaying > the ones the clients ask for. This probably is the scenratio which is most relevant for cases of good use of the ban lurker. The case I'm working on is 40k expensive bans on 400k objects when the ban luker cannot keep up for long periods of time (yes, I know this is bad use of bans, but still an intersting case). With such numbers of bans, avoiding ban tests at lookup time should be relevant, so kicking in the lurker early should help. > The right thing to do with respect to both ban_lurker_{age|sleep} > is probably to look at the rate of req-time evaluations and modulate > the lurker activity to minimize interference. > > Likewise the mutex contest delay should increase in response to > rate of req-time evaluations. Yeah, maybe, I'm not sure. The drawback is that such "magic" auto-tuning will make the whole system less predictable and can lead to oscillation effects. For the time being, I'd prefer to just lower the ban_lurker_age default. > For instance we could make a param which says how many ban-checks > per second we allow. If the req-time processing leaves any > unused, the lurker is free to "fill up the quota". Unfortunately > such "quota" parameters need to scale and I have no idea on what (ncpu ?) As long as the ban lurker is single threaded, it's pretty hard rate-limited already on any real-life system of relevance. Nils From phk at phk.freebsd.dk Tue Apr 19 19:33:17 2016 From: phk at phk.freebsd.dk (Poul-Henning Kamp) Date: Tue, 19 Apr 2016 19:33:17 +0000 Subject: ban_lurker_age, ban_lurker_sleep In-Reply-To: <57152FD8.8090705@schokola.de> References: <57110C4E.8010209@schokola.de> <571112C2.5070607@schokola.de> <19481.1460968792@critter.freebsd.dk> <57152FD8.8090705@schokola.de> Message-ID: <75941.1461094397@critter.freebsd.dk> -------- In message <57152FD8.8090705 at schokola.de>, Nils Goroll writes: >This probably is the scenratio which is most relevant for cases of good use of >the ban lurker. The case I'm working on is 40k expensive bans on 400k objects >when the ban luker cannot keep up for long periods of time (yes, I know this is >bad use of bans, but still an intersting case). > >With such numbers of bans, avoiding ban tests at lookup time should be >relevant, so kicking in the lurker early should help. Well, depends what you want helped I guess :-) Either way, that's why it is a parameter, I'm sure there's no one size that fits everybody. >As long as the ban lurker is single threaded, it's pretty hard rate-limited >already on any real-life system of relevance. Not really, there are many systems where the full attention of the ban lurker is undesirable. For instance a system with a gazillion objects where a single ban is used to take out a handful of objects because of some one-off mistake. In that scenario it is much better to let the req-time validations do the job, than having the lurker go full tilt and page in the entire cache. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk at FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From phk at phk.freebsd.dk Tue Apr 19 22:28:10 2016 From: phk at phk.freebsd.dk (Poul-Henning Kamp) Date: Tue, 19 Apr 2016 22:28:10 +0000 Subject: Expanding VCL object support In-Reply-To: References: Message-ID: <76652.1461104890@critter.freebsd.dk> -------- I think we need to turn this proposal into a "VIP": https://github.com/varnishcache/varnish-cache/wiki/Varnish-Improvement-Proposals Can you do that Reza ? We already have another VIP (#1) with the same issues of scope/lifetime but seen only on the C-level of things, here the same issue rears its ugly head at VCL-level. This may be an issue we should focus on next time we have a VDD... -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk at FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From reza at varnish-software.com Wed Apr 20 14:25:37 2016 From: reza at varnish-software.com (Reza Naghibi) Date: Wed, 20 Apr 2016 10:25:37 -0400 Subject: Expanding VCL object support In-Reply-To: <76652.1461104890@critter.freebsd.dk> References: <76652.1461104890@critter.freebsd.dk> Message-ID: Sounds good to me! -- Reza Naghibi Varnish Software On Tue, Apr 19, 2016 at 6:28 PM, Poul-Henning Kamp wrote: > -------- > > I think we need to turn this proposal into a "VIP": > > > https://github.com/varnishcache/varnish-cache/wiki/Varnish-Improvement-Proposals > > Can you do that Reza ? > > > We already have another VIP (#1) with the same issues of scope/lifetime > but seen only on the C-level of things, here the same issue rears its > ugly head at VCL-level. > > This may be an issue we should focus on next time we have a VDD... > > > -- > Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 > phk at FreeBSD.ORG | TCP/IP since RFC 956 > FreeBSD committer | BSD since 4.3-tahoe > Never attribute to malice what can adequately be explained by incompetence. > -------------- next part -------------- An HTML attachment was scrubbed... URL: From lkarsten at varnish-software.com Wed Apr 20 16:56:35 2016 From: lkarsten at varnish-software.com (Lasse Karstensen) Date: Wed, 20 Apr 2016 18:56:35 +0200 Subject: HTTP/2 workshop April 2016 Message-ID: <20160420165634.GA31975@immer.varnish-software.com> Hi all. Martin, Poul-Henning, Guillaume and I have spent the day looking at HTTP/2 and how we'll organize the work on the coming months. We had a productive meeting and found a few modules that can be developed somewhat in parallel. Guillaume will continue his work on h2 in varnishtest. Martin will work out a plan for how logging and VSL will look with H2. A performant hpack implementation needs to be written. The pipe proposal in the wiki was discussed and currently the wind is blowing in the direction of no return(pipe) in h2. We most certainly will need someone with access to production traffic to run the H2 code on when it has been written. Simple code shuffling to make room for H2 will happen in current git master. Remaining code will be kept in other repositories until there is something working to merge to master. We'll revisit this again on the next VDD, presumably in June. -- Lasse Karstensen Varnish Software AS From ruben at varnish-software.com Wed Apr 27 01:14:59 2016 From: ruben at varnish-software.com (=?UTF-8?Q?Rub=C3=A9n_Romero?=) Date: Wed, 27 Apr 2016 03:14:59 +0200 Subject: #VarnishCon1 Is On! Amsterdam, NL - June 17th, 2016 (VDD+Training the 16th) Message-ID: Hello everyone, It is my pleasure to announce our first VarnishCon, happening on June 17th in Amsterdam. This is our yearly European user community conference and we hope you can join us. Varnish Cache users, enthusiasts and core developers will once again reunite together in one place to shape the future of our software. Everyone is invited, so RSVP now (it's free) > http://varnishcon2016.eventbrite.com/ This far the agenda is shapping up nicely: * Emanuele Rocca, Ops Eng @ Wikimedia Foundation will join us and share the story of how Wikipedia, the 7th biggest site in the web, stays up thanks to Varnish. * Poul-Henning Kamp, Varnish Chief Architect, will be holding a keynote and give us his update of the state of affairs of the Varnish project. If you have a great Varnish story you want to share with the crowd, just reply to this email or join us on the discussion on gtihub[1], Otherwise you can expect the Varnish core developer team to share insights about future plans for Varnish. Are you curious about progress with HTTP/2 support or our plan for time-based releases? You will hear more about that and will have plenty of time for questions. The day before the conference, June 16th, we plan to have training sessions (limited seats) as well as a VDD for the core developer team. If your day job is working as a web developer, sysadmin or devops engineer, you will find at least one of the two Varnish training tracks rather useful (1 day training is only ?80,-) So tell your boss that this is your chance to go to VarnishCon, get some training and learn more about how to make their site shine even more! At this event you will be able to meet, connect and learn from varnishers in the region. When you go back home you will have quite a few tips & tricks to implement and know more about what pitfalls to avoid. Get a ticket now > http://varnishcon2016.eventbrite.com/ Hope to see you all there! Best regards, -- *Rub?n Romero* Director, Community Engagement Varnish Software Group Cell: +47 95964088 / Office: +47 21989260 Skype, Twitter & IRC: ruben_varnish We Make Websites Fly! [1] https://github.com/varnishcache/homepage/issues/12 -------------- next part -------------- An HTML attachment was scrubbed... URL: