From phk at phk.freebsd.dk Thu Feb 9 08:08:35 2006 From: phk at phk.freebsd.dk (Poul-Henning Kamp) Date: Thu, 09 Feb 2006 08:08:35 +0000 Subject: My random thoughts Message-ID: <3697.1139472515@critter.freebsd.dk> Here are my random thoughts on Varnish until now. Some of it mirrors what we talked about in the meeting, some if it is more detailed or reaches further into speculation. Poul-Henning Notes on Varnish ---------------- Philosophy ---------- It is not enough to deliver a technically superior piece of software, if it is not possible for people to deploy it usefully in a sensible way and timely fashion. Deployment scenarios -------------------- There are two fundamental usage scenarios for Varnish: when the first machine is brought up to offload a struggling backend and when a subsequent machine is brought online to help handle the load. The first (layer of) Varnish ---------------------------- Somebodys webserver is struggling and they decide to try Varnish. Often this will be a skunkworks operation with some random PC purloined from wherever it wasn't being used and the Varnish "HOWTO" in one hand. If they do it in an orderly fashion before things reach panic proportions, a sensible model is to setup the Varnish box, test it out from your own browser, see that it answers correctly. Test it some more and then add the IP# to the DNS records so that it takes 50% of the load off the backend. If it happens as firefighting at 3AM the backend will be moved to another IP, the Varnish box given the main IP and things had better work real well, really fast. In both cases, it would be ideal if all that is necessary to tell Varnish are two pieces of information: Storage location Alternatively we can offer an "auto" setting that makes Varnish discover what is available and use what it find. DNS or IP# of backend. IP# is useful when the DNS settings are not quite certain or when split DNS horizon setups are used. Ideally this can be done on the commandline so that there is no configuration file to edit to get going, just varnish -d /home/varnish -s backend.example.dom and you're off running. A text, curses or HTML based based facility to give some instant feedback and stats is necessary. If circumstances are not conductive to strucured approach, it should be possible to repeat this process and set up N independent Varnish boxes and get some sort of relief without having to read any further documentation. The subsequent (layers of) Varnish ---------------------------------- This is what happens once everybody has caught their breath, and where we start to talk about Varnish clusters. We can assume that at this point, the already installed Varnish machines have been configured more precisely and that people have studied Varnish configuration to some level of detail. When Varnish machines are put in a cluster, the administrator should be able to consider the cluster as a unit and not have to think and interact with the individual nodes. Some sort of central management node or facility must exist and it would be preferable if this was not a physical but a logical entity so that it can follow the admin to the beach. Ideally it would give basic functionality in any browser, even mobile phones. The focus here is scaleability, we want to avoid per-machine configuration if at all possible. Ideally, preconfigured hardware can be plugged into power and net, find an address with DHCP, contact preconfigured management node, get a configuration and start working. But we also need to think about how we avoid a site of Varnish machines from acting like a stampeeding horde when the power or connectivity is brought back after a disruption. Some sort of slow starting ("warm-up" ?) must be implemented to prevent them from hitting all the backend with the full force. An important aspect of cluster operations is giving a statistically meaninful judgement of the cluster size, in particular answering the question "would adding another machine help ?" precisely. We should have a facility that allows the administrator to type in a REGEXP/URL and have all the nodes answer with a checksum, age and expiry timer for any documents they have which match. The results should be grouped by URL and checksum. Technical concepts ------------------ We want the central Varnish process to be that, just one process, and we want to keep it small and efficient at all cost. Code that will not be used for the central functionality should not be part of the central process. For instance code to parse, validate and interpret the (possibly) complex configuration file should be a separate program. Depending on the situation, the Varnish process can either invoke this program via a pipe or receive the ready to use data structures via a network connection. Exported data from the Varnish process should be made as cheap as possible, likely shared memory. That will allow us to deploy separate processes for log-grabbing, statistics monitoring and similar "off-duty" tasks and let the central process get on with the important job. Backend interaction ------------------- We need a way to tune the backend interaction further than what the HTTP protocol offers out of the box. We can assume that all documents we get from the backend has an expiry timer, if not we will set a default timer (configurable of course). But we need further policy than that. Amongst the questions we have to ask are: How long time after the expiry can we serve a cached copy of this document while we have reason to belive the backend can supply us with an update ? How long time after the expiry can we serve a cached copy of this document if the backend does not reply or is unreachable. If we cannot serve this document out of cache and the backend cannot inform us, what do we serve instead (404 ? A default document of some sort ?) Should we just not serve this page at all if we are in a bandwidth crush (DoS/stampede) situation ? It may also make sense to have a "emergency detector" which triggers when the backend is overloaded and offer a scaling factor for all timeouts for when in such an emergency state. Something like "If the average response time of the backend rises above 10 seconds, multiply all expiry timers by two". It probably also makes sense to have a bandwidth/request traffic shaper for backend traffic to prevent any one Varnish machine from pummeling the backend in case of attacks or misconfigured expiry headers. Startup/consistency ------------------- We need to decide what to do about the cache when the Varnish process starts. There may be a difference between it starting first time after the machine booted and when it is subsequently (re)started. By far the easiest thing to do is to disregard the cache, that saves a lot of code for locating and validating the contents, but this carries a penalty in backend or cluster fetches whenever a node comes up. Lets call this the "transient cache model" The alternative is to allow persistently cached contents to be used according to configured criteria: Can expired contents be served if we can't contact the backend ? (dangerous...) Can unexpired contents be served if we can't contact the backend ? If so, how much past the expiry ? It is a very good question how big a fraction of the persistent cache would be usable after typical downtimes: After a Varnish process restart: Nearly all. After a power-failure ? Probably at least half, but probably not the half that contains the most busy pages. And we need to take into consideration if validating the format and contents of the cache might take more resources and time than getting the content from the backend. Off the top of my head, I would prefer the transient model any day because of the simplicity and lack of potential consistency problems, but if the load on the back end is intolerable this may not be practically feasible. The best way to decide is to carefully analyze a number of cold starts and cache content replacement traces. The choice we make does affect the storage management part of Varnish, but I see that is being modular in any instance, so it may merely be that some storage modules come up clean on any start while other will come up with existing objects cached. Clustering ---------- I'm somewhat torn on clustering for traffic purposes. For admin and management: Yes, certainly, but starting to pass objects from one machine in a cluster to another is likely to be just be a waste of time and code. Today one can trivially fit 1TB into a 1U machine so the partitioning argument for cache clusters doesn't sound particularly urgent to me. If all machines in the cluster have sufficient cache capacity, the other remaining argument is backend offloading, that would likely be better mitigated by implementing a 1:10 style two-layer cluster with the second level node possibly having twice the storage of the front row nodes. The coordination necessary for keeping track of, or discovering in real-time, who has a given object can easily turn into a traffic and cpu load nightmare. And from a performance point of view, it only reduces quality: First we send out a discovery multicast, then we wait some amount of time to see if a response arrives only then should we start to ask the backend for the object. With a two-level cluster we can ask the layer-two node right away and if it doesn't have the object it can ask the back-end right away, no timeout is involved in that. Finally Consider the impact on a cluster of a "must get" object like an IMG tag with a misspelled URL. Every hit on the front page results in one get of the wrong URL. One machine in the cluster ask everybody else in the cluster "do you have this URL" every time somebody gets the frontpage. If we implement a negative feedback protocol ("No I don't"), then each hit on the wrong URL will result in N+1 packets (assuming multicast). If we use a silent negative protocol the result is less severe for the machine that got the request, but still everybody wakes up to to find out that no, we didn't have that URL. Negative caching can mitigate this to some extent. Privacy ------- Configuration data and instructions passed forth and back should be encrypted and signed if so configured. Using PGP keys is a very tempting and simple solution which would pave the way for administrators typing a short ascii encoded pgp signed message into a SMS from their Bahamas beach vacation... Implementation ideas -------------------- The simplest storage method mmap(2)'s a disk or file and puts objects into the virtual memory on page aligned boundaries, using a small struct for metadata. Data is not persistant across reboots. Object free is incredibly cheap. Object allocation should reuse recently freed space if at all possible. "First free hole" is probably a good allocation strategy. Sendfile can be used if filebacked. If nothing else disks can be used by making a 1-file filesystem on them. More complex storage methods are object per file and object in database models. They are relatively trival and well understood. May offer persistence. Read-Only storage methods may make sense for getting hold of static emergency contents from CD-ROM etc. Treat each disk arm as a separate storage unit and keep track of service time (if possible) to decide storage scheduling. Avoid regular expressions at runtime. If config file contains regexps, compile them into executable code and dlopen() it into the Varnish process. Use versioning and refcounts to do memory management on such segments. Avoid committing transmit buffer space until we have bandwidth estimate for client. One possible way: Send HTTP header and time ACKs getting back, then calculate transmit buffer size and send object. This makes DoS attacks more harmless and mitigates traffic stampedes. Kill all TCP connections after N seconds, nobody waits an hour for a web-page to load. Abuse mitigation interface to firewall/traffic shaping: Allow the central node to put an IP/Net into traffic shaping or take it out of traffic shaping firewall rules. Monitor/interface process (not main Varnish process) calls script to config firewalling. "Warm-up" instructions can take a number of forms and we don't know what is the most efficient or most usable. Here are some ideas: Start at these URL's then... ... follow all links down to N levels. ... follow all links that match REGEXP no deeper than N levels down. ... follow N random links no deeper than M levels down. ... load N objects by following random links no deeper than M levels down. But... ... never follow any links that match REGEXP ... never pick up objects larger than N bytes ... never pick up objects older than T seconds It makes a lot of sense to not actually implement this in the main Varnish process, but rather supply a template perl or python script that primes the cache by requesting the objects through Varnish. (That would require us to listen separately on 127.0.0.1 so the perlscript can get in touch with Varnish while in warm-up.) One interesting but quite likely overengineered option in the cluster case is if the central monitor tracks a fraction of the requests through the logs of the running machines in the cluster, spots the hot objects and tell the warming up varnish what objects to get and from where. In the cluster configuration, it is probably best to run the cluster interaction in a separate process rather than the main Varnish process. From Varnish to cluster info would go through the shared memory, but we don't want to implement locking in the shmem so some sort of back-channel (UNIX domain or UDP socket ?) is necessary. If we have such an "supervisor" process, it could also be tasked with restarting the varnish process if vitals signs fail: A time stamp in the shmem or kill -0 $pid. It may even make sense to run the "supervisor" process in stand alone mode as well, there it can offer a HTML based interface to the Varnish process (via shmem). For cluster use the user would probably just pass an extra argument when he starts up Varnish: varnish -c $cluster_args $other_args vs varnish $other_args and a "varnish" shell script will Do The Right Thing. Shared memory ------------- The shared memory layout needs to be thought about somewhat. On one hand we want it to be stable enough to allow people to write programs or scripts that inspect it, on the other hand doing it entirely in ascii is both slow and prone to race conditions. The various different data types in the shared memory can either be put into one single segment(= 1 file) or into individual segments (= multiple files). I don't think the number of small data types to be big enough to make the latter impractical. Storing the "big overview" data in shmem in ASCII or HTML would allow one to point cat(1) or a browser directly at the mmaped file with no interpretation necessary, a big plus in my book. Similarly, if we don't update them too often, statistics could be stored in shared memory in perl/awk friendly ascii format. But the logfile will have to be (one or more) FIFO logs, probably at least three in fact: Good requests, Bad requests, and exception messages. If we decide to make logentries fixed length, we could make them ascii so that a simple "sort -n /tmp/shmem.log" would put them in order after a leading numeric timestamp, but it is probably better to provide a utility to cat/tail-f the log and keep the log in a bytestring FIFO format. Overruns should be marked in the output. *END* -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk at FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From des at linpro.no Thu Feb 9 13:51:11 2006 From: des at linpro.no (Dag-Erling =?iso-8859-1?q?Sm=F8rgrav?=) Date: Thu, 09 Feb 2006 14:51:11 +0100 Subject: My random thoughts References: Message-ID: Poul-Henning Kamp writes: > Here are my random thoughts on Varnish until now. Thank you. I will try to take the time to read them and comment tomorrow; I am currently busy preparing for a trade show early next week. DES -- Dag-Erling Sm?rgrav Senior Software Developer Linpro AS - www.linpro.no From des at linpro.no Fri Feb 10 12:59:21 2006 From: des at linpro.no (Dag-Erling =?iso-8859-1?q?Sm=F8rgrav?=) Date: Fri, 10 Feb 2006 13:59:21 +0100 Subject: r2 - trunk References: Message-ID: des at projects.linpro.no writes: > Added: > trunk/LICENSE > Log: > Two-clause BSD license. Assign copyright to Linpro for now. As discussed with Anders on the phone, the assignment of copyright to Linpro is temporary until we figure out the legal situation. I just talked to our CEO (just back from two weeks of jury duty); he is open to the idea of setting up a foundation or some similar non-profit legal entity, and will consult with our lawyer. DES -- Dag-Erling Sm?rgrav Senior Software Developer Linpro AS - www.linpro.no From des at linpro.no Fri Feb 10 18:09:27 2006 From: des at linpro.no (Dag-Erling =?iso-8859-1?q?Sm=F8rgrav?=) Date: Fri, 10 Feb 2006 19:09:27 +0100 Subject: My random thoughts References: Message-ID: Poul-Henning Kamp writes: > It is not enough to deliver a technically superior piece of software, > if it is not possible for people to deploy it usefully in a sensible > way and timely fashion. I tend to favor usability over performance. I believe you tend to favor performance over usability. Hopefully, our opposing tendencies will combine and the result will be a perfect balance ;) > In both cases, it would be ideal if all that is necessary to tell > Varnish are two pieces of information: > > Storage location > Alternatively we can offer an "auto" setting that makes > Varnish discover what is available and use what it find. I want Varnish to support multiple storage backends: - quick and dirty squid-like hashed directories, to begin with - fancy block storage straight to disk (or to a large preallocated file) like you suggested - memcached > Ideally this can be done on the commandline so that there is no > configuration file to edit to get going, just > > varnish -d /home/varnish -s backend.example.dom This would use hashed directories if /home/varnish is a directory, and block storage if it's a file or device node. > We need to decide what to do about the cache when the Varnish > process starts. There may be a difference between it starting > first time after the machine booted and when it is subsequently > (re)started. This might vary depending on which storage backend is used. With memcached, for instance, there is a possibility that varnish restarted, but memcached is still running and still has a warm cache; and if memcached also restarted, it will transparently obtain any cached object from its peers. The disadvantage with memcached is that we can't sendfile() from it. > By far the easiest thing to do is to disregard the cache, that saves > a lot of code for locating and validating the contents, but this > carries a penalty in backend or cluster fetches whenever a node > comes up. Lets call this the "transient cache model" Another issue is that a persistent cache must store both data and metadata on disk, rather than just store data on disk and metadata in memory. This complicates not only the logic but also the storage format. > Can expired contents be served if we can't contact the > backend ? (dangerous...) Dangerous, but highly desirable in certain circumstances. I need to locate the architecture notes I wrote last fall and place them online; I spent quite somet time thinking about and describing how this could / should be done. > It is a very good question how big a fraction of the persistent > cache would be usable after typical downtimes: > > After a Varnish process restart: Nearly all. > > After a power-failure ? Probably at least half, but probably > not the half that contains the most busy pages. When using direct-to-disk storage, we can (fairly) easily design the storage format in such a way that updates are atomic, and make liberal use of fsync() or similar to ensure (to the extent possible) that the cache is in a consistent state after a power failure. > Off the top of my head, I would prefer the transient model any day > because of the simplicity and lack of potential consistency problems, > but if the load on the back end is intolerable this may not be > practically feasible. How about this: we start with the transient model, and add persistence later. > If all machines in the cluster have sufficient cache capacity, the > other remaining argument is backend offloading, that would likely > be better mitigated by implementing a 1:10 style two-layer cluster > with the second level node possibly having twice the storage of > the front row nodes. Multiple cache layers may give rise to undesirable and possibly unpredictable interaction (compare this to tunneling TCP/IP over TCP, with both TCP layers battling each other's congestion control) > Finally Consider the impact on a cluster of a "must get" object > like an IMG tag with a misspelled URL. Every hit on the front page > results in one get of the wrong URL. One machine in the cluster > ask everybody else in the cluster "do you have this URL" every > time somebody gets the frontpage. Not if we implement negative caching, which we have to anyway - otherwise all those requests go to the backend, which gets bogged down sending out 404s. > If we implement a negative feedback protocol ("No I don't"), then > each hit on the wrong URL will result in N+1 packets (assuming > multicast). Or we can just ignore queries for documents which we don't have; the requesting node will have a simply request the document from the backend if no reply arrives within a short timeout (~1s). > Configuration data and instructions passed forth and back should > be encrypted and signed if so configured. Using PGP keys is > a very tempting and simple solution which would pave the way for > administrators typing a short ascii encoded pgp signed message > into a SMS from their Bahamas beach vacation... Unfortunately, PGP is very slow, so it should only be used to communicate with some kind of configuration server, not with the cache itself. > The simplest storage method mmap(2)'s a disk or file and puts > objects into the virtual memory on page aligned boundaries, > using a small struct for metadata. Data is not persistant > across reboots. Object free is incredibly cheap. Object > allocation should reuse recently freed space if at all possible. > "First free hole" is probably a good allocation strategy. > Sendfile can be used if filebacked. If nothing else disks > can be used by making a 1-file filesystem on them. hmm, I believe you can sendfile() /dev/zero if you use that trick to get a private mmap()ed arena. > Avoid regular expressions at runtime. If config file contains > regexps, compile them into executable code and dlopen() it > into the Varnish process. Use versioning and refcounts to > do memory management on such segments. unlike regexps, globs can be evaluated very efficiently. > It makes a lot of sense to not actually implement this in the main > Varnish process, but rather supply a template perl or python script > that primes the cache by requesting the objects through Varnish. > (That would require us to listen separately on 127.0.0.1 > so the perlscript can get in touch with Varnish while in warm-up.) This can easily be done with existing software like w3mir. > One interesting but quite likely overengineered option in the > cluster case is if the central monitor tracks a fraction of the > requests through the logs of the running machines in the cluster, > spots the hot objects and tell the warming up varnish what objects > to get and from where. You can probably do this in ~50 lines of Perl using Net::HTTP. > In the cluster configuration, it is probably best to run the cluster > interaction in a separate process rather than the main Varnish > process. From Varnish to cluster info would go through the shared > memory, but we don't want to implement locking in the shmem so > some sort of back-channel (UNIX domain or UDP socket ?) is necessary. Distributed lock managers are *hard*... but we don't need locking for simple stuff like reading logs out of shmem. DES -- Dag-Erling Sm?rgrav Senior Software Developer Linpro AS - www.linpro.no From phk at phk.freebsd.dk Fri Feb 10 18:42:24 2006 From: phk at phk.freebsd.dk (Poul-Henning Kamp) Date: Fri, 10 Feb 2006 18:42:24 +0000 Subject: My random thoughts In-Reply-To: Your message of "Fri, 10 Feb 2006 19:09:27 +0100." Message-ID: <5868.1139596944@critter.freebsd.dk> In message , Dag-Erling =?iso-8859-1?q?Sm=F8rgra v?= writes: >Poul-Henning Kamp writes: >> In both cases, it would be ideal if all that is necessary to tell >> Varnish are two pieces of information: >> >> Storage location >> Alternatively we can offer an "auto" setting that makes >> Varnish discover what is available and use what it find. > >I want Varnish to support multiple storage backends: > > - quick and dirty squid-like hashed directories, to begin with That's actually slow and dirty. So I'd prefer to wait with this one until we know we need it (ie: persistance). > - fancy block storage straight to disk (or to a large preallocated > file) like you suggested This is actually the simpler one to implement: make one file, mmap it, sendfile from it. I don't see any advantage to memcached right off the bat, but I may become wiser later on. Memcached is intended for when your app needs a shared memory interface, which is then simulated using network. Our app is network oriented and we know a lot more about or data than memcached would, so we can do the networking more efficiently ourselves. >> By far the easiest thing to do is to disregard the cache, that saves >> a lot of code for locating and validating the contents, but this >> carries a penalty in backend or cluster fetches whenever a node >> comes up. Lets call this the "transient cache model" > >Another issue is that a persistent cache must store both data and >metadata on disk, rather than just store data on disk and metadata in >memory. This complicates not only the logic but also the storage >format. Yes, although we can get pretty far with mmap on this too. >> It is a very good question how big a fraction of the persistent >> cache would be usable after typical downtimes: >> >> After a Varnish process restart: Nearly all. >> >> After a power-failure ? Probably at least half, but probably >> not the half that contains the most busy pages. > >When using direct-to-disk storage, we can (fairly) easily design the >storage format in such a way that updates are atomic, and make liberal >use of fsync() or similar to ensure (to the extent possible) that the >cache is in a consistent state after a power failure. I meant "usable" as in "will be asked for", ie: usable for improving the hitrate. >How about this: we start with the transient model, and add persistence >later. My idea exactly :-) Since I expect the storage to be pluggable, this should be pretty straightforward. >> If all machines in the cluster have sufficient cache capacity, the >> other remaining argument is backend offloading, that would likely >> be better mitigated by implementing a 1:10 style two-layer cluster >> with the second level node possibly having twice the storage of >> the front row nodes. > >Multiple cache layers may give rise to undesirable and possibly >unpredictable interaction (compare this to tunneling TCP/IP over TCP, >with both TCP layers battling each other's congestion control) I doubt it. The front end Varnish fetches from the backend into its store and from there another thread will serve the users, so the two TCP connections are not interacting directly. >Or we can just ignore queries for documents which we don't have; the >requesting node will have a simply request the document from the >backend if no reply arrives within a short timeout (~1s). I want to avoid any kind of timeouts like that. One slight bulge in your load and everybody times out and hits the backend. >Unfortunately, PGP is very slow, so it should only be used to >communicate with some kind of configuration server, not with the cache >itself. Absolutely. My plan wast to have the "management process" do that. >unlike regexps, globs can be evaluated very efficiently. But more efficiently still if compiled into C code. >> It makes a lot of sense to not actually implement this in the main >> Varnish process, but rather supply a template perl or python script >> that primes the cache by requesting the objects through Varnish. >This can easily be done with existing software like w3mir. >[...] >You can probably do this in ~50 lines of Perl using Net::HTTP. Sounds like you just won this bite :-) >Distributed lock managers are *hard*... Nobody is talking about distributed lock managers. The shared memory is strictly local to the machine and r/o by everybody else than the main Varnish process. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk at FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From des at linpro.no Sat Feb 11 20:23:10 2006 From: des at linpro.no (Dag-Erling =?iso-8859-1?q?Sm=F8rgrav?=) Date: Sat, 11 Feb 2006 21:23:10 +0100 Subject: My random thoughts In-Reply-To: <5868.1139596944@critter.freebsd.dk> (Poul-Henning Kamp's message of "Fri, 10 Feb 2006 18:42:24 +0000") References: <5868.1139596944@critter.freebsd.dk> Message-ID: "Poul-Henning Kamp" writes: > "Dag-Erling Sm?rgrav" writes: > > Multiple cache layers may give rise to undesirable and possibly > > unpredictable interaction (compare this to tunneling TCP/IP over TCP, > > with both TCP layers battling each other's congestion control) > I doubt it. The front end Varnish fetches from the backend > into its store and from there another thread will serve the > users, so the two TCP connections are not interacting directly. You took me a little too literally. What I meant is that we may see undesirable interaction between the two layers, for instance in the area of expiry handling (what will the front layer think when the rear layer sends it expired documents?). > > Unfortunately, PGP is very slow, so it should only be used to > > communicate with some kind of configuration server, not with the cache > > itself. > Absolutely. My plan wast to have the "management process" do that. Hmm, we might as well go right ahead and call it a FEP :) (see http://www.jargon.net/jargonfile/b/box.html if you didn't catch the reference) > > unlike regexps, globs can be evaluated very efficiently. > But more efficiently still if compiled into C code. I don't think so, but I may have overlooked something. DES -- Dag-Erling Sm?rgrav Senior Software Developer Linpro AS - www.linpro.no From andersb at vgnett.no Sun Feb 12 21:54:00 2006 From: andersb at vgnett.no (Anders Berg) Date: Sun, 12 Feb 2006 22:54:00 +0100 (CET) Subject: My random thoughts Message-ID: <61614.193.213.34.102.1139781240.squirrel@denise.vg.no> Good work guys. I had a great time reading the notes. Here comes the sys.adm approach. P.S The sys.adm approach can easily been seen as a overengineered solution, don't feel my approach as a must-have. More as a nice-to-have. >Notes on Varnish >---------------- > >Philosophy >---------- > >It is not enough to deliver a technically superior piece of software, >if it is not possible for people to deploy it usefully in a sensible >way and timely fashion. >[...] >If circumstances are not conductive to strucured approach, it should >be possible to repeat this process and set up N independent Varnish >boxes and get some sort of relief without having to read any further >documentation. I think these are reasonable senarios and solutions. > >The subsequent (layers of) Varnish >---------------------------------- > >[...] >When Varnish machines are put in a cluster, the administrator should >be able to consider the cluster as a unit and not have to think and >interact with the individual nodes. That would be great. Imho far to little software acts like this. There could be a good reason for that, but I wouldn't know. >Some sort of central management node or facility must exist and >it would be preferable if this was not a physical but a logical >entity so that it can follow the admin to the beach. Ideally it >would give basic functionality in any browser, even mobile phones. A web-browser interface and a CLI should cover 99% of use. An easy protocol/API would make it possible for anybody to write their own interface to the central managment node. >The focus here is scaleability, we want to avoid per-machine >configuration if at all possible. Ideally, preconfigured hardware >can be plugged into power and net, find an address with DHCP, contact >preconfigured management node, get a configuration and start working. This would ease many things. If one should make a image of some sort, one does not have to change/make new image for every config change (if that happens more ofte than software updates). >But we also need to think about how we avoid a site of Varnish >machines from acting like a stampeeding horde when the power or >connectivity is brought back after a disruption. Some sort of >slow starting ("warm-up" ?) must be implemented to prevent them >from hitting all the backend with the full force. Yes. As you said in Oslo Poul, this could be a killer-app feature for some sites. >An important aspect of cluster operations is giving a statistically >meaninful judgement of the cluster size, in particular answering >the question "would adding another machine help ?" precisely. Is this possible? It would involve knowing how the backend is doing with added load. One thing is to measure how it's doing right now (responstime), but to predict added load is hard. My guess is also that the only reason somebody would ask "would adding another machine help ?" was if the CPU or bandwith was exhausted on the accelerator(s) in place, and one really needed to do something anyway. The only other reason I can think of is responstime from the accelerator, and then we have the predict load problem. >We should have a facility that allows the administrator to type >in a REGEXP/URL and have all the nodes answer with a checksum, age >and expiry timer for any documents they have which match. The >results should be grouped by URL and checksum. Not only the admin needs this. Its great when programmers/implementors need to debug how "good" the new/old application caches. In a world of rapid development, little or no time is often given to make/check the "cachebility" of the app. A "check www.rapiddev.com/newapp/*" after a couple of clicks on the app could save developers huge amount of time, and reduce backend load immensely. > >Technical concepts >------------------ > >We want the central Varnish process to be that, just one process, and >we want to keep it small and efficient at all cost. Yes. When you say 1 process, you mean 1 process per CPU/Core? >Code that will not be used for the central functionality should not >be part of the central process. For instance code to parse, validate >and interpret the (possibly) complex configuration file should be a >separate program. Lets list possible processes: 1. Varnish main. 2. Disk/storage process. 3. Config process/program. 4. Managment process. 5. Logger/stats. >Depending on the situation, the Varnish process can either invoke >this program via a pipe or receive the ready to use data structures >via a network connection. > >Exported data from the Varnish process should be made as cheap as >possible, likely shared memory. That will allow us to deploy separate >processes for log-grabbing, statistics monitoring and similar >"off-duty" tasks and let the central process get on with the >important job. Sounds great. > >Backend interaction >------------------- > >We need a way to tune the backend interaction further than what the >HTTP protocol offers out of the box. > >We can assume that all documents we get from the backend has an >expiry timer, if not we will set a default timer (configurable of >course). > >But we need further policy than that. Amongst the questions we have >to ask are: > > How long time after the expiry can we serve a cached copy > of this document while we have reason to belive the backend > can supply us with an update ? > > How long time after the expiry can we serve a cached copy > of this document if the backend does not reply or is > unreachable. > > If we cannot serve this document out of cache and the backend > cannot inform us, what do we serve instead (404 ? A default > document of some sort ?) > > Should we just not serve this page at all if we are in a > bandwidth crush (DoS/stampede) situation ? You are correct. Did you mean ask the user or did you mean questions to answer in a specification? I think the best approach is to ask the user, and let him answer in the config. I can see as many answers to these questions (and more) as there are websites :) Also a site might answer differently in different scenarios. >It may also make sense to have a "emergency detector" which triggers >when the backend is overloaded and offer a scaling factor for all >timeouts for when in such an emergency state. Something like "If >the average response time of the backend rises above 10 seconds, >multiply all expiry timers by two". Good idea. Once again I opt for a config choice on that one. >It probably also makes sense to have a bandwidth/request traffic >shaper for backend traffic to prevent any one Varnish machine from >pummeling the backend in case of attacks or misconfigured >expiry headers. Good idea, but this one I am unsure about. The reason: One more thing that can make the accelerator behave in a way you don't understand. You are delivering stale documents from the accelerator. You start "debugging". "Hmm, most of thre requests are given from backen in timely fashion..." You debug more and start examining the headers. I can see myself going through loads of different stuff, and than: "Ahh, the traffic shaper..." As I said, I like the idea, but to many rules for backoffs will make the sys.admin scratch his head even more. Can we come up with a way for Varnish to tell the sys.adm. "Hey, you are delivering stale's here. Because ..." Or is this overengineer? > >Startup/consistency >------------------- > >We need to decide what to do about the cache when the Varnish >process starts. There may be a difference between it starting >first time after the machine booted and when it is subsequently >(re)started. > >By far the easiest thing to do is to disregard the cache, that saves >a lot of code for locating and validating the contents, but this >carries a penalty in backend or cluster fetches whenever a node >comes up. Lets call this the "transient cache model" I agree with Dag here. Lets start with "transient cache model" and add more later. We will discuss some scenarios at spec writing, and maybe come up with some models for later implementation. Better dig out those architecture notes Dag :) >The alternative is to allow persistently cached contents to be used >according to configured criteria: >[...] >The choice we make does affect the storage management part of Varnish, >but I see that is being modular in any instance, so it may merely be >that some storage modules come up clean on any start while other >will come up with existing objects cached. Ironically at VG the stuff that can be cahced long (JPG's, GIF's etc) can be cached long, while the costly stuff is the documents that cost CPU making. It would not be surprised if its like that many places. > >Clustering >---------- > >I'm somewhat torn on clustering for traffic purposes. For admin >and management: Yes, certainly, but starting to pass objects from >one machine in a cluster to another is likely to be just be a waste >of time and code. > >Today one can trivially fit 1TB into a 1U machine so the partitioning >argument for cache clusters doesn't sound particularly urgent to me. > >If all machines in the cluster have sufficient cache capacity, the >other remaining argument is backend offloading, that would likely >be better mitigated by implementing a 1:10 style two-layer cluster >with the second level node possibly having twice the storage of >the front row nodes. I am also torn here. A part of me says. Hey, there is ICP v2 and such, lets use it, it's good economy. Another part is thinking that ICP works at it's best when you have many accelerators, and if Varnish can deliver what we hope, not many frontends are needed for most sites in the world :) At that level, you can for sure deliver the extra content ICP and such would save you from. I know that in saying that I am sacrificing design because of implementation, but there it is. >The coordination necessary for keeping track of, or discovering in >real-time, who has a given object can easily turn into a traffic >and cpu load nightmare. > >And from a performance point of view, it only reduces quality: >First we send out a discovery multicast, then we wait some amount >of time to see if a response arrives only then should we start >to ask the backend for the object. With a two-level cluster >we can ask the layer-two node right away and if it doesn't have >the object it can ask the back-end right away, no timeout is >involved in that. A note. One of the reasons to be wary of two-level clusters in my opinion is that if you cache a document from the backend at the lowest lvl for say 2 min. And then the level over comes and gets it 1 min. into those 2 min., looks up in its config and finds out this is a 2 min. cache document, the document will be 1 min stale before a refesh. This could of cource be solved with Expires tags, but it makes sys.adm's wary. Dag also noted problems with this when we have two-layer approach and first layer is in backoff-mode. >Finally Consider the impact on a cluster of a "must get" object >like an IMG tag with a misspelled URL. Every hit on the front page >results in one get of the wrong URL. One machine in the cluster >ask everybody else in the cluster "do you have this URL" every >time somebody gets the frontpage. >[...] >Negative caching can mitigate this to some extent. > > >Privacy >------- > >Configuration data and instructions passed forth and back should >be encrypted and signed if so configured. Using PGP keys is >a very tempting and simple solution which would pave the way for >administrators typing a short ascii encoded pgp signed message >into a SMS from their Bahamas beach vacation... Bahamas? Vaction? :) > >Implementation ideas >-------------------- > >The simplest storage method mmap(2)'s a disk or file and puts >objects into the virtual memory on page aligned boundaries, >using a small struct for metadata. Data is not persistant >across reboots. Object free is incredibly cheap. Object >allocation should reuse recently freed space if at all possible. >"First free hole" is probably a good allocation strategy. >Sendfile can be used if filebacked. If nothing else disks >can be used by making a 1-file filesystem on them. > >More complex storage methods are object per file and object >in database models. They are relatively trival and well >understood. May offer persistence. Dag says: >- quick and dirty squid-like hashed directories, to begin with > > - fancy block storage straight to disk (or to a large preallocated > file) like you suggested > > - memcached as Poul later comments, squid is slow and dirty. Lets try to avoid it. I am fine with fancy block storage, and I am tempted to suggest: Berkeley DB I have always pictured Varnish with a Berkley DB backend. Why? I _think_ it is fast (only website info to go on here). http://www.sleepycat.com/products/bdb.html and http://www.sleepycat.com/products/bdb.html its block storage, and wildcard purge could potentially be as easy as: delete from table where URL like '%bye-bye%'; Another thing I am just gonna base on my wildest fantasies, could we use the Berkley DB replication to make a cache up-to-date after downtime? Would be fun, wouldn't it? :) I also like memcached, and I am excited to hear Poul suggest that we build a "better" approach. When I read that, I must admit that my first thought was that it would be really nice if this is a deamon/shem process that one can build a php (or whatever) interface against. This is out of scope, but imagine you have full access to the cache-data in php if only in RO mode. That means you can build php apps with a superquick backend with loads of metadata. :) >Read-Only storage methods may make sense for getting hold >of static emergency contents from CD-ROM etc. Nice feature. >Treat each disk arm as a separate storage unit and keep track of >service time (if possible) to decide storage scheduling. > >Avoid regular expressions at runtime. If config file contains >regexps, compile them into executable code and dlopen() it >into the Varnish process. Use versioning and refcounts to >do memory management on such segments. I smell a glob vs. compiled regexp showdown. Hehe. My only contrib here would be. Don't do it in java regexp :) >Avoid committing transmit buffer space until we have bandwidth >estimate for client. One possible way: Send HTTP header >and time ACKs getting back, then calculate transmit buffer size >and send object. This makes DoS attacks more harmless and >mitigates traffic stampedes. Yes. Are you thinking of writing a FreeBSD kernel module (accept_filter) for this? Like accf_http. >Kill all TCP connections after N seconds, nobody waits an hour >for a web-page to load. > >Abuse mitigation interface to firewall/traffic shaping: Allow >the central node to put an IP/Net into traffic shaping or take >it out of traffic shaping firewall rules. Monitor/interface >process (not main Varnish process) calls script to config >firewalling. This sounds like a really good feature. Hope it can be solved in Linux as well. Not sure they have the fancy IPFW filters etc. >"Warm-up" instructions can take a number of forms and we don't know >what is the most efficient or most usable. Here are some ideas: >[...] > >One interesting but quite likely overengineered option in the >cluster case is if the central monitor tracks a fraction of the >requests through the logs of the running machines in the cluster, >spots the hot objects and tell the warming up varnish what objects >to get and from where. >>This can easily be done with existing software like w3mir. >>[...] >>You can probably do this in ~50 lines of Perl using Net::HTTP. >>>Sounds like you just won this bite :-) Nice :) But I am not sure this is as "easy" as it sounds at first. >In the cluster configuration, it is probably best to run the cluster >interaction in a separate process rather than the main Varnish >process. From Varnish to cluster info would go through the shared >memory, but we don't want to implement locking in the shmem so >some sort of back-channel (UNIX domain or UDP socket ?) is necessary. > >If we have such an "supervisor" process, it could also be tasked >with restarting the varnish process if vitals signs fail: A time >stamp in the shmem or kill -0 $pid. You got to like programs that keep themselvs alive. >It may even make sense to run the "supervisor" process in stand >alone mode as well, there it can offer a HTML based interface >to the Varnish process (via shmem). > >For cluster use the user would probably just pass an extra argument >when he starts up Varnish: > > varnish -c $cluster_args $other_args >vs > > varnish $other_args > >and a "varnish" shell script will Do The Right Thing. Thats what we should aim at. >Shared memory >------------- > >The shared memory layout needs to be thought about somewhat. On one >hand we want it to be stable enough to allow people to write programs >or scripts that inspect it, on the other hand doing it entirely in >ascii is both slow and prone to race conditions. > >The various different data types in the shared memory can either be >put into one single segment(= 1 file) or into individual segments >(= multiple files). I don't think the number of small data types to >be big enough to make the latter impractical. > >Storing the "big overview" data in shmem in ASCII or HTML would >allow one to point cat(1) or a browser directly at the mmaped file >with no interpretation necessary, a big plus in my book. > >Similarly, if we don't update them too often, statistics could be stored >in shared memory in perl/awk friendly ascii format. That would be a big pluss with the stats either in HTML or in ASCII at least. >But the logfile will have to be (one or more) FIFO logs, probably at least >three in fact: Good requests, Bad requests, and exception messages. And a debug logg. The squid modell is not to bad there. Only poorly documented. In short its a "binary configuration", 1=some part a, 4=some part b, ..., 128=some part i. Debug=133=a,b and i. I mentioned on the meeting some URL's that would provide some relevant reading: http://www.web-cache.com/ is old but good. It lists all relevant protocols: http://www.web-cache.com/Writings/protocols-standards.html and other written things: http://www.web-cache.com/writings.html Here is also the Hypertext Caching Protocol - alternative and improvement to ICP, what I refered to as WCCP at the last meeting. Another RFC to take a look on might be: Web Cache Invalidation Protocol (WCIP) Here is what ESI.org has to say about WCIP: http://www.esi.org/tfaq.html#q8 And here is their approach: http://www.esi.org/invalidation_protocol_1-0.html Sorry about all the text :) P.S I was not on the list when Poul wrote the first post, so I don't have the ID either. My post will come as a seperate one. Anders Berg From andersb at vgnett.no Thu Feb 16 00:45:54 2006 From: andersb at vgnett.no (Anders Berg) Date: Thu, 16 Feb 2006 01:45:54 +0100 (CET) Subject: [Fwd: Re: My random thoughts] In-Reply-To: <2493.1139994855@critter.freebsd.dk> References: Your message of "Mon, 13 Feb 2006 10:10:23 +0100." <52801.129.240.201.175.1139821823.squirrel@denise.vg.no> <2493.1139994855@critter.freebsd.dk> Message-ID: <65058.193.213.34.102.1140050754.squirrel@denise.vg.no> Thanks for reply Poul. One thought that keeps coming back to me all the time is the need for a really well documented/well discussed/tested HTTP header strategy. It is crucial and I belive we will spend much of our time next week and much more later on this. I do not think it is possible to cover all aspects in spec. alone. This is maybe to state the obvious, but I rather think that I should so we all have a time to ponder on it. >>as Poul later comments, squid is slow and dirty. Lets try to avoid it. I >>am fine with fancy block storage, and I am tempted to suggest: Berkeley >> DB >>I have always pictured Varnish with a Berkley DB backend. Why? I _think_ >>it is fast (only website info to go on here). > > We may want to use DB to hash urls into object identity, but I doubt we > will be putting the objects themselves into DB. Yes. Objects _could_ work fine for a website with ASCII text HTML pages and small JPEG's, GIF's, but anybody delivering "large" files and binaries would curse it. So I see the usage rather limited for objects. >>its block storage, and wildcard purge could potentially be as easy as: >>delete from table where URL like '%bye-bye%'; >>Another thing I am just gonna base on my wildest fantasies, could we use >>the Berkley DB replication to make a cache up-to-date after downtime? >>Would be fun, wouldn't it? :) > > I fear it would be expensive. Considering that objects would be kept outside this could work if the database held some more data like how "hot" the object is, then parse ("select id from table order by hotness limit 200") it and fetch, but I see that it may be alot more "effective" to do it the w3mir way Dag suggested. Hotness would be inserted from aggregated shm data? I note w3mir could maybe give us a License problem? Anyway, spec week is coming up and I am excited. :) Anders Berg From phk at phk.freebsd.dk Thu Feb 16 10:09:17 2006 From: phk at phk.freebsd.dk (Poul-Henning Kamp) Date: Thu, 16 Feb 2006 10:09:17 +0000 Subject: [Fwd: Re: My random thoughts] In-Reply-To: Your message of "Thu, 16 Feb 2006 01:45:54 +0100." <65058.193.213.34.102.1140050754.squirrel@denise.vg.no> Message-ID: <4156.1140084557@critter.freebsd.dk> In message <65058.193.213.34.102.1140050754.squirrel at denise.vg.no>, "Anders Ber g" writes: Let me just try to see if I can express the overall threading strategy I have formed without using a whiteboard: The [...] is which thread we're in. [acceptor] Incoming connections are handled by acceptfilters in a single thread or if acceptfilters are not available with a single threaded poll loop. [acceptor] Once a full HTTP request has been gathered, the URL is hashed and looked up to see if we have a hit or not. [acceptor] If we have a hit, and the object is in a "ready" state, a thread is pulled off the "sender" queue and given the request to complete. [sender] The object will be shipped out according to its state (it may still be arriving from the backend) and the HTTP headers. sendfile will be used if at all possible. Once done, the the fd will be sent back the the acceptor if not closed {can we engage acceptfilters again ?} {We may ($config) engage in compression here and in such case we would embellish the object with the compressed version (up front) so it can be reused by other senders.} [acceptor] If we have a hit, but the object is not in a "ready" state, (for instance we are trying to get the object from the backend, but havn't received any of it yet) the request is parked on the object. [acceptor] If we have no hit, the header needs to be analyzed (URL cleanup, rewriting, negative lookup etc etc). We could use a "sender" thread to do this, but I would rather in order to limit the amount of potentially expensive work we do here. My initial thought therefore is to put the request into a queue to be dealt with by the "backend" threads. [backend] These threads will look for two kinds of work in order of priority: requests that needs analysing and objects nearing expiration. [backend] Requests needing analysis are chewed upon according to the configured rules and one of four outcomes are possible: [backend] Invalid request. Grap a "sender" and ship out a static error-object. [backend] Rematched request, (after analysis it matches an existing object) treat like the acceptor would for a hash hit. If configuration allows: add new hash entry to put this URL on fast track in the future. [backend] Unmatched request, cacheable (glob/regexp matching). Create object, queue request on it. Add hash entry. Initiate fetch from backend. When HTTP header arrives, set expiry on object accordingly. Once some data has arrived, grab sender and pass it the object (NB: not the request). Receive full object. [backend] Unmatched request, uncacheable (glob/regexp matching). Create (transient) object. Initiate fetch from backend. Once some data has arrived, grab sender thread and pass it object. Receive full object. [backend] Near-expiry objects: Once an object nears expiry (defined by config) it is eligble for refresh. A backend thread will determine if the object is important enough (defined by config) compared to current backend responsiveness to be refreshed. If it is, a GET request is sent to the backend. (I'm not sure optimizing with a HEAD is worth much here, maybe a hybrid strategy: If the object has been refreshed before and a GET was necessary more often than not, then do GET otherwise try HEAD first). [sender] When passed object: If only one request queued on object, behave as if passed that request. If more than one request is queued, grab a sender for each and pass that request. [sender] On transient object: Destroy object after transmission. [any] If on attempting to pull a sender off the queue, none is available, the request or object is queued instead. [overseer] Monitor number of sender threads and create/destroy them as appropriate. Sender threads go back to the front of the queue (to cache efficiency reasons) and if they linger in the tail of the queue doing nothing for more than $config seconds, they get killed off. [overseer] Monitor backend responsiveness based on backend thread statistics. Switch between various policy states accordingly. [master] handle requests coming in via $channel from janitor process. ... or something like that. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk at FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From des at linpro.no Fri Feb 17 12:47:26 2006 From: des at linpro.no (Dag-Erling =?iso-8859-1?q?Sm=F8rgrav?=) Date: Fri, 17 Feb 2006 13:47:26 +0100 Subject: My random thoughts References: <61614.193.213.34.102.1139781240.squirrel@denise.vg.no> Message-ID: "Anders Berg" writes: > "Dag-Erling Sm?rgrav" writes: > > - quick and dirty squid-like hashed directories, to begin with > as Poul later comments, squid is slow and dirty. Lets try to avoid it. I just mentioned it as a way of getting a storage backend up and running quickly so we can concentrate on other stuff. > I am fine with fancy block storage, and I am tempted to suggest: > Berkeley DB I have always pictured Varnish with a Berkley DB > backend. Why? I _think_ it is fast (only website info to go on > here). > > http://www.sleepycat.com/products/bdb.html and > http://www.sleepycat.com/products/bdb.html > > its block storage, and wildcard purge could potentially be as easy as: > delete from table where URL like '%bye-bye%'; Berkeley DB does not have an SQL interface or any kind of query engine. > "Poul-Henning Kamp" writes: > > Abuse mitigation interface to firewall/traffic shaping: Allow > > the central node to put an IP/Net into traffic shaping or take > > it out of traffic shaping firewall rules. Monitor/interface > > process (not main Varnish process) calls script to config > > firewalling. > This sounds like a really good feature. Hope it can be solved in > Linux as well. Not sure they have the fancy IPFW filters etc. They have iptables and other equivalents. DES -- Dag-Erling Sm?rgrav Senior Software Developer Linpro AS - www.linpro.no From des at linpro.no Fri Feb 17 12:51:49 2006 From: des at linpro.no (Dag-Erling =?iso-8859-1?q?Sm=F8rgrav?=) Date: Fri, 17 Feb 2006 13:51:49 +0100 Subject: [Fwd: Re: My random thoughts] References: <4156.1140084557@critter.freebsd.dk> Message-ID: "Poul-Henning Kamp" writes: > [acceptor] If we have no hit, the header needs to be analyzed (URL > cleanup, rewriting, negative lookup etc etc). We could use a > "sender" thread to do this, but I would rather in order to limit > the amount of potentially expensive work we do here. My initial > thought therefore is to put the request into a queue to be dealt > with by the "backend" threads. The header always needs to be analyzed, as it may contain stuff like If-Modified-Since, Range, etc. DES -- Dag-Erling Sm?rgrav Senior Software Developer Linpro AS - www.linpro.no From phk at phk.freebsd.dk Fri Feb 17 13:26:58 2006 From: phk at phk.freebsd.dk (Poul-Henning Kamp) Date: Fri, 17 Feb 2006 13:26:58 +0000 Subject: [Fwd: Re: My random thoughts] In-Reply-To: Your message of "Fri, 17 Feb 2006 13:51:49 +0100." Message-ID: <15373.1140182818@critter.freebsd.dk> In message , Dag-Erling =?iso-8859-1?q?Sm=F8rgra v?= writes: >"Poul-Henning Kamp" writes: >> [acceptor] If we have no hit, the header needs to be analyzed (URL >> cleanup, rewriting, negative lookup etc etc). We could use a >> "sender" thread to do this, but I would rather in order to limit >> the amount of potentially expensive work we do here. My initial >> thought therefore is to put the request into a queue to be dealt >> with by the "backend" threads. > >The header always needs to be analyzed, as it may contain stuff like >If-Modified-Since, Range, etc. While those headers are relevant, they are of no use until we have the object in question, so we don't need to look at them until in the sender or backend thread. Since we only have one frontend thread, I want to minimize the amount of work it does to the absolute minimum. The number of sender and backend threads are variable and can/will be adjusted to fit the load. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk at FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From andersb at vgnett.no Fri Feb 17 17:11:45 2006 From: andersb at vgnett.no (Anders Berg) Date: Fri, 17 Feb 2006 18:11:45 +0100 (CET) Subject: My random thoughts In-Reply-To: References: <61614.193.213.34.102.1139781240.squirrel@denise.vg.no> Message-ID: <4263.195.139.5.194.1140196305.squirrel@denise.vg.no> > "Dag-Erling Sm?rgrav" writes: >> I am fine with fancy block storage, and I am tempted to suggest: >> Berkeley DB I have always pictured Varnish with a Berkley DB >> backend. Why? I _think_ it is fast (only website info to go on >> here). >> >> http://www.sleepycat.com/products/bdb.html and >> http://www.sleepycat.com/products/bdb.html >> >> its block storage, and wildcard purge could potentially be as easy as: >> delete from table where URL like '%bye-bye%'; > > Berkeley DB does not have an SQL interface or any kind of query > engine. Okay, I knew it did not have a SQL interface, but not that it did not deliver a query engine of some sort. Anyway Berkeley DB (now Oracle owned :)) does say this on their homepage: "Berkeley DB is the ideal choice for static queries over dynamic data, while traditional relational databases are well suited for dynamic queries over static data." I did not paste this in to argue that you and Berkeley have a different definition of queries :) But rather that the "queries" we are gonna use for this are the same, and the data dynamic. So at first glance it looks to be right for us if it's _fast_. But no fear, I can kill darlings :) >> "Poul-Henning Kamp" writes: >> > Abuse mitigation interface to firewall/traffic shaping: Allow >> > the central node to put an IP/Net into traffic shaping or take >> > it out of traffic shaping firewall rules. Monitor/interface >> > process (not main Varnish process) calls script to config >> > firewalling. >> This sounds like a really good feature. Hope it can be solved in >> Linux as well. Not sure they have the fancy IPFW filters etc. > > They have iptables and other equivalents. Brilliant. Now lets pray they work the way they should, and are dynamic :) Anders Berg From des at linpro.no Mon Feb 20 15:59:05 2006 From: des at linpro.no (Dag-Erling =?iso-8859-1?q?Sm=F8rgrav?=) Date: Mon, 20 Feb 2006 16:59:05 +0100 Subject: draft spec Message-ID: Here's a dump of what we've written down so far: http://varnish.projects.linpro.no/wiki/VarnishSpecDraft Please let me know if there are any glaring mistakes or if something seems to be headed in the wrong direction. I'd like to make one comment regardint the Components section - the various components are not necessarily separate threads, they're just distinct functional units, some of which may be implemented as threads with message passing or work queues while others are simply APIs. Oh, and we haven't started talking about management or logging, so this is all in the main Varnish process. What I'd like to do tomorrow and on Wednesday is try to cover as much ground as possible without going into too much detail; we can save that for when Poul-Henning is here. DES -- Dag-Erling Sm?rgrav Senior Software Developer Linpro AS - www.linpro.no From des at linpro.no Wed Feb 22 18:55:32 2006 From: des at linpro.no (Dag-Erling =?iso-8859-1?q?Sm=F8rgrav?=) Date: Wed, 22 Feb 2006 19:55:32 +0100 Subject: r16 - trunk/varnish-doc/share References: <20060222184402.061351ED520@projects.linpro.no> Message-ID: des at projects.linpro.no writes: > Modified: > trunk/varnish-doc/share/docbook-xml.css > Log: > Set correct mime-type. Since mod_dav_svn uses svn:mime-type as Content-Type, it is now possible to read the draft spec online, straight from the repo: http://varnish.projects.linpro.no/svn/trunk/varnish-doc/en/varnish-specification/article.xml It's not perfect - no TOC, no links, no bibliography - but it's enough to be able to read the text without first having to check out the DocBook source and run it through xsltproc. In the medium term, I will look into creating a lightweight XSL stylesheet for DocBook which will allow a web browser to transform DocBook to XHTML on the fly (the full DocBook XSL stylesheets are not well suited for that purpose) DES -- Dag-Erling Sm?rgrav Senior Software Developer Linpro AS - www.linpro.no From phk at phk.freebsd.dk Fri Feb 24 12:53:08 2006 From: phk at phk.freebsd.dk (Poul-Henning Kamp) Date: Fri, 24 Feb 2006 12:53:08 +0000 Subject: notes1 Message-ID: <20060224125308.27584BC6D@phk.freebsd.dk> Notes on Varnish ---------------- Collected 2006-02-24 to 2006-02-.. Poul-Henning Kamp ----------------------------------------------------------------------- Policy Configuration Policy is configured in a simple unidirectional (no loops, no goto) programming language which is compiled into 'C' and from there binary modules which are dlopen'ed by the main Varnish process. The dl object contains one exported symbol, a pointer to a structure which contains a reference count, a number of function pointers, a couple of string variables with identifying information. All access into the config is protected by the reference counts. Multiple policy configurations can be loaded at the same time but only one is the "active configuration". Loading, switching and unloading of policy configurations happen via the managment process. A global config sequence number is incremented on each switch and policy modified object attributes (ttl, cache/nocache) are all qualified by the config-sequence under which they were calculated and invalid if a different policy is now in effect. ----------------------------------------------------------------------- Configuration Language XXX: include lines. BNF: program: function | program function function: "sub" function_name compound_statement compound_statement: "{" statements "}" statements: /* empty */ | statement | statements statement statement: if_statement | call_statement | "finish" | assignment_statement | action_statement if_statement: "if" condition compound_statement elif_parts else_part elif_parts: /* empty */ | elif_part | elif_parts elif_part elif_part: "elseif" condition compound_statement | "elsif" condition compound_statement | "else if" condition compound_statement else_part: /* empty */ | "else" compound_statement call_statement: "call" function_name assign_statement: field "=" value field: object field "." variable action_statement: action arguments arguments: /* empty */ arguments | argument ----------------------------------------------------------------------- Sample request policy program sub request_policy { if (client.ip in 10.0.0.0/8) { no-cache finish } if (req.url.host ~ "cnn.no$") { rewrite s/cnn.no$/vg.no/ } if (req.url.path ~ "cgi-bin") { no-cache } if (req.useragent ~ "spider") { no-new-cache } if (backend.response_time > 0.8s) { set req.ttlfactor = 1.5 } elseif (backend.response_time > 1.5s) { set req.ttlfactor = 2.0 } elseif (backend.response_time > 2.5s) { set req.ttlfactor = 5.0 } /* * the program contains no references to * maxage, s-maxage and expires, so the * default handling (RFC2616) applies */ } ----------------------------------------------------------------------- Sample fetch policy program sub backends { set backend.vg.ip = {...} set backend.ads.ip = {...} set backend.chat.ip = {...} set backend.chat.timeout = 10s set backend.chat.bandwidth = 2000 MB/s set backend.other.ip = {...} } sub vg_backend { set backend.ip = {10.0.0.1-5} set backend.timeout = 4s set backend.bandwidth = 2000Mb/s } sub fetch_policy { if (req.url.host ~ "/vg.no$/") { set req.backend = vg call vg_backend } else { /* XXX: specify 404 page url ? */ error 404 } if (backend.response_time > 2.0s) { if (req.url.path ~ "/landbrugspriser/") { error 504 } } fetch if (backend.down) { if (obj.exist) { set obj.ttl += 10m finish } switch_config ohhshit } if (obj.result == 404) { error 300 "http://www.vg.no" } if (obj.result != 200) { finish } if (obj.size > 256k) { no-cache } else if (obj.size > 32k && obj.ttl < 2m) { obj.tll = 5m } if (backend.response_time > 2.0s) { set ttl *= 2.0 } } sub prefetch_policy { if (obj.usage < 10 && obj.ttl < 5m) { fetch } } ----------------------------------------------------------------------- Purging When a purge request comes in, the regexp is tagged with the next generation number and added to the tail of the list of purge regexps. Before a sender transmits an object, it is checked against any purge-regexps which have higher generation number than the object and if it matches the request is sent to a fetcher and the object purged. If there were purge regexps with higher generation to match, but they didn't match, the object is tagged with the current generation number and moved to the tail of the list. Otherwise, the object does not change generation number and is not moved on the generation list. New Objects are tagged with the current generation number and put at the tail of the list. Objects are removed from the generation list when deleted. When a purge object has a lower generation number than the first object on the generation list, the purge object has been completed and will be removed. A log entry is written with number of compares and number of hits. ----------------------------------------------------------------------- Random notes swap backed storage slowstart by config-flipping start-config has peer servers as backend once hitrate goes above limit, management process flips config to 'real' config. stat-object always URL, not regexp management + varnish process in one binary, comms via pipe Change from config with long expiry to short expiry, how does the ttl drop ? (config sequence number invalidates all calculated/modified attributes.) Mgt process holds copy of acceptor socket -> Restart without lost client requests. BW limit per client IP: create shortlived object (<4sec) to hold status. Enforce limits by delaying responses. ----------------------------------------------------------------------- Source structure libvarnish library with interface facilities, for instance functions to open&read shmem log varnish varnish sources in three classes ----------------------------------------------------------------------- protocol cluster/mgt/varnish object_query url -> TTL, size, checksum {purge,invalidate} regexp object_status url -> object metadata load_config filename switch_config configname list_configs unload_config freeze # stop the clock, freezes the object store thaw suspend # stop acceptor accepting new requests resume stop # forced stop (exits) varnish process start restart = "stop;start" ping $utc_time -> pong $utc_time # cluster only config_contents filename $inline -> compilation messages stats [-mr] -> $data zero stats help ----------------------------------------------------------------------- CLI (local) import protocol from above telnet localhost someport authentication: password $secret secret stored in {/usr/local}/etc/varnish.secret (400 root:wheel) ----------------------------------------------------------------------- HTML (local) php/cgi-bin thttpd ? (alternatively direct from C-code.) Everything the CLI can do + stats popen("rrdtool"); log view ----------------------------------------------------------------------- CLI (cluster) import protocol from above, prefix machine/all compound stats accept / deny machine (?) curses if you set termtype ----------------------------------------------------------------------- HTML (cluster) ditto ditto http://clustercontrol/purge?regexp=fslkdjfslkfdj POST with list of regexp authentication ? (IP access list) ----------------------------------------------------------------------- Mail (cluster) pgp signed emails with CLI commands ----------------------------------------------------------------------- connection varnish -> cluster controller Encryption SSL Authentication (?) IP number checks. varnish -c clusterid -C mycluster_ctrl.vg.no ----------------------------------------------------------------------- Filer /usr/local/sbin/varnish contains mgt + varnish process. if -C argument, open SSL to cluster controller. Arguments: -p portnumber -c clusterid at cluster_controller -f config_file -m memory_limit -s kind[,storage-options] -l logfile,logsize -b backend ip... -d debug -u uid -a CLI_port KILL SIGTERM -> suspend, stop /usr/local/sbin/varnish_cluster Cluster controller. Use syslog Arguments: -f config file -d debug -u uid (?) /usr/local/sbin/varnish_logger Logfile processor -i shmemfile -e regexp -o "/var/log/varnish.%Y%m%d.traffic" -e regexp2 -n "/var/log/varnish.%Y%m%d.exception" (NCSA format) -e regexp3 -s syslog_level,syslogfacility -r host:port send via TCP, prefix hostname SIGHUP: reopen all files. /usr/local/bin/varnish_cli Command line tool. /usr/local/share/varnish/etc/varnish.conf default request + fetch + backend scripts /usr/local/share/varnish/etc/rfc2616.conf RFC2616 compliant handling function /usr/local/etc/varnish.conf (optional) request + fetch + backend scripts /usr/local/share/varnish/etc/varnish.startup default startup sequence /usr/local/etc/varnish.startup (optional) startup sequence /usr/local/etc/varnish_cluster.conf XXX {/usr/local}/etc/varnish.secret CLI password file. ----------------------------------------------------------------------- varnish.startup load config /foo/bar startup_conf switch config startup_conf !mypreloadscript load config /foo/real real_conf switch config real_conf resume *eof* From andersb at vgnett.no Fri Feb 24 13:55:00 2006 From: andersb at vgnett.no (Anders Berg) Date: Fri, 24 Feb 2006 14:55:00 +0100 (CET) Subject: Module-map Varnish Message-ID: <2158.193.69.165.4.1140789300.squirrel@denise.vg.no> A non-text attachment was scrubbed... Name: varnish_prosesser.pdf Type: application/pdf Size: 26872 bytes Desc: not available URL: From des at linpro.no Fri Feb 24 14:54:28 2006 From: des at linpro.no (Dag-Erling =?iso-8859-1?Q?Sm=F8rgrav?=) Date: Fri, 24 Feb 2006 15:54:28 +0100 Subject: r24 - in trunk/varnish-cache: . bin bin/varnishd include lib lib/libvarnish lib/libvarnishapi References: <20060224143556.09D131ED51F@projects.linpro.no> Message-ID: des at projects.linpro.no writes: > Added: > trunk/varnish-cache/Makefile.am > trunk/varnish-cache/autogen.sh > trunk/varnish-cache/bin/ > trunk/varnish-cache/bin/Makefile.am > trunk/varnish-cache/bin/varnishd/ > trunk/varnish-cache/bin/varnishd/Makefile.am > trunk/varnish-cache/bin/varnishd/varnishd.c > trunk/varnish-cache/configure.ac > trunk/varnish-cache/include/ > trunk/varnish-cache/include/Makefile.am > trunk/varnish-cache/include/varnishapi.h > trunk/varnish-cache/lib/ > trunk/varnish-cache/lib/Makefile.am > trunk/varnish-cache/lib/libvarnish/ > trunk/varnish-cache/lib/libvarnish/Makefile.am > trunk/varnish-cache/lib/libvarnishapi/ > trunk/varnish-cache/lib/libvarnishapi/Makefile.am > Log: > Source tree structure as agreed. To build, first make sure you have GNU autotools installed (FreeBSD: devel/gnu-autoconf, devel/gnu-automake, devel/gnu-libtool). Check out the sources, run autogen.sh to generate the configure script, then configure && make && make install as usual. DES -- Dag-Erling Sm?rgrav Senior Software Developer Linpro AS - www.linpro.no From des at linpro.no Mon Feb 27 10:01:57 2006 From: des at linpro.no (Dag-Erling =?iso-8859-1?Q?Sm=F8rgrav?=) Date: Mon, 27 Feb 2006 11:01:57 +0100 Subject: RFC: namespaces Message-ID: Varnish is going to consist of quite a bit of code, and I'd like to keep separate modules in separate namespaces. I'd like to suggest the following convention: - All external symbols get a three-letter prefix followed by an underscore. - The first letter is always v for Varnish. - The next two letters identify the module the symbol belongs to. Each module gets a unique two-letter mnemonic code. For instance we could assign the two-letter code "lo" to the logger; logging functions would be named e.g. vlo_emit(), and log-related preprocessor macros would be named e.g. VLO_LEVEL_DEBUG. DES -- Dag-Erling Sm?rgrav Senior Software Developer Linpro AS - www.linpro.no From phk at phk.freebsd.dk Mon Feb 27 11:55:01 2006 From: phk at phk.freebsd.dk (Poul-Henning Kamp) Date: Mon, 27 Feb 2006 11:55:01 +0000 Subject: RFC: namespaces In-Reply-To: Your message of "Mon, 27 Feb 2006 11:01:57 +0100." Message-ID: <903.1141041301@critter.freebsd.dk> In message , Dag-Erling =?iso-8859-1?Q?Sm=F8rgra v?= writes: >For instance we could assign the two-letter code "lo" to the logger; >logging functions would be named e.g. vlo_emit(), and log-related >preprocessor macros would be named e.g. VLO_LEVEL_DEBUG. Sounds a bit like overkill to me, but if it makes you happy I can live with it. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk at FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence. From des at linpro.no Mon Feb 27 13:56:58 2006 From: des at linpro.no (Dag-Erling =?iso-8859-1?Q?Sm=F8rgrav?=) Date: Mon, 27 Feb 2006 14:56:58 +0100 Subject: RFC: namespaces References: <903.1141041301@critter.freebsd.dk> Message-ID: "Poul-Henning Kamp" writes: > "Dag-Erling Sm?rgrav" writes: > > For instance we could assign the two-letter code "lo" to the logger; > > logging functions would be named e.g. vlo_emit(), and log-related > > preprocessor macros would be named e.g. VLO_LEVEL_DEBUG. > Sounds a bit like overkill to me, but if it makes you happy I can > live with it. Well, the alternatives are worse IMHO: open_log(const char *); /* conflicts with libfoobar */ varnish_open_log(const char *); /* too long */ open(const char *); /* oops */ vlo_open(const char *); /* that's better! */ DES -- Dag-Erling Sm?rgrav Senior Software Developer Linpro AS - www.linpro.no From des at linpro.no Mon Feb 27 14:18:01 2006 From: des at linpro.no (Dag-Erling =?iso-8859-1?Q?Sm=F8rgrav?=) Date: Mon, 27 Feb 2006 15:18:01 +0100 Subject: r24 - in trunk/varnish-cache: . bin bin/varnishd include lib lib/libvarnish lib/libvarnishapi References: <20060224143556.09D131ED51F@projects.linpro.no> Message-ID: des at linpro.no (Dag-Erling Sm?rgrav) writes: > To build, first make sure you have GNU autotools installed (FreeBSD: > devel/gnu-autoconf, devel/gnu-automake, devel/gnu-libtool). Check out > the sources, run autogen.sh to generate the configure script, then > configure && make && make install as usual. I forgot to add that the recommended configure command line for developers is the following: $ ./configure --enable-pedantic --enable-wall --enable-werror \ --enable-dependency-tracking I need to get this all written down in the wiki... DES -- Dag-Erling Sm?rgrav Senior Software Developer Linpro AS - www.linpro.no From phk at phk.freebsd.dk Mon Feb 27 17:29:07 2006 From: phk at phk.freebsd.dk (Poul-Henning Kamp) Date: Mon, 27 Feb 2006 17:29:07 +0000 Subject: RFC: namespaces In-Reply-To: Your message of "Mon, 27 Feb 2006 14:56:58 +0100." Message-ID: <2100.1141061347@critter.freebsd.dk> In message , Dag-Erling =?iso-8859-1?Q?Sm=F8rgra v?= writes: >Well, the alternatives are worse IMHO: > >open_log(const char *); /* conflicts with libfoobar */ >varnish_open_log(const char *); /* too long */ >open(const char *); /* oops */ >vlo_open(const char *); /* that's better! */ Well, I would tend to consider the varnish process + mgt_process namespace "private" and therefore not in need of a lot of prefix whereas for the public API I fully agree a prefix is in order. But as I said: I can live with it. -- Poul-Henning Kamp | UNIX since Zilog Zeus 3.20 phk at FreeBSD.ORG | TCP/IP since RFC 956 FreeBSD committer | BSD since 4.3-tahoe Never attribute to malice what can adequately be explained by incompetence.