The Great Mystery of HTTP and the World of Web Servers

Before embarking upon this research I must confess I was fairly ignorant of the ins and outs of HTTP, or what web servers really were beyond being “big computers that store stuff”. Despite typing “http” into my browser’s address bar on a regular basis, I had little idea why I was doing that, or where my request was going when I hit enter.

The basics then. HTTP (Hyper Text Transfer Protocol) is a simple client-server protocol for hypermedia information that has driven the World Wide Web since 1996. It is based on a request-response scenario, where the client sends a request to the server (usually in the form of a website address) and the server sends a response back (usually the content of the requested page). There may be one or more intermediaries between the client and server, such as proxies, gateways or firewalls, but this is the fundamental journey.

Prior to the current version of HTTP – 1.1 – there were versions 0.9 and 1.0. The main differences between 1.1 and these older versions is that the latter use a separate connection for every piece of information, whereas 1.1 maintains the same TCP (Transmission Control Protocol) connection, hence it is faster and more efficient as the connection does not have to be re-established each time. This also enables HTTP Pipelining, whereby multiple requests can be made to a single connection without having to wait for the corresponding response. A persistent connection improves latency significantly then, but pipelining is still very much in its infancy. It does not fix the issue of head-of-line blocking, and cannot be used for HTTP methods like POST. More significantly the difficulties in deploying Pipelining to web browsers has proved so great that it remains disabled by default in all major browsers. This drives clients to use multiple parallel TCP connections for concurrency, which is slow and inefficient.

HTTP defines eight request methods; the most common of these is GET. This is the standard method called when you type a website address into your browser and simply indicates to the server that you want to see the resources you are specifying. I will go into this further after a brief introduction to web servers.

A web server is essentially something that provides a service (for example a web service) to something else. The term applies to both the computer program that serves content (such as web pages) using HTTP, and the computer (or virtual machine) that runs the program.

To break it down further, when you type a website address into your browser (for example http://www.sky.com/news), what you are really specifying is the following:

  • The protocol: “http”
  • The server name, comprising:
    • The host name: “www”
    • The domain name: “sky”
    • The top-level domain name: “com”
  • The file name: “news”

In this simple way, the client is requesting a specific resource from the server (here the file “news” from server “www.sky.com”) using HTTP’s GET method. The server responds by sending the contents of this file, and then disconnecting. This response is usually sent in the form of an HTML file, which is then translated and formatted by the web browser onto the screen.

This is a very basic explanation, as it implies the server is required to do no more than send the requested file. This is the case with static pages, but most modern web pages also contain dynamic content. With dynamic elements such as HTML forms, the client’s input dictates the response sent by the server. In this case the server uses other HTTP methods to processes the information and generates a page based on the specifics of the query.

Apache has been the most popular HTTP server software in use since 1996, serving over 54.48% of all websites as of September 2009. Its goal is not to be the “fastest” web server, but (according to their website to provide a “secure, efficient and extensible server that provides HTTP services in sync with the current HTTP standards.” Although Apache is currently secure in this dominant position, there are developments elsewhere which must be considered. Most notably is Tornado, which has recently been made open-source. This is the web server that powers Friend Feed, and is unusual because it was designed specifically to handle the site’s real-time features. Rather than polling for updates the way other web servers do, Tornado works by maintaining a standing connection which waits for updates. These can then be displayed immediately as there is no need to wait for a polling update to run. It uses epoll, which is far more scalable than traditional polling methods, so can handle thousands of simultaneous standing connections. Every active user of Friend Feed then, maintains an open connection to the Friend Feed servers. As such this is ideal for real-time web services in a way that traditional web servers cannot be. As more emphasis is placed on the need for real-time web content, it will be interesting to see the impact Tornado will surely come to have.

The future of HTTP has rarely been challenged. While there have been attempts to improve the latency of web pages, most of these have been targeted at improving TCP rather than HTTP. However in practical terms, changing the transport protocol is very difficult to deploy. It is not until now that Google, realising this, have started looking at alternatives to HTTP. As part of their “Let’s make the web faster” initiative they have been developing a new protocol called SPDY (pronounced “SPeeDY” – three guesses as to why). This is specifically designed to reduce the latency of web pages. Google’s development here is significant, not least because such an approach requires minimal changes to existing infrastructure. SPDY seeks to preserve existing HTTP semantics as much as possible. All features such as cookies and etags work exactly as they do with HTTP; SPDY simply replaces the way the data is written to the network.

SPDY aims to address the current problems with HTTP; specifically the fact that HTTP relies on multiple connections for concurrency, which can cause delays for the client. The three main improvements over HTTP are:

  • An unlimited number of requests can be issued concurrently over a single SPDY connection, making TCP much more efficient.
  • Requests can be prioritised according to the client’s needs.
  • Headers will be compressed to save much latency and bandwidth.

Looking at the current SPDY whitepaper there are additional aims which appear more ambitious still. The current HTTP model caters exclusively for client-initiated requests. Google highlight their desire for SPDY to support bi-directional streams, breaking from the current HTTP model which caters exclusively for client-initiated requests. Thus data could to be pushed to the client. This would allow web applications to receive a notification the instant something happens, rather than having to poll the server, which is expensive. Moreover, by constantly staying one step ahead of the client, resources could be provided far quicker. Google are currently using a faux-push hack to allow real-time updates in AJAX applications like Gmail and Google Wave, so this seems like a natural progression and should provide greater flexibility in such cases.

Google have so far built a high-speed, in-memory server, and a modified Chrome client. As is standard with Google, the project is all open sourced (although some of it is not yet available). The source code for the modified Chrome client can be found here. One of SPDY’s primary high-level goals was to achieve a 50% reduction in page load time. Early results show that this target can certainly be met as they claim to have seen up to 64% reductions in page load times. A summary of the test results to date can be found here. However it is important to note that whilst these results are promising, Google themselves acknowledge that they don’t know how well they represent the real world.

SPDY has been compared to HTTP Pipelining. However it is easier to deploy and does not appear to have the same limitations. As it is still in the development stage it is perhaps too early to say for sure, but this could well be the future of the web. It is certainly unlikely that Google will put too much effort into developing HTTP Pipelining support for the Chrome browser with SPDY in the (pun intended) pipeline.

For now however, HTTP remains the application-layer protocol of choice. Which means that for now at least, I can relax with the knowledge that I understand how it all works.

Advertisements

About RNewstead

I am learning every day. Sometimes I worry there are too many interesting things in the world and not enough time.
This entry was posted in Web Technology and tagged . Bookmark the permalink.

One Response to The Great Mystery of HTTP and the World of Web Servers

  1. Alex says:

    This is a pretty good visualisation of just how chatty a single HTTP GET can be http://vimeo.com/14439742

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s