Microsoft Projects Details

Challenges to Building Scalable Services

A Survey of Microsoft’s Internet Services

Version 1.0

September 24, 1999

 

Steven Levi and Galen Hunt

Microsoft Research

 

 

 

 

Microsoft Confidential

 

 

 

 

Acknowledgments

We have been continually impressed at the caliber of the people building and operating Microsoft’s megaservices.  Don’t underestimate either their intelligence or their commitment to deliver.  We have borrowed their time and their wisdom.  Without exception, these people have graciously shared both with us.  Any value derived from this report comes from their contribution.  Any inaccuracies are solely ours.

 

Revision History

The latest version of this paper is available at http://big/megasurvey/megasurvey.doc.

      Version 0.6      August 24, 1999              Rough draft distributed to executives.

      Version 0.7      September 3, 1999          Text complete and revisions from sites.

      Version 0.8      September 15, 1999        Added LinkExchange.

      Version 0.9      September 17, 1999        Revisions from LinkExchange.

 


Challenges to Building Scalable Services    1

Acknowledgments. 1

Revision History.. 1

1. Introduction.. 4

2. What is a Service?. 5

2.1. Service Characterization.. 5

2.2. Load Balancing.. 7

2.2.1. Multiple IP Addresses (DNS Round Robin) 7

2.2.2. Hardware Solutions (Cisco, Alteon, F5 and others) 7

2.2.3. Software Solution  (WLBS) 9

2.3. Example. 10

3. Observations. 12

3.1. Maintainability.. 12

3.2. Scalability.. 14

3.3. Availability.. 16

4. Services. 18

4.1. Hotmail. 18

4.1.1. Architecture. 18

4.1.2. Operations. 20

4.2. Home.Microsoft.Com (HMC) 21

4.2.1. Architecture. 21

4.2.2. Development/Testing methodology. 22

4.3. Sidewalk.. 23

4.3.1. Architecture. 23

4.3.2. Development Methodology/Issues. 24

4.4. MSNBC.. 25

4.4.1. Architecture. 26

4.4.2. Issues. 27

4.5. Instant Messaging.. 28

4.5.1. Architecture. 28

4.5.2. Issues. 30

4.6. Expedia.. 30

4.6.1. Architecture. 31

4.6.2. Development and Testing. 32

4.6.3. Wish List 32

4.7. MoneyCentral. 33

4.7.1. Architecture. 33

4.7.2. Development Methodology and Issues. 34

4.7.3. Wish List 35

4.8. Windows Update. 35

4.8.1. Architecture. 36

4.8.2. Development Methodology and Issues. 37

4.8.3. Operations. 37

4.9. CarPoint. 38

4.9.1. Architecture. 39

4.9.2. Issues. 40

4.10. Calendar.. 41

4.10.1. Architecture. 41

4.11. Chat. 42

4.11.1. Architecture. 42

4.11.2. Wish List 43

4.12. Communities. 44

4.12.1. Architecture. 44

4.12.2. Issues. 45

4.13. WebTV.. 45

4.13.1. Architecture. 46

4.13.2. Development methodology. 48

4.13.3. Operations. 49

4.13.4. Future: TVPAK.. 50

4.14. LinkExchange. 51

4.14.1. Architecture. 52

4.14.2. Development 54

4.15. LinkExchange ListBot. 55

4.15.1. Architecture. 55

4.16. Hydrogen.. 56

4.16.1. Architecture. 57

4.17. Passport/Wallet. 58

4.17.1. Architecture. 59

4.17.2. Deployment Plans. 60

4.18. AdsTech.. 62

4.18.1. Architecture. 62

4.18.2. Software Description. 63

4.19. Exchange Hosting.. 65

4.20. MSN Operations. 66

4.20.1. Issues. 66

4.20.2. Wish List 66

5. Conclusions. 68

 


 

1. Introduction

Desktop applications and megaservices have fundamentally different per-user scalability models.  The developers of desktop applications like Office, Flight Simulator and Visual Studio scale to more users by shipping more CD-ROMs.  The developers of megaservices like HotMail, Expedia, and home.microsoft.com must physically scale their servers and software to support millions of simultaneous users.

While programming and execution in the two domains can be quite different, deployment and operations are radically different.  Software developers bridging the two domains, whether MS product teams or ISVs, have very little incentive to use MS technology when moving to the Internet because frankly, MS has little, if any, technology to make their application scalable, deployable or operable with millions of users.  Moreover, the skill set of the MS-technology-capable developer is radically different than the skill set needed to build and deploy large, scalable services.

Three months ago, we initiated a study of Microsoft megaservices.  Recognizing our complete ignorance, we endeavored to visit each megaservice within the company.  Our goal was to understand their architectural, programming, and deployment challenges.  Our hope was to identify a set of commonalties upon which we could then propose a new Microsoft application platform for megaservices and create an explosion of new applications reminiscent of the Win3 desktop application explosion.

This report represents the first major milestone of our study.  We visited seventeen megaservices, including virtually all of the MSN properties, HotMail, LinkExchange, and WebTV.  We believe we have a strong basic understanding of the challenges of building, deploying, and operating today's megaservices and want to share our understanding with a broader audience.

The outline of the remainder of this paper is as follows:

Section 2 develops a notion of what a service is by introducing an extremely simple definition and then layering on top of that additional constraints and complexities that one must consider as part of any viable (internet) service.  To round out a common base of understanding a discussion of load balancing mechanisms is included.  Finally, this notion is applied to a real-world example where some of the more subtle design trade-offs are explored.

Section 3 details our key observations.  The reader is urged not to skip Section 2 before reading Section 3.  Without the base of understanding developed in Section 2, the findings in this section will not be as clear.

Section 4 describes each of the Microsoft services.

Section 5 summarizes our most important findings.


2. What is a Service?

We will not propose in this report an application platform for megaservices.  In fact, we go so far as to caution others of the danger of preaching a solution from the pulpit of naïveté.  Building scalable systems is hard.  Building reliable megaservices is even harder.

2.1. Service Characterization

Two-tier client/server computing has evolved into multi-tier service-based computing.  In this newer approach, servers provide logical services to their clients by partitioning the application workload among themselves, and by depending upon a common communication model to share their distributed computation.

In practice, application-level megaservices are often recursively composed of multiple services that cooperatively span many machines.  As simple as this statement is, it has profound implications.  The application author must become responsible for matching his or her code to the hardware and communications capabilities of what is typically a large and difficult-to-understand cluster of computers.  Thinking about multiple types of machines, and multiple instances of each type, forces the developer to think about network connectivity.  One must consider communications patterns between machines and the mechanisms needed/desired to balance both machine load and internal/external network capacity.

Virtually all of the Microsoft services we reviewed minimally decomposed their service into what are typically called front-end (FE) machines and back-end (BE) machines.  The FE machines usually service incoming user requests, acquire, generate or read some data, render this data into HTML, and then issue a response to the user.  For services that support the HTTP 1.0 protocol, a TCP connection is created and torn down for each individual user request. 

Front-end (FE) machines are generally stateless.  That is to say, after a connection is torn down there is no need to remember anything about that request to process future requests.  Because of their statelessness, most FE machines are considered interchangeable.  Back-end (BE) machines are typically stateful and are used to retrieve and persist data.  Examples of BE components include the Exchange Store, a SQL or Oracle database, and the HotMail USTORE machines.  FE machines quite often make requests to BE machines to get or store data.  End users rarely communicate directly with BE machines.

As services scale, data partitioning becomes an issue.  In the case of a small service where there is only one BE machine, the way in which the data is partitioned does not matter much.  As the system scales and the needed BE output rate exceeds the resources of a single box (CPU cycles, network bandwidth, number of spindles, etc.), more often than not there will be a need to find a natural partitioning of the data.  Fortunately, many of the current services being developed are “embarrassingly parallel,” which is to say that data is easily partitioned, usually on a per-user basis.  The developer now concerns himself not only with how the data can and must be partitioned, but also how it will be accessed and what the topology of the connectivity between the FE machines servicing the requests and the set of BE machines.  If not careful, the developer will find himself with a full mesh connectivity from every FE to every BE.  A mesh may or may not be a bad thing depending on how it is used.  If, however, the service accesses SQL via ADO in ASP it will open and tear down connections between the FE and the BE machines for each HTTP request.  The mesh in this case is, of course, disastrous.

Of course, our poor developer is not out of the woods yet.  Not only does he have to figure out how to partition the data to accommodate the current BE expansion but he must provision for all future expansions and new versions as well!  Now the developer probably needs to come up with some virtual resource manager  (VRM) that provides an indirection between the logical data and its physical partition.  As new data BE machines come on line he needs to adjust the VRM to reflect the new partitioning.  Most likely this new partitioning will necessitate the migration of data stored on one or more of the BE devices to other BE devices.  The developer has to either handle this directly in the service or provide support tools for migration.

As a service grows in scale and importance, more and more attention is paid to the minuscule details of performance.  With a single machine fulfilling only one functional aspect of the service, the developer’s attention turns to specific details of what is going on each box.  Where is performance going?  Are there things on this box that get in the way?  Are there services or processes running on this box that are not need?  Are the services provided by a general OS costing performance for the special application or sub-application?  Recall that a subcomponent of the system may run on a very large number of machines.  HotMail has over 1300 Front Door machines all doing exactly the same thing.  The cost of additional fine tuning is amortized over the entire set of machines on which it runs.  In extreme cases, the OS appears to be little more than overhead to be eradicated, rather than a useful foundation for code.

To understand the system better, the developer must add all manner of runtime instrumentation.  Sources of runtime instrumentation data serve two distinct purposes: to aid in monitoring the health of a current system; and to enable understanding of how the system behaves during real on-line operations.  Deeper understanding of the current system increases the chance of improving its implementation.  The most successful services have built extensive monitoring and logging facilities into their architectures.

As the number of components in the service increases, so does the likelihood of component failure.  Interestingly, as a consequence of their distributed nature, even if a function  (or a function servicing a subset of the user community) is not operational most likely some portion of the service is still operating.  It behooves the developer to exploit partial failure.  Because of partial failure, the developer must treat failures and/or the lack of sub-services (or infrastructure services) with style and grace when writing his components.  He needs to handle failures from all external components as gracefully as possible and certainly not cause errors to ripple back to the user in some unintelligible form (or god forbid, assert!).

Up to this point, a strong system developer or architect with some distributed systems knowledge could grapple with most of the concepts and issues presented.  The challenge is to expand their mindset to incorporate an intrinsic quality of services: megaservices must always be up!  This is a massive mental shift.  For services to become a cornerstone of electronic business, they must always be available, supporting at least a minimal quality of service.  Think telephone: you can (almost always) depend upon it.

No number of developers and architects versed in distributed computing are sufficient to ensure high availability – this guarantee can only be offered by a top-notch operations staff.  The requirements placed on a service by operational constraints are as important (and perhaps even more important) than end-user requirements.  Especially in the resource-rich environment of dedicated personal computer hardware, developers have often paid less attention to operational ease-of-use than to end-user visible application features.  This lack of engineering detail can have disastrous effects upon the scalability and availability of a service, however, and must be avoided at all costs.  Administrative simplicity, ease of configuration, ongoing health monitoring, and failure detection are as high priorities as any application feature; because of this, the developer must fully understand the context in which a service is deployed and run.  Conversely, the operations staff must also understand all partitioning schemes, administrative tools, and communications patterns that characterize the service and its runtime presence on the net.

2.2. Load Balancing

In the presence of diverse and plentiful machines, the developer is forced to think about partitioning the application workload, as well as the network connectivity needed to support each partitioning scheme.  One very important aspect of connectivity exists between external users and the service itself.  Except for the smallest of services, some form of load balancing mechanism is needed to distribute the load of incoming connections over the FE boxes.  Many megaservices also have sub-services, which are themselves load balanced.

Three typical load balancing mechanisms exist: multiple IP addresses (DNS round robin), hardware support for virtual-to-real IP address mapping, and software support for virtual-to-real IP address mapping.

2.2.1. Multiple IP Addresses (DNS Round Robin)

DNS name resolution is the process of translating a domain name to an IP address.  In “round robin” DNS, a random IP address will be returned with each DNS resolution request (if multiple entries exist in the DNS entry.)

The purpose of round robin is to allow use of multiple HTTP servers (with identical contents) in order to distribute the connection loads.  Round robin is not random, though it gives a random effect.  It operates in a round-robin fashion (as the name implies), in that it rotates the return record sequence by one for each response – one address is handed out, put at the end of the list, and then the next address is handed out for the next translation request yielding something like a translation list.

2.2.2. Hardware Solutions (Cisco, Alteon, F5 and others) 

The Cisco LocalDirector is a hardware-based load-balancing solution.  Several companies make similar devices.  Most of these devices manifest themselves as switches or bridges with additional software for managing specialized routing.  The switch learns the IP addresses of all the servers connected to it.  Based on machine availability and a balancing algorithm the switch takes the incoming packets, all with the same destination IP address (the LocalDirector’s IP address), and re-writes them to contain the appropriate chosen server’s IP address.  The high-end LocalDirector can re-write packets for a 230Mbps data stream with up to 32 destination servers.  Moreover, we believe, the director can support up to 16 independent data segments (sub-nets such that traffic on the sub-net is completely isolated from traffic on the other sub-nets)

The following is a blurb from http://www.cisco.com/warp/public­/cc­/cisco­/mkt­/scale­/locald­/index.shtml

“Cisco System's Local Director is a high-availability, Internet scalability solution that intelligently load balances TCP/IP traffic across multiple servers.  Servers can be automatically and transparently placed in or out of service, and LocalDirector itself is equipped with a hot standby failover mechanism, eliminating all points of failure for the server farm.  LocalDirector is a high performance Internet appliance with proven performance in the highest traffic Internet sites.”

Below is an example of a LocalDirector configured in one of its simplest forms.  A set of server machines is connected via a hub on one common Ethernet segment.  The segment is connected to the LocalDirector (bridge).

LocalDirector with Hubs or Switches (simplest configuration):

The example below is considerably more complex.  This configuration can survive any single point of failure (up to but not including, the servers) without the users being adversely affected.  This example is included to demonstrate the complexity and sophistication that can occur with basic network building blocks.

LocalDirector in Highly Fault-Tolerant Configuration:

 

2.2.3. Software Solution  (WLBS)

Microsoft’s Windows NT Load Balancing System (WLBS), also known as Convoy, is a software-only load balancing solution.  WLBS creates a shared virtual IP address (VIP) among a cluster of Windows NT servers.  WLBS load balances incoming TCP connections to the VIP across the members of the cluster. 

WLBS is an NDIS packet filter driver.  WLBS sits above the network adapter’s NDIS driver and below the TCP/IP stack.  Each machine in the cluster receives every packet for the VIP.  WLBS decides on a packet-by-packet basis which packets should be processed on each machine.  If the packet should be processed by another machine in the cluster, WLBS throws the packet away.  If the packet should be processed locally, it is passed up to the TCP/IP stack.

The WLBS driver on each machine sees all incoming packets because all members of the cluster sit on a shared Ethernet segment.  In essence, each packet is broadcast to every machine and each machine individually decides which packets to process.

WLBS uses a distributed hash function to determine which packets should be accepted on a given machine.  WLBS hosts exchange heartbeat messages with each other once a second.  The heartbeat message contains a host’s view of the cluster.  Based on heartbeats, each host knows about the state of other cluster members.  Each host independently makes the decision to accept or reject the packet based on its host ID, the number of hosts in the cluster, and the information in IP header of the packet.

WLBS effectively partitions (through hashing) the IP client space among available cluster hosts and lets each handle its share of the workload.  WLBS does not modify any information in the packet.

Using a management console, the system administrator can remove or add any host to the cluster at any time.

2.3. Example

Other architectural components of a megaservice include file storage, database storage (like SQL), database-access mechanisms (like ODBC), network bandwidth limitations, and server hardware (number of CPUs, hardware class, etc.).  While each of these issues is important, people at Microsoft tend to have a much better understanding of these fundamental problems and scalability trade-offs than they do of load balancing.

Naturally, not every service needs to worry about all of these architectural problems.  On the other hand, there is more than one service today that deals with virtually all the above-described issues.  Future services (or services that have yet to scale sufficiently) will likely have to deal with all of these issues. 

To reinforce the difficulty of constructing a megaservice, consider HotMail.  It is a service that easily scales by user.  In condensed form, HotMail consists of four classes of machines:

·        Web server Front Doors (FDs): Stateless web servers that present the HotMail user interface via HTML.

·        User data stores (USTOREs): Stateful servers persisting the email folders for up to 2M users.

·        Member index servers (MSERVs): Stateless servers containing the global mapping from user ID to USTORE.

·        Incoming SMTP servers: These servers accept incoming email messages and save them to the appropriate USTORE.

Hotmail is arguably the largest service Microsoft owns.  HotMail has 47M users and handles many millions of email messages a day.  On one occasion when HotMail suspended incoming SMTP connection for 2 hours, AOL’s outgoing SMTP queue grew to about 2M messages.  HotMail can independently scale each of the four classes of machines.  They currently have over 1300 Front Doors and 54 USTOREs.  Cisco LocalDirectors allow the Front Doors to share a common IP address and automatically balance incoming HTTP requests.

HotMail’s Achilles heel is the USTORE.  Each new USTORE currently cost over $750,000.  The number of I/O requests a USTORE can fulfill per second bounds the per user scalability of the HotMail architecture.  The USTOREs are continuously pounded by multiple I/O requests per user page view.  Furthermore, the USTORE is a single point of failure.  If a USTORE goes down, up to 2M users lose access to their email (although they can still send outgoing email messages).

One of the proposed solutions to fix HotMail’s problems is the following: since a user’s email is limited to 2MB (and in fact it is often closer to 400KB), transfer the entire email folder to the Front Door and back in a single pair of I/O operations per user session.  Furthermore, the email folder could be RAID striped across a cluster of USTOREs.  The Front Door reads the email folder, twiddles on the bits over the lifetime of the HTTP connection, then flushes the email folder back to the USTOREs.

This “easy” solution overlooks the realities of the web.  First, the law of large numbers: the access patterns of 47M users are indistinguishable from random noise.  Second, the inherent parallelism of web activity: while one Front Door is rendering an email message to HTML, another email message may arrive for the same user.  Third, the law of large number again: thousands of email messages arrive at a given SMTP server in any given minute, and they just keep coming.  Megaservices don’t get coffee breaks.

The “easy” solution becomes complex very quickly.  Moving the email folder for every connection can be very expensive.  HTTP and SMTP activity for the same folder can be concurrent forcing either expensive queuing or locking.  The Front Doors and SMTP servers become stateful drastically complicating load balancing.  Finally, HTTP connections can be very short lived: HTTP 1.0 clients reconnect on every page item request. 

Building scalable megaservices is not easy, but it can be done.  One can make the “easy” solution work and, in fact, a number of low-tech solutions work extremely well in the megaservice space.  Our challenge is to gather the collective wisdom of those who have built scalable megaservices and harness it for the benefit of the company and customers.


3. Observations

In this section, we present several of the most important insights and observations we have gathered from our survey.  While these insights are presented here for conciseness, we strongly encourage a complete reading of subsequent sections.  The true value of these insights is best appreciated in context.  The bulk of this report contains our detailed descriptions of each of the megaservices we visited.

Many of the lessons we learned are similar to those found in the ISTORE project (see http://big/megasurvey/istore1.ppt).  Traditional systems research has, up until now, focused on cost and performance, and mostly ignored scalability, availability and maintainability.  One of the fundamental cornerstones of the ISTORE project is the realization, in light of the world of services, that cost and performance are not very interesting if the system is not functioning.  Similarly, traditional application and operating system developers (and program managers, for that matter) also tend to focus on features and performance, thinking of management and deployment only after the fact.

In order to drive this point home the lessons learned have been classified into three buckets: Maintainability, Scalability and Availability. 

3.1. Maintainability

·        Simple solutions are often best.  Many web services are basically, in the words of the parallel computing community, “embarrassingly parallel.”  For example, HotMail has exploited the inherent simplicity of per-user email partitioning to avoid the extra layers of software and architectural complexity that come with general-purpose extensibility models such as Application Server Providers or Enterprise Java Beans.  They have created a service that directly reflects the natural partitioning of the domain being modeled, and that scales and performs exceptionally well because of this.  Whether this service will scale well into an era of inter-service integration remains an open question, but the simplicity remains striking – general architectures designed for effectively sharing the resources of a single machine are unlikely to adapt to a world in which the machines themselves constitute the component boundary.

·        Hire the best people for operations.  When we visited HotMail, we spent an afternoon with two development leads and the operations lead.  Almost without exception, the operations lead answered all of our software architecture questions.  He knew every scalability pitfall in the system.  He knew the architecture as intimately as the people who wrote it did.  Why does HotMail have fewer operators by almost any metric than any of our sites?  Our conjecture: because the operations team knows the software.

It is critical to recognize that while developers create code, operators create process.  In the same sense that a service needs strong developers, it also needs strong operators to create the appropriate processes.  Code may be the backbone of our products, but process is the backbone of our services.

·        Operations teams should be integrated into product development.  Development management at several of Microsoft’s largest megaservices insisted that one of Microsoft’s failings is that we separate operations and development into separate teams, often in separate divisions.  At both HotMail and WebTV, the operators are intimately involved in product development.  The WebTV operators are pushing features one, and even two, software releases in advance.  We found at least one example at HotMail where operations correctly predicted the lack of scalability of a particular feature long before the developers came to that same conclusion.  Operators feel their users pain, successful developers will feel the their operators pain.

·        Configurable off-the-shelf solutions are preferred to custom code.  When possible, it is better to adapt the design of the cluster to accommodate optimized hardware or software than to write custom code.  LocalDirector switches from Cisco are widely deployed for this reason; they are reliable, well understood, predictable, and can easily be dropped into a network without adverse impact.  This flies in the face of Microsoft’s traditional extensibility solution, calling user-supplied code.  Although code is a very general solution, it also demands much more operational coordination than LocalDirector’s simple configuration-driven solution.  Likewise, many services expressed a desire for off-the-shelf state management for both user profiles and content management.

·        Low-tech rules.  Low-tech systems are often much easier to operate because they are much easier to understand.  For example, with few exceptions the MSN properties use ROBOCOPY instead of CRS for content replication because it is restartable, reliable, and easy to understand.  When it comes to managing a site with hundreds or thousands of machines, command-line scriptable tools always beat a fancy GUI.

The value of low-tech is particularly evident in the megaservices messaging infrastructures.  Only one of the sites we visited uses cross-machine DCOM.  Even sites using very RPC-like communication patterns and proprietary software, such as the next version of MSN Chat, use low-tech message passing.

·        Single data center rather than geoscaling.  Supporting two data centers more than doubles the operations costs and the benefits of geographic distribution for distributed users is of questionable value given the topology of the Internet.  At acquisition, HotMail had planned to open a third data center in New Jersey to augment their Wyatt and Lawson data centers in the Bay area.  They have since abandoned these plans, and in fact, they are closing down Wyatt.  Other MS services have also consolidated their plans for geoscaled operations.  Plummeting prices for data transmission are enabling tunneling and back hauling to be a cheaper alternative to distributed presence. 

Cost considerations aside, geoscaling is a vital concern for service reliability.  Enough of the world’s email flows through HotMail, to make it a globally vital data center.  Many third party services, such as AOL, suffer serious disruption when HotMail’s incoming SMTP queues block.  One can ask if Microsoft’s focus on single data centers is socially irresponsible.

·        Less is more.  Internet users have shown a surprising tolerance for systems that are operationally reliable and responsive, but that aren’t feature complete.  This is probably symptomatic of both the immaturity of the market and the volatility of the Internet; it also suggests that the widely held conjecture that desktop applications have become more complex than warranted is true.  This suggests that, within reason, manageability and operations should be given a higher priority than feature creep.  Again, this is a profound shift from our traditional product focus.

·        Side-by-side component versioning.  Launching a new version of a service or one of its subcomponents involves risk.  Each new deployment should include a backward path in the case of failure, as well as an incremental rollout strategy that can be adjusted in real time.  Most sites employ physical side-by-side operations to accomplish this: the new version of service is deployed on new hardware parallel to the existing service.  Either a DNS or a router switch is thrown to enable the new service.  WebTV upgrades on the same hardware, in parallel directory trees; they switch between versions of the system by changing a soft link and restarting.  Many aspects of Win32, from the registry to the component model, complicate software-based side-by-side operation.

·        Process isolation and restart.  In the service space, Windows NT’s thread-based architecture is a liability.  As has been demonstrated with IIS, shared address space services can be very fragile in the face of ISV-provided code.  Individual threads are difficult, if not impossible, to debug in an online server.  Isolated processes are robust, easy to debug, and easy to restart.  Both threads and processes are too low level to represent cluster-wide concurrency since this involves numerous machines.  Because of this, each service has incorporated synchronization directly into their infrastructure.

·        A service is never finished.  The Internet changes, competitors change, and the load on a service changes.  Constant change demands that service be malleable and maintainable.  A web service is only truly finished when the developers give up and quit.  Once again, megaservices don’t get coffee breaks.

3.2. Scalability

·        The network is an integral part of the system.  Network topology including placement of routers, switches, LocalDirectors, and subnets is crucial to service scalability.  Service providers manipulate and exploit the network; it isn’t just part of the environment.  Imagine building HotMail without control over its physical network topology.  Although they have not yet moved to hardware assisted routing of data, services such as Instant Messaging that depend upon multicast topologies or distributed event routing are interested in the deployment possibilities represented by programmable switches.

·        Careful partitioning.  As mentioned above, one common feature of all of the services we examined is that at some fundamental level they are embarrassingly parallel.  Whether separating HotMail data by user, or chat rooms by topic, the data decomposition is parallel and, for the most part, obvious.  It is likely that as the Web evolves, the granularity of partitioning will become larger (moving from per-user to per-community, etc.), but ample opportunities for data partitioning and operation parallelism will persist.

·        Load balancing is a core component.  Load balancing is the invocation model that, when coupled with partitioning, enables scalability.  At the level of IP packets and TCP connections, load-balancing solutions (like DNS round robin, WLBS, and LocalDirector) are readily available.  Where appropriate, DNS round robin is particularly robust, as users re-balance themselves when they experience poor service by pressing the “retry” button.

·        Stateless front ends.  Most of the megaservices we visited employ stateless FE machines (although they often depend upon state passed by the client within their logic).  The front ends render HTML and embody the control logic used to issue requests to stateful back ends.  In one sense, the stateless front end is just the bottom half of the UI layer in a three-tier system.  In another sense the FE machines act as a high-level, application specific switch since they often switch back-end connections based on information in the HTTP request rather than data in IP or TCP headers.

The primary disadvantage of stateless front-end architectures is that state must be pushed either to the client or to the back-end servers.  Pushing state to the server implies that the site must support a notion of a unique ID (or login) and must provide a state database.  Pushing state to the client-side implies that state must either be embedded in the URL or transported in a browser cookie.  Cookies are problematic because they do not support user roaming and they are often considered an invasion of privacy.  Furthermore, cookies make data mining virtual impossible.  For example, home.microsoft.com has no idea how many users subscribe to a given subsection of the start page because they don’t have access to the user’s start-page cookie. 

·        Understand your connectivity.  Services must understand the nature of their internal dataflow in order to scale.  This is one reason that the FE/BE distinction is so useful – the FE can offload the processing bottleneck associated with slow client connections and enable greater BE concurrency.  In general, impedance matching by service components based on the connectivity patterns and bandwidths is an important part of megaservice design.  Concern with the impedance relationships between database processing, rendering, and personalization is a topic that recurred with several of the groups interviewed.

·        Cost and performance matter.  Scalability of a service is affected by the performance of individual components.  HotMail uses 9GB drives on their front doors, rather than cheaper 2GB drives, simply to get faster RPM drives; they don’t use the space.  Performance of ASP was a common complaint amongst the service developers we met with in our study.  By rendering its top four pages from pure ISAPI (instead of ASP), MoneyCentral reduced its average response time from 32ms to 8ms per page.  Optimization for the common case leads to a willingness to factor components to a very fine grain when necessary – any attempt at automatic or tool-generated factoring is better when it takes empirically gathered statistics into account.

3.3. Availability

·        A component should never fail due to an external component failure.  Individual components in WebTV have defined behaviors for all possible external failures.  For example, if the login server is unavailable, WebTV will service the user with “unauthenticated” permissions.  WebTV can function even when a remarkable number of its servers have failed.  The WebTV home page service will render a start page even if the mail, ad, and stock ticker services are all down.  It is considerably better to have broad and robust error coverage than to have an unprotected component that implements more features. 

·        Components should fast-fail on inconsistent state.  The quicker an errant component fails, the sooner the rest of the system can work around the failure.  As a negative example, consider the interaction between LocalDirector and IIS.  Cisco’s LocalDirector routes incoming packets to the server with fastest turn-around time.  When IIS runs out of internal resources, a fast path immediately replies to all HTTP requests with a “come back later” reply.  LocalDirector interprets this behavior as a very fast server and begins to route all incoming packets to the overloaded server.  Similarly, WLBS only considers a server failed if it no longer issues a heartbeat, regardless of the health of the resident IIS server.

·        Monitoring is absolutely essential.  Many of the megaservices use extensive logging and counter-based monitoring, along with as much remote administration as possible, to both ensure continuous availability and to provide data with which to improve their infrastructure.  Filtering, alerting, and visualization tools are an absolute necessity for sites with hundreds of machines in order to filter out important events from background “noise”.

·        Systems should be designed (and tested) with component failure as a rule not an exception.  The standard for designing communicating components for the Internet is considerably different from Microsoft’s traditional LAN-centric models such as ODBC and DCOM.  For the most part simple socket-based protocols are used by megaservice components when they cannot piggyback upon an existing web protocol.  Although not all of these protocols are designed to fail in a recoverable way, they all are designed to come back up as quickly as possible in the face of failure, and the services interviewed were always aware of the failure characteristics of their infrastructures.

·        The system should work partially even when components fail.  Running an Internet service is a double-edged sword.  Users expect service.  Sometimes they expect full service; sometimes they’ll tolerate less.  Even if a USERVE is offline, the affected users still get a HotMail web page and can still send email.  On the other hand, Expedia has found that when it loses connection to it’s BE machines, it is better to deny users entry (with a stylized retry-later message) than to let them get halfway through a ticket purchase before failing.

·        Test suites should be delivered to operations as part of the platform.  At most of the sites, the test suites used by development are also used continually to sanity check the health and the performance of the system.  Many of the sites (see full descriptions) have a “hidden” web page that exercises important site features.  Viewing the hidden page alerts operations staff of any problems in functionality or user-perceived performance.


4. Services

This section describes the seventeen Microsoft Internet services we visited.  Each subsection describes a service’s basic function, physical architecture, operations issues, and scalability issues.

4.1. Hotmail

Configuration

 

 

  Front Ends:

2198

  Back Ends:

58

  Other Boxes:

54

  Load Balancing:

LocalDirector

 

 

Interview Date:

July 22, 1999

 

 

 

 

 

 

HotMail is an Internet mail service.  HotMail currently has 47 million users, and supports up to 77K simultaneous online connections.  HotMail serves approximately 90M ad impressions per day.

Users connect to www.hotmail.com with a standard Internet browser.  HotMail.com is really a set of Front-End (FE) machines behind a local director.  All of the FE machines run FreeBSD and Apache.  The FE machines communicate with one of the member index servers (MSERVs), a set of replicated machines that contain the index of username, <machine name, data segment> mappings, to determine which USTORE machine to connect to for data.  The USTOREs are large SUN Solaris boxes that contain the user’s mail, password and customizations (aliases etc).

The FE acts as an agent for the end user; it reads and writes files on the USTORE through the XFS protocol (an atomic mail storage protocol with some similarities to NFS) and generates the appropriate HTML.  Ads and images are stored on separate servers to keep the load down on the front ends.  

Instant Messaging, which is housed at HotMail, is detailed in a separate section.

4.1.1. Architecture

For Scaling, HotMail has defined a Hotmail Capitalization Unit (HCU).  An HCU is the incremental unit of scale for adding new users to the system.  Given today’s’ hardware a HCU covers approximately 2M users.  An HCU is added approximately every month.

The prototypical HotMail HCU (HotMail Capitalization Unit) consists of the following:

·        User data server (USTORE) (1 machine).  Typically a very large Solaris box (latest machines are E4500s with 8 processors).  The USTORE holds all of the email for a large group of user (up to 2M users).  Backup is done to tape units attached directly to the USTORE.  Other machines (FE boxes) access the USTORE files through XFS.  Contrary to popular belief, XFS is not a file system; it is really an atomic mail storage and retrieval protocol.  USTOREs are bound by two constraints: the amount of time needed for backup and the number of I/O operations per second.  It takes 18 hours to backup a USTORE.  USTOREs typically have a CPU load of about 5%, but a disk utilization of 100%.

·        Front Door servers (16 machines).  Front doors are stateless front-end servers; their primary responsibilities are accessing storage and HTML rendering.  In addition to HTML, FDs also run spell checking, thesaurus, and dispatch outgoing email from HotMail users.  Current FDs are 400MHz P2 boxes running FreeBSD and Apache.  FDs are CPU and network bound.  Each incoming connection requires at least two FreeBSD processes.  The FDs within a cluster share a common IP address.  Incoming requests are distributed with a Cisco LocalDirector.

·        Login servers (15 machines).  Login servers are stateless web servers that redirect users at login to the appropriate cluster.  Physically they have the same configuration as the Front Doors.

·        Member index server (MSERVs (1 machine).  A global directory mapping users to USTOREs.  All MSERVs share a common IP address distributed with a LocalDirector.  Each MSERV contains the entire user directory. 

·        Graphics servers (4 machines).  Simple Web servers for static graphics content.  The graphics servers load all images into cache on boot up to reduce request latency.

·        Incoming mail servers (4 machines).  These run SMTP to accept incoming mail and dispatch it to the appropriate USTORE.  Mail servers sit behind a single IP address and LocalDirector.

Multiple HCUs are combined behind common Cisco LocalDirectors to form a cluster.  HotMail currently runs seven mail clusters; six at the Lawson facility and one at Wyatt.   HotMail runs the LocalDirectors beyond the maximum recommended speed with the expected instabilities.

The HCU is an idealized model.  As HotMail has scaled from 9M users at time of acquisition to 45M users, the ratios have morphed.  For example, a common set of four MSERVs is shared across all of the HotMail clusters.  Ad service has now moved to MSN Ads.

Given gains in hardware, offloading of ad tracking, and generally better performance from their code, the team expects that within 12 months one HCU (one USTORE), with the addition of the appropriate number of disk spindles, will be capable of supporting 4M users.

To support multiple clusters, users enter the site through a set of login servers (via DNS round robin).  The login servers then redirect the user to the appropriate cluster.  At login time, a file containing the user's last access time is updated.  A user's account is deleted after 120 days of inactivity.

At time of acquisition, the typical user had 240KB of stored email.  Storage has grown to 400-600KB per user primarily because of email attachments.  Cost has grown to $1.62/user/year with an addition $.60/user/year for Instant Messaging.  Approximately 70% of the hardware costs are in the USTOREs.

HotMail has abandoned plans for geoscaling.  While the software could geoscale with minor modifications, they see no immediate business justification.  Managing 2x sites requires more than 2x effort.  In fact, HotMail plans to shutdown the original Wyatt site when the next cluster comes on line at Lawson.

Load during off-peak is 2/3s of load during peak time; many of their customers are foreign based.

4.1.2. Operations

The operations team is intimately familiar with all of the quirks and details of the system (including the actual code).  In fact, Josh Goldsmith gave most of the technical development details.

HotMail’s machines are housed at Exodus.  Exodus supplies cages, power, and Internet connectivity.  Due to Exodus’ pricing model, HotMail has added redundant network connectivity to many of the major ISPs.

Operations are monitored remotely; all management is through remote shells and scripting.  Updates are propagated to all of the live machines at once using RDIST.  Feature upgrades are applied incrementally.  HotMail has never had a “major” release.

Operations is deeply involved in all phases of development.  They are a major feature driver for the system.  Operations is involved at least one version ahead of deployment.

Server up time can be as much as 300 days (USTOREs), with 94.5% uptime for the service as a whole.  However, on average across the whole system, they experience one USTORE kernel panic every two days.  USTORE hardware failures are due, in order, to tape drives gone bad, RAID controllers fried, or dead CPUs.  Front doors experience as many as eleven crashes per day.  Luckily, front-door crashes are largely masked by the LocalDirectors.

Nightly backup has proven useful.  HotMail has restored one tape backup in the last month.  The login process had not been correctly updating the last access time for 120 days.  After the garbage collector deleted a large faction of a USTORE, the operators noticed, the bug was fixed, and the previous tape was restored.  This was the third restore in the history of the site.  The other restores were due to serious RAID failures.

It is worth noting that users are routinely migrated between USTOREs.  For example, in the last week, approximately 1M accounts were moved in order to retire two old USTOREs.  The average churn rate between USTOREs is probably closer to 250K accounts per week.  The migration of users from one USTORE to another is a good example of dynamic partitioning in action with the smallest unit of granularity being a single mailbox.  It is interesting to note that migration is under administrative control.

The system has to be brought down briefly when adding a new HCU (all MSERV index structures have to be updated).  Migrating users between stores does not necessitate downtime. 

HotMail still is vulnerable to single points of failures – When a USTORE goes down mail is not available to 2M accounts.  HotMail is also vulnerable to catastrophic events as all of their machines are hosted in one site (excluding ad servers).

For the USTOREs, HotMail Operations insists that Solaris on Sun hardware supplies better reliability, better remote management support, more socket connections (at least compared to Windows NT 4), and faster process startup.  Moving HotMail off the current code base would be very difficult

4.2. Home.Microsoft.Com (HMC)

Configuration

 

 

  Front Ends:

42

  Back Ends:

?

  Other Boxes:

?

  Load Balancing:

DNS/WLBS

 

 

Interview Date:

August 3, 1999

 

 

 

 

 

 

HMC supplies the main web page for the Microsoft portal.  Much of the complexity of the home page arises due to personalization.  Personalization information is kept in cookies on the user’s machine and read before the page is rendered.  How many things are on the page, their order, and localized content are all determined at runtime from the cookie.

4.2.1. Architecture

HMC consists of seven clusters of six machines each.  Each box is connected to three networks: the Internet, an administrative LAN, and the shared services LAN.  Ads, profiles and stock quotes are all accessed through the shared services LAN.  Each machine is a quad-processor Xeon.  Ideally, each machine would run with just four threads; one thread per processor for the Internet, the shared services LAN, IIS, and NT.  In reality, they use approximately 40 threads due to the "lemmings" problem: large groups of users with the same source IP address (like AOL) hitting a single server.

Load is balanced between the clusters with DNS round robin; load is balanced within each cluster through WLBS.  Steve Bush's favorite WLBS feature is that a machine can easily be taken out of the rotation.  .

Browser clients are bucketized into four classes:

1)      IE4+ (support for DHTML).

2)      ECMA Script.

3)      HTML 2.0 - requires more server round trips.

4)      IE2 - pop up a forced upgrade dialog.

The ASP code detects the 70 or so flavors of browser in use and reduces them to one of these four cases.  The rest of the rending code is conditionalized to output for these four buckets.

4.2.2. Development/Testing methodology

HMC tests performance with ad hoc packet drops.  They also rely on MSNSRVT (the MSN Server Test team) a user experience tool, SOC watch, and the WebCatThreads from the IIS test team.  Most of their in-house testing is focused on UI.

Aside from scale, HMC’s major problem is content management.  For example, HMC acquires stories from the Wall Street Journal (WSJ).  In the ideal world, the WSJ would render news for each user.  Nevertheless, in reality the WSJ can't be depended on to write reliable rendering code and the cost to go out to a WSJ server is too expensive.  Instead, HMC must pre-process data to make it deliverable to users.

Data arrives from the WSJ in a CDF file encoded in XML.  The file is then split into headlines, HTML payloads and redirections for WSJ pages.  The file is retrieved from a WSJ-provided URL every 5-60 minutes (depending on the specific file).  The fetcher applies transformations to the data, publishes into HMC's CORAL SQL server and schedules it for display.  The SQL data is then copied to a stager, which pushes data out to the web servers.

Another part of content management is the application of business logic rules.  For example, sports scores come in unprocessed.  HMC sorts the scores by team and league using business logic that defines the league for each team and adds URLs such as the team's home page.  The logic of the team/league hierarchy must be encoded into the business logic; it isn't described in metadata delivered by the WSJ.  Another example of business logic is the conversion of times from GMT to local time.  As a final example of the complexity of business logic, HMC renders local news in the US based on the user's zip code if known.  Internationally, local news is determined by region, but there exists no standard taxonomy for defining local regions on an international basis.

Content management in the human world is governed by defined processes.  However, the state of web development is that processes must be captured and expressed in code.  Steve's comment, "people want to solve problems through process, not code."

Other problems HMC sees with NT:

·         There exists no standard infrastructure for managing and administering the system.  "MMC is totally useless because we don't have RPC access to server; no one is going to open up the RPC port on their web servers."

·         NT's thread-based architecture requires too much work.  The cost of creating a new process is prohibitively expensive.  The cost of a thread corrupting the global IIS process is even worse.

·         Most Internet sites don't want more OS, they want less.

·         NT's code isn't available as an educational or idea resource.

·         A developer creating business services doesn't have the flexibility to change features in NT the meet their needs.

4.3. Sidewalk

Configuration

 

 

  Front Ends:

3

  Back Ends:

3

  Other Boxes:

2

  Load Balancing:

DNS

 

 

Interview Date:

August 2, 1999

 

 

 

 

 

 

Sidewalk is an online guide to entertainment with targeted content for 74 cities, primarily in the US.  Sidewalk's primary challenges are content management (with customized data for 74 cities and 3-4 updates per day) and HTML rendering.  In July, MS sold the arts and entertainment sections of its first 10 cities to TicketMaster's City Search.  Essentially, TicketMaster bought version 2.0 of the Sidewalk rendering engine.  The other cities (and the rest of the site) use version 3.0.  The yellow pages section (along with local yellow page advertising) is the most lucrative side of the Sidewalk business.

4.3.1. Architecture

The sidewalk architecture consists of the following:

·        Front End (FE) IIS servers (3 machines).  Running a custom ISAPI rendering engine, each machine is DNS registered with 74 names (one for each Sidewalk city).  An ISAPI filter maps friendly URLs to internal URLs.  FE servers are connected to both the Internet and a local 100BaseT publishing LAN.  Load is balanced between the FE servers using DNS round robin. 

·        Mid-Tier (MT) SQL Servers (3 machines).  There is a 1:1 relationship between FE IIS servers and MT SQL servers.  All page content is stored on the SQL servers.  Custom OLEDB data service objects provide optimized support for connection management and query.  A fourth MT SQL server stores non-content data. 

·        IIS staging server (SS) and SQL staging server (1 machine each).  Identical to the FE IIS and MT SQL servers, these servers run on CorpNet.  The IIS server is actually registered under 592 names (through a custom HOSTS.TXT file propagated to Sidewalk editors’ and developers’ machines).  The 592 names cover the cross product of 74 cites, 2 modes (preview or live) and 4 content bases (current, next, and 2 others).  New data is replicated from the staging servers to FE and MT servers every 5 minutes.  When a new edition of Sidewalk is ready to publish (3-4 times