CMS Email: 'Adding capacity to address bottlenecks like these' Likely Ineffective

Emails sent by CMS in the days before the Obamacare website launch reveal growing concern about the system’s ability to handle concurrent users because of coding problems. One email dated September 27th warns “Adding capacity to address bottlenecks like these will likely be ineffective.”

The email chain was published Thursday by Jonathan Strong at National Review. Strong highlighted one portion of an email by project manager Henry Chao in which Chao warned he did not want a “meltdown.”

But a review of the other emails in the chain suggests concern was growing that the site would not be able to handle even a modest number of concurrent users on launch day. Concurrent users is a measure of how many individuals can simultaneously view and use a site.

On September 27th–four days before launch–David Nelson of CMS wrote to project manager Henry Chao with this blunt warning [emphasis added]:

The scripts are failing so far due to issues like load balancing, inefficient and defective code, and inefficient queries. We have not been successful in moving beyond 500 concurrent users filling applications without income verification. Adding capacity to address bottlenecks like these will likely be ineffective. We must give ourselves the ability to work through these tuning issues and at this point we don not have an operational environment for further performance testing.

If a site is unprepared to handle enough concurrent users to meet demand, one option is to simply multiply hardware to increase capacity. If, for example, each server can handle 500 concurrent users then adding 10 servers (virtual or otherwise) should allow the site to handle around 5,000 concurrent users. What Nelson is warning is that the types of flaws they are seeing are such that adding “capacity” (more servers) will not necessarily fix the problem.

On October 30th at 4pm, just hours before the site went live, another member of the CMS team emailed project manager Chao. In addition to problems handling enough concurrent users, he also made clear that significant portions of the site not yet been tested [Emphasis added]:

Currently we are seeing performance degradation starting around 1100-1200 concurrent users, and most of the pages (except few) are responding within 10 seconds at the load. Few transactions/pages are taking longer such as Application Summary Save, Family & Household Summary Save, Race and Ethnicity, etc. — which should be investigated by the development team, and should be brought to the attention of Monitoring & Helpdesk team. As of today we’re only focusing on individual application performance testing and going all the way to application submission. The plan is to continue testing to ramp up load to 10k concurrent users. Plan compare has not yet been tested. Currently we cannot get to Plan Compare because the plan data not loaded in ____ yet.

A few paragraphs later the author gives Chao this warning, “Bottom-line we need to focus more on application tuning (code, query optimization etc) rather just increasing the infrastructures, otherwise by the time we shoot for 50,000 concurrent users we may run out….” Again, this email was sent out just hours before the site went live.

Of course the site did “run out.” There was no certainty it could handle 10,000 users much less 50,000 and the number who logged on in 36 states was much greater than 50,000. The administration was quick to blame the failure on a surge of traffic. It’s spokespeople failed to mention that the site was not ready for even a modest amount of traffic on launch day.

Project manager Chao was warned at least twice that problems with the site were such that they could not be fixed by multiplying infrastructure. What was needed was more time to address problems with the code. Despite the warnings, someone within the administration made the decision to push the site out anyway.

In any private enterprise, the person who made that decision would likely be looking for a new job now. So far no one has been held accountable for the failure of