IRS Loses Lois Lerner Emails

The IRS told Congress yesterday that two years of emails on Tax Exempt Organizations department manager Lois Lerner’s desktop were irretrievably lost due to a hard drive crash in 2011. As this is a technology blog, how could this event happen?

The Internal Revenue Service has 90,000 employees working in a complex financial-services organization. Like its private-sector counterparts, the IRS has a sophisticated Information Technology organization because the IRS mission is implementing the tax laws of the United States. The IRS is the epitome of a paper-pushing organization, and by 2011 paper-pushing was done by email.

1. The IRS first installed Microsoft’s enterprise email product, Exchange in the data center and Outlook on client desktops in 1998, about the same time as many Fortune 500 organizations. By 2011, the IRS had over a decade of operational experience.

2. Hard drives are the weak-link in IT installations. These mechanical devices fail at the rate of about 5% a year. With 90,000 employees, that works out to an average of 4,500 a year or 22 per work day. The IRS IT staff is very familiar with the consequences of user-PC hard drive failures. Data center storage management is another leaf in the same book.

3. The IRS reported to Congress that senior executive Lerner’s hard drive failed, and nothing could be recovered from it. It was forensically tested. As a result, the IRS claims, there is no record of the emails that were sent or received from Ms. Lerner’s computer. The thousands of emails recovered to date were extracted from sender or recipient email lists within the IRS, not from Lerner’s files. There is no record of emails to other federal departments, or to other organizations or  personal emails.

4. The implication is that the Lerner email history only resided on her computer. There is no other IT explanation.  Yet Microsoft Exchange in the data center stores copies of all email chains on multiple hard drives on multiple, synchronized email servers. That’s the way all enterprise email systems have to work. So the facts as stated make no IT sense.

But let’s look at the implications of a strategy where the Lerner email history only resided on her computer and the hard drive failed completely so nothing could be recovered:

  • Where are the Lerner PC backups? With a 5% annual failure rate, industry-wide PC backup strategies are as old as centralized email. There should be Lerner PC backups made by IRS IT. Leave it up to the user to make backups? No organization the size of the IRS allows that for all the obvious reasons that come to mind, starting with it doesn’t work in practice.
  • How could Lois Lerner do her work? The hard drive was lost and there were no PC backups. Besides losing two years worth of emails, GS-15 department head Lerner had to also lose all the data of a digital business life: calendar; contacts; personnel notes; work-in-process plans, schedules, meeting notes, reviews, budget spreadsheets, official IRS rulings.
    It is inconceivable that a modern executive could be stripped of all her business data and not face-plant within a week. Could you? Not me. Nobody has paper backup for everything anymore. Your business smartphone backs up to your PC.
  • The Exchange servers log every email coming into and going out of the IRS. Did the whole set of IRS backup tapes fail in an unreported catastrophe? That primary (but undiscovered) failure would make the routine failure of Lerner’s PC unrecoverable.

I cannot think of an acceptable reason for the unexplained yet unrecoverable loss of the data on Lerner’s PC while following the usual practices every IT organization I have worked with over decades. Which leaves only two alternatives: a much clearer explanation from IRS IT professionals of how these events could happen; or something nefarious is going on.

Follow me on Twitter @peterskastner

The author’s experience with federal email and records management began with the Ronald Reagan White House in 1982.

Email Inbox

 

 

 

 

HealthCare.Gov: IT Rules Broken, Mistakes Made

Numerous friends, neighbors and clients have asked me about the IT fiasco in the eight weeks since the  Obamacare federal exchange project, HealthCare.Gov, was launched. “How did it happen and what went wrong?”, they ask. Lacking subpoena power, I can only draw from experience and common sense. There were lots of information technology (IT) rules broken and mistakes made. Someone else can write the book.

Performance, delivery date, features, and quality are a zero sum game
The project was executed with the expectation of complex features, very high volume and performance metrics, robust production quality expectations, and a drop-dead date of October 1, 2013. Even with all the money in the federal budget, tradeoffs are still necessary in the zero-sum equation of successful project management. The tradeoffs were obviously not made, the system went live October 1, and the results are obvious.

The Feds are different
That federal IT project procurement and management is different from the private sector is like night and day. Some of the major factors include:

  • Politics squared.
  • IT procurement regulations are a gamed system that’s broken. Everybody in Washington knows it and nobody will do anything about it.
  • The federal government does little programming and development in-house. Most is contracted out.
  • The culture lacks accountability and real performance metrics.

The HealthCare.gov website is really a complex online marketplace. It’s not the first, though. HealthCare.gov has taken longer to complete than World Wars I and II, the building of the atomic bomb, and putting a man in space.

Too many cooks in the kitchen
The specifications were never really frozen in a version approved by all stakeholders. That meant the programmers never worked with a fixed design.

The HealthCare.gov project was always surrounded by politics and executive branch oversight that led to design changes, such as the late summer decision to graft a rigorous registration process into the site before users could see policy choices.

No surprise that this high visibility project would have lots of micro-management. But the many — over fifty — IT contractors working on the project had no incentive to tradeoff time-to-completion with feature changes. They weren’t in charge. The timeline slipped, a lot.

Who’s in charge? Can the project manager do the job?
There was no take-charge project manager responsible for this half-billion dollar undertaking. The Centers for Medicare & Medicaid Services (CMS) was assigned oversight by an executive without extensive complex system integration project experience. It’s obvious that day-to-day coordination of the over fifty contractors working on Healthcare.gov was lacking, and that project management was sub-par from seeing the first remedy in October was assigning government technicians with such experience.

The White House trusted its own policy and political teams rather than bringing in outsiders with more experience putting in place something as technically challenging as HealthCare.gov and the infrastructure to support it.

Absent a take-charge project management team, the individual IT contractors pretty much did their own thing by programming the functions assigned to them and little else. This is obvious from all the finger-pointing about parts of the site that did not work October 1st. The lack of systems integration is telling.

After October 1st, a lead project manager (a contractor) was appointed.

We don’t need no stinking architecture.
Why was the federal website at Healthcare.gov set up the way it was? How were the choices of architecture made — the overall technology design for the site?

Everybody now knows the site needs to handle the volumes of millions of subscribers during a short eligibility and sign-up window between October 1 and December 15th (now extended to the 23rd). The HealthCare.gov website handles 36 states. Each state has different insurance plans at different tiers and prices, but the way an individual state site works is identical to other states. Every one has bronze, silver, and gold policies from one or more insurers to compare.

The Healthcare.gov website was architected as one humongous site for all 36 states handling all the visitors in a single system. All the eggs were put in one basket.

An alternative architecture would immediately rout visitors from the home page to 36 separate but technologically identical state-level systems. More operations management, but infinitely greater opportunities for scale-out and scale-up for volume.

Another benefit of a single application code-base replicated dozens of times is that the risk and expense of states that chose to run their own sites can be mitigated. The 14 states that built their own sites all did it differently, to no national benefit. While California succeeded, Oregon has yet to enroll a customer online. We paid 14 times.

Another questionable architectural decision is the real-time “federal data hub” that makes eligibility and subsidy decisions. As designed, the website queries numerous agencies including Social Security, Homeland Security, Internal Revenue, Immigration and other agencies to make a complicated, regulation-driven decision on whether the (waiting) customer is qualified to buy a policy and what federal subsidies, if any, the customer gets to reduce policy premium costs.

This design approach puts a strain on all of the agency systems, and leads to inevitable response time delays as all the data needed to make a determination is gathered in real time. It also requires that agency systems not designed for 24/7 online operations reinvent their applications and operations. This hasn’t happened.

An alternative design would make some or all of the determination in batch programs run ahead of time, much like a credit score is instantly available to a merchant when you’re applying for new credit. That would greatly un-complicate the website and reduce the time online by each user.

Security as an add-on, not a built-in.
The HealthCare.gov website needs the most personal financial information to make eligibility decisions. Therefore, it’s shocking that data security and privacy are not an integral and critical part of the system design, according to testimony by security experts to Congress. This is a glaring lapse.

We’re not going to make it are we?
By April 2013, the project was well underway. An outside study by McKinsey looked at the project and pointed out the likelihood of missing the October deadline, and the unconventional approaches used to date on the project.

There’s almost always a cathartic moment in an IT project with a fixed deadline when the odds favor a missed deadline.

The HealthCare.gov project broke at least three IT project management best practice rules: overloading the roster, succumbing to threatening incentives, and ignoring human resistance to being cornered.

Let’s throw more people at the project.
More labor was thrown at the project both before and after October 1st. That slowed the project down even more. An IT management mentor explained it to me early in my career this way: “If you need a baby in a month, nine women can’t get that job done. Not possible. Try something different, like kidnapping.” This later became known as Brooks’ Law which says “adding manpower to a late software project makes it later”.

“Failure is not an option!”
When managers yell “Failure is not an option!” is it any surprise that project management reporting immediately becomes a meaningless exercise? The managers were flogging the troops to make the deadline. So the troops reported all sorts of progress that hadn’t actually happened. Passive resistance is real.

It therefore comes as no surprise when managers post-launch “concluded that some of the people working in the trenches on the website were not forthcoming about the problems.” It’s a fairytale world where nobody has ever read the real world depicted in Scott Adams’ Dilbert comic strip.

The project passed six technology reviews, and told Congress all was well. CMS IT management approved the green-light schedule status on the website. It’s still online here. Completely inaccurate.

“We’ll just all work 24/7 until we’re over the hump.”
As I write this a week ahead of the December 1 “drop-dead date 2.0”, it’s hard to fathom how many weeks the project teams have been working late, nights, and weekends. Soldiers are kept on the front line for days or weeks except in the most dire circumstances. The military knows from experience that troops need R&R. So do IT troops.

“Let’s turn it on all at once.”
Sketchy project progress. Testing done at the integration level; no time left to do system testing or performance testing. It’s September 30, 2013. The decision is made: “Let’s turn it on all at once.”

Turning the whole system on at once is known as a “light-switch conversion”. You flip the switch and it’s on. In the case of HealthCare.gov, the circuit breaker blew, and that’s pretty much the case since day one. Now what to do?

“Where’s the problem at?”
From the moment the website was turned on, the system throughput — enrollments per day — was somewhere between none and awful. We’ve all heard the stories of more than a million abandoned registrations and a pitiful number of successful enrollments. Where were the bottlenecks? The project team did not know.

They didn’t know because there was practically no internal instrumentation of the software. We know this because the tech SWAT team brought in after October first said so.

What happens next?
The next drop dead date is December 1. There is no public report that the system has reached its 60,000 simultaneous user design goal, nor the “critical problem list” solved.

All the body language and weasel-wording suggests the website will support more simultaneous users than at launch two months ago. Do not be surprised at a press release admitting failure hidden during the Thanksgiving holiday rush.

To make the exchange insurance process work, a couple of million (or more, when you include cancelled policies that must be replaced) enrollments need to take place in December. If that’s not possible, there needs to be a clock reset on the law’s implementation. The pressure to do so will be irresistible.

There is no assurance that the enrollments made to date supply accurate data. Insurance companies are reviewing each enrollment manually. That process is not scalable.

It was sobering to hear Congressional testimony this week that 30%-40% of the project coding had yet to be completed. Essentially, the backend financial system is not done or tested. That system reconciles customer policy payments and government subsidies, and makes payments to insurers. If the insurer does not have a payment, you’re not a customer. Which makes the remaining 30% of the system yet another “failure is not an option” to get working correctly by Christmas.

There is no backup plan. Telephone and paper enrollments all get entered into the same HealthCare.gov website.

There is a high probability of a successful attack. One thorough security hack and public confidence will dissolve. There’s a difference between successful and thorough which allows a lot of room for spin.

If individual healthcare insurance is 5% of the market, and we’ve had all these problems, what happens next year when the other 95% is subjected to ACA regulations? No one knows.

Comment here or tweet me @peterskastner

The author has ten years experience as a programmer, analyst, and group project manager for a systems integrator;  established government marketing programs at two companies; has over twenty years experience in IT consulting and market research; and, has served as an expert witness regarding failed computer projects.

HealthCare.gov improvements are in the works

HealthCare.gov improvements are in the works