Self-Driving Software: Why We Need E Pluribus Unum

Today, numerous large and small companies around the world are working diligently on perfecting their company’s self-driving software. All the large traditional automobile companies are included as well as large technology firms such as Google, Intel and Microsoft, and even Uber. These companies are working in true twentieth-century capitalist fashion: they’re doing it all independently and secretly. This approach leads to sub-optimal technology and foreseeable tragedies.

Self-Driving Vehicles Use Artificial Intelligence (AI)

Programming a self-driving vehicle (SDV) by traditional software-development methods is so fraught with complexity that no one, to my knowledge, is attempting. So scrap that idea. Instead, developers have flocked to artificial intelligence, a red-hot technology idea built on rather old ideas about neural networks.

There’s a lot to AI technology beyond the scope of this blog. A quick Internet search will get you started on a deep dive. For today, let’s sketch a common approach to AI application development:

  • First, an AI rules-based model is fed real-world scenarios, rules, and practical knowledge. For example, “turning left into oncoming traffic (in the USA but not the UK) is illegal and hazardous and will likely result in a crash. Don’t do that.” This first phase is the AI Learning Phase.
  • Second, the neural network created in the learning phase is executed in a vehicle, often on a specialized chip, graphics processing unit (GPU) or multi-processor. This is the Execution Phase.
  • Third, the execution unit records real-world observations while driving, eventually feeding them back into the learning model.

The Problem of Many

Here’s the rub. Every SDV developer is on its own, creating a proprietary AI model with its own set of learning criteria. Each AI model is only as good as the data fed into its learning engine.

No single company is likely to encounter or imagine all of the third standard-deviation, Black Swan events that can and will lead to vehicle tragedies and loss of life. Why should Tesla and the state of Florida be the only beneficiaries of the lessons from a particular fatal crash? The industry should learn from the experience too. That’s how society progresses.

Cue the class-action trial lawyers.

E Pluribus Unum

E Pluribus Unum is Latin for “out of many, one”. (Yes, it’s the motto of the United States). My proposal is simple:

  • The federal government should insist that all self-driving vehicles use an AI execution unit that is trained in its learning phase with an open-source database of events, scenarios, and real-world feedback. Out of many AI training models, one model.
  • The Feds preempt state regulation of core AI development and operation
  • Vehicles that use the federalized learning database for training receive limited class-action immunity, just like we now do with immunization drugs.
  • The Feds charge fees to the auto industry that cover the costs of the program.


From a social standpoint, there’s no good reason for wild-west capitalism over proprietary AI learning engines that lead to avoidable crashes and accidents. With one, common AI learning database, all SDVs will get smarter, faster because they are benefiting from the collective experience of the entire industry. By allowing and encouraging innovation in AI execution engines, the industry will focus on areas that impact better-faster-cheaper-smaller products and not in avoiding human-risk situations. Performance benchmarks are a well-understood concept.

Philosophically, I don’t turn first to government regulation. But air traffic control, railroads, and numerous aspects of medical areas are regulated without controversy. Vehicle AI is ripe for regulation before production vehicles are produced by the millions over the next decade.

I am writing this blog because I don’t see the subject being discussed. It ought to be.

Comments and feedback are welcome. See my feed on Twitter @peterskastner.

IRS Loses Lois Lerner Emails

The IRS told Congress yesterday that two years of emails on Tax Exempt Organizations department manager Lois Lerner’s desktop were irretrievably lost due to a hard drive crash in 2011. As this is a technology blog, how could this event happen?

The Internal Revenue Service has 90,000 employees working in a complex financial-services organization. Like its private-sector counterparts, the IRS has a sophisticated Information Technology organization because the IRS mission is implementing the tax laws of the United States. The IRS is the epitome of a paper-pushing organization, and by 2011 paper-pushing was done by email.

1. The IRS first installed Microsoft’s enterprise email product, Exchange in the data center and Outlook on client desktops in 1998, about the same time as many Fortune 500 organizations. By 2011, the IRS had over a decade of operational experience.

2. Hard drives are the weak-link in IT installations. These mechanical devices fail at the rate of about 5% a year. With 90,000 employees, that works out to an average of 4,500 a year or 22 per work day. The IRS IT staff is very familiar with the consequences of user-PC hard drive failures. Data center storage management is another leaf in the same book.

3. The IRS reported to Congress that senior executive Lerner’s hard drive failed, and nothing could be recovered from it. It was forensically tested. As a result, the IRS claims, there is no record of the emails that were sent or received from Ms. Lerner’s computer. The thousands of emails recovered to date were extracted from sender or recipient email lists within the IRS, not from Lerner’s files. There is no record of emails to other federal departments, or to other organizations or  personal emails.

4. The implication is that the Lerner email history only resided on her computer. There is no other IT explanation.  Yet Microsoft Exchange in the data center stores copies of all email chains on multiple hard drives on multiple, synchronized email servers. That’s the way all enterprise email systems have to work. So the facts as stated make no IT sense.

But let’s look at the implications of a strategy where the Lerner email history only resided on her computer and the hard drive failed completely so nothing could be recovered:

  • Where are the Lerner PC backups? With a 5% annual failure rate, industry-wide PC backup strategies are as old as centralized email. There should be Lerner PC backups made by IRS IT. Leave it up to the user to make backups? No organization the size of the IRS allows that for all the obvious reasons that come to mind, starting with it doesn’t work in practice.
  • How could Lois Lerner do her work? The hard drive was lost and there were no PC backups. Besides losing two years worth of emails, GS-15 department head Lerner had to also lose all the data of a digital business life: calendar; contacts; personnel notes; work-in-process plans, schedules, meeting notes, reviews, budget spreadsheets, official IRS rulings.
    It is inconceivable that a modern executive could be stripped of all her business data and not face-plant within a week. Could you? Not me. Nobody has paper backup for everything anymore. Your business smartphone backs up to your PC.
  • The Exchange servers log every email coming into and going out of the IRS. Did the whole set of IRS backup tapes fail in an unreported catastrophe? That primary (but undiscovered) failure would make the routine failure of Lerner’s PC unrecoverable.

I cannot think of an acceptable reason for the unexplained yet unrecoverable loss of the data on Lerner’s PC while following the usual practices every IT organization I have worked with over decades. Which leaves only two alternatives: a much clearer explanation from IRS IT professionals of how these events could happen; or something nefarious is going on.

Follow me on Twitter @peterskastner

The author’s experience with federal email and records management began with the Ronald Reagan White House in 1982.

Email Inbox





HealthCare.Gov: IT Rules Broken, Mistakes Made

Numerous friends, neighbors and clients have asked me about the IT fiasco in the eight weeks since the  Obamacare federal exchange project, HealthCare.Gov, was launched. “How did it happen and what went wrong?”, they ask. Lacking subpoena power, I can only draw from experience and common sense. There were lots of information technology (IT) rules broken and mistakes made. Someone else can write the book.

Performance, delivery date, features, and quality are a zero sum game
The project was executed with the expectation of complex features, very high volume and performance metrics, robust production quality expectations, and a drop-dead date of October 1, 2013. Even with all the money in the federal budget, tradeoffs are still necessary in the zero-sum equation of successful project management. The tradeoffs were obviously not made, the system went live October 1, and the results are obvious.

The Feds are different
That federal IT project procurement and management is different from the private sector is like night and day. Some of the major factors include:

  • Politics squared.
  • IT procurement regulations are a gamed system that’s broken. Everybody in Washington knows it and nobody will do anything about it.
  • The federal government does little programming and development in-house. Most is contracted out.
  • The culture lacks accountability and real performance metrics.

The website is really a complex online marketplace. It’s not the first, though. has taken longer to complete than World Wars I and II, the building of the atomic bomb, and putting a man in space.

Too many cooks in the kitchen
The specifications were never really frozen in a version approved by all stakeholders. That meant the programmers never worked with a fixed design.

The project was always surrounded by politics and executive branch oversight that led to design changes, such as the late summer decision to graft a rigorous registration process into the site before users could see policy choices.

No surprise that this high visibility project would have lots of micro-management. But the many — over fifty — IT contractors working on the project had no incentive to tradeoff time-to-completion with feature changes. They weren’t in charge. The timeline slipped, a lot.

Who’s in charge? Can the project manager do the job?
There was no take-charge project manager responsible for this half-billion dollar undertaking. The Centers for Medicare & Medicaid Services (CMS) was assigned oversight by an executive without extensive complex system integration project experience. It’s obvious that day-to-day coordination of the over fifty contractors working on was lacking, and that project management was sub-par from seeing the first remedy in October was assigning government technicians with such experience.

The White House trusted its own policy and political teams rather than bringing in outsiders with more experience putting in place something as technically challenging as and the infrastructure to support it.

Absent a take-charge project management team, the individual IT contractors pretty much did their own thing by programming the functions assigned to them and little else. This is obvious from all the finger-pointing about parts of the site that did not work October 1st. The lack of systems integration is telling.

After October 1st, a lead project manager (a contractor) was appointed.

We don’t need no stinking architecture.
Why was the federal website at set up the way it was? How were the choices of architecture made — the overall technology design for the site?

Everybody now knows the site needs to handle the volumes of millions of subscribers during a short eligibility and sign-up window between October 1 and December 15th (now extended to the 23rd). The website handles 36 states. Each state has different insurance plans at different tiers and prices, but the way an individual state site works is identical to other states. Every one has bronze, silver, and gold policies from one or more insurers to compare.

The website was architected as one humongous site for all 36 states handling all the visitors in a single system. All the eggs were put in one basket.

An alternative architecture would immediately rout visitors from the home page to 36 separate but technologically identical state-level systems. More operations management, but infinitely greater opportunities for scale-out and scale-up for volume.

Another benefit of a single application code-base replicated dozens of times is that the risk and expense of states that chose to run their own sites can be mitigated. The 14 states that built their own sites all did it differently, to no national benefit. While California succeeded, Oregon has yet to enroll a customer online. We paid 14 times.

Another questionable architectural decision is the real-time “federal data hub” that makes eligibility and subsidy decisions. As designed, the website queries numerous agencies including Social Security, Homeland Security, Internal Revenue, Immigration and other agencies to make a complicated, regulation-driven decision on whether the (waiting) customer is qualified to buy a policy and what federal subsidies, if any, the customer gets to reduce policy premium costs.

This design approach puts a strain on all of the agency systems, and leads to inevitable response time delays as all the data needed to make a determination is gathered in real time. It also requires that agency systems not designed for 24/7 online operations reinvent their applications and operations. This hasn’t happened.

An alternative design would make some or all of the determination in batch programs run ahead of time, much like a credit score is instantly available to a merchant when you’re applying for new credit. That would greatly un-complicate the website and reduce the time online by each user.

Security as an add-on, not a built-in.
The website needs the most personal financial information to make eligibility decisions. Therefore, it’s shocking that data security and privacy are not an integral and critical part of the system design, according to testimony by security experts to Congress. This is a glaring lapse.

We’re not going to make it are we?
By April 2013, the project was well underway. An outside study by McKinsey looked at the project and pointed out the likelihood of missing the October deadline, and the unconventional approaches used to date on the project.

There’s almost always a cathartic moment in an IT project with a fixed deadline when the odds favor a missed deadline.

The project broke at least three IT project management best practice rules: overloading the roster, succumbing to threatening incentives, and ignoring human resistance to being cornered.

Let’s throw more people at the project.
More labor was thrown at the project both before and after October 1st. That slowed the project down even more. An IT management mentor explained it to me early in my career this way: “If you need a baby in a month, nine women can’t get that job done. Not possible. Try something different, like kidnapping.” This later became known as Brooks’ Law which says “adding manpower to a late software project makes it later”.

“Failure is not an option!”
When managers yell “Failure is not an option!” is it any surprise that project management reporting immediately becomes a meaningless exercise? The managers were flogging the troops to make the deadline. So the troops reported all sorts of progress that hadn’t actually happened. Passive resistance is real.

It therefore comes as no surprise when managers post-launch “concluded that some of the people working in the trenches on the website were not forthcoming about the problems.” It’s a fairytale world where nobody has ever read the real world depicted in Scott Adams’ Dilbert comic strip.

The project passed six technology reviews, and told Congress all was well. CMS IT management approved the green-light schedule status on the website. It’s still online here. Completely inaccurate.

“We’ll just all work 24/7 until we’re over the hump.”
As I write this a week ahead of the December 1 “drop-dead date 2.0”, it’s hard to fathom how many weeks the project teams have been working late, nights, and weekends. Soldiers are kept on the front line for days or weeks except in the most dire circumstances. The military knows from experience that troops need R&R. So do IT troops.

“Let’s turn it on all at once.”
Sketchy project progress. Testing done at the integration level; no time left to do system testing or performance testing. It’s September 30, 2013. The decision is made: “Let’s turn it on all at once.”

Turning the whole system on at once is known as a “light-switch conversion”. You flip the switch and it’s on. In the case of, the circuit breaker blew, and that’s pretty much the case since day one. Now what to do?

“Where’s the problem at?”
From the moment the website was turned on, the system throughput — enrollments per day — was somewhere between none and awful. We’ve all heard the stories of more than a million abandoned registrations and a pitiful number of successful enrollments. Where were the bottlenecks? The project team did not know.

They didn’t know because there was practically no internal instrumentation of the software. We know this because the tech SWAT team brought in after October first said so.

What happens next?
The next drop dead date is December 1. There is no public report that the system has reached its 60,000 simultaneous user design goal, nor the “critical problem list” solved.

All the body language and weasel-wording suggests the website will support more simultaneous users than at launch two months ago. Do not be surprised at a press release admitting failure hidden during the Thanksgiving holiday rush.

To make the exchange insurance process work, a couple of million (or more, when you include cancelled policies that must be replaced) enrollments need to take place in December. If that’s not possible, there needs to be a clock reset on the law’s implementation. The pressure to do so will be irresistible.

There is no assurance that the enrollments made to date supply accurate data. Insurance companies are reviewing each enrollment manually. That process is not scalable.

It was sobering to hear Congressional testimony this week that 30%-40% of the project coding had yet to be completed. Essentially, the backend financial system is not done or tested. That system reconciles customer policy payments and government subsidies, and makes payments to insurers. If the insurer does not have a payment, you’re not a customer. Which makes the remaining 30% of the system yet another “failure is not an option” to get working correctly by Christmas.

There is no backup plan. Telephone and paper enrollments all get entered into the same website.

There is a high probability of a successful attack. One thorough security hack and public confidence will dissolve. There’s a difference between successful and thorough which allows a lot of room for spin.

If individual healthcare insurance is 5% of the market, and we’ve had all these problems, what happens next year when the other 95% is subjected to ACA regulations? No one knows.

Comment here or tweet me @peterskastner

The author has ten years experience as a programmer, analyst, and group project manager for a systems integrator;  established government marketing programs at two companies; has over twenty years experience in IT consulting and market research; and, has served as an expert witness regarding failed computer projects. improvements are in the works improvements are in the works

Japan Earthquake: Toting Up the Supply Risk

My first business reaction to the Friday earthquake in Japan was that it happened in a country better prepared than any other to withstand an earthquake, and that tsunami damage would be limited to a few miles of the coast. By Saturday, I was concerned that electricity was still out and that the country’s transportation system was tattered at best. By Tuesday, I was toting up the real and potential damage to the global electronics supply  chain.

The world’s electronics supply chain is a beautiful, enormously complex process that typically runs better than a Swiss watch. But inventories are often measured in days and sometimes hours, as clusters of factories manufacture parts for final assembly with only minimal transportation delays.

When things go wrong in the industry, there are often backup geographical or alternate suppliers that can close a supply-chain gap, since no sane CEO wants to shut factories due to lack of component supply.

Japan’s prefectures of Fukushima, Miyagi, Aomori, Yamagata, Iwate and Akita bore the brunt of the earthquake and tsunami. They are also key — double digit — producers for the global electronics industry of silicon wafers, DRAM memory chips, NAND memory chips, disk drives, passive devices, lithium batteries for mobile devices, and glass for PC and gadget screens — to name just few of the requisite components in virtually every modern electronic device.

So far, none of the major downstream electronics manufacturers (i.e., Sony, Apple, H-P, Acer, Dell) have let out public concerns for anything short of a brief disruption as alternate supply sources are turned on. But note that publicly traded companies are loathe to warn stockholders of potential risks, waiting until there is no way out of announcing bad news. So, the lack of concern on the airwaves is not a positive sign.

The worst case assessment includes widespread nuclear contamination and decades of cleanup, knocking Japan out of its Number 3 spot in GDP. But that worst case has only a fraction of a percent of happening.

The bad case to watch for is a scenario that includes

  • prolonged electrical outages and brownouts — a bane of electronics production
  • a disrupted transportation system (both internal to Japan and via ships for export)
  • an inability to repair and restore damaged factories and production infrastructure
  • a months-long period of social unrest caused by an inability to put the earthquake and tsunami behind sufferers due to lack of food, water, housing, transportation, and knowledge of the fate of loved ones
  • a faltering government response including leadership, monetary policy, police, and moral suasion.

As of this morning, I’m very concerned that the bad case is quite possible — no one still has the whole scope of the damage, the time it will take to get back to a semblance of a modern economy, and the resulting impact on the global electronics supply chain.

Like the threat of the SARS virus in 2003 closing China’s factories to the world, the earthquake may have wiped out or seriously impacted a double-digit cohort of the electronics components industry. If Japan’s output is inhibited more than a few more days, the impact on Q2-2011 and subsequent quarterly revenues of major electronics products companies will head south quickly.

DRAM prices on the spot market jumped 5% yesterday as companies moved to shore up inventories.

I think the needed process is called “watchful waiting”, an apt expression.

Source: Digitimes

Cisco: The Lion King Fights for Data Center Fabric Leadership

It was hard to read the Wall Street Journal this morning and not realize that Cisco’s status as data center fabric-infrastructure Lion King is being challenged. And with the company’s stock at about a quarter of its zenith a decade ago, you have to wonder if the smart money is long gone. Now, the customers are moving on as well.

Lions, as all children and most parents know from watching the Disney movie, have an alpha leader who reigns as long as he is strong enough to fight off the ever-younger competition. I have to put Cisco in that spot. [Full disclosure: I once worked at Stratus with former Cisco CEO and chairman John Morgridge, a man I highly admire. But that was decades ago.]

As the WSJ reports,

… the growing competition Cisco faces in its biggest business, the $13.6 billion switching unit. The San Jose, Calif., company earlier this month reported a 7% drop in quarterly revenue from switches, a stumble for a business that represents about a third of Cisco’s $40 billion in annual revenue.

HP is coming at Cisco from the bottom up with lower prices on switches, the ubiquitous component found in every data center rack. After buying 3Com, H-P’s price-concession strategy is having an effect on Cisco, which has a higher margin structure than H-P and other competitors. As a result, Cisco is gradually bleeding market share. Competitors smell blood and are circling the once invincible king.

Meanwhile, Juniper Networks is rolling out qFabric, a data center wiring and communications architecture that is aimed at incumbent Cisco, with a two-page color ad. The pitch is to lower IT operating expenses (op ex).

Every server in the data center typically requires one to four connecting cables. A brief glance at the photo below

Buried in IT Cabling

is all you need to know about the pressing need to lower capital expenses (i.e., Cisco’s Achilles heel as high-price incumbent) and lower operating expenses associated with set-up and management of a typical data center (i.e., the poor guy buried in cables during a server set-up or repair).

I am hearing more from IT executives that status quo at lower prices is not a viable long-term outcome. That in turn leads me to think that Darwinian evolutionary jumps forward in server I/O fabric are a market opportunity waiting to happen. For example, NextIO, an Austin, Texas company, is coming to market this spring with a one-cable-per-server solution to rack-level I/O that connects to the data center fabric. PCI Express is the standards-based interface for all the server’s I/O in the NextIO solution.

I am drawn to the toast of UK citizens on the passing of a monarch: ” The king is dead. Long live the king [queen]”. King Cisco is certainly not dead yet. And it’s a fact that it’s nigh unto impossible to actually kill off a big IT company. But his reign is far along, and it’s not too early to start thinking and planning for a successor.

Computing in the Third World: A Success Story

I just got back from my third trip to Haiti, a third-world country hampered by numerous problems. Nevertheless, after two years in operation, our school installation is reporting 100% uptime. Careful planning on electrical infrastructure proved to be worth every penny.

The First Problem is Power

Funny how AC power, which in the U.S. we take so for granted, is completely problematical in the third world. Endemic power problems are the worst threat to achieving widespread computer adoption in the third world.

In Haiti, my recorded observations in two cities posit total AC power grid failures at least once a day, and at random times.  Duration is from several minutes to several hours.  In addition, power spikes and voltage problems are common.  The results to personal computers are predictable: premature failure.
For example, I am involved in a 2008 project that put PCs into a K-12 school and teacher’s college in the city of Jacmel, on the Caribbean.  To overcome the unacceptable mains power problems, our technology team undertook the following infrastructure to support a 20-PC classroom:
– Industrial window air conditioners to cool and dehumidify classroom air
– Roof-mounted solar panels, lead-acid batteries, and an AC inverter
– 35 Kw diesel generator for backup power when solar is not available
– Installed costs including freight and duties were US$35,000; the PCs represent 40% of the total cost.
Side-by-Side Test: Protected Power PCs Keep Running
Five, unprotected, three-year old Dell Dimension computers in another classroom at the same Jacmel school are all dead with motherboard failures.  The capital is not available to replace these computers which failed well before expected useful life end.
The solar-to-battery classroom has 20 PCs that are all operating after tow years of operation that includes a Category 4 hurricane and a devastating earthquake.
PC Education Starts With the Basics
It is common for first-world students to get hands-on PC experience during their K-12 years, and to expect college-level students to have basic computer competency.  In the third world, it is common for even college-level students to have essentially no hands-on PC experience.  Therefore, education needs to start with basics like mouse movement, menus, folders and all the UI concepts needed as a base for application-level experience.
At the Jacmel school, hands-on computer training starts at age 10 with a classroom-hour a week. Teacher training receives extensive computer familiarization. However, the school population has doubled in the two years since the computer room was opened. Up to fifty students at a time now use the twenty PCs, so actual hand-eye skill training are not ideal.
Microsoft Has an Important Place in the Third World
Outside the U.S., it is common in the first and second world to see deliberate moves away from Microsoft operating systems and applications (e.g., EC).  I am quite surprised at the third world demand for Microsoft software.  The key reason is job skill development.  K-12 education is much more aimed at building usable work skills than in the first world, and here Microsoft operating systems and applications are — and buyers expect will continue — thoroughly embedded in business, commerce, and government.
Do Child Laptops have Limited Interest?
I have evaluated EePC, Intel Classmate, and OLPC laptops and extensively discussed the pros and cons with prospective buyers.  Some observations:
– The screen size is only personal; a plurality want desktops with large (19″) LCDs in order to sit two students per PC to maximize equipment utilization.
– Microsoft applications and Windows (see above).
– Concern for theft and loss.  Lack of capital for loss replacement.
– In general, laptops were of lower demand/interest than desktops.
In spite of these objections from Haiti and Brazil, Intel is making slow progress convincing governments to invest in country-wide child laptop deployments. Haiti, which lacks a fully functional government, is a poor candidate for country-wide PC deployments.
Classroom PCs Must Come Without IT Infrastructure
Classroom PCs need the same backup, anti-virus, software update, and other mundane but necessary care and feeding as a small business.  However, the lack of IT personnel skills and availability makes even routine IT infrastructure and management exercises difficult.  In short, a server is not much good if there is no administrator or operator trained and certified to keep the infrastructure running.  Our solution to date is to use LAN-level automated backup products.  I think it is a chicken and egg problem where lack of interest in learning about proper IT infrastructure management is tied to a lack of jobs related to IT infrastructure management.
By focusing 60% of our infrastructure investment on clean, reliable electrical power, our PCs are delivering the uptime needed to meet a demanding 14-hour-per-day usage schedule. The feared infant mortality was averted. The costs of that electrical infrastructure, which took an exorbitant 60% of the overall budget, are projected to support at least a decade of operations. That translates into about two generations of PCs.
The doubling of the school size was not forecast. Nor was the growth through success; adding more student classes to younger grades based on early success and the installation’s reliability. We’ll put in another six PCs this year to lower the students-per-PC ratio. However, the situation will never be ideal.
Looking to the Future
The rapid decline in desktop electrical usage makes it possible, in a couple of years, to think about doubling the number of PCs into an adjacent classroom — without increasing the solar-battery infrastructure. Stretching that existing solar-battery infrastructure is the obvious lever.
Netbooks are another future alternative for the classroom. At another Haiti project, a jobs-creating business was started with a pallet of 24 netbooks costing about $10,000. With wireless LAN and a router, this computer services business was in business in hours. However, unlike the take-it-home approach Intel advocates with the Classmate PC, these netbooks stay on the desks at night.

Wimpy Cores Are Not IT’s Answer to Data Center Power Limits

Using lots of “wimpy cores” like Intel’s Atom instead of “brawny cores” like Xeon is a growing topic in data center discussions, especially cloud-computing giants like Google and Microsoft. What everybody wants is more computing at lower kilowatt hours. That’s gospel. But this analyst does not see wimpy-core servers playing a role in most enterprise data centers over the next five years. The computing science and economic hurdles are huge.

The Wimpy-Core Pitch
I picked up the arguments for a new approach to some data center workloads with a worthwhile white paper by Google’s Urs Hölzle, where he described “wimpy cores” as having low compute capabilities with commensurate low power requirements, as opposed to “brawny cores” with high electrical loads in traditional data center servers.

Lots of wimpy-core servers, the argument goes, running the appropriate work loads could get the job done at a much reduced electricity budget.

And make no mistake, the Internet giants running multiple data centers for global operations are very focused on electricity consumption due to server operation and server HVAC cooling.

A hybrid argument was made yesterday by Microsoft, calling for low-power 16-core Atom servers-on-a-chip at an industry conference.

The Contestants
In the wimpy corner on the right is Intel’s Atom D525. This 45 nm processors has 2 cores and 2 threads. It runs at 1.66 GHz, consuming up to 13 watts. It supports a 64-bit operating system, but not virtualization. Without a special chipset, you’d get one OS image per server, with one socket. Note that Seamicro has clustered up to 512 Atoms in a single box with a proprietary interconnect, so building a lot of wimpy engines into a larger whole is already in the marketplace. And please note that AMD’s Brazos could substitute for Intel’s Atom in this article.

In the brawny corner on the left is Intel’s Xeon 7550 with 8 cores and 16 threads at 2.4 GHz, consuming 130 watts. The Xeon 7550 supports a 64-bit OS, virtualization, and every other Intel server technology. This chip works with a chipset that supports four processors on a motherboard, quadrupling the single-7550 specs.

Handicapping Wimpy Cores Versus Brawny Cores
Atom’s low electrical consumption is certainly attractive versus the brawny Xeon. But buying lots of Atoms today is certainly not the answer for CIOs. There are several key issues that carry more weight than chip power consumption, as I’ll outline below:

Duplicated Server Infrastructure
In order to connect the wimpy server to the rest of the IT world, the same infrastructure elements are needed: network connections; I/O connections; memory; server power connections; KVM ports and all the other nuts-and-bolts hardware that make up a server in a data center.

Expanded Management Complexity
Since it takes lots of wimpy-core servers to equal the throughput of a brawny server, the number of server nodes could easily by hundreds or thousands. That raises operational manageability questions which not all enterprises are prepared to handle today.

Microsoft’s proposed 16-way Atom is no panacea either. This chip will not run traditional symmetrical multi-processing (SMP) operating systems efficiently; too many processors and not enough horsepower per processor implies lots of SMP overhead losses. That means Microsoft needs a new OS that can separate the 16 Atom cores, and a software development system that brings everything together at the application level. Not an easy task.

Dispatch Overhead Increases
The servers that dispatch workloads are complicated when hundreds of wimpy-server destinations take the place of a handful of brawny servers.

Software Stack Costs
A wimpy server costs chicken-feed beside the likely server software stack costs. Sure, Microsoft can talk about upping its server count by an order of magnitude or two using wimpy servers, but they don’t have to pay for the OS and middleware like everybody else. Google creates custom versions of Linux depending on a server’s workload, and they maintain those OS images themselves. Is your enterprise ready to drop Red Hat service and support and roll-your-own Linux? That’s what it will take to make software economics pay off in a wimpy-server world.

Atom is Made for Gadgets, Not Servers
Atom today was not designed for a data center environment. Here are some server must-haves that are not available with an Atom-based wimpy server:

  • 64-bit OS and virtualization together. You can have one or the other.
  • Server memory with error-correction;
  • Server redundancy and chipset-level error correction;
  • Other technologies like AES encryption and AVX scientific instructions.

Increased Latencies
With a slower core, work takes longer to get done. If a user is waiting for a response, how much added latency can be tolerated? If the wimpy cores are running in parallel, the latency goes way up when one wimpy core falls behind the rest.

Picking the Right Workload
Perhaps the biggest hurdle to harnessing wimpy server will be selecting the right work loads. There is not much literature or vendor experience today to assist IT planners. A lot of computer science and measurements are still in the future. Making the wrong choices will lead to obvious costs in delays, rework, training, and everything else that happens on the bleeding edge of technology. Moreover, the right answer will include the business process owners, not just the systems architects, with the following example.

One wimpy server workload that comes to mind is stock option self-service in the human resources department. Most employees log in only occasionally, so work loads per employee are predictably light (and if the company is acquired and everybody wants to cash out, they can wait). There are only a couple of dozen columns and a couple of rows per option grant per employee. It’s easy to imagine this database sitting in memory all day long. If all the application does is display the employee’s option holdings, a wimpy server seems workable.

But if the self-service includes updating the option database through an option exercise, then all the rules and controls of an enterprise accounting application fall into place. Sarbanes-Oxley is in play, which means the CFO could go to jail if the wimpy server corrupts data or fails with accounting-level recoverability. As stated above, Atom is not ready for enterprise server duties, so any application remotely mission-critical shouldn’t prudently be run on a wimpy server.

Wimpy Servers Are Not Ready for Enterprise Prime Time
All of the above leads me to conclude that wimpy servers are not up to the tasks of the vast majority of enterprise work loads.

My recommendation to CIOs is to track the technology in 2011, and follow early-technology companies like Seamicro for early-adopter success stories. What applications and work loads can actually deliver lower Total Costs of Ownership using wimpy servers?

The Rest of the Story
For the foreseeable future,  IT planners can lower electrical consumption — the raison d’etre of using wimpy servers — by purchasing the latest server technology early in its life-cycle. Fact is, huge improvements have been made in server power management over the past five years, and more is coming in 2011. I’d look to buy new brawny servers before dabbling in wimpy servers.

If  Atom isn’t going to make it in the data center soon, I don’t see how ARM-based servers have a snowball’s chance in hell due to the massive software migration needed.

Over the next five years, it’s likely we will see:

  • Even lower power levels, especially at idle and with partial loads. Intel and AMD hear their customers loud and clear;
  • Microsoft will get its Servers on a Chip, but they won’t only be Atom-based; expect Xeon too;
  • Lower overhead for virtual work loads, making virtualization more attractive as an alternative to wimpy servers;
  • Smaller, more configurable operating system kernels that enable better wimpy computing.

Data Center Server Farm