Microsoft Azure’s Current Uptime and Support Struggle

Posted by on Nov 20, 2015 in From the Cloud, From the Mind | 3 comments

Author’s note: Following this post, I had a productive call with a Microsoft Azure Team. Valuable Insights of that call are here.

Over the last couple of years, we at EfficiencyNext have become more bullish about how Microsoft Azure can help our clients. New features have been rolling out regularly, and extremely useful services like SQLAzure make the cloud platform unique. And Microsoft has been very bullish with their Service Level Agreements, guaranteeing with the promise of partial refunds up-times as high as 99.95%.

Unfortunately, at least for our Azure Subscriptions and those of our clients, up-time over the last couple of months has been less than terrific. IaaS Servers that lose their disks, SQLAzure databases that go offline for several hours, Web Apps that suddenly operate at 3% of their usual power, the Azure Cache service being suddenly unreliable for over 24 hours, effectively bring sites down. All this has happened to Azure-based systems we manage for our clients over the last couple of months, the vast majority of them on services backed by SLAs. And in most cases, the issues with these services were never reflected on the Azure Dashboard. We are on the 26th hour of a serious issue now, and yet, the Azure Dashboard is green.

What is just as alarming is Microsoft’s moves to shut off the valves on how users can report issues and receive help when it is the Microsoft infrastructure at fault. In theory, “Web Incident Submission” is included with all Azure Subscriptions (https://azure.microsoft.com/en-us/support/plans/). At the very least, a way to submit technical issues to Microsoft as they happen is implied, if not promised. Unfortunately, this capacity doesn’t really exist, at least not in the way one would expect. In my experience, bringing up an issue to the Twitter handle @AzureSupport usually results in them recommending submitting a Billing Ticket to the Azure Portal. And then when you do that, the Billing Reps now regularly tell those seeking help that they need to pay extra for a paid Azure Support Plan; the standard one runs $300 a month. The situation has become so ridiculous that when this is protested, the folks at @AzureSupport are now telling customers to submit Billing Tickets and to lie by saying the issue to be resolved is the subscription portal not working. Evidently, that gets a tech on an issue, and then once they are engaged, one can bring up the actual problem. In other instances I know of, technical support plans have been setup with tickets then submitted. After that, the plan is cancelled with a refund by Billing Support after an issue has been fixed.

Limiting Customer Input is a Direct Threat to Up-time

The idea of restricting incoming information about outages and guaranteeing high up-time seems contradictory. Given that Azure’s Dashboard and health mechanisms clearly can’t catch every serious issue that occurs on the service, letting customers tell techs what is wrong seems very smart. An issue that affects one customer often likely affects hundreds more. With that option now limited to only those who pay for support, it’s my theory up-time has really suffered as a result. Whenever I bring up to Azure reps that SLAs are supposed to be backed by good faith efforts, they usually tell me its only about a promised partial refund, which as we all know often doesn’t cover the business cost for multiple hours and in some cases days of downtime. Call me old fashioned; I’ve worked with other hosting companies for over a decade and know from experience that good companies put real effort into actually making the up-time expressed in SLAs a reality.

So What Should You Do?

In the heat of the moment, it’s tempting to run over to AWS (Amazon Web Services). While they have paid support offerings, they also provide free high priority support if a service goes down when it is their fault; the key difference vs Azure is Microsoft makes their customers pay for reporting issues that are Microsoft’s own causing.

That said, Azure provides valuable capabilities and integration options with Microsoft development tooling that in my opinion are unmatched in the industry. SQLAzure for instance, is a great service, and a  great value, especially with the latest V12 upgrade. And once you get hold of a tech, they work hard to get you back online. So here are my recommendations if you plan on being on Microsoft Azure.

  1. Understand buying a paid support plan is a requirement. If up-time is fairly important, the $30 Monthly Developer’s Plan might work. If up-time is very important, then expect to pay $300 a month. Without one of these, it is usually impossible to even talk to a Microsoft tech when their own mistakes bring down services you are paying thousands of dollars for annually. Think of it as buying a PC without a manufacturers warranty, and thus needing to buy an extended warranty for any coverage.
  2. Budget extra development time to circumvent outages when necessary. Tonight, I’m working on changing dependencies on a client site with intermittent issues, as the critical service we depend on is still unreliable after over 26 hours. At some point, the fix has to transfer from an overwhelmed Azure team to perhaps a coding change, with all the regressive testing that comes with that.
  3. Minimize the number of Azure Services you use. An Azure Cache service can’t fail you if you aren’t using it. If up-time is important, keep the dependencies simple, and perhaps have backup services in another Azure data center at the ready.
  4. Finally, and I say this with a heavy heart, understand that up-time on Azure just won’t be as good as on some other hosting services. This is based purely on my own observations, but up-time is just worse than anything I’ve seen on any other major hosting service, at least over the last couple months. Perhaps its the rapid release of new capabilities or the surge in new customers. Or maybe its just cultural; Azure started more as an Application hosting service and later expanded to more website hosting; the up-time expectations of the latter perhaps don’t resonate yet. Whatever the reason, I now tell clients to just plan for some downtime. For their purposes, the SLA percentages aren’t real. In my opinion, it’s simply the price for using a constantly upgrading platform that has an increasingly large client base.

In Conclusion

I hope Microsoft will engage in good faith efforts to improve Azure up-time. Letting customers tell you when things are wrong seems like a good start. I’d argue it is Bad Faith to publish high up-time SLAs, promise free “Web Incident Submission” and then block customers from submitting information to techs that would help maintain the SLA targets.

The world needs Azure. I want to recommend Azure. But these days, I have to add serious qualifications…

With that, I turn over the floor. Current Azure customers, am I right? Wrong? Any defectors from Azure to AWS out there? Is the up-time better over there? And Microsoft Azure folks, please feel free to join the conversation.

 

3 Comments

  1. Good read! Thanks for bringing this up. I agree on #4, though I wish it wasnt so. I’m so happy for the empowerment we get from Azure, but these issues have to be resolved before we all run for the hills.. Or should I say the competitor.

    Having to tell a customer that “it’s not our fault” just doesn’t cut it.

  2. It seems like it has been a while since any new comments. We have interest in moving to Azure, has this improved at all?

    • My apologies for the delay writing back! I can say that uptime in general seems to have improved, not that there isn’t an outage once in a while. Perhaps the biggest challenge right now is keeping up-to-date with everything new they are launching on Azure these days. Call us if you’d like some guidance! 🙂