Want to read Slashdot from your mobile device? Point it at m.slashdot.org and keep reading!

 



Forgot your password?
typodupeerror
×
Communications The Internet

Google Blames Gmail Troubles On Maintenance Goof 109

Slatterz writes "Google has apologised for the two-and-a-half-hour Gmail outage on Tuesday morning, and admitted that the cause was down to data center maintenance. 'Lots of people around the world who rely on Gmail were disrupted during their waking and working hours, and we are very sorry. We did everything we could to restore access as soon as possible, and the issue is now resolved,' said Gmail site reliability manager Acacio Cruz in a blog post. Google had been testing new code designed to keep data geographically closer to its owner, which brought about disruption when maintenance in one data center caused another facility to be overloaded. This had a cascade effect, according to Google, and it took the company an hour to get it back under control."
This discussion has been archived. No new comments can be posted.

Google Blames Gmail Troubles On Maintenance Goof

Comments Filter:
  • Gmail = Goofmail
  • by Jon_Hanson ( 779123 ) <jon@the-hansons-az.net> on Wednesday February 25, 2009 @06:48PM (#26989071)
    Maybe it's related to this but I noticed this past weekend that the Jabber server running on my Linux machine no longer can get presence information for people on GMail/GTalk. From the logs I can see my server attempting to make a connection but nothing happens after 20 seconds and my server gives up for the time being. I haven't changed anything on my side but I'm unsure who to contact about issues like these.
  • by Narnie ( 1349029 ) on Wednesday February 25, 2009 @06:52PM (#26989125)
    the cloud can breakdown? WTF? I thought cloud computing fixed any conceived computer problem out there.

    damn marketing bs...
    • Re: (Score:1, Offtopic)

      by TheLink ( 130905 )
      Yes the cloud can break down, it's ironic isn't it? It's like rain on your working day.
  • by Tarmus ( 1410207 ) on Wednesday February 25, 2009 @06:53PM (#26989139) Homepage
    So you're saying I *didn't* need to throw my iPhone out the car window the other day? I hit some poor lady right on the noggin with it.
    • Re: (Score:2, Funny)

      by Anonymous Coward

      it's ok, some poor lady got an iphone that should would have never been able to afford otherwise. cause she's poor.

  • by mea37 ( 1201159 ) on Wednesday February 25, 2009 @06:54PM (#26989167)

    I mean, sure, if the janitor brought down the service, that's pretty bad, but it seems a bit harsh to start calling him a "maintenance goof" ...

    (tip your bartenders and waiters)

  • by Anonymous Coward on Wednesday February 25, 2009 @06:54PM (#26989171)

    .. Gmail is Beta or something.

  • by buddyglass ( 925859 ) on Wednesday February 25, 2009 @06:57PM (#26989231)
    Nobody complain about that silly beta label anymore.
    • Re: (Score:1, Funny)

      by Anonymous Coward
      hope they make it betta soon...
    • I pay for my (business) GMail account, which definitely does not have that label, but still I was down. Oh well, I guess they'll still make the promised uptime.

  • by glebd ( 586769 ) on Wednesday February 25, 2009 @07:00PM (#26989315) Homepage
    "As the stunned world slowly recovers from 2.5 hours of complete hibernation, digging through wreckage, restarting life support systems we all came to depend on, re-animating accidentally dead and restoring their brains from backups (provided backups are available and reasonably error-free), Google has apologised for causing 'the disruption' and blamed it on a maintenance goof in the Google Cloud, said GCloud site reliability manager Acacio Cruz IV v10.0.013 in a BrainTwitter post. We can only envy our ancestors who used to just lose access to their electronic mail via primitive personal computers when Google was having a glitch."
    • by jnuzzo ( 313424 ) on Wednesday February 25, 2009 @07:22PM (#26989685)

      It's a FREE service. I don't have a problem with an outage when the service is free. It's when I pay for a premium service, they can't keep it stable, and finally raise my rate to cover their idiocy that p*sses me off.

      • by Idiomatick ( 976696 ) on Wednesday February 25, 2009 @08:06PM (#26990437)
        Google has a better uptime than any pay email service I know of anyways. I'm pegging Gmail at around 4 9s and the search engine to around 5~6. Most ISPs get 2 9s and business ISPs 3, 4 if you hit the ISP jackpot (even then i'd be shocked). So how are people so up in arms about this? Sheesh if you saved it up you'd get a day off once every 30 years. Which reminds me, nevermind ISPs. You get less uptime for ELECTRICITY in north america than you do google, going just on the continent wide outage a few years back.
        • Technically this downtime already brought GMail down to 3 nines.
        • Re: (Score:3, Interesting)

          by Anonymous Coward

          How did you calculate 4 nine's for gMail? 4 9's is 52 minutes of downtime per year, while this outage was over 2 hours.

          And this isn't their first outage. The last one I remember was April of 2008.

          Is it even possible to measure 6 9's of downtime for an internet service? 6 9's is just 30 seconds of downtime per year -- less than 3 seconds per month -- 100 msec/day. Can you honestly say that you never experience 100 msec of additional latency once a day? Maybe once a month they have a hard disk timeout that m

          • Re: (Score:2, Insightful)

            by Anonymous Coward

            And this isn't their first outage. The last one I remember was April of 2008.

            Having seen forum threads around the internet discussing gmail downtimes in the past a general trend is, that only one or two persons see an outage, everybody else can access gmail just fine. That makes me think the majority of their downtimes only affect a tiny fraction of their users. If you count all outages even though they affected maybe just 1% of users, then you are not giving correct availability figures. If 1% of the time there is an outage for 1% of the users, the availability isn't 0.99, it is 0.

          • Measuring downtime as a SOLID downtime over 3 minutes long. You can't honestly say your productivity was severely impacted by a 2minute downtime. Because ou are right, you can't measure it otherwise.
        • We're 5 9s. If you think GMail is 4 9s I have a bridge you might also be interested in.

        • I agree. I have another, pay, email account with netidentity, owned by tucows.com, which has had all kinds of reliability problems for the past several years. They've had several occasions where my email was inaccessible for several DAYS. Now I just forward that mail to my Gmail account, which I've never noticed a problem with.

      • That's all very well, EXCEPT people DO PAY for Gmail.

        There are lots of corporate/paid accounts who WERE paying for it, and have an SLA (service level agreement) with Google, and they were just as affected as everyone else.

    • by Lifyre ( 960576 )
      Thank you. That just made 7 hours of trouble shooting here in Iraq so very worth it.
  • by dave562 ( 969951 ) on Wednesday February 25, 2009 @07:06PM (#26989435) Journal
    The first thing that came to mind when reading the article is, "They were 'testing' code in a fscking production environment?!" Then I realized that Gmail is still a beta app. I think these things are to be expected from beta software. What I'm curious about is whether or not corporate users who are paying for Gmail were effected as well. If so, then Google better get their ducks in a row, and fast. It's one thing to play around with your servers when people aren't paying you for uptime. It's another thing entirely to test code on a production network.
    • by sloth jr ( 88200 ) on Wednesday February 25, 2009 @07:13PM (#26989545)
      There are all manner of tests, and sooner or later, you do have to test in production. It's important to know that in cloud computing, there are certain kinds of tests that are only possible in production; production load is the surest way to characterize your application and platform. Who knows where in the deployment lifecycle this happened? Someone at Google, certainly, but not us.
    • by PotatoFarmer ( 1250696 ) on Wednesday February 25, 2009 @07:21PM (#26989669)
      The fact that they have corporate accounts paying for access to the service should preclude the 'beta' label. I like a lot of what Google has done, but sometimes it seems like the whole beta thing is just a convenient excuse for failure, or as a free pass for iffy behavior like testing in production.
      • by lefiz ( 1475731 ) on Wednesday February 25, 2009 @08:38PM (#26990829)
        Google is an innovative company, and innovation often includes trial and error as well as improvements to an original idea. No one makes you use their products, and in this case, gmail is only one of many email providers. If you would prefer slightly more reliability from a corporation providing a product guarantee, feel free to look elsewhere. I like the way gmail works much more than any other email app I've tried, and am happy to accept the occasional issue, especially for all of the positive developments that have come from continued work on the project. Remember not so long ago when you couldn't chat in gmail?
      • The fact that they have corporate accounts paying for access to the service should preclude the 'beta' label. I like a lot of what Google has done, but sometimes it seems like the whole beta thing is just a convenient excuse for failure, or as a free pass for iffy behavior like testing in production.

        It's just a label. It doesn't mean anything other than "we're not finished with this yet". And when did they use it as an excuse?

      • Re: (Score:2, Insightful)

        by Strake ( 982081 )

        Be thankful that, at least, Google calls their testing versions "beta", not "Sevice Pack n" | n < 2.

    • by jadin ( 65295 ) on Wednesday February 25, 2009 @07:21PM (#26989673) Homepage

      I don't think they are testing it on their corporate users. My domain is signed up for google apps which includes email, but not the pay for premium version. When I read on slashdot that gmail was finally adding an option for 'always use https connection', I looked in the options where people said it would be, and found nothing. Logging into the "official" gmail I was able to find it right away. It took some time before it showed up in my domain's gmail client.

      My conclusion is they test all the code on the official gmail users to make sure it's stable enough before updating the corporate clients etc.

      • There's an option in gmail for domains you can change to use the latest version if you want the bleeding edge features. But this setting probably wouldn't help with the type of outage that happened this week.
    • Re: (Score:3, Interesting)

      by BikeHelmet ( 1437881 )

      It wasn't 2.5 hours for me - it was more like 14-15 hours.

      It stopped working at night time, around ~9PM (this is when Gmail Notifier failed to login, and curious, I tried to login manually). It wasn't working yet at 2AM in the morning. I went to sleep, woke up, and it was still broken. It finally came back online some time after lunch.

      This would be quite irritating if I were a business. As it was, I did have some important emails to send off, but waiting a day didn't kill me.

    • And pray tell what kind of test environment must exist to mimic the massive production environment! At some point there aren't many options because something so distributedly large is beyond conventional mechanisms and wisdom. In Google's case they may never be able to move out of beta if it means testing code in the production environment.
      • by dave562 ( 969951 )

        The test environment doesn't have to exactly mimic the production environment. It just has to serve as a model. Lets say that in their test environment they move mailboxes around, and they find that it takes X minutes and Y amount of bandwidth to move Z amount of data. They can then take those calculations and extrapolate what will happen when the numbers change. We obviously don't have details involved, but the article mentions moving mailboxes and servers being overwhelmed by the amount of data moved.

      • In the same datacentres as the production servers.

        Or something like that.

    • My business depended on GMail, and yes it was down, but for a hell of a lot longer than 2.5 hours. It was more like from Monday night to Tuesday afternoon.

      Yes, I do pay for their services. And no, we will not be depending upon it any longer. "Testing" code on a production environment is just bone-headed, and I am quite frankly getting tired of the constant "Some features have failed to load..." (... because we're testing new code that doesn't work) messages.

      There are more reliable providers out there...

    • You can test and test all you want outside of production, and any respectable shop will have every piece of code thoroughly unit tested and will test "significant" changes against simulated (for changes that load can affect) and limited users.

      But, for an environment with huge infrastructure, it becomes literally impossible to test every scenario against real user loads with real user patterns ("random" requests is not real).

      When your test scripts get timeouts, they gently retry after $TIMEOUT. People arent

    • How long should a product remain in beat, exactly? Beta doesn't mean what it use to, Google has redefined it as a "get out of jail free card". That isn't beta, its a scapegoat.
  • for their lame layout - should give people a way to avoid (or change) the styled buttons, not all of us can easily read them now.

  • My bad. (Score:5, Funny)

    by Maintenance Goof ( 1487053 ) on Wednesday February 25, 2009 @08:16PM (#26990575)
    Sorry, My bad.
  • by haruchai ( 17472 ) on Wednesday February 25, 2009 @09:03PM (#26991177)

    First off, it's free, it gives you 7 Gigs of mail storage and it's accessible from any where or any device with an Internet connection.
    It searches through my 4 years of e-mail faster than Outlook ( in cached Exchange mode) can search
    the last week. They keep adding features - for free;
    have no annoying Flash ads and the ones they do have are off on the extremes of the page.

    If you don't like it, stop using them - I promise you there won't be any pesky cancellation fees.
    Hotmail and Yahoo await you and we'll miss you all - maybe.

    • It's alot worse than you seem to know.

      I've been having problems with gmail for 4 days now. My mail STILL isn't being delivered.

      I have sent two emails a day (morning and evening) to my Yahoo account over the last 4 days.

      None have been delivered. This still isn't fixed.

      • by haruchai ( 17472 )

        Sorry to hear that but it's not a universal problem - I've had no issues with Gmail and the only time that there was an outage that was long enough for me to notice was over 2 years ago.

        All in all, their free service has been an order of magnitude better than the various Exchange environments I've been in ( including BP ( 2 years ago, HP ( currently) and several medium sized ISPs) in terms of both service speed / reliability, mailbox size, searching and spam filtering / virus scanning.
        Of course, Google does

    • Except Gmail is NOT free if you are a PAID user and they were just as affected as everyone else.

  • I don't understand people who rely on Gmail, or any other free webmail, as their primary and business-critical point of contact. There is no SLA, no contractual obligation, no guarantee of anything. Anything can happen to your email and there's absolutely nothing you can do about it.

    The logic is quite simple: if you can't live without something, then get a guarantee in writing, and pay the premium for that extra service. In Gmail's case, there is no premium service, so you'd better start looking elsewher

    • They want people to use gmail, which is of course the reason they offer it in the first place. They make a significant profit off it, and would lose money if they drove away users.

    • by mkendall ( 69179 )

      The logic is quite simple: if you can't live without something, then get a guarantee in writing, and pay the premium for that extra service. In Gmail's case, there is no premium service, so you'd better start looking elsewhere.

      Actually there is. Gmail is available as part of Google Apps for Your Domain [google.com]. Premier Edition costs $50 per user per year and offers a 3-nines uptime guarantee.

    • There is no SLA, no contractual obligation, no guarantee of anything.

      If you're paying for the premium version of Google Apps (which you should be it's a business-critical domain), you get a 99.9% SLA, which by my calculations they are still well within.

      Despite that, they are giving premium users 15 days of free service. There aren't many service providers I know of who would just go ahead and do that - most of the ones I've worked with would point blank refuse, and some of them will make you fight to get se

      • I routinely give away full months of free service to my customers if they have a problem - even for tiny things like management interface issues that prevent them from doing something they should have. For major issues, the only real way to compensate a customer is to give them x amount of their money back. Google's "you may get 15 days" SLA is very, very weak. Purely my opinion; obviously many people think it's the most awesome generous thing a company could ever do. If you look around though, it's bottom

    • Maybe because it's more reliable than the non-free services?

      I have a pay email account with netidentity, owned by tucows.com. They've had several outtages in the past couple years that have gone on for several DAYS. What exactly can I do about that? Sue them? Good luck with that.

      Your guarantee in writing is worthless when something actually does go wrong. Your only recourse is to sue, and if you've ever used the court system, you'd know that you'll never get any money back that way. Only the lawyers p

  • by Puppet Master ( 19479 ) on Wednesday February 25, 2009 @11:29PM (#26992855) Homepage
    Not sure how wide spread this is, but I use OpenDNS both at home and at the office as my resolving name servers. Recently some ass hat apparently set gmail.com on OpenDNS's filters. Labeled it as a Webmail client. So, for the past 2 days I couldn't get logged on to my Gmail account while at the office, kept saying login failure. But at home it would work fine. I changed to the company's internal DNS servers for resolving and suddenly my Gmail would connect... So, anyone using OpenDNS and still not able to connect might look into that. I have sent OpenDNS admins a request to re-check that filter... It's kinda pointless to just block everything that someone *thinks* should be blocked.
    • Recently some ass hat apparently set gmail.com on OpenDNS's filters. Labeled it as a Webmail client.

      Bastards! They labeled one of the biggest webmail providers around as a webmail client?

      Next they'll be labelling myhotteenpussy.com as a porn site.

    • You know that you can whitelist domains with OpenDNS, right? Or just not block the "webmail providers" category?

      • I have... I had it set up that way, but for some reason it wasn't working. I logged in again to my OpenDNS account to verify that I had indeed not opted to filter the webmail providers category and I hadn't. So, I checked the filtered option and saved my config and then unchecked the filtered option and saved again... Now it works fine. But it shouldn't have done that to begin with.
  • Yes it's damn annoying when email or some other part of your critical infrastructure goes out, but this really should have been planned for in advance. Not by google but by you.
    Things happen. Things that are out of our control but we still have to deal with them. This outage was quite short for most people. A day at the most from what I hear, but what if the outage had been longer? A week? A month? How would you have dealt with it?

    I always keep a few lists of things to do, people to call, things to write sh

  • Too bad they don't tell more details. Their software can withstand lots of problems: network partitions, data center outages, failing routers, etc. This time, a new piece of of algorithm apparently did not do a very good job at redistributing data at the time of the data center failure. I'd like to know what it tried to do? Did it try to push too much data to one single location, causing that location to become unresponsive, in turn causing it to start redistributing data as well? I'm glad they didn't loose

  • Am I the only one who noticed it should be apologized not apologised?

Beware of Programmers who carry screwdrivers. -- Leonard Brandwein

Working...