Last Weekend’s Outage

Posted August 1, 2011 by George | 54 Comments

As many of you are aware, we experienced some unexpected downtime this past weekend. First, we want to apologize to our amazing community of genealogists for inconveniences that the outage may have caused. Our goal is to provide the best place for all of you to work together to build a single family tree of the world, and we failed to provide that service for more than 48 hours over the weekend.

Our engineers worked around the clock to resolve the issues that caused the outage, and they were finally able to restore service at approximately 8am PDT (3pm GMT) today.

What Went Wrong?

A couple important points worth noting before we dig into the details:

  • We have measures in place to prevent the loss of data, and they worked. (Phew!)
  • You may notice that some data is missing right now.  Please do not re-enter that data, as we are in the process of reloading it over the next couple days.

The issues that caused the outage were with Geni’s PostgreSQL database.  We know that hardware issues or data corruption were not the root of the problem, and we suspect that it was an issue with the database’s index. The eventual solution to the problem was a full restore of the database.

As you can imagine, fully restoring more than 100 million profiles and all associated data (source documents, images, videos, etc) is a large task, and unfortunately we couldn’t simply “pedal faster”.  We did try several alternatives before resorting to a full restore, including (a) rolling back our codebase, (b) attempting to move the data to a different database server, and (c) investigating all of our system logs to try to find an easier way to repair the site.

We will continue to investigate, and as we learn more about the cause(s) of this problem, we will put additional measures in place to minimize or completely prevent this sort of outage in the future.

Better Communication

During the outage, we were not as effective as we should have been with our communication to the Geni community.  While we didn’t know exactly when we would be able to restore service, we should have provided updates much more frequently than we did, and we will in the future.

We will provide updates to users on our Facebook page, our Twitter account (and our Twitter Uptime account), and if possible, on our blog and our support site.

Pro Users

For those of you who have supported Geni by purchasing a Pro account, we would like to offer an additional week of Pro service to you for this inconvenience.  Some of you have already inquired about this; as soon as all of Geni’s data is restored, we will begin working on a way to credit a week to each of your accounts.  We will notify you once we have figured out the best way to apply this credit.

We value your support so much, and we will do our best to ensure that you don’t have to experience this long of an outage again.  Thank you for helping make Geni the amazing community that it is.  We apologize again for the inconvenience.

If You’re Still Having Issues

If you are still having issues and you think that they may be because of this outage, please feel free to leave a comment on this post or seek help at our Helpdesk. Thank you for your patience, and thanks for being so passionate about Geni.

Post written by George

George joined the Geni team in September, 2010 as Geni's marketing director. You can find him on Twitter where he never posts but is happy to respond: @georgegeni

See all posts by

Share: