Lessons Learned Building OnLive

I just presented at the Large Scale Production Engineering group at our August 26th meeting, giving a talk on some of the things I learned while building out OnLive. I’ve included the text of the talk below as well as the slide-deck. The area that generated the most interest was our use of Kanban to manage workflow in our group. I was inspired to try out kanban based on this blog entry on Jordan Sissel’s blog.. Jordan’s blog is one of my favorites, you could spend hours going through his entries on operations and engineering.

I’ll be adding blog entries on things I didn’t cover in the presentation in the interest of time. But I will be blogging about soon..

  • The importance of a low-information diet, and not being overwhelmed by your email
  • Unicast for server software releases is dead at scale.

Onlive lessons learned

View more presentations from gregretkowski.

Introduction

I am looking forward to sharing with you several things I learned over the last two years building the OnLive game service. My name is Greg Retkowski, and I’m team lead for Operations Engineering at OnLive. I’m going to share with you what I learned about when to automate, why you can solve scaling problems by breaking it into chunks, and how to balance between interrupt-driven work and long-term project work. But first a little about us..

About OnLive

OnLive is a cloud gaming service that we launched in June – it provides instant-satisfaction gaming. You click ‘buy’ on a title, and you are playing that title within seconds – no media to bring home from the store, and nothing to install on your machine. You can play the latest games on your underpowered and old PC’s or Macs.

We do this by running everything in the datacenter – When you play a game, that game is running on our server – you send up controller inputs, we send down video.

Through a bunch of smart video and networking guys, and a few puppycorns, we provide an experience that’s the same as playing on your game console at home.

About Me

I’ve been working in Operations since 95′, I started as the first employee of an Internet Service Provider in Ft Lauderdale, Florida. Two years later I moved here to the Bay Area and I’ve been doing operations for internet startups ever since.

I joined onlive in 2008 – at the time we had around 100 nodes. Since then we’ve grown by more than a couple of magnitudes… and we’re continuing to grow today.

I currently lead a team called Operations Engineering, and we’re responsible for…

  • Automating the deployment and configuration
  • Creating the tools used to manage the service
  • Designing the operational fabric for the service
  • And we’re the FINAL escalation path for the most difficult
    sysadmin issues.

… Basically, were the greybeard sysadmins in the organization..

There were a lot of growing pains going from almost nothing to the size we needed to build to support the service. We had challenges, we had failures, and even a dumpster
fire; BUT we learned a lot, and ultimately succeeded.

I’m going to cover three areas where I learned a bit about managing complexity in my last two years.

I’m sharing what I learned – hoping that I learned the hard way so that you don’t have to.

When to automate.

You probably have worked in shops where every server is hand-built, some poor sysadmin sitting in a loud cold datacenter, swapping redhat CD’s and clicking NEXT prompts.. And then repeating it for a row of servers that has to be installed…

That’s probably not the best approach if you need to scale.. But, then, is the answer to spend a ton of time in automating everything?

Our machine database is a rails app and I needed to decide if it would run under mongrel or mod_passenger. We ultimately decided on mod_passenger. I could have spent weeks building our puppet rules around mongrel, automating the install, rotating the logfiles,
excetra.. And those weeks would have been wasted as we chose something different… So what is the right balance?

You want anything going into production to be automated and repeatable. However you don’t know what’s going into production the first time you build it. The first time you are building this you are building it for engineering to evaluate.. You could be building
couple different NOSQL boxes for engineering, each to be evaluated, and one, or none, will be used in the final product. So when do you automate something?

The FIRST time you build something JUST LEARN about what you are building. At this stage you may be doing throw-away work, you are learning what the software can do. And, more likely than not, you are going to implement it wrong out of inexperience. At this stage, there is no value in automating something you are as likely as not to throw
away.

The SECOND time, DOCUMENT IT. Now you are building a second one – you are probably going to build a few more. You’ve learned your lessons on how to do it better the second time. This is the time to DOCUMENT how you did it and make a CHECKLIST. That’ll let you hand off the next build to another team-member, and validate the procedure.

The THIRD time, AUTOMATE IT. By now you know the software inside-out, and you know you will continue using it. Now is the best time to codify it using puppet, chef, or your automation du jour.

The FOURTH time – bonus points for pulling this off – is to TEST IT. There are many ways to approach testing your automation. You could consider your monitoring service checks as your ‘tests’, or you could use a VM test harness to test your automation code.

We’re using a VM test harness to test all of service installations, but that’s a story for another presentation.

Break things into smaller chunks

The next big lesson we learned as we were building was to spit our environment into chunks whenever possible. That takes some explanation, so I’ll relate what we experienced building out our service.

As a side effect of our architecture, we’ve split things into chunks we call slices. This side effect turned out to be very important on our efforts to scale later.

As we grew our service we ran into performance issues in different parts of our mamangement software. One example – as we grew, generating our DNS data got longer and longer, and it eventually took an hour to generate.

But – Because we cut our environment into chunks, we were able to re-write our DNS generator to generate slice-by-slice. It was able to cache slices that didn’t change, and that cut the generation down to a couple of minutes.

It is much easier to scale your service out when you’ve broken it into smaller and well understood chunks. Where we faced the issue of understanding how to generate DNS for a thousand hosts, we didn’t know how that would behave at two thousand hosts.

But we did understand how our system behaved at the smaller slice size, and going from ten slices, say, to twenty slices, was a much easier to manage as we had good knowledge of how everything behaved within the slice.

So, whenever you can, break your environment into manageable chunks, and put as much of your scaling dependencies down into those chunks – and when you have to do things at a whole, cache whenever you can.

Balancing interrupt & Project Work

Our next big lesson was about how to balance interrupt and project work. It took a while to try and find a good balance between the two.

How many of you think that if they reduced interruptions they’d get a lot more work done? If you are like most in operations you know it is a huge problem.

I remember one case – A release deadline was approaching, and my group was on the hook for delivering a new web proxy system and some new admin tools to manage the site. At the same time our interrupt-driven requests from engineering went up because they had the same deadline and needed our support to get their projects done. It was very challenging to balance the two, to be able to unblock engineering, while delivering our projects by the deadline. We worked long hours and still some things got dropped because we didn’t have a good way to balance it out.

Our teams are pulled in many directions. A pure engineering team specs something out and then has several weeks to deliver. A ‘helpdesk’ team goes from interrupt to interrupt to solve problems in a short time window, but no long-term deliverables. In operations, you have to be both engineering and helpdesk. So how did we balance the two?

I looked at development methods like Agile – that would help my team deliver complex projects – BUT weeks between identification of a problem and a resolution isn’t acceptable.

I knew the solution of going from interrupt to interrupt wasn’t the right way to go. That may make people happy in the short term, but then we’d deliver our long-term projects late and poorly implemented, because we didn’t give them the attention they deserved.

I dug a bit deeper and found a solution that was ideally suited for my group.

We borrowed from the lean manufacturing world to solve this problem. We use a system called KANBAN to manage our tasks. KANBAN is easy to implement – you only need a whiteboard and some post-it notes. The central premis of KANBAN is that projects go through different states – INCOMING – WORKING – INTEGRATION – PUSH TO PRODUCTION – and that only a limited number of projects are allowed to be in each state at any time.

We have an INCOMING queue where all projects arrive – each team member can have two projects going at one time, an as they finish one they pull the next-most-important project into their WORKING queue.

It is bad kanban form to either overload the number of projects in a given state or to pull a project and put it in a previous state. We only do that when something crucial comes up.

Using kanban makes sure everyone can work on their projects with as little interruption as possible – and makes sure the next project that gets worked on is always the next most important project.

Point Of Contact Rotation

I believe interruptions are a HUGE productivity killer in operations groups, so I’ve done what I can to minimize them.

It is very hard for you to focus on projects when you are constantly interrupted. After an interruption you remain unproductive for 20 minutes – before you can be fully engaged in what you were doing. In ops this means you could go all day and get nothing done
due to interruptions.

So there’s another strategy we use to minimize interruptions – We rotate everyone on the team through an Point-of-Contact shift each week. Any interrupt-type work goes through to PoC first. You know your interruptions will be minimized most times, and that the week
that you are PoC you know most of your time will be interrupt-driven.

It also gives my newer team members an opportunity to learn about all the parts of our service. They have to field all questions for our group regardless of their comfort level.

During this week the PoC guy is responsible for the bug queue and to triage any new issues that arrive. This helps the rest of the team as everyone knows if they aren’t on-call they’ll be able to work on their current project uninterrupted.

Conclusion

I covered three areas where I learned important lessons, and I hope you find these lessons useful in your environments.

I covered…

  • When to automate – Automation is necessary, but is more
    effective when you do it at the right time.
  • If you break your environment into small chunks, and scale at
    those chunks, you’ll have less scaling headaches later.
  • And finally, if you can find a bring order from the chaos of
    interruptions, while still delivering on your projects, you’ll be
    happier and the engineers you support will be happier too.

Thanks to Yahoo for organizing and hosting this meetup – thanks to all of you for your time, and if you have any questions I’m happy to answer them now.

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • del.icio.us
  • digg
  • Furl
  • Reddit
  • SphereIt
  • Technorati
  • YahooMyWeb

Leave a Reply