Monitoring and Configuration Management to restart services

November 20th, 2011

This is the text of the talk I gave at the LSPE meetup in November 2011.

Introduction

Good Evening. Tonight I’m going to talk about something that I hope will free up your time from firefighting during the day – and help you sleep better at night – instead of catching up on your sleep during thesetalks. I’m going to talk about setting up your servers so that they can recover from faults automatically, without your intervention. And once you’ve followed this self-healing recipe you’ll be freed from firefighting your most common system failures

My name is Greg Retkowski, and I’m an Operations Engineer at OnLive – a cloud gaming startup. I’ve sysadmin’ed at a dozen or so internet startups around the bay area since moving out here in 1997. I’m going to talk today about the self-healing setup I used at another startup I worked at; a company called Avvenu. This setup ties together your network monitoring and your configuration management system so that common faults that your monitoring system detects can be quickly fixed by your configuration management system.

This talk is based on an article I wrote a few years ago for Oreilly. Originally it used NAGIOS and Cfengine. As most people are more familiar with puppet I’ve updated this talk to use it instead. With just a few small changes to tools you’re already using you’ll be able to tie these two systems together to resolve faults as they occur.

What’s in it for me?

So, how does this setup help you?

First – It’ll free you up from firefighting interruptions. I didn’t like getting paged late at night and having to fix common problems that a configuration management system could rectify. For example, we had some custom apache modules that’d sometimes crash the apache daemon. I’d have to VPN in and restart apache – which would be easy to automate.

Second – It can react faster than a human can. Once I implemented it, a pleasant side effect was that the setup would resolve common issues even faster than a human could. When I was paged I’d login and find that the system had already self-corrected. Failures were shortened by removing a human from the loop.

Third – It’s a hedge against technical debt. We don’t want to get software from engineering that crashes every millionth request, but sometimes we do, and something like this can get us through till the next release.

Required Tools

Nagios

The first tool is NAGIOS. Most people in this room are already familiar with NAGIOS. Its the most popular open-source monitoring
package. It runs service checks against services and notifies sysadmins when things fail. It has the capability of running an external script when a failure occurs, and we’ll leverage that in our setup.

Puppet

The next tool is Puppet. When I mention puppet some people think of a scarey doll with strings – but in this case I’m talking about
the configuration management system developed by Luke Kaines. It will check the configuration of a host against policies you create, and will update the host to match your policies. Puppet has many capabilities that help in our system, it can correct corrupted config files, can fix directory permissions, and can ensure processes are running. In most installations puppet runs only, say, twice an hour. However it can be run on demand, and run remotely, and we’ll use this capability in our setup.

As an aside, if there’s one skill that’ll be crucial to have in the next five years it’s going to be a familiarity with configuration management systems – be that puppet or chef. It’s nearly impossible to manage large server farms without them. If you haven’t investigated either of them yet I recommend you do.

High Level diagram

This is a high level diagram of how it is all tied together. Nagios monitors services, and when a falt occurs it triggers an RPC mechanism to tell puppet to run. Puppet is configured to ensure apache is running and if it isn’t it restarts it via its init script.

Setting up puppet to start downed services

This is our puppet policy for our apache server. This may look like giberish if you aren’t familiar with puppet. I’ll quickly walk through it..

This class ‘httpd’ tells puppet how to configure apache on our system. It says we should have the apache package installed, there’s a bunch of config file definitions we skip in this example, and then at the bottom we tell puppet that the service ‘apache’ requires the ‘apache’ package and that puppet should ensure it is running.

When puppet runs, if it finds the apache process is missing, it’ll restart it via the apache init script. This will be how puppet will restart apache if nagios notifies it that it isn’t running.

Setting up nagios with a postfail script

Now we’ll talk about the changes to Nagios. To make this work you’ll need to configure nagios to call an event handler script whenever a service goes into a different state. You’ll need to make changes in two places. First the services config file:
The important lines in this file are the ‘event_handler_enabled’ and the ‘event_handler’ lines. The first tells nagios to turn on an event handler for state changes for this service. ‘event_handler’ tells nagios what event handler to use.

The next file to update is the commands config file – we add a command handle_puppetrun, which invokes the handle_puppetrun shell script with several arguments. This will tell the script what host is affected and what the service state is. I’ve wrapped the lines here, but the command_line line must be all on the same line.

Setting up the glu between nagios and CM

Now we are going to set up the RPC glue between Nagios and Puppet. All these examples are with Puppet 2.6

Setting up the puppet daemon on the host

First you’ll need to edit puppet’s auth config, and add a stanza that will allow it to accept remote requests to kick off puppet runs.

You’ll also need to create an empty namespaceauth config – this is a known issue with 2.6.

Next you’ll need to tell puppet to listen for incoming requests. You can do this by adding a stanza like this to your main puppet config.

You could also use the ‘listen’ flag on the command line when invoking the puppet agent.

For testing this you should use this command line. This will cause the puppet agent to stay in the foreground and print debugging information to your console. I recommend you run it like this at first while debugging your configuration. It’ll print all the logging information to the console which makes troubleshooting much easier.

Testing puppetrun from your monitoring host

You should now try invoking puppetrun from the command line, as the nagios user. This will ensure the end-to-end communication is working and that your nagios server will be able to fire off puppet when it needs to. If you are running the puppet agent in debug mode on your apache server you should see it running through its configuration.

When this isn’t working it’s often because the puppet certs aren’t issued to both hosts, or that the users running the commands don’t haveaccess to the certs. Check both of these if you have trouble. In my installation I added an entry to the sudoers file so that nagios can invoke puppetrun as root to have access to the certs.

The puppetrun invocation script

Once you are satisifed that puppet is communicating propery; You’ll need the last piece – the handle_puppetrun shell script. It goes into the Nagios plugins directory. Once it’s in place, make sure it is executable via the NAGIOS user.

Nagios calls the script on all state changes. So the script looks for critical errors (here) and either HARD failures or three SOFT failures (here). In either case it calls puppetrun with the remote hostname – this causes puppet to run on that remote host.

Demonstration

And this is how it works once it is deployed.

Here’s our nagios instance – happily monitoring our network.

And here I segfault the apache process

Nagios notices that apache is down and calls our handle_puppetrun script

Puppet gets invoked on our webserver, and restarts apache for us

And here Nagios has noticed that our webserver has recovered

And our network is happy again.

Other monitoring packages and CM tools

There’s no reason you couldn’t apply this to other network monitoring systems or configuration mangement tools. I originally had this running under cfengine, and you could use chef as well. Other monitoring systems also support event handlers, same as nagios

Conclusion

To find the examples and the original Oreilly article visit my site for this talk at this URL.

To wrap up, I’ve shown a setup where your network can self correct for its most common failures. We’ve  used tools that many of you are already using — tied together in a novel way. And I am hopeful I’ve freed you up from some firefighting so you’ll sleep better at night and more productive during the day.

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • del.icio.us
  • digg
  • Furl
  • Reddit
  • SphereIt
  • Technorati
  • YahooMyWeb

Managing your information diet

August 31st, 2010


How many emails do you have waiting for you each morning? 10? 50? 500?

Managing your information diet is going to give you amazing gains in getting productivity back, AND improve the response time to the really important messages you get.

I believe too much information is worse then not enough information. If your pager didn’t go off, you suffer an outage and you get an irate phone call from your boss. If your pager is going off every 5 minutes because your monitoring sucks – you are going to be extremely frazzled AND you are going to miss that one alert that REALLY means your site is down – and you STILL get that irate call.

So what can you do to manage the emails you are getting?

First thing, EVERY monitoring alert should be followed up by an action. Either you are fixing a broken server, or you are tuning the false alert. You have to be vicious about chasing down and fixing false alerts. Don’t let yourself become numb to them.

Next, SPLIT your HIGH traffic messages from your LOW traffic messages. While you may want to follow every Subversion check-in in engineering – or know every time your ISP replaces a router line-card in Azerbaijani, be sure these aren’t also going to your group’s discussion email list. It may be nice to be able to refer back to them, but they should not be polluting your INBOX.

Get adept at the mail filtering tools you have at your disposal – I’m partial to procmail, just because it is what I’d grown up with. Do your best to only have ACTIONABLE email land in your INBOX, everything else should get filtered into a folders that you can reference at your leisure.

Finally, do something SMART with your system mail. If you have hundreds of nodes and they all send you a note every time they rotate logs or burp on a cron job that is bad.

I built a NNTP server for all system mail. All of that mail gets thrown into an NNTP news-group. So I don’t have to look at it, but when I need to investigate a problem we can always refer back to several days of history.

Managing your communications, particularly your email, is one of the keys to going from being a good sysadmin to a great one. A little effort on the front end will reap great gains. The better you manage your incoming email, the better your piece of mind, AND the faster you’ll be able to respond when a serious incident occurs.

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • del.icio.us
  • digg
  • Furl
  • Reddit
  • SphereIt
  • Technorati
  • YahooMyWeb

Lessons Learned Building OnLive

August 27th, 2010

I just presented at the Large Scale Production Engineering group at our August 26th meeting, giving a talk on some of the things I learned while building out OnLive. I’ve included the text of the talk below as well as the slide-deck. The area that generated the most interest was our use of Kanban to manage workflow in our group. I was inspired to try out kanban based on this blog entry on Jordan Sissel’s blog.. Jordan’s blog is one of my favorites, you could spend hours going through his entries on operations and engineering.

I’ll be adding blog entries on things I didn’t cover in the presentation in the interest of time. But I will be blogging about soon..

  • The importance of a low-information diet, and not being overwhelmed by your email
  • Unicast for server software releases is dead at scale.

Onlive lessons learned

View more presentations from gregretkowski.

Introduction

I am looking forward to sharing with you several things I learned over the last two years building the OnLive game service. My name is Greg Retkowski, and I’m team lead for Operations Engineering at OnLive. I’m going to share with you what I learned about when to automate, why you can solve scaling problems by breaking it into chunks, and how to balance between interrupt-driven work and long-term project work. But first a little about us..

About OnLive

OnLive is a cloud gaming service that we launched in June – it provides instant-satisfaction gaming. You click ‘buy’ on a title, and you are playing that title within seconds – no media to bring home from the store, and nothing to install on your machine. You can play the latest games on your underpowered and old PC’s or Macs.

We do this by running everything in the datacenter – When you play a game, that game is running on our server – you send up controller inputs, we send down video.

Through a bunch of smart video and networking guys, and a few puppycorns, we provide an experience that’s the same as playing on your game console at home.

About Me

I’ve been working in Operations since 95′, I started as the first employee of an Internet Service Provider in Ft Lauderdale, Florida. Two years later I moved here to the Bay Area and I’ve been doing operations for internet startups ever since.

I joined onlive in 2008 – at the time we had around 100 nodes. Since then we’ve grown by more than a couple of magnitudes… and we’re continuing to grow today.

I currently lead a team called Operations Engineering, and we’re responsible for…

  • Automating the deployment and configuration
  • Creating the tools used to manage the service
  • Designing the operational fabric for the service
  • And we’re the FINAL escalation path for the most difficult
    sysadmin issues.

… Basically, were the greybeard sysadmins in the organization..

There were a lot of growing pains going from almost nothing to the size we needed to build to support the service. We had challenges, we had failures, and even a dumpster
fire; BUT we learned a lot, and ultimately succeeded.

I’m going to cover three areas where I learned a bit about managing complexity in my last two years.

I’m sharing what I learned – hoping that I learned the hard way so that you don’t have to.

When to automate.

You probably have worked in shops where every server is hand-built, some poor sysadmin sitting in a loud cold datacenter, swapping redhat CD’s and clicking NEXT prompts.. And then repeating it for a row of servers that has to be installed…

That’s probably not the best approach if you need to scale.. But, then, is the answer to spend a ton of time in automating everything?

Our machine database is a rails app and I needed to decide if it would run under mongrel or mod_passenger. We ultimately decided on mod_passenger. I could have spent weeks building our puppet rules around mongrel, automating the install, rotating the logfiles,
excetra.. And those weeks would have been wasted as we chose something different… So what is the right balance?

You want anything going into production to be automated and repeatable. However you don’t know what’s going into production the first time you build it. The first time you are building this you are building it for engineering to evaluate.. You could be building
couple different NOSQL boxes for engineering, each to be evaluated, and one, or none, will be used in the final product. So when do you automate something?

The FIRST time you build something JUST LEARN about what you are building. At this stage you may be doing throw-away work, you are learning what the software can do. And, more likely than not, you are going to implement it wrong out of inexperience. At this stage, there is no value in automating something you are as likely as not to throw
away.

The SECOND time, DOCUMENT IT. Now you are building a second one – you are probably going to build a few more. You’ve learned your lessons on how to do it better the second time. This is the time to DOCUMENT how you did it and make a CHECKLIST. That’ll let you hand off the next build to another team-member, and validate the procedure.

The THIRD time, AUTOMATE IT. By now you know the software inside-out, and you know you will continue using it. Now is the best time to codify it using puppet, chef, or your automation du jour.

The FOURTH time – bonus points for pulling this off – is to TEST IT. There are many ways to approach testing your automation. You could consider your monitoring service checks as your ‘tests’, or you could use a VM test harness to test your automation code.

We’re using a VM test harness to test all of service installations, but that’s a story for another presentation.

Break things into smaller chunks

The next big lesson we learned as we were building was to spit our environment into chunks whenever possible. That takes some explanation, so I’ll relate what we experienced building out our service.

As a side effect of our architecture, we’ve split things into chunks we call slices. This side effect turned out to be very important on our efforts to scale later.

As we grew our service we ran into performance issues in different parts of our mamangement software. One example – as we grew, generating our DNS data got longer and longer, and it eventually took an hour to generate.

But – Because we cut our environment into chunks, we were able to re-write our DNS generator to generate slice-by-slice. It was able to cache slices that didn’t change, and that cut the generation down to a couple of minutes.

It is much easier to scale your service out when you’ve broken it into smaller and well understood chunks. Where we faced the issue of understanding how to generate DNS for a thousand hosts, we didn’t know how that would behave at two thousand hosts.

But we did understand how our system behaved at the smaller slice size, and going from ten slices, say, to twenty slices, was a much easier to manage as we had good knowledge of how everything behaved within the slice.

So, whenever you can, break your environment into manageable chunks, and put as much of your scaling dependencies down into those chunks – and when you have to do things at a whole, cache whenever you can.

Balancing interrupt & Project Work

Our next big lesson was about how to balance interrupt and project work. It took a while to try and find a good balance between the two.

How many of you think that if they reduced interruptions they’d get a lot more work done? If you are like most in operations you know it is a huge problem.

I remember one case – A release deadline was approaching, and my group was on the hook for delivering a new web proxy system and some new admin tools to manage the site. At the same time our interrupt-driven requests from engineering went up because they had the same deadline and needed our support to get their projects done. It was very challenging to balance the two, to be able to unblock engineering, while delivering our projects by the deadline. We worked long hours and still some things got dropped because we didn’t have a good way to balance it out.

Our teams are pulled in many directions. A pure engineering team specs something out and then has several weeks to deliver. A ‘helpdesk’ team goes from interrupt to interrupt to solve problems in a short time window, but no long-term deliverables. In operations, you have to be both engineering and helpdesk. So how did we balance the two?

I looked at development methods like Agile – that would help my team deliver complex projects – BUT weeks between identification of a problem and a resolution isn’t acceptable.

I knew the solution of going from interrupt to interrupt wasn’t the right way to go. That may make people happy in the short term, but then we’d deliver our long-term projects late and poorly implemented, because we didn’t give them the attention they deserved.

I dug a bit deeper and found a solution that was ideally suited for my group.

We borrowed from the lean manufacturing world to solve this problem. We use a system called KANBAN to manage our tasks. KANBAN is easy to implement – you only need a whiteboard and some post-it notes. The central premis of KANBAN is that projects go through different states – INCOMING – WORKING – INTEGRATION – PUSH TO PRODUCTION – and that only a limited number of projects are allowed to be in each state at any time.

We have an INCOMING queue where all projects arrive – each team member can have two projects going at one time, an as they finish one they pull the next-most-important project into their WORKING queue.

It is bad kanban form to either overload the number of projects in a given state or to pull a project and put it in a previous state. We only do that when something crucial comes up.

Using kanban makes sure everyone can work on their projects with as little interruption as possible – and makes sure the next project that gets worked on is always the next most important project.

Point Of Contact Rotation

I believe interruptions are a HUGE productivity killer in operations groups, so I’ve done what I can to minimize them.

It is very hard for you to focus on projects when you are constantly interrupted. After an interruption you remain unproductive for 20 minutes – before you can be fully engaged in what you were doing. In ops this means you could go all day and get nothing done
due to interruptions.

So there’s another strategy we use to minimize interruptions – We rotate everyone on the team through an Point-of-Contact shift each week. Any interrupt-type work goes through to PoC first. You know your interruptions will be minimized most times, and that the week
that you are PoC you know most of your time will be interrupt-driven.

It also gives my newer team members an opportunity to learn about all the parts of our service. They have to field all questions for our group regardless of their comfort level.

During this week the PoC guy is responsible for the bug queue and to triage any new issues that arrive. This helps the rest of the team as everyone knows if they aren’t on-call they’ll be able to work on their current project uninterrupted.

Conclusion

I covered three areas where I learned important lessons, and I hope you find these lessons useful in your environments.

I covered…

  • When to automate – Automation is necessary, but is more
    effective when you do it at the right time.
  • If you break your environment into small chunks, and scale at
    those chunks, you’ll have less scaling headaches later.
  • And finally, if you can find a bring order from the chaos of
    interruptions, while still delivering on your projects, you’ll be
    happier and the engineers you support will be happier too.

Thanks to Yahoo for organizing and hosting this meetup – thanks to all of you for your time, and if you have any questions I’m happy to answer them now.

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • del.icio.us
  • digg
  • Furl
  • Reddit
  • SphereIt
  • Technorati
  • YahooMyWeb

A week of work on Zortit

October 1st, 2008

It has been a hectic week working on our webapp. I planned to put up regular updates but have been so busy coding that I haven’t been able to keep up with it. Here’s a big update of what we’ve had going on for the last week or so. We’re coming up on the home stretch with the contest deadline looming on friday..

Marketing
Cherie has been hard at work. She’s been working on creating the product description, writing a rough draft of a marketing plan, doing competitive research, putting together a board of advisors, and coordinating the efforts of our graphics team.

Branding
Our graphics artist, Chris Barela, got us a great logo last week and we’ve gotten some great website mockups. HTML/CSS guru Jean Leitner has been hard at work converting the mockups into code. Here’s the new logo:

Infrastructure

Our application runs in the AWS cloud. We’re using EC2 instances. We’re using S3 to cache some web API results. There are other AWS services we use which I won’t dive into here.

Right now our app is running as a single EC2 instance, however I’ve partitioned the components out on this instance so that they can be spread across machines. On the front end we’re using HAProxy, with apache/mod_passenger (aka mod_rails) running rails instances, with MySQL as the database. We’re using memcache for a performance speedup, as well as S3 as a cache. I’m doing deployments via Capistrano which works pretty well.

‘I have a dream’ of having instances come up and self configure. Sometime in the future (probably when things are burning down) I’ll set up iclassify and puppet, and perhaps even configure user auth via LDAP. And then systems will spin up, register with iclassify and I’ll be able to provision them mostly automatically. I’d hoped to use pool party, but it’s in re-write right now — perhaps when it is finished.

I also got nagios set up to monitor from an existing machine, and during the process found one of my nameservers was broken – funny things you find out when you start monitoring things!

Collaboration
We’ve been using trac to colaborate – and we’ve managed to proxy tickets from email into trac’s bug tracking. So testers can click a mailto link when something breaks on the site. Very neat! Trac also has a subversion browser I’ve used on occasion and I’ve been posting links to system management pages and whatnot there.

So that’s where we’re at. More news as it happens!

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • del.icio.us
  • digg
  • Furl
  • Reddit
  • SphereIt
  • Technorati
  • YahooMyWeb

Some random links

September 24th, 2008

I’ve been cleaning out my email today, and finding several gems among the cruft..

A friend sent me a link to a pretty rocking collaborative whiteboarding application: Dabbleboard

I found this great slide-deck from a presentation at the Velocity conference done by Adam Jacob. It’s a great introduction to the latest tools that you can leverage for ‘deploying to the cloud’. Check it out:
Building an Automated Infrastructure (Powerpoint Slides)

Right now I’m busy hacking on an AWS Startup Challenge entry. Our entry is using the theme ‘redefining search’. The product is called Zortit, and will be leveraging the AWS cloud services and be built around Ruby on Rails. Keep tuned in for more updates!

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • del.icio.us
  • digg
  • Furl
  • Reddit
  • SphereIt
  • Technorati
  • YahooMyWeb

Sizing your infrastructure before launch

March 12th, 2008

So you got a webapp – How do you decide how many servers to deploy??? Even if you are still in development and don’t have a single outside user you can make an informed decision on how big to build and what your future network infrastructure will look like.

By gathering some data and doing a little load testing you can launch a new application confident in the fact that you know how many users your application will support.

I will outline the process you can use to size your infrastructure. I’ll be discussing it in the context of a web-based application but these methods can be applied to other types of applications. At my last client, Avvenu, half the network communication was not HTTP based and I used these methods to scale it regardless.

At the end of this process you’ll have a spreadsheet where you’ll be able to plug in arbitrary numbers and get out the scaling information you need. If bizdev asks “what happens if we close this deal and double our user base?” or if engineering finds a way to increase server performance by 100% you’ll be able to quickly answer what the impact on your network would be.

Understanding your usage

The first step in building our scaling model is to understand how your users use the system. There are a big series of questions that you’ll need to answer to get an idea of what that usage looks like.

First you’ll need to know how many active users to expect in the future. This data often comes from your marketing department.

The data is usually presented something like – in one month we’ll have X active users, in two months we’ll have Y, in three months we’ll have Z. You’ll need all these for your scaling spreadsheet.

Next you’ll need to find out how the typical user either uses the site (for existing sites) or is expected to use the site (for new sites). You’ll want this data in a given time period, such as per week. Some examples of what you’ll want to know are:

  • How many times a week does he visit?
  • When he visits what does he do?
  • Downloads a large file?
  • Looks at pages that require a large amount of processing
  • How many times and which ones?
  • Looks at images that are dynamically created?
  • Looks at static pages?
  • Uploads Data?

How much data do you have to maintain per users? This includes files, database rows, or in some applications constant open connections. This will also have to be accounted for in your scaling model.

For an existing application you’ll be able to mine your access logs. Always keep and archive these logs when at all possible. They come in handy to mine for useage pattern data. Throw together some scripts to extract the answers from your access logs.

For new sites put together a detailed but not overly technical questionnaire for your product manager. The answers from the questionnaire can be used to model typical visitor usage patterns.

One final note on usage patterns. You’ll find that you’ll have some users that look at a few pages every couple of months, and then some users who integrate your site into their daily routine. You’ll need to find the /average/ across all your active users.
 
Distilling the estimated traffic

Now you have how many users you have, vs. the activity of each user. You can now determine how many requests your service will have to handle. You can figure this out just by multiplying the number of users against the number of operations and then divide that by the number of seconds in your time period (i.e. a week) to find the average number of operations you’ll have to perform per second.

Important to note, when sizing your bandwidth that file sizes are measured in BYTES and bandwidth in BITS. multiply all file sizes by 8 to find the number of bits they would be when crossing Ethernet.

Load Testing

Once you’ve determined what your average user will do you’ll need to automate that behavior for load testing. Typically you’ll set up a load testing cluster – or just test against your pre-production or development environment on off hours. You’ll need to ensure your load-generating machines that run your load testing scripts do not become your bottleneck. In this phase it is very useful to be running server monitoring and graphing software like NAGIOS and CACTI. Make sure your server graphing captures CPU, Disk, Memory, Network, and process utilization so that you can identify which machines bottleneck and what parts of the machines have to be scaled. Sometimes you’ll think an application should bottleneck on CPU and find it bottlenecks on Memory. This helps you make informed purchasing decisions when you buy new machines for your production environment.

You can set up scripts and use tools such as AB (apache benchmark) to throw traffic at your servers and determine the number of operations per second your servers can handle. You’ll have to try to isolate each class of machine (i.e. DB or HTTP, etc) and determine it’s maximum load. With unlimited resources you could load test a single webserver to determine it’s limits, then throw 100 load-testers against 100 web-servers to find your DB’s load limits. But for most of us this is impractical. So you may have to be clever and try and profile the database traffic generated by the webserver load testing and then create a script to drive simulated load at your DB server directly.

It is important in this step to discover any horizontal scaling issues. If you find adding new servers does NOT increase your capacity as you expect then you’ll need to work with your software engineering team and fix the scaling problems or warn management that their is a likely hard limit of X number of users the system will support.

Peak vs. Average usage

You will need to determine the peak usage hour(s) of your service and how these relate to your average usage.

I have found that your peak usage will typically be double your average usage. If you have no other data then go ahead and size for that.

If you are sizing an existing application you already know your ratio of peak vs. average by looking at your log data.
   
Building the Spreadsheet

TOTAL          (users * usage / time-period-to-seconds ) * peak/avg
REQUIRED  =  --------------------------------------------
SERVERS       benchmarked-requests-per-second-per-server

Do this for each class of server, web servers, app servers, DB servers, etc. Then make a column for each month of growth. Make your formula round-up the number of servers. you can’t deploy 2.3333333 servers can you?

Often I’ll break this down into the number of active users each server can support. I can then divide the number of projected users and have the number of required servers.

USERS       benchmarked-requests-per-second-per-server
PER       = ---------------------------------------
SERVER     (per-user-usage / time-period-in-seconds ) * peak/avg

TOTAL                USERS
REQUIRED = ---------------------
SERVERS      USERS-PER-SERVER

Your total servers numbers can drive other parts of the spreadsheet as well. Every so many servers you’ll need a new Ethernet switch, another rack at the colo, and perhaps increased headcount (try and reduce this by automating as much as possible!)

Make sure your spreadsheet also accouts for the amount of static data you have to maintain per user. For example how many file servers will you need for the files your users upload? How many users will the disks on your DB server support?

Your model should also determine the maximum network traffic at peak times so that you’ll understand when you’ll need to order more bandwidth from your connectivity provider or will need bigger routers and load balancers.

In Conclusion

Using this process has allowed me to help size networks for many internet startups and kept my network operations groups from being caught with their pants down. Determining your scalability and using this data to anticipate required infrastructure growth will help you and the rest of your organization have confidence going forward with a growing userbase.

Technorati Tags: gioco poker gratuitogiochi online pokerpoker room on linepoker milanogiochi keno gratisroulette cineseslots machine downloads gratuito,giochi gratis slots machine,slots machineautomatic video pokerall slotsroulette da scaricare gratiscasino gamingla roulettebetandwin casinocasinos onlinegiochi casino da scaricare gratisi casino on lineregolamento roulettecasino on line con bonuscasino en lineatrucchi casino onlineswiss casino onlinecasino on line sicuricasino’ on lineplay slotscasino on net pone un icona sul desktop,casino net,david casino netregole baccaratvideo poker machinesforum casino on linewww casino online comrussian roulettescaricare casino gratisroulette europeacasino baccaratbaccarat on linegiochi casino pc,gioco pc casino,casino pccasino poker gratisplay baccarat onlinesistemi gioco roulettegiochi gratis video pokercasino con bonus senza depositoroulette liveslots gratiscasino paypaladvanced video pokerwin rouletteregole gioco roulettecasino on line americangioco gratuitodownload giochi casinocasino bonus no deposito

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • del.icio.us
  • digg
  • Furl
  • Reddit
  • SphereIt
  • Technorati
  • YahooMyWeb

Maintaining documentation — It’s in the wiki!

February 19th, 2008

One of the important things of maintaining a big network environment – with a small staff – is to keep up to date documentation on configurations, customizations, and instructions for frequently executed tasks. Commonly when I walk into a new company the documentation is terrible? Why? Because there is either no thought to maintaining documentation or the documentation system/procedure in place is too time consuming to use.

If a documentation system us hard to use it wont be used at all. It should take less effort to update a piece of documentation than to send an email. Locating a document should be as easy and should support freeform text searching. Thats why the best documentation setup I’ve worked with is a wiki. It’s easy to create, locate, and change documentation which encourages people to actually document things! You will have current verbose documentation when you need it.

If you do use a wiki to maintain your documentation produce an offline copy of periodically and burn it on cd. Put this CD along with one copy of every vendor supplied CD into a CD wallet and keep it at the datacenter. it will prove invaluable when you have outages.

Heres the wiki engine I’ve used – and liked – in the past. It runs on top of your vanilla LAMP stack.

tikiwiki.org — TikiWiki CMS/Groupware

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • del.icio.us
  • digg
  • Furl
  • Reddit
  • SphereIt
  • Technorati
  • YahooMyWeb

Small Business: How not to behave on the internet

February 17th, 2008

This is an example of how not to behave if you are a small business on the internet. A friend of mine simply posted a question on a forum, the entirety of his question was: I’m curious if anybody knows anything about Lucas Environmental Stormwater Services, Inc.? This simple question has led to the owner threatening legal action in email and via rambling voice-mails. It is never a good idea to threaten someone unless they are blatantly in the wrong and doing something clearly illegal. Otherwise you just rile people up and turn what should have been nothing into a huge negative-publicity exercise for your company. For more information see: mhalligan: Greatest voicemail transcript EVER

Technorati Tags:

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • del.icio.us
  • digg
  • Furl
  • Reddit
  • SphereIt
  • Technorati
  • YahooMyWeb

RoR: Testing with simple_captcha & HTTP-Auth

February 9th, 2008

While developing a small Ruby on Rails application for The Pilot’s Camping Directory website I ran into a few problems that weren’t solved by a simple google search – so I’m documenting them here for future posterity and googling. I had problems with testing when using some security features to keep out riff-raff. It was not obvious how to handle simple_captcha or simple_http_auth while doing testing so I scratched around the net and pieced together a solution for each of the problems. These work with Rails 1.2. With Rails 2.0 YMMV – but then 2.0 breaks every rails tutorial ever written so I don’t feel bad if this blows up in 2.0.

Using Mocks for testing with simple_captcha

Tests will fail when trying to save something protected by a captcha – obviously – as stoping automated lever-pulling is exactly what a captcha is designed to do. In my application I use capcha at the model level, so I simply override the save_with_captcha method with a simple save.

Here’s what my mocks/test/recipient.rb looks like:

# Can't fake captcha for testing - so we mock it out.
require_dependency 'models/recipient'
class Recipient < ActiveRecord::Base
def self.save_with_captcha
self.save
end
end


Functional Testing HTTP-Auth

To test HTTP Authorization / Authentication you must set up your request environment to pass the http authorization into the application. This is known to work with the simple_http_auth plugin, the plugin that I used for my application. Specify this in the setup section of your functional test.

def setup
@controller = SupersecretController.new
@request = ActionController::TestRequest.new
@request.env['HTTP_AUTHORIZATION'] = "Basic " + Base64.encode64(ADMIN_USER +':' + ADMIN_PASSWORD )
end


Integration Testing HTTP-Auth

Integration testing simulates making requests directly to the webserver. To work with http authorization here you must pass in the appropriate authentication headers when making each get/post request. An example is below:

@htauth = "Basic " + Base64.encode64(ADMIN_USER+':' + ADMIN_PASSWORD )
get("/supersecret/index", nil , {:authorization => @htauth})

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • del.icio.us
  • digg
  • Furl
  • Reddit
  • SphereIt
  • Technorati
  • YahooMyWeb

Sharpening the saw, html and graphics.

January 16th, 2008

In my off-season (winter) I am usually traveling internationally – mostly places that are sunnier and warmer than the San Francisco bay area. It’s often the perfect time for me to sharpen my various skills , being unconstrained by the usual grand infrastructure projects I do in the summer.

It’s often these times that I bring back up my html/coding/graphics skills. Wifi Bandwidth here in Puerto Vallarta has gotten much more ubiquitous and reliable and so I’ve got connectivity almost as good as back in SF. I’ve been diving back into apps like
Gimp, Aptana, & Inkscape.

I also enjoy catching up on the avant guard of web artistry and seeing what people are creating with html and css. I appreciate simplistic designs and so I really enjoyed the sites on display at the link below:

25 Beautiful, Minimalistic Website Designs – Part 2 | Vandelay Website Design

Powered by ScribeFire.

Share and Enjoy: These icons link to social bookmarking sites where readers can share and discover new web pages.
  • del.icio.us
  • digg
  • Furl
  • Reddit
  • SphereIt
  • Technorati
  • YahooMyWeb