We love Amazon Web Services: the last thing a fast-growing company like ours needs is to worry about how many servers to own. Much, much easier to use Amazon's technology to let our server stack grow to respond to demand.
However, it's one thing to say 'set it up so that it grows to meet demand', and slightly another to make it so. Here are our ten top tips. Some of these are detailed and technical, others are broad-brush architectural principles, but hopefully there's at least one of interest to most readers.
1. Don't design servers to shutdown
Once you have an army of servers, there will be casualties. Servers will crash, Amazon will occasionally reboot one, others will just go awol. You need to design your system to cope with chaotic shutdown, so why worry about the other kind as well? If you start from the beginning by just pulling the plug, you're more likely to be ready when the plug falls out.
2. The database is the only bottleneck
Most of the bottlenecks you might be used to from outside the cloud are dealt with by AWS. It's up to you to design to use the AWS tools. So, don't put data on disk and mount NFS, just put it in S3. Don't try to roll your own server messaging, make SQS work. Use load balancers and auto scaling groups. About the only place you're likely to have a bottleneck is in front of the database, so focus your technical smarts there.
3. Use AWS i/o
Particularly useful if you have a lot of videos coming in and out: don't send and receive them through your web servers, use AWS servers instead. Their content distribution network is simple to use, so use it. And your contributors can upload directly to S3, so let them do that too...
4. S3 upload gotcha
... although, bear in mind the annoying restriction of S3 uploads, which is that the bucket key must be the first field in the multipart upload. This can be a bit tricky, if for example your client code uses a reasonable dictionary to hold a list of fields, but you need to work around this. Groan. You can see why they might want to get the bucket key before the file contents, but before everything else?
5. Script (nearly) everything
Each step that's manual can and will go wrong at just the wrong moment. One of the beauties of AWS is that you can use fifty computers for an hour and never use them again. Take advantage of this by making it simple to create a whole environment for a short while and then throw it away. Give yourself shortcut ways to scoop up log files from an army, log into machines, and so on: time spent on this is never wasted in even the quite short term.
6. Don't expect things to happen immediately
All those scripts need to be robust enough to cope with major variations in the time it takes to do something. Most operations need a kind of "do it, wait for a while checking whether it happened every few seconds, finally bail if it didn't" logic.
7. Use cloud-init
If you're using linux boxes, then use the cloud-init package to tailor machines at launch time. We create one machine image for each release of our software, whether they're web servers, video processors, and whether they're in the production or test environments. Then we use a launch configuration to attach data to a machine at start up which gets picked up by cloud-init to tell the machine what to do and who to do it with. That way we have high confidence that test machines and production machines will behave the same (they're built off identical machine images) and the flexibility to add new environments and move environments to different releases without rebuilding our code.
8. Use elastic IPs for special computers
Our database servers need to be reachable by our armies of web servers and video processors. We achieve this by assigning them elastic IP addresses, which means they won't change address. It also means that if one goes down, its replacement steps into place without reconfiguring the other servers.
9. Use the Amazon-assigned public DNS name for those special computers
Once a computer has an elastic IP, it has an Amazon-assigned public DNS name, a fixed public IP, a private DNS name, and a private IP. Traffic routed via the public IP address will incur fees, the private IP and private DNS might change, so point your other servers at the public DNS name. The DNS servers inside AWS resolve this to the private IP, so there are no fees.
10. Integrate continuously
You do continuous integration, run the tests and do a build on every check-in. If you don't, then now would be a good time to start. Once the build is built, deploy it to AWS and run some tests against a real environment. Machines are cheap, so why not?