Top 6 lessons we've learned about AWS
A couple of us attended the first AWS Dublin User Group last Tuesday. We met some interesting people and hope it will help us maximise our AWS investment.
I thought I'd write a quick blog to summarise the main lessons we've learned since we first migrated to Amazon's cloud in 2009.
Lesson 1: The Low Risk Approach is High Risk
We were hosted on physical hardware, so this may not be as big an issue if you are already virtualised, but the first (and potentially biggest) mistake we made when moving to AWS was to decide to replicate our existing hardware, software and network infrastructure rather than redesign with the cloud and AWS in mind. We thought this would be a quick win, just a simple "lift and drop", and then we could sit back and design how we should really be using AWS once we had more experience. Seemed like a good idea at the time, but we lost a huge amount of time and resources trying to track down performance issues/stability issues/etc. as the virtual set-up performed differently to the physical set-up. With the benefit of hindsight we could have saved a lot of time and money re-designing our solution to be AWS compatible from the start.
Lesson 2: Stop Hugging Your Servers
Our physical servers had names, we'd applied labels to them, and if truth be told we'd started to assign personality characteristics to them ("I bet you thats BizNet2 going down again just because it's Friday night"). In our minds we started an AWS instance (server) with the intention of never turning it off unless we really needed to, and wanted to apply updates directly to these "named" servers. AWS has a different approach, and once you get your head around it there is tremendous flexibility and freedom (not to mention scalability and stability) to be gained by treating your servers as instances of a Launch Configuration controlled via Autoscaling. Need to deploy a new version of the software? Simple, just launch new instances with the latest code and then terminate all the existing instances. For our core product we use various scheduled as well as dynamic autoscaling groups that mean the longest an instance will live is 12 hours. Not enough time to form a strong relationship and start bringing it to the cinema...
Lesson 3: Failure shouldn't be a Drama
Our old model (while we had our own physical hardware) was to have Engineers on call to respond to events or outages. Something would go wrong and we'd be alerted by email/sms (or worst case directly by a customer) and an Engineer would diagnose and fix the problem. It was often high-pressure-liable-to-make-mistakes while trying to resolve the problem stuff, but also simple things like restarting a service that we promised we'd write a script for so we wouldn't have to manually intervene next time...except we wouldn't get around to it. Now with AWS we become aware of issues after they have already been resolved. You can set-up simple application level health-checks for your intances that if they fail they will automatically be terminated and new instances will be launched to take their place. Likewise with all the pieces of your solution in AWS it's relatively simple to put in place automated procedures to deal with all possible failures and have automated actions resolve the problems.
Lesson 4: Deploy to Multiple Availabilty Zones
Trust me, there is no sympathy from anyone when your application goes down because AWS has an issue with a single availablility zone. You need to design your solution to have instances running in at least two availability zones that will continue to run if an availability zone is down. Things like Dynamodb, Elastic Load Balancers, RDS, etc. are great because they automatically handle this issue and you really only need to worry about your instances. One thing to note is that when an avilability zone does go down everyone stampedes in unison to the other availability zones and we've found that it can be hours before the AWS API starts responding to requests to launch more instances. This means that your Multi-AZ solution can't be "I'll fire up instances in a different AZ if I'm affected", they need to be ALREADY running,
Lesson 5: Pay for what you Provision
One of the accepted benefits of the cloud model is that you only "pay for what you use". However that's not exactly correct, you really pay for what you Provision whether it's running at 80% CPU or 1% CPU. For HA/DR purposes you will need to have at least one instance in two AZ, and you are much better off splitting out your application to run on smaller instances that you can scale up and down on demand. If you create a Dynamodb table with a high thoughput and forget to reset it after a batch job you keep paying at the higher rate. You pay because you provisioned it, not because you used it. We found our S3 costs were creeping up unexpectely until we realised that EBS volumes attached to instances that were terminated were just hanging around afterwards as we hadn't set them to delete on termination. Giving you Dev team the ability to fire up hundreds of instances at will is hugely empowering, but it has to come with the responsibility of turning them off or terminating them.
Lesson 6: Take Advantage of AWS Products
AWS distributed products are great. They work, reliably, and you don't have to manage them. It took a while for our Dev team to realise that we should use everything AWS has to offer as a default position rather than stick with what we already knew (like sticking with our own file server instead of using S3 because we'd have to re-write existing code. File server fell over, code is now re-written). We can process 3 million addresses in an hour when needed by inserting the requests into an SQS queue, firing up 100 spot instances and reading the results from a DynamoDB table. Total cost? $10 for the instances and a few bucks for the DynamoDB table.