Approach to high-availability in the cloud

By Mahesh Khambadkone on November 1, 2011 4 Comments / 2864 views

I’m having a debate with our infrastructure provider partner.

It’s their opinion that to achieve high-availability (HA), one needs to isolate infrastructure at a per-domain level. So, if one application needs 2 load-balanced web servers, a master database and a slave database, you set 4 nodes up. For another application, running similar requirements, you set-up another 4 nodes. So, a total of 8 nodes to manage, and 2 load-balancers.

I feel that’s a textbook approach to HA and the smarter way to plan for high-availability is to perform it at a “per-functionality” level.

Using the same example above, you would have 2 nodes to run both PHP we servers, a central master DB and a slave DB. Therefore, 4 nodes with 1 load-balancer, and scale-out after monitoring performance.

The principle is that hardware (and therefore it’s failure), is agnostic to the software that runs it. Will a piece of hardware fail if it’s running application A or application B, or will it fail because it has to ?

This is further exemplified in the cloud, where infrastructure is shared across seemingly independent applications. (key word is seemingly, as the same hardware is virtually running multiple types of software).

One can argue that chance of failures is a function of number of nodes, but again, I feel that’s more a academic response, because management and likely running costs becomes a challenge with more nodes, even in the cloud.

What do others technically-inclined think?

Comments

Mahesh Khambadkone

4 Comments

Nayana Somaratna November 2, 2011, 10:00 am

@Mahesh:

Quick question – your argument does not seem to take in the fact that running application A + B on the same set of nodes may potentially double the load per server ? Or were you considering increasing the server capacities concurrently ?
Mahesh Khambadkone November 2, 2011, 10:26 am

My point is if there is capacity on a node, shouldn’t we have an ability to add more apps to it, rather than forcing us to have to create a new node just because it is a different app.
Nayana Somaratna November 2, 2011, 10:43 am

@Mahesh: now I get it !

I actually had a similar discussion with my co-founder Sandaruwan.

My argument was that aside from capacity, other issues also tend to come into play – such as security and even maintainability (especially if the applications use different technology stacks).

However, his point was that it would be a waste of capacity to have multiple nodes – and that the running costs would also be increased unnecessarily.

In the end we decided to go with multiple apps on a single node, since that seemed to make more sense for our situation.
Mahesh Khambadkone November 2, 2011, 10:45 am

Yes – and keep measuring it. If you find there’s a problem, move out to additional servers, or spawn new servers just for the app that is loaded. More often than not, it will only be on specific days. I would even simply just replicate the entire node.