djeps (2463954) writes "I've used to achieve this with Nagios Event Handlers scripts and RabbitMQ. But facebook has done it for a far larger scale than my old days of sysadmin: When your infrastructure is the size of Facebook’s, there are always broken servers and pieces of software that have gone down or are generally misbehaving. In most cases, our systems are engineered such that these issues cause little or no impact to people using the site. But sometimes small outages can become bigger outages, causing errors or poor performance on the site. If a piece of broken software or hardware does impact the site, then it's important that we fix it or replace it as quickly as possible. Even if it's not causing issues for users yet, it could in the future so we need to take care of it quickly."
Link to Original Source