ServiceNow Advanced High Availability Infrastructure
Documentation From ServiceNow
Advanced High Availability Architecture
Every organization, regardless of size, relies upon access to IT and business data and services. In many cases, this accessibility is critical to the continued operation and success of the enterprise.
This white paper provides an overview of the ServiceNow® Advanced High Availability (AHA) architecture – a key element in delivering a true enterprise cloud. Through ServiceNow’s unique, multi-instance architecture, Advanced High Availability meets and exceeds stringent requirements surrounding data sovereignty, availability and performance.
ServiceNow’s data centers and cloud-based infrastructure have been designed to be highly available. All servers and network devices have redundant components and multiple network paths to avoid single points of failure.
At the heart of this architecture, each customer application instance is supported by a multi-homed network configuration with multiple connections to the Internet. Production application servers are load balanced within each data center. Production database servers are replicated in near-real time to a peer data center within the same geographic region in Asia Pacific Japan (APJ); Europe, Middle East and Africa (EMEA); North America; and South America.
We leverage AHA for customer production instances in several ways:
• In the event of the failure of one or more infrastructure components, service is restored by transferring the operation of customer instances associated with the failed components to the peer data center.
• Before executing required maintenance, ServiceNow can proactively transfer operation of customer instances impacted by the maintenance to the peer data center. The maintenance can then proceed without impacting service availability.
This approach means that the transfer between active and standby data centers is being regularly executed as part of our standard operating procedures – ensuring that when it is needed to address a failure, the transfer will be successful and service disruption minimized.
Global Data Center Pairs
ServiceNow’s data centers are arranged in pairs. ServiceNow has 8 data center pairs (for a total of 16 data centers) across four geographic regions including Asia Pacific Japan (APJ); Europe, Middle East and Africa (EMEA); North America; and South America. Within several of these regions, there are specific country pairs for Canadian, U.S., Australian, and Swiss customers.
All customer production data is stored in both data centers and kept in sync using asynchronous database replication. Both data centers are active at all times, each with the ability to support the combined production load of the pair. A production instance from one customer may be operating out of one data center in the pair and a production instance of another customer from the other.
ServiceNow maintains continuous, asynchronous replication from the database in the current primary data center (read-write) to the secondary data center (read-only). To transfer a customer instance from a primary data center to a secondary, ServiceNow designates the secondary to be the primary and the primary to be the secondary if it still exists.
High-Level Overview of AHA Process
The AHA process is comprised of eight main steps and is invoked through ServiceNow’s Service Automation Platform in one of two conditions:
- In the event of a service disruption, the ServiceNow operations team determines whether a failover is required.
- For scheduled maintenance activity, the ServiceNow operations team determines if an AHA transfer should be performed.
High level automated AHA transfer steps:
- Run an end-to-end automated suite of pre-flight checks to ensure that all infrastructure and application configurations associated with the customer’s active and standby instances are in a healthy state, including data replication between the data centers.
- Change the Domain Name Service (DNS) information associated with the customer instance.
- Stop all application nodes associated with the customer instance.
- Reverse the roles for each database from active (read-write) to passive (read-only) and vice versa.
- Change the database pointer to the read-write database within the application nodes.
- Start all application nodes associated with the instance.
- Run an end-to-end automated suite of post-flight checks to ensure all systems and configurations are in a healthy state.
- Perform discovery so that the configuration management database (CMDB) is updated with the new configuration.
In the event an AHA failover is required, some of the above steps are bypassed, as the active instance may not be accessible. In both the AHA transfer and AHA failover scenario, the cloud automation platform will make the customer instance in the peer data center active.
Backup and Recovery
While Advanced High Availability is the primary means to recover data and restore service in the case of a service disruption, in certain cases it is desirable to use ServiceNow’s more traditional data backup and recovery mechanism. This data backup and recovery system works in concert with AHA and acts as a secondary recovery mechanism.
ServiceNow stores production instances in two geographically separate regional data centers, with sub-production instances hosted in a single data center. Backups of the two production databases and the single sub-production database are taken everyday for all instances throughout the private cloud infrastructure.
The backup cycle consists of four weekly full backups and the past 6 days of daily differential backups that provide 28 days of backups. All backups are written to disk, no tapes are used and no backups are sent off site. All the controls that apply to live customer data also apply to backups. If data is encrypted in the live database then it will also be encrypted in the backups.
Regular, automated tests are run to ensure the quality of backups. Any failures are reported for remediation within ServiceNow.
ServiceNow is responsible for managing the ServiceNow environment, supporting infrastructure, and vendor relationships. As part of these responsibilities, we maintain a 24x7 Site Reliability Engineering Center (SRE) to monitor uptime and availability. The SRE uses a “follow-the-sun” model, which provides continuous security, operational monitoring and support of the ServiceNow environment and infrastructure. ServiceNow rotates operations and technical support daily – in North America, The Netherlands and the U.K. – in order to provide 24x7 operations and security monitoring.
Critical system resources, including DNS, email, ServiceNow’s cloud automation platform and ServiceNow’s Customer Service System are operated in high availability configurations in a minimum of two data centers. None of these resources relies upon ServiceNow’s internal corporate IT infrastructure.
We use AHA for our own development systems used for source code control and the software build process, which are also hosted at the production data centers to ensure the highest continuity for our developers. This enables ServiceNow developers to support and continue developing the application without requiring physical access to ServiceNow offices.
The AHA architecture uses the same transfer process for preventive maintenance and recovery from actual disasters. This approach eliminates the need for a yearly disaster recovery test, and creates a practiced transfer event during the performance of normal maintenance.
Through ServiceNow’s unique, multi-instance architecture, Advanced High Availability meets and exceeds stringent requirements surrounding data sovereignty, availability and performance. If you would like more information on ServiceNow, AHA or our security measures, please contact your local ServiceNow sales representative.