Azure Pack - Recover from a disaster

This post is not talking about recovering from a failed data center or Azure Site recovery procedures. No, it’s just “simple” recovery when everything went down for whatever kind of reason. I mean, we are humans and we make mistakes. Even in an environment as Azure they make mistakes. Remember the Storage outage last year? This was by a human making a mistake pushing an update. Sorry… shit happens.

But what if something happened and the power, storage, networking and suddenly the VM’s comes back up and you log in to Azure Pack and an error is on your screen? When we run projects at customers we build from the ground up and “assume” its running forever. In this post I want to discuss the proper way of bring an environment back up and running and what are the critical pointers you need to be aware of.

So let’s start the environment J

START DOMAIN CONTROLLERS
First we need to bring up all the domain controllers for each domain where fabric and azure pack components are running. If an ADFS instance is co-located on the domain controller and is using SQL, check after booting the SQL if the services are started correctly.

START HYPER-V SERVERS
If Hyper-V servers went down start them first.

The first 2 depend on how you have built your environment. You know about chicken and egg story? When you have one physical domain controller start this first. If you have only virtual domain controllers then first start Hyper-V Server. Just be careful when you have virtual domain controllers and your hyper-v server is joined to the domain that host the domain controller, that it can start the vm when booting the first hyper-v node. You would not be the first where I have seen that the cluster couldn’t start because domain was not up and domain couldn’t be brought up because storage and/or cluster couldn’t start.

START SQL SERVERS
Start the SQL Servers that exist in each domain where fabric and azure pack components are running.

START VIRTUAL MACHINE MANAGER (VMM) SERVERS
Start the VMM Server in the domain where fabric components are running. If it is a cluster check that the cluster is up and running and you can start the VMM Console and see the Hyper-V hosts and virtual machine.

START SERVICE PROVIDER FOUNDATION (SPF) SERVERS
Start the SPF servers in the domain where fabric components are running. Check if the IIS Site SPF is running and the app pools are started. When load balanced make sure bode nodes appear up in the load balancer.

START OPERATIONS MANAGER SERVERS
Start the Operations Manager Server in the domain where fabric components are running. If you have more than 1 domain and using SCOM Gateway servers start them also. Make sure the SCOM Services are started.

START SERVICE MANAGER AUTOMATION SERVERS
Start the SMA Runbook servers and Webservers in the domain where fabric components are running. Check if the SMA Website in IIS is running and AppPools are started. If the SMA Webservers are load balanced make sure they appear up in the load balancer. For the runbook server the Runbook service should be started successfully.

START NETWORK VIRTUALIZATION GATEWAY SERVERS
Start the Hyper-V Network Virtualization Gateways and check if the cluster and roles are up and running.

START OTHER INFRA MANAGEMENT SERVERS
with this I mean WSUS, WDS, PKI and for example RDS server for management of the fabric. Make sure the services are started properly.

START AZURE PACK SQL / MYSQL SERVERS
Start the SQL and MySQL servers to bring up Database as a Service. Make sure the SQL and or MySQL Services are running.

START AZURE PACK WEBSITES FILESERVER
Start the Azure Pack Websites Fileserver. If it is a clustered Role check if the cluster is up and running and the file share is accessible.

START AZURE PACK WEBSITES CONTROLLER SERVERS
Start the Azure Pack Websites Controller servers. We can check later in Azure Pack Admin if status is Ready, when it’s not we need to troubleshoot the issue.

START AZURE PACK WEBSITES MANAGEMENT SERVERS
Start the Azure Pack Websites Management servers. We can check later in Azure Pack Admin if status is Ready, when it’s not we need to troubleshoot the issue.

START AZURE PACK WEBSITES PUBLISHER SERVERS
Start the Azure Pack Websites Publisher servers. We can check later in Azure Pack Admin if status is Ready, when it’s not we need to troubleshoot the issue.

START AZURE PACK WEBSITES WORKER SERVERS
Start the Azure Pack Websites worker servers. We can check later in Azure Pack Admin if status is Ready, when it’s not we need to troubleshoot the issue.

START AZURE PACK WEBSITES FRONTEND SERVERS
Start the Azure Pack Websites Frontend servers. We can check later in Azure Pack Admin if status is Ready, when it’s not we need to troubleshoot the issue.

START AZURE PACK ADMIN SERVERS
Start the Azure Pack Admin servers. If you scaled these out into separate boxes: Start the AdminAPI, TenantAPI, PowershellAPI, SQLProvider, MySQLProvider, Usage, UsageCollector, WebAppGallery, AdminAuthSite, and AdminSite. Check if the websites are running in IIS and the APP Pools are started.

START THE AZURE PACK CONSOLE CONNECT SERVERS
Start the Remote desktop gateway servers that are used for console connect in the domain where Azure Pack components are running. Check if the Remote desktop Gateway services are running.

START AZURE PACK TENANT SERVERS
Start the Azure Pack Tenant Servers. If you scaled these out to separate boxes: Start TenantSite, TenantPublicAPI, and AuthSite. The Authsite we usually use for dev/test and might not be installed in your environment. It that case you are depended on ADFS. Check if the websites are running in IIS and the APP Pools are started.

START PROXY SERVERS
For ADFS or Azure pack you might expose it through the proxy server. Make sure the ADFS proxy service is running.

So… now environment is back up and running what I always do is the following:
Log in to SCOM and delete all alerts. Put all Machines that have errors in the computer view in maintenance mode for half an hour at least. Be aware not to put your management servers into maintenance mode. Been there, done that. And you don’t want to fix that! When machines with errors are exiting from maintenance mode the health status is reset by the agent. From that moment on we can take any new alerts seriously.

That was the monitoring part. Let’s move on to the functional part:
Test you can login to the Azure Pack Admin Portal and check if all enabled providers don’t give an error message in the portal or not showing up at all. When that’s successful check in the Websites Cloud from the Admin Portal you see the roles and all status are ready for the components. If they are too long in an installing state check the role and cloud log and eventually the server event log.

When Admin Portal is running and is OK we can move on to the Tenant Portal. Log on to the portal. If the screen stays blank or gives an error don’t shout at the Azure Pack tenant portal directly. When using ADFS it might be that ADFS is not running. So check first ADFS. When that’s running and you can login but see errors for some resource providers check the tenant server event viewer. There is an Azure Pack Event log below the Applications and Services Log. If still issues occur also look at the admin server Azure Pack event viewer to see what’s going on. If you are in a load balanced scenario and need to start troubleshooting I always shutdown 1 side of the Load balancing. This is for me easier to troubleshoot event logs.

When Tenant server is up and running check console connect to your VM. Check if you can change network rules when using NVGRE Gateways and create, stop and start a virtual machine. For troubleshooting Remote desktop gateways check first VMM log if a request is arrived at VMM. Then check Remote desktop gateway servers for any events in the event log and maybe on the Hyper-V hosts where the VM is running check the event log. Check if your websites are accessible from the internet and databases in applications are functioning.

That was the functional part, let’s move on the administrative part:
When using custom billing solutions check if usage is ok and correct according to some random vm’s. Those checks you need to do. I have seen a couple of times for whatever kind of reason usage breaks. Just make sure its running and working. Customer’s pays your boss, your boss pays you… just keep the circle of life running… Further update tickets from your helpdesk system and prepare a RFO.

I hope I could give you a better understanding of how the stack is depending on the components and if I have forgotten something or you have any suggestions please let me know.

Twitter: @markscholman