Quantcast
Channel: Windows Azure – Troubleshooting & Debugging
Viewing all articles
Browse latest Browse all 67

Role Instance Restarts Due to OS Upgrades

$
0
0

Update March 7, 2012

Added to the Q&A section --- Q: How long will the upgrade take?  How long will my VM be down?

 

------------

 

Roughly once per month Microsoft releases a new Guest OS version for Windows Azure PaaS VMs.  The exact schedule varies and the historic trend can be seen at http://msdn.microsoft.com/en-us/library/windowsazure/ee924680.aspx.  During this rollout the Window Azure Fabric Controller will do two passes through all of the datacenters:

  1. Host OS.  The first pass will upgrade the Host OS.  The host OS reboots instances and the fabric controller ensures that only instances from one upgrade domain at a time will be rebooted.  During this reboot, your role instances will go through the standard shutdown process and the RoleEnvironment.OnStop event will be raised to give you a chance to gracefully shut down the instance.
  2. Guest OS.  Once the Host OS has finished upgrading across the datacenter then the Guest OS will be upgraded for services which are configured to use automatic Guest OS versions and this upgrade will proceed using standard upgrade domain rules for your service.  Your VM will be rebooted and the Windows Partition (the D drive) will be reimaged with the upgraded OS.

Mark Russinovich has a great blog post which describes the Host OS upgrade process - http://blogs.technet.com/b/markrussinovich/archive/2012/08/22/3515679.aspx.

 

Impact to Your Service

  • As long as each of your roles has 2 or more instances then your service will not experience downtime due to the adherence to upgrade domains.  The blog at http://blog.toddysm.com/2010/04/upgrade-domains-and-fault-domains-in-windows-azure.html has a great explanation of upgrade and fault domains, and why having 2 instances of a role is required to meet the 99.95% uptime SLA.
  • Approximately every month, expect your instances to reboot once for the Host OS update.  If you have automatic guest OS updates, expect your instances to reboot again.  These reboots are typically several hours apart, but this time frame can change depending on the makeup of different services within a datacenter.
  • Your role needs to adhere to the rules around host OS updates, in particular instances should reach the Ready state within 15 minutes of starting the Startup tasks.  For more information about this limitation see http://msdn.microsoft.com/en-us/library/hh543978.
  • Your role instances should be able to handle a Reboot and a Reimage.  The Host OS upgrade will cause a Reboot of your instance, and the Guest OS upgrade will cause the equivalent of a Reimage of your instance.  See the common issues below for more information.

 

Common Issues

See http://blogs.msdn.com/b/kwill/archive/2011/05/05/windows-azure-role-architecture.aspx for more information about the processes which are running and the location of log files which can be used to troubleshoot.

 

  1. The most common problem is roles not reaching the Ready state after the OS upgrades.  The most common root cause for this problem is a startup task or code in the OnStart or Run function not running correctly.  There are 2 common categories of this root cause:
    1. A failure of the code to run twice due to the Host OS reboot which will cause your startup tasks to run again.  If you are doing something in a Startup task and executing a command which returns an error when run twice (ie. ‘appcmd set config’ to add a section will fail when run twice with the error “New add object missing required attributes. Cannot add duplicate collection entry of type…”) then your startup task will fail and cause your role instance to begin recycling.  To troubleshoot this type of failure, RDP into the VM and look in the Event Logs for errors, and look in the WaHostBootstrapper.log for Startup task failures.  During your normal development and testing process you should proactively initiate a Reboot of your role instances from the Windows Azure Management portal in order to test your service and make sure that it works correctly in this scenario.  A common fix for startup task failures is to add an 'exit /b 0' to the end of your startup task.  See http://msdn.microsoft.com/en-us/library/windowsazure/hh124132.aspx for more information on why this is needed.
    2. A failure of the code to run after the Windows partition is reimaged.  During the Guest OS portion of the update, the Windows Partition is reimaged.  The Windows Partition is typically where program installations and registry changes are stored, and during the reimage those changes will be lost.  If the startup code assumes that the change exists (ie. if the startup task makes a registry change and then stores a record of that change on the C: or E: drive so that the code isn’t run twice) then the role instance may fail to work properly.  During your normal development and testing process you should proactively initiate a Reimage of your role instances from the Windows Azure Management portal in order to test your service and make sure that it works correctly in this scenario.
  2. If your startup code takes longer than 15 minutes to complete then you may have multiple role instances taken out of service at the same time.  This is most common when a startup task installs a program or feature, downloads cache data, or downloads website information.  See the Host OS update rules in the ‘Impact to Your Service’ section above for more information about this.
  3. Occasionally the Windows Azure Platform will fail to restart the host or guest OS after an update.  Overall this is a rare scenario and the platform is constantly improving to eliminate these types of failures.  If you are in this scenario then your symptoms will typically be a ‘Waiting for Host’ message in the portal that does not change after at least 15 minutes, and the inability to RDP into the role instance.  In this scenario there is little you can do short of deleting the deployment to recover this instance.  If you open a support incident (http://www.windowsazure.com/en-us/support/contact/) the support team can manually recover that instance.  Note: If you are able to RDP into the role instance then the problem is almost always due to a failure in the startup code as described in common issue #1 above.
  4. During the OS upgrades one or more of your instances will be unavailable at any given time which will cause reduced capacity for your service.  For example, you have 2 instances of a web role and both instances typically run at 75% CPU.  During the OS upgrade one instance will be rebooted during the upgrade which means all traffic will be directed to the remaining instance which will exceed the capacity for that instance and your service availability will be impacted.  You should ensure that your service has sufficient excess capacity to absorb X% of the instances being unavailable, where X is 1/<number of upgrade domains> (ie. for 2 upgrade domains you will lose 50% of your capacity, and for 5 upgrade domains you will lose 20% of your capacity).

 

Detection and Notification

Notification

At this time the Windows Azure platform does not offer proactive notifications when an OS upgrade is happening.  The Windows Azure development team is working on this functionality so that service administrators can better plan for upgrades and possible service impact.  Your role instances will receive a RoleEnvironment.Stopping event prior to being shut down and you can use that event to gracefully terminate any work that the role instance is doing or notify an administrator that an instance is shutting down.

In the meantime you can subscribe to the Windows Azure OS Updates RSS feed at http://sxp.microsoft.com/feeds/3.0/msdntn/WindowsAzureOSUpdates.  This feed should be updated the same day that the OS updates start being rolled out to the datacenter.  This does not give advanced proactive notification, but it does help identify when the updates are happening.

The Guest OS list at http://msdn.microsoft.com/en-us/library/windowsazure/ee924680.aspx and the OS version selection dropdown in the management portal are typically updated after the Guest OS rollout has completed so you should not use the latest entry in these lists as an indication of when the OS updates are in progress.

 

Detection

At this time there is no direct way to detect a Host OS upgrade, but you can see the evidence of the reboot within the logs on the VM:

    • Search System event logs for event source USER32, event ID 1074, with message “The process D:\Packages\GuestAgent\WaAppAgent.exe (RD00155D50206D) has initiated the shutdown of computer RD00155D50206D on behalf of user NT AUTHORITY\SYSTEM for the following reason: Legacy API shutdown”.  This indicates that the Windows Azure fabric’s guest agent (WaAppAgent.exe) initiated a shutdown of the VM.
    • Look in the AppAgentRuntime.log.old files for a message saying “Context_Start” with a Context=”StopContainer()”

 

Frequently Asked Questions

  • Q: How can I opt out of the OS updates?

A: You cannot opt out of the Host OS updates because Microsoft must maintain updated and patched host OSes within the datacenter.  You can opt out of the Guest OS update by specifying a version of the Guest OS, but note that your service will no longer receive security patches and may be vulnerable.  See http://msdn.microsoft.com/en-us/library/windowsazure/ff729422.aspx.

 

  • Q: How do I force the reboots to be done only during non-business hours?

A: There is no way to control when an individual instance or service will be upgraded.  The upgrade is started on all Azure datacenters across the world at approximately the same time, and the fabric works continuously on upgrading each datacenter.  This process takes several days due to the complexity of making sure upgrade domain rules are followed for all hosted services, and there is no way to control or determine when a specific instance will be impacted.

 

  • Q: I installed something on the VM and now the VM has rebooted and the software I installed is gone, why?

A: Connecting to an Azure PaaS VM via RDP and making changes or installing software is unsupported.  At any point in time the VM may be completely rebuilt and any changes you make will be lost.  This can happen if the hardware fails and we have to startup a new VM on new hardware.  This will also happen during the Guest OS update when the Windows Partition is rebuilt.  If you need to install software or make changes to the VM you must create a startup task and do the work from there.  This ensures that when the VM is recreated that your configuration will be executed again.

 

  • Q: Can one of the updates in the new Guest OS version break my service?

A: The updates that are installed onto the new guest OS version are publicly available and thoroughly tested hotfixes which are also being deployed to servers around the world via Windows Update and the chance negative impact to your service is extremely small.  However, the root of the question goes back to how you manage OS patches in your on-premise services - do you install directly on the production servers and assume it will work, or do you have a staging environment where you test the patches first?  You will follow the same pattern in Azure.  If you want to have a staging environment to test patches prior to production then you should configure your production service to use a fixed version OS string in the .cscfg file.  Then when a new guest OS is available you can deploy your service into the staging slot using the newest guest OS version.  After you have validated that the service works correctly on the latest guest OS you can then either do a VIP swap, or do an in-place upgrade of your production service to use the latest OS.

 

  • Q: How long will the upgrade take?  How long will my VM be down?

A: There is a common misconception that the more patches being applied, the longer the update will take.  This is based on the belief that the upgrade works similar to how a Windows Update upgrade happens on your local desktop machine where a bunch of patches are copied to Windows and installed with subsequent reboots, but this is not how upgrading works in Azure.  When a new OS version is being released in Azure, the OS team will take the latest image, apply the patches, and then save a new VHD with this new base image.  This base image is then copied to a repository in Azure.  When the fabric is instructed to do an OS upgrade it will first make a copy pass where it copies this new base image VHD to the hard disks on each server in the datacenter that is going to be upgraded.  Once this copy process is finished the fabric will begin the upgrade process, following the normal upgrade domain rules.  When a given host is going to be updated the fabric will do a graceful shutdown of the OS and then start a new VM using the new base image.  So the time it takes to upgrade a given VM is the time it takes to do a graceful Windows shutdown + the time it takes to start Windows.


Viewing all articles
Browse latest Browse all 67

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>