In Troubleshooting Scenario 1 we looked at a scenario where the role would recycle after deployment and the root cause was easily seen in the Windows Azure Event Logs. This blog post will show another example of this same type of behavior, but with a different, and more difficult to find, root cause. This is a continuation of the troubleshooting series.

Symptom

You have deployed your Azure hosted service and it shows as Recycling in the portal. But there is no additional information such as an Exception type or error message. The role status in the portal might switch between a few different messages such as, but not limited to:

Recycling (Waiting for role to start… System startup tasks are running.

Recycling (Waiting for role to start… Sites are being deployed.

Recycling (Role has encountered an error and has stopped. Sites were deployed.

Get the Big Picture

Similar to the previous troubleshooting scenarios we want to get a quick idea of where we are failing. Watching task manager we see that WaIISHost.exe starts for a few seconds and then disappears along with WaHostBootstrapper.

From the ‘Get the Big Picture’ section in Troubleshooting Scenario 1 we know that if we see WaIISHost (or WaWorkerHost) then the problem is most likely a bug in our code which is throwing an exception and that the Windows Azure and Application Event logs are a good place to start.

Check the logs

Looking at the Windows Azure event logs we don’t see any errors. The logs show that the guest agent finishes initializing (event ID 20001), starts a startup task (10001), successfully finishes a start task (10002), then IIS configurator sets up IIS (10003 and 10004), and then the guest agent initializes itself again and repeats the loop. No obvious errors or anything to indicate a problem other than the fact that we keep repeating this cycle a couple times per minute.

Next we will check the Application event logs to see if there is anything interesting there.

The Application event logs are even less interesting. There is virtually nothing in there, and certainly nothing in there that would correlate to an application failing every 30 seconds.

As we have done in the previous troubleshooting scenarios we can check some of the other commonly used logs to see if anything interesting shows up.

WaHostBootstrapper logs
If we check the C:\Resources folder we will see several WaHostBoostrapper.log.old.{index} files. WaHostBootstrapper.exe creates a new log file (and archives the previous one) every time it starts up, so based on what we were seeing in task manager and the Windows Azure event logs then it makes sense to see lots of these host bootstrapper log files. When looking at the host bootstrapper log file for a recycling role we want to look at one of the archived files rather than the current WaHostBootstrapper.log file. The reason is because the current file is still being written so depending on when you open the file it could be at any point in the startup process (ie. starting a startup task) and most likely won’t have any information about the crash or error which ultimately causes the processes to shut down. You can typically pick any of the .log.old files, but I usually start with the most recent one.

The host bootstrapper log starts off normally and we can see all of the startup tasks executing and returning with a 0 (success) return code. The log file ends like this:

[00002916:00001744, 2013/10/02, 22:09:30.660, INFO ] Getting status from client WaIISHost.exe (2976).
[00002916:00001744, 2013/10/02, 22:09:30.660, INFO ] Client reported status 1.
[00002916:00001744, 2013/10/02, 22:09:31.285, INFO ] Getting status from client DiagnosticsAgent.exe (1788).
[00002916:00001744, 2013/10/02, 22:09:31.285, ERROR] Failed to connect to client DiagnosticsAgent.exe (1788).
[00002916:00001744, 2013/10/02, 22:09:31.285, ERROR] <- CRuntimeClient::OnRoleStatusCallback(0x00000035CFE86EF0) =0x800706ba
[00002916:00001744, 2013/10/02, 22:09:31.285, INFO ] Getting status from client DiagnosticsAgent.exe (3752).
[00002916:00001744, 2013/10/02, 22:09:31.285, INFO ] Client reported status 0.
[00002916:00001744, 2013/10/02, 22:09:31.285, INFO ] Getting status from client RemoteAccessAgent.exe (2596).
[00002916:00001744, 2013/10/02, 22:09:31.285, INFO ] Client reported status 0.
[00002916:00001744, 2013/10/02, 22:09:31.285, INFO ] Getting status from client RemoteAccessAgent.exe (3120).
[00002916:00001744, 2013/10/02, 22:09:31.285, ERROR] Failed to connect to client RemoteAccessAgent.exe (3120).
[00002916:00001744, 2013/10/02, 22:09:31.285, ERROR] <- CRuntimeClient::OnRoleStatusCallback(0x00000035CFE86E00) =0x800706ba
[00002916:00001744, 2013/10/02, 22:09:31.285, INFO ] Getting status from client WaIISHost.exe (2976).
[00002916:00001744, 2013/10/02, 22:09:31.300, INFO ] Client reported status 2.

No error messages or failures (remember from scenario 2 that we can ignore the ‘Failed to connect to client’ and 0x800706ba errors), just a status value of 2 from WaIISHost.exe. The status is defined as an enum with the following values:

0 = Healthy
1 = Unhealthy
2 = Busy

We would typically expect to see a 1 (Unhealthy while the role is starting up), then a 2 (Busy while the role is running startup code), and then a 0 once the role is running in the Run() method. So this host bootstrapper log file is basically just telling us that the role is in the Busy state while starting up and then disappears, which is pretty much what we already knew.

WindowsAzureGuestAgent logs

Once WaIISHost.exe starts up then the guest agent is pretty much out of the picture so we won’t expect to find anything in these logs, but since we haven’t found anything else useful it is good to take a quick look at these logs to see if anything stands out. When looking at multiple log files, especially for role recycling scenarios, I typically find one point in time when I know the problem happened and use that consistent time period to look across all logs. This helps prevent just aimlessly looking through huge log files hoping that something jumps out. In this case I will use the timestamp 2013/10/02, 22:09:31.300 since that is the last entry in the host bootstrapper log file.

AppAgentRuntime.log

[00002608:00003620, 2013/10/02, 22:09:21.789, INFO ] Role process with id 2916 is successfully resumed
[00002608:00003620, 2013/10/02, 22:09:21.789, IMPRT] State of 36ec83922b34432b808b37e73e6a216d.MissingDependency_IN_0 changed from RoleStateSuspended to RoleStateBusy.
[00002608:00001840, 2013/10/02, 22:09:29.566, IMPRT] State of 36ec83922b34432b808b37e73e6a216d.MissingDependency_IN_0 changed from RoleStateBusy to RoleStateUnhealthy.
[00002608:00003244, 2013/10/02, 22:09:31.300, IMPRT] State of 36ec83922b34432b808b37e73e6a216d.MissingDependency_IN_0 changed from RoleStateUnhealthy to RoleStateBusy.
[00002608:00003620, 2013/10/02, 22:09:31.535, FATAL] Role process exited with exit code of 0
[00002608:00003620, 2013/10/02, 22:09:31.613, IMPRT] State of 36ec83922b34432b808b37e73e6a216d.MissingDependency_IN_0 changed from RoleStateBusy to RoleStateStopping.
[00002608:00003620, 2013/10/02, 22:09:31.613, INFO ] Waiting for ping from LB.
[00002608:00003620, 2013/10/02, 22:09:31.613, INFO ] TIMED OUT waiting for LB ping. Proceeding to stop the role.
[00002608:00003620, 2013/10/02, 22:09:31.613, IMPRT] State of 36ec83922b34432b808b37e73e6a216d.MissingDependency_IN_0 changed from RoleStateStopping to RoleStateStopped.

We can see the WaHostBootstrapper process starting (PID 2916, which matches the PID:TID we see in the WaHostBootstrapper.log – {00002916:00001744}). Then we see the role status change to Busy, Unhealthy, then Busy, which is exactly what we see in the host bootstrapper log file. Then the role process exits and the guest agent proceeds to do a normal stop role and then start role. So nothing useful in this log.

Debugging

At this point we have looked at all of the useful logs and have not found any indication of what the source of the problem might be. Now it is time to do a live debug session in order to find out why WaIISHost.exe is shutting down.

The easiest way to start debugging on an Azure VM is with AzureTools. You can learn more about AzureTools and how to download it from http://blogs.msdn.com/b/kwill/archive/2013/08/26/azuretools-the-diagnostic-utility-used-by-the-windows-azure-developer-support-team.aspx.

First we want to download AzureTools and then double-click the X64 Debuggers tool which will download and install the Debugging Tools for Windows which contains WinDBG.

Now we have to get WinDBG attached to WaIISHost.exe. Typically when debugging a process you can just start WinDBG and go to File –> Attach to a Process and select the process from the list, but in this case WaIISHost.exe is crashing immediately on startup so it won’t show up in the currently running process list. The typical way to attach to a process that is crashing on startup is to set the Image File Execution Options Debugger key to start and attach WinDBG as soon as the process starts. Unfortunately this solution doesn’t work in an Azure VM (for various reasons) so we have to come up with a new way to attach a debugger.

AzureTools includes an option under the Utils tab to attach a debugger to the startup of a process. Switch to the Utils tab, click Attach Debugger, select WaIISHost from the process list, then click Attach Debugger. You will see WaIISHost show up in the Currently Monitoring list. AzureTools will attach WinDBG (or whatever you specify in Debugger Location) to a monitored process the next time that process starts up. Note that AzureTools will only attach the next instance of the target process that is started – if the process is currently running then AzureTools will ignore it.

Now we just wait for Azure to recycle the processes and start WaIISHost again. Once WaIISHost is started then AzureTools will attach WinDBG and you will see a screen like this:

Debugging an application, especially using a tool like WinDBG, is oftentimes more art than science. There are lots of articles that talk about how to use WinDBG, but Tess’s Debugging Demos series is a great place to start. Typically in these role recycling scenarios where there is no indication of why the role host process is exiting (ie. the event logs aren’t showing us an exception to look for) I will just hit ‘g’ to let the debugger go and see what happens when the process exits.

WinDBG produces lots of output, but here are the more interesting pieces of information:

Microsoft.WindowsAzure.ServiceRuntime Information: 100 : Role environment . INITIALIZING
[00000704:00003424, INFO ] Initializing runtime.
Microsoft.WindowsAzure.ServiceRuntime Information: 100 : Role environment . INITIALED RETURNED. HResult=0
Microsoft.WindowsAzure.ServiceRuntime Information: 101 : Role environment . INITIALIZED
ModLoad: 00000000`00bd0000 00000000`00bda000   E:\approot\bin\MissingDependency.dll
Microsoft.WindowsAzure.ServiceRuntime Critical: 201 : ModLoad: 000007ff`a7c00000 000007ff`a7d09000   D:\Windows\Microsoft.NET\Framework64\v4.0.30319\diasymreader.dll
Role entrypoint could not be created:
System.Reflection.TargetInvocationException: Exception has been thrown by the target of an invocation. —> System.IO.FileNotFoundException: Could not load file or assembly ‘Microsoft.Synchronization, Version=1.0.0.0, Culture=neutral, PublicKeyToken=89845dcd8080cc91′ or one of its dependencies. The system cannot find the file specified.
   at MissingDependency.WebRole..ctor()
   — End of inner exception stack trace —
   at System.RuntimeTypeHandle.CreateInstance(RuntimeType type, Boolean publicOnly, Boolean noCheck, Boolean& canBeCached, RuntimeMethodHandleInternal& ctor, Boolean& bNeedSecurityCheck)
   at System.RuntimeType.CreateInstanceSlow(Boolean publicOnly, Boolean skipCheckThis, Boolean fillCache, StackCrawlMark& stackMark)
   at System.RuntimeType.CreateInstanceDefaultCtor(Boolean publicOnly, Boolean skipCheckThis, Boolean fillCache, StackCrawlMark& stackMark)
   at System.Activator.CreateInstance(Type type, Boolean nonPublic)
   at System.Activator.CreateInstance(Type type)
   at Microsoft.WindowsAzure.ServiceRuntime.RoleEnvironment.GetRoleEntryPoint(Assembly entryPointAssembly)
   at Microsoft.WindowsAzure.ServiceRuntime.RoleEnvironment.CreateRoleEntryPoint(RoleType roleTypeEnum)
   at Microsoft.WindowsAzure.ServiceRuntime.RoleEnvironment.InitializeRoleInternal(RoleType roleTypeEnum)

The first 4 lines tell us that the Azure serviceruntime was initialized successfully.

Line 5 shows that my role entry point (MissingDependency.dll – the WebRole.cs code such as the OnStart method) is loaded. This tells us that we are getting into custom code and the problem is probably not with Azure itself.
Line 6 is loading diasymreader.dll. This is the diagnostic symbol reader and you will see it loaded whenever a managed process throws a second chance exception. The fact that this comes shortly after loading my DLL tells me that it is probably something within my DLL that is causing a crash.
Line 7 “Role entrypoint could not be created: “ tells me that Azure (WaIISHost.exe) is trying to enumerate the types in the role entry point module (MissingDependency.dll) to find the class that inherits from RoleEntrypoint so that it knows where to call the OnStart and Run methods, but it failed for some reason.
The rest of the lines show the exception being raised which ultimately is causing the process to exit.

The exception message and callstack tell us that WaIISHost.exe was not able to find the Microsoft.Synchronization DLL when trying to load the role entry point class and running the code in MissingDependency.WebRole..ctor().

Intellitrace

The above shows how to do a live debug which has some nice benefits – no need to redeploy in order to troubleshoot so it can be much faster if you are experienced with debugging, and you are in a much richer debugging environment which is often required for the most complex problem types. But for issues such as role recycles it is often easier to turn on Intellitrace and redeploy. For more information about setting up and using Intellitrace see http://msdn.microsoft.com/en-us/library/windowsazure/ff683671.aspx or http://blogs.msdn.com/b/jnak/archive/2010/06/07/using-intellitrace-to-debug-windows-azure-cloud-services.aspx.

For this particular issue I redeployed the application with Intellitrace turned on and was quickly able to get to the root cause.

Solution

Typically once I think I have found the root cause of a problem I like to validate the fix directly within the VM before spending the time to fix the problem in the project and redeploy. This is especially valuable if there are multiple things wrong (ie. multiple dependent DLLs that are missing) so that you don’t spend a couple hours in a fix/redeploy cycle. See http://blogs.msdn.com/b/kwill/archive/2013/09/05/how-to-modify-a-running-azure-service.aspx for more information about making changes to an Azure service.

Applying the temporary fix:

On your dev machine, check in Visual Studio to see where the DLL is on the development machine.
Copy that DLL to the Azure VM into the same folder as the role entry point DLL (e:\approot\bin\MissingDependency.dll in this case).
On the Azure VM, close WinDBG in order to let WaIISHost.exe finish shutting down which will then let Azure recycle the host processes and attempt to restart WaIISHost.

Validating the fix:

The easiest way to validate the fix is to just watch Task Manager to see if WaIISHost.exe starts and stays running.
You should also validate that the role reaches the Ready state. You can do this 3 different ways:

Check the portal. This may take a couple minutes for the HTML portal to reflect the current status.
Open C:\Logs\WaAppAgent.log and scroll to the end. You are looking for “reporting state Ready.”
Within AzureTools download the DebugView.zip tool. Run DebugView.exe and check Capture –> Capture GlobalWin32. You will now see the results of the app agent heartbeat checks in real time.

Applying the solution:

At this point we have validated that the only problem is the missing Microsoft.Synchronization.DLL so we can go to Visual Studio and mark that reference as CopyLocal=true and redeploy.

Windows Azure SDK 2.2 introduces the concept of a topology blast. This blog post will describe how topology changes happen at the fabric level and how you can take advantage of topology blast to build a more robust service.

Definitions:

Role Topology – The number of instances, number of internal endpoints, and composition of internal endpoints (ie. internal IP addresses of instances, also known as DIP addresses ).
Topology Change – Any change in this topology. Typically a scale up or down of the number of instances, or a service healing event which causes one VM to move to a new physical server and obtain a new internal IP address.
Rolling Upgrade – The process the fabric controller uses to make changes to a hosted service. The fabric controller will send the change to Upgrade Domain #0, wait for all instances in UD #0 to return to the Ready state, and then move to UD #1, continuing until all UDs have been walked. See the Upgrade Domain information at http://msdn.microsoft.com/en-us/library/windowsazure/hh472157.aspx.
Topology Blast – A new feature to allow topology changes to be sent to all UDs at one time, bypassing the normal UD walk.
topologyChangeDiscovery – Use this .csdef <ServiceDefinition> attribute to control the type of topology change your service receives. topologyChangeDiscovery=”Blast” will turn on Topology Blast.
RoleEnvironment.SimultaneousChanged/SimultaneousChanging – These events are raised in your code when a topology change happens and you have set topologyChangeDiscovery=”Blast”.

* Note that you must have an InternalEndpoint defined in order to receive topology changes. If you turn on RDP for your service an internal endpoint is implicitly created.

Picture 1 – Hosted service with 3 instances

Picture 1 shows a standard Windows Azure hosted service with 3 role instances. Each instance has a Microsoft.WindowsAzure.ServiceRuntime.RoleEnvironment.CurrentRoleInstance.InstanceEndpoints list with the endpoints pointing to the correct DIPs for all other instances.

Picture 2 – Standard topology change and rolling upgrade

Picture 2 shows a standard rolling upgrade after a topology change has occurred.

The server hosting IN_2 (with original DIP 10.31.70.8) has had a failure. The fabric controller automatically detects this failure and recreates IN_2 on a new server. IN_2 receives a new DIP of 10.25.18.2. This constitutes a topology change.
The fabric controller begins the rolling upgrade process in order to notify the rest of the instances that there has been a topology change. Each instance will receive a RoleEnvironment.Changing and RoleEnvironment.Changed event, with the change of type ServiceRuntime.RoleEnvironmentTopologyChange. The InstanceEndpoints list will be updated with the new DIP(s).
In Picture 2 the fabric controller is currently processing UD #0 and IN_0 has a correct InstanceEndpoints list. When IN_0 attempts to communicate with IN_2 using an InternalEndpoint it will connect to the new DIP 10.25.18.2.
IN_1 is in UD #1 and has not yet been notified of the topology change and IN_1 is still using the old incorrect InstanceEndpoints list. When IN_1 attempts to communicate with IN_2 using an InternalEndpoint it will fail to connect to the old DIP 10.31.70.8.

Depending on the architecture of your service and how well your code tolerates communication failures this scenario of some instances with the correct InstanceEndpoints list and some with an incorrect InstanceEndpoints list can cause significant problems in your application.

Picture 3 – Topology change with Topology Blast enabled

Picture 3 shows a topology blast after a topology change has occurred.

The server hosting IN_2 (with original DIP 10.31.70.8) has had a failure. The fabric controller automatically detects this failure and recreates IN_2 on a new server. IN_2 receives a new DIP of 10.25.18.2. This constitutes a topology change.
The fabric controller initiates a topology change. Because the service has set topologyChangeDiscovery=”Blast” the fabric will initiate a topology blast and send the topology change to all instances at the same time.
Both IN_0 and IN_1 receive a RoleEnvironment.SimultaneousChanging event at the same time with an updated InstanceEndpoints list. Both instances are now able to successfully communicate with IN_2.

Note that your architecture must still be tolerant of the communication failures that will happen from the time that the server hosting IN_2 fails until the fabric recreates IN_2 and sends the topology change.

Turning on Topology Blast

Topology Blast is enabled per deployment. To turn this on for a deployment set topologyChangeDiscovery=”Blast” in the csdef. Your service will now begin receiving topology blast configuration changes.

<ServiceDefinition name="waTestFramework"

topologyChangeDiscovery=“Blast”

 xmlns="http://schemas.microsoft.com/ServiceHosting/2008/10/ServiceDefinition" schemaVersion="2013-10.2.2">

Optionally, if you need to execute code to respond to a topology change you can implement the following events:

Topology changes will now raise the RoleEnvironment.SimultaneousChanged/SimultaneousChanging events instead of the default Changed/Changing events. Add handlers for these two events in OnStart and then implement your code in the appropriate event handler. These new events behave the same as the old ones with two exceptions:

With topology blast turned on, only topology changes will fire the Simultaneous* events and all other types of changes will fire the standard Changed/Changing events.
The SimultaneousChangingEventArgs does not implement a Cancel property. This is to prevent all role instances from recycling at the same time.

public override bool OnStart()
{
    // For information on handling configuration changes
    // see the MSDN topic at http://go.microsoft.com/fwlink/?LinkId=166357.
    RoleEnvironment.SimultaneousChanged += RoleEnvironment_SimultaneousChanged;
    RoleEnvironment.SimultaneousChanging += RoleEnvironment_SimultaneousChanging;

    return base.OnStart();
}

void RoleEnvironment_SimultaneousChanging(object sender, SimultaneousChangingEventArgs e)
{
    // Add code to run before the InstanceEndpoints list is updated
    // WARNING: Make sure you do not call RequestRecycle or throw an unhandled exception
}

void RoleEnvironment_SimultaneousChanged(object sender, SimultaneousChangedEventArgs e)
{
    // Add code to run after the InstanceEndpoints list is updated
}

In a previous post we looked at the Windows Azure PaaS SDP package which allows you to quickly and easily gather all of the log data to determine root cause for a variety of PaaS compute issues. This post will look at a new SDP package which allows you to quickly and easily gather all of the storage analytics logs.

Getting the SDP Package

This package will only work on a Windows 7 or later, or Windows Server 2008 R2 or later computer.

Open PowerShell

Copy/Paste and Run the following script

md c:\Diagnostics; Import-Module bitstransfer; Start-BitsTransfer http://dsazure.blob.core.windows.net/azuretools/AzureStorageAnalyticsLogs_global.DiagCab c:\Diagnostics\AzureStorageAnalyticsLogs_global.DiagCab; c:\Diagnostics\AzureStorageAnalyticsLogs_global.DiagCab

Alternatively you can download and save the .DiagCab directly from http://dsazure.blob.core.windows.net/azuretools/AzureStorageAnalyticsLogs_global.DiagCab.

Running the SDP Package

Enter the storage account name.

Enter the storage account key. Note that this key is only temporarily used within the SDP package utility. It is not saved or transferred.

Enter the starting time and ending time. The default values will gather logs from the past 24 hours
.

Select the analytics logs to gather.

When the tool is finished gathering data click Next and an Explorer window will open showing the latest.cab which is a compressed file containing the most recent set of data, along with folders containing the data from each time the SDP package was run.

The Results

There will be several files created as a result of running this SDP package. The important ones are:

ResultReport.xml. This file lists the data collected and includes the storage account name and time range specified. In the future we will include intelligent analytics results within this file (ie. “Event <x> found in Blob logs. This usually indicates <y>. You can find more information at <link>”).
*.csv. This are the raw data files containing the logs and metrics. A header line is included in the file to make analysis easy. The headers correspond to the Logs format and Metrics format.
*.xlsx. If Excel is installed on the computer running the SDP package then these .xlsx files will be created which include pre-built charts showing the most commonly used metrics along with the option to select additional metrics.

Excel charts (*.xlsx)

You can add or remove metrics from the Excel charts using the standard Chart filter tools:

Logs (*.csv)

You can easily filter and sort the .CSV files within Excel. The following filter can help identify potentially inefficient queries by identifying requests that take longer than X number of milliseconds on the server:

Look for additional blog posts in the future which walk through using the analytics data to identify and solve common issues.

Additional Resources

http://msdn.microsoft.com/en-us/library/windowsazure/hh343270.aspx – In depth documentation about storage analytics and what each field means.

http://www.windowsazure.com/en-us/documentation/articles/storage-monitor-storage-account/ – How to enable and use metrics from the Azure Management Portal.

http://channel9.msdn.com/Series/DIY-Windows-Azure-Troubleshooting/Storage-Analytics – A short 5 and a half minute video showing how to enable and use storage analytics.

Overview

Windows Azure SDK 2.4 and prior used Windows Azure Diagnostics 1.0 which provided multiple options for configuring diagnostics collection, including code based configuration using the Microsoft.WindowsAzure.Diagnostics classes. Windows Azure SDK 2.5 introduces Windows Azure Diagnostics 1.2 which streamlines the configuration and adds enhanced capabilities for post-deployment enablement, however it removes the code based configuration capabilities. This blog post will describe how to convert your older code based diagnostics configuration to the new WAD 1.2 XML based configuration model.

For more information about WAD 1.2 see http://azure.microsoft.com/en-us/documentation/articles/cloud-services-dotnet-diagnostics/.

Specifying diagnostics Configuration in XML

In Azure SDK 2.4 it was possible to configure diagnostics through code as well as xml configuration (the diagnostics.wadcfg file) and developers needed to understand the precedence rules of how the diagnostics agent picks up the configuration settings (see here for more information). Azure SDK 2.5 removes this complexity and the diagnostics agent will now always use the xml configuration. This has some implications when you migrate your Azure SDK 2.4 project to Azure SDK 2.5. If your Azure SDK 2.4 project already uses the xml based diagnostics configuration then when you upgrade the project in Visual Studio to target Azure SDK 2.5, Visual Studio will automatically update the xml based configuration to the new format (diagnostics.wadcfgx). If you project continued to use the code based configuration (for example, using the API in Microsoft.WindowsAzure.Diagnostics and Microsoft.WindowsAzure.Diagnostics.Management) then when its upgraded to SDK 2.5, you will get build warnings which will inform you of the deprecated APIs. The diagnostics data will not be collected unless you configure your diagnostics using the XML file (diagnostics.wadcfgx) or through the Visual Studio diagnostics configuration UI.

Let’s look at an example- In Azure SDK 2.4 you could be using the code below to update the diagnostics configuration for the diagnostics infrastructure logs. In this particular example the code is setting the Loglevel for the diagnostics infrastructure logs to Error and the Transfer Period to 5 minutes. When you migrate this project to SDK 2.5 you will get a build warning around this code with the message that the API is deprecated – “Warning: ‘Microsoft.WindowsAzure.Diagnostics.DiagnosticMonitor’ is obsolete: “This API is deprecated””. The project will still build and deploy but it won’t update the diagnostics configuration.

To configure the diagnostics infrastructure logs in Azure SDK 2.5 you must remove that code based configuration and update the diagnostics configuration xml file (diagnostics.wadcfgx) associated with your role. Visual Studio provides a configuration UI to let you specify these diagnostics settings. To configure the diagnostics through Visual Studio you can Right click on the Role and select Properties to display the Role Designer. On the Configuration tab make sure Enable Diagnostics is selected and click Configure. This will bring up the Diagnostics configuration UI where you can go to the Infrastructure Logs Tab and select Enable transfer of Diagnostics Infrastructure Logs and set the Transfer Period and Log Level Appropriately.

Behind the scenes Visual Studio is updating the diagnostics.wadcfgx file with the appropriate xml to configure the infrastructure logs. See highlighted as an example:

<?xml version="1.0" encoding="utf-8"?>

</DiagnosticMonitorConfiguration>

</WadCfg>

</PublicConfig>

</PrivateConfig>

The diagnostics configuration dialog in Visual Studio lets you configure other settings like Application Logs, Windows Event Logs, Performance Counters, Infrastructure Logs, Log Directories, ETW Logs and Crash Dumps.

Let’s look at another example for Performance Counters. With Azure SDK 2.4 (and prior) you would have to do the following in code:

With Azure SDK 2.5 you can use the diagnostics configuration UI to enable the default performance counters:

For Custom Performance counters you still have to instrument your code to create the custom performance category and counters as earlier. Once you have instrumented your code you can enable transferring these custom performance counters to storage by adding them to the list of performance counters in the diagnostics configuration UI. Eg. For a custom performance counter you can enter the category and name “\SampleCustomCategory\Total Button1 Clicks” in the text box and click Add.

Visual Studio will automatically add the performance counters selected in the UI to the diagnostics configuration file diagnostics.wadcfgx:

<?xml version="1.0" encoding="utf-8"?>

</PerformanceCounters>

</DiagnosticMonitorConfiguration>

</WadCfg>

</PublicConfig>

</PrivateConfig>

For WindowsEventLog data you can enable transfer of logs for common data sources directly from the diagnostics configuration UI.

For custom event logs that you define in your code for example ->

EventLog.CreateEventSource("GuestBookSource", "Application");

EventLog.WriteEntry("GuestBookSource", "WebRole Started", EventLogEntryType.Error, 9191);

You have to manually add the custom source to the wadcfgx XML file under the WindowsEventLog node. Make sure the name matches the name specified in code:

<WindowsEventLog scheduledTransferPeriod="PT1M">

<DataSource name="Application!*" />

<DataSource name="GuestBookSource!*" />

</WindowsEventLog>

Enabling diagnostics extension through PowerShell

Since Azure SDK 2.5 uses the extension model the diagnostics extension, the configuration and the connection string to the diagnostic storage are no longer part of the deployment package and cscfg. All the diagnostics configuration is contained within the wadcfgx. The advantage with this approach is that diagnostics agent and settings are decoupled from the project and can be dynamically enabled and updated even after your application is deployed.

Due to this change some existing workflows need to be rethought – instead of configuring the diagnostics as part of the application that gets deployed to each environment you can first deploy the application to the environment and then apply the diagnostics configuration for it. When you publish the application from Visual Studio this process is done automatically for you. However if you were deploying your application outside of VS using PowerShell then you have to install the extension separately through PowerShell.

There PowerShell cmdlets for managing the diagnostics extensions on a Cloud Service are –

· Set-AzureServiceDiagnosticsExtension

· Get-AzureServiceDiagnosticsExtension

· Remove-AzureServiceDiagnosticsExtension

You can use the Set-AzureServiceDiagnosticsExtension method to enable diagnostics extension on a cloud service. One of the parameters on this cmdlet is the XML configuration file. This file is slightly different from the diagnostics.wadcfgx file. You can create this file from scratch as described here or you can modify the wadcfgx file and pass in the modified file as a parameter to the powershell cmdlet.

To modify the wadcfgx file –

1. Make a copy the .wadcfgx.

2. Remove the following elements from the Copy:

</PrivateConfig>

<IsEnabled>false</IsEnabled>

</DiagnosticsConfiguration>

3. Make sure the top of the file still has xml version and encoding –

<?xml version="1.0" encoding="utf-8"?>

Effectively you are stripping down the Wadcfgx to only contain the <PublicConfig> section and the <?xml> header. You can then call the PowerShell cmdlet along with the appropriate parameters for the staging slots and roles:

$storage_name = ‘<storagename>’

$key= ‘<key>’

$service_name = ‘<servicename>’

$public_config = ‘<thepublicconfigfrom_diagnostics.wadcfgx>’

$storageContext = New-AzureStorageContext –StorageAccountName $storage_name –StorageAccountKey $key

Set-AzureServiceDiagnosticsExtension -StorageContext $storageContext -DiagnosticsConfigurationPath $public_config –ServiceName $service_name -Slot ‘Staging’ -Role ‘WebRole1’

A question I see periodically is how to restrict RDP access for PaaS services to specific network IP addresses. In the past this has always been difficult to do and the typical solution was to use a Startup task to configure firewall rules (ie. using Set-NetFirewallRule or netsh advfirewall per http://msdn.microsoft.com/en-us/library/azure/jj156208.aspx). This technique generally works fine, but it introduces the extra complexity of a startup task and is not built into the Azure platform itself.

Network ACLs

With the (relatively) recent introduction of network ACLs it becomes much easier to robustly secure an input endpoint on a cloud service. My colleague Walter Myers has a great blog post about how to enable network ACLs for PaaS roles at http://blogs.msdn.com/b/walterm/archive/2014/04/22/windows-azure-paas-acls-are-here.aspx. To apply a network ACL to the RDP endpoint it is simply a matter of defining your ACL rules targeting the role which imports the RemoteForwarder plugin, and specifying the name of the RDP endpoint in the endPoint attribute.

Here is the resulting NetworkConfiguration section to add to the CSCFG file:

Important information:

You must enable RDP in the package before publishing your service. The new model of enabling RDP post-deployment via the management portal or extension APIs will not work. You can enable RDP in the package using Visual Studio by right-clicking the cloud service project in Solution Explorer and selecting ‘Configure Remote Desktop…’, or in the Publish wizard by checking the ‘Enable Remote Desktop for all roles’ checkbox.
The role="WebRole1" attribute must specify the role which imports the RemoteForwarder plugin. You can look in the CSDEF file and find the role which has <Import moduleName="RemoteForwarder" />. If you have multiple roles in your service then all of them will import RemoteAccess, but only one of them will import RemoteForwarder and you must specify the role which imports RemoteForwarder.
The network configuration defined above will restrict all client’s except for those with IP addresses in the range 167.220.26.1-167.220.26.255. See Walter’s blog post for more information about how to specify network ACLs, and the MSDN documentation (http://msdn.microsoft.com/en-us/library/azure/dn376541.aspx) for more information about order and precedence of rules.

<Update March 2, 2015>

The ability to modify RDP plugin settings via the portal has been restored. I will leave this blog post up since the password encryption portion can still be valuable for people setting up RDP outside of the Visual Studio tools.

</Update>

There are 2 ways to enable RDP for a PaaS cloud service – via the CSPKG using the RemoteAccess plugin, or via the portal (or Powershell or REST API) using an extension. There was a recent change to the management portal that disabled the ability to configure the plugin style RDP settings after deployment. The error you get is:

“Failed to launch remote desktop configuration.”

And if you click Details you get:

“Remote Desktop is enabled for this deployment using the RemoteAccess module, which is specified in the ServiceDefinition.csdef file. To allow configuring Remote Desktop using the management portal, remove the RemoteAccess module and update the deployment. Learn More.”

The best solution is to remove the RDP plugin from the package, rebuild the package, and redeploy. This will let you enable RDP via the portal after deployment and manage the configuration in the portal. This also provides a more reliable RDP experience since this shifts the RDP functionality from a static feature in your CSPKG to an extension that can be managed and upgraded by the Azure guest agent. To disable RDP in your package you can right-click the cloud service project, select ‘Configure Remote Desktop’ and then uncheck the ‘Enable connections for all roles’ option.

However, modifying and then redeploying the package is not always a viable solution. For the scenarios where you want to update the configuration of an existing service you can use the following steps to manually update the configuration.

Download the configuration

Download the configuration and save a local .CSCFG file. You can do this via the portal or Powershell.

To do this from the portal go to the Configure tab and click Download:

To do this from Powershell run:

$deployment = Get-AzureDeployment -ServiceName <servicename> –Slot <slot>

([xml]$deployment.Configuration).Save("c:\temp\config.cscfg")

Update the configuration setting

Open the .CSCFG file you saved, make any necessary changes, and then save the .CSCFG.

*See below for how to modify the password.

Upload back to the portal

Upload the modified .CSCFG file back to the portal and wait for the configuration change to complete.

To do this from the portal go to the Configure table and click Upload:

To do this from Powershell run:

Set-AzureDeployment -Config -ServiceName <servicename> -Configuration "c:\temp\config.cscfg" –Slot <slot>

Updating the password

Updating settings such as the expiration date or username is easy, but updating the password is more difficult. The RDP password is encrypted using a user-provided certificate that is uploaded to the cloud service. If you are making these RDP changes on the computer that already has that certificate in the cert store (ie. from the dev/deploy machine) then getting a new encrypted password is pretty straightforward:

Open a ‘Windows Azure Command Prompt’
Execute:
csencrypt encrypt-password -copytoclipboard –thumbprint <thumbprint_from_CSCFG>
Paste the new password into the AccountEncryptedPassword setting in the .CSCFG

If you don’t have the certificate in the local computer’s cert store you can download it and generate a new encrypted certificate using Powershell:

# Download the cert
$cert = Get-AzureCertificate -ServiceName <servicename> -Thumbprint <thumbprint_from_CSCFG> -ThumbprintAlgorithm "SHA1"

# Save the cert to a file
"—–BEGIN CERTIFICATE—–" > "C:\Temp\rdpcert.cer"
$cert.Data >> "C:\Temp\rdpcert.cer"
"—–END CERTIFICATE—–" >> "C:\Temp\rdpcert.cer"

# Prompt for the new password
$password = Read-Host -Prompt "Enter the password to encrypt"

#Load the certificate
[System.Reflection.Assembly]::LoadWithPartialName("System.Security") | Out-Null
$cert = New-Object System.Security.Cryptography.X509Certificates.X509Certificate2("c:\temp\rdpcert.cer")
$thumbprint = $cert[0].thumbprint
$pass = [Text.Encoding]::UTF8.GetBytes($password)

# Encrypt the cert
$content = new-object Security.Cryptography.Pkcs.ContentInfo -argumentList (,$pass)
$env = new-object Security.Cryptography.Pkcs.EnvelopedCms $content
$env.Encrypt((new-object System.Security.Cryptography.Pkcs.CmsRecipient($cert)))

# Write the new password to a file and load the file
write-host "Writing encrypted password, cut/paste the text below the line to CSCFG file"
[Convert]::ToBase64String($env.Encode()) > "c:\temp\encrypted_password.txt"
Invoke-Item "c:\temp\encrypted_password.txt"

“My certificate provider recently switched to only providing SHA2/SHA256 certificates because SHA-1 certificates are no longer safe. But Azure only supports SHA1 certificates! https://msdn.microsoft.com/library/azure/gg465718.aspx says ‘The only thumbprint algorithm currently supported is sha1’”.

Lately I have been seeing this issue more often due to some larger cert providers recently making this change. The entire industry has been deprecating SHA-1 certificates for a while and Chrome has recently started showing warnings in the browser. Some references:

Signing algorithm vs. Thumbprint algorithm

The issue stems from confusion between the two types of algorithms used by certificates.

Signing algorithm. This is the algorithm used to actually sign the certificate and this is what makes the certificate secure (or in the case of SHA-1, less secure). The signing algorithm and resulting signature is specified by the certificate authority when it creates the cert and is built into the cert itself. This algorithm is where SHA1 is being deprecated. Azure doesn’t know or care what this algorithm is.

Thumbprint algorithm. This algorithm is used to generate a thumbprint in order to uniquely identify and find a certificate. This algorithm and value is not built into the certificate but is instead calculated whenever a cert lookup is done. Multiple thumbprints can be generated using different algorithms all from the same certificate data. The thumbprint has nothing to do with certificate security since it is just used to identify/find the cert within the cert store. Windows, .NET, and Azure all use SHA1 algorithm for the thumbprint algorithm, and SHA1 is the only algorithm allowed in the ServiceConfiguration.cscfg file:

<Certificates>
<Certificate name="Certificate1" thumbprint="69BF333452DAA85E462E33B138F3B65842C8B428" thumbprintAlgorithm="sha1" />
< /Certificates>

Solution

You can use your SHA2/SHA256 signed certificate in Azure, you just have to specify an SHA1 thumbprint. Your certificate provider should be able to provide you with an SHA1 thumbprint, but it is relatively straightforward to find or calculate the SHA1 thumbprint on your own. Here are a few options:

The easiest option is to simply open the certificate in the Certificate Manager on any Windows OS. Windows will display the SHA1 thumbprint in the certificate properties window.
On a Windows OS you can run ‘certutil –store my’ (replace my with whatever store your cert is in).
In Powershell you can call System.Security.Cryptography.SHA1CNG.ComputeHash per http://ig2600.blogspot.com/2010/01/how-do-you-thumbprint-certificate.html
There are a few other options including .NET code, openssl, and Apache Commons Codec at http://stackoverflow.com/questions/1270703/how-to-retrieve-compute-an-x509-certificates-thumbprint-in-java.

Thank you to Morgan Simonsen for his excellent blog post Understanding X.509 digital certificate thumbprints which details the different certificate algorithms.

After upgrading to Azure SDK 2.5 with Windows Azure Diagnostics 1.2 (see http://blogs.msdn.com/b/kwill/archive/2014/12/02/windows-azure-diagnostics-upgrading-from-azure-sdk-2-4-to-azure-sdk-2-5.aspx) you may notice that IIS logs and failed request (FREB) logs are no longer transferred to storage.

Root Cause

When WAD generates the diagnostics configuration it queries the IIS Management Service to find the location of the IIS logs, and by default this location is set to %SystemDrive%\inetpub\logs\LogFiles. In a PaaS WebRole IISConfigurator will configure IIS according to your service definition, and part of this setup changes the IIS log file location to C:\Resources\directory\{deploymentid.rolename}.DiagnosticStore\LogFiles\Web. The WAD configuration happens prior to IISConfigurator running which means WAD is watching the wrong folder for IIS logs.

Workaround

To work around this issue you have to restart the WAD diagnostics agent after IISConfigurator has setup IIS. When the WAD diagnostics agent starts up again it will query the IIS Management Service for the IIS log file location and will get the correct C:\Resources\directory\{deploymentid.rolename}.DiagnosticStore\LogFiles\Web location.

The two ways to restart the diagnostics agent are:

Reboot the VM. This can be done from the portal or from an RDP session with the VM.
Update the WAD configuration, which will cause the diagnostics agent to refresh it’s configuration. This can be done from Visual Studio (Server Explorer –> Cloud Services –> Right-click a role –> Update Diagnostics –> Make any change and update) or from Powershell (see this post).

One problem with these two options is that you have to manually do this for each role/VM in your service after deploying. The bigger problem is that any operation which recreates the Windows (D: drive) partition will also reset the IIS log file location to the default %SystemDrive% location which will cause the diagnostics agent to again get the wrong location. This will happen to all instances roughly once per month for Guest OS updates, or randomly to single instances due to service healing (see this and this for more info).

Resolution

The WAD dev team is working to fix this issue with the next Azure SDK release. In the meantime you can add the following code to your WebRole.OnStart method in order to automatically reboot the VM once during initial startup.

public override bool OnStart()
{
    // For information on handling configuration changes
    // see the MSDN topic at http://go.microsoft.com/fwlink/?LinkId=166357.

    // Write a RebootFlag.txt file to the %RoleRoot%\Approot\bin folder to track if this VM has rebooted to fix WAD 1.3 IIS logs issue
    string path = "RebootFlag.txt";
    // If RebootFlag.txt already exists then skip rebooting the VM
    if (!System.IO.File.Exists(path))
    {
        System.IO.File.WriteAllText(path, "Writing RebootFlag at " + DateTime.Now.ToString("O"));
        System.Diagnostics.Trace.WriteLine("Rebooting");
        System.Diagnostics.Process.Start("shutdown", "/r /t 0");
    }

    return base.OnStart();
}

Note that this code uses a file on the %RoleRoot% drive as a flag, so this will also cause an additional reboot in extra scenarios such as portal reboots and in-place upgrades (see this post), but these scenarios are rare enough that it should not cause an issue in your service. If you wish to avoid these extra reboots you can set the role runtime execution context to elevated by adding <Runtime executionContext="elevated" /> to the .csdef file and then write to either a file to the %SystemRoot% drive or write a flag to the registry.

With the older version of the storage client library (version 1.7) you could sign HttpWebRequests using the SignRequestLite function, and there were several examples on the web of how to do this. The SignRequestLite function has been removed from the 2.0+ versions of the storage client library and so far I have not seen any examples of using the new signing functions.

The new SCL uses Microsoft.WindowsAzure.Storage.Auth.Protocol.SharedKeyAuthenticationHandler to sign the request, and Microsoft.WindowsAzure.Storage.Core.Auth.SharedKeyTableCanonicalizer (for table storage requests) or Microsoft.WindowsAzure.Storage.Core.Auth.SharedKeyCanonicalizer (for blob and queue storage requests) to create the canonicalizer. In the simplest form, this is the code (using the blob canonicalizer) to sign an httpWebRequest object:

Microsoft.WindowsAzure.Storage.Auth.StorageCredentials credentials = new Microsoft.WindowsAzure.Storage.Auth.StorageCredentials(storageAccountName, storageAccountKey);
Microsoft.WindowsAzure.Storage.Auth.Protocol.SharedKeyAuthenticationHandler auth;
Microsoft.WindowsAzure.Storage.Core.Auth.SharedKeyCanonicalizer canonicalizer = Microsoft.WindowsAzure.Storage.Core.Auth.SharedKeyCanonicalizer.Instance;
auth = new Microsoft.WindowsAzure.Storage.Auth.Protocol.SharedKeyAuthenticationHandler(canonicalizer, credentials, storageAccountName);
auth.SignRequest(httpWebRequest, null);

For a more full featured example (you can largely ignore the MakeRequest function since it is just a generic HTTP request/response handler):


using System.Net;
using Microsoft.WindowsAzure.Storage.Auth;
using Microsoft.WindowsAzure.Storage.Auth.Protocol;
 
        private HttpWebRequest SignRequest(HttpWebRequest request, bool isTable = false)
        {
            // Create a StorageCredentials object with account name and key
            string storageAccountName = "kwillstorage2";
            string storageAccountKey = "S4RUcCvGKBPoFvhyJ0p6Wu0ciJnPTn5+b5MgU5olWhqGABAfvFhMFCbOBSeDgL9VF27TFrYzCQnRHYkbgJgxxg==";
            Microsoft.WindowsAzure.Storage.Auth.StorageCredentials credentials = new Microsoft.WindowsAzure.Storage.Auth.StorageCredentials(storageAccountName, storageAccountKey);
 
            // Create the SharedKeyAuthenticationHandler which is used to sign the HttpWebRequest
            Microsoft.WindowsAzure.Storage.Auth.Protocol.SharedKeyAuthenticationHandler auth;
            if (isTable)
            {
                // Tables use SharedKeyTableCanonicalizer along with storage credentials and storage account name
                Microsoft.WindowsAzure.Storage.Core.Auth.SharedKeyTableCanonicalizer canonicalizertable = Microsoft.WindowsAzure.Storage.Core.Auth.SharedKeyTableCanonicalizer.Instance;
                auth = new Microsoft.WindowsAzure.Storage.Auth.Protocol.SharedKeyAuthenticationHandler(canonicalizertable, credentials, storageAccountName);
                if (request.Headers["MaxDataServiceVersion"] == null)
                {
                    request.Headers.Add("MaxDataServiceVersion", "2.0;NetFx");
                }
            }
            else
            {
                // Blobs and Queues use SharedKeyCanonicalizer
                Microsoft.WindowsAzure.Storage.Core.Auth.SharedKeyCanonicalizer canonicalizer = Microsoft.WindowsAzure.Storage.Core.Auth.SharedKeyCanonicalizer.Instance;
                auth = new Microsoft.WindowsAzure.Storage.Auth.Protocol.SharedKeyAuthenticationHandler(canonicalizer, credentials, storageAccountName);
            }
 
            // Sign the request which will add the Authorization header
            auth.SignRequest(request, null);
            return request;
        }
 
        private void button1_Click(object sender, EventArgs e)
        {
            HttpWebRequest requestBlob, requestTable;
 
            requestBlob = (HttpWebRequest)HttpWebRequest.Create("https://kwillstorage2.blob.core.windows.net/vsdeploy?restype=container&comp=list");
            requestBlob.Method = "GET";
            requestBlob.Headers.Add("x-ms-version", "2014-02-14");
            SignRequest(requestBlob);
            MakeRequest(requestBlob);
 
            requestTable = (HttpWebRequest)HttpWebRequest.Create("https://kwillstorage2.table.core.windows.net/testtable");
            requestTable.Method = "GET";
            requestTable.Headers.Add("x-ms-version", "2014-02-14");
            SignRequest(requestTable, true);
            MakeRequest(requestTable);
        }
 
        private void MakeRequest(HttpWebRequest request)
        {
            HttpWebResponse response = null;
            System.IO.Stream receiveStream;
            System.IO.StreamReader readStream;
            Encoding encode;
            try
            {
                response = (HttpWebResponse)request.GetResponse();
            }
            catch (WebException ex)
            {
                Console.WriteLine(request.Headers.ToString());
                Console.WriteLine(ex.Message + Environment.NewLine + Environment.NewLine + ex.Response.Headers.ToString());
                try
                {
                    receiveStream = ex.Response.GetResponseStream();
                    encode = System.Text.Encoding.GetEncoding("utf-8");
                    // Pipes the stream to a higher level stream reader with the required encoding format. 
                    readStream = new System.IO.StreamReader(receiveStream, encode);
                    Console.WriteLine(readStream.ReadToEnd());

                    // Releases the resources of the response.
                    response.Close();
                    // Releases the resources of the Stream.
                    readStream.Close();
                }
                catch
                {
                }
                return;
            }
 
            Console.WriteLine(request.Method + " " + request.RequestUri + " " + request.ProtocolVersion + Environment.NewLine + Environment.NewLine);
            Console.WriteLine(request.Headers.ToString());
            Console.WriteLine((int)response.StatusCode + " - " + response.StatusDescription + Environment.NewLine + Environment.NewLine);
            Console.WriteLine(response.Headers + Environment.NewLine + Environment.NewLine);
 
            receiveStream = response.GetResponseStream();
            encode = System.Text.Encoding.GetEncoding("utf-8");
            // Pipes the stream to a higher level stream reader with the required encoding format. 
            readStream = new System.IO.StreamReader(receiveStream, encode);
            Console.WriteLine(readStream.ReadToEnd());
 
            // Releases the resources of the response.
            response.Close();
            // Releases the resources of the Stream.
            readStream.Close();
        }

There are several reasons why a PaaS cloud service role instance would restart or recycle. This list may not be exhaustive, but I believe it covers all scenarios as of today. Some day when I have free time I will try to edit this blog post to include details about how to detect each of these restarts.

The items at the bottom marked with [VM] will cause a reboot or reimage of the VM itself, while the other items will leave the VM running but cause a restart of the role processes. For more information about the role processes, see https://blogs.msdn.microsoft.com/kwill/2011/05/05/windows-azure-role-architecture/.

A new version of the Azure Guest Agent
In place upgrade of the cloud service
Calling RoleEnvironment.RequestRecycle from within the VM
Stop/Start management operation
Setting e.Cancel=true in the RoleEnvironment_Changing event
The role host process exiting the RoleEntryPoint_Run method
The role host process crashing
Host or Guest OS Updates [VM]
Reboot or Reimage management operation [VM]
A hardware issue on the physical server causing service healing and node movement [VM]
A communication failure between the Guest Agent and the Host Agent that lasts longer than 10 minutes [VM]

*Update Aug 20, 2013: The guest agent process used to be WaAppAgent, and any updates to this guest agent would come at the same time as a Guest OS update. This has been changed to use 2 guest agents – WaAppAgent and WindowsAzureGuestAgent. WindowsAzureGuestAgent takes on all of the work that WaAppAgent used to do, and WaAppAgent is now responsible for installing, configuring, and updating WindowsAzureGuestAgent. This decouples the guest agent from the Guest OS itself and allows updates to the guest agent to be out of band with Guest OS updates. This is reflected in the chart below by the C2 process.

One of the more common questions I get is about the architecture of an Azure role and which processes are responsible for the various steps in getting an Azure role instance up and running, how they interact, etc. This blog post will attempt to give a good overview of the major steps that happen when deploying a service, and the core processes that run on an Azure VM.

Process Information

A. RDFE / FFE is the communication path from the user to the fabric. RDFE (RedDog Front End) is the publicly exposed API which is the front end to the Management Portal and the Service Management API (ie. Visual Studio, Azure MMC, etc). All requests from the user go through RDFE. FFE (Fabric Front End) is the layer which translates requests from RDFE into the fabric commands. All requests from RDFE go through the FFE to reach the fabric controllers.

B. Fabric controller is responsible for maintaining and monitoring all of the resources in the data center. It communicates with fabric host agents on the fabric OS sending information such as the Guest OS version, service package, service configuration, service state, etc.

C. Host Agent lives on the Host OS and is responsible for setting up Guest OS and communicating with Guest Agent (WindowsAzureGuestAgent) in order to drive the role towards goal state and to do heartbeat checks with the guest agent. If host agent does not receive heartbeat response for 10 minutes, host agent will restart the guest OS.

C2. WaAppAgent is responsible for installing, configuring, and updating WindowsAzureGuestAgent.exe.

D. WindowsAzureGustAgent is responsible for:

a. Configuring the guest OS including firewall, ACLs, LocalStorage resources, service package and configuration, certificates.

b. Setting up the SID for the user account which the role will run under.

c. Communicating role status to the fabric.

d. Starting WaHostBootstrapper and monitoring it to ensure role is in goal state.

E. WaHostBootstrapper is responsible for:

a. Reading the role configuration and starting up all of the appropriate tasks and processes to configure and run the role.

b. Monitoring all of its child processes.

c. Raising the StatusCheck event on the role host process.

F. IISConfigurator runs when the role is configured as a Full IIS web role (it will not run for SDK 1.2 HWC roles). It is responsible for:

a. Starting the standard IIS services

b. Configuring rewrite module in web config

c. Setting up the AppPool for the <Sites> configured in the service model.

d. Setting up IIS logging to point to the DiagnosticStore LocalStorage folder

e. Configuring permissions and ACLs

f. The website resides in %roleroot%:\sitesroot\0 and the apppool is pointed to this location to run IIS.

G. Startup tasks are defined by the role model and started by WaHostBootstrapper. Startup tasks can be configured to run in the Background asynchronously and the host bootstrapper will start the startup task and then continue on to other startup tasks. Startup tasks can also be configured to run in Simple (default) mode where the host bootstrapper will wait for the startup task to finish running and return with a success (0) exit code before continuing on to the next startup task.

H. These tasks are part of the SDK and defined as plugins in the role’s service definition (.csdef). When expanded into startup tasks the DiagnosticsAgent and RemoteAccessAgent are unique in that they define 2 startup tasks each, one regular and one with a /blockStartup parameter. The normal startup task is defined as a Background startup task so that it can run in the background while the role itself is running. The /blockStartup startup task is defined as a Simple startup task so that WaHostBootstrapper will wait for it to exit before continuing. The /blockStartup task simply waits for the regular task to finish initializing and then it will exit and allow the host bootstrapper to continue. The reason this is done is so that diagnostics and RDP access can be configured prior to the role processes starting up (this is done via the /blockStartup task), and that diagnostics and RDP access can continue running after the host bootstrapper has finished with startup tasks (this is done via the normal task).

I. WaWorkerHost is the standard host process for normal worker roles. This host process will host all of the role’s DLLs and entry point code such as OnStart and Run.

J. WaWebHost is the standard host process for web roles when they are configured to use the SDK 1.2 compatible Hostable Web Core (HWC). Roles can enable the HWC mode by removing the <Sites> element from the service definition (.csdef). In this mode all of the service’s code and DLLs run from the WaWebHost process. IIS (w3wp) is not used and there are no AppPools configured in IIS Manager because IIS is hosted inside of WaWebHost.exe.

K. WaIISHost is the host process for role entry point code for web roles using Full IIS. This process will load the first DLL found which implements the RoleEntryPoint class (this DLL is defined in E:\__entrypoint.txt) and execute the code from this class (OnStart, Run, OnStop). Any RoleEnvironment events (ie. StatusCheck, Changed, etc) created in the RoleEntryPoint class will be raised in this process.

L. W3WP is the standard IIS worker process which will be used when the role is configured to use Full IIS. This will run the AppPool configured from IISConfigurator. Any RoleEnvironment events (ie. StatusCheck, Changed, etc) created here will be raised in this process. Note that RoleEnvironment events will fire in both locations (WaIISHost and w3wp.exe) if you subscribe to events from both processes.

Workflow steps

1. Customer makes a request such as uploading a .cspkg and .cscfg, telling a role to stop, doing a configuration change, etc. This can be done through the Azure Management portal or via a tool that uses the Service Management API such as Visual Studio’s Publish feature. This request goes to RDFE which does all of the subscription related work and then communicates the request to FFE. The rest of these workflow steps will assume the process of deploying a new package and starting it.

2. FFE finds the correct machine pool (based on customer input such as affinity group or geo location, and input from fabric such as machine availability) and communicates with the master fabric controller in that machine pool.

3. The fabric controller finds a host with available CPU cores (or spins up a new host). The service package and configuration is copied to the host and the fabric controller communicates with the host agent on the host OS to deploy the package (configure DIPs, ports, guest OS, etc).

4. The host agent starts the Guest OS and communicates with the guest agent (WindowsAzureGuestAgent). The host sends heartbeats to the guest to make sure that the role is working towards its goal state.

5. WindowsAzureGuestAgent sets up the guest OS (firewall, ACLs, LocalStorage, etc), copies a new XML configuration file to c:\Config, then starts the WaHostBootstrapper process.

6. For Full IIS web roles, WaHostBootstrapper starts IISConfigurator and tells it to delete any existing AppPools for the webrole from IIS.

7. WaHostBootstrapper reads the <Startup> tasks from E:\RoleModel.xml and begins executing startup tasks. WaHostBootstrapper will wait until all Simple startup tasks have finished and returned a success.

8. For Full IIS web roles, WaHostBootstrapper tells IISConfigurator to configure the IIS AppPool and points the site to E:\Sitesroot\<index> where <index> is a 0 based index into the number of <Sites> elements defined for the service.

9. WaHostBootstrapper will start the host process depending on the role type:

a. Worker Role: WaWorkerHost.exe is started. WaHostBootstrapper executes the OnStart() method, and once it returns it starts to execute the Run() method and simultaneously marks the role as Ready and puts it into the load balancer rotation (if InputEndpoints are defined). WaHostBootsrapper will then go into a loop of checking the role status.

b. SDK 1.2 HWC Web Role: WaWebHost is started. WaHostBootstrapper executes the OnStart() method, and once it returns it starts to execute the Run() method and simultaneously marks the role as Ready and puts it into the load balancer rotation. WaWebHost issues a warmup request (GET /do.__rd_runtime_init__). All web requests are sent to WaWebHost.exe. WaHostBootsrapper will then go into a loop of checking the role status.

c. Full IIS Web Role: WaIISHost is started. WaHostBootstrapper executes the OnStart() method, and once it returns it starts to execute the Run() method and simultaneously marks the role as Ready and puts it into the load balancer rotation. WaHostBootsrapper will then go into a loop of checking the role status.

10. Incoming web requests to a Full IIS web role will trigger IIS to start the W3WP process and serve the request just like it would in an on-prem IIS environment.

Log File locations

*Update August 9, 2013: A new blog post at http://blogs.msdn.com/b/kwill/archive/2013/08/09/windows-azure-paas-compute-diagnostics-data.aspx has updated Log File information and locations.

WindowsAzureGuestAgent

C:\Logs\AppAgentRuntime.Log. This log contains changes to the service including starts, stops, and new configurations. If the service does not change then there is expected to be large gaps of time in this log file.
C:\Logs\WaAppAgent.Log. This log contains status updates and heartbeat notifications and is updated every 2-3 seconds. This log will contain a historic view of the status of the instance and will tell you when the instance was not in the Ready state.

WaHostBootstrapper

C:\Resources\Directory\<deploymentID>.<role>.DiagnosticStore\WaHostBootstrapper.log

WaWebHost

C:\Resources\Directory\<guid>.<role>\WaWebHost.log

WaIISHost

C:\Resources\Directory\<deploymentID>.<role>\WaIISHost.log

IISConfigurator

C:\Resources\Directory\<deploymentID>.<role>\IISConfigurator.log

IIS Logs

C:\Resources\Directory\<guid>.<role>.DiagnosticStore\LogFiles\W3SVC1

Windows Event Logs

D:\Windows\System32\Winevt\Logs

During the course of debugging Windows Azure applications on a daily basis I have accumulated several tips, tricks, and debugging techniques that are specific to the Azure platform. I plan to create a series of blog posts to share a lot of the information that my team uses internally to troubleshoot Windows Azure. The first one will start off with some of the resources that you can find once you RDP into an Azure VM.

Debugging Windows Azure Series

Resources on the VM

One of the keys to troubleshooting an issue is understanding the overall environment and what resources and log files are available. The C: drive contains config info and local storage, the D: drive contains Windows and WaAppAgent logs, and the E: drive contains the customer’s role code.

C: – log files, LocalResource folders, etc

You can use the C: drive for any local temporary storage while debugging (ie. WinDBG, tools, dump files, log files, etc). I usually create C:\Scratch to load all of my troubleshooting tools.
C:\Config – Service configuration information
- <DeploymentID>.<RoleName>.<index>.xml
  This is the main configuration information for your role. If you update your service configuration (either updating the config via the management portal or doing an in-place upgrade) there will be multiple copies of the XML file, where the <index> is a counter incremented by one for each new config version. You can find the following information in this file:
  - Deployment Name
  - Service Name
  - Certificates
  - Host process
  - cscfg settings and values
  - LocalStorage folders
  - InputEndpoints, InternalEndpoints and port mappings
- <GUID>.ccf
  - OS Version
  - Cluster name
  - Deployment ID
  - VM Size (look for <ProcessorCount>)
  - ipconfig info
C:\Resources – LocalStorage, log files, AspNetTemp
- Directory\<GUID>.<RoleName>.DiagnosticStore – IIS Logs, Failed Request (freb) logs
- Directory\<guid>.<role>.DiagnosticStore\Monitor\Tables – All diagnostics collected by the Azure diagnostics monitor (event logs, role logs, WAD diagnostics, etc).
- Temp\<GUID>.<RoleName>\RoleTemp – Host process logs
- Directory\<GUID>.<RoleName>.<LocalStorage name> – Folder for a <LocalStorage> as defined in the service model (.csdef).

D: – Windows (%SystemDrive%, %SystemRoot%, etc), agent logs

D:\Packages\GuestAgent – agent logs
- WaAppAgent.exe runs from this location
- WaAppAgent.exe.log – Log file for agent showing instance startup details.
- WaAppAgent.<index>.log – Historic log files

E: – Virtual drive containing service package

This drive is created when the guest agent deploys the service package. Note that this is dynamically created and may end up being the F: drive, G: drive, etc.

_entrypoint.txt – DLL containing the entry point for the host process to call.
RoleModel.xml – XML file including <Startup> tasks and <Sites> configuration elements. Also includes msbuild information and target .NET framework version.
<GUID>.csman – PackageManifest contains a list of all files in the package (copy/paste to Visual Studio to get readable format)
\Base\x64 – The host process that runs the service’s code.
\Approot – The customer’s code, aspx pages, DLLs, etc.
\Sitesroot\<index> – This is the folder where the full IIS site is run from. When the role is deployed, the files from \Approot are copied to this \Sitesroot folder and then IIS will be configured to run the site from \Sitesroot. Any temporary test changes to the role (web.config changes, new DLLs, etc) should be placed here.

Debugging Windows Azure Series

In the second part of the Debugging Windows Azure series we will look at the various ways you can get your troubleshooting tools onto the VM.

The simple and obvious way is to do a Copy from your desktop machine and then Paste in the Azure VM. This uses rdpclip.exe to do standard Copy/Paste between your machine and the RDP machine. This works really well if your files are small and you have them handy on your local machine. The biggest problem with rdpclip is that transferring files can be very slow. It also replaces whatever was on your clipboard, just like doing a Copy/Paste on your local machine. Transferring a 12 MB file took about 1.5 minutes.
You can also do a pseudo-UNC file share by using \\tsclient\<drive> from within the Azure VM. The file transfers are still slow (the same 12 MB file took about 1.5 minutes), but it does allow easy file transfer back and forth, especially if you are working with multiple files. To enable this functionality you first have to turn on local resource sharing in mstsc. To do this you will click the Connect button on the Management Portal to connect to your Azure VM, but instead of doing an Open on the file, do a Save-As and save the .rdp file onto your local machine. Browse to this file and select Edit. This will open the mstsc client and allow you to configure the settings used to RDP into the machine. In the RDP client go to the Local Resources tab, select More under Local devices and resources, then expand Drives and select any drive that you want to share. Then within the VM you can go to Start –> Run and enter \\tsclient\<drive>. In the below example, I would enter \\tsclient\c (note there is no ‘$’ after the c).
You can also download tools using Internet Explorer. The benefit of this method is the transfers are very fast (the 12 MB file took about 10 seconds), but you do have to know where to find all of your tools on the internet. You will also have to add the download site your list of trusted sites. Using the Sysinternals Suite as an example you would browse to http://technet.microsoft.com/en-us/sysinternals/bb842062 and click on ‘Download Sysinternals Suite’, then click ‘Add’ on the Trusted Sites dialog.
If you have a set of frequently used tools (for me it is Debugging Tools for Windows a.k.a WinDBG, Sysinternals Suite, and Netmon) you can put those tools into Blob Storage and then access the files from blob storage while on the VM. The benefit here is that file transfers are extremely fast since the transfers are all in the same datacenter. The drawback is that you have to get an Azure storage explorer tool (see list here) onto the VM using one of the above methods and then configure it with your account and key before you can download from blob storage. Stay tuned for a future blog post about a custom lightweight tool which provides access to these blobs along with several other useful features when you are RDPed onto an Azure VM. Alternatively you could store your tools in public blogs and then just browse the URL, but you would have to make sure the license of the tool allows you to do this.

UPDATE March 6, 2013:

This code has been significantly updated and the new version can be found at http://blogs.msdn.com/b/kwill/archive/2013/03/06/asynchronous-parallel-block-blob-transfers-with-progress-change-notification-2-0.aspx.

———————————————————-

Have you ever wanted to asynchronously upload or download blobs from a client application and be able to show progress change notifications to the end user? I started off using the System.Net.WebClient class and calling UploadFileAsync or DownloadFileAsync and subscribing to the ProgressChanged event handlers. This works fairly well for most scenarios, but there are a couple problems with this approach: UploadFileAsync will throw an OutOfMemoryException on large files due to the fact that it tries to read the entire file into a buffer prior to sending it, and the WebClient transfers are slow compared to taking advantage of transferring a blob using multiple parallel blocks.

You could also use the CloudBlockBlob class and call BeginUploadFromStream or BeginDownloadFromStream, but you don’t get progress change notifications or parallel block uploads. You could wrap the storage client stream in a ProgressStream to get the progress change notifications, but then you still don’t get the speed of parallel block uploads. If you don’t need the extra speed of the parallel block uploads, but you still want progress change notifications and an easy programming model I would suggest checking out the code from http://appfabriccat.com/2011/02/exploring-windows-azure-storage-apis-by-building-a-storage-explorer-application/.

To get all of the features I wanted I ended up writing my own BlobTransfer class which gives me the following benefits:

Fast uploads and downloads by using parallel block blob transfers. Check out http://azurescope.cloudapp.net/Default.aspx for some of the performance benefits.
Asynchronous programming model
Progress change notifications

BlobTransfer.cs

using System;

using System.Text;

using System.ComponentModel;

using System.Windows.Forms;

using System.Collections.Generic;

using System.Threading;

using System.Runtime.Remoting.Messaging;

using Microsoft.WindowsAzure;

using Microsoft.WindowsAzure.StorageClient;

using Microsoft.WindowsAzure.StorageClient.Protocol;

using System.IO;

using System.Net;

using System.Security.Cryptography;

using System.Linq;

namespace BlobTransferUI

{

    class BlobTransfer

    {

        // Async events and properties

        public event AsyncCompletedEventHandler TransferCompleted;

        public event EventHandler<BlobTransferProgressChangedEventArgs> TransferProgressChanged;

        private delegate void BlobTransferWorkerDelegate(MyAsyncContext asyncContext, out bool cancelled, AsyncOperation async);

        private bool TaskIsRunning = false;

        private MyAsyncContext TaskContext = null;

        private readonly object _sync = new object();

        // Used to calculate download speeds

        Queue<long> timeQueue = new Queue<long>(100);

        Queue<long> bytesQueue = new Queue<long>(100);

        DateTime updateTime = System.DateTime.Now;

        // BlobTransfer properties

        private string m_FileName;

        private CloudBlockBlob m_Blob;

        public TransferTypeEnum TransferType;

        public void UploadBlobAsync(CloudBlob blob, string LocalFile)

        {   

            TransferType = TransferTypeEnum.Upload;

            //attempt to open the file first so that we throw an exception before getting into the async work

            using (FileStream fs = new FileStream(LocalFile, FileMode.Open, FileAccess.Read)) { }

            m_Blob = blob.ToBlockBlob;

            m_FileName = LocalFile;

            BlobTransferWorkerDelegate worker = new BlobTransferWorkerDelegate(UploadBlobWorker);

            AsyncCallback completedCallback = new AsyncCallback(TaskCompletedCallback);

            lock (_sync)

            {

                if (TaskIsRunning)

                    throw new InvalidOperationException("The control is currently busy.");

                AsyncOperation async = AsyncOperationManager.CreateOperation(null);

                MyAsyncContext context = new MyAsyncContext();

                bool cancelled;

                worker.BeginInvoke(context, out cancelled, async, completedCallback, async);

                TaskIsRunning = true;

                TaskContext = context;

            }

        }

        public void DownloadBlobAsync(CloudBlob blob, string LocalFile)

        {

            TransferType = TransferTypeEnum.Download;

            m_Blob = blob.ToBlockBlob;

            m_FileName = LocalFile;

            BlobTransferWorkerDelegate worker = new BlobTransferWorkerDelegate(DownloadBlobWorker);

            AsyncCallback completedCallback = new AsyncCallback(TaskCompletedCallback);

            lock (_sync)

            {

                if (TaskIsRunning)

                    throw new InvalidOperationException("The control is currently busy.");

                AsyncOperation async = AsyncOperationManager.CreateOperation(null);

                MyAsyncContext context = new MyAsyncContext();

                bool cancelled;

                worker.BeginInvoke(context, out cancelled, async, completedCallback, async);

                TaskIsRunning = true;

                TaskContext = context;

            }

        }

        public bool IsBusy

        {

            get { return TaskIsRunning; }

        }

        public void CancelAsync()

        {

            lock (_sync)

            {

                if (TaskContext != null)

                    TaskContext.Cancel();

            }

        }

        private void UploadBlobWorker(MyAsyncContext asyncContext, out bool cancelled, AsyncOperation async)

        {

            cancelled = false;

            ParallelUploadFile(asyncContext, async);

            // check for Cancelling

            if (asyncContext.IsCancelling)

            {

                cancelled = true;

            }

        }

        private void DownloadBlobWorker(MyAsyncContext asyncContext, out bool cancelled, AsyncOperation async)

        {

            cancelled = false;

            ParallelDownloadFile(asyncContext, async);

            // check for Cancelling

            if (asyncContext.IsCancelling)

            {

                cancelled = true;

            }

        }

        private void TaskCompletedCallback(IAsyncResult ar)

        {

            // get the original worker delegate and the AsyncOperation instance

            BlobTransferWorkerDelegate worker = (BlobTransferWorkerDelegate)((AsyncResult)ar).AsyncDelegate;

            AsyncOperation async = (AsyncOperation)ar.AsyncState;

            bool cancelled;

            // finish the asynchronous operation

            worker.EndInvoke(out cancelled, ar);

            // clear the running task flag

            lock (_sync)

            {

                TaskIsRunning = false;

                TaskContext = null;

            }

            // raise the completed event

            AsyncCompletedEventArgs completedArgs = new AsyncCompletedEventArgs(null, cancelled, null);

            async.PostOperationCompleted(delegate(object e) { OnTaskCompleted((AsyncCompletedEventArgs)e); }, completedArgs);

        }

        protected virtual void OnTaskCompleted(AsyncCompletedEventArgs e)

        {

            if (TransferCompleted != null)

                TransferCompleted(this, e);

        }

        private double CalculateSpeed(long BytesSent)

        {

            double speed = 0;

            if (timeQueue.Count == 80)

            {

                timeQueue.Dequeue();

                bytesQueue.Dequeue();

            }

            timeQueue.Enqueue(System.DateTime.Now.Ticks);

            bytesQueue.Enqueue(BytesSent);

            if (timeQueue.Count > 2)

            {

                updateTime = System.DateTime.Now;

                speed = (bytesQueue.Max() - bytesQueue.Min()) / TimeSpan.FromTicks(timeQueue.Max() - timeQueue.Min()).TotalSeconds;

            }

            return speed;

        }

        protected virtual void OnTaskProgressChanged(BlobTransferProgressChangedEventArgs e)

        {

            if (TransferProgressChanged != null)

                TransferProgressChanged(this, e);

        }

        // Blob Upload Code

        // 200 GB max blob size

        // 50,000 max blocks

        // 4 MB max block size

        // Try to get close to 100k block size in order to offer good progress update response.

        private int GetBlockSize(long fileSize)

        {

            const long KB = 1024;

            const long MB = 1024 * KB;

            const long GB = 1024 * MB;

            const long MAXBLOCKS = 50000;

            const long MAXBLOBSIZE = 200 * GB;

            const long MAXBLOCKSIZE = 4 * MB;

            long blocksize = 100 * KB;

            //long blocksize = 4 * MB;

            long blockCount;

            blockCount = ((int)Math.Floor((double)(fileSize / blocksize))) + 1;

            while (blockCount > MAXBLOCKS - 1)

            {

                blocksize += 100 * KB;

                blockCount = ((int)Math.Floor((double)(fileSize / blocksize))) + 1;

            }

            if (blocksize > MAXBLOCKSIZE)

            {

                throw new ArgumentException("Blob too big to upload.");

            }

            return (int)blocksize;

        }

        private void ParallelUploadFile(MyAsyncContext asyncContext, AsyncOperation asyncOp)

        {

            BlobTransferProgressChangedEventArgs eArgs = null;

            object AsyncUpdateLock = new object();

            // stats from azurescope show 10 to be an optimal number of transfer threads

            int numThreads = 10;

            var file = new FileInfo(m_FileName);

            long fileSize = file.Length;

            int maxBlockSize = GetBlockSize(fileSize);

            long bytesUploaded = 0;

            int blockLength = 0;

            // Prepare a queue of blocks to be uploaded. Each queue item is a key-value pair where

            // the 'key' is block id and 'value' is the block length.

            Queue<KeyValuePair<int, int>> queue = new Queue<KeyValuePair<int, int>>();

            List<string> blockList = new List<string>();

            int blockId = 0;

            while (fileSize > 0)

            {

                blockLength = (int)Math.Min(maxBlockSize, fileSize);

                string blockIdString = Convert.ToBase64String(ASCIIEncoding.ASCII.GetBytes(string.Format("BlockId{0}", blockId.ToString("0000000"))));

                KeyValuePair<int, int> kvp = new KeyValuePair<int, int>(blockId++, blockLength);

                queue.Enqueue(kvp);

                blockList.Add(blockIdString);

                fileSize -= blockLength;

            }

            m_Blob.DeleteIfExists();

            BlobRequestOptions options = new BlobRequestOptions()

            {

                RetryPolicy = RetryPolicies.RetryExponential(RetryPolicies.DefaultClientRetryCount, RetryPolicies.DefaultMaxBackoff),

                Timeout = TimeSpan.FromSeconds(90)

            };

            // Launch threads to upload blocks.

            List<Thread> threads = new List<Thread>();

            for (int idxThread = 0; idxThread < numThreads; idxThread++)

            {

                Thread t = new Thread(new ThreadStart(() =>

                {

                    KeyValuePair<int, int> blockIdAndLength;

                    using (FileStream fs = new FileStream(file.FullName, FileMode.Open, FileAccess.Read))

                    {

                        while (true)

                        {

                            // Dequeue block details.

                            lock (queue)

                            {

                                if (asyncContext.IsCancelling)

                                    break;

                                if (queue.Count == 0)

                                    break;

                                blockIdAndLength = queue.Dequeue();

                            }

                            byte[] buff = new byte[blockIdAndLength.Value];

                            BinaryReader br = new BinaryReader(fs);

                            // move the file system reader to the proper position

                            fs.Seek(blockIdAndLength.Key * (long)maxBlockSize, SeekOrigin.Begin);

                            br.Read(buff, 0, blockIdAndLength.Value);

                            // Upload block.

                            string blockName = Convert.ToBase64String(BitConverter.GetBytes(

                                blockIdAndLength.Key));

                            using (MemoryStream ms = new MemoryStream(buff, 0, blockIdAndLength.Value))

                            {

                                string blockIdString = Convert.ToBase64String(ASCIIEncoding.ASCII.GetBytes(string.Format("BlockId{0}", blockIdAndLength.Key.ToString("0000000"))));

                                string blockHash = GetMD5HashFromStream(buff);

                                m_Blob.PutBlock(blockIdString, ms, blockHash, options);

                            }

                            lock (AsyncUpdateLock)

                            {

                                bytesUploaded += blockIdAndLength.Value;

                                int progress = (int)((double)bytesUploaded / file.Length * 100);

                                // raise the progress changed event

                                eArgs = new BlobTransferProgressChangedEventArgs(bytesUploaded, file.Length, progress, CalculateSpeed(bytesUploaded), null);

                                asyncOp.Post(delegate(object e) { OnTaskProgressChanged((BlobTransferProgressChangedEventArgs)e); }, eArgs);

                            }

                        }

                    }

                }));

                t.Start();

                threads.Add(t);

            }

            // Wait for all threads to complete uploading data.

            foreach (Thread t in threads)

            {

                t.Join();

            }

            if (!asyncContext.IsCancelling)

            {

                // Commit the blocklist.

                m_Blob.PutBlockList(blockList, options);

            }

        }

        /// <summary>

        /// Downloads content from a blob using multiple threads.

        /// </summary>

        /// <param name="blob">Blob to download content from.</param>

        /// <param name="numThreads">Number of threads to use.</param>

        private void ParallelDownloadFile(MyAsyncContext asyncContext, AsyncOperation asyncOp)

        {

            BlobTransferProgressChangedEventArgs eArgs = null;

            int numThreads = 10;

            m_Blob.FetchAttributes();

            long blobLength = m_Blob.Properties.Length;

            int bufferLength = GetBlockSize(blobLength);  // 4 * 1024 * 1024;

            long bytesDownloaded = 0;

            // Prepare a queue of chunks to be downloaded. Each queue item is a key-value pair 

            // where the 'key' is start offset in the blob and 'value' is the chunk length.

            Queue<KeyValuePair<long, int>> queue = new Queue<KeyValuePair<long, int>>();

            long offset = 0;

            while (blobLength > 0)

            {

                int chunkLength = (int)Math.Min(bufferLength, blobLength);

                queue.Enqueue(new KeyValuePair<long, int>(offset, chunkLength));

                offset += chunkLength;

                blobLength -= chunkLength;

            }

            int exceptionCount = 0;

            FileStream fs = new FileStream(m_FileName, FileMode.OpenOrCreate, FileAccess.Write, FileShare.Read);

            using (fs)

            {

                // Launch threads to download chunks.

                List<Thread> threads = new List<Thread>();

                for (int idxThread = 0; idxThread < numThreads; idxThread++)

                {

                    Thread t = new Thread(new ThreadStart(() =>

                    {

                        KeyValuePair<long, int> blockIdAndLength;

                        // A buffer to fill per read request.

                        byte[] buffer = new byte[bufferLength];

                        while (true)

                        {

                            if (asyncContext.IsCancelling)

                                return;

                            // Dequeue block details.

                            lock (queue)

                            {

                                if (queue.Count == 0)

                                    break;

                                blockIdAndLength = queue.Dequeue();

                            }

                            try

                            {

                                // Prepare the HttpWebRequest to download data from the chunk.

                                HttpWebRequest blobGetRequest = BlobRequest.Get(m_Blob.Uri, 60, null, null);

                                // Add header to specify the range

                                blobGetRequest.Headers.Add("x-ms-range", string.Format(System.Globalization.CultureInfo.InvariantCulture, "bytes={0}-{1}", blockIdAndLength.Key, blockIdAndLength.Key + blockIdAndLength.Value - 1));

                                // Sign request.

                                StorageCredentials credentials = m_Blob.ServiceClient.Credentials;

                                credentials.SignRequest(blobGetRequest);

                                // Read chunk.

                                using (HttpWebResponse response = blobGetRequest.GetResponse() as

                                    HttpWebResponse)

                                {

                                    using (Stream stream = response.GetResponseStream())

                                    {

                                        int offsetInChunk = 0;

                                        int remaining = blockIdAndLength.Value;

                                        while (remaining > 0)

                                        {

                                            int read = stream.Read(buffer, offsetInChunk, remaining);

                                            lock (fs)

                                            {

                                                fs.Position = blockIdAndLength.Key + offsetInChunk;

                                                fs.Write(buffer, offsetInChunk, read);

                                            }

                                            offsetInChunk += read;

                                            remaining -= read;

                                            Interlocked.Add(ref bytesDownloaded, read);

                                        }

                                        int progress = (int)((double)bytesDownloaded / m_Blob.Attributes.Properties.Length * 100);

                                        // raise the progress changed event

                                        eArgs = new BlobTransferProgressChangedEventArgs(bytesDownloaded, m_Blob.Attributes.Properties.Length, progress, CalculateSpeed(bytesDownloaded), null);

                                        asyncOp.Post(delegate(object e) { OnTaskProgressChanged((BlobTransferProgressChangedEventArgs)e); }, eArgs);

                                    }

                                }

                            }

                            catch (Exception ex)

                            {

                                // Add block back to queue

                                queue.Enqueue(blockIdAndLength);

                                exceptionCount++;

                                // If we have had more than 100 exceptions then break

                                if (exceptionCount == 100)

                                {

                                    throw new Exception("Received 100 exceptions while downloading. Cancelling download. " + ex.ToString());

                                }

                                if (exceptionCount >= 100)

                                {

                                    break;

                                }

                            }

                        }

                    }));

                    t.Start();

                    threads.Add(t);

                }

                // Wait for all threads to complete downloading data.

                foreach (Thread t in threads)

                {

                    t.Join();

                }

            }

        }

        private string GetMD5HashFromStream(byte[] data)

        {

            MD5 md5 = new MD5CryptoServiceProvider();

            byte[] blockHash = md5.ComputeHash(data);

            return Convert.ToBase64String(blockHash, 0, 16);

        }

        internal class MyAsyncContext

        {

            private readonly object _sync = new object();

            private bool _isCancelling = false;

            public bool IsCancelling

            {

                get

                {

                    lock (_sync) { return _isCancelling; }

                }

            }

            public void Cancel()

            {

                lock (_sync) { _isCancelling = true; }

            }

        }

        public class BlobTransferProgressChangedEventArgs : ProgressChangedEventArgs

        {

            private long m_BytesSent = 0;

            private long m_TotalBytesToSend = 0;

            private double m_Speed = 0;

            public long BytesSent

            {

                get { return m_BytesSent; }

            }

            public long TotalBytesToSend

            {

                get { return m_TotalBytesToSend; }

            }

            public double Speed

            {

                get { return m_Speed; }

            }

            public TimeSpan TimeRemaining

            {

                get

                {

                    TimeSpan time = new TimeSpan(0, 0, (int)((TotalBytesToSend - m_BytesSent) / (m_Speed == 0 ? 1 : m_Speed)));

                    return time;

                }

            }

            public BlobTransferProgressChangedEventArgs(long BytesSent, long TotalBytesToSend, int progressPercentage, double Speed, object userState)

                : base(progressPercentage, userState)

            {

                m_BytesSent = BytesSent;

                m_TotalBytesToSend = TotalBytesToSend;

                m_Speed = Speed;

            }

        }

    }

    public enum TransferTypeEnum

    {

        Download,

        Upload

    }

}

Simple Console Client

Calling the upload or download method from BlobTransfer is a pretty simple matter of obtaining a CloudBlob reference to the blob of interest, subscribing to the TransferProgressChanged and TransferCompleted eventargs, and then calling UploadBlobAsync or DownloadBlobAsync. The following console app shows a simple example.

Create a new console application
Add a reference to System.Web (you will need to change the project’s Target Framework property to .NET Framework 4 instead of .NET Framework 4 Client Library) and Microsoft.WindowsAzure.StorageClient.
Add BlobTransfer.cs
Add the following code to Program.CS and change the const members to valid values.

using System;

using System.Collections.Generic;

using System.Linq;

using System.Text;

using Microsoft.WindowsAzure;

using Microsoft.WindowsAzure.StorageClient;

namespace ConsoleApplication1

{

    class Program

    {

        const string ACCOUNTNAME = "ENTER ACCOUNT NAME";

        const string ACCOUNTKEY = "ENTER ACCOUNT KEY";

        const string LOCALFILE = @"ENTER LOCAL FILE";

        const string CONTAINER = "temp";

        private static CloudStorageAccount AccountFileTransfer;

        private static CloudBlobClient BlobClientFileTransfer;

        private static CloudBlobContainer ContainerFileTransfer;

        private static bool Transferring;

        static void Main(string[] args)

        {

            System.Net.ServicePointManager.DefaultConnectionLimit = 35;

            AccountFileTransfer = CloudStorageAccount.Parse("DefaultEndpointsProtocol=http;AccountName=" + ACCOUNTNAME + ";AccountKey=" + ACCOUNTKEY);

            if (AccountFileTransfer != null)

            {

                BlobClientFileTransfer = AccountFileTransfer.CreateCloudBlobClient();

                ContainerFileTransfer = BlobClientFileTransfer.GetContainerReference(CONTAINER);

                ContainerFileTransfer.CreateIfNotExist();

            }

            // Upload the file

            CloudBlob blobUpload = ContainerFileTransfer.GetBlobReference(CONTAINER + "/" + System.IO.Path.GetFileName(LOCALFILE));

            BlobTransfer transferUpload = new BlobTransfer();

            transferUpload.TransferProgressChanged += new EventHandler<BlobTransfer.BlobTransferProgressChangedEventArgs>(transfer_TransferProgressChanged);

            transferUpload.TransferCompleted += new System.ComponentModel.AsyncCompletedEventHandler(transfer_TransferCompleted);

            transferUpload.UploadBlobAsync(blobUpload, LOCALFILE);

            Transferring = true;

            while (Transferring)

            {

                Console.ReadLine();

            }

            // Download the file

            CloudBlob blobDownload = ContainerFileTransfer.GetBlobReference(CONTAINER + "/" + System.IO.Path.GetFileName(LOCALFILE));

            BlobTransfer transferDownload = new BlobTransfer();

            transferDownload.TransferProgressChanged += new EventHandler<BlobTransfer.BlobTransferProgressChangedEventArgs>(transfer_TransferProgressChanged);

            transferDownload.TransferCompleted += new System.ComponentModel.AsyncCompletedEventHandler(transfer_TransferCompleted);

            transferDownload.DownloadBlobAsync(blobDownload, LOCALFILE + ".copy");

            Transferring = true;

            while (Transferring)

            {

                Console.ReadLine();

            }

        }

        static void transfer_TransferCompleted(object sender, System.ComponentModel.AsyncCompletedEventArgs e)

        {

            Transferring = false;

            Console.WriteLine("Transfer completed. Press any key to continue.");

        }

        static void transfer_TransferProgressChanged(object sender, BlobTransfer.BlobTransferProgressChangedEventArgs e)

        {

            Console.WriteLine("Transfer progress percentage = " + e.ProgressPercentage + " - " + (e.Speed / 1024).ToString("N2") + "KB/s");

        }

    }

}

UI Client

For a more full featured sample check out the BlobTransferUI project for simple UI showing progress bars for multiple simultaneous uploads and downloads.

Edit June 19, 2011

Fixed some issues in BlobTransfer to change some int’s into long’s in order to fix a bug when transferring a >2GB file. Thanks to Jeff Baxter with our HPC team for helping me discover the issue.

By default the ASP.NET temporary folder size in a Windows Azure web role is limited to 100 MB. This is sufficient for the vast majority of applications, but some applications may require more storage space for temporary files. In particular this will happen for very large applications which generate a lot of dynamically generated code, or applications which use controls that make use of the temporary folder such as the standard FileUpload control. If you are encountering the problem of running out of temporary folder space you will get error messages such as OutOfMemoryException or ‘There is not enough space on the disk.’.

In order to change the size of the temp folder you need to create a new folder by defining a LocalStorage resource in your service definition file, then on role startup modify the website configuration to point the system.web/Compilation tempDirectory property to point to this folder.

Create a new cloud service project and add a web role.

In the ServiceDefinition.csdef create 2 LocalStorage resources in the Web Role, and set the Runtime executionContext to elevated. The elevated executionContext allows us to use the ServerManager class to modify the IIS configuration during role startup. One LocalStorage resource will be for the AspNetTemp folder and one will be used to store the file uploaded by the user.

<WebRole name="WebRole1" >

  <Runtime executionContext="elevated" />

  <Sites>

    <Site name="Web">

      <Bindings>

        <Binding name="Endpoint1" endpointName="Endpoint1" />

      </Bindings>

    </Site>

  </Sites>

  <Endpoints>

    <InputEndpoint name="Endpoint1" protocol="http" port="80" />

  </Endpoints>

  <LocalResources>

    <LocalStorage name="AspNetTemp1GB" sizeInMB="1000" />

    <LocalStorage name="FileUploadFolder" sizeInMB="1000" />

  </LocalResources>

</WebRole>

Add a FileUpload control and an Upload button to Default.aspx.

<asp:FileUpload ID="FileUpload1" runat="server" />

<asp:Button ID="Button1" Text="Upload" runat="server" OnClick="Button1_OnClick" />

In Default.aspx.cs add the code for the Upload button’s OnClick event handler. This code will simply store the file uploaded into the FileUploadFolder LocalStorage resource.

protected void Button1_OnClick(object sender, EventArgs e)

{

    Microsoft.WindowsAzure.ServiceRuntime.LocalResource FileUploadFolder = Microsoft.WindowsAzure.ServiceRuntime.RoleEnvironment.GetLocalResource("FileUploadFolder");

    FileUpload1.SaveAs(System.IO.Path.Combine(FileUploadFolder.RootPath, FileUpload1.FileName));

}

Add a reference to %System32%\inetsrv\Microsoft.Web.Administration.dll and set CopyLocal=True.

Add the following code to the OnStart routine in WebRole.cs. This code configures the Website to point to the AspNetTemp1GB LocalStorage resource.

public override bool OnStart()

{

    // Get the location of the AspNetTemp1GB resource

    Microsoft.WindowsAzure.ServiceRuntime.LocalResource aspNetTempFolder = Microsoft.WindowsAzure.ServiceRuntime.RoleEnvironment.GetLocalResource("AspNetTemp1GB");

    //Instantiate the IIS ServerManager

    ServerManager iisManager = new ServerManager();

    // Get the website.  Note that "_Web" is the name of the site in the ServiceDefinition.csdef, so make sure you change this code if you change the site name in the .csdef

    Application app = iisManager.Sites[RoleEnvironment.CurrentRoleInstance.Id + "_Web"].Applications[0];

    // Get the web.config for the site

    Configuration webHostConfig = app.GetWebConfiguration();

    // Get a reference to the system.web/compilation element

    ConfigurationSection compilationConfiguration = webHostConfig.GetSection("system.web/compilation");

    // Set the tempDirectory property to the AspNetTemp1GB folder

    compilationConfiguration.Attributes["tempDirectory"].Value = aspNetTempFolder.RootPath;

    // Commit the changes

    iisManager.CommitChanges();

    return base.OnStart();

}

Modify the web.config to increase the size of the allowed requests so that you can upload large files. Edit the configuration/system.webServer/security/requestFiltering/requestLimites to increase the maxAllowedContentLength, and the configuration/system.web/httpRuntime to increase the maxRequestLength.

<system.webServer>

  <security>

    <requestFiltering>

      <!-- maxAllowedContentLength = 1 GB -->

      <requestLimits maxAllowedContentLength="1073741824" />

    </requestFiltering>

  </security>

</system.webServer>

<!-- maxRequestLength = 1 GB -->

<httpRuntime maxRequestLength="1048576" />

Deploy your service and you are now able to upload a file larger than 100 MB.

For more information about the ASP.NET Temporary Folder see http://msdn.microsoft.com/en-us/magazine/cc163496.aspx.

Update March 7, 2013

Added to the Q&A section — Q: How long will the upgrade take? How long will my VM be down?

Update October 17, 2014

Added information about Guest Agent updates. Thanks to my colleague Anurag Sharma for this idea.

————

Roughly once per month Microsoft releases a new Guest OS version for Windows Azure PaaS VMs. The exact schedule varies and the historic trend can be seen at http://msdn.microsoft.com/en-us/library/windowsazure/ee924680.aspx. During this rollout the Window Azure Fabric Controller will do two passes through all of the datacenters. There is also a periodic update of the Azure guest agent that runs inside of your VM.

Host OS. The first pass will upgrade the Host OS. The host OS reboots instances and the fabric controller ensures that only instances from one upgrade domain at a time will be rebooted. During this reboot, your role instances will go through the standard shutdown process and the RoleEnvironment.OnStop event will be raised to give you a chance to gracefully shut down the instance. The Host OS update can take several days for the fabric to coordinate the upgrades across all of the different hosted services and upgrade domains within a datacenter. It is not uncommon for different instances of your deployment to be updated several hours apart from each other.
Guest OS. Once the Host OS has finished upgrading across the datacenter then the Guest OS will be upgraded for services which are configured to use automatic Guest OS versions and this upgrade will proceed using standard upgrade domain rules for your service. Your VM will be rebooted and the Windows Partition (the D drive) will be reimaged with the upgraded OS. The Guest OS update process is much faster than the Host OS update since the fabric only has to coordinate the update within your hosted service and your upgrade domains. The duration of the Guest OS update process for your service will largely depend on how many instances you have, how many upgrade domains you have, and how long your service takes to shut down (Stopping/OnStop events) and start up (startup tasks and OnStart event).
Guest Agent. The Azure guest agent is updated on a roughly monthly basis. When the guest agent is updated the host process running your role (typically WaWorkerHost or WaWebHost) will be gracefully shutdown, then the guest agent will update itself, then the host process will start again. See http://blogs.msdn.com/b/kwill/archive/2011/05/05/windows-azure-role-architecture.aspx for more information about the guest agent process and how it interacts with your service.

Mark Russinovich has a great blog post which describes the Host OS upgrade process – http://blogs.technet.com/b/markrussinovich/archive/2012/08/22/3515679.aspx.

Note that this article is focused on PaaS scenarios, but the Host OS update process applies to IaaS Persistent VMs as well. For more information about IaaS VM restarts see http://blogs.msdn.com/b/windows_azure_technical_support_wats_team/archive/2013/11/27/windows-azure-iaas-host-os-update-demystified.aspx.

Impact to Your Service

As long as each of your roles has 2 or more instances then your service will not experience downtime due to the adherence to upgrade domains. The blog at http://blog.toddysm.com/2010/04/upgrade-domains-and-fault-domains-in-windows-azure.html has a great explanation of upgrade and fault domains, and why having 2 instances of a role is required to meet the 99.95% uptime SLA.
Approximately every month, expect your instances to reboot once for the Host OS update. If you have automatic guest OS updates, expect your instances to reboot again. These reboots are typically several hours apart, but this time frame can change depending on the makeup of different services within a datacenter.
Your role needs to adhere to the rules around host OS updates, in particular instances should reach the Ready state within 15 minutes of starting the Startup tasks. For more information about this limitation see http://msdn.microsoft.com/en-us/library/hh543978.
Your role instances should be able to handle a Reboot, a Reimage, and a Recycle. The Host OS upgrade will cause a Reboot of your instance, and the Guest OS upgrade will cause the equivalent of a Reimage of your instance. See the common issues below for more information.

Common Issues

See http://blogs.msdn.com/b/kwill/archive/2011/05/05/windows-azure-role-architecture.aspx for more information about the processes which are running and the location of log files which can be used to troubleshoot.

The most common problem is roles not reaching the Ready state after the OS upgrades. The most common root cause for this problem is a startup task or code in the OnStart or Run function not running correctly. There are 2 common categories of this root cause:
1. A failure of the code to run twice due to the Host OS reboot which will cause your startup tasks to run again. If you are doing something in a Startup task and executing a command which returns an error when run twice (ie. ‘appcmd set config’ to add a section will fail when run twice with the error “New add object missing required attributes. Cannot add duplicate collection entry of type…”) then your startup task will fail and cause your role instance to begin recycling. To troubleshoot this type of failure, RDP into the VM and look in the Event Logs for errors, and look in the WaHostBootstrapper.log for Startup task failures. During your normal development and testing process you should proactively initiate a Reboot of your role instances from the Windows Azure Management portal in order to test your service and make sure that it works correctly in this scenario. A common fix for startup task failures is to add an ‘exit /b 0’ to the end of your startup task. See http://msdn.microsoft.com/en-us/library/windowsazure/hh124132.aspx for more information on why this is needed.
2. A failure of the code to run after the Windows partition is reimaged. During the Guest OS portion of the update, the Windows Partition is reimaged. The Windows Partition is typically where program installations and registry changes are stored, and during the reimage those changes will be lost. If the startup code assumes that the change exists (ie. if the startup task makes a registry change and then stores a record of that change on the C: or E: drive so that the code isn’t run twice) then the role instance may fail to work properly. During your normal development and testing process you should proactively initiate a Reimage of your role instances from the Windows Azure Management portal in order to test your service and make sure that it works correctly in this scenario.
If your startup code takes longer than 15 minutes to complete then you may have multiple role instances taken out of service at the same time. This is most common when a startup task installs a program or feature, downloads cache data, or downloads website information. See the Host OS update rules in the ‘Impact to Your Service’ section above for more information about this.
Occasionally the Windows Azure Platform will fail to restart the host or guest OS after an update. Overall this is a rare scenario and the platform is constantly improving to eliminate these types of failures. If you are in this scenario then your symptoms will typically be a ‘Waiting for Host’ message in the portal that does not change after at least 15 minutes, and the inability to RDP into the role instance. In this scenario there is little you can do short of deleting the deployment to recover this instance. If you open a support incident (http://www.windowsazure.com/en-us/support/contact/) the support team can manually recover that instance. Note: If you are able to RDP into the role instance then the problem is almost always due to a failure in the startup code as described in common issue #1 above.
During the OS upgrades one or more of your instances will be unavailable at any given time which will cause reduced capacity for your service. For example, you have 2 instances of a web role and both instances typically run at 75% CPU. During the OS upgrade one instance will be rebooted during the upgrade which means all traffic will be directed to the remaining instance which will exceed the capacity for that instance and your service availability will be impacted. You should ensure that your service has sufficient excess capacity to absorb X% of the instances being unavailable, where X is 1/<number of upgrade domains> (ie. for 2 upgrade domains you will lose 50% of your capacity, and for 5 upgrade domains you will lose 20% of your capacity).
If your website takes several minutes to warmup (either standard IIS/ASP.NET warmup of precompilation and module loading, or warming up a cache or other app specific tasks) then your clients may experience an outage or random timeouts. After a role instance restarts and your OnStart code completes then your role instance will be put back in the load balancer rotation and will begin receiving incoming requests. If your website is still warming up then all of those incoming requests will queue up and time out. If you only have 2 instances of your web role then IN_0, which is still warming up, will be taking 100% of the incoming requests while IN_1 is being restarted for the Guest OS update. This can lead to a complete outage of your service until your website is finished warming up on both instances. It is recommended to keep your instance in OnStart, which will keep it in the Busy state where it won’t receive incoming requests from the load balancer, until your warmup is complete. You can use the following code to accomplish this:

public class WebRole : RoleEntryPoint { public override bool OnStart() { // For information on handling configuration changes // see the MSDN topic at http://go.microsoft.com/fwlink/?LinkId=166357. IPHostEntry ipEntry = Dns.GetHostEntry(Dns.GetHostName()); string ip = null; foreach (IPAddress ipaddress in ipEntry.AddressList) { if (ipaddress.AddressFamily.ToString() == "InterNetwork") { ip = ipaddress.ToString(); } }

        string urlToPing = "http://" + ip;
        HttpWebRequest req = HttpWebRequest.Create(urlToPing) as HttpWebRequest;
        WebResponse resp = req.GetResponse();
        return base.OnStart();
    }
}

Detection and Notification

Notification

At this time the Windows Azure platform does not offer proactive notifications when an OS upgrade is happening. The Windows Azure development team is working on this functionality so that service administrators can better plan for upgrades and possible service impact. Your role instances will receive a RoleEnvironment.Stopping event prior to being shut down and you can use that event to gracefully terminate any work that the role instance is doing or notify an administrator that an instance is shutting down.

In the meantime you can subscribe to the Windows Azure OS Updates RSS feed at http://sxp.microsoft.com/feeds/3.0/msdntn/WindowsAzureOSUpdates. This feed should be updated the same day that the OS updates start being rolled out to the datacenter. This typically does not give advanced proactive notification, but it does help identify when the updates are happening. As noted above in the Host OS and Guest OS description the update process can take several days to complete, so it may be one or more days between when the RSS feed is updated and your hosted service begins updating.

The Guest OS list at http://msdn.microsoft.com/en-us/library/windowsazure/ee924680.aspx and the OS version selection dropdown in the management portal are typically updated after the Guest OS rollout has completed so you should not use the latest entry in these lists as an indication of when the OS updates are in progress.

Detection

At this time there is no direct way to detect a Host OS upgrade, but you can see the evidence of the reboot within the logs on the VM:

Search System event logs for event source USER32, event ID 1074, with message “The process D:\Packages\GuestAgent\WaAppAgent.exe (RD00155D50206D) has initiated the shutdown of computer RD00155D50206D on behalf of user NT AUTHORITY\SYSTEM for the following reason: Legacy API shutdown”. This indicates that the Windows Azure fabric’s guest agent (WaAppAgent.exe) initiated a shutdown of the VM.
Look in the AppAgentRuntime.log.old files for a message saying “Context_Start” with a Context=”StopContainer()”

Frequently Asked Questions

Q: How can I opt out of the OS updates?

A: You cannot opt out of the Host OS updates because Microsoft must maintain updated and patched host OSes within the datacenter. You can opt out of the Guest OS update by specifying a version of the Guest OS, but note that your service will no longer receive security patches and may be vulnerable. See http://msdn.microsoft.com/en-us/library/windowsazure/ff729422.aspx.

Q: How do I force the reboots to be done only during non-business hours?

A: There is no way to control when an individual instance or service will be upgraded for the Host OS. The upgrade is started on all Azure datacenters across the world at approximately the same time, and the fabric works continuously on upgrading each datacenter. This process takes several days due to the complexity of making sure upgrade domain rules are followed for all cloud services, and there is no way to control or determine when a specific instance will be impacted. To control the Guest OS update you can specify a fixed Guest OS version and then update it whenever you are ready.

Q: I installed something on the VM and now the VM has rebooted and the software I installed is gone, why?

A: Connecting to an Azure PaaS VM via RDP and making changes or installing software is unsupported. At any point in time the VM may be completely rebuilt and any changes you make will be lost. This can happen if the hardware fails and we have to startup a new VM on new hardware. This will also happen during the Guest OS update when the Windows Partition is rebuilt. If you need to install software or make changes to the VM you must create a startup task and do the work from there. This ensures that when the VM is recreated that your configuration will be executed again.

Q: Can one of the updates in the new Guest OS version break my service?

A: The updates that are installed onto the new guest OS version are publicly available and thoroughly tested hotfixes which are also being deployed to servers around the world via Windows Update and the chance negative impact to your service is extremely small. However, the root of the question goes back to how you manage OS patches in your on-premise services – do you install directly on the production servers and assume it will work, or do you have a staging environment where you test the patches first? You will follow the same pattern in Azure. If you want to have a staging environment to test patches prior to production then you should configure your production service to use a fixed version OS string in the .cscfg file. Then when a new guest OS is available you can deploy your service into the staging slot using the newest guest OS version. After you have validated that the service works correctly on the latest guest OS you can then either do a VIP swap, or do an in-place upgrade of your production service to use the latest OS.

Q: How long will the upgrade take? How long will my VM be down?

A: There is a common misconception that the more patches being applied, the longer the update will take. This is based on the belief that the upgrade works similar to how a Windows Update upgrade happens on your local desktop machine where a bunch of patches are copied to Windows and installed with subsequent reboots, but this is not how upgrading works in Azure. When a new OS version is being released in Azure, the OS team will take the latest image, apply the patches, and then save a new VHD with this new base image. This base image is then copied to a repository in Azure. When the fabric is instructed to do an OS upgrade it will first make a copy pass where it copies this new base image VHD to the hard disks on each server in the datacenter that is going to be upgraded. Once this copy process is finished the fabric will begin the upgrade process, following the normal upgrade domain rules. When a guest is going to be updated the fabric will do a graceful shutdown of the OS and then start a new VM using the new base image. The time it takes to upgrade a given VM for a Guest OS is roughly the time it takes to do a graceful Windows shutdown + the time it takes to start Windows. The timing for a Host OS update is a little different. When a Host is being upgraded it first sends the shutdown message to each Guest OS running on that Host. Each Guest OS is then given the standard OnStop and Windows Shutdown time to finish shutting down. Once every Guest OS is shut down, then the Host OS does a graceful shutdown and goes through it’s normal shutdown procedure. Once the Host OS is shutdown then the Host is rebooted using the new OS image. Once the Host is up and running then it will start each of the Guest OSes. Typically this Host OS update process will take 15 to 20 minutes, but it can vary depending on how many other Guests are on that Host and how long they take. Having said that, there will always be exceptions if there is a failure on a particular node and the Azure fabric determines that the Guests on that node need to be moved to a different node.

Q: How do I gracefully handle the OS shutdown?

A: When the OS is being updated the Azure Fabric will perform a graceful shutdown of your role instance. This means that your ASP.NET code will receive the Application_End event, and the Azure service runtime will raise the Stopping and OnStop events. Your code will have 5 minutes to finish cleanup work in OnStop before the process is shut down. After your Azure host process is shut down then Windows will go through a normal graceful shutdown including raising the standard OnStop and related events for Windows Services. For more information about gracefully handling a shut down of your instance see http://blogs.msdn.com/b/windowsazure/archive/2013/01/14/the-right-way-to-handle-azure-onstop-events.aspx, http://msdn.microsoft.com/en-us/library/hh180152.aspx and http://msdn.microsoft.com/en-us/library/windowsazure/microsoft.windowsazure.serviceruntime.roleentrypoint.onstop.aspx.

One of the regular questions I get asked is what happens to my disks and drive letters in Windows Azure VMs when <x> happens. The following chart will outline the different scenarios that affect your VM instances, and what happens to the different disks. Note that this information is for stateless PaaS VMs and not the new Windows Azure Virtual Machine persistent VMs.

Windows Azure Disk Partitions

C: – Local Resource disk. This disk contains Azure logs and configuration files, Windows Azure Diagnostics (which includes your IIS logs), and any LocalStorage resources you define.
D: – Windows disk. This is the OS disk and contains the Program Files folder (which includes installations done via startup tasks unless you specify another disk), registry changes, System32 folder, and the .NET framework.
E: or F: – Application disk. This is the disk where your .cspkg is extracted and includes your website, binaries, role host process, startup tasks, web.config, etc.

Disk Preservation

	C (Local Resource)	D (Windows)	E / F (Application)
VM Reboot From Within VM*	Preserved	Preserved	Preserved
Internal Fabric Node Recovery (power cycle node)	Preserved	Rebuilt	Preserved
Portal Reboot, Host OS Update, Stop/Start Service	Preserved	Preserved	Rebuilt
Portal ReImage or Guest OS Update	Preserved	Rebuilt	Rebuilt
In-Place Upgrade (default when deploying from Visual Studio)	Preserved	Preserved	Rebuilt**
Node Migration (ie. server failure)	Rebuilt	Rebuilt	Rebuilt
Rebuild Role Instance API (link)	Rebuilt	Rebuilt	Rebuilt

* This reboot is one done from within the VM by doing something such as executing shutdown /r /t 0. The portal reboot is done via the ‘Reboot’ button in the portal.

** In this scenario the application disk will switch from E to F (or vice-versa). To detect the current application disk, applications should query the %RoleRoot% environment variable.

Some of the more common questions I get are around heartbeats/probes, how the fabric recovers from failed probes, and how load balancer manages traffic to these instances.

Q: How does the fabric know that an instance has failed, and what actions does it take to recover that instance?

A: There are a series of heartbeat probes between the fabric and the instance — Fabric <-> Host Agent <-> Guest Agent (WaAppAgent.exe) <-> Host Bootstrapper (WaHostBootstrapper.exe) <-> Host Process (typically WaIISHost.exe or WaWorkerHost.exe).

If the Fabric <-> Host Agent probe fails then the fabric will attempt to restart the host. There are heuristics in the fabric to determine what to do with that host if a restart fails to resolve the problem, taking more aggressive actions to remedy the problem until ultimately the fabric may determine that the server itself is bad and it will create a new host on a new server and then start all of the affected guest VMs on that new host.
If the Host Agent <-> Guest Agent probe fails then the Host will attempt to restart the Guest OS, and this also includes a set of heuristics to take additional actions including attempting to start that Guest VM on a new server. If the Host <-> Guest probe succeeds then the fabric no longer takes action on that instance and any further recovery is handled by the guest agent within the VM.
The only recovery action that the guest agent will take is to restart the host stack (WaHostBootstrapper and all of its children) if one of the child processes crashes. If the probe times out then the guest agent assumes the host process is busy working and lets it continue running indefinitely. The guest agent will not restart the VM as part of a recovery process.

See http://blogs.msdn.com/b/kwill/archive/2011/05/05/windows-azure-role-architecture.aspx for more information about the processes and probes on the Guest OS.

Q: How does the load balancer know when an instance is unhealthy?

A: There are 2 different mechanisms the load balancer can use to determine instance health and whether or not to include that instance in the round robin rotation and send new traffic to it.

The default mechanism is that the load balancer sends probes to the Guest Agent to request the instance health. If the Guest Agent returns anything besides ‘Ready’ then the load balancer will mark that instance as unhealthy and remove it from the rotation. Looking back at the heartbeats from the guest agent to the host process, this means that if any of those processes running in the Guest OS has crashed or hung then the guest agent will not return Ready and the instance will be removed from the LB rotation.
The other mechanism is for you to define a custom LoadBalancerProbe in your service definition. A LoadBalancerProbe gives you much more control over how the load balancer determines instance health and allows you to more accurately reflect the status of your service, in particular the health of w3wp.exe and any other external dependencies your service has. Make sure your probe path is not a simple HTML page, but actually includes logic to determine your service health (eg. Try to connect to your SQL database).

Q: What does the load balancer do when an instance is detected as unhealthy?

A: The load balancer will route new incoming TCP connections to instances which are in rotation. The instances that are in rotation are either:

Returning a ‘Ready’ state from the guest agent for roles which do not have a LoadBalancerProbe.
Returning 200 or TCP ACK from a LoadBalancerProbe element.

If an instance drops out of rotation, the load balancer will not terminate any existing TCP connections. So if the client and server maintain the TCP connection then traffic on that connection will still be sent to the instance which has dropped out of rotation, but no new TCP connections will be sent to that instance. If the TCP connection is broken by the server (ie. the VM restarts or the process holding the TCP connection crashes) then the client should retry the connection, at which time the load balancer will see it as a new TCP connection and route it to an instance which is in rotation.

Note that for single instance deployments, the load balancer considers that instance to always be in rotation. So regardless of the status of the instance the load balancer will send traffic to that instance.

Q: How can you determine if a role instance was recycled or moved to a new server?

A: There is no direct way to know if an instance was recycled. Fabric initiated restarts (ie. OS updates) will raise the Stopping/OnStop events will be raised, but for unexpected shutdowns you will not receive these events. There are some strategies to detect these events:

The most common way to achieve this is to write a log in the RoleEntroyPoint.OnStart method. If you unexpectedly see an instance of this log then you know a role instance was recycled and you can look at various pieces of evidence to determine why.
If an instance is moved to a new VM/server then the Changing/Changed events will be raised on all other roles and instances with a type of RoleEnvironmentTopologyChange. Note that this will only happen if you have an InternalEndpoint defined. Also note that an InternalEndpoint is implicitly defined for you if you have enabled RDP.
See http://blogs.msdn.com/b/kwill/archive/2012/09/19/role-instance-restarts-due-to-os-upgrades.aspx for information about determining when an instance is restarted due to OS updates.
The guest agent logs (reference the Role Architecture blog post for log file location) will contain evidence of all restarts, both planned and unplanned, but they are internal undocumented logs and interpreting them is not trivial. But if you are following #1 and you know the timestamp for when your role restarted then you can focus on a specific timeframe in the agent logs.
The host bootstrapper logs (reference the Role Architecture blog post for log file location) will tell you if a startup task or host process failed and caused the guest agent to recycle the instance.
The state of the drives on the guest OS can provide information about what happened. See http://blogs.msdn.com/b/kwill/archive/2012/10/05/windows-azure-disk-partition-preservation.aspx.
If the above doesn’t help, the support team can help investigate through a support incident.

I will be following up with a couple more blog posts to go deeper into a couple of these topics – troubleshooting using various logs from the VM, and more details about how the Load Balancer works.

This post is an update to the post at http://blogs.msdn.com/b/kwill/archive/2011/05/30/asynchronous-parallel-block-blob-transfers-with-progress-change-notification.aspx.

Improvements from previous version

Upgraded to Azure Storage Client library 2.0 (Microsoft.WindowsAzure.Storage.dll).
Switched from custom parallel transfer code to the built in BeginDownloadToStream and BeginUploadFromStream methods which provides better performance and more reliable functionality with the same async parallel operations.
Helper functions to allow clients using Storage Client library 1.7 (Microsoft.WindowsAzure.StorageClient.dll) to utilize the functionality.

Upgrade instructions

The changes were designed to allow clients using the older version of the code a drop-in replacement with almost 0 code changes. If you are upgrading a client to use this new code there are a few small changes to make:

Add a reference to Azure Storage Client Library 2.0. The Nuget package manager makes this a near 1-click operation.
The TransferTypeEnum has been moved into the BlobTransfer class. If your client code utilizes TransferTypeEnum upgrade your code to use BlobTransfer.TransferTypeEnum

BlobTransfer.cs

using System;
using System.ComponentModel;

using System.Collections.Generic;
using Microsoft.WindowsAzure.Storage;
using Microsoft.WindowsAzure.Storage.Blob;
using System.IO;
using System.Linq;

namespace BlobTransferUI
{

    // Class to allow for easy async upload and download functions with progress change notifications
    // Requires references to Microsoft.WindowsAzure.Storage.dll (Storage client 2.0) and Microsoft.WindowsAzure.StorageClient.dll (Storage client 1.7).
    // See comments on UploadBlobAsync and DownloadBlobAsync functions for information on removing the 1.7 client library dependency
    class BlobTransfer
    {
        // Public async events
        public event AsyncCompletedEventHandler TransferCompleted;
        public event EventHandler<BlobTransferProgressChangedEventArgs> TransferProgressChanged;

        // Public BlobTransfer properties
        public TransferTypeEnum TransferType;

        // Private variables
        private ICancellableAsyncResult asyncresult;
        private bool Working = false;
        private object WorkingLock = new object();
        private AsyncOperation asyncOp;

        // Used to calculate download speeds
        private Queue<long> timeQueue = new Queue<long>(200);
        private Queue<long> bytesQueue = new Queue<long>(200);
        private DateTime updateTime = System.DateTime.Now;

        // Private BlobTransfer properties
        private string m_FileName;
        private ICloudBlob m_Blob;

        // Helper function to allow Storage Client 1.7 (Microsoft.WindowsAzure.StorageClient) to utilize this class.
        // Remove this function if only using Storage Client 2.0 (Microsoft.WindowsAzure.Storage).
        public void UploadBlobAsync(Microsoft.WindowsAzure.StorageClient.CloudBlob blob, string LocalFile)
        {
            Microsoft.WindowsAzure.StorageCredentialsAccountAndKey account = blob.ServiceClient.Credentials as Microsoft.WindowsAzure.StorageCredentialsAccountAndKey;
            ICloudBlob blob2 = new CloudBlockBlob(blob.Attributes.Uri, new Microsoft.WindowsAzure.Storage.Auth.StorageCredentials(blob.ServiceClient.Credentials.AccountName, account.Credentials.ExportBase64EncodedKey()));
            UploadBlobAsync(blob2, LocalFile);
        }

        // Helper function to allow Storage Client 1.7 (Microsoft.WindowsAzure.StorageClient) to utilize this class.
        // Remove this function if only using Storage Client 2.0 (Microsoft.WindowsAzure.Storage).
        public void DownloadBlobAsync(Microsoft.WindowsAzure.StorageClient.CloudBlob blob, string LocalFile)
        {
            Microsoft.WindowsAzure.StorageCredentialsAccountAndKey account = blob.ServiceClient.Credentials as Microsoft.WindowsAzure.StorageCredentialsAccountAndKey;
            ICloudBlob blob2 = new CloudBlockBlob(blob.Attributes.Uri, new Microsoft.WindowsAzure.Storage.Auth.StorageCredentials(blob.ServiceClient.Credentials.AccountName, account.Credentials.ExportBase64EncodedKey()));
            DownloadBlobAsync(blob2, LocalFile);
        }

        public void UploadBlobAsync(ICloudBlob blob, string LocalFile)
        {
            // The class currently stores state in class level variables so calling UploadBlobAsync or DownloadBlobAsync a second time will cause problems.
            // A better long term solution would be to better encapsulate the state, but the current solution works for the needs of my primary client.
            // Throw an exception if UploadBlobAsync or DownloadBlobAsync has already been called.
            lock (WorkingLock)
            {
                if (!Working)
                    Working = true;
                else
                    throw new Exception("BlobTransfer already initiated. Create new BlobTransfer object to initiate a new file transfer.");
            }

            // Attempt to open the file first so that we throw an exception before getting into the async work
            using (FileStream fstemp = new FileStream(LocalFile, FileMode.Open, FileAccess.Read)) { }

            // Create an async op in order to raise the events back to the client on the correct thread.
            asyncOp = AsyncOperationManager.CreateOperation(blob);

            TransferType = TransferTypeEnum.Upload;
            m_Blob = blob;
            m_FileName = LocalFile;

            var file = new FileInfo(m_FileName);
            long fileSize = file.Length;

            FileStream fs = new FileStream(m_FileName, FileMode.Open, FileAccess.Read, FileShare.Read);
            ProgressStream pstream = new ProgressStream(fs);
            pstream.ProgressChanged += pstream_ProgressChanged;
            pstream.SetLength(fileSize);
            m_Blob.ServiceClient.ParallelOperationThreadCount = 10;
            asyncresult = m_Blob.BeginUploadFromStream(pstream, BlobTransferCompletedCallback, new BlobTransferAsyncState(m_Blob, pstream));
        }

        public void DownloadBlobAsync(ICloudBlob blob, string LocalFile)
        {
            // The class currently stores state in class level variables so calling UploadBlobAsync or DownloadBlobAsync a second time will cause problems.
            // A better long term solution would be to better encapsulate the state, but the current solution works for the needs of my primary client.
            // Throw an exception if UploadBlobAsync or DownloadBlobAsync has already been called.
            lock (WorkingLock)
            {
                if (!Working)
                    Working = true;
                else
                    throw new Exception("BlobTransfer already initiated. Create new BlobTransfer object to initiate a new file transfer.");
            }

            // Create an async op in order to raise the events back to the client on the correct thread.
            asyncOp = AsyncOperationManager.CreateOperation(blob);

            TransferType = TransferTypeEnum.Download;
            m_Blob = blob;
            m_FileName = LocalFile;

            m_Blob.FetchAttributes();

            FileStream fs = new FileStream(m_FileName, FileMode.OpenOrCreate, FileAccess.Write, FileShare.Read);
            ProgressStream pstream = new ProgressStream(fs);
            pstream.ProgressChanged += pstream_ProgressChanged;
            pstream.SetLength(m_Blob.Properties.Length);
            m_Blob.ServiceClient.ParallelOperationThreadCount = 10;
            asyncresult = m_Blob.BeginDownloadToStream(pstream, BlobTransferCompletedCallback, new BlobTransferAsyncState(m_Blob, pstream));
        }

        private void pstream_ProgressChanged(object sender, ProgressChangedEventArgs e)
        {
            BlobTransferProgressChangedEventArgs eArgs = null;
            int progress = (int)((double)e.BytesRead / e.TotalLength * 100);

            // raise the progress changed event on the asyncop thread
            eArgs = new BlobTransferProgressChangedEventArgs(e.BytesRead, e.TotalLength, progress, CalculateSpeed(e.BytesRead), null);
            asyncOp.Post(delegate(object e2) { OnTaskProgressChanged((BlobTransferProgressChangedEventArgs)e2); }, eArgs);
        }

        private void BlobTransferCompletedCallback(IAsyncResult result)
        {
            BlobTransferAsyncState state = (BlobTransferAsyncState)result.AsyncState;
            ICloudBlob blob = state.Blob;
            ProgressStream stream = (ProgressStream)state.Stream;

            try
            {
                stream.Close();

                // End the operation.
                if (TransferType == TransferTypeEnum.Download)
                    blob.EndDownloadToStream(result);
                else if (TransferType == TransferTypeEnum.Upload)
                    blob.EndUploadFromStream(result);

                // Operation completed normally, raise the completed event
                AsyncCompletedEventArgs completedArgs = new AsyncCompletedEventArgs(null, false, null);
                asyncOp.PostOperationCompleted(delegate(object e) { OnTaskCompleted((AsyncCompletedEventArgs)e); }, completedArgs);
            }
            catch (StorageException ex)
            {
                if (!state.Cancelled)
                {
                    throw (ex);
                }

                // Operation was cancelled, raise the event with the cancelled flag = true
                AsyncCompletedEventArgs completedArgs = new AsyncCompletedEventArgs(null, true, null);
                asyncOp.PostOperationCompleted(delegate(object e) { OnTaskCompleted((AsyncCompletedEventArgs)e); }, completedArgs);
            }
        }

        // Cancel the async download
        public void CancelAsync()
        {
            ((BlobTransferAsyncState)asyncresult.AsyncState).Cancelled = true;
            asyncresult.Cancel();
        }

        // Helper function to only raise the event if the client has subscribed to it.
        protected virtual void OnTaskCompleted(AsyncCompletedEventArgs e)
        {
            if (TransferCompleted != null)
                TransferCompleted(this, e);
        }

        // Helper function to only raise the event if the client has subscribed to it.
        protected virtual void OnTaskProgressChanged(BlobTransferProgressChangedEventArgs e)
        {
            if (TransferProgressChanged != null)
                TransferProgressChanged(this, e);
        }

        // Keep the last 200 progress change notifications and use them to calculate the average speed over that duration. 
        private double CalculateSpeed(long BytesSent)
        {
            double speed = 0;

            if (timeQueue.Count >= 200)
            {
                timeQueue.Dequeue();
                bytesQueue.Dequeue();
            }

            timeQueue.Enqueue(System.DateTime.Now.Ticks);
            bytesQueue.Enqueue(BytesSent);

            if (timeQueue.Count > 2)
            {
                updateTime = System.DateTime.Now;
                speed = (bytesQueue.Max() - bytesQueue.Min()) / TimeSpan.FromTicks(timeQueue.Max() - timeQueue.Min()).TotalSeconds;
            }

            return speed;
        }

        // A modified version of the ProgressStream from http://blogs.msdn.com/b/paolos/archive/2010/05/25/large-message-transfer-with-wcf-adapters-part-1.aspx
        // This class allows progress changed events to be raised from the blob upload/download.
        private class ProgressStream : Stream
        {
            #region Private Fields
            private Stream stream;
            private long bytesTransferred;
            private long totalLength;
            #endregion

            #region Public Handler
            public event EventHandler<ProgressChangedEventArgs> ProgressChanged;
            #endregion

            #region Public Constructor
            public ProgressStream(Stream file)
            {
                this.stream = file;
                this.totalLength = file.Length;
                this.bytesTransferred = 0;
            }
            #endregion

            #region Public Properties
            public override bool CanRead
            {
                get
                {
                    return this.stream.CanRead;
                }
            }

            public override bool CanSeek
            {
                get
                {
                    return this.stream.CanSeek;
                }
            }

            public override bool CanWrite
            {
                get
                {
                    return this.stream.CanWrite;
                }
            }

            public override void Flush()
            {
                this.stream.Flush();
            }

            public override void Close()
            {
                this.stream.Close();
            }

            public override long Length
            {
                get
                {
                    return this.stream.Length;
                }
            }

            public override long Position
            {
                get
                {
                    return this.stream.Position;
                }
                set
                {
                    this.stream.Position = value;
                }
            }
            #endregion

            #region Public Methods
            public override int Read(byte[] buffer, int offset, int count)
            {
                int result = stream.Read(buffer, offset, count);
                bytesTransferred += result;
                if (ProgressChanged != null)
                {
                    try
                    {
                        OnProgressChanged(new ProgressChangedEventArgs(bytesTransferred, totalLength));
                        //ProgressChanged(this, new ProgressChangedEventArgs(bytesTransferred, totalLength));
                    }
                    catch (Exception)
                    {
                        ProgressChanged = null;
                    }
                }
                return result;
            }

            protected virtual void OnProgressChanged(ProgressChangedEventArgs e)
            {
                if (ProgressChanged != null)
                    ProgressChanged(this, e);
            }

            public override long Seek(long offset, SeekOrigin origin)
            {
                return this.stream.Seek(offset, origin);
            }

            public override void SetLength(long value)
            {
                totalLength = value;
                //this.stream.SetLength(value);
            }

            public override void Write(byte[] buffer, int offset, int count)
            {
                this.stream.Write(buffer, offset, count);
                bytesTransferred += count;
                {
                    try
                    {
                        OnProgressChanged(new ProgressChangedEventArgs(bytesTransferred, totalLength));
                        //ProgressChanged(this, new ProgressChangedEventArgs(bytesTransferred, totalLength));
                    }
                    catch (Exception)
                    {
                        ProgressChanged = null;
                    }
                }
            }

            protected override void Dispose(bool disposing)
            {
                stream.Dispose();
                base.Dispose(disposing);
            }

            #endregion
        }

        private class BlobTransferAsyncState
        {
            public ICloudBlob Blob;
            public Stream Stream;
            public DateTime Started;
            public bool Cancelled;

            public BlobTransferAsyncState(ICloudBlob blob, Stream stream)
                : this(blob, stream, DateTime.Now)
            { }

            public BlobTransferAsyncState(ICloudBlob blob, Stream stream, DateTime started)
            {
                Blob = blob;
                Stream = stream;
                Started = started;
                Cancelled = false;
            }
        }

        private class ProgressChangedEventArgs : EventArgs
        {
            #region Private Fields
            private long bytesRead;
            private long totalLength;
            #endregion

            #region Public Constructor
            public ProgressChangedEventArgs(long bytesRead, long totalLength)
            {
                this.bytesRead = bytesRead;
                this.totalLength = totalLength;
            }
            #endregion

            #region Public properties

            public long BytesRead
            {
                get
                {
                    return this.bytesRead;
                }
                set
                {
                    this.bytesRead = value;
                }
            }

            public long TotalLength
            {
                get
                {
                    return this.totalLength;
                }
                set
                {
                    this.totalLength = value;
                }
            }
            #endregion
        }

        public enum TransferTypeEnum
        {
            Download,
            Upload
        }

        public class BlobTransferProgressChangedEventArgs : System.ComponentModel.ProgressChangedEventArgs
        {
            private long m_BytesSent = 0;
            private long m_TotalBytesToSend = 0;
            private double m_Speed = 0;

            public long BytesSent
            {
                get { return m_BytesSent; }
            }

            public long TotalBytesToSend
            {
                get { return m_TotalBytesToSend; }
            }

            public double Speed
            {
                get { return m_Speed; }
            }

            public TimeSpan TimeRemaining
            {
                get
                {
                    TimeSpan time = new TimeSpan(0, 0, (int)((TotalBytesToSend - m_BytesSent) / (m_Speed == 0 ? 1 : m_Speed)));
                    return time;
                }
            }

            public BlobTransferProgressChangedEventArgs(long BytesSent, long TotalBytesToSend, int progressPercentage, double Speed, object userState)
                : base(progressPercentage, userState)
            {
                m_BytesSent = BytesSent;
                m_TotalBytesToSend = TotalBytesToSend;
                m_Speed = Speed;
            }
        }
    }
}

Sample usage

BlobTransfer transfer;

private void button1_Click(object sender, EventArgs e)
{
    CloudStorageAccount account = new CloudStorageAccount(new Microsoft.WindowsAzure.Storage.Auth.StorageCredentials("accountname", "accountkey"), false);
    CloudBlobClient client = account.CreateCloudBlobClient();
    CloudBlobContainer container = client.GetContainerReference("container");
    CloudBlockBlob blob = container.GetBlockBlobReference("file");

    transfer = new BlobTransfer();
    transfer.TransferProgressChanged += transfer_TransferProgressChanged;
    transfer.TransferCompleted += transfer_TransferCompleted;
    transfer.DownloadBlobAsync(blob, @"C:\temp\file");
}

private void button2_Click(object sender, EventArgs e)
{
    transfer.CancelAsync();
}

void transfer_TransferCompleted(object sender, AsyncCompletedEventArgs e)
{
    System.Diagnostics.Debug.WriteLine("Completed. Cancelled = " + e.Cancelled);
}

void transfer_TransferProgressChanged(object sender, BlobTransfer.BlobTransferProgressChangedEventArgs e)
{
    System.Diagnostics.Debug.WriteLine("Changed - " + e.BytesSent + " / " + e.TotalBytesToSend + " = " + e.ProgressPercentage + "% " + e.Speed);
}

Simple Console Client

using System;
 using System.Collections.Generic;
 using System.Linq;
 using System.Text;
  
 using Microsoft.WindowsAzure;
 using Microsoft.WindowsAzure.StorageClient;
  
 namespace ConsoleApplication1
 {
     class Program
     {
         const string ACCOUNTNAME = "ENTER ACCOUNT NAME";
         const string ACCOUNTKEY = "ENTER ACCOUNT KEY";
         const string LOCALFILE = @"ENTER LOCAL FILE";
         const string CONTAINER = "temp";
  
         private static CloudStorageAccount AccountFileTransfer;
         private static CloudBlobClient BlobClientFileTransfer;
         private static CloudBlobContainer ContainerFileTransfer;
  
         private static bool Transferring;
  
         static void Main(string[] args)
         {
             System.Net.ServicePointManager.DefaultConnectionLimit = 35;
  
             AccountFileTransfer = CloudStorageAccount.Parse("DefaultEndpointsProtocol=http;AccountName=" + ACCOUNTNAME + ";AccountKey=" + ACCOUNTKEY);
             if (AccountFileTransfer != null)
             {
                 BlobClientFileTransfer = AccountFileTransfer.CreateCloudBlobClient();
                 ContainerFileTransfer = BlobClientFileTransfer.GetContainerReference(CONTAINER);
                 ContainerFileTransfer.CreateIfNotExist();
             }
  
             // Upload the file
             CloudBlob blobUpload = ContainerFileTransfer.GetBlobReference(CONTAINER + "/" + System.IO.Path.GetFileName(LOCALFILE));
             BlobTransfer transferUpload = new BlobTransfer();
             transferUpload.TransferProgressChanged += new EventHandler<BlobTransfer.BlobTransferProgressChangedEventArgs>(transfer_TransferProgressChanged);
             transferUpload.TransferCompleted += new System.ComponentModel.AsyncCompletedEventHandler(transfer_TransferCompleted);
             transferUpload.UploadBlobAsync(blobUpload, LOCALFILE);
  
             Transferring = true;
             while (Transferring)
             {
                 Console.ReadLine();
             }
  
             // Download the file
             CloudBlob blobDownload = ContainerFileTransfer.GetBlobReference(CONTAINER + "/" + System.IO.Path.GetFileName(LOCALFILE));
             BlobTransfer transferDownload = new BlobTransfer();
             transferDownload.TransferProgressChanged += new EventHandler<BlobTransfer.BlobTransferProgressChangedEventArgs>(transfer_TransferProgressChanged);
             transferDownload.TransferCompleted += new System.ComponentModel.AsyncCompletedEventHandler(transfer_TransferCompleted);
             transferDownload.DownloadBlobAsync(blobDownload, LOCALFILE + ".copy");
  
             Transferring = true;
             while (Transferring)
             {
                 Console.ReadLine();
             }
         }
  
         static void transfer_TransferCompleted(object sender, System.ComponentModel.AsyncCompletedEventArgs e)
         {
             Transferring = false;
             Console.WriteLine("Transfer completed. Press any key to continue.");
         }
  
         static void transfer_TransferProgressChanged(object sender, BlobTransfer.BlobTransferProgressChangedEventArgs e)
         {
             Console.WriteLine("Transfer progress percentage = " + e.ProgressPercentage + " - " + (e.Speed / 1024).ToString("N2") + "KB/s");
         }
     }
 }

UI Client

For a more full featured UI client check out the full source code at 0601.BlobTransferUI.zip.

Overview

We recently published a blog post outlining the upcoming root certificate change from the GTE CyberTrust Global Root to the Baltimore CyberTrust Root which will take effect starting April 15, 2013. This new blog post will attempt to outline the potential impact that this will have one users in various scenarios. The first thing to point out is that this change will have zero impact on the vast majority of Windows Azure customers or end users who are consuming services hosted in Windows Azure. Standard applications running on modern OSes will automatically trust the new certificate. The potential impact is limited to developers who have written very specific code to do custom cert validation, or niche scenarios such as specific embedded devices, ultra secure environments, or old development frameworks and OSes.

If you have additional questions not covered by the original blog post or this one, please feel free to post your question in the forum thread and we will attempt to address it.

What is Changing?

Microsoft is changing from the GTE CyberTrust Global Root to the Baltimore CyberTrust Root for all public facing HTTPs services. This specific post will only target the Windows Azure Platform scenarios. If you need assistance with other platforms such as Office 365 please engage with their support team.

This change only affects Microsoft owned domain names and does not have any impact on custom domain names you have specified for your hosted service. For example, if you have a hosted service at myapp.cloudapp.net and a CNAME setup for www.myapp.com, then your HTTPs certificate for https://www.myapp.com is not affected by this change and your end users’ experience will not change. The root cert change will only affect HTTPs calls to Microsoft owned URLs such as the following:

*.windowsazure.com
*.windows.net
*.azure.com
*.microsoftonline.com
*.msecn.net
*.azurewebsites.net

Impacted Scenarios

The following are the scenarios which might be impacted by this root cert change, roughly ordered by the probability of impact. Because the scope of these scenarios is very limited, and the code or configuration required to be in one of these scenarios is very deliberate, if you aren’t sure if you are impacted by any of these scenarios then most like you are not. Also note that in all of these scenarios you will only be impacted if your code is calling into one of the Azure domains noted above. If your code is not calling an Azure domain then you will not be impacted by any of these scenarios.

Sharepoint Server. Sharepoint uses a custom certificate store rather than the standard Windows certificate store. Sharepoint administrators must add certificates to the Sharepoint store via the admin console or the New-SPTrustedRootAuthority Powershell cmdlet. If your Sharepoint Server uses Windows Azure Active Directory (Access Control Service) for user authentication, or if you have modules which make HTTPs calls to Azure services such as Storage Service, then you will need to add the new Baltimore CyberTrust Root certificate.
.NET applications using ServerCertificateValidationCallback. .NET exposes the System.Net.ServicePointManager.ServerCertificateValidationCallback and in .NET 4.5 the System.Net.HttpWebRequest.ServerCertificateValidationCallback callback functions which allow developers to use custom logic to determine certificate validity rather than relying on the standard Windows certificate store. A developer can add logic which checks for a specific subject name or thumbprint, or use logic which only allows a specific root authority such as GTE CyberTrust Global Root. If your application uses this callback function you should make sure that it accepts both the old and new certificates.
Devices with custom certificate stores. Embedded devices such as TV set top boxes and mobile devices often ship with a limited set of root authority certificates and have no easy way to update the certificate store. If you write code for, or manage deployments of, custom embedded or mobile devices you will want to make sure the devices trust the new Baltimore CyberTrust Root certificate. Most modern smartphone device already include the Baltimore CyberTrust Root certificate with the notable exception being that Google included the certificate with Android 2.3 Gingerbread which was released mid-2011 (source).
Highly secured environments. Clients running in environments which are highly secured may run into issues with not having the standard root certificates installed on the OS, having outbound network traffic restricted to specific addresses, or not allowing automatic certificate updates. System administrators may remove all certificates that are not explicitly required, and may have removed the Baltimore CyberTrust Root from the trusted root store on the OS. Network administrators may restrict outbound network traffic which may block the standard Certificate Revocation List checks performed by Windows during the process of validating certificates. The “Microsoft Internet Authority” intermediate cert has changed as part of this update and the CRL distribution point has changed from http://www.public-trust.com/cgi-bin/CRL/2018/cdp.crl to http://cdp1.public-trust.com/CRL/Omniroot2025.crl.
Runtime Environments. Some runtime environments such as Java use custom certificate validation mechanisms instead of the standard Windows certificate validation. We are currently unaware of any runtime environments which do not trust the Baltimore CyberTrust Root certificate, but if you are running on a niche or older runtime environment you should check with the vendor to determine if your application will be impacted.
Native applications using WINHTTP_CALLBACK_STATUS_SENDING_REQUEST. Similar to #2 above with the ServerCertificateValidationCallback, the WINHTTP_CALLBACK_STATUS_SENDING_REQUEST notification allows native applications to implement custom certificate validation algorithms. Usage of this notification is very rare and requires a significant amount of custom code to implement.

Details and Testing

To get details about the old and new certificates you can browse to the following two URLs. The way you check the certificate will depend on your OS, but in Windows using IE you can browse to that URL and then click the lock icon near the address bar. This will bring up the certificate where you can inspect the details and certificate chain. You should also try connecting to these two URLs from your environment to ensure that you will not run into any issues.

Old GTE CyberTrust Global Root certificate – https://jag-db.accesscontrol.windows.net/FederationMetadata/2007-06/FederationMetadata.xml

New Baltimore CyberTrust Root certificate – https://jag-sn.accesscontrol.windows.net/FederationMetadata/2007-06/FederationMetadata.xml

New Baltimore CyberTrust Root certificate – https://ewew3t8qjn.database.windows.net/

Error Messages

If you are impacted by this certificate update then the error messages you receive will be dependent on the type of environment you are running in and which scenario you are impacted by. You should check Windows Application event logs, CAPI2 event logs, and custom application logs. Here is an example of an error message you may receive in the Application event logs:

An operation failed because the following certificate has validation errors:\n\nSubject Name: CN=accesscontrol.windows.net\nIssuer Name: CN=MSIT Machine Auth CA 2, DC=redmond, DC=corp, DC=microsoft, DC=com\nThumbprint: 3562E9B2583B8E4AA74E4E6C07A0A419B11DEBAD\n\nErrors:\n\n The root of the certificate chain is not a trusted root authority.

Q & A

Q: I have an HTTPs certificate for my website hosted in Azure. Will this affect me?

A: No. If you are using a standard HTTPs certificate which you obtained from a domain registrar to secure traffic to your custom domain name (ie. https://www.myapp.com) then this change does not affect you. Your HTTPs certificate is chained to whatever root certificate authority your registrar is using and has nothing to do with the change that Microsoft is making for Azure specific domain names. The only caveat to this is that if your website is utilizing Azure services such as Table or Blob storage then you might fall into the Scenario #2 if you have custom code in your application to implement the ServerCertificateValidationCallback.

Q: But the HTTPs certificate for my website has GTE CyberTrust Global Root as its root certificate. Does this mean I need to change my certificate?

A: No. Your service’s certificate is completely unrelated to the certificates for Microsoft owned domain names as listed earlier. Your service will continue to be chained to whatever root certificate was provided to you and the end users who access your website will continue to trust the GTE CyberTrust Global Root certificate, which means they will continue to trust your certificate.

Q: Does this affect my service management or RDP certificates which I got from Microsoft to manage my Azure services?

A: No. The service management certificate (used by things like Visual Studio, Powershell, and the REST APIs) and RDP certificates are self-signed certificates which are not chained to a root authority.

Q: I am still not sure if I will be impacted by this change, what can I do?

A: As mentioned in the scenarios section above, it takes a very deliberate effort to modify the default certificate validation processes which means that typically the only people who would be impacted by a change like this are the people who already have processes in place to track and mitigate these types of issues. This means that if you aren’t sure if you are impacted by this event, then you are almost certainly not going to be impacted. If you have questions that are not addressed here, please post them in the forum thread.

Q: I see the Baltimore CyberTrust Root and the GTE CyberTrust Global Root in my Trusted Root Authority certificate store. Should I remove the GTE one?

A: No. The GTE root certificate is still a valid root certificate and many other websites around the world will continue to use it as the root for their certificate chains.

Q: Do my Azure VMs already have the Baltimore CyberTrust Root in the certificate store or do I need to deploy this certificate with my service?

A: All Azure VMs are built with the Baltimore CyberTrust Root (among many other known trusted root authority certificates). You do not need to make any modifications to your hosted service package to include a new certificate.

I would like to say thanks to Hui Zhu and Jaganathan Thangavelu for their expertise in putting together this post.

Prior to July 2013 the Windows Azure Guest OS included version 10 of the SQL Native Client libraries (commonly named SQLNCLI or SNAC). The July release of the Windows Azure Guest OS (1.24, 2.16 and 3.4) now ship version 11 of SQL Native Client libraries. This change can cause exceptions being thrown by client applications connecting to SQL Azure (or any other database accessed by SNAC).

Symptoms:

The most common symptom is a failure in connectivity to the database starting mid-July (July 16^th is the date when July releases started shipping) or when you manually upgrade to the latest Windows Azure Guest OS version (1.24, 2.16 or 3.4).
Error messages will depend on the language or framework you are using, but will generally be thrown when opening a connection to the database. Here are some variations of the errors you may see:

With SNAC ODBC:

ERROR [IM002] [Microsoft][ODBC Driver Manager] Data source name not found and no default driver specified

With SNAC OLEDB:

The ‘SQLNCLI10’ provider is not registered on the local machine.

With PHP:

Error establishing a Database Connection
The connection has timed out – the server at XXXXXXXX.database.windows.net is taking too long to respond.

Cause:

Prior to July 2013 the Windows Azure Guest OS shipped with the Microsoft SQL Server 2008 Native Client (version 10 of the SQLNCLI – sqlncli10.dll), but after July 2013 the Windows Azure Guest OS ships with the Microsoft SQL Server 2012 Native Client (version 11 of the SQLNCLI – sqlncli11.dll).

ODBC/OLEDB Connection Strings generally hardcode the SQLNCLI version as in the following examples:

SQL Server Native Client 10.0 ODBC Driver

Driver={SQL Server Native Client 10.0};Server=tcp:[serverName].database.windows.net;Database=myDataBase;Uid=[LoginForDb]@[serverName];Pwd=myPassword;Encrypt=yes;

SQL Server Native Client 10.0 OLE DB Provider

Provider=SQLNCLI10;Password=myPassword;User ID=[username]@[servername];Initial Catalog=databasename;Data Source=tcp:[servername].database.windows.net;

Since the sqlncli10.dll no longer exists on the OS, the calls to its methods will fail.

Impacted OS Versions

Guest OS Version	Configuration String	Release Date	SQL Native Client Version
1.24	WA-GUEST-OS-1.24_ 201306-01	July 16 2013	SQL 2012 (v11)
2.16	WA-GUEST-OS-2.16_201306-01	July 16 2013	SQL 2012 (v11)
3.4	WA-GUEST-OS-3.4_201306-01	July 16 2013	SQL 2012 (v11)

How to check the version of the SQL Native Client Libraries on your VM?

You can login to your Virtual Machine with RDP and look in D:\Windows\System32.

If sqlncli10.dll is present then you are using version 10.
If sqlncli11.dll is present then you are using version 11.

You can check the Programs and Features List via the control panel.

If Microsoft SQL Server 2008 Native Client is present then you are using version 10.
If Microsoft SQL Server 2012 Native Client is present then you are using version 11.

How to determine if you are impacted?

If you are using a database connection string check for the presence of “Driver=” or “Provider=”. This value will indicate if you are using version 10 or version 11.
If you are using PHP/Wordpress, you can look at PHP logs – where you can search for the string “sql“. Calls will generally indicate it is expecting version 10.

Note: If you have a .NET application which directly uses System.Data.SqlClient.dll you should not be affected as this assembly doesn’t use SNAC.

Solutions:

The latest Windows Azure Guest OS includes both SNAC 10 and SNAC 11 installed side-by-side. If you had rolled back to a previous Guest OS version you can now upgrade to the latest one (201308-01 (3.6, 2.18, 1.26) or newer).

For a quick, temporary solution you can roll back to the previous Guest OS (June release – versions 1.23, 2.15, 3.3) which contains SNAC 10. This can be done via the Azure Management Portal or by uploading a new service configuration file with the correct OS configuration string. This should be a temporary solution since the June Guest OS releases will be deprecated.
Upgrade your code or connection string to use SNAC version 11. This could be as simple as changing “10” into “11” (see SQL Azure ConnectionStrings samples on http://www.connectionstrings.com/sql-azure/) – but this of course needs testing.
If you are using PHP, you can use the latest version 3.0 of the PHP SQL drivers – available on http://www.microsoft.com/en-us/download/details.aspx?id=20098
You can install SQLNCLI10 x64 (SQLNCLI version 10 in x64) side-by-side with version 11. Note that you will need to deploy SQLNCLI version 10 with a Startup Task and redeploy your application. You can redeploy your application using an In Place Upgrade or VIP swap in order to prevent downtime.

A special thanks to Axel Guerrier with our Windows Azure Technical Support team for providing the technical information.

When troubleshooting a problem one of the most important things to know is what diagnostic data is available. If you don’t know where to look for logs or other diagnostics information then you end up having to resort to the trial-and-error or shotgun approach to troubleshooting problems. However, with access to logs you have a fair chance to troubleshoot any problem, even if it isn’t in your domain of expertise. This blog post will describe the data available in Windows Azure PaaS compute environments, how to easily gather this data, and will begin a series of posts discussing how to troubleshoot issues when using the Azure platform.

In conjunction with this blog post I would highly recommend reading the post at http://blogs.msdn.com/b/kwill/archive/2011/05/05/windows-azure-role-architecture.aspx which explains the different processes in a PaaS VM and how they interact with each other. An understanding of the high level architecture of something you are trying to troubleshoot will significantly improve your ability to resolve problems.

Troubleshooting Series (along with the major concepts or tools covered in each scenario):

Windows Azure PaaS Compute Diagnostics Data
AzureTools – The Diagnostic Utility used by the Windows Azure Developer Support Team
Troubleshooting Scenario 1 – Role Recycling
1. Using Task Manager to determine which process is failing and which log to look at first.
2. Windows Azure Event Logs
Troubleshooting Scenario 2 – Role Recycling After Running Fine For 2 Weeks
1. WaHostBootstrapper.log
2. Failing startup task
3. OS restarts
Troubleshooting Scenario 3 – Role Stuck in Busy
1. WaHostBootstrapper.log
2. Failing startup task
3. Modifying a running service
Troubleshooting Scenario 5 – Internal Server Error 500 in WebRole
1. Browse IIS using DIP
Troubleshooting Scenario 6 – Role Recycling After Running For Some Time
1. Deep dive on WindowsAzureGuestAgent.exe logs (AppAgentRuntime.log and WaAppAgent.log)
2. DiagnosticStore LocalStorage resource
Troubleshooting Scenario 7 – Role Recycling
1. Brief look at WaHostBootstrapper and WindowsAzureGuestAgent logs
2. AzureTools
3. WinDBG
4. Intellitrace

There is a new short Channel 9 video demonstrating some of these blog file locations and the use of the SDP package at https://channel9.msdn.com/Series/DIY-Windows-Azure-Troubleshooting/Windows-Azure-PaaS-Diagnostics-Data.

Diagnostic Data Locations

This list includes the most commonly used data sources used when troubleshooting issues in a PaaS VM, roughly ordered by importance (ie. the frequency of using the log to diagnose issues).

Windows Azure Event Logs – Event Viewer –> Applications and Services Logs –> Windows Azure
- Contains key diagnostic output from the Windows Azure Runtime, including information such as Role starts/stops, startup tasks, OnStart start and stop, OnRun start, crashes, recycles, etc.
- This log is often overlooked because it is under the “Applications and Services Logs” folder in Event Viewer and thus not as visible as the standard Application or System event logs.
- This one diagnostic source will help you identify the cause of several of the most common issues with Azure roles failing to start correctly – startup task failures, and crashing in OnStart or OnRun.
- Captures crashes, with callstacks, in the Azure runtime host processes that run your role entrypoint code (ie. WebRole.cs or WorkerRole.cs).
Application Event Logs – Event Viewer –> Windows Logs –> Application
- This is standard troubleshooting for both Azure and on-premise servers. You will often find w3wp.exe related errors in these logs.
App Agent Runtime Logs – C:\Logs\AppAgentRuntime.log
- These logs are written by WindowsAzureGuestAgent.exe and contain information about events happening within the guest agent and the VM. This includes information such as firewall configuration, role state changes, recycles, reboots, health status changes, role stops/starts, certificate configuration, etc.
- This log is useful to get a quick overview of the events happening over time to a role since it logs major changes to the role without logging heartbeats.
- If the guest agent is not able to start the role correctly (ie. a locked file preventing directory cleanup) then you will see it in this log.
App Agent Heartbeat Logs – C:\Logs\WaAppAgent.log
- These logs are written by WindowsAzureGuestAgent.exe and contain information about the status of the health probes to the host bootstrapper.
- The guest agent process is responsible for reporting health status (ie. Ready, Busy, etc) back to the fabric, so the health status as reported by these logs is the same status that you will see in the Management Portal.
- These logs are typically useful for determining what is the current state of the role within the VM, as well as determining what the state was at some time in the past. With a problem description like “My website was down from 10:00am to 11:30am yesterday”, these heartbeat logs are very useful to determine what the health status of the role was during that time.
Host Bootstrapper Logs – C:\Resources\WaHostBootstrapper.log
- This log contains entries for startup tasks (including plugins such as Caching or RDP) and health probes to the host process running your role entrypoint code (ie. WebRole.cs code running in WaIISHost.exe).
- A new log file is generated each time the host bootstrapper is restarted (ie. each time your role is recycled due to a crash, recycle, VM restart, upgrade, etc) which makes these logs easy to use to determine how often or when your role recycled.
IIS Logs – C:\Resources\Directory\{DeploymentID}.{Rolename}.DiagnosticStore\LogFiles\Web
- This is standard troubleshooting for both Azure and on-premise servers.
- One key problem scenario where these logs are often overlooked is the scenario of “My website was down from 10:00am to 11:30am yesterday”. The natural tendency is to blame Azure for the outage (“My site has been working fine for 2 weeks, so it must be a problem with Azure!”), but the IIS logs will often indicate otherwise. You may find increased response times immediately prior to the outage, or non-success status codes being returned from IIS, which would indicate a problem within the website itself (ie. in the ASP.NET code running in w3wp.exe) rather than an Azure issue.
Performance Counters – perfmon, or Windows Azure Diagnostics
- This is standard troubleshooting for both Azure and on-premise servers.
- The interesting aspect of these logs in Azure is that, assuming you have setup WAD ahead of time, you will often have valuable performance counters to troubleshoot problems which occurred in the past (ie. “My website was down from 10:00am to 11:30am yesterday”).
- Other than specific problems where you are gathering specific performance counters, the most common uses for the performance counters gathered by WAD is to look for regular performance counter entries, then a period of no entries, then resuming the regular entries (indicating a scenario where the VM was potentially not running), or 100% CPU (usually indicating an infinite loop or some other logic problem in the website code itself).
HTTP.SYS Logs – D:\WIndows\System32\LogFiles\HTTPERR
- This is standard troubleshooting for both Azure and on-premise servers.
- Similar to the IIS Logs, these are often overlooked but very important when trying to troubleshoot an issue with a hosted service website not responding. Often times it can be the result of IIS not being able to process the volume of requests coming in, the evidence of which will usually show up in the HTTP.SYS logs.
IIS Failed Request Log Files – C:\Resources\Directory\{DeploymentID}.{Rolename}.DiagnosticStore\FailedReqLogFiles
- This is standard troubleshooting for both Azure and on-premise servers.
- This is not turned on by default in Windows Azure and is not frequently used. But if you are troubleshooting IIS/ASP.NET specific issues you should consider turning FREB tracing on in order to get additional details.
Windows Azure Diagnostics Tables and Configuration – C:\Resources\Directory\{DeploymentID}.{Rolename}.DiagnosticStore\Monitor
- This is the local on-VM cache of the Windows Azure Diagnostics (WAD) data. WAD captures the data as you have configured it, stores in in custom .TSF files on the VM, then transfers it to storage based on the scheduled transfer period time you have specified.
- Unfortunately because they are in a custom .TSF format the contents of the WAD data are of limited use, however you can see the diagnostics configuration files which are useful to troubleshoot issues when Windows Azure Diagnostics itself is not working correctly. Look in the Configuration folder for a file called config.xml which will include the configuration data for WAD. If WAD is not working correctly you should check this file to make sure it is reflecting the way that you are expecting WAD to be configured.
Windows Azure Caching Log Files – C:\Resources\Directory\{DeploymentID}.{Rolename}.DiagnosticStore\AzureCaching
- These logs contain detailed information about Windows Azure role-based caching and can help troubleshoot issues where caching is not working as expected.
WaIISHost Logs – C:\Resources\Directory\{DeploymentID}.{Rolename}.DiagnosticStore\WaIISHost.log
- This contains logs from the WaIISHost.exe process which is where your role entrypoint code (ie. WebRole.cs) runs for WebRoles. The majority of this information is also included in other logs covered above (ie. the Windows Azure Event Logs), but you may occasionally find additional useful information here.
IISConfigurator Logs – C:\Resources\Directory\{DeploymentID}.{Rolename}.DiagnosticStore\IISConfigurator.log
- This contains information about the IISConfigurator process which is used to do the actual IIS configuration of your website per the model you have defined in the service definition files.
- This process rarely fails or encounters errors, but if IIS/w3wp.exe does not seem to be setup correctly for your service then this log is the place to check.
Role Configuration Files – C:\Config\{DeploymentID}.{DeploymentID.{Rolename}.{Version}.xml
- This contains information about the configuration for your role such as settings defined in the ServiceConfiguration.cscfg file, LocalResource directories, DIP and VIP IP addresses and ports, certificate thumbprints, Load Balancer Probes, other instances, etc.
- Similar to the Role Model Definition File, this is not a log file which contains runtime generated information, but can be useful to ensure that your service is being configured in the way that you are expecting.
Role Model Definition File – E:\RoleModel.xml (or F:\RoleModel.xml)
- This contains information about how your service is defined according to the Azure Runtime, in particular it contains entries for every startup task and how the startup task will be run (ie. background, environment variables, location, etc). You will also be able to see how your <sites> element is defined for a web role.
- This is not a log file which contains runtime generated information, but it will help you validate that Azure is running your service as you are expecting it to. This is often helpful when a developer has a particular version of a service definition on his development machine, but the build/package server is using a different version of the service definition files.

* A note about ETL files

If you look in the C:\Logs folder you will find RuntimeEvents_{iteration}.etl and WaAppAgent_{iteration}.etl files. These are ETW traces which contain a compilation of the information found in the Windows Azure Event Logs, Guest Agent Logs, and other logs. This is a very convenient compilation of all of the most important log data in an Azure VM, but because they are in ETL format it requires a few extra steps to consume the information. If you have a favorite ETW viewing tool then you can ignore several of the above mentioned log files and just look at the information in these two ETL files.

Gathering The Log Files For Offline Analysis and Preservation

In most circumstances you can analyze all of the log files while you are RDPed onto the VM and doing a live troubleshooting session, and you aren’t concerned with gathering all of the log files into one central location. However, there are several scenarios in which you want to be able to easily gather all of the log files and be able to save them off of the VM for analysis by someone else or to preserve them for analysis at a later time so that you can redeploy your hosted service and restore your application’s functionality.

There are three ways to quickly gather diagnostics logs from a PaaS VM:

The easiest way is to simply RDP to the VM and run CollectGuestLogs.exe. CollectGuestLogs.exe ships with the Azure Guest Agent which is present on all PaaS VMs and most IaaS VMs and it will create a ZIP file of the logs from the VM. For PaaS VMs, this file is located at D:\Packages\GuestAgent\CollectGuestLogs.exe. For IaaS VMs this file is located at C:\WindowsAzure\Packages\CollectGuestLogs.exe. Note that CollectGuestLogs collects IIS Logs by default, which can be quite large for long running web roles. To prevent IIS log collection run “CollectGuestLogs.exe -Mode:ga”, or run “CollectGuestLogs.exe -?” for more information.
If you want to collect logs without having to RDP to the VM, or collect logs from multiple VMs at once, you can run the Azure Log Collector Extension from your local dev machine. For more information see http://azure.microsoft.com/blog/2015/03/09/simplifying-virtual-machine-troubleshooting-using-azure-log-collector/.
The older way (before CollectGuestLogs existed) is to use the SDP package created by the Azure support teams. See below for instructions on using this package.

Using the older SDP package

The Windows Azure Developer support team has created an SDP (Support Diagnostics Platform) script which automatically gathers all of the above information into a .CAB file which allows for easy transfer of the necessary logs to the support professional for analysis during the course of working on a support incident. This same SDP package is also available outside of a normal support incident via the following URL:

Windows Azure Guest OS Family 2 & 3 (Windows Server 2008 R2 and Windows Server 2012. Powershell v2) – 2625.CTS_AzurePaaSLogs_global.DiagCab
Windows Azure Guest OS Family 1 (Windows Server 2008. Powershell v1) – 5635.CTS_AzurePaaSLogs_en-US_OSFamily1.EXE

* You can find more information about the SDP package at http://support.microsoft.com/kb/2772488.

Obtaining the SDP Package for Windows Azure Guest OS Family 2 & 3

Obtaining and running the SDP Package is easy in Windows Server 2008 R2 or 2012.

RDP to the Azure VM
Open Powershell
Copy/Paste and Run the following script

md c:\Diagnostics; md $env:LocalAppData\ElevatedDiagnostics\1239425890; Import-Module bitstransfer; explorer $env:LocalAppData\ElevatedDiagnostics\1239425890; Start-BitsTransfer http://dsazure.blob.core.windows.net/azuretools/AzurePaaSLogs_global-Windows2008R2_Later.DiagCab c:\Diagnostics\AzurePaaSLogs_global-Windows2008R2_Later.DiagCab; c:\Diagnostics\AzurePaaSLogs_global-Windows2008R2_Later.DiagCab

This script will do the following:

Create a folder C:\Diagnostics
Load the BitsTransfer module and download the SDP package to C:\Diagnostics\AzurePaaSLogs.DiagCab
Launch AzurePaaSLogs.DiagCab
Open Windows Explorer to the folder which will contain the CAB file after the package has completed running.

Obtaining the SDP Package for Windows Azure Guest OS Family 1

If you are on Windows Azure Guest OS Family 1, or you prefer not to directly download and run the file from within your Azure VM, you can download the appropriate file to an on-prem machine and then copy it to the Azure VM when you are ready to run it (standard Copy/Paste [ctrl+c, ctrl+v] the file between your machine and the Azure VM in the RDP session).

Using the SDP Package

The SDP package will present you with a standard wizard with Next/Cancel buttons.

The first screen has an ‘Advanced’ option which allows to check or uncheck the ‘Apply repairs automatically’ option. This option has no impact on this SDP package since this package only gathers data and does not make any changes or repairs to the Azure VM. Click Next.
The next screen gives you the option for ‘Express [Recommended]’ or ‘Custom’. If you choose Custom you will have the option to gather WAD (*.tsf) files and IIS log files. You will typically not gather the WAD *.tsf files since you have no way to analyze the custom format, but depending on the issue you are troubleshooting you may want to gather the IIS log files. Click Next.
The SDP Package will now gather all of the files into a .CAB file. Depending on how long the VM has been running and how much data is in the various log files this process could take several minutes.
Once the files have been gathered you will be presented with a screen showing any commonly problems that have been detected. You can click on the ‘Show Additional Information’ button to see a full report with additional details about any detected common problems. This same report is also available within the .CAB file itself so you can view it later. Click ‘Close’.
At this point the CAB file will be available in the %LocalAppData%\ElevatedDiagnostics folder. You will notice a ‘latest.cab’ file along with one or more subfolders. The subfolders contain the timestamped results from every time you run the SDP package, while the latest.cab file contains the results from the most recent execution of the SDP package.
You can now copy the latest.cab file from the VM to your on-premise machine for preservation or off-line analysis.

Continuing from the diagnostic information at Windows Azure PaaS Compute Diagnostics Data, this blog post will describe how to troubleshoot a role that fails to start. This particular scenario is fairly easy to troubleshoot, but serves well to show the beginning steps in troubleshooting in Azure. Future scenarios will be more complex and build on these troubleshooting techniques.

If you have been developing for very long in Azure you have had a deployment that seemed to work fine on your local development machine, but after deploying to Azure it cycled (cycling between the Starting, Busy, Recycling, etc states).

RDP To The VM

The first step to troubleshoot a recycling role is to RDP to the VM hosting that instance. If you have already turned on RDP prior to deploying the solution then you can just click the Connect button at the bottom of the Management Portal screen. If you have not, then switch to the Configure tab and click Remote to turn on RDP access to the VM. Switch back to the Instances tab and select the instance that is recycling, and after a few minutes the RDP access will be enabled and the Connect button will be enabled.

Get the Big Picture

I usually start out by trying to figure out roughly where in the role startup process the failure is occurring. The quick and easy way to do this is to reference the diagram at Windows Azure Role Architecture and compare that to what processes are running in Task Manager. Open Task Manager, switch to the Details tab, and sort descending by Process name. Assuming this is a webrole, then per the role architecture diagram I would expect to see WaAppAgent.exe, WindowsAzureGuestAgent.exe, WaHostBootstrapper.exe, WaIISHost.exe, RemoteAccessAgent.exe, RemoteForwarderAgent.exe, and maybe w3wp.exe (depending on if any HTTP requests have come in).

If I don’t see one or more of those processes running then I will watch task manager for a minute or two to see which processes startup. I would always expect WaAppAgent and WindowsAzureGuestAgent to be running since those are Azure owned processes, and if they aren’t running then something pretty significant is wrong. But beyond the guest agent processes, I am looking to see which of these processes is the last to startup:

WindowsAzureGuestAgent – If this and WaAppAgent are the only processes running, and I never see WaHostBootstrapper startup, then there must be a failure in the guest agent. Per the role architecture blog post we know that the guest agent is responsible for setting up things like the firewall, LocalStorage resources, etc. A common error here is when the guest agent is trying to delete LocalStorage resources (when CleanOnRoleRecycle=true) but fails due to a file lock. The guest agent logs are a good place to start troubleshooting.
WaHostBootstrapper – We know that WaHostBootstrapper is responsible for startup tasks, so if this process starts, but you don’t see the WaIISHost (or WaWorkerHost) processes start then it is most likely a startup task that is failing. WaHostBoostrapper logs are a good place to start troubleshooting.
WaIISHost/WaWorkerHost – The role host process runs your role entrypoint code (WebRole.cs or WorkerRole.cs) so if we see the role host process start and then exit (ie. a recycling role) then we can be fairly certain that there is a bug in that code throwing an exception, or perhaps a missing dependency causing the process to fail on startup. The Azure and Application event logs are a good place to start troubleshooting.

In this particular example when I opened task manager I saw the following:

Notice that only WindowsAzureGuestAgent.exe is running. At this point I will usually watch for a minute or two to see what else happens. A few seconds later I saw the WaHostBootstrapper process, and then within a few seconds I saw the WaIISHost process. Then a few seconds after that the WaHostBootstrapper and WaIISHost processes were gone and the only thing left was WindowsAzureGuestAgent.exe.

At this point I can be fairly certain that something within my role entry point code is throwing an exception and causing WaIISHost.exe to crash.

Check the Logs

Now that I know roughly where to begin my investigation I can start looking at specific logs for error messages. Given the information from Windows Azure PaaS Compute Diagnostics Data I know which logs should be of interest in this scenario and I will start with the Windows Azure Event Logs.

Right away you can tell from this log exactly what the problem is and where the error is coming from:

I can see an Error from the Azure Runtime, with a System.Exception being thrown from my CrashInOnStart.WebRole.OnStart() code. At this point I would know to go review that code to see how it could be throwing that type of exception, which would hopefully lead me to the root cause. If you don’t immediately know what the root cause is then you need to do basic troubleshooting just like you would do on-prem – add tracing code, use Intellitrace, attach a debugger, etc.

Getting AzureTools

Direct Download Link – http://dsazure.blob.core.windows.net/azuretools/AzureTools.exe

— or —

Downloading from within an Azure VM (*note, this only works on Guest OS Family 2 or later. Guest OS Family 1 includes Powershell v1 which does not support Import-Module or BitsTransfer. For Guest OS Family 1 use the direct download link above)

RDP to the Azure VM
Open Powershell
Copy/Paste and Run the following script

md c:\tools; Import-Module bitstransfer; Start-BitsTransfer http://dsazure.blob.core.windows.net/azuretools/AzureTools.exe c:\tools\AzureTools.exe; c:\tools\AzureTools.exe

The Windows Azure Developer Support team has engineers all over the world who spend significant amounts of time troubleshooting a wide range of issues within Windows Azure PaaS Virtual Machines. Over the last few years this team has developed an in-house utility which provides quick and easy access to the most common troubleshooting tasks and tools. In an effort to provide a better experience for our customers this utility is being released publicly and this blog post will outline how to get the tool and how to use it. The troubleshooting series at http://blogs.msdn.com/b/kwill/archive/2013/08/09/windows-azure-paas-compute-diagnostics-data.aspx will make use of this tool and show different scenarios for how it can be used.

This utility is constantly being developed and improved so keep an eye on this blog post for updates on new features. The utility has an auto-update feature so you will be notified when a new version is available. Also, if you have any suggestions for new features please add a comment.

AzureTools is primarily intended to be run from within an RDP session in an Azure PaaS Virtual Machine, but there are some useful functions that can also be used on any machine outside of the Azure environment.

<Update September 5, 2013>

Version 1.5.0.0

Added following tools:

PerfView

Remote Tools for Visual Studio 2012 Update 2

Table2csv

Adding the following to Utils tab:

Table2csv

Misc Bug Fixes

</Update September 5, 2013>

<Update October 29, 2013>

Version 1.6.0.0

Added following tools:

DebugDiag v2 (replaces DebugDiag v1.2)

Adding the following to Misc Tools:

Storage REST API

Windows Azure Service Management REST API

String Encoding

Misc Bug Fixes

</Update October 29, 2013>

Downloadable Tools

The primary use of AzureTools is to quickly and easily get the most common troubleshooting tools (ie. Netmon, Sysinternals Suite, etc) onto an Azure VM. The IE enhanced internet security configuration on Azure VMs makes it cumbersome to search for and download tools from the internet, but with AzureTools you can easily double-click on a tool and have it automatically download and extract/install. Several of these tools are also useful on your on-premise machines.

Tools List

DebugDiagx64 – The Debug Diagnostic Tool (DebugDiag) is designed to assist in troubleshooting issues such as hangs, slow performance, memory leaks or fragmentation, and crashes in any user-mode process. The tool includes additional debugging scripts focused on Internet Information Services (IIS) applications, web data access components, COM+ and related Microsoft technologies. This is used frequently in Azure to setup monitoring to capture a problem that happens infrequently.

DebugView – DebugView is an application that lets you monitor debug output on your local system, or any computer on the network that you can reach via TCP/IP. It is capable of displaying both kernel-mode and Win32 debug output, so you don’t need a debugger to catch the debug output your applications or device drivers generate, nor do you need to modify your applications or drivers to use non-standard debug output APIs. This is used in Azure to monitor the heartbeat output from WindowsAzureGuestAgent in real time. To enable this go to Capture –> Capture Global Win32.

Fiddler4Setup – Fiddler is used to capture, analyze, and manipulate HTTP traffic. This is a vital tool when troubleshooting any web based application and is frequently used in Azure to troubleshoot webroles, Azure Storage, etc.

ILSpy_Master_2.1.0.1603_RTW_Binaries – ILSpy is the open-source .NET assembly browser and decompiler. This allows you to disassemble any .NET assembly and is a quick and easy way to see what a binary on an Azure VM is doing and validate that it is doing what you expect it to do.

PerfView.zip – PerfView is a performance-analysis tool that helps isolate CPU- and memory-related performance issues.

NM34_x64 – Network Monitor (Netmon) is used to capture and analyze network traffic. If you are using Azure you are almost certainly building an application which communicates over the network, and Netmon is vital for troubleshooting the communication between your Azure VM and an external resource.

Psscor4 – Psscor4 is a Windows Debugger extension used to debug .NET Framework 4 applications. If you are using WinDBG to debug a .NET application or dump file then psscor4 is a key debugger extension.

SysinternalsSuite – The Sysinternals Troubleshooting Utilities have been rolled up into a single Suite of tools. This file contains the individual troubleshooting tools and help files. It does not contain non-troubleshooting tools like the BSOD Screen Saver or NotMyFault. The most common utilities to use in an Azure VM is Process Explorer and Process Monitor.

X64 Debuggers And Tools-x64_en-us – Use Debugging Tools for Windows to debug drivers, applications, and services on Windows systems. Debugging Tools for Windows includes a core debugging engine and several tools that provide interfaces to the debugging engine. A Visual Studio extension provides a graphical user interface, as does Windows Debugger (WinDbg). Console Debugger (CDB), NT Symbolic Debugger (NTSD), and Kernel Debugger (KD) provide command line user interfaces. WinDBG is the debugger of choice for most Microsoft support teams and most hard core debuggers and is frequently used in Azure to do live debugging since it is much easier to get onto an Azure VM than a debugger such as Visual Studio.

rtools_setup_x64 – Remote Tools for Visual Studio 2012 enables remote debugging, remote testing, and performance unit testing on computers that don’t have Visual Studio installed.

Table2csv – Use this tool from the Utils tab of AzureTools. Converts WAD TSF Files into a CSV file. See below under Utils for more information.

File Transfer

There are several ways to get a file onto and off of an Azure VM while you are troubleshooting, with the easiest being to do a simple Copy/Paste via RDP. However, for very large files this becomes very time consuming and error prone. You could also download one of many free storage explorer tools and configure it, but if you only have one or two files to transfer (ie. a .dmp file you want to analyze later, or a .CAB from the Azure PaaS SDP package) then you spend more time getting the storage explorer tool. AzureTools provides data transfer functionality which uploads or downloads files into and out of blob storage which makes getting files onto or off of an Azure VM very easy and quick.

Utilities

When troubleshooting on an Azure VM you can find yourself repeating the same types of tasks on every VM, such as turning on Fusion Logging or showing File Extensions in Windows Explorer. There are also some tasks that are much easier to achieve programmatically than manually, such as pinging the input endpoints of every role instance in the hosted service to identify which one of them might be encountering issues.

Misc Tools – Miscellaneous tools that require more UI space than can fit here (yes, there are probably better ways to organize this, but I am not a UI designer)…

Set Explorer Options – Change the Windows Explorer options to show Hidden Files and show File Extensions.

Auto Gather Logs – This was the predecessor to the Azure PaaS SDP Package which was recently introduced. It will gather all of the most commonly used log files into a ZIP file which can then be copied off of the Azure VM for data retention or offline analysis. This is especially useful if your production system is down and you need to get it back up and running quickly and you want to grab the log files before rebooting the VM or taking any other remediation steps.

Attach Debugger – This is arguably one of the most useful utilities in AzureTools. This will let you attach a debugger to a process which fails and exits immediately on startup. The most common scenario is a role which is recycling and you want to attach WinDBG to the WaIISHost/WaWorkerHost process, but it crashes too quickly for you to manually attach the debugger. In Azure the normal trick of setting the Image File Execution Options debugger registry key doesn’t work and this Attach Debugger utility is the only consistent way I have found to attach a debugger. To use, first download a debugger (ie. double-click the X64 Debuggers And Tools-x64_en-us tool from the Tools tab), enter the process name you are interested in, click Attach Debugger, and wait for the Azure guest agent to automatically start that process again.

Check Input Endpoints – This will send an HTTP GET to every endpoint of every role and instance in the hosted service that the VM you are RDPed into belongs. Because Azure VMs with input endpoints sit behind a round robin load balancer, if one VM is misbehaving then the problem can look very random from the outside world (ie. 5% of requests from my monitoring site are failing). If you only have a small number of instances then it is not too difficult to RDP to each one and see if it is working correctly, but this technique quickly becomes infeasible. To identify if it is a specific VM which is having problems you can RDP to any VM in that hosted service, run AzureTools, click on Check Input Endpoints and see if a failure is returned from any of them.

Set Busy – This will set the VM you are currently RDPed into to Busy which will remove it from the load balancer rotation. This is useful in a couple scenarios: 1) You are troubleshooting a production service and you don’t want your debugging (ie. breaking into w3wp) to interfere with live traffic to your site, or 2) You want to isolate this VM in order to troubleshoot it without the extra noise of additional unexpected incoming traffic.

Open Log Files – This will open the current log file for all of the log files most commonly used when troubleshooting on an Azure VM (see Windows Azure PaaS Compute Diagnostics Data). This is useful when you are troubleshooting an issue and you don’t quite know where to begin looking so you want to see all of the data without having to go searching for all of it.

Fusion Logging – This turns on or off .NET Fusion Logging verbose output by setting keys under HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Fusion [ForceLog=1, LogFailures=1, LogResourceBinds=1, LogPath=<AzureTools startup path>].

Table2csv – Convert Windows Azure Diagnostics TSF files into CSV files. WAD TSF files are the locally cached binary files on an Azure VM where WAD stores diagnostic data prior to sending it to Azure storage. These files can be found at C:\Resources\Directory\{DeploymentID}.{Rolename}.DiagnosticStore\Monitor\Tables.

Misc Tools

This is a collection of miscellaneous utilities that required more UI space than a simple button/link.

Time Zone Conversion

Use this to quickly convert between different time zones, UTC ticks, and format strings. Because everything in Azure is UTC this functionality is very handy when converting between your local time zone and a time that you see in an Azure log file. The UTC Ticks conversion is handy when searching Windows Azure Diagnostics log tables since they are partitioned based on tick count.

Build RDP File

This will build an RDP file to a specific VM in an Azure deployment. This is handy if you don’t have access to the portal, but you know the URL and role name of the VM that you want to RDP to.

Shared Access Signatures

Still under development for Table and Queue resources. This lets you easily build one-off shared access signatures to blob resources without having to write a bunch of code just to generate a SAS.

Storage REST API

This will let you build a REST request to the storage API and specify the headers and body. This is particular useful if you want to test different x-ms-versions, troubleshoot storage calls, or test the RETS APIs.

Service Management REST API

Similar to the Storage REST API above, but this targets the service management API.

String Encoding

Encode and Decode strings between Base64, URLEncoding, and URLPathEncoding. Useful when making Service Management API calls as most of the APIs require input in the form of Base64 encoded strings.

Continuing from the diagnostic information at Windows Azure PaaS Compute Diagnostics Data, this blog post will describe how to troubleshoot a role that begins recycling after it has been running fine for some period of time. This is a fairly common scenario and the first instinct is to naturally blame Azure/Microsoft (“Hey! This thing was running fine for a long time so it must be a problem with your datacenter!”).

If this is your first exposure to this troubleshooting scenario series I would recommend starting with Windows Azure PaaS Compute Diagnostics Data which will walk through the diagnostic data locations and link to the earlier troubleshooting scenarios which lay out some fundamental troubleshooting techniques.

Symptom

You have deployed a hosted service to Azure and it is working great. A couple weeks go by and you start getting alerts from your monitoring service that your Azure hosted site is down. You go to the portal and you see your instances in a Recycling state (cycling between the Starting, Busy, Recycling, Initializing, etc states).

Immediately assuming it must be a Microsoft problem, because you haven’t changed anything, you check the Service Health Dashboard and see that everything is operating normally. Time to start troubleshooting.

Get the Big Picture

Just like in Scenario 1 we want to RDP onto the Azure VM and get a high level overview of what is happening on the VM. Similar to Scenario 1 we start off only seeing WindowsAzureGuestAgent running:

Then after a few seconds we see WaHostBootstrapper, but unlike Scenario 1 we do not see WaIISHost start, and then a few seconds later WaHostbootstrapper goes away.

As we know from scenario 1, if we only see WaHostBootstrapper then it is probably a startup task that is failing.

Check the Logs

According to Windows Azure PaaS Compute Diagnostics Data, WaHostBootstrapper logs are in C:\Resources\WaHostBootstrapper.log. A new host bootstrapper log is written every time the WaHostBootstrapper process is started and the old one is saved as WaHostBootstrapper.log.old.<index> (up to a maximum of 15 old logs), and looking in the C:\Resources folder I can see 16 log files which indicates several restarts of WaHostBootstrapper (ie. a recycling role). The current log file (WaHostBootstrapper.log) will probably only be partially written because it is the host bootstrapper that is failing and exiting, so I am more interested in one of the older log files which will show the last log entry written before the process exited. Opening WaHostBootstrapper.log.old.001 we see the following:

WaHostBootstrapper log:

These first several entries are standard and denote the host bootstrapper startup:

[00002068:00002012, 2013/08/26, 21:21:03.620, ERROR] <- WapXmlReadRoleModel=0x1
[00002068:00002012, 2013/08/26, 21:21:03.620, ERROR] <- WapXmlReadContainerId=0x1
[00002068:00002012, 2013/08/26, 21:21:03.620, ERROR] <- WapGetVirtualAccountName=0x1
[00002068:00002012, 2013/08/26, 21:21:03.620, ERROR] <- WapGetAppCmdPath=0x1
[00002068:00002012, 2013/08/26, 21:21:03.620, ERROR] <- WapSetDefaultEnvironment=0x1
[00002068:00002012, 2013/08/26, 21:21:03.620, ERROR] <- WapGetAppHostConfigPath=0x1
[00002068:00002012, 2013/08/26, 21:21:03.620, ERROR] <- GetDebugger=0x1
[00002068:00002012, 2013/08/26, 21:21:03.620, ERROR] <- GetStartupTaskDebugger=0x1
[00002068:00002012, 2013/08/26, 21:21:03.620, ERROR] <- WapGetEnvironmentVariable=0x800700cb
[00002068:00002012, 2013/08/26, 21:21:03.620, ERROR] <- WapGetEnvironmentVariable=0x800700cb
[00002068:00002012, 2013/08/26, 21:21:03.620, ERROR] <- WapGetEnvironmentVariable=0x800700cb
[00002068:00002012, 2013/08/26, 21:21:03.620, ERROR] <- WapGetEnvironmentVariable=0x800700cb

You will then see several startup tasks executing for IISConfigurator (if running a web role), modules defined in <Import> elements in the csdef file, and startup tasks. Note that the 0x800700cb returns from WapGetEnvironmentVariable are normal and can be ignored*. Note that the startup tasks with “type=0” are simple types which block host bootstrapper progression until they exit, and startup tasks with “type=2” are background and host bootstrapper will continue execution while they run.

[00002068:00002012, 2013/08/26, 21:21:03.620, INFO ] Executing Startup Task type=0 rolemodule=IISConfigurator cmd=”E:\base\x64\IISConfigurator.exe”
[00002068:00002012, 2013/08/26, 21:21:03.620, INFO ] Executing “E:\base\x64\IISConfigurator.exe” .
[00002068:00002012, 2013/08/26, 21:21:03.683, INFO ] Program “E:\base\x64\IISConfigurator.exe” exited with 0. Working Directory = E:\base\x64
[00002068:00002012, 2013/08/26, 21:21:03.683, ERROR] <- WapGetEnvironmentVariable=0x800700cb
[00002068:00002012, 2013/08/26, 21:21:03.683, ERROR] <- WapGetEnvironmentVariable=0x800700cb
[00002068:00002012, 2013/08/26, 21:21:03.683, INFO ] Executing Startup Task type=2 rolemodule=Diagnostics cmd=”E:\plugins\Diagnostics\DiagnosticsAgent.exe”
[00002068:00002012, 2013/08/26, 21:21:03.683, INFO ] Executing “E:\plugins\Diagnostics\DiagnosticsAgent.exe” .
[00002068:00002012, 2013/08/26, 21:21:03.745, ERROR] <- WapGetEnvironmentVariable=0x800700cb
[00002068:00002012, 2013/08/26, 21:21:03.745, ERROR] <- WapGetEnvironmentVariable=0x800700cb
[00002068:00002012, 2013/08/26, 21:21:03.745, INFO ] Executing Startup Task type=0 rolemodule=Diagnostics cmd=”E:\plugins\Diagnostics\DiagnosticsAgent.exe” /blockStartup
[00002068:00002012, 2013/08/26, 21:21:03.745, INFO ] Executing “E:\plugins\Diagnostics\DiagnosticsAgent.exe” /blockStartup.
[00002068:00004008, 2013/08/26, 21:21:04.276, INFO ] Registering client with PID 3888.
[00002068:00004008, 2013/08/26, 21:21:04.276, INFO ] Client DiagnosticsAgent.exe (3888) registered.
[00002068:00003276, 2013/08/26, 21:21:04.620, INFO ] Registering client with PID 4076.
[00002068:00003276, 2013/08/26, 21:21:04.620, INFO ] Client DiagnosticsAgent.exe (4076) registered.
[00002068:00002012, 2013/08/26, 21:21:09.825, INFO ] Program “E:\plugins\Diagnostics\DiagnosticsAgent.exe” /blockStartup exited with 0. Working Directory = E:\plugins\Diagnostics
[00002068:00002012, 2013/08/26, 21:21:09.825, ERROR] <- WapGetEnvironmentVariable=0x800700cb
[00002068:00002012, 2013/08/26, 21:21:09.825, ERROR] <- WapGetEnvironmentVariable=0x800700cb
[00002068:00002012, 2013/08/26, 21:21:09.825, INFO ] Executing Startup Task type=2 rolemodule=RemoteAccess cmd=”E:\plugins\RemoteAccess\RemoteAccessAgent.exe”
[00002068:00002012, 2013/08/26, 21:21:09.825, INFO ] Executing “E:\plugins\RemoteAccess\RemoteAccessAgent.exe” .
[00002068:00002012, 2013/08/26, 21:21:09.825, ERROR] <- WapGetEnvironmentVariable=0x800700cb
[00002068:00002012, 2013/08/26, 21:21:09.825, ERROR] <- WapGetEnvironmentVariable=0x800700cb
[00002068:00002012, 2013/08/26, 21:21:09.825, INFO ] Executing Startup Task type=0 rolemodule=RemoteAccess cmd=”E:\plugins\RemoteAccess\RemoteAccessAgent.exe” /blockStartup
[00002068:00002012, 2013/08/26, 21:21:09.825, INFO ] Executing “E:\plugins\RemoteAccess\RemoteAccessAgent.exe” /blockStartup.
[00002068:00003308, 2013/08/26, 21:21:10.185, INFO ] Registering client with PID 3448.
[00002068:00003356, 2013/08/26, 21:21:10.200, INFO ] Registering client with PID 908.
[00002068:00003356, 2013/08/26, 21:21:10.200, INFO ] Client RemoteAccessAgent.exe (908) registered.
[00002068:00003308, 2013/08/26, 21:21:10.232, INFO ] Client RemoteAccessAgent.exe (3448) registered.
[00002068:00002012, 2013/08/26, 21:21:10.575, INFO ] Program “E:\plugins\RemoteAccess\RemoteAccessAgent.exe” /blockStartup exited with 0. Working Directory = E:\plugins\RemoteAccess
[00002068:00002012, 2013/08/26, 21:21:11.122, ERROR] <- WapGetEnvironmentVariable=0x800700cb
[00002068:00002012, 2013/08/26, 21:21:11.122, ERROR] <- WapGetEnvironmentVariable=0x800700cb
[00002068:00002012, 2013/08/26, 21:21:11.122, ERROR] <- WapGetEnvironmentVariable=0x800700cb
[00002068:00002012, 2013/08/26, 21:21:11.122, INFO ] Executing Startup Task type=0 rolemodule=RemoteForwarder cmd=”E:\plugins\RemoteForwarder\install.cmd”
[00002068:00002012, 2013/08/26, 21:21:11.138, INFO ] Executing “E:\plugins\RemoteForwarder\install.cmd” .
[00002068:00002012, 2013/08/26, 21:21:11.388, INFO ] Program “E:\plugins\RemoteForwarder\install.cmd” exited with 0. Working Directory = E:\plugins\RemoteForwarder

And finally we see the last startup task executing before the host bootstrapper process exits:

[00002068:00002012, 2013/08/26, 21:21:11.388, INFO ] Executing Startup Task type=0 rolemodule=(null) cmd=”E:\approot\bin\StartupTasks\Startup.cmd”
[00002068:00002012, 2013/08/26, 21:21:11.388, INFO ] Executing “E:\approot\bin\StartupTasks\Startup.cmd” .
[00002068:00002012, 2013/08/26, 21:21:11.497, INFO ] Program “E:\approot\bin\StartupTasks\Startup.cmd” exited with 183. Working Directory = E:\approot\bin
[00002068:00002012, 2013/08/26, 21:21:11.497, ERROR] <- WaitForProcess=0x80004005
[00002068:00002012, 2013/08/26, 21:21:11.497, ERROR] <- ExecuteProcessAndWait=0x80004005
[00002068:00002012, 2013/08/26, 21:21:11.497, ERROR] <- WapDoStartup=0x80004005
[00002068:00002012, 2013/08/26, 21:21:11.497, ERROR] <- DoStartup=0x80004005
[00002068:00002012, 2013/08/26, 21:21:11.497, ERROR] <- wmain=0x80004005

The interesting thing to note is that the last startup task to run is E:\approot\bin\StartupTasks\Startup.cmd, which is the custom startup task that we have deployed with this service. WaHostBootstrapper expects all startup tasks to exit with a success (a 0 return code), and considers any non-zero return code to be a failure which causes the host bootstrapper to exit and recycle the role. We can see in this case that our Startup.cmd startup task is exiting with a 183 exit code which WaHostBootstrapper sees as a fatal error. Now that we know which process is failing it is time to go find out why.

*Additional WaHostBootstrapper messages which can be ignored:

ERROR] Failed to connect to client RemoteAccessAgent.exe (3120). {Note that RemoteAccessAgent could be any startup task, and 3120 could be any PID}

ERROR] <- CRuntimeClient::OnRoleStatusCallback(0x00000035CFE86E00) =0x800706ba

Check the startup task

First lets browse to the startup task and open it in notepad to see what it is doing:

Startup.cmd

%windir%\system32\inetsrv\appcmd set config -section:system.webServer/httpCompression /+”dynamicTypes.[mimeType=’application/json; charset=utf-8′,enabled=’True’]” /commit:apphost

This startup task is pretty simple (just one line) so we know which command is failing, and at this point we could do some internet searches for ‘appcmd error 183’ or something like that. But a better way to troubleshoot this, especially if the startup script is more complicated, is to just run it manually from a command prompt and see if we get a better error message.

We can clearly see that appcmd is throwing an error because we are trying to add a duplicate collection entry. The simple way to fix this is to add an ‘exit /b 0’ to the end of our startup script as per the documentation (and all over the web).

Why Did This Just Now Start Failing?

We found the root cause and fixed the issue, but the next logical question is why did this problem just now start happening? My site monitoring alerts tell me that I started getting failures at 08/26/2013 20:48:30 UTC so lets see what was going on at that time:

From the Role Architecture diagram we know that the guest agent is responsible for running everything in the guest OS, so starting there we look for logs in C:\Logs.

We see that prior to my monitoring alerts the instance is in the Ready state, which means that according to the Azure guest agent this instance is working fine and in load balancer rotation.

[00000010] [08/26/2013 20:47:57.45] [INFO] Role e85c82ee2e5b48f7ac6f88a8848dba16.WebRole1_IN_0 has current state Started, desired state Started, and goal state execution status UpdateSucceeded.
[00000009] [08/26/2013 20:47:57.72] [HEART] WindowsAzureGuestAgent Heartbeat.
[00000009] [08/26/2013 20:47:57.72] [INFO] Role e85c82ee2e5b48f7ac6f88a8848dba16.WebRole1_IN_0 is reporting state Ready.
[00000010] [08/26/2013 20:48:02.45] [INFO] Role e85c82ee2e5b48f7ac6f88a8848dba16.WebRole1_IN_0 has current state Started, desired state Started, and goal state execution status UpdateSucceeded.
[00000009] [08/26/2013 20:48:02.74] [HEART] WindowsAzureGuestAgent Heartbeat.
[00000009] [08/26/2013 20:48:02.74] [INFO] Role e85c82ee2e5b48f7ac6f88a8848dba16.WebRole1_IN_0 is reporting state Ready.
[00000010] [08/26/2013 20:48:07.45] [INFO] Role e85c82ee2e5b48f7ac6f88a8848dba16.WebRole1_IN_0 has current state Started, desired state Started, and goal state execution status UpdateSucceeded.
[00000007] [08/26/2013 20:48:07.77] [HEART] WindowsAzureGuestAgent Heartbeat.
[00000007] [08/26/2013 20:48:07.77] [INFO] Role e85c82ee2e5b48f7ac6f88a8848dba16.WebRole1_IN_0 is reporting state Ready.
[00000010] [08/26/2013 20:48:12.47] [INFO] Role e85c82ee2e5b48f7ac6f88a8848dba16.WebRole1_IN_0 has current state Started, desired state Started, and goal state execution status UpdateSucceeded.
[00000009] [08/26/2013 20:48:12.80] [HEART] WindowsAzureGuestAgent Heartbeat.
[00000009] [08/26/2013 20:48:12.80] [INFO] Role e85c82ee2e5b48f7ac6f88a8848dba16.WebRole1_IN_0 is reporting state Ready.
[00000010] [08/26/2013 20:48:17.47] [INFO] Role e85c82ee2e5b48f7ac6f88a8848dba16.WebRole1_IN_0 has current state Started, desired state Started, and goal state execution status UpdateSucceeded.
[00000007] [08/26/2013 20:48:17.81] [HEART] WindowsAzureGuestAgent Heartbeat.
[00000007] [08/26/2013 20:48:17.81] [INFO] Role e85c82ee2e5b48f7ac6f88a8848dba16.WebRole1_IN_0 is reporting state Ready.
[00000010] [08/26/2013 20:48:22.48] [INFO] Role e85c82ee2e5b48f7ac6f88a8848dba16.WebRole1_IN_0 has current state Started, desired state Started, and goal state execution status UpdateSucceeded.
[00000009] [08/26/2013 20:48:22.83] [HEART] WindowsAzureGuestAgent Heartbeat.
[00000009] [08/26/2013 20:48:22.83] [INFO] Role e85c82ee2e5b48f7ac6f88a8848dba16.WebRole1_IN_0 is reporting state Ready.

Now we see a new Goal state being received, and the goal state is ‘Stopped’.
[00000009] [08/26/2013 20:48:22.84] [INFO] Goal state 3 received.
[00000009] [08/26/2013 20:48:22.84] [INFO] Received stop role deadline hint: 300000.
[00000009] [08/26/2013 20:48:22.84] [INFO] Machine goal state is Stopped. Stopping all roles.

The guest agent begins shutting down the VM.
[00000009] [08/26/2013 20:48:22.84] [INFO] Stopping container.
[00000009] [08/26/2013 20:48:22.84] [INFO] Execution for container 0db3c7b9-521a-4681-885c-dd363d797e61 is being stopped.
[00000010] [08/26/2013 20:48:22.84] [INFO] Role State Executor for e85c82ee2e5b48f7ac6f88a8848dba16.WebRole1_IN_0 is exiting.
[00000011] [08/26/2013 20:48:22.84] [INFO] Role State Inquirer for e85c82ee2e5b48f7ac6f88a8848dba16.WebRole1_IN_0 is exiting.
[00000009] [08/26/2013 20:48:44.21] [INFO] Container stopped.
[00000009] [08/26/2013 20:48:44.24] [INFO] Initiating system shutdown.
[00000009] [08/26/2013 20:48:44.26] [INFO] Agent Runtime shutdown.
[00000005] [08/26/2013 20:48:44.69] [INFO] WindowsAzureGuestAgent stopping.
[00000005] [08/26/2013 20:48:44.69] [INFO] Stopping state driver.
[00000005] [08/26/2013 20:48:44.69] [INFO] State driver stopped.
[00000005] [08/26/2013 20:48:44.69] [INFO] Un-initializing Container State Machine.
[00000005] [08/26/2013 20:48:44.69] [INFO] Container already stopped.
[00000005] [08/26/2013 20:48:44.69] [INFO] Agent runtime already uninitialized.
[00000005] [08/26/2013 20:48:44.69] [INFO] Clearing Deployment Management URIs, if set.
[00000005] [08/26/2013 20:48:44.69] [INFO] Container State Machine uninitialized.
[00000009] [08/26/2013 20:48:44.69] [INFO] WindowsAzureGuestAgent stopped successfully.

And about a minute later the guest agent is starting up again.
[00000005] [08/26/2013 20:49:15.00] [INFO] WindowsAzureGuestAgent starting. Version 2.1.0.0

The System Event Logs also contain an entry at that time saying the Azure guest agent is initiating a system shut down:

The process D:\Packages\GuestAgent\GuestAgent\WindowsAzureGuestAgent.exe (RD00155D66C9DA) has initiated the shutdown of computer RD00155D66C9DA on behalf of user NT AUTHORITY\SYSTEM for the following reason: Legacy API shutdown
Reason Code: 0x80070000
Shutdown Type: shutdown
Comment:

Why is Azure shutting down the VM? Get a more in depth look from Role Instance Restarts Due to OS Upgrades, along with some tips that would have helped you avoid this service interruption to begin with.

One of the more common Azure Storage shared access signature issues I see is “403 Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.” The challenge with this error is that it can often feel very random. When you run your code on some computers it works fine, but on other computers you get a 403. Or if you return a collection of SAS URLs to a client some of those URLs work fine and others get a 403.

The code to generate the SAS is typically very simple:

string sas = azureContainer.GetSharedAccessSignature (new SharedAccessPolicy ()

{

    SharedAccessStartTime =  DateTime.UtcNow,

    SharedAccessExpiryTime = DateTime.UtcNow.AddHours(1),

    Permissions = SharedAccessPermissions.Write | SharedAccessPermissions.Read

});

So how can it behave so randomly?

Troubleshooting

As with most Azure Storage problems, we want to start with Fiddler. Download and install Fiddler and let it run and capture traffic while you reproduce the problem. The Raw results will look something like the following:

Request

GET http://teststorage.blob.core.windows.net/Images/TestImage.png?st=2013-08-27T10%3A36%3A43Z&se=2013-08-27T11%3A31%3A43Z&sr=b&sp=r&sig=l95QElg18CEa55%2BXuhJIQz56gFFs%1FppYz%2E024uj3aYc%3D HTTP/1.1
Accept: image/png, image/svg+xml, image/*;q=0.8, */*;q=0.5
Accept-Language: en-US
User-Agent: Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; WOW64; Trident/5.0)
Accept-Encoding: gzip, deflate, peerdist
Proxy-Connection: Keep-Alive
Host: teststorage.blob.core.windows.net
X-P2P-PeerDist: Version=1.0

Response

HTTP/1.1 403 Server failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.
Proxy-Connection: Keep-Alive
Connection: Keep-Alive
Content-Length: 422
Via: 1.1 APS-PRXY-10
Date: Tue, 27 Aug 2013 10:36:41 GMT
Content-Type: application/xml
Server: Microsoft-HTTPAPI/2.0
x-ms-request-id: bf0d4d25-0110-4719-945c-afae8cbcdf0b
AuthenticationFailedServer failed to authenticate the request. Make sure the value of Authorization header is formed correctly including the signature.
RequestId:bf0d4d25-0110-4719-945c-afae8cbcdf0b
Time:2013-08-27T10:36:42.3077388Z Signature not valid in the specified time frame

I have highlighted the key portions of the Fiddler trace. Notice the error message indicates that the shared access signature’s time frame is not valid.

Looking at the Request URL and comparing it to the SAS documentation we can see that the start time is set for 10:36:43. Given the code above we now that this is the value being returned on the client machine when calling DateTime.UtcNow.

Looking at the Response’s Date header we can see that the server side time is 10:36:41. 2 seconds earlier than the start time in the Request URL that the client created. This means that the client machine’s system time is at least 2 seconds faster than the system time for the front end authentication server handling that particular Azure Storage request. And the Azure Storage authentication server is going to reject this request because the client is attempting to access the storage resource 2 seconds earlier than the shared access signature allows. Clock drift like this is not an uncommon scenario.

So depending on which machine is generating the SAS URL, how recently the system time was synchronized, the time delta between the SAS-generating machine and the storage servers, and the speed of the client issuing the SAS request, you will randomly get HTTP 403 responses.

Resolution

As we know from the SAS documentation the Start Time is optional (from MSDN: “Optional. The time at which the shared access signature becomes valid, in an ISO 8061 format. If omitted, start time for this call is assumed to be the time when the storage service receives the request.”). If you are going to specify DateTime.Now as your start time, then why specify a start time at all? The only reason you would want to specify a start time is if you were intentionally trying to time-delay a client’s access to a storage resource.

To resolve this issue simply remove the StartTime from the code such that it looks like this:

string sas = azureContainer.GetSharedAccessSignature (new SharedAccessPolicy ()

{

    SharedAccessExpiryTime = DateTime.UtcNow.AddHours(1),

    Permissions = SharedAccessPermissions.Write | SharedAccessPermissions.Read

});

You have deployed your service to Azure and discover that it isn’t working correctly. After going through the troubleshooting series you have identified where the problem is and you think you know how to fix the issue (or you want to add more logging to narrow down the problem in your code), but the prospect of having to delete the existing deployment, rebuild the entire solution, redeploy to Azure, and wait for the VMs to startup is not very appealing. It would be much easier to just put the fix into the current deployment for testing purposes.

WARNING: The following is only for troubleshooting and testing purposes. The nature of stateless Azure cloud services means that any changes you make to an Azure VM could be lost at any time so you should absolutely not make any changes to a production service. If you use these techniques during your troubleshooting or testing phase you must make sure that you incorporate those changes into a new package and redeploy the new package to Azure.

The basic steps to modify a live service are:

RDP to the Azure VM
Copy the changes (ie. the new DLL) into a temporary location such as the C: drive
Terminate the WaHostBootstrapper process from task manager
Wait for a few seconds for the role host process (WaWorkerHost or WaIISHost) to terminate
Copy the changes (new DLL) into the appropriate folder in the E:/F: drive
Wait for the Azure guest agent to restart WaHostBootstrapper
Test the changes

Details

1. Copy the changes (ie. the new DLL) into a temporary location such as the C: drive

Shortly after the WaHostBootstrapper process is terminated the guest agent will detect this and begin the process of recycling the role, which includes restarting WaHostBootstrapper. You will typically have less than a minute between terminating WaHostBootstrapper and the guest agent restarting the process. Because of this you need to be quick in replacing the updated DLL and you typically won’t have time to do a copy/paste across the RDP connection. This is why you should first copy the DLL to a temporary location on the VM, then have 2 Windows Explorer windows open (or a cmd prompt with the ‘copy’ command ready to go) so you can quickly copy/paste the updated DLL into the E:/F: drive.

2. Terminate the WaHostBootstrapper process from task manager

From the Role Architecture diagram we know that the WaHostBootsrapper process is the parent process for your startup tasks and role entry point (WaIISHost/WaWorkerHost). Terminating the host bootstrapper process will also cause all of it’s child processes to terminate. The guest agent does heartbeats and will detect that WaHostBootstrapper is no longer running and begin the process of recycling the role. It will first start WaHostBootstrapper, and then WaHostBootstrapper will go through the normal role startup process which includes running startup tasks, configuring IIS, and running your role entry point.

3. Wait for a few seconds for the role host process (WaWorkerHost or WaIISHost) to terminate

It will take a few seconds for Windows to terminate the child processes of WaHostBootstrapper. If this does not happen automatically you can manually terminate the role host process.

4. Copy the changes (new DLL) into the appropriate folder in the E:/F: drive

Your cspkg contents (role entry point DLL, startup tasks, etc) are extracted to and executed from E:\approot\bin (see http://blogs.msdn.com/b/kwill/archive/2012/10/05/windows-azure-disk-partition-preservation.aspx for scenarios where it will change to the F: drive). Prior to copying the changed file into the E: drive I would recommend making a backup copy of the file already on the E: drive so you can easily revert back to it as needed. After you have killed WaHostBootstrapper (and before the new WaHostBootstrapper starts) you need to copy your changes into the appropriate location in the E: drive. Note that you can change pretty much anything you want – startup tasks, website content, DLLs, etc.

5. Wait for the Azure guest agent to restart WaHostBootstrapper

One the guest agent restarts WaHostBootstrapper your role will startup again and you it will load the updated binary that you just copied into the E: drive. You can now test to make sure that your fixes resolve the problem (or check your logs for additional trace data, etc).

Changing website content or binaries

The above steps are primarily intended to replace files which are being used directly by Azure (ie. startup tasks or role entry point DLLs). If you just need to replace website content you can simply terminate w3wp.exe and replace the files in E:\sitesroot\0 (if you have multiple <sites> elements in your csdef then you will need to find the correct folder under E:\Sitesroot). The next time the site is accessed w3wp.exe will start up and load the updated content.

Note that if you are trying to change IIS configuration settings you cannot simply execute iisreset on an Azure VM. iisreset also stops the WAS and W3SVC services and then restarts the WAS service. On an Azure VM the W3SVC service is set to Manual startup so it won’t automatically restart. If you want to do an iisreset on an Azure VM you need to also run ‘net start w3svc’ (or start the World Wide Web Publishing Service from services.msc).

Preventing WaHostBootstrapper from automatically restarting

If you need more time to make your changes before the azure guest agent automatically restart WaHostBootstrapper then you can break into WindowsAzureGuestAgent.exe with a debugger. To do this:

Download AzureTools from http://blogs.msdn.com/b/kwill/archive/2013/08/26/azuretools-the-diagnostic-utility-used-by-the-windows-azure-developer-support-team.aspx
Within AzureTools download the X64 Debuggers And Tools-x64_en-us tool
Launch WinDBG.exe (should be at C:\tools\Debuggers\x64\WinDBG.exe)
Within WinDBG.exe go to File –> Attach to a Process
Select WindowsAzureGuestAgent.exe. Select No when prompted to save information for workspace.
WindowsAzureGuestAgent is now broken into by the debugger and it will stop performing heartbeats and will not automatically restart WaHostBootstrapper.
Terminate WaHostBootstrapper and follow the steps above to modify the service. Note that you will need to manually terminate the role host process.
Within WinDBG execute ‘.detach’ (without the quotes). The WinDBG prompt will change to “NoTarget>”.
Close WinDBG. WindowsAzureGuestAgent will proceed with heartbeats and automatic recovery of the role.

Caution: Do not remain broken into WindowsAzureGuestAgent for 10 minutes. The Azure host agent process does regular heartbeats to the guest agent process and if it detects no response for 10 minutes it will reboot the VM.

Continuing from the troubleshooting series at Windows Azure PaaS Compute Diagnostics Data, this blog post will describe how to troubleshoot a role that is stuck in the Busy state with the error message “Busy (Waiting for role to start… Application startup tasks are running.)”.

Symptom

You are deploying a service to Azure and your service never starts correctly and never gets to the Ready state. It is stuck forever in the Busy state.

Get the Big Picture

Just like in Scenario 1 we want to RDP onto the Azure VM and get a high level overview of what is happening on the VM. We see WaHostBootstrapper running, but we do not see WaWorkerHost.exe and watching task manager for a minute or two we don’t see any other changes. From the notes in Scenario 1 we know that this probably means a problem with a startup task.

Check the logs

We know from Windows Azure PaaS Compute Diagnostics Data that the WaHostBootstrapper logs are in C:\Resources\WaHostBootstrapper.log. Checking that folder we see only one WaHostBootstrapper.log file, and no instances of WaHostBootstarpper.log.old.<index>. This means that WaHostBootstrapper has only tried to start one time and it isn’t recycling due to an error.

WaHostBootstrapper log:

I am going to skip over all of the stuff that we expect to see in every host bootstrapper log (see Scenario 2 for more info) and go straight to the bottom of the log because I am interested in whatever host bootstrapper is doing right now, not what it did in the past. These are the last two lines in the file:

[00002968:00001680, 2013/09/05, 21:47:15.487, INFO ] Executing Startup Task type=0 rolemodule=(null) cmd="E:\approot\StartupTasks\Startup.cmd"
[00002968:00001680, 2013/09/05, 21:47:15.487, INFO ] Executing "E:\approot\StartupTasks\Startup.cmd" .

We can compare the timestamp in host bootstrapper (2013/09/05, 21:47:15.487) to the current system time (2013/09/05, 22:13:34.581) to see that the host bootstrapper has been trying to run E:\approot\StartupTasks\Startup.cmd for about 30 minutes. Notice that the task type is 0 (“Executing Startup Task type=0”) which is a simple startup task which causes the host bootstrapper to wait until the task finishes before continuing execution (see the taskType entry at http://msdn.microsoft.com/en-us/library/windowsazure/gg557552.aspx).

Note that we can also get this same information from the event viewer. We can see that the last thing that Azure is doing is trying to run the startup task:

Check the startup task and test the fix

Now that we know where we are failing lets go check the startup task. Here are the contents of e:\approot\StartupTasks\Startup.cmd:

ECHO Press Enter When Ready
PAUSE
REM Other commands here…

This one is pretty simple and obvious to tell what is wrong, but if this were more complicated logic then we could run the startup task manually, attach a debugger to it, etc. Now it would be nice to test a quick fix to the startup task to make sure it fixes the problem without having to completely redeploy the solution. Using the information at How to Modify a Running Azure Service we know that we can terminate WaHostBootstrapper, replace the Startup.cmd, and then wait.

I modified Startup.cmd to remove the PAUSE line, and terminated WaHostBootstrapper, and waited for a minute, and now I see WaWorkerHost.exe running:

And the WaHostBootstrapper.log shows that my startup task finished correctly and the role host process (WaWorkerHost) was started:

[00002844:00002244, 2013/09/06, 14:28:55.386, INFO ] Executing Startup Task type=0 rolemodule=(null) cmd="E:\approot\StartupTasks\Startup.cmd"
[00002844:00002244, 2013/09/06, 14:28:55.386, INFO ] Executing "E:\approot\StartupTasks\Startup.cmd" .
[00002844:00002244, 2013/09/06, 14:28:55.401, INFO ] Program "E:\approot\StartupTasks\Startup.cmd" exited with 0. Working Directory = E:\approot\
[00002844:00002244, 2013/09/06, 14:28:55.401, ERROR] <- GetDebugger=0x1
[00002844:00002244, 2013/09/06, 14:28:55.401, ERROR] <- GetRoleHostDebugger=0x1
[00002844:00002244, 2013/09/06, 14:28:55.401, INFO ] Executing base\x64\WaWorkerHost.exe .
[00002844:00002244, 2013/09/06, 14:28:55.401, INFO ] Role host process PID: 4036.
[00002844:00003348, 2013/09/06, 14:28:55.479, INFO ] Registering client with PID 4036.
[00002844:00003348, 2013/09/06, 14:28:55.479, INFO ] Client WaWorkerHost.exe (4036) registered.
[00002844:00003348, 2013/09/06, 14:28:55.479, INFO ] Client process 4036 is the role host.
[00002844:00003348, 2013/09/06, 14:28:55.479, INFO ] Role host process registered.

This post will describe how to troubleshoot a Windows Azure Traffic Manager profile which is showing a Degraded status, and provide some key points to understand about traffic manager probes. This is a continuation of the troubleshooting series.

Symptom

You have configured a Windows Azure Traffic Manager profile pointing to some of your .cloudapp.net hosted services and after a few seconds you see the Status as Degraded.

If you go into the Endpoints tab of that profile you will see one or more of the endpoints in an Offline status:

Important notes about WATM probing

WATM only considers an endpoint as ONLINE if the probe gets a 200 back from the probe path.
A 30x redirect (or any other non-200 response) will fail, even if the redirected URL returns a 200.
For HTTPs probes, certificate errors are ignored.
The actual content of the probe path doesn’t matter, as long as a 200 is returned. A common technique if the actual website content doesn’t return a 200 (ie. if the ASP pages redirect to an ACS login page or some other CNAME URL) is to set the path to something like “/favicon.ico”.
Best practice is to set the Probe path to something which has enough logic to determine if the site is up or down. In the above example setting the path to “/favicon.ico” you are only testing if w3wp.exe is responding, but not if your website is healthy. A better option would be to set a path to something such as “/Probe.aspx”, and within Probe.aspx include enough logic to determine if your site is healthy (ie. check perf counters to make sure you aren’t at 100% CPU or receiving a large number of failed requests, attempt to access resources such as the database or session state to make sure the application’s logic is working, etc).
If all endpoints in a profile are degraded then WATM will treat all endpoints as healthy and route traffic to all endpoints. This is to ensure that any potential problem with the probing mechanism which results in incorrectly failed probes will not result in a complete outage of your service.

Troubleshooting

The best tool for troubleshooting WATM probe failures is wget. You can get the binaries and dependencies package from http://gnuwin32.sourceforge.net/packages/wget.htm. Note that you can use other programs such as Fiddler or curl instead of wget – basically you just need something that will show you the raw HTTP response.

Once you have wget installed, go to a command prompt and run wget against the URL + Probe port & path that is configured in WATM. For this example it would be http://watestsdp2008r2.cloudapp.net:80/Probe.

Notice that wget indicates that the URL returned a 301 redirect to http://watestsdp2008r2.cloudapp.net/Default.aspx. As we know from the “Important notes about WATM probing” section above, a 30x redirect is considered a failure by WATM probing and this will cause the probe to report Offline. At this point it is a simple matter to check the website configuration and make sure that a 200 is returned from the /Probe path (or reconfigure the WATM probe to point to a path which will return a 200).

If your probe is using HTTPs protocol you will want to add the “–no-check-certificate” parameter to wget so that it will ignore the certificate mismatch on the cloudapp.net URL.

The hotfix which resolves this issue has been released – http://support.microsoft.com/kb/2888303. See solution #2 below.

</Update>

A hotfix has been built and a link to the KB article will be available shortly.

Updated the hotfix to http://support.microsoft.com/kb/2836939/en-us, and corrected information about OS Family 3 (Windows Server 2012).

</Update>

Symptom

Your ASP.NET application which uses the System.Runtime.Caching.MemoryCache class and is hosted in a Windows Azure WebRole begins throwing the following exception some time after August 31, 2013:

System.Exception: Type ‘System.Threading.ExecutionContext’ does not have a public property named ‘PreAllocatedDefault’

The callstack may look something like:

System.Web.Util.ExecutionContextUtil.GetDummyDefaultEC()
System.Web.Util.ExecutionContextUtil..cctor()
System.Web.Util.ExecutionContextUtil.RunInNullExecutionContext(System.Action)
System.Web.Hosting.ObjectCacheHost.System.Runtime.Caching.Hosting.IMemoryCacheManager.UpdateCacheSize(Int64, System.Runtime.Caching.MemoryCache)
System.Runtime.Caching.CacheMemoryMonitor.GetCurrentPressure()
System.Runtime.Caching.MemoryMonitor.Update()
System.Runtime.Caching.MemoryCacheStatistics.CacheManagerThread(Int32)
System.Threading.ExecutionContext.runTryCode(System.Object)

Root Cause

The August 2013 Windows Azure Guest OS Release began deployment into the production environment on August 31, 2013, with most hosted services which use automatic guest OS updates being upgraded starting around September 7. This OS release contains the standard security patches, including the fix in http://support.microsoft.com/kb/2836939/en-us which is an ASP.NET hotfix for .NET Framework 4. This ASP.NET patch introduced a regression which causes the above exception when using System.Runtime.Caching.MemoryCache in an ASP.NET application.

Note that this issue does not impact OS Family 3 (Windows Server 2012) since this OS comes with .NET 4.5 which is not impacted by the ASP.NET hotfix.

Solution

The following are the possible solutions, roughly ordered by ease of implementation:

The easiest fix is to roll back to the previous Windows Azure Guest OS Release. This can be done via the management portal and changing Operating System Version to 1.25, 2.17, or by updating the .cscfg file and change osVersion=* to:

OS Family 1: osVersion=WA-GUEST-OS-1.25_201307-01
OS Family 2: osversion=WA-GUEST-OS-2.17_201307-0
For more information on configuring the Windows Azure Guest OS version see Configure Settings for the Windows Azure Guest OS.

A hotfix which resolves this problem is available via http://support.microsoft.com/kb/2888303. If solution #1 is not viable for you then you can include this hotfix in your package and install it via a startup script. Once the next Azure Guest OS release is released you can then remove this hotfix from your package.

Upgrade to OS Family 3 which uses Windows Sever 2012.

Remove the use of System.Runtime.Caching.MemoryCache from the ASP.NET application

The next Windows Azure Guest OS Release will include the ASP.NET hotfix to resolve this issue and should begin rolling out to Azure by early October. If you have chosen solution #1 (roll back to the previous Guest OS), once this new Guest OS is released you will be able to upgrade to this release or set your OS configuration back to Automatic updates. To be notified when the next Guest OS begins deployment to production sign up for the RSS feed at http://sxp.microsoft.com/feeds/3.0/msdntn/WindowsAzureOSUpdates.

A somewhat common question regarding Windows Azure Traffic Manager (WATM) deals with potential performance problems that it might cause. The questions are typically along the lines of “How much latency will WATM add to my website?”, “My monitoring site says that my website was slow for a couple hours yesterday – were there any WATM issues at that time?”, “Where are the WATM servers? I want to make sure they are in the same datacenter as my website so that performance isn’t impacted.”.

Note that this post is only about the direct performance impact that WATM can cause to a website. If you have a website in East US and one in Asia and your East US is failing the WATM probes, then all of your users will be directed to your Asia website and you will see performance impacts, but this performance impact has nothing to do with WATM itself.

Important notes about how WATM works

http://msdn.microsoft.com/en-us/library/windowsazure/hh744833.aspx is an excellent resource to learn how WATM works, but there is a lot of information on that page and picking out the key information relating to performance can be difficult. The important points to look at in the MSDN documentation is step #5 and #6 from Image 3, which I will explain in more detail here:

WATM essentially only does one thing – DNS resolution. This means that the only performance impact that WATM can have on your website is the initial DNS lookup.
A point of clarification about the WATM DNS lookup. WATM populates, and regularly updates, the normal Microsoft DNS root servers based on your policy and the probe results. So even during the initial DNS lookup there is no involvement by WATM since the DNS request is handled by the normal Microsoft DNS root servers. If WATM goes ‘down’ (ie. a failure in the VMs doing the policy probing and DNS updating) then there will be no impact to your WATM DNS name since the entries in the Microsoft DNS servers will still be preserved – the only impact will be that probing and updating based on policy will not happen (ie. if your primary site goes down, WATM will not be able to update DNS to point to your failover site).
Traffic does NOT flow through WATM. There are no WATM servers acting as a middle-man between your clients and your Azure hosted service. Once the DNS lookup is finished then WATM is completely removed from the communication between client and server.
DNS lookup is very fast, and is cached. The initial DNS lookup will depend on the client and their configured DNS servers, by typically a client can do a DNS lookup in ~50 ms (see http://www.solvedns.com/dns-comparison/). Once the first lookup is done the results will be cached for the DNS TTL, which for WATM is default of 300 seconds.
The WATM policy you choose (performance, failover, round robin) has no influence on the DNS performance. Your performance policy can negatively impact your user’s experience, for example if you send US users to a service hosed in Asia, but this performance issue is not caused by WATM.

Testing WATM Performance

There are a few publicly available websites that you can use to determine your WATM performance and behavior. These sites are useful to determine the DNS latency and which of your hosted services your users around the world are being directed to. Keep in mind that most of these tools do not cache the DNS results so running the tests multiple times will show the full DNS lookup, whereas clients connecting to your WATM endpoint will only see the full DNS lookup performance impact once during the TTL duration.

http://www.websitepulse.com/help/tools.php

One of the simplest tools is WebSitePulse. Enter the URL and you will see statistics such as DNS resolution time, First Byte, Last Byte, and other performance statistics. You can choose from three different locations to test your site from. In this example you will see that the first execution shows that first DNS lookup time takes 0.204 sec. The second time we run this test on the same WATM endpoint the DNS lookup time takes 0.002 sec since the results are already cached.

http://www.watchmouse.com/en/checkit.php

Another really useful tool to get DNS resolution time from multiple geographic regions simultaneously is Watchmouse’s Check Website tool. Enter the URL and you will see DNS resolution time, connection time, and speed from several geo locations. This is also handy to test the WATM Performance policy to see which hosted service your different users around the world are being sent to.

http://tools.pingdom.com/ – This will test a website and provide performance statistics for each element on the page on a visual graph. If you switch to the Page Analysis tab you can see the percentage of time spent doing DNS lookup.

http://www.whatsmydns.net/ – This site will do a DNS lookup from 20 different geo locations and display the results on a map. This is a great visual representation to help determine which hosted service your clients will connect to.

http://www.digwebinterface.com – Similar to the watchmouse site, but this one shows more detailed DNS information including CNAMEs and A records. Make sure you check the ‘Colorize output’ and ‘Stats’ under options, and select ‘All’ under Nameservers.

Summary

Given the above information we know that the only performance impact that WATM will have on a website is the first DNS lookup (times vary, but average ~50 ms), and then 0 performance impact for the duration of the DNS TTL (300 seconds default), and then again a refresh the DNS cache after the TTL expires. So the answer to the question “How much latency will WATM add to my website?" is, essentially, zero.

This post will describe how to troubleshoot an Internal Server Error 500 in an Azure webrole. This is a continuation of the troubleshooting series.

Symptom

You have deployed your WebRole, which works perfectly fine on your development machine, to Windows Azure and it shows as Ready in the portal, but when you browse to the site you get:

500 – Internal server error.

There is a problem with the resource you are looking for, and it cannot be displayed.

Troubleshooting

If the role itself is showing Ready in the portal, but there are functional issues with your hosted service (ie. this 500 Internal Server Error) then the first and easiest step is to RDP to the Azure VM and attempt to browse to the site using the DIP. The DIP is the VM’s internal IP address (a 10.xxx or 100.xxx address) which you can get from ipconfig or IIS Manager. This will give you the more detailed error information that you would expect to get when browsing a website from the server where IIS is running. Typically the error and root cause of the issue will be immediately apparent.

The easiest way to browse the website on the local DIP is to open IIS Manager, expand Sites and click on the website. On the right-hand side you will see ‘Browse Website’. Alternatively you can use ipconfig to get the local IP address and then open Internet Explorer and browse to that address, but if your site is not on the standard port 80 you will also have to find the port number. You can get the port number from IIS Manager, the management portal, or your .csdef file, but in general it is just easier to browse directly using IIS Manager.

Browsing to the local DIP in IE will result in more detailed error information:

In this case we can see the following problem:

Error Code
0x80070032

Config Error
The configuration section ‘system.web.webPages.razor’ cannot be read because it is missing a section declaration

Solution

This particular problem can be resolved by adding the razor sectionGroup settings to your web.config file. But more generically this blog post is meant to show you how to get the detailed error information, which at that point should be easy enough to do a quick web search to find a solution.

In Troubleshooting Scenario 2 we looked at a scenario where the role would recycle after running fine for some time due to a bug in a startup task triggered by a role recycle such as an OS update. This blog post will show another example of this same type of behavior, but with a different, and more difficult to find, root cause. This is a continuation of the troubleshooting series.

Symptom

Similar to what we looked at in Scenario 2, your role has been running fine for some time but seemingly without cause it enters a recycling state and the service health dashboard shows that everything is operating normally.

Get the Big Picture

As we have in the other troubleshooting scenarios we will start by looking at Task Manager for a minute to see what processes are running, or starting and stopping. When we initially look we only see WindowsAzureGuestAgent.exe:

After watching for a minute or two we see WindowsAzureGuestAgent.exe consuming CPU cycles, so we know it is doing some work, but we don’t see any other processes. We know that the guest agent is supposed to start WaHostBootstrapper.exe, but we never see this process in task manager.

From the ‘Get the Big Picture’ section in Troubleshooting Scenario 1 we know that if we don’t see WaHostBootstrapper running then the problem is most likely within the Azure guest agent (WindowsAzureGuestAgent.exe) itself.

Guidelines for analyzing Windows Azure Guest Agent logs

From the diagnostic data blog post we know that there are 2 types of guest agent logs – App Agent Runtime Logs in AppAgentRuntime.log, and App Agent Heartbeat Logs in WaAppAgent.log. This section will briefly describe the content in the logs and how to look at them.

App Agent Runtime (AppAgentRuntime.log)

These logs are written by WindowsAzureGuestAgent.exe and contain information about events happening within the guest agent and the VM. This includes information such as firewall configuration, role state changes, recycles, reboots, health status changes, role stops/starts, certificate configuration, etc.
This log is useful to get a quick overview of the events happening over time to a role since it logs major changes to the role without logging heartbeats.
If the guest agent is not able to start the role correctly (ie. a locked file preventing directory cleanup) then you will see it in this log.

The app agent runtime logs are normally filled with lots of error messages and hresults which look like problems, but are expected during the normal execution of the guest agent. This makes analyzing these logs very difficult and more art than science. Here are some general guidelines for how to look at these files and what messages to look for.

Compare guest agent logs from a good and bad VM so that you can see where the differences are. This is probably the most effective way to rule out a lot of the noise and benign error messages.
Scroll to the bottom of the log and start looking from there. The start and middle of the log includes a lot of basic setup messages that you are most likely not interested in. Any failures will be occurring later in the logs.
Look for repeating patterns of messages. The Azure guest agent works like a state machine. If the current goal state is Running then the guest agent will continue retrying the same set of actions until it reaches that goal state.
Look for _Context_Start and _Context_Ends messages. These correspond to the actions taken as the guest agent tries to reach the goal state. Every Context_Start will have a Context_End. A context can contain subcontexts, so you can see multiple Context_Start events before you see a Context_End event.
Lines that begin with a “<-“ are function returns, along with the hresult being returned. So a line of “<- RuntimepDestroyResource” means that the function RuntimepDestroyResource is returning. A series of lines in a row showing “<- {some function}” can be looked at much like a callstack.
The normal order of Context actions are (indented to show sub-contexts):

AgentFirewallInitialize
RuntimeHttpMonitor

AgentCreateContainer

AgentpCreateContainerWorker

SendConfig
StartContainer

AgentpStartContainerWorker

GetTransportCertificate
SendConfig
StartRole

AgentpStartRoleWorker

The _Context_Ends should always have a Return Value of 00000000 indicating success. If a context in the log does not have a success return value, then that is the context to focus on for the source of the problem. You can typically trace the same failed HRESULT back up in the log to see where it originates.
Some common entries in the log file that look like failures, but can be ignored:

{most HRESULTS, unless they are part of a CONTEXT_END}
Failed to remove CIS from Lsa. (or Failed to remove CIS\{guid} from Lsa.
TIMED OUT waiting for LB ping. Proceeding to start the role.
Failed to delete URLACL
RuntimeFindContainer=0x80070490

Once the host bootstrapper is started successfully you will see an entry in the log with the PID for WaHostBootstrapper: Role process with id {pid} is successfully resumed

App Agent Heartbeat (WaAppAgent.log)

These logs are written by WindowsAzureGuestAgent.exe and contain information about the status of the health probes to the host bootstrapper.
The guest agent process is responsible for reporting health status (ie. Ready, Busy, etc) back to the fabric, so the health status as reported by these logs is the same status that you will see in the Management Portal.
These logs are typically useful for determining what is the current state of the role within the VM, as well as determining what the state was at some time in the past. With a problem description like "My website was down from 10:00am to 11:30am yesterday", these heartbeat logs are very useful to determine what the health status of the role was during that time.

The heartbeat logs are very verbose and are typically best used to determine the status of the VM at a given point in time. Here are some guidelines on how to look at these files:

Every time the role starts (initial start, VM reboot, role recycle) you will see a large group of lines with ipconfig.exe and route.exe output. This can be ignored.
When the role starts you will see a few messages showing the state as NotReady with sub-status of Starting.
If the role never leaves the Busy state then it usually means that startup tasks are still executing or the role host is still in the OnStart method. The role can also show as Busy if you use the StatusCheck event.
Once the role is running you will see Role {roleid{ is reporting state Ready.
The software load balancer communicates with the guest agent to determine when to put an instance into LB rotation. If the role is reporting state Ready then the instance will be in LB rotation. Note that this is using the default LB configuration which can be overridden by use a custom LB probe.
Common entries that look like failures but can be ignored:

GetMachineGoalState() Error: 410 – Goal State not yet available from server. Will retry later.
Caught exception in pre-initialization heartbeat thread, will continue heartbeats: System.NullReferenceException: Object reference not set to an instance of an object.
Unable to get SDKVersion. System.IO.FileNotFoundException: Could not find file ‘D:\Packages\GuestAgent\GuestAgent\RoleModel.xml’.

Check the Logs

Normally for role recycles we should start with the Windows Azure event logs, Application event logs, and WaHostBootstrapper log files. But in this scenario we know that the problem is in the guest agent so we will start with the guest agent logs in C:\Logs. The runtime logs are where the guest agent logs the major events that occur so that is usually the first place to start looking when something is preventing the guest agent from correctly starting the host bootstrapper.

From the guidelines above we know to start with the AppAgentRuntime logs because those track the major events that happen with the guest agent, and we know to start off by scrolling to the bottom of the file and working our way up. We also know to start looking for a _Context_Ends entry with a non-success hresult.

The first entry we find is:

<<<<_Context_Ends: {B7B98274-CF7B-4D0B-95B5-A13E3D973E4C} Return value = 80070490. Context={{ AgentpStopRoleWorker

The interesting aspect of this line is that it is occurring on a StopRole context, but we know that we are trying to start the role. Whenever a StartRole fails the guest agent will then do a StopRole in order to tear everything down to prepare for another StartRole. So most likely this HRESULT is just a symptom of the real root cause and can be ignored. We also know that the hresult 0x80070490 is one of the ones that can usually be ignored.

Continuing to search up we find another Context_Ends with a non-success return value:

<<<<_Context_Ends: {28E5D4C1-654E-4631-8B8C-C9809E4074C7} Return value = 80070020. Context={{ AgentpStartRoleWorker

This one looks more promising since we know the failure is occurring while the guest agent is trying to start the role. Continuing to search up in the file on that hresult (80070020) we find several more entries, and finally we find the origination point:

<- RuntimepDestroyResource=0x80070020 Context={{ AgentpStartRoleWorker

So we know that the RuntimepDestroyResource function call returned an 0x80070020 hresult which bubbled up to the StartRole context and caused the role to fail to start. Next we want to continue to look up in the log file to see what other details are being logged about the execution of RuntimepDestroyResource and any logged failures. The very next line up in the log file is:

Failed to delete file C:\Resources\directory\31fa1ff786e645beb0ecd18eb9854fa9.DiagnosticStoreCleanOnRoleRecycle.DiagnosticStore\LogFiles\Web\W3SVC1273337584\u_ex13092320.log Context={{ AgentpStartRoleWorker

Nothing else looks interesting in the few lines preceding this delete file entry so this must be the root cause of the problem. The section of the log file where we see this error is:

[00001440:00002292, 2013/09/23, 20:32:26.993, ERROR] Failed to delete file C:\Resources\directory\31fa1ff786e645beb0ecd18eb9854fa9.DiagnosticStoreCleanOnRoleRecycle.DiagnosticStore\LogFiles\Web\W3SVC1273337584\u_ex13092320.log Context={{ AgentpStartRoleWorker: ConfigFileName=31fa1ff786e645beb0ecd18eb9854fa9.31fa1ff786e645beb0ecd18eb9854fa9.DiagnosticStoreCleanOnRoleRecycle_IN_0.1.xml ContainerId=68fdd1f2-865b-4ebf-b2b9-c9b0288526ba RoleInstanceId=31fa1ff786e645beb0ecd18eb9854fa9.DiagnosticStoreCleanOnRoleRecycle_IN_0 }}
[00001440:00002292, 2013/09/23, 20:32:26.993, ERROR] <- RuntimepDestroyResource=0x80070020
[00001440:00002292, 2013/09/23, 20:32:26.993, ERROR] <- RuntimepSetupDirectoryResource=0x80070020
[00001440:00002292, 2013/09/23, 20:32:26.993, ERROR] <- RuntimeSetupRoleResources=0x80070020
[00001440:00002292, 2013/09/23, 20:32:26.993, ERROR] <- RuntimeRole::SetupEnvironment(0x000000001BEA1CC0) =0x80070020
[00001440:00002292, 2013/09/23, 20:32:26.993, ERROR] <- RuntimeRole::StartUnsafe(0x000000001BEA1CC0) =0x80070020
[00001440:00002292, 2013/09/23, 20:32:26.993, ERROR] <- RuntimeRole::Start(0x000000001BEA1CC0) =0x80070020
[00001440:00002292, 2013/09/23, 20:32:26.993, ERROR] <- RuntimeBaseContainer::StartRole(0x0000000001327280) =0x80070020
[00001440:00002292, 2013/09/23, 20:32:26.993, ERROR] <- RuntimeStartRole=0x80070020

As we know from the guidelines above, the series of lines starting with “<-“ can be seen as a callstack. This tells us that the failure to delete a file is coming from the RuntimeSetupRoleResources function.

We know that the failure is due to the fact that the guest agent can’t delete a specific file, and that the file is trying to be deleted when executing RuntimeSetupRoleResources. At this point some experience with the Azure service model and VM setup is helpful, along with some knowledge of the DiagnosticStore folder (the folder where the file is being deleted from). The RuntimeSetupRoleResources function is responsible for setting up the <LocalResources> as defined in the service’s ServiceDefinition.csdef.

Solution

This hosted service is changing the size of the DiagnosticStore LocalStorage resource in order to accommodate a larger diagnostic store for Windows Azure Diagnostics, per the information at http://msdn.microsoft.com/en-us/library/windowsazure/microsoft.windowsazure.diagnostics.diagnosticmonitorconfiguration.overallquotainmb.aspx. The entry in the csdef looks like this:

<LocalResources>
< LocalStorage name="DiagnosticStore" sizeInMB="8192" cleanOnRoleRecycle="true"/>
</LocalResources>

The problem is that this definition is incorrect. The cleanOnRoleRecycle setting is set to true, which instructs the guest agent to delete the folder and recreate it during every role start. But if a file in that folder is locked (in this case it is w3wp.exe locking an IIS log file) then the delete will fail, causing the guest agent to fail to start the role. The solution is to change cleanOnRoleRecycle=”false” and then redeploy.

I am not sure why so many people set cleanOnRoleRecycle to true for the DiagnosticStore folder since the MSDN documentation’s sample sets it to false, but I have seen this specific issue several times.

One of the more common questions we have seen when creating Service Fabric clusters is how to integrate the cluster with various Azure networking features. This blog post will show how to create clusters using the following features:

A key concept to keep in mind is that Service Fabric simply runs in a standard Virtual Machine Scale Set, so any functionality you can use in a Virtual Machine Scale Set can also be used with a Service Fabric Cluster, and the networking portions of the ARM template will be identical. And once you can achieve #1 above (deploying into an existing VNet) then it is easy to incorporate other networking features such as ExpressRoute, VPN gateway, Network Security Group (NSG), VNet peering, etc.

The only Service Fabric specific aspect is that the Azure Management Portal internally uses the Service Fabric Resource Provider (SFRP) to call into a cluster in order to get information about nodes and applications, and SFRP requires publicly accessible inbound access to the HTTP Gateway port (19080 by default) on the management endpoint. This port is used by Service Fabric Explorer to browse and manage your cluster, and it is also used by the Service Fabric Resource Provider to query information about your cluster in order to display in the Azure Management Portal. If this port is not accessible from the SFRP then you will see a message such as ‘Nodes Not Found’ in the management portal and your node and application list will appear empty. This means that if you wish to have visibility of your cluster via the Azure Management Portal then your load balancer must expose a public IP address and your NSG must allow incoming 19080 traffic. If you do not meet these requirements then the Azure Management Portal will not be able to display current status of your cluster, but otherwise your cluster will not be affected and you can use Service Fabric Explorer to get the current status, so this may be an acceptable limitation based on your networking requirements. Note that this is a temporary limitation that we are planning to remove in the coming months, at which time your cluster can be publicly inaccessible without any loss of management portal functionality.

Templates

All of the templates can be found here and you should be able to deploy them as-is using the below powershell commands – just make sure you go through the ‘Initial Setup’ section first if deploying the existing VNet template or the static public IP template.

Initial Setup

Existing Virtual Network

I am starting with an existing Virtual Network named ‘ExistingRG-vnet’ in the resource group ‘ExistingRG’, with a subnet named ‘default’. These resources are the default ones created when using the Azure management portal to create a standard IaaS Virtual Machine. You could also just create the VNet and subnet without creating the Virtual Machine, but because the main goal of adding a cluster to an existing VNet is to provide network connectivity to other VMs, creating the VM gives a concrete example of how this is typically used. The VM with it’s public IP can also be used as a jump box if your SF cluster only uses an internal load balancer without a public IP address.

Static Public IP Address

Because a static public IP address is generally a dedicated resource that is managed separately from the VM(s) it is assigned to, it is usually provisioned in a dedicated networking resource group (as opposed to within the Service Fabric cluster resource group itself). Create a static public IP address with the name ‘staticIP1’ in the same ExistingRG resource group, either via the portal or Powershell:

PS C:\Users\kwill> New-AzureRmPublicIpAddress -Name staticIP1 -ResourceGroupName ExistingRG -Location westus -AllocationMethod Static -DomainNameLabel sfnetworking

Name                     : staticIP1
ResourceGroupName        : ExistingRG
Location                 : westus
Id                       : /subscriptions/1237f4d2-3dce-1236-ad95-123f764e7123/resourceGroups/ExistingRG/providers/Micr
                           osoft.Network/publicIPAddresses/staticIP1
Etag                     : W/"fc8b0c77-1f84-455d-9930-0404ebba1b64"
ResourceGuid             : 77c26c06-c0ae-496c-9231-b1a114e08824
ProvisioningState        : Succeeded
Tags                     :
PublicIpAllocationMethod : Static
IpAddress                : 40.83.182.110
PublicIpAddressVersion   : IPv4
IdleTimeoutInMinutes     : 4
IpConfiguration          : null
DnsSettings              : {
                             "DomainNameLabel": "sfnetworking",
                             "Fqdn": "sfnetworking.westus.cloudapp.azure.com"
                           }

Service Fabric Template

I am using the Service Fabric template.json that can be downloaded from the portal just prior to creating a cluster using the standard portal wizard:

You can also use one of the templates in the template gallery such as the 5 Node Service Fabric Cluster.

Existing Virtual Network / Subnet

1. Change the subnet parameter to the name of the existing subnet and add two new parameters to reference the existing VNet:

        "subnet0Name": {
            "type": "string",
            "defaultValue": "default"
        },
        "existingVNetRGName": {
            "type": "string",
            "defaultValue": "ExistingRG"
        },
        "existingVNetName": {
            "type": "string",
            "defaultValue": "ExistingRG-vnet"
        },
        /*
        "subnet0Name": {
            "type": "string",
            "defaultValue": "Subnet-0"
        },
        "subnet0Prefix": {
            "type": "string",
            "defaultValue": "10.0.0.0/24"
        },*/

2. Change the vnetID variable to point to the existing VNet:

        /*old "vnetID": "[resourceId('Microsoft.Network/virtualNetworks',parameters('virtualNetworkName'))]",*/
        "vnetID": "[concat('/subscriptions/', subscription().subscriptionId, '/resourceGroups/', parameters('existingVNetRGName'), '/providers/Microsoft.Network/virtualNetworks/', parameters('existingVNetName'))]",

3. Remove the Microsoft.Network/virtualNetworks from the Resources so that Azure does not try to create a new VNet:

4. Comment out the VNet from the dependsOn attribute of the Microsoft.Compute/virtualMachineScaleSets so that we don’t depend on creating a new VNet:

5. Deploy the template:

New-AzureRmResourceGroup -Name sfnetworkingexistingvnet -Location westus

New-AzureRmResourceGroupDeployment -Name deployment -ResourceGroupName sfnetworkingexistingvnet -TemplateFile C:\SFSamples\Final\template_existingvnet.json

After the deployment your Virtual Network should include the new VMSS VMs:

And the Virtual Machine Scale Set node type should show the existing VNet and subnet:

You can also RDP to the existing VM that was already in the VNet and ping the new VMSS VMs:

Static Public IP Address

1. Add parameters for the name of the existing static IP resource group, name, and FQDN:

        "existingStaticIPResourceGroup": {
            "type": "string"
        },
        "existingStaticIPName": {
            "type": "string"
        },
        "existingStaticIPDnsFQDN": {
            "type": "string"
        }

2. Remove the dnsName parameter since the static IP already has one:

3. Add a variable to reference the existing static IP:

        "existingStaticIP": "[concat('/subscriptions/', subscription().subscriptionId, '/resourceGroups/', parameters('existingStaticIPResourceGroup'), '/providers/Microsoft.Network/publicIPAddresses/', parameters('existingStaticIPName'))]",

4. Remove the Microsoft.Network/publicIPAddresses from the Resources so that Azure does not try to create a new IP address:

5. Comment out the IP address from the dependsOn attribute of the Microsoft.Network/loadBalancers so that we don’t depend on creating a new IP address:

6. Change the publicIPAddress element of the frontendIPConfigurations in the Microsoft.Network/loadBalancers resource to reference the existing static IP instead of a newly created one:

                "frontendIPConfigurations": [
                    {
                        "name": "LoadBalancerIPConfig",
                        "properties": {
                            "publicIPAddress": {
                                /*"id": "[resourceId('Microsoft.Network/publicIPAddresses',concat(parameters('lbIPName'),'-','0'))]"*/
                                "id": "[variables('existingStaticIP')]"
                            }
                        }
                    }
                ],

7. Change the managementEndpoint in the Microsoft.ServiceFabric/clusters resource to the DNS FQDN of the static IP. If you are using a secure cluster make sure you change ‘http://’ to ‘https://’. (Note: This instruction is only for Service Fabric clusters. If you are only using a VMSS then skip this step):

                "fabricSettings": [],
                /*"managementEndpoint": "[concat('http://',reference(concat(parameters('lbIPName'),'-','0')).dnsSettings.fqdn,':',parameters('nt0fabricHttpGatewayPort'))]",*/
                "managementEndpoint": "[concat('http://',parameters('existingStaticIPDnsFQDN'),':',parameters('nt0fabricHttpGatewayPort'))]",

8. Deploy the template:

New-AzureRmResourceGroup -Name sfnetworkingstaticip -Location westus

$staticip = Get-AzureRmPublicIpAddress -Name staticIP1 -ResourceGroupName ExistingRG

$staticip

New-AzureRmResourceGroupDeployment -Name deployment -ResourceGroupName sfnetworkingstaticip -TemplateFile C:\SFSamples\Final\template_staticip.json -existingStaticIPResourceGroup $staticip.ResourceGroupName -existingStaticIPName $staticip.Name -existingStaticIPDnsFQDN $staticip.DnsSettings.Fqdn

After the deployment you will see that your load balancer is bound to the public static IP address from the other resource group:

And the Service Fabric client connection endpoint and SFX endpoint point to the DNS FQDN of the static IP address:

Internal Only Load Balancer

This scenario replaces the external load balancer in the default Service Fabric template with an internal only load balancer. See above for the Azure Management Portal and SFRP implications.

1. Remove the dnsName parameter since it is not needed:

2. Optionally add a static IP address parameter, if using static allocation method. If using dynamic allocation method then this is not needed:

        "internalLBAddress": {
            "type": "string",
            "defaultValue": "10.0.0.250"
        }

3. Remove the Microsoft.Network/publicIPAddresses from the Resources so that Azure does not try to create a new IP address:

4. Remove the IP address dependsOn attribute of the Microsoft.Network/loadBalancers so that we don’t depend on creating a new IP address, and add the VNet depends on since the load balancer now depends on the subnet from the VNet:

            "apiVersion": "[variables('lbApiVersion')]",
            "type": "Microsoft.Network/loadBalancers",
            "name": "[concat('LB','-', parameters('clusterName'),'-',parameters('vmNodeType0Name'))]",
            "location": "[parameters('computeLocation')]",
            "dependsOn": [
                /*"[concat('Microsoft.Network/publicIPAddresses/',concat(parameters('lbIPName'),'-','0'))]"*/
                "[concat('Microsoft.Network/virtualNetworks/',parameters('virtualNetworkName'))]"
            ],

5. Change the load balancer’s frontendIPConfigurations from using a publicIPAddress to using a subnet and privateIPAddress. Note that this uses a predefined static internal IP address, you could switch this to a dynamic IP address by removing the privateIPAddress element and changing the privateIPAllocationMethod to “Dynamic”.

                "frontendIPConfigurations": [
                    {
                        "name": "LoadBalancerIPConfig",
                        "properties": {
                            /*
                            "publicIPAddress": {
                                "id": "[resourceId('Microsoft.Network/publicIPAddresses',concat(parameters('lbIPName'),'-','0'))]"
                            } */
                            "subnet" :{
                                "id": "[variables('subnet0Ref')]"
                            },
                            "privateIPAddress": "[parameters('internalLBAddress')]",
                            "privateIPAllocationMethod": "Static"
                        }
                    }
                ],

6. In the Microsoft.ServiceFabric/clusters resource change the managementEndpoint to point to the internal load balancer address. If you are using a secure cluster make sure you change ‘http://’ to ‘https://’. (Note: This instruction is only for Service Fabric clusters. If you are only using a VMSS then skip this step):

                "fabricSettings": [],
                /*"managementEndpoint": "[concat('http://',reference(concat(parameters('lbIPName'),'-','0')).dnsSettings.fqdn,':',parameters('nt0fabricHttpGatewayPort'))]",*/
                "managementEndpoint": "[concat('http://',reference(variables('lbID0')).frontEndIPConfigurations[0].properties.privateIPAddress,':',parameters('nt0fabricHttpGatewayPort'))]",

7. Deploy the template:

New-AzureRmResourceGroup -Name sfnetworkinginternallb -Location westus

New-AzureRmResourceGroupDeployment -Name deployment -ResourceGroupName sfnetworkinginternallb -TemplateFile C:\SFSamples\Final\template_internalonlyLB.json

After the deployment you will see that your load balancer is using the private static 10.0.0.250 IP address:

If you have another machine in that same VNet then you can also browse to the internal SFX endpoint and see that it connects to one of the nodes behind the load balancer:

Internal and External Load Balancer

This scenario will take the existing single node type external load balancer and add an additional internal load balancer for the same node type. A back end port attached to a back end address pool can only be assigned to a single load balancer so you will have to decide which load balancer should have your application ports and which load balancer should have your management endpoints (port 19000/19080). Keep in mind the SFRP restrictions from above if you decide to put the management endpoints on the internal load balancer. This sample keeps the management endpoints on the external load balancer and adds a port 80 application port and places it on the internal load balancer.

If you want a two node type cluster, with one node type on the external load balancer and the other on the internal load balancer, then simply take the portal-created two node type template (which will come with 2 load balancers) and switch the second load balancer to an internal load balancer per the ‘Internal Only Load Balancer’ section above.

1. Add the static internal LB IP address parameter (see notes above if dynamic IP address is desired):

        "internalLBAddress": {
            "type": "string",
            "defaultValue": "10.0.0.250"
        }

2. Add application port 80 parameter:

        "loadBalancedAppPort1": {
            "type": "int",
            "defaultValue": 80,
            "metadata": {
                "description": "Input endpoint1 for the application to use. Replace it with what your application uses"
            }
        },

3. Add internal versions of the existing networking variables by copy/paste and adding “-Int” to the naming:

        /* Add internal LB networking variables */
        "lbID0-Int": "[resourceId('Microsoft.Network/loadBalancers', concat('LB','-', parameters('clusterName'),'-',parameters('vmNodeType0Name'), '-Internal'))]",
        "lbIPConfig0-Int": "[concat(variables('lbID0-Int'),'/frontendIPConfigurations/LoadBalancerIPConfig')]",
        "lbPoolID0-Int": "[concat(variables('lbID0-Int'),'/backendAddressPools/LoadBalancerBEAddressPool')]",
        "lbProbeID0-Int": "[concat(variables('lbID0-Int'),'/probes/FabricGatewayProbe')]",
        "lbHttpProbeID0-Int": "[concat(variables('lbID0-Int'),'/probes/FabricHttpGatewayProbe')]",
        "lbNatPoolID0-Int": "[concat(variables('lbID0-Int'),'/inboundNatPools/LoadBalancerBEAddressNatPool')]",
        /* internal LB networking variables end */

4. If you are starting with the portal generated template with an application port 80 then the default portal template will add AppPort1 (port 80) on the external load balancer. In this case remove the it from the external load balancer loadBalancingRules and probes so you can add it to the internal load balancer:

5. Add a second Microsoft.Network/loadBalancers resource. This will look very similar to the internal load balancer created in the previous ‘Internal Only Load Balancer’ section, but using the ‘-Int’ load balancer variables and only implementing the application port 80. This also removes the inboundNatPools in order to keep RDP endpoints on the public load balancer – if you want RDP in the internal load balancer then move the inboundNatPools from the external load balancer to this internal load balancer.

        /* Add a second load balancer, configured with a static privateIPAddress and the "-Int" LB variables */
        {
            "apiVersion": "[variables('lbApiVersion')]",
            "type": "Microsoft.Network/loadBalancers",
            /* Add '-Internal' to name */
            "name": "[concat('LB','-', parameters('clusterName'),'-',parameters('vmNodeType0Name'), '-Internal')]",
            "location": "[parameters('computeLocation')]",
            "dependsOn": [
                /* Remove public IP dependsOn, add vnet dependsOn
                "[concat('Microsoft.Network/publicIPAddresses/',concat(parameters('lbIPName'),'-','0'))]"
                */
                "[concat('Microsoft.Network/virtualNetworks/',parameters('virtualNetworkName'))]"
            ],
            "properties": {
                "frontendIPConfigurations": [
                    {
                        "name": "LoadBalancerIPConfig",
                        "properties": {
                            /* Switch from Public to Private IP address
                            /*
                            "publicIPAddress": {
                                "id": "[resourceId('Microsoft.Network/publicIPAddresses',concat(parameters('lbIPName'),'-','0'))]"
                            }
                            */
                            "subnet" :{
                                "id": "[variables('subnet0Ref')]"
                            },
                            "privateIPAddress": "[parameters('internalLBAddress')]",
                            "privateIPAllocationMethod": "Static"
                        }
                    }
                ],
                "backendAddressPools": [
                    {
                        "name": "LoadBalancerBEAddressPool",
                        "properties": {}
                    }
                ],
                "loadBalancingRules": [
                    /* Add the AppPort rule, making sure to reference the "-Int" versions of the backendAddressPool, frontendIPConfiguration, and probe variables */
                    {
                        "name": "AppPortLBRule1",
                        "properties": {
                            "backendAddressPool": {
                                "id": "[variables('lbPoolID0-Int')]"
                            },
                            "backendPort": "[parameters('loadBalancedAppPort1')]",
                            "enableFloatingIP": "false",
                            "frontendIPConfiguration": {
                                "id": "[variables('lbIPConfig0-Int')]"
                            },
                            "frontendPort": "[parameters('loadBalancedAppPort1')]",
                            "idleTimeoutInMinutes": "5",
                            "probe": {
                                "id": "[concat(variables('lbID0-Int'),'/probes/AppPortProbe1')]"
                            },
                            "protocol": "tcp"
                        }
                    }
                ],
                "probes": [
                   /* Add the probe for the app port */
                   {
                        "name": "AppPortProbe1",
                        "properties": {
                            "intervalInSeconds": 5,
                            "numberOfProbes": 2,
                            "port": "[parameters('loadBalancedAppPort1')]",
                            "protocol": "tcp"
                        }
                    }
                ],
                "inboundNatPools": [
                ]
            },
            "tags": {
                "resourceType": "Service Fabric",
                "clusterName": "[parameters('clusterName')]"
            }
        },

6. In the networkProfile for the Microsoft.Compute/virtualMachineScaleSets resource add the internal back end address pool:

                                                "loadBalancerBackendAddressPools": [
                                                    {
                                                        "id": "[variables('lbPoolID0')]"
                                                    },
                                                    {
                                                        /* Add internal BE pool */
                                                        "id": "[variables('lbPoolID0-Int')]"
                                                    }
                                                ],

7. Deploy the template:

New-AzureRmResourceGroup -Name sfnetworkinginternalexternallb -Location westus

New-AzureRmResourceGroupDeployment -Name deployment -ResourceGroupName sfnetworkinginternalexternallb -TemplateFile C:\SFSamples\Final\template_internalexternalLB.json

After the deployment you will see two load balancers in the resource group, and browsing through the load balancers you will see the public IP address and management endpoints (port 19000/19080) assigned to the public IP address, and the static internal IP address and application endpoint (port 80) assigned to the internal load balancer, and both load balancers using the same VMSS backend pool.