Quantcast
Channel: Windows Azure – Troubleshooting & Debugging
Viewing all 67 articles
Browse latest View live

Windows Azure Traffic Manager Performance Impact

$
0
0

 

A somewhat common question regarding Windows Azure Traffic Manager (WATM) deals with potential performance problems that it might cause.  The questions are typically along the lines of “How much latency will WATM add to my website?”, “My monitoring site says that my website was slow for a couple hours yesterday – were there any WATM issues at that time?”, “Where are the WATM servers? I want to make sure they are in the same datacenter as my website so that performance isn’t impacted.”.

Note that this post is only about the direct performance impact that WATM can cause to a website.  If you have a website in East US and one in Asia and your East US is failing the WATM probes, then all of your users will be directed to your Asia website and you will see performance impacts, but this performance impact has nothing to do with WATM itself.

 

Important notes about how WATM works

http://msdn.microsoft.com/en-us/library/windowsazure/hh744833.aspx is an excellent resource to learn how WATM works, but there is a lot of information on that page and picking out the key information relating to performance can be difficult.  The important points to look at in the MSDN documentation is step #5 and #6 from Image 3, which I will explain in more detail here:

  • WATM essentially only does one thing – DNS resolution.  This means that the only performance impact that WATM can have on your website is the initial DNS lookup.
  • A point of clarification about the WATM DNS lookup.  WATM populates, and regularly updates, the normal Microsoft DNS root servers based on your policy and the probe results.  So even during the initial DNS lookup there is no involvement by WATM since the DNS request is handled by the normal Microsoft DNS root servers.  If WATM goes ‘down’ (ie. a failure in the VMs doing the policy probing and DNS updating) then there will be no impact to your WATM DNS name since the entries in the Microsoft DNS servers will still be preserved – the only impact will be that probing and updating based on policy will not happen (ie. if your primary site goes down, WATM will not be able to update DNS to point to your failover site).
  • Traffic does NOT flow through WATM.  There are no WATM servers acting as a middle-man between your clients and your Azure hosted service.  Once the DNS lookup is finished then WATM is completely removed from the communication between client and server.
  • DNS lookup is very fast, and is cached.  The initial DNS lookup will depend on the client and their configured DNS servers, by typically a client can do a DNS lookup in ~50 ms (see http://www.solvedns.com/dns-comparison/).  Once the first lookup is done the results will be cached for the DNS TTL, which for WATM is default of 300 seconds.
  • The WATM policy you choose (performance, failover, round robin) has no influence on the DNS performance.  Your performance policy can negatively impact your user’s experience, for example if you send US users to a service hosed in Asia, but this performance issue is not caused by WATM.

 

Testing WATM Performance

There are a few publicly available websites that you can use to determine your WATM performance and behavior.  These sites are useful to determine the DNS latency and which of your hosted services your users around the world are being directed to.  Keep in mind that most of these tools do not cache the DNS results so running the tests multiple times will show the full DNS lookup, whereas clients connecting to your WATM endpoint will only see the full DNS lookup performance impact once during the TTL duration.

http://www.websitepulse.com/help/tools.php

One of the simplest tools is WebSitePulse.  Enter the URL and you will see statistics such as DNS resolution time, First Byte, Last Byte, and other performance statistics.  You can choose from three different locations to test your site from.  In this example you will see that the first execution shows that first DNS lookup time takes 0.204 sec.  The second time we run this test on the same WATM endpoint the DNS lookup time takes 0.002 sec since the results are already cached.

image

image

 

http://www.watchmouse.com/en/checkit.php

Another really useful tool to get DNS resolution time from multiple geographic regions simultaneously is Watchmouse’s Check Website tool.  Enter the URL and you will see DNS resolution time, connection time, and speed from several geo locations.  This is also handy to test the WATM Performance policy to see which hosted service your different users around the world are being sent to.

image

 

http://tools.pingdom.com/– This will test a website and provide performance statistics for each element on the page on a visual graph.  If you switch to the Page Analysis tab you can see the percentage of time spent doing DNS lookup.

 

http://www.whatsmydns.net/– This site will do a DNS lookup from 20 different geo locations and display the results on a map.  This is a great visual representation to help determine which hosted service your clients will connect to.

 

http://www.digwebinterface.com– Similar to the watchmouse site, but this one shows more detailed DNS information including CNAMEs and A records.  Make sure you check the ‘Colorize output’ and ‘Stats’ under options, and select ‘All’ under Nameservers.

 

Summary

Given the above information we know that the only performance impact that WATM will have on a website is the first DNS lookup (times vary, but average ~50 ms), and then 0 performance impact for the duration of the DNS TTL (300 seconds default), and then again a refresh the DNS cache after the TTL expires.  So the answer to the question “How much latency will WATM add to my website?" is, essentially, zero.

 


Troubleshooting Scenario 5 – Internal Server Error 500 in WebRole

$
0
0

 

This post will describe how to troubleshoot an Internal Server Error 500 in an Azure webrole.  This is a continuation of the troubleshooting series.

Symptom

You have deployed your WebRole, which works perfectly fine on your development machine, to Windows Azure and it shows as Ready in the portal, but when you browse to the site you get:

500 - Internal server error.

There is a problem with the resource you are looking for, and it cannot be displayed.

image

 

Troubleshooting

If the role itself is showing Ready in the portal, but there are functional issues with your hosted service (ie. this 500 Internal Server Error) then the first and easiest step is to RDP to the Azure VM and attempt to browse to the site using the DIP.  The DIP is the VM's internal IP address (a 10.xxx or 100.xxx address) which you can get from ipconfig or IIS Manager. This will give you the more detailed error information that you would expect to get when browsing a website from the server where IIS is running.  Typically the error and root cause of the issue will be immediately apparent.

image

The easiest way to browse the website on the local DIP is to open IIS Manager, expand Sites and click on the website.  On the right-hand side you will see ‘Browse Website’.  Alternatively you can use ipconfig to get the local IP address and then open Internet Explorer and browse to that address, but if your site is not on the standard port 80 you will also have to find the port number.  You can get the port number from IIS Manager, the management portal, or your .csdef file, but in general it is just easier to browse directly using IIS Manager.

image

Browsing to the local DIP in IE will result in more detailed error information:

image

In this case we can see the following problem:

Error Code
   0x80070032

Config Error
   The configuration section 'system.web.webPages.razor' cannot be read because it is missing a section declaration 

 

Solution

This particular problem can be resolved by adding the razor sectionGroup settings to your web.config file.  But more generically this blog post is meant to show you how to get the detailed error information, which at that point should be easy enough to do a quick web search to find a solution.

Troubleshooting Scenario 6 – Role Recycling After Running For Some Time

$
0
0

 

In Troubleshooting Scenario 2 we looked at a scenario where the role would recycle after running fine for some time due to a bug in a startup task triggered by a role recycle such as an OS update.  This blog post will show another example of this same type of behavior, but with a different, and more difficult to find, root cause.  This is a continuation of the troubleshooting series.

 

Symptom

Similar to what we looked at in Scenario 2, your role has been running fine for some time but seemingly without cause it enters a recycling state and the service health dashboard shows that everything is operating normally.

 

Get the Big Picture

As we have in the other troubleshooting scenarios we will start by looking at Task Manager for a minute to see what processes are running, or starting and stopping.  When we initially look we only see WindowsAzureGuestAgent.exe:

image

After watching for a minute or two we see WindowsAzureGuestAgent.exe consuming CPU cycles, so we know it is doing some work, but we don’t see any other processes.  We know that the guest agent is supposed to start WaHostBootstrapper.exe, but we never see this process in task manager.

From the ‘Get the Big Picture’ section in Troubleshooting Scenario 1 we know that if we don’t see WaHostBootstrapper running then the problem is most likely within the Azure guest agent (WindowsAzureGuestAgent.exe) itself.

 

Guidelines for analyzing Windows Azure Guest Agent logs

From the diagnostic data blog post we know that there are 2 types of guest agent logs – App Agent Runtime Logs in AppAgentRuntime.log, and App Agent Heartbeat Logs in WaAppAgent.log.  This section will briefly describe the content in the logs and how to look at them.

App Agent Runtime (AppAgentRuntime.log)

  • These logs are written by WindowsAzureGuestAgent.exe and contain information about events happening within the guest agent and the VM.  This includes information such as firewall configuration, role state changes, recycles, reboots, health status changes, role stops/starts, certificate configuration, etc.
  • This log is useful to get a quick overview of the events happening over time to a role since it logs major changes to the role without logging heartbeats.
  • If the guest agent is not able to start the role correctly (ie. a locked file preventing directory cleanup) then you will see it in this log.

The app agent runtime logs are normally filled with lots of error messages and hresults which look like problems, but are expected during the normal execution of the guest agent.  This makes analyzing these logs very difficult and more art than science.  Here are some general guidelines for how to look at these files and what messages to look for.

  1. Compare guest agent logs from a good and bad VM so that you can see where the differences are.  This is probably the most effective way to rule out a lot of the noise and benign error messages.
  2. Scroll to the bottom of the log and start looking from there.  The start and middle of the log includes a lot of basic setup messages that you are most likely not interested in.  Any failures will be occurring later in the logs.
  3. Look for repeating patterns of messages.  The Azure guest agent works like a state machine.  If the current goal state is Running then the guest agent will continue retrying the same set of actions until it reaches that goal state.
  4. Look for _Context_Start and _Context_Ends messages.  These correspond to the actions taken as the guest agent tries to reach the goal state.  Every Context_Start will have a Context_End.  A context can contain subcontexts, so you can see multiple Context_Start events before you see a Context_End event.
  5. Lines that begin with a “<-“ are function returns, along with the hresult being returned.  So a line of “<- RuntimepDestroyResource” means that the function RuntimepDestroyResource is returning.  A series of lines in a row showing “<- {some function}” can be looked at much like a callstack.
  6. The normal order of Context actions are (indented to show sub-contexts):
    1. AgentFirewallInitialize
    2. RuntimeHttpMonitor
      1. AgentCreateContainer
        1. AgentpCreateContainerWorker
    3. SendConfig
    4. StartContainer
      1. AgentpStartContainerWorker
    5. GetTransportCertificate
    6. SendConfig
    7. StartRole
      1. AgentpStartRoleWorker
  7. The _Context_Ends should always have a Return Value of 00000000 indicating success.  If a context in the log does not have a success return value, then that is the context to focus on for the source of the problem.  You can typically trace the same failed HRESULT back up in the log to see where it originates.
  8. Some common entries in the log file that look like failures, but can be ignored:
    1. {most HRESULTS, unless they are part of a CONTEXT_END}
    2. Failed to remove CIS from Lsa.  (or Failed to remove CIS\{guid} from Lsa.
    3. TIMED OUT waiting for LB ping. Proceeding to start the role.
    4. Failed to delete URLACL
    5. RuntimeFindContainer=0x80070490
  9. Once the host bootstrapper is started successfully you will see an entry in the log with the PID for WaHostBootstrapper: Role process with id {pid} is successfully resumed

 

App Agent Heartbeat (WaAppAgent.log)

  • These logs are written by WindowsAzureGuestAgent.exe and contain information about the status of the health probes to the host bootstrapper.
  • The guest agent process is responsible for reporting health status (ie. Ready, Busy, etc) back to the fabric, so the health status as reported by these logs is the same status that you will see in the Management Portal.
  • These logs are typically useful for determining what is the current state of the role within the VM, as well as determining what the state was at some time in the past.  With a problem description like "My website was down from 10:00am to 11:30am yesterday", these heartbeat logs are very useful to determine what the health status of the role was during that time.

The heartbeat logs are very verbose and are typically best used to determine the status of the VM at a given point in time.  Here are some guidelines on how to look at these files:

  1. Every time the role starts (initial start, VM reboot, role recycle) you will see a large group of lines with ipconfig.exe and route.exe output.  This can be ignored.
  2. When the role starts you will see a few messages showing the state as NotReady with sub-status of Starting.
  3. If the role never leaves the Busy state then it usually means that startup tasks are still executing or the role host is still in the OnStart method.  The role can also show as Busy if you use the StatusCheck event.
  4. Once the role is running you will see Role {roleid{ is reporting state Ready.
  5. The software load balancer communicates with the guest agent to determine when to put an instance into LB rotation.  If the role is reporting state Ready then the instance will be in LB rotation.  Note that this is using the default LB configuration which can be overridden by use a custom LB probe.
  6. Common entries that look like failures but can be ignored:
    1. GetMachineGoalState() Error: 410 - Goal State not yet available from server. Will retry later.
    2. Caught exception in pre-initialization heartbeat thread, will continue heartbeats: System.NullReferenceException: Object reference not set to an instance of an object.
    3. Unable to get SDKVersion. System.IO.FileNotFoundException: Could not find file 'D:\Packages\GuestAgent\GuestAgent\RoleModel.xml'.

 

Check the Logs

Normally for role recycles we should start with the Windows Azure event logs, Application event logs, and WaHostBootstrapper log files.  But in this scenario we know that the problem is in the guest agent so we will start with the guest agent logs in C:\Logs.  The runtime logs are where the guest agent logs the major events that occur so that is usually the first place to start looking when something is preventing the guest agent from correctly starting the host bootstrapper.

From the guidelines above we know to start with the AppAgentRuntime logs because those track the major events that happen with the guest agent, and we know to start off by scrolling to the bottom of the file and working our way up.  We also know to start looking for a _Context_Ends entry with a non-success hresult.

The first entry we find is:

<<<<_Context_Ends: {B7B98274-CF7B-4D0B-95B5-A13E3D973E4C}    Return value = 80070490.         Context={{ AgentpStopRoleWorker

The interesting aspect of this line is that it is occurring on a StopRole context, but we know that we are trying to start the role.  Whenever a StartRole fails the guest agent will then do a StopRole in order to tear everything down to prepare for another StartRole.  So most likely this HRESULT is just a symptom of the real root cause and can be ignored.  We also know that the hresult 0x80070490 is one of the ones that can usually be ignored.

Continuing to search up we find another Context_Ends with a non-success return value:

<<<<_Context_Ends: {28E5D4C1-654E-4631-8B8C-C9809E4074C7}    Return value = 80070020.         Context={{ AgentpStartRoleWorker

This one looks more promising since we know the failure is occurring while the guest agent is trying to start the role.  Continuing to search up in the file on that hresult (80070020) we find several more entries, and finally we find the origination point:

<- RuntimepDestroyResource=0x80070020        Context={{ AgentpStartRoleWorker

So we know that the RuntimepDestroyResource function call returned an 0x80070020 hresult which bubbled up to the StartRole context and caused the role to fail to start.  Next we want to continue to look up in the log file to see what other details are being logged about the execution of RuntimepDestroyResource and any logged failures.  The very next line up in the log file is:

Failed to delete file C:\Resources\directory\31fa1ff786e645beb0ecd18eb9854fa9.DiagnosticStoreCleanOnRoleRecycle.DiagnosticStore\LogFiles\Web\W3SVC1273337584\u_ex13092320.log        Context={{ AgentpStartRoleWorker

Nothing else looks interesting in the few lines preceding this delete file entry so this must be the root cause of the problem.  The section of the log file where we see this error is:

[00001440:00002292, 2013/09/23, 20:32:26.993, ERROR] Failed to delete file C:\Resources\directory\31fa1ff786e645beb0ecd18eb9854fa9.DiagnosticStoreCleanOnRoleRecycle.DiagnosticStore\LogFiles\Web\W3SVC1273337584\u_ex13092320.log        Context={{ AgentpStartRoleWorker:     ConfigFileName=31fa1ff786e645beb0ecd18eb9854fa9.31fa1ff786e645beb0ecd18eb9854fa9.DiagnosticStoreCleanOnRoleRecycle_IN_0.1.xml     ContainerId=68fdd1f2-865b-4ebf-b2b9-c9b0288526ba     RoleInstanceId=31fa1ff786e645beb0ecd18eb9854fa9.DiagnosticStoreCleanOnRoleRecycle_IN_0 }}
[00001440:00002292, 2013/09/23, 20:32:26.993, ERROR] <- RuntimepDestroyResource=0x80070020
[00001440:00002292, 2013/09/23, 20:32:26.993, ERROR] <- RuntimepSetupDirectoryResource=0x80070020
[00001440:00002292, 2013/09/23, 20:32:26.993, ERROR] <- RuntimeSetupRoleResources=0x80070020
[00001440:00002292, 2013/09/23, 20:32:26.993, ERROR] <- RuntimeRole::SetupEnvironment(0x000000001BEA1CC0) =0x80070020
[00001440:00002292, 2013/09/23, 20:32:26.993, ERROR] <- RuntimeRole::StartUnsafe(0x000000001BEA1CC0) =0x80070020
[00001440:00002292, 2013/09/23, 20:32:26.993, ERROR] <- RuntimeRole::Start(0x000000001BEA1CC0) =0x80070020
[00001440:00002292, 2013/09/23, 20:32:26.993, ERROR] <- RuntimeBaseContainer::StartRole(0x0000000001327280) =0x80070020
[00001440:00002292, 2013/09/23, 20:32:26.993, ERROR] <- RuntimeStartRole=0x80070020

As we know from the guidelines above, the series of lines starting with “<-“ can be seen as a callstack.  This tells us that the failure to delete a file is coming from the RuntimeSetupRoleResources function. 

We know that the failure is due to the fact that the guest agent can’t delete a specific file, and that the file is trying to be deleted when executing RuntimeSetupRoleResources.  At this point some experience with the Azure service model and VM setup is helpful, along with some knowledge of the DiagnosticStore folder (the folder where the file is being deleted from).  The RuntimeSetupRoleResources function is responsible for setting up the <LocalResources> as defined in the service’s ServiceDefinition.csdef. 

 

Solution

This hosted service is changing the size of the DiagnosticStore LocalStorage resource in order to accommodate a larger diagnostic store for Windows Azure Diagnostics, per the information at http://msdn.microsoft.com/en-us/library/windowsazure/microsoft.windowsazure.diagnostics.diagnosticmonitorconfiguration.overallquotainmb.aspx.  The entry in the csdef looks like this:

<LocalResources>
  < LocalStorage name="DiagnosticStore" sizeInMB="8192" cleanOnRoleRecycle="true"/>
</LocalResources>

The problem is that this definition is incorrect.  The cleanOnRoleRecycle setting is set to true, which instructs the guest agent to delete the folder and recreate it during every role start.  But if a file in that folder is locked (in this case it is w3wp.exe locking an IIS log file) then the delete will fail, causing the guest agent to fail to start the role.  The solution is to change cleanOnRoleRecycle=”false” and then redeploy.

I am not sure why so many people set cleanOnRoleRecycle to true for the DiagnosticStore folder since the MSDN documentation’s sample sets it to false, but I have seen this specific issue several times.

Troubleshooting Scenario 7 – Role Recycling

$
0
0

In Troubleshooting Scenario 1 we looked at a scenario where the role would recycle after deployment and the root cause was easily seen in the Windows Azure Event Logs.  This blog post will show another example of this same type of behavior, but with a different, and more difficult to find, root cause.  This is a continuation of the troubleshooting series.

Symptom

You have deployed your Azure hosted service and it shows as Recycling in the portal.  But there is no additional information such as an Exception type or error message.  The role status in the portal might switch between a few different messages such as, but not limited to:

  • Recycling (Waiting for role to start... System startup tasks are running.
  • Recycling (Waiting for role to start... Sites are being deployed.
  • Recycling (Role has encountered an error and has stopped. Sites were deployed.

Get the Big Picture

Similar to the previous troubleshooting scenarios we want to get a quick idea of where we are failing.  Watching task manager we see that WaIISHost.exe starts for a few seconds and then disappears along with WaHostBootstrapper.

image

From the ‘Get the Big Picture’ section in Troubleshooting Scenario 1 we know that if we see WaIISHost (or WaWorkerHost) then the problem is most likely a bug in our code which is throwing an exception and that the Windows Azure and Application Event logs are a good place to start.

Check the logs

Looking at the Windows Azure event logs we don’t see any errors.  The logs show that the guest agent finishes initializing (event ID 20001), starts a startup task (10001), successfully finishes a start task (10002), then IIS configurator sets up IIS (10003 and 10004), and then the guest agent initializes itself again and repeats the loop.  No obvious errors or anything to indicate a problem other than the fact that we keep repeating this cycle a couple times per minute.

image

Next we will check the Application event logs to see if there is anything interesting there.

The Application event logs are even less interesting.  There is virtually nothing in there, and certainly nothing in there that would correlate to an application failing every 30 seconds.

image

As we have done in the previous troubleshooting scenarios we can check some of the other commonly used logs to see if anything interesting shows up.

WaHostBootstrapper logs
If we check the C:\Resources folder we will see several WaHostBoostrapper.log.old.{index} files.  WaHostBootstrapper.exe creates a new log file (and archives the previous one) every time it starts up, so based on what we were seeing in task manager and the Windows Azure event logs then it makes sense to see lots of these host bootstrapper log files.  When looking at the host bootstrapper log file for a recycling role we want to look at one of the archived files rather than the current WaHostBootstrapper.log file.  The reason is because the current file is still being written so depending on when you open the file it could be at any point in the startup process (ie. starting a startup task) and most likely won’t have any information about the crash or error which ultimately causes the processes to shut down.  You can typically pick any of the .log.old files, but I usually start with the most recent one.

The host bootstrapper log starts off normally and we can see all of the startup tasks executing and returning with a 0 (success) return code.  The log file ends like this:

[00002916:00001744, 2013/10/02, 22:09:30.660, INFO ] Getting status from client WaIISHost.exe (2976).
[00002916:00001744, 2013/10/02, 22:09:30.660, INFO ] Client reported status 1.
[00002916:00001744, 2013/10/02, 22:09:31.285, INFO ] Getting status from client DiagnosticsAgent.exe (1788).
[00002916:00001744, 2013/10/02, 22:09:31.285, ERROR] Failed to connect to client DiagnosticsAgent.exe (1788).
[00002916:00001744, 2013/10/02, 22:09:31.285, ERROR] <- CRuntimeClient::OnRoleStatusCallback(0x00000035CFE86EF0) =0x800706ba
[00002916:00001744, 2013/10/02, 22:09:31.285, INFO ] Getting status from client DiagnosticsAgent.exe (3752).
[00002916:00001744, 2013/10/02, 22:09:31.285, INFO ] Client reported status 0.
[00002916:00001744, 2013/10/02, 22:09:31.285, INFO ] Getting status from client RemoteAccessAgent.exe (2596).
[00002916:00001744, 2013/10/02, 22:09:31.285, INFO ] Client reported status 0.
[00002916:00001744, 2013/10/02, 22:09:31.285, INFO ] Getting status from client RemoteAccessAgent.exe (3120).
[00002916:00001744, 2013/10/02, 22:09:31.285, ERROR] Failed to connect to client RemoteAccessAgent.exe (3120).
[00002916:00001744, 2013/10/02, 22:09:31.285, ERROR] <- CRuntimeClient::OnRoleStatusCallback(0x00000035CFE86E00) =0x800706ba
[00002916:00001744, 2013/10/02, 22:09:31.285, INFO ] Getting status from client WaIISHost.exe (2976).
[00002916:00001744, 2013/10/02, 22:09:31.300, INFO ] Client reported status 2.

No error messages or failures (remember from scenario 2 that we can ignore the ‘Failed to connect to client’ and 0x800706ba errors), just a status value of 2 from WaIISHost.exe.  The status is defined as an enum with the following values:

0 = Healthy
1 = Unhealthy
2 = Busy

We would typically expect to see a 1 (Unhealthy while the role is starting up), then a 2 (Busy while the role is running startup code), and then a 0 once the role is running in the Run() method.  So this host bootstrapper log file is basically just telling us that the role is in the Busy state while starting up and then disappears, which is pretty much what we already knew.

WindowsAzureGuestAgent logs

Once WaIISHost.exe starts up then the guest agent is pretty much out of the picture so we won’t expect to find anything in these logs, but since we haven’t found anything else useful it is good to take a quick look at these logs to see if anything stands out.  When looking at multiple log files, especially for role recycling scenarios, I typically find one point in time when I know the problem happened and use that consistent time period to look across all logs.  This helps prevent just aimlessly looking through huge log files hoping that something jumps out.  In this case I will use the timestamp 2013/10/02, 22:09:31.300 since that is the last entry in the host bootstrapper log file.

AppAgentRuntime.log

[00002608:00003620, 2013/10/02, 22:09:21.789, INFO ] Role process with id 2916 is successfully resumed
[00002608:00003620, 2013/10/02, 22:09:21.789, IMPRT] State of 36ec83922b34432b808b37e73e6a216d.MissingDependency_IN_0 changed from RoleStateSuspended to RoleStateBusy.
[00002608:00001840, 2013/10/02, 22:09:29.566, IMPRT] State of 36ec83922b34432b808b37e73e6a216d.MissingDependency_IN_0 changed from RoleStateBusy to RoleStateUnhealthy.
[00002608:00003244, 2013/10/02, 22:09:31.300, IMPRT] State of 36ec83922b34432b808b37e73e6a216d.MissingDependency_IN_0 changed from RoleStateUnhealthy to RoleStateBusy.
[00002608:00003620, 2013/10/02, 22:09:31.535, FATAL] Role process exited with exit code of 0
[00002608:00003620, 2013/10/02, 22:09:31.613, IMPRT] State of 36ec83922b34432b808b37e73e6a216d.MissingDependency_IN_0 changed from RoleStateBusy to RoleStateStopping.
[00002608:00003620, 2013/10/02, 22:09:31.613, INFO ] Waiting for ping from LB.
[00002608:00003620, 2013/10/02, 22:09:31.613, INFO ] TIMED OUT waiting for LB ping. Proceeding to stop the role.
[00002608:00003620, 2013/10/02, 22:09:31.613, IMPRT] State of 36ec83922b34432b808b37e73e6a216d.MissingDependency_IN_0 changed from RoleStateStopping to RoleStateStopped.

We can see the WaHostBootstrapper process starting (PID 2916, which matches the PID:TID we see in the WaHostBootstrapper.log – {00002916:00001744}).  Then we see the role status change to Busy, Unhealthy, then Busy, which is exactly what we see in the host bootstrapper log file.  Then the role process exits and the guest agent proceeds to do a normal stop role and then start role.  So nothing useful in this log.

Debugging

At this point we have looked at all of the useful logs and have not found any indication of what the source of the problem might be.  Now it is time to do a live debug session in order to find out why WaIISHost.exe is shutting down.

The easiest way to start debugging on an Azure VM is with AzureTools.  You can learn more about AzureTools and how to download it from http://blogs.msdn.com/b/kwill/archive/2013/08/26/azuretools-the-diagnostic-utility-used-by-the-windows-azure-developer-support-team.aspx.

First we want to download AzureTools and then double-click the X64 Debuggers tool which will download and install the Debugging Tools for Windows which contains WinDBG.

image

Now we have to get WinDBG attached to WaIISHost.exe.  Typically when debugging a process you can just start WinDBG and go to File –> Attach to a Process and select the process from the list, but in this case WaIISHost.exe is crashing immediately on startup so it won’t show up in the currently running process list.  The typical way to attach to a process that is crashing on startup is to set the Image File Execution Options Debugger key to start and attach WinDBG as soon as the process starts.  Unfortunately this solution doesn’t work in an Azure VM (for various reasons) so we have to come up with a new way to attach a debugger.

AzureTools includes an option under the Utils tab to attach a debugger to the startup of a process.  Switch to the Utils tab, click Attach Debugger, select WaIISHost from the process list, then click Attach Debugger.  You will see WaIISHost show up in the Currently Monitoring list.  AzureTools will attach WinDBG (or whatever you specify in Debugger Location) to a monitored process the next time that process starts up.  Note that AzureTools will only attach the next instance of the target process that is started – if the process is currently running then AzureTools will ignore it.

image

Now we just wait for Azure to recycle the processes and start WaIISHost again.  Once WaIISHost is started then AzureTools will attach WinDBG and you will see a screen like this:

image

Debugging an application, especially using a tool like WinDBG, is oftentimes more art than science.  There are lots of articles that talk about how to use WinDBG, but Tess’s Debugging Demos series is a great place to start.  Typically in these role recycling scenarios where there is no indication of why the role host process is exiting (ie. the event logs aren’t showing us an exception to look for) I will just hit ‘g’ to let the debugger go and see what happens when the process exits.

WinDBG produces lots of output, but here are the more interesting pieces of information:

Microsoft.WindowsAzure.ServiceRuntime Information: 100 : Role environment . INITIALIZING
[00000704:00003424, INFO ] Initializing runtime.
Microsoft.WindowsAzure.ServiceRuntime Information: 100 : Role environment . INITIALED RETURNED. HResult=0
Microsoft.WindowsAzure.ServiceRuntime Information: 101 : Role environment . INITIALIZED
ModLoad: 00000000`00bd0000 00000000`00bda000   E:\approot\bin\MissingDependency.dll
Microsoft.WindowsAzure.ServiceRuntime Critical: 201 : ModLoad: 000007ff`a7c00000 000007ff`a7d09000   D:\Windows\Microsoft.NET\Framework64\v4.0.30319\diasymreader.dll
Role entrypoint could not be created:
System.Reflection.TargetInvocationException: Exception has been thrown by the target of an invocation. ---> System.IO.FileNotFoundException: Could not load file or assembly 'Microsoft.Synchronization, Version=1.0.0.0, Culture=neutral, PublicKeyToken=89845dcd8080cc91' or one of its dependencies. The system cannot find the file specified.
   at MissingDependency.WebRole..ctor()
   --- End of inner exception stack trace ---
   at System.RuntimeTypeHandle.CreateInstance(RuntimeType type, Boolean publicOnly, Boolean noCheck, Boolean& canBeCached, RuntimeMethodHandleInternal& ctor, Boolean& bNeedSecurityCheck)
   at System.RuntimeType.CreateInstanceSlow(Boolean publicOnly, Boolean skipCheckThis, Boolean fillCache, StackCrawlMark& stackMark)
   at System.RuntimeType.CreateInstanceDefaultCtor(Boolean publicOnly, Boolean skipCheckThis, Boolean fillCache, StackCrawlMark& stackMark)
   at System.Activator.CreateInstance(Type type, Boolean nonPublic)
   at System.Activator.CreateInstance(Type type)
   at Microsoft.WindowsAzure.ServiceRuntime.RoleEnvironment.GetRoleEntryPoint(Assembly entryPointAssembly)
   at Microsoft.WindowsAzure.ServiceRuntime.RoleEnvironment.CreateRoleEntryPoint(RoleType roleTypeEnum)
   at Microsoft.WindowsAzure.ServiceRuntime.RoleEnvironment.InitializeRoleInternal(RoleType roleTypeEnum)

  • The first 4 lines tell us that the Azure serviceruntime was initialized successfully. 
  • Line 5 shows that my role entry point (MissingDependency.dll - the WebRole.cs code such as the OnStart method) is loaded.  This tells us that we are getting into custom code and the problem is probably not with Azure itself.
  • Line 6 is loading diasymreader.dll.  This is the diagnostic symbol reader and you will see it loaded whenever a managed process throws a second chance exception.  The fact that this comes shortly after loading my DLL tells me that it is probably something within my DLL that is causing a crash.
  • Line 7 “Role entrypoint could not be created: “ tells me that Azure (WaIISHost.exe) is trying to enumerate the types in the role entry point module (MissingDependency.dll) to find the class that inherits from RoleEntrypoint so that it knows where to call the OnStart and Run methods, but it failed for some reason.
  • The rest of the lines show the exception being raised which ultimately is causing the process to exit.

The exception message and callstack tell us that WaIISHost.exe was not able to find the Microsoft.Synchronization DLL when trying to load the role entry point class and running the code in MissingDependency.WebRole..ctor(). 

 

Intellitrace

The above shows how to do a live debug which has some nice benefits – no need to redeploy in order to troubleshoot so it can be much faster if you are experienced with debugging, and you are in a much richer debugging environment which is often required for the most complex problem types.  But for issues such as role recycles it is often easier to turn on Intellitrace and redeploy.  For more information about setting up and using Intellitrace see http://msdn.microsoft.com/en-us/library/windowsazure/ff683671.aspx or http://blogs.msdn.com/b/jnak/archive/2010/06/07/using-intellitrace-to-debug-windows-azure-cloud-services.aspx.

For this particular issue I redeployed the application with Intellitrace turned on and was quickly able to get to the root cause.

image

image
 

 

Solution

Typically once I think I have found the root cause of a problem I like to validate the fix directly within the VM before spending the time to fix the problem in the project and redeploy.  This is especially valuable if there are multiple things wrong (ie. multiple dependent DLLs that are missing) so that you don’t spend a couple hours in a fix/redeploy cycle.  See http://blogs.msdn.com/b/kwill/archive/2013/09/05/how-to-modify-a-running-azure-service.aspx for more information about making changes to an Azure service.

Applying the temporary fix:

  1. On your dev machine, check in Visual Studio to see where the DLL is on the development machine.
  2. Copy that DLL to the Azure VM into the same folder as the role entry point DLL (e:\approot\bin\MissingDependency.dll in this case).
  3. On the Azure VM, close WinDBG in order to let WaIISHost.exe finish shutting down which will then let Azure recycle the host processes and attempt to restart WaIISHost.

Validating the fix:

  • The easiest way to validate the fix is to just watch Task Manager to see if WaIISHost.exe starts and stays running.
  • You should also validate that the role reaches the Ready state.  You can do this 3 different ways:
    • Check the portal.  This may take a couple minutes for the HTML portal to reflect the current status.
    • Open C:\Logs\WaAppAgent.log and scroll to the end.  You are looking for “reporting state Ready.”
    • Within AzureTools download the DebugView.zip tool.  Run DebugView.exe and check Capture –> Capture GlobalWin32.  You will now see the results of the app agent heartbeat checks in real time.

Applying the solution:

At this point we have validated that the only problem is the missing Microsoft.Synchronization.DLL so we can go to Visual Studio and mark that reference as CopyLocal=true and redeploy.

Windows Azure Storage Analytics SDP Package

$
0
0

In a previous post we looked at the Windows Azure PaaS SDP package which allows you to quickly and easily gather all of the log data to determine root cause for a variety of PaaS compute issues.  This post will look at a new SDP package which allows you to quickly and easily gather all of the storage analytics logs.

 

Getting the SDP Package

This package will only work on a Windows 7 or later, or Windows Server 2008 R2 or later computer.

  1. Open PowerShell
  2. Copy/Paste and Run the following script

md c:\Diagnostics; Import-Module bitstransfer; Start-BitsTransfer http://dsazure.blob.core.windows.net/azuretools/AzureStorageAnalyticsLogs_global.DiagCab c:\Diagnostics\AzureStorageAnalyticsLogs_global.DiagCab; c:\Diagnostics\AzureStorageAnalyticsLogs_global.DiagCab

Alternatively you can download and save the .DiagCab directly from http://dsazure.blob.core.windows.net/azuretools/AzureStorageAnalyticsLogs_global.DiagCab.

 

Running the SDP Package

  1. Enter the storage account name
    image
     
  2. Enter the storage account key.  Note that this key is only temporarily used within the SDP package utility.  It is not saved or transferred.
    image
     
  3. Enter the starting time and ending time.  The default values will gather logs from the past 24 hours
    image   .image
     
  4. Select the analytics logs to gather.
    image
     
  5. When the tool is finished gathering data click Next and an Explorer window will open showing the latest.cab which is a compressed file containing the most recent set of data, along with folders containing the data from each time the SDP package was run.
    image

 

The Results

There will be several files created as a result of running this SDP package.  The important ones are:

  1. ResultReport.xml.  This file lists the data collected and includes the storage account name and time range specified.  In the future we will include intelligent analytics results within this file (ie. “Event <x> found in Blob logs.  This usually indicates <y>.  You can find more information at <link>”).
  2. *.csv.  This are the raw data files containing the logs and metrics.  A header line is included in the file to make analysis easy.  The headers correspond to the Logs format and Metrics format.
  3. *.xlsx.  If Excel is installed on the computer running the SDP package then these .xlsx files will be created which include pre-built charts showing the most commonly used metrics along with the option to select additional metrics.

 

Excel charts (*.xlsx)

You can add or remove metrics from the Excel charts using the standard Chart filter tools:

image

 

Logs (*.csv)

You can easily filter and sort the .CSV files within Excel.  The following filter can help identify potentially inefficient queries by identifying requests that take longer than X number of milliseconds on the server:

image

 

Look for additional blog posts in the future which walk through using the analytics data to identify and solve common issues.

 

Additional Resources

http://msdn.microsoft.com/en-us/library/windowsazure/hh343270.aspx– In depth documentation about storage analytics and what each field means.

http://www.windowsazure.com/en-us/documentation/articles/storage-monitor-storage-account/– How to enable and use metrics from the Azure Management Portal.

http://channel9.msdn.com/Series/DIY-Windows-Azure-Troubleshooting/Storage-Analytics– A short 5 and a half minute video showing how to enable and use storage analytics.

Windows Azure Diagnostics – Upgrading from Azure SDK 2.4 to Azure SDK 2.5

$
0
0

Overview

Windows Azure SDK 2.4 and prior used Windows Azure Diagnostics 1.0 which provided multiple options for configuring diagnostics collection, including code based configuration using the Microsoft.WindowsAzure.Diagnostics classes.  Windows Azure SDK 2.5 introduces Windows Azure Diagnostics 1.2 which streamlines the configuration and adds enhanced capabilities for post-deployment enablement, however it removes the code based configuration capabilities.  This blog post will describe how to convert your older code based diagnostics configuration to the new WAD 1.2 XML based configuration model.

 

For more information about WAD 1.2 see http://azure.microsoft.com/en-us/documentation/articles/cloud-services-dotnet-diagnostics/.

 

 

Specifying diagnostics Configuration in XML

In Azure SDK 2.4 it was possible to configure diagnostics through code as well as xml configuration (the diagnostics.wadcfg file) and developers needed to understand the precedence rules of how the diagnostics agent picks up the configuration settings (see here for more information). Azure SDK 2.5 removes this complexity and the diagnostics agent will now always use the xml configuration. This has some implications when you migrate your Azure SDK 2.4 project to Azure SDK 2.5. If your Azure SDK 2.4 project already uses the xml based diagnostics configuration then when you upgrade the project in Visual Studio to target Azure SDK 2.5, Visual Studio will automatically update the xml based configuration to the new format (diagnostics.wadcfgx). If you project continued to use the code based configuration (for example, using the API in Microsoft.WindowsAzure.Diagnostics and Microsoft.WindowsAzure.Diagnostics.Management) then when its upgraded to SDK 2.5, you will get build warnings which will inform you of the deprecated APIs. The diagnostics data will not be collected unless you configure your diagnostics using the XML file (diagnostics.wadcfgx) or through the Visual Studio diagnostics configuration UI.

 

Let’s look at an example- In Azure SDK 2.4 you could be using the code below to update the diagnostics configuration for the diagnostics infrastructure logs. In this particular example the code is setting the Loglevel for the diagnostics infrastructure logs to Error and the Transfer Period to 5 minutes. When you migrate this project to SDK 2.5 you will get a build warning around this code with the message that the API is deprecated – “Warning: ‘Microsoft.WindowsAzure.Diagnostics.DiagnosticMonitor’ is obsolete: “This API is deprecated””. The project will still build and deploy but it won’t update the diagnostics configuration.

 

clip_image002[6]

 

To configure the diagnostics infrastructure logs in Azure SDK 2.5 you must remove that code based configuration and update the diagnostics configuration xml file (diagnostics.wadcfgx) associated with your role. Visual Studio provides a configuration UI to let you specify these diagnostics settings. To configure the diagnostics through Visual Studio you can Right click on the Role and select Properties to display the Role Designer. On the Configuration tab make sure Enable Diagnostics is selected and click Configure. This will bring up the Diagnostics configuration UI where you can go to the Infrastructure Logs Tab and select Enable transfer of Diagnostics Infrastructure Logs and set the Transfer Period and Log Level Appropriately.

 

clip_image004[6]

 

Behind the scenes Visual Studio is updating the diagnostics.wadcfgx file with the appropriate xml to configure the infrastructure logs. See highlighted as an example:

<?xmlversion="1.0"encoding="utf-8"?>

<DiagnosticsConfigurationxmlns="http://schemas.microsoft.com/ServiceHosting/2010/10/DiagnosticsConfiguration">

  <PublicConfigxmlns="http://schemas.microsoft.com/ServiceHosting/2010/10/DiagnosticsConfiguration">

    <WadCfg>

      <DiagnosticMonitorConfigurationoverallQuotaInMB="4096">

        <DiagnosticInfrastructureLogsscheduledTransferPeriod="PT5M"scheduledTransferLogLevelFilter="Error" />

      </DiagnosticMonitorConfiguration>

    </WadCfg>

  </PublicConfig>

  <PrivateConfigxmlns="http://schemas.microsoft.com/ServiceHosting/2010/10/DiagnosticsConfiguration">

    <StorageAccountname=""endpoint="" />

  </PrivateConfig>

  <IsEnabled>true</IsEnabled>
</DiagnosticsConfiguration>

 

The diagnostics configuration dialog in Visual Studio lets you configure other settings like Application Logs, Windows Event Logs, Performance Counters, Infrastructure Logs, Log Directories, ETW Logs and Crash Dumps.

Let’s look at another example for Performance Counters. With Azure SDK 2.4 (and prior) you would have to do the following in code:
clip_image006[6]

 

With Azure SDK 2.5 you can use the diagnostics configuration UI to enable the default performance counters:

clip_image008[6]

 

For Custom Performance counters you still have to instrument your code to create the custom performance category and counters as earlier. Once you have instrumented your code you can enable transferring these custom performance counters to storage by adding them to the list of performance counters in the diagnostics configuration UI. Eg. For a custom performance counter you can enter the category and name “\SampleCustomCategory\Total Button1 Clicks” in the text box and click Add.

 

Visual Studio will automatically add the performance counters selected in the UI to the diagnostics configuration file diagnostics.wadcfgx:

<?xmlversion="1.0"encoding="utf-8"?>

<DiagnosticsConfigurationxmlns="http://schemas.microsoft.com/ServiceHosting/2010/10/DiagnosticsConfiguration">

  <PublicConfigxmlns="http://schemas.microsoft.com/ServiceHosting/2010/10/DiagnosticsConfiguration">

    <WadCfg>

      <DiagnosticMonitorConfigurationoverallQuotaInMB="4096">

        <PerformanceCountersscheduledTransferPeriod="PT1M">

          <PerformanceCounterConfigurationcounterSpecifier="\Processor(_Total)\% Processor Time"sampleRate="PT3M" />

          <PerformanceCounterConfigurationcounterSpecifier="\Memory\Available MBytes"sampleRate="PT3M" />

        </PerformanceCounters>

      </DiagnosticMonitorConfiguration>

    </WadCfg>

  </PublicConfig>

  <PrivateConfigxmlns="http://schemas.microsoft.com/ServiceHosting/2010/10/DiagnosticsConfiguration">

    <StorageAccountname=""endpoint="" />

  </PrivateConfig>

  <IsEnabled>true</IsEnabled>
</DiagnosticsConfiguration>

 

For WindowsEventLog data you can enable transfer of logs for common data sources directly from the diagnostics configuration UI.

clip_image010[6]

 

For custom event logs that you define in your code for example ->

 

EventLog.CreateEventSource("GuestBookSource", "Application");

EventLog.WriteEntry("GuestBookSource", "WebRole Started", EventLogEntryType.Error, 9191);

 

You have to manually add the custom source to the wadcfgx XML file under the WindowsEventLog node. Make sure the name matches the name specified in code:

 

<WindowsEventLogscheduledTransferPeriod="PT1M">

   <DataSourcename="Application!*" />

   <DataSourcename="GuestBookSource!*" />

</WindowsEventLog>

 

 

Enabling diagnostics extension through PowerShell

 

Since Azure SDK 2.5 uses the extension model the diagnostics extension, the configuration and the connection string to the diagnostic storage are no longer part of the deployment package and cscfg. All the diagnostics configuration is contained within the wadcfgx. The advantage with this approach is that diagnostics agent and settings are decoupled from the project and can be dynamically enabled and updated even after your application is deployed.

 

Due to this change some existing workflows need to be rethought – instead of configuring the diagnostics as part of the application that gets deployed to each environment you can first deploy the application to the environment and then apply the diagnostics configuration for it.  When you publish the application from Visual Studio this process is done automatically for you. However if you were deploying your application outside of VS using PowerShell then you have to install the extension separately through PowerShell.

 

There PowerShell cmdlets for managing the diagnostics extensions on a Cloud Service are -

·         Set-AzureServiceDiagnosticsExtension

·         Get-AzureServiceDiagnosticsExtension

·         Remove-AzureServiceDiagnosticsExtension

You can use the Set-AzureServiceDiagnosticsExtension method to enable diagnostics extension on a cloud service. One of the parameters on this cmdlet is the XML configuration file. This file is slightly different from the diagnostics.wadcfgx file. You can create this file from scratch as described here or you can modify the wadcfgx file and pass in the modified file as a parameter to the powershell cmdlet.

 

To modify the wadcfgx file –

1.       Make a copy the .wadcfgx.

2.       Remove the following elements from the Copy:

<DiagnosticsConfiguration xmlns="http://schemas.microsoft.com/ServiceHosting/2010/10/DiagnosticsConfiguration">

   <PrivateConfig xmlns="http://schemas.microsoft.com/ServiceHosting/2010/10/DiagnosticsConfiguration">

     <StorageAccount name=" " endpoint="https://core.windows.net/" />

   </PrivateConfig>

   <IsEnabled>false</IsEnabled>

</DiagnosticsConfiguration>

3.       Make sure the top of the file still has xml version and encoding –

<?xmlversion="1.0"encoding="utf-8"?>

 

Effectively you are stripping down the Wadcfgx to only contain the <PublicConfig> section and the <?xml> header. You can then call the PowerShell cmdlet along with the appropriate parameters for the staging slots and roles:

$storage_name=‘<storagename>’

$key=‘<key>’

$service_name='<servicename>'

$public_config ='<thepublicconfigfrom_diagnostics.wadcfgx>'

 

$storageContext=New-AzureStorageContext–StorageAccountName$storage_name–StorageAccountKey$key

 

Set-AzureServiceDiagnosticsExtension-StorageContext$storageContext-DiagnosticsConfigurationPath$public_config–ServiceName$service_name-Slot‘Staging’ -Role‘WebRole1’

 

How to Restrict RDP Access in an Azure PaaS Cloud Service

$
0
0

 

A question I see periodically is how to restrict RDP access for PaaS services to specific network IP addresses.  In the past this has always been difficult to do and the typical solution was to use a Startup task to configure firewall rules (ie. using Set-NetFirewallRule or netsh advfirewall per http://msdn.microsoft.com/en-us/library/azure/jj156208.aspx).  This technique generally works fine, but it introduces the extra complexity of a startup task and is not built into the Azure platform itself.

Network ACLs

With the (relatively) recent introduction of network ACLs it becomes much easier to robustly secure an input endpoint on a cloud service.  My colleague Walter Myers has a great blog post about how to enable network ACLs for PaaS roles at http://blogs.msdn.com/b/walterm/archive/2014/04/22/windows-azure-paas-acls-are-here.aspx.  To apply a network ACL to the RDP endpoint it is simply a matter of defining your ACL rules targeting the role which imports the RemoteForwarder plugin, and specifying the name of the RDP endpoint in the endPoint attribute.

Here is the resulting NetworkConfiguration section to add to the CSCFG file:

  <NetworkConfiguration>
    <AccessControls>
      <AccessControlname="RDPRestrict">
        <Ruleaction="permit"description="PermitRDP"order="100"remoteSubnet="167.220.26.0/24" />
        <Ruleaction="deny"description="DenyRDP"order="200"remoteSubnet="0.0.0.0/0" />
      </AccessControl>
    </AccessControls>
    <EndpointAcls>
      <EndpointAclrole="WebRole1"endPoint="Microsoft.WindowsAzure.Plugins.RemoteForwarder.RdpInput"accessControl="RDPRestrict" />
    </EndpointAcls>
  </NetworkConfiguration>

 

 

Important information:

  1. You must enable RDP in the package before publishing your service.  The new model of enabling RDP post-deployment via the management portal or extension APIs will not work.  You can enable RDP in the package using Visual Studio by right-clicking the cloud service project in Solution Explorer and selecting ‘Configure Remote Desktop…’, or in the Publish wizard by checking the ‘Enable Remote Desktop for all roles’ checkbox.
  2. The role="WebRole1" attribute must specify the role which imports the RemoteForwarder plugin.  You can look in the CSDEF file and find the role which has <ImportmoduleName="RemoteForwarder" />.  If you have multiple roles in your service then all of them will import RemoteAccess, but only one of them will import RemoteForwarder and you must specify the role which imports RemoteForwarder.
  3. The network configuration defined above will restrict all client’s except for those with IP addresses in the range 167.220.26.1-167.220.26.255.  See Walter’s blog post for more information about how to specify network ACLs, and the MSDN documentation (http://msdn.microsoft.com/en-us/library/azure/dn376541.aspx) for more information about order and precedence of rules.

Cloud Service RDP Configuration not available via portal

$
0
0

There are 2 ways to enable RDP for a PaaS cloud service – via the CSPKG using the RemoteAccess plugin, or via the portal (or Powershell or REST API) using an extension.  There was a recent change to the management portal that disabled the ability to configure the plugin style RDP settings after deployment.  The error you get is:

“Failed to launch remote desktop configuration.”

And if you click Details you get:

“Remote Desktop is enabled for this deployment using the RemoteAccess module, which is specified in the ServiceDefinition.csdef file. To allow configuring Remote Desktop using the management portal, remove the RemoteAccess module and update the deployment. Learn More.”

image

image

   

The best solution is to remove the RDP plugin from the package, rebuild the package, and redeploy.  This will let you enable RDP via the portal after deployment and manage the configuration in the portal.  This also provides a more reliable RDP experience since this shifts the RDP functionality from a static feature in your CSPKG to an extension that can be managed and upgraded by the Azure guest agent.  To disable RDP in your package you can right-click the cloud service project, select ‘Configure Remote Desktop’ and then uncheck the ‘Enable connections for all roles’ option.

image

 

However, modifying and then redeploying the package is not always a viable solution.  For the scenarios where you want to update the configuration of an existing service you can use the following steps to manually update the configuration.

 

Download the configuration

Download the configuration and save a local .CSCFG file.  You can do this via the portal or Powershell.

To do this from the portal go to the Configure tab and click Download:

image

To do this from Powershell run:

$deployment = Get-AzureDeployment -ServiceName <servicename> –Slot <slot>

([xml]$deployment.Configuration).Save("c:\temp\config.cscfg")

 

Update the configuration setting

Open the .CSCFG file you saved, make any necessary changes, and then save the .CSCFG.

*See below for how to modify the password.

 

Upload back to the portal

Upload the modified .CSCFG file back to the portal and wait for the configuration change to complete.

To do this from the portal go to the Configure table and click Upload:

image

To do this from Powershell run:

Set-AzureDeployment -Config -ServiceName <servicename> -Configuration "c:\temp\config.cscfg" –Slot <slot>

 

Updating the password

Updating settings such as the expiration date or username is easy, but updating the password is more difficult.  The RDP password is encrypted using a user-provided certificate that is uploaded to the cloud service.  If you are making these RDP changes on the computer that already has that certificate in the cert store (ie. from the dev/deploy machine) then getting a new encrypted password is pretty straightforward:

  1. Open a ‘Windows Azure Command Prompt’
  2. Execute:
  3. csencrypt encrypt-password -copytoclipboard –thumbprint <thumbprint_from_CSCFG>
  4. Paste the new password into the AccountEncryptedPassword setting in the .CSCFG

If you don’t have the certificate in the local computer’s cert store you can download it and generate a new encrypted certificate using Powershell:

# Download the cert
$cert = Get-AzureCertificate -ServiceName <servicename> -Thumbprint <thumbprint_from_CSCFG> -ThumbprintAlgorithm "SHA1"


# Save the cert to a file
  "-----BEGIN CERTIFICATE-----" > "C:\Temp\rdpcert.cer"
  $cert.Data >> "C:\Temp\rdpcert.cer"
  "-----END CERTIFICATE-----" >> "C:\Temp\rdpcert.cer"


# Prompt for the new password
  $password = Read-Host -Prompt "Enter the password to encrypt"


#Load the certificate
  [System.Reflection.Assembly]::LoadWithPartialName("System.Security") | Out-Null
  $cert = New-Object System.Security.Cryptography.X509Certificates.X509Certificate2("c:\temp\rdpcert.cer")
  $thumbprint = $cert[0].thumbprint
  $pass = [Text.Encoding]::UTF8.GetBytes($password)


# Encrypt the cert
  $content = new-object Security.Cryptography.Pkcs.ContentInfo -argumentList (,$pass)
  $env = new-object Security.Cryptography.Pkcs.EnvelopedCms $content
  $env.Encrypt((new-object System.Security.Cryptography.Pkcs.CmsRecipient($cert)))


# Write the new password to a file and load the file
  write-host "Writing encrypted password, cut/paste the text below the line to CSCFG file"
  [Convert]::ToBase64String($env.Encode()) > "c:\temp\encrypted_password.txt"
  Invoke-Item "c:\temp\encrypted_password.txt"

 


Azure Cloud Services only support SHA-1 Thumbprint Algorithm

$
0
0

 

“My certificate provider recently switched to only providing SHA2/SHA256 certificates because SHA-1 certificates are no longer safe.  But Azure only supports SHA1 certificates!  https://msdn.microsoft.com/library/azure/gg465718.aspx says ‘The only thumbprint algorithm currently supported is sha1’”.

 

Lately I have been seeing this issue more often due to some larger cert providers recently making this change.  The entire industry has been deprecating SHA-1 certificates for a while and Chrome has recently started showing warnings in the browser.  Some references:

 

Signing algorithm vs. Thumbprint algorithm

The issue stems from confusion between the two types of algorithms used by certificates.

  • Signing algorithm.  This is the algorithm used to actually sign the certificate and this is what makes the certificate secure (or in the case of SHA-1, less secure).  The signing algorithm and resulting signature is specified by the certificate authority when it creates the cert and is built into the cert itself.  This algorithm is where SHA1 is being deprecated.  Azure doesn't know or care what this algorithm is.

  • Thumbprint algorithm.  This algorithm is used to generate a thumbprint in order to uniquely identify and find a certificate.  This algorithm and value is not built into the certificate but is instead calculated whenever a cert lookup is done.  Multiple thumbprints can be generated using different algorithms all from the same certificate data.  The thumbprint has nothing to do with certificate security since it is just used to identify/find the cert within the cert store.  Windows, .NET, and Azure all use SHA1 algorithm for the thumbprint algorithm, and SHA1 is the only algorithm allowed in the ServiceConfiguration.cscfg file:

<Certificates>
  <Certificate name="Certificate1" thumbprint="69BF333452DAA85E462E33B138F3B65842C8B428" thumbprintAlgorithm="sha1" />
< /Certificates>

 

Solution

You can use your SHA2/SHA256 signed certificate in Azure, you just have to specify an SHA1 thumbprint.  Your certificate provider should be able to provide you with an SHA1 thumbprint, but it is relatively straightforward to find or calculate the SHA1 thumbprint on your own.  Here are a few options:

  1. The easiest option is to simply open the certificate in the Certificate Manager on any Windows OS.  Windows will display the SHA1 thumbprint in the certificate properties window.
  2. On a Windows OS you can run ‘certutil –store my’  (replace my with whatever store your cert is in).
  3. In Powershell you can call System.Security.Cryptography.SHA1CNG.ComputeHash per http://ig2600.blogspot.com/2010/01/how-do-you-thumbprint-certificate.html
  4. There are a few other options including .NET code, openssl, and Apache Commons Codec at http://stackoverflow.com/questions/1270703/how-to-retrieve-compute-an-x509-certificates-thumbprint-in-java.

 

Thank you to Morgan Simonsen for his excellent blog post Understanding X.509 digital certificate thumbprints which details the different certificate algorithms.

 

SDK 2.5 / WAD 1.2 --- IIS Logs Not Transferring to Storage in PaaS WebRoles

$
0
0

After upgrading to Azure SDK 2.5 with Windows Azure Diagnostics 1.2 (see http://blogs.msdn.com/b/kwill/archive/2014/12/02/windows-azure-diagnostics-upgrading-from-azure-sdk-2-4-to-azure-sdk-2-5.aspx) you may notice that IIS logs and failed request (FREB) logs are no longer transferred to storage.

 

Root Cause

When WAD generates the diagnostics configuration it queries the IIS Management Service to find the location of the IIS logs, and by default this location is set to %SystemDrive%\inetpub\logs\LogFiles.  In a PaaS WebRole IISConfigurator will configure IIS according to your service definition, and part of this setup changes the IIS log file location to C:\Resources\directory\{deploymentid.rolename}.DiagnosticStore\LogFiles\Web.  The WAD configuration happens prior to IISConfigurator running which means WAD is watching the wrong folder for IIS logs.

 

Workaround

To work around this issue you have to restart the WAD diagnostics agent after IISConfigurator has setup IIS.  When the WAD diagnostics agent starts up again it will query the IIS Management Service for the IIS log file location and will get the correct C:\Resources\directory\{deploymentid.rolename}.DiagnosticStore\LogFiles\Web location.

The two ways to restart the diagnostics agent are:

  1. Reboot the VM.  This can be done from the portal or from an RDP session with the VM.
  2. Update the WAD configuration, which will cause the diagnostics agent to refresh it’s configuration.  This can be done from Visual Studio (Server Explorer –> Cloud Services –> Right-click a role –> Update Diagnostics –> Make any change and update) or from Powershell (see this post).

One problem with these two options is that you have to manually do this for each role/VM in your service after deploying.  The bigger problem is that any operation which recreates the Windows (D: drive) partition will also reset the IIS log file location to the default %SystemDrive% location which will cause the diagnostics agent to again get the wrong location.  This will happen to all instances roughly once per month for Guest OS updates, or randomly to single instances due to service healing (see this and this for more info).

 

Resolution

The WAD dev team is working to fix this issue with the next Azure SDK release.  In the meantime you can add the following code to your WebRole.OnStart method in order to automatically reboot the VM once during initial startup.

publicoverridebool OnStart()
{
// For information on handling configuration changes
// see the MSDN topic at http://go.microsoft.com/fwlink/?LinkId=166357.

// Write a RebootFlag.txt file to the %RoleRoot%\Approot\bin folder to track if this VM has rebooted to fix WAD 1.3 IIS logs issue
string path = "RebootFlag.txt";
// If RebootFlag.txt already exists then skip rebooting the VM
if (!System.IO.File.Exists(path))
{
System.IO.File.WriteAllText(path, "Writing RebootFlag at " + DateTime.Now.ToString("O"));
System.Diagnostics.Trace.WriteLine("Rebooting");
System.Diagnostics.Process.Start("shutdown", "/r /t 0");
}

returnbase.OnStart();
}

 

Note that this code uses a file on the %RoleRoot% drive as a flag, so this will also cause an additional reboot in extra scenarios such as portal reboots and in-place upgrades (see this post), but these scenarios are rare enough that it should not cause an issue in your service.  If you wish to avoid these extra reboots you can set the role runtime execution context to elevated by adding <Runtime executionContext="elevated" /> to the .csdef file and then write to either a file to the %SystemRoot% drive or write a flag to the registry.

Topology Blast–Send topology change updates to all instances at once

$
0
0

 

Windows Azure SDK 2.2 introduces the concept of a topology blast.  This blog post will describe how topology changes happen at the fabric level and how you can take advantage of topology blast to build a more robust service.

 

Definitions:

  • Role Topology– The number of instances, number of internal endpoints, and composition of internal endpoints (ie. internal IP addresses of instances, also known as DIP addresses ).
  • Topology Change– Any change in this topology.  Typically a scale up or down of the number of instances, or a service healing event which causes one VM to move to a new physical server and obtain a new internal IP address.
  • Rolling Upgrade– The process the fabric controller uses to make changes to a hosted service.  The fabric controller will send the change to Upgrade Domain #0, wait for all instances in UD #0 to return to the Ready state, and then move to UD #1, continuing until all UDs have been walked.  See the Upgrade Domain information at http://msdn.microsoft.com/en-us/library/windowsazure/hh472157.aspx.
  • Topology Blast– A new feature to allow topology changes to be sent to all UDs at one time, bypassing the normal UD walk.
  • topologyChangeDiscovery– Use this .csdef <ServiceDefinition> attribute to control the type of topology change your service receives.  topologyChangeDiscovery="Blast" will turn on Topology Blast.
  • RoleEnvironment.SimultaneousChanged/SimultaneousChanging– These events are raised in your code when a topology change happens and you have set topologyChangeDiscovery="Blast".

* Note that you must have an InternalEndpoint defined in order to receive topology changes.  If you turn on RDP for your service an internal endpoint is implicitly created.

 

 

Picture 1 – Hosted service with 3 instances

Picture 1

 

Picture 1 shows a standard Windows Azure hosted service with 3 role instances.  Each instance has a Microsoft.WindowsAzure.ServiceRuntime.RoleEnvironment.CurrentRoleInstance.InstanceEndpoints list with the endpoints pointing to the correct DIPs for all other instances.

 

 

Picture 2 – Standard topology change and rolling upgrade

image

 

Picture 2 shows a standard rolling upgrade after a topology change has occurred.

  1. The server hosting IN_2 (with original DIP 10.31.70.8) has had a failure.  The fabric controller automatically detects this failure and recreates IN_2 on a new server.  IN_2 receives a new DIP of 10.25.18.2.  This constitutes a topology change.
  2. The fabric controller begins the rolling upgrade process in order to notify the rest of the instances that there has been a topology change.  Each instance will receive a RoleEnvironment.Changing and RoleEnvironment.Changed event, with the change of type ServiceRuntime.RoleEnvironmentTopologyChange.  The InstanceEndpoints list will be updated with the new DIP(s).
  3. In Picture 2 the fabric controller is currently processing UD #0 and IN_0 has a correct InstanceEndpoints list.  When IN_0 attempts to communicate with IN_2 using an InternalEndpoint it will connect to the new DIP 10.25.18.2.
  4. IN_1 is in UD #1 and has not yet been notified of the topology change and IN_1 is still using the old incorrect InstanceEndpoints list.  When IN_1 attempts to communicate with IN_2 using an InternalEndpoint it will fail to connect to the old DIP 10.31.70.8.

Depending on the architecture of your service and how well your code tolerates communication failures this scenario of some instances with the correct InstanceEndpoints list and some with an incorrect InstanceEndpoints list can cause significant problems in your application.

 

 

Picture 3 – Topology change with Topology Blast enabled

image

Picture 3 shows a topology blast after a topology change has occurred.

  1. The server hosting IN_2 (with original DIP 10.31.70.8) has had a failure.  The fabric controller automatically detects this failure and recreates IN_2 on a new server.  IN_2 receives a new DIP of 10.25.18.2.  This constitutes a topology change.
  2. The fabric controller initiates a topology change.  Because the service has set topologyChangeDiscovery="Blast" the fabric will initiate a topology blast and send the topology change to all instances at the same time.
  3. Both IN_0 and IN_1 receive a RoleEnvironment.SimultaneousChanging event at the same time with an updated InstanceEndpoints list.  Both instances are now able to successfully communicate with IN_2.

Note that your architecture must still be tolerant of the communication failures that will happen from the time that the server hosting IN_2 fails until the fabric recreates IN_2 and sends the topology change.

 

 

Turning on Topology Blast

Topology Blast is enabled per deployment.  To turn this on for a deployment set topologyChangeDiscovery="Blast" in the csdef.  Your service will now begin receiving topology blast configuration changes.

<ServiceDefinition name="waTestFramework"
topologyChangeDiscovery="Blast"
 xmlns="http://schemas.microsoft.com/ServiceHosting/2008/10/ServiceDefinition" schemaVersion="2013-10.2.2">

 

Optionally, if you need to execute code to respond to a topology change you can implement the following events:

Topology changes will now raise the RoleEnvironment.SimultaneousChanged/SimultaneousChanging events instead of the default Changed/Changing events.  Add handlers for these two events in OnStart and then implement your code in the appropriate event handler.  These new events behave the same as the old ones with two exceptions:

  • With topology blast turned on, only topology changes will fire the Simultaneous* events and all other types of changes will fire the standard Changed/Changing events.
  • The SimultaneousChangingEventArgs does not implement a Cancel property.  This is to prevent all role instances from recycling at the same time.

 

publicoverridebool OnStart()
{
// For information on handling configuration changes
// see the MSDN topic at http://go.microsoft.com/fwlink/?LinkId=166357.
RoleEnvironment.SimultaneousChanged += RoleEnvironment_SimultaneousChanged;
RoleEnvironment.SimultaneousChanging += RoleEnvironment_SimultaneousChanging;

returnbase.OnStart();
}

void RoleEnvironment_SimultaneousChanging(object sender, SimultaneousChangingEventArgs e)
{
// Add code to run before the InstanceEndpoints list is updated
// WARNING: Make sure you do not call RequestRecycle or throw an unhandled exception
}

void RoleEnvironment_SimultaneousChanged(object sender, SimultaneousChangedEventArgs e)
{
// Add code to run after the InstanceEndpoints list is updated
}

Authenticating Storage Requests Using SharedKeyAuthenticationHandler

$
0
0

 

With the older version of the storage client library (version 1.7) you could sign HttpWebRequests using the SignRequestLite function, and there were several examples on the web of how to do this.  The SignRequestLite function has been removed from the 2.0+ versions of the storage client library and so far I have not seen any examples of using the new signing functions.

 

The new SCL uses Microsoft.WindowsAzure.Storage.Auth.Protocol.SharedKeyAuthenticationHandler to sign the request, and Microsoft.WindowsAzure.Storage.Core.Auth.SharedKeyTableCanonicalizer (for table storage requests) or Microsoft.WindowsAzure.Storage.Core.Auth.SharedKeyCanonicalizer (for blob and queue storage requests) to create the canonicalizer.  In the simplest form, this is the code (using the blob canonicalizer) to sign an httpWebRequest object:

Microsoft.WindowsAzure.Storage.Auth.StorageCredentials credentials = new Microsoft.WindowsAzure.Storage.Auth.StorageCredentials(storageAccountName, storageAccountKey);
Microsoft.WindowsAzure.Storage.Auth.Protocol.SharedKeyAuthenticationHandler auth;
Microsoft.WindowsAzure.Storage.Core.Auth.SharedKeyCanonicalizer canonicalizer = Microsoft.WindowsAzure.Storage.Core.Auth.SharedKeyCanonicalizer.Instance;
auth = new Microsoft.WindowsAzure.Storage.Auth.Protocol.SharedKeyAuthenticationHandler(canonicalizer, credentials, storageAccountName);
auth.SignRequest(httpWebRequest, null);

 

For a more full featured example (you can largely ignore the MakeRequest function since it is just a generic HTTP request/response handler):


using System.Net;
using Microsoft.WindowsAzure.Storage.Auth;
using Microsoft.WindowsAzure.Storage.Auth.Protocol;

private HttpWebRequest SignRequest(HttpWebRequest request, bool isTable = false)
{
// Create a StorageCredentials object with account name and key
string storageAccountName = "kwillstorage2";
string storageAccountKey = "S4RUcCvGKBPoFvhyJ0p6Wu0ciJnPTn5+b5MgU5olWhqGABAfvFhMFCbOBSeDgL9VF27TFrYzCQnRHYkbgJgxxg==";
Microsoft.WindowsAzure.Storage.Auth.StorageCredentials credentials = new Microsoft.WindowsAzure.Storage.Auth.StorageCredentials(storageAccountName, storageAccountKey);

// Create the SharedKeyAuthenticationHandler which is used to sign the HttpWebRequest
Microsoft.WindowsAzure.Storage.Auth.Protocol.SharedKeyAuthenticationHandler auth;
if (isTable)
{
// Tables use SharedKeyTableCanonicalizer along with storage credentials and storage account name
Microsoft.WindowsAzure.Storage.Core.Auth.SharedKeyTableCanonicalizer canonicalizertable = Microsoft.WindowsAzure.Storage.Core.Auth.SharedKeyTableCanonicalizer.Instance;
auth = new Microsoft.WindowsAzure.Storage.Auth.Protocol.SharedKeyAuthenticationHandler(canonicalizertable, credentials, storageAccountName);
if (request.Headers["MaxDataServiceVersion"] == null)
{
request.Headers.Add("MaxDataServiceVersion", "2.0;NetFx");
}
}
else
{
// Blobs and Queues use SharedKeyCanonicalizer
Microsoft.WindowsAzure.Storage.Core.Auth.SharedKeyCanonicalizer canonicalizer = Microsoft.WindowsAzure.Storage.Core.Auth.SharedKeyCanonicalizer.Instance;
auth = new Microsoft.WindowsAzure.Storage.Auth.Protocol.SharedKeyAuthenticationHandler(canonicalizer, credentials, storageAccountName);
}

// Sign the request which will add the Authorization header
auth.SignRequest(request, null);
return request;
}

privatevoid button1_Click(object sender, EventArgs e)
{
HttpWebRequest requestBlob, requestTable;

requestBlob = (HttpWebRequest)HttpWebRequest.Create("https://kwillstorage2.blob.core.windows.net/vsdeploy?restype=container&comp=list");
requestBlob.Method = "GET";
requestBlob.Headers.Add("x-ms-version", "2014-02-14");
SignRequest(requestBlob);
MakeRequest(requestBlob);

requestTable = (HttpWebRequest)HttpWebRequest.Create("https://kwillstorage2.table.core.windows.net/testtable");
requestTable.Method = "GET";
requestTable.Headers.Add("x-ms-version", "2014-02-14");
SignRequest(requestTable, true);
MakeRequest(requestTable);
}

privatevoid MakeRequest(HttpWebRequest request)
{
HttpWebResponse response = null;
System.IO.Stream receiveStream;
System.IO.StreamReader readStream;
Encoding encode;
try
{
response = (HttpWebResponse)request.GetResponse();
}
catch (WebException ex)
{
Console.WriteLine(request.Headers.ToString());
Console.WriteLine(ex.Message + Environment.NewLine + Environment.NewLine + ex.Response.Headers.ToString());
try
{
receiveStream = ex.Response.GetResponseStream();
encode = System.Text.Encoding.GetEncoding("utf-8");
// Pipes the stream to a higher level stream reader with the required encoding format.
readStream = new System.IO.StreamReader(receiveStream, encode);
Console.WriteLine(readStream.ReadToEnd());

// Releases the resources of the response.
response.Close();
// Releases the resources of the Stream.
readStream.Close();
}
catch
{
}
return;
}

Console.WriteLine(request.Method + " " + request.RequestUri + " " + request.ProtocolVersion + Environment.NewLine + Environment.NewLine);
Console.WriteLine(request.Headers.ToString());
Console.WriteLine((int)response.StatusCode + " - " + response.StatusDescription + Environment.NewLine + Environment.NewLine);
Console.WriteLine(response.Headers + Environment.NewLine + Environment.NewLine);

receiveStream = response.GetResponseStream();
encode = System.Text.Encoding.GetEncoding("utf-8");
// Pipes the stream to a higher level stream reader with the required encoding format.
readStream = new System.IO.StreamReader(receiveStream, encode);
Console.WriteLine(readStream.ReadToEnd());

// Releases the resources of the response.
response.Close();
// Releases the resources of the Stream.
readStream.Close();
}

Troubleshooting Scenario 7 – Role Recycling

$
0
0

In Troubleshooting Scenario 1 we looked at a scenario where the role would recycle after deployment and the root cause was easily seen in the Windows Azure Event Logs.  This blog post will show another example of this same type of behavior, but with a different, and more difficult to find, root cause.  This is a continuation of the troubleshooting series.


Symptom

You have deployed your Azure hosted service and it shows as Recycling in the portal.  But there is no additional information such as an Exception type or error message.  The role status in the portal might switch between a few different messages such as, but not limited to:

  • Recycling (Waiting for role to start… System startup tasks are running.
  • Recycling (Waiting for role to start… Sites are being deployed.
  • Recycling (Role has encountered an error and has stopped. Sites were deployed.


Get the Big Picture

Similar to the previous troubleshooting scenarios we want to get a quick idea of where we are failing.  Watching task manager we see that WaIISHost.exe starts for a few seconds and then disappears along with WaHostBootstrapper.

image

From the ‘Get the Big Picture’ section in Troubleshooting Scenario 1 we know that if we see WaIISHost (or WaWorkerHost) then the problem is most likely a bug in our code which is throwing an exception and that the Windows Azure and Application Event logs are a good place to start.


Check the logs

Looking at the Windows Azure event logs we don’t see any errors.  The logs show that the guest agent finishes initializing (event ID 20001), starts a startup task (10001), successfully finishes a start task (10002), then IIS configurator sets up IIS (10003 and 10004), and then the guest agent initializes itself again and repeats the loop.  No obvious errors or anything to indicate a problem other than the fact that we keep repeating this cycle a couple times per minute.

image

Next we will check the Application event logs to see if there is anything interesting there.

The Application event logs are even less interesting.  There is virtually nothing in there, and certainly nothing in there that would correlate to an application failing every 30 seconds.

image

As we have done in the previous troubleshooting scenarios we can check some of the other commonly used logs to see if anything interesting shows up.

WaHostBootstrapper logs
If we check the C:\Resources folder we will see several WaHostBoostrapper.log.old.{index} files.  WaHostBootstrapper.exe creates a new log file (and archives the previous one) every time it starts up, so based on what we were seeing in task manager and the Windows Azure event logs then it makes sense to see lots of these host bootstrapper log files.  When looking at the host bootstrapper log file for a recycling role we want to look at one of the archived files rather than the current WaHostBootstrapper.log file.  The reason is because the current file is still being written so depending on when you open the file it could be at any point in the startup process (ie. starting a startup task) and most likely won’t have any information about the crash or error which ultimately causes the processes to shut down.  You can typically pick any of the .log.old files, but I usually start with the most recent one.

The host bootstrapper log starts off normally and we can see all of the startup tasks executing and returning with a 0 (success) return code.  The log file ends like this:


[00002916:00001744, 2013/10/02, 22:09:30.660, INFO ] Getting status from client WaIISHost.exe (2976).
[00002916:00001744, 2013/10/02, 22:09:30.660, INFO ] Client reported status 1.
[00002916:00001744, 2013/10/02, 22:09:31.285, INFO ] Getting status from client DiagnosticsAgent.exe (1788).
[00002916:00001744, 2013/10/02, 22:09:31.285, ERROR] Failed to connect to client DiagnosticsAgent.exe (1788).
[00002916:00001744, 2013/10/02, 22:09:31.285, ERROR] <- CRuntimeClient::OnRoleStatusCallback(0x00000035CFE86EF0) =0x800706ba
[00002916:00001744, 2013/10/02, 22:09:31.285, INFO ] Getting status from client DiagnosticsAgent.exe (3752).
[00002916:00001744, 2013/10/02, 22:09:31.285, INFO ] Client reported status 0.
[00002916:00001744, 2013/10/02, 22:09:31.285, INFO ] Getting status from client RemoteAccessAgent.exe (2596).
[00002916:00001744, 2013/10/02, 22:09:31.285, INFO ] Client reported status 0.
[00002916:00001744, 2013/10/02, 22:09:31.285, INFO ] Getting status from client RemoteAccessAgent.exe (3120).
[00002916:00001744, 2013/10/02, 22:09:31.285, ERROR] Failed to connect to client RemoteAccessAgent.exe (3120).
[00002916:00001744, 2013/10/02, 22:09:31.285, ERROR] <- CRuntimeClient::OnRoleStatusCallback(0x00000035CFE86E00) =0x800706ba
[00002916:00001744, 2013/10/02, 22:09:31.285, INFO ] Getting status from client WaIISHost.exe (2976).
[00002916:00001744, 2013/10/02, 22:09:31.300, INFO ] Client reported status 2.

No error messages or failures (remember from scenario 2 that we can ignore the ‘Failed to connect to client’ and 0x800706ba errors), just a status value of 2 from WaIISHost.exe.  The status is defined as an enum with the following values:


0 = Healthy
1 = Unhealthy
2 = Busy

We would typically expect to see a 1 (Unhealthy while the role is starting up), then a 2 (Busy while the role is running startup code), and then a 0 once the role is running in the Run() method.  So this host bootstrapper log file is basically just telling us that the role is in the Busy state while starting up and then disappears, which is pretty much what we already knew.

WindowsAzureGuestAgent logs

Once WaIISHost.exe starts up then the guest agent is pretty much out of the picture so we won’t expect to find anything in these logs, but since we haven’t found anything else useful it is good to take a quick look at these logs to see if anything stands out.  When looking at multiple log files, especially for role recycling scenarios, I typically find one point in time when I know the problem happened and use that consistent time period to look across all logs.  This helps prevent just aimlessly looking through huge log files hoping that something jumps out.  In this case I will use the timestamp 2013/10/02, 22:09:31.300 since that is the last entry in the host bootstrapper log file.

AppAgentRuntime.log


[00002608:00003620, 2013/10/02, 22:09:21.789, INFO ] Role process with id 2916 is successfully resumed
[00002608:00003620, 2013/10/02, 22:09:21.789, IMPRT] State of 36ec83922b34432b808b37e73e6a216d.MissingDependency_IN_0 changed from RoleStateSuspended to RoleStateBusy.
[00002608:00001840, 2013/10/02, 22:09:29.566, IMPRT] State of 36ec83922b34432b808b37e73e6a216d.MissingDependency_IN_0 changed from RoleStateBusy to RoleStateUnhealthy.
[00002608:00003244, 2013/10/02, 22:09:31.300, IMPRT] State of 36ec83922b34432b808b37e73e6a216d.MissingDependency_IN_0 changed from RoleStateUnhealthy to RoleStateBusy.
[00002608:00003620, 2013/10/02, 22:09:31.535, FATAL] Role process exited with exit code of 0
[00002608:00003620, 2013/10/02, 22:09:31.613, IMPRT] State of 36ec83922b34432b808b37e73e6a216d.MissingDependency_IN_0 changed from RoleStateBusy to RoleStateStopping.
[00002608:00003620, 2013/10/02, 22:09:31.613, INFO ] Waiting for ping from LB.
[00002608:00003620, 2013/10/02, 22:09:31.613, INFO ] TIMED OUT waiting for LB ping. Proceeding to stop the role.
[00002608:00003620, 2013/10/02, 22:09:31.613, IMPRT] State of 36ec83922b34432b808b37e73e6a216d.MissingDependency_IN_0 changed from RoleStateStopping to RoleStateStopped.

We can see the WaHostBootstrapper process starting (PID 2916, which matches the PID:TID we see in the WaHostBootstrapper.log – {00002916:00001744}).  Then we see the role status change to Busy, Unhealthy, then Busy, which is exactly what we see in the host bootstrapper log file.  Then the role process exits and the guest agent proceeds to do a normal stop role and then start role.  So nothing useful in this log.


Debugging

At this point we have looked at all of the useful logs and have not found any indication of what the source of the problem might be.  Now it is time to do a live debug session in order to find out why WaIISHost.exe is shutting down.

The easiest way to start debugging on an Azure VM is with AzureTools.  You can learn more about AzureTools and how to download it from http://blogs.msdn.com/b/kwill/archive/2013/08/26/azuretools-the-diagnostic-utility-used-by-the-windows-azure-developer-support-team.aspx.

First we want to download AzureTools and then double-click the X64 Debuggers tool which will download and install the Debugging Tools for Windows which contains WinDBG.

image

Now we have to get WinDBG attached to WaIISHost.exe.  Typically when debugging a process you can just start WinDBG and go to File –> Attach to a Process and select the process from the list, but in this case WaIISHost.exe is crashing immediately on startup so it won’t show up in the currently running process list.  The typical way to attach to a process that is crashing on startup is to set the Image File Execution Options Debugger key to start and attach WinDBG as soon as the process starts.  Unfortunately this solution doesn’t work in an Azure VM (for various reasons) so we have to come up with a new way to attach a debugger.

AzureTools includes an option under the Utils tab to attach a debugger to the startup of a process.  Switch to the Utils tab, click Attach Debugger, select WaIISHost from the process list, then click Attach Debugger.  You will see WaIISHost show up in the Currently Monitoring list.  AzureTools will attach WinDBG (or whatever you specify in Debugger Location) to a monitored process the next time that process starts up.  Note that AzureTools will only attach the next instance of the target process that is started – if the process is currently running then AzureTools will ignore it.

image

Now we just wait for Azure to recycle the processes and start WaIISHost again.  Once WaIISHost is started then AzureTools will attach WinDBG and you will see a screen like this:

image

Debugging an application, especially using a tool like WinDBG, is oftentimes more art than science.  There are lots of articles that talk about how to use WinDBG, but Tess’s Debugging Demos series is a great place to start.  Typically in these role recycling scenarios where there is no indication of why the role host process is exiting (ie. the event logs aren’t showing us an exception to look for) I will just hit ‘g’ to let the debugger go and see what happens when the process exits.

WinDBG produces lots of output, but here are the more interesting pieces of information:


Microsoft.WindowsAzure.ServiceRuntime Information: 100 : Role environment . INITIALIZING
[00000704:00003424, INFO ] Initializing runtime.
Microsoft.WindowsAzure.ServiceRuntime Information: 100 : Role environment . INITIALED RETURNED. HResult=0
Microsoft.WindowsAzure.ServiceRuntime Information: 101 : Role environment . INITIALIZED
ModLoad: 00000000`00bd0000 00000000`00bda000   E:\approot\bin\MissingDependency.dll
Microsoft.WindowsAzure.ServiceRuntime Critical: 201 : ModLoad: 000007ff`a7c00000 000007ff`a7d09000   D:\Windows\Microsoft.NET\Framework64\v4.0.30319\diasymreader.dll
Role entrypoint could not be created:
System.Reflection.TargetInvocationException: Exception has been thrown by the target of an invocation. —> System.IO.FileNotFoundException: Could not load file or assembly ‘Microsoft.Synchronization, Version=1.0.0.0, Culture=neutral, PublicKeyToken=89845dcd8080cc91′ or one of its dependencies. The system cannot find the file specified.
   at MissingDependency.WebRole..ctor()
   — End of inner exception stack trace —
   at System.RuntimeTypeHandle.CreateInstance(RuntimeType type, Boolean publicOnly, Boolean noCheck, Boolean& canBeCached, RuntimeMethodHandleInternal& ctor, Boolean& bNeedSecurityCheck)
   at System.RuntimeType.CreateInstanceSlow(Boolean publicOnly, Boolean skipCheckThis, Boolean fillCache, StackCrawlMark& stackMark)
   at System.RuntimeType.CreateInstanceDefaultCtor(Boolean publicOnly, Boolean skipCheckThis, Boolean fillCache, StackCrawlMark& stackMark)
   at System.Activator.CreateInstance(Type type, Boolean nonPublic)
   at System.Activator.CreateInstance(Type type)
   at Microsoft.WindowsAzure.ServiceRuntime.RoleEnvironment.GetRoleEntryPoint(Assembly entryPointAssembly)
   at Microsoft.WindowsAzure.ServiceRuntime.RoleEnvironment.CreateRoleEntryPoint(RoleType roleTypeEnum)
   at Microsoft.WindowsAzure.ServiceRuntime.RoleEnvironment.InitializeRoleInternal(RoleType roleTypeEnum)

  • The first 4 lines tell us that the Azure serviceruntime was initialized successfully. 
  • Line 5 shows that my role entry point (MissingDependency.dll – the WebRole.cs code such as the OnStart method) is loaded.  This tells us that we are getting into custom code and the problem is probably not with Azure itself.
  • Line 6 is loading diasymreader.dll.  This is the diagnostic symbol reader and you will see it loaded whenever a managed process throws a second chance exception.  The fact that this comes shortly after loading my DLL tells me that it is probably something within my DLL that is causing a crash.
  • Line 7 “Role entrypoint could not be created: “ tells me that Azure (WaIISHost.exe) is trying to enumerate the types in the role entry point module (MissingDependency.dll) to find the class that inherits from RoleEntrypoint so that it knows where to call the OnStart and Run methods, but it failed for some reason.
  • The rest of the lines show the exception being raised which ultimately is causing the process to exit.

The exception message and callstack tell us that WaIISHost.exe was not able to find the Microsoft.Synchronization DLL when trying to load the role entry point class and running the code in MissingDependency.WebRole..ctor(). 

 

Intellitrace

The above shows how to do a live debug which has some nice benefits – no need to redeploy in order to troubleshoot so it can be much faster if you are experienced with debugging, and you are in a much richer debugging environment which is often required for the most complex problem types.  But for issues such as role recycles it is often easier to turn on Intellitrace and redeploy.  For more information about setting up and using Intellitrace see http://msdn.microsoft.com/en-us/library/windowsazure/ff683671.aspx or http://blogs.msdn.com/b/jnak/archive/2010/06/07/using-intellitrace-to-debug-windows-azure-cloud-services.aspx.

For this particular issue I redeployed the application with Intellitrace turned on and was quickly able to get to the root cause.

image

image
 

 

Solution

Typically once I think I have found the root cause of a problem I like to validate the fix directly within the VM before spending the time to fix the problem in the project and redeploy.  This is especially valuable if there are multiple things wrong (ie. multiple dependent DLLs that are missing) so that you don’t spend a couple hours in a fix/redeploy cycle.  See http://blogs.msdn.com/b/kwill/archive/2013/09/05/how-to-modify-a-running-azure-service.aspx for more information about making changes to an Azure service.

Applying the temporary fix:

  1. On your dev machine, check in Visual Studio to see where the DLL is on the development machine.
  2. Copy that DLL to the Azure VM into the same folder as the role entry point DLL (e:\approot\bin\MissingDependency.dll in this case).
  3. On the Azure VM, close WinDBG in order to let WaIISHost.exe finish shutting down which will then let Azure recycle the host processes and attempt to restart WaIISHost.

Validating the fix:

  • The easiest way to validate the fix is to just watch Task Manager to see if WaIISHost.exe starts and stays running.
  • You should also validate that the role reaches the Ready state.  You can do this 3 different ways:
    • Check the portal.  This may take a couple minutes for the HTML portal to reflect the current status.
    • Open C:\Logs\WaAppAgent.log and scroll to the end.  You are looking for “reporting state Ready.”
    • Within AzureTools download the DebugView.zip tool.  Run DebugView.exe and check Capture –> Capture GlobalWin32.  You will now see the results of the app agent heartbeat checks in real time.

Applying the solution:

At this point we have validated that the only problem is the missing Microsoft.Synchronization.DLL so we can go to Visual Studio and mark that reference as CopyLocal=true and redeploy.

Topology Blast–Send topology change updates to all instances at once

$
0
0

 

Windows Azure SDK 2.2 introduces the concept of a topology blast.  This blog post will describe how topology changes happen at the fabric level and how you can take advantage of topology blast to build a more robust service.

 

Definitions:

  • Role Topology – The number of instances, number of internal endpoints, and composition of internal endpoints (ie. internal IP addresses of instances, also known as DIP addresses ).
  • Topology Change – Any change in this topology.  Typically a scale up or down of the number of instances, or a service healing event which causes one VM to move to a new physical server and obtain a new internal IP address.
  • Rolling Upgrade – The process the fabric controller uses to make changes to a hosted service.  The fabric controller will send the change to Upgrade Domain #0, wait for all instances in UD #0 to return to the Ready state, and then move to UD #1, continuing until all UDs have been walked.  See the Upgrade Domain information at http://msdn.microsoft.com/en-us/library/windowsazure/hh472157.aspx.
  • Topology Blast – A new feature to allow topology changes to be sent to all UDs at one time, bypassing the normal UD walk.
  • topologyChangeDiscovery – Use this .csdef <ServiceDefinition> attribute to control the type of topology change your service receives.  topologyChangeDiscovery=”Blast” will turn on Topology Blast.
  • RoleEnvironment.SimultaneousChanged/SimultaneousChanging – These events are raised in your code when a topology change happens and you have set topologyChangeDiscovery=”Blast”.

* Note that you must have an InternalEndpoint defined in order to receive topology changes.  If you turn on RDP for your service an internal endpoint is implicitly created.

 

 

Picture 1 – Hosted service with 3 instances

Picture 1

 

Picture 1 shows a standard Windows Azure hosted service with 3 role instances.  Each instance has a Microsoft.WindowsAzure.ServiceRuntime.RoleEnvironment.CurrentRoleInstance.InstanceEndpoints list with the endpoints pointing to the correct DIPs for all other instances.

 

 

Picture 2 – Standard topology change and rolling upgrade

image

 

Picture 2 shows a standard rolling upgrade after a topology change has occurred.

  1. The server hosting IN_2 (with original DIP 10.31.70.8) has had a failure.  The fabric controller automatically detects this failure and recreates IN_2 on a new server.  IN_2 receives a new DIP of 10.25.18.2.  This constitutes a topology change.
  2. The fabric controller begins the rolling upgrade process in order to notify the rest of the instances that there has been a topology change.  Each instance will receive a RoleEnvironment.Changing and RoleEnvironment.Changed event, with the change of type ServiceRuntime.RoleEnvironmentTopologyChange.  The InstanceEndpoints list will be updated with the new DIP(s).
  3. In Picture 2 the fabric controller is currently processing UD #0 and IN_0 has a correct InstanceEndpoints list.  When IN_0 attempts to communicate with IN_2 using an InternalEndpoint it will connect to the new DIP 10.25.18.2.
  4. IN_1 is in UD #1 and has not yet been notified of the topology change and IN_1 is still using the old incorrect InstanceEndpoints list.  When IN_1 attempts to communicate with IN_2 using an InternalEndpoint it will fail to connect to the old DIP 10.31.70.8.

Depending on the architecture of your service and how well your code tolerates communication failures this scenario of some instances with the correct InstanceEndpoints list and some with an incorrect InstanceEndpoints list can cause significant problems in your application.

 

 

Picture 3 – Topology change with Topology Blast enabled

image

Picture 3 shows a topology blast after a topology change has occurred.

  1. The server hosting IN_2 (with original DIP 10.31.70.8) has had a failure.  The fabric controller automatically detects this failure and recreates IN_2 on a new server.  IN_2 receives a new DIP of 10.25.18.2.  This constitutes a topology change.
  2. The fabric controller initiates a topology change.  Because the service has set topologyChangeDiscovery=”Blast” the fabric will initiate a topology blast and send the topology change to all instances at the same time.
  3. Both IN_0 and IN_1 receive a RoleEnvironment.SimultaneousChanging event at the same time with an updated InstanceEndpoints list.  Both instances are now able to successfully communicate with IN_2.

Note that your architecture must still be tolerant of the communication failures that will happen from the time that the server hosting IN_2 fails until the fabric recreates IN_2 and sends the topology change.

 

 

Turning on Topology Blast

Topology Blast is enabled per deployment.  To turn this on for a deployment set topologyChangeDiscovery=”Blast” in the csdef.  Your service will now begin receiving topology blast configuration changes.

<ServiceDefinition name="waTestFramework" 

topologyChangeDiscovery=“Blast”

 xmlns="http://schemas.microsoft.com/ServiceHosting/2008/10/ServiceDefinition" schemaVersion="2013-10.2.2">

 

Optionally, if you need to execute code to respond to a topology change you can implement the following events:

Topology changes will now raise the RoleEnvironment.SimultaneousChanged/SimultaneousChanging events instead of the default Changed/Changing events.  Add handlers for these two events in OnStart and then implement your code in the appropriate event handler.  These new events behave the same as the old ones with two exceptions:

  • With topology blast turned on, only topology changes will fire the Simultaneous* events and all other types of changes will fire the standard Changed/Changing events.
  • The SimultaneousChangingEventArgs does not implement a Cancel property.  This is to prevent all role instances from recycling at the same time.

 

public override bool OnStart()
{
// For information on handling configuration changes
// see the MSDN topic at http://go.microsoft.com/fwlink/?LinkId=166357.
RoleEnvironment.SimultaneousChanged += RoleEnvironment_SimultaneousChanged;
RoleEnvironment.SimultaneousChanging += RoleEnvironment_SimultaneousChanging;

return base.OnStart();
}

void RoleEnvironment_SimultaneousChanging(object sender, SimultaneousChangingEventArgs e)
{
// Add code to run before the InstanceEndpoints list is updated
// WARNING: Make sure you do not call RequestRecycle or throw an unhandled exception
}

void RoleEnvironment_SimultaneousChanged(object sender, SimultaneousChangedEventArgs e)
{
// Add code to run after the InstanceEndpoints list is updated
}

Windows Azure Storage Analytics SDP Package

$
0
0

In a previous post we looked at the Windows Azure PaaS SDP package which allows you to quickly and easily gather all of the log data to determine root cause for a variety of PaaS compute issues.  This post will look at a new SDP package which allows you to quickly and easily gather all of the storage analytics logs.

 


Getting the SDP Package

This package will only work on a Windows 7 or later, or Windows Server 2008 R2 or later computer.

  1. Open PowerShell
  2. Copy/Paste and Run the following script

md c:\Diagnostics; Import-Module bitstransfer; Start-BitsTransfer http://dsazure.blob.core.windows.net/azuretools/AzureStorageAnalyticsLogs_global.DiagCab c:\Diagnostics\AzureStorageAnalyticsLogs_global.DiagCab; c:\Diagnostics\AzureStorageAnalyticsLogs_global.DiagCab

Alternatively you can download and save the .DiagCab directly from http://dsazure.blob.core.windows.net/azuretools/AzureStorageAnalyticsLogs_global.DiagCab.

 


Running the SDP Package

  1. Enter the storage account name
    image
     
  2. Enter the storage account key.  Note that this key is only temporarily used within the SDP package utility.  It is not saved or transferred.
    image
     
  3. Enter the starting time and ending time.  The default values will gather logs from the past 24 hours
    image   .image
     
  4. Select the analytics logs to gather.
    image
     
  5. When the tool is finished gathering data click Next and an Explorer window will open showing the latest.cab which is a compressed file containing the most recent set of data, along with folders containing the data from each time the SDP package was run.
    image

 

The Results

There will be several files created as a result of running this SDP package.  The important ones are:

  1. ResultReport.xml.  This file lists the data collected and includes the storage account name and time range specified.  In the future we will include intelligent analytics results within this file (ie. “Event <x> found in Blob logs.  This usually indicates <y>.  You can find more information at <link>”).
  2. *.csv.  This are the raw data files containing the logs and metrics.  A header line is included in the file to make analysis easy.  The headers correspond to the Logs format and Metrics format.
  3. *.xlsx.  If Excel is installed on the computer running the SDP package then these .xlsx files will be created which include pre-built charts showing the most commonly used metrics along with the option to select additional metrics.

 

Excel charts (*.xlsx)

You can add or remove metrics from the Excel charts using the standard Chart filter tools:

image

 

Logs (*.csv)

You can easily filter and sort the .CSV files within Excel.  The following filter can help identify potentially inefficient queries by identifying requests that take longer than X number of milliseconds on the server:

image

 

Look for additional blog posts in the future which walk through using the analytics data to identify and solve common issues.

 

Additional Resources

http://msdn.microsoft.com/en-us/library/windowsazure/hh343270.aspx – In depth documentation about storage analytics and what each field means.

http://www.windowsazure.com/en-us/documentation/articles/storage-monitor-storage-account/ – How to enable and use metrics from the Azure Management Portal.

http://channel9.msdn.com/Series/DIY-Windows-Azure-Troubleshooting/Storage-Analytics – A short 5 and a half minute video showing how to enable and use storage analytics.


Windows Azure Diagnostics – Upgrading from Azure SDK 2.4 to Azure SDK 2.5

$
0
0


Overview

Windows Azure SDK 2.4 and prior used Windows Azure Diagnostics 1.0 which provided multiple options for configuring diagnostics collection, including code based configuration using the Microsoft.WindowsAzure.Diagnostics classes.  Windows Azure SDK 2.5 introduces Windows Azure Diagnostics 1.2 which streamlines the configuration and adds enhanced capabilities for post-deployment enablement, however it removes the code based configuration capabilities.  This blog post will describe how to convert your older code based diagnostics configuration to the new WAD 1.2 XML based configuration model.

 

For more information about WAD 1.2 see http://azure.microsoft.com/en-us/documentation/articles/cloud-services-dotnet-diagnostics/.

 

 

Specifying diagnostics Configuration in XML

In Azure SDK 2.4 it was possible to configure diagnostics through code as well as xml configuration (the diagnostics.wadcfg file) and developers needed to understand the precedence rules of how the diagnostics agent picks up the configuration settings (see here for more information). Azure SDK 2.5 removes this complexity and the diagnostics agent will now always use the xml configuration. This has some implications when you migrate your Azure SDK 2.4 project to Azure SDK 2.5. If your Azure SDK 2.4 project already uses the xml based diagnostics configuration then when you upgrade the project in Visual Studio to target Azure SDK 2.5, Visual Studio will automatically update the xml based configuration to the new format (diagnostics.wadcfgx). If you project continued to use the code based configuration (for example, using the API in Microsoft.WindowsAzure.Diagnostics and Microsoft.WindowsAzure.Diagnostics.Management) then when its upgraded to SDK 2.5, you will get build warnings which will inform you of the deprecated APIs. The diagnostics data will not be collected unless you configure your diagnostics using the XML file (diagnostics.wadcfgx) or through the Visual Studio diagnostics configuration UI.

 

Let’s look at an example- In Azure SDK 2.4 you could be using the code below to update the diagnostics configuration for the diagnostics infrastructure logs. In this particular example the code is setting the Loglevel for the diagnostics infrastructure logs to Error and the Transfer Period to 5 minutes. When you migrate this project to SDK 2.5 you will get a build warning around this code with the message that the API is deprecated – “Warning: ‘Microsoft.WindowsAzure.Diagnostics.DiagnosticMonitor’ is obsolete: “This API is deprecated””. The project will still build and deploy but it won’t update the diagnostics configuration.

 

clip_image002[6]

 

To configure the diagnostics infrastructure logs in Azure SDK 2.5 you must remove that code based configuration and update the diagnostics configuration xml file (diagnostics.wadcfgx) associated with your role. Visual Studio provides a configuration UI to let you specify these diagnostics settings. To configure the diagnostics through Visual Studio you can Right click on the Role and select Properties to display the Role Designer. On the Configuration tab make sure Enable Diagnostics is selected and click Configure. This will bring up the Diagnostics configuration UI where you can go to the Infrastructure Logs Tab and select Enable transfer of Diagnostics Infrastructure Logs and set the Transfer Period and Log Level Appropriately.

 

clip_image004[6]

 

Behind the scenes Visual Studio is updating the diagnostics.wadcfgx file with the appropriate xml to configure the infrastructure logs. See highlighted as an example:

<?xml version="1.0" encoding="utf-8"?>

<DiagnosticsConfiguration xmlns="http://schemas.microsoft.com/ServiceHosting/2010/10/DiagnosticsConfiguration">

  <PublicConfig xmlns="http://schemas.microsoft.com/ServiceHosting/2010/10/DiagnosticsConfiguration">

    <WadCfg>

      <DiagnosticMonitorConfiguration overallQuotaInMB="4096">

        <DiagnosticInfrastructureLogs scheduledTransferPeriod="PT5M" scheduledTransferLogLevelFilter="Error" />

      </DiagnosticMonitorConfiguration>

    </WadCfg>

  </PublicConfig>

  <PrivateConfig xmlns="http://schemas.microsoft.com/ServiceHosting/2010/10/DiagnosticsConfiguration">

    <StorageAccount name="" endpoint="" />

  </PrivateConfig>

  <IsEnabled>true</IsEnabled>
</DiagnosticsConfiguration>

 

The diagnostics configuration dialog in Visual Studio lets you configure other settings like Application Logs, Windows Event Logs, Performance Counters, Infrastructure Logs, Log Directories, ETW Logs and Crash Dumps.

Let’s look at another example for Performance Counters. With Azure SDK 2.4 (and prior) you would have to do the following in code:
clip_image006[6]

 

With Azure SDK 2.5 you can use the diagnostics configuration UI to enable the default performance counters:

clip_image008[6]

 

For Custom Performance counters you still have to instrument your code to create the custom performance category and counters as earlier. Once you have instrumented your code you can enable transferring these custom performance counters to storage by adding them to the list of performance counters in the diagnostics configuration UI. Eg. For a custom performance counter you can enter the category and name “\SampleCustomCategory\Total Button1 Clicks” in the text box and click Add.

 

Visual Studio will automatically add the performance counters selected in the UI to the diagnostics configuration file diagnostics.wadcfgx:

<?xml version="1.0" encoding="utf-8"?>

<DiagnosticsConfiguration xmlns="http://schemas.microsoft.com/ServiceHosting/2010/10/DiagnosticsConfiguration">

  <PublicConfig xmlns="http://schemas.microsoft.com/ServiceHosting/2010/10/DiagnosticsConfiguration">

    <WadCfg>

      <DiagnosticMonitorConfiguration overallQuotaInMB="4096">

        <PerformanceCounters scheduledTransferPeriod="PT1M">

          <PerformanceCounterConfiguration counterSpecifier="\Processor(_Total)\% Processor Time" sampleRate="PT3M" />

          <PerformanceCounterConfiguration counterSpecifier="\Memory\Available MBytes" sampleRate="PT3M" />

        </PerformanceCounters>

      </DiagnosticMonitorConfiguration>

    </WadCfg>

  </PublicConfig>

  <PrivateConfig xmlns="http://schemas.microsoft.com/ServiceHosting/2010/10/DiagnosticsConfiguration">

    <StorageAccount name="" endpoint="" />

  </PrivateConfig>

  <IsEnabled>true</IsEnabled>
</DiagnosticsConfiguration>

 

For WindowsEventLog data you can enable transfer of logs for common data sources directly from the diagnostics configuration UI.

clip_image010[6]

 

For custom event logs that you define in your code for example ->

 


EventLog.CreateEventSource("GuestBookSource", "Application");

EventLog.WriteEntry("GuestBookSource", "WebRole Started", EventLogEntryType.Error, 9191);

 

You have to manually add the custom source to the wadcfgx XML file under the WindowsEventLog node. Make sure the name matches the name specified in code:

 


<WindowsEventLog scheduledTransferPeriod="PT1M">

   <DataSource name="Application!*" />

   <DataSource name="GuestBookSource!*" />

</WindowsEventLog>

 

 

Enabling diagnostics extension through PowerShell

 

Since Azure SDK 2.5 uses the extension model the diagnostics extension, the configuration and the connection string to the diagnostic storage are no longer part of the deployment package and cscfg. All the diagnostics configuration is contained within the wadcfgx. The advantage with this approach is that diagnostics agent and settings are decoupled from the project and can be dynamically enabled and updated even after your application is deployed.

 

Due to this change some existing workflows need to be rethought – instead of configuring the diagnostics as part of the application that gets deployed to each environment you can first deploy the application to the environment and then apply the diagnostics configuration for it.  When you publish the application from Visual Studio this process is done automatically for you. However if you were deploying your application outside of VS using PowerShell then you have to install the extension separately through PowerShell.

 

There PowerShell cmdlets for managing the diagnostics extensions on a Cloud Service are –

·         Set-AzureServiceDiagnosticsExtension

·         Get-AzureServiceDiagnosticsExtension

·         Remove-AzureServiceDiagnosticsExtension

You can use the Set-AzureServiceDiagnosticsExtension method to enable diagnostics extension on a cloud service. One of the parameters on this cmdlet is the XML configuration file. This file is slightly different from the diagnostics.wadcfgx file. You can create this file from scratch as described here or you can modify the wadcfgx file and pass in the modified file as a parameter to the powershell cmdlet.

 

To modify the wadcfgx file –

1.       Make a copy the .wadcfgx.

2.       Remove the following elements from the Copy:

<DiagnosticsConfiguration xmlns="http://schemas.microsoft.com/ServiceHosting/2010/10/DiagnosticsConfiguration">

   <PrivateConfig xmlns="http://schemas.microsoft.com/ServiceHosting/2010/10/DiagnosticsConfiguration">

     <StorageAccount name=" " endpoint="https://core.windows.net/" />

   </PrivateConfig>

   <IsEnabled>false</IsEnabled>

</DiagnosticsConfiguration>

3.       Make sure the top of the file still has xml version and encoding –

<?xml version="1.0" encoding="utf-8"?>

 

Effectively you are stripping down the Wadcfgx to only contain the <PublicConfig> section and the <?xml> header. You can then call the PowerShell cmdlet along with the appropriate parameters for the staging slots and roles:


$storage_name = ‘<storagename>’

$key= ‘<key>’

$service_name = ‘<servicename>’

$public_config = ‘<thepublicconfigfrom_diagnostics.wadcfgx>’

 

$storageContext = New-AzureStorageContext –StorageAccountName $storage_name –StorageAccountKey $key

 

Set-AzureServiceDiagnosticsExtension -StorageContext $storageContext -DiagnosticsConfigurationPath $public_config –ServiceName $service_name -Slot ‘Staging’ -Role ‘WebRole1’

 

How to Restrict RDP Access in an Azure PaaS Cloud Service

$
0
0

 

A question I see periodically is how to restrict RDP access for PaaS services to specific network IP addresses.  In the past this has always been difficult to do and the typical solution was to use a Startup task to configure firewall rules (ie. using Set-NetFirewallRule or netsh advfirewall per http://msdn.microsoft.com/en-us/library/azure/jj156208.aspx).  This technique generally works fine, but it introduces the extra complexity of a startup task and is not built into the Azure platform itself.

Network ACLs

With the (relatively) recent introduction of network ACLs it becomes much easier to robustly secure an input endpoint on a cloud service.  My colleague Walter Myers has a great blog post about how to enable network ACLs for PaaS roles at http://blogs.msdn.com/b/walterm/archive/2014/04/22/windows-azure-paas-acls-are-here.aspx.  To apply a network ACL to the RDP endpoint it is simply a matter of defining your ACL rules targeting the role which imports the RemoteForwarder plugin, and specifying the name of the RDP endpoint in the endPoint attribute.

Here is the resulting NetworkConfiguration section to add to the CSCFG file:

  <NetworkConfiguration>
    <AccessControls>
      <AccessControl name="RDPRestrict">
        <Rule action="permit" description="PermitRDP" order="100" remoteSubnet="167.220.26.0/24" />
        <Rule action="deny" description="DenyRDP" order="200" remoteSubnet="0.0.0.0/0" />
      </AccessControl>
    </AccessControls>
    <EndpointAcls>
      <EndpointAcl role="WebRole1" endPoint="Microsoft.WindowsAzure.Plugins.RemoteForwarder.RdpInput" accessControl="RDPRestrict" />
    </EndpointAcls>
  </NetworkConfiguration>

 

 

Important information:

  1. You must enable RDP in the package before publishing your service.  The new model of enabling RDP post-deployment via the management portal or extension APIs will not work.  You can enable RDP in the package using Visual Studio by right-clicking the cloud service project in Solution Explorer and selecting ‘Configure Remote Desktop…’, or in the Publish wizard by checking the ‘Enable Remote Desktop for all roles’ checkbox.
  2. The role="WebRole1" attribute must specify the role which imports the RemoteForwarder plugin.  You can look in the CSDEF file and find the role which has <Import moduleName="RemoteForwarder" />.  If you have multiple roles in your service then all of them will import RemoteAccess, but only one of them will import RemoteForwarder and you must specify the role which imports RemoteForwarder.
  3. The network configuration defined above will restrict all client’s except for those with IP addresses in the range 167.220.26.1-167.220.26.255.  See Walter’s blog post for more information about how to specify network ACLs, and the MSDN documentation (http://msdn.microsoft.com/en-us/library/azure/dn376541.aspx) for more information about order and precedence of rules.

Cloud Service RDP Configuration not available via portal

$
0
0

 

 

<Update March 2, 2015>

The ability to modify RDP plugin settings via the portal has been restored.  I will leave this blog post up since the password encryption portion can still be valuable for people setting up RDP outside of the Visual Studio tools.

</Update>

 

 

There are 2 ways to enable RDP for a PaaS cloud service – via the CSPKG using the RemoteAccess plugin, or via the portal (or Powershell or REST API) using an extension.  There was a recent change to the management portal that disabled the ability to configure the plugin style RDP settings after deployment.  The error you get is:

“Failed to launch remote desktop configuration.”

And if you click Details you get:

“Remote Desktop is enabled for this deployment using the RemoteAccess module, which is specified in the ServiceDefinition.csdef file. To allow configuring Remote Desktop using the management portal, remove the RemoteAccess module and update the deployment. Learn More.”

image

image

    

The best solution is to remove the RDP plugin from the package, rebuild the package, and redeploy.  This will let you enable RDP via the portal after deployment and manage the configuration in the portal.  This also provides a more reliable RDP experience since this shifts the RDP functionality from a static feature in your CSPKG to an extension that can be managed and upgraded by the Azure guest agent.  To disable RDP in your package you can right-click the cloud service project, select ‘Configure Remote Desktop’ and then uncheck the ‘Enable connections for all roles’ option.

image

 

However, modifying and then redeploying the package is not always a viable solution.  For the scenarios where you want to update the configuration of an existing service you can use the following steps to manually update the configuration.

 

Download the configuration

Download the configuration and save a local .CSCFG file.  You can do this via the portal or Powershell.

To do this from the portal go to the Configure tab and click Download:

image

To do this from Powershell run:

$deployment = Get-AzureDeployment -ServiceName <servicename> –Slot <slot>

([xml]$deployment.Configuration).Save("c:\temp\config.cscfg")

 

Update the configuration setting

Open the .CSCFG file you saved, make any necessary changes, and then save the .CSCFG.

*See below for how to modify the password.

 

Upload back to the portal

Upload the modified .CSCFG file back to the portal and wait for the configuration change to complete.

To do this from the portal go to the Configure table and click Upload:

image

To do this from Powershell run:

Set-AzureDeployment -Config -ServiceName <servicename> -Configuration "c:\temp\config.cscfg" –Slot <slot>

 

Updating the password

Updating settings such as the expiration date or username is easy, but updating the password is more difficult.  The RDP password is encrypted using a user-provided certificate that is uploaded to the cloud service.  If you are making these RDP changes on the computer that already has that certificate in the cert store (ie. from the dev/deploy machine) then getting a new encrypted password is pretty straightforward:

  1. Open a ‘Windows Azure Command Prompt’
  2. Execute:
  3. csencrypt encrypt-password -copytoclipboard –thumbprint <thumbprint_from_CSCFG>
  4. Paste the new password into the AccountEncryptedPassword setting in the .CSCFG

If you don’t have the certificate in the local computer’s cert store you can download it and generate a new encrypted certificate using Powershell:

# Download the cert
$cert = Get-AzureCertificate -ServiceName <servicename> -Thumbprint <thumbprint_from_CSCFG> -ThumbprintAlgorithm "SHA1"

# Save the cert to a file
  "—–BEGIN CERTIFICATE—–" > "C:\Temp\rdpcert.cer"
  $cert.Data >> "C:\Temp\rdpcert.cer"
  "—–END CERTIFICATE—–" >> "C:\Temp\rdpcert.cer"

# Prompt for the new password
  $password = Read-Host -Prompt "Enter the password to encrypt"

#Load the certificate
  [System.Reflection.Assembly]::LoadWithPartialName("System.Security") | Out-Null
  $cert = New-Object System.Security.Cryptography.X509Certificates.X509Certificate2("c:\temp\rdpcert.cer")
  $thumbprint = $cert[0].thumbprint
  $pass = [Text.Encoding]::UTF8.GetBytes($password)

# Encrypt the cert
  $content = new-object Security.Cryptography.Pkcs.ContentInfo -argumentList (,$pass)
  $env = new-object Security.Cryptography.Pkcs.EnvelopedCms $content
  $env.Encrypt((new-object System.Security.Cryptography.Pkcs.CmsRecipient($cert)))

# Write the new password to a file and load the file
  write-host "Writing encrypted password, cut/paste the text below the line to CSCFG file"
  [Convert]::ToBase64String($env.Encode()) > "c:\temp\encrypted_password.txt"
  Invoke-Item "c:\temp\encrypted_password.txt"

 

Azure Cloud Services only support SHA-1 Thumbprint Algorithm

$
0
0

 

“My certificate provider recently switched to only providing SHA2/SHA256 certificates because SHA-1 certificates are no longer safe.  But Azure only supports SHA1 certificates!  https://msdn.microsoft.com/library/azure/gg465718.aspx says ‘The only thumbprint algorithm currently supported is sha1’”.

 

Lately I have been seeing this issue more often due to some larger cert providers recently making this change.  The entire industry has been deprecating SHA-1 certificates for a while and Chrome has recently started showing warnings in the browser.  Some references:

 

Signing algorithm vs. Thumbprint algorithm

The issue stems from confusion between the two types of algorithms used by certificates.

  • Signing algorithm.  This is the algorithm used to actually sign the certificate and this is what makes the certificate secure (or in the case of SHA-1, less secure).  The signing algorithm and resulting signature is specified by the certificate authority when it creates the cert and is built into the cert itself.  This algorithm is where SHA1 is being deprecated.  Azure doesn’t know or care what this algorithm is.

  • Thumbprint algorithm.  This algorithm is used to generate a thumbprint in order to uniquely identify and find a certificate.  This algorithm and value is not built into the certificate but is instead calculated whenever a cert lookup is done.  Multiple thumbprints can be generated using different algorithms all from the same certificate data.  The thumbprint has nothing to do with certificate security since it is just used to identify/find the cert within the cert store.  Windows, .NET, and Azure all use SHA1 algorithm for the thumbprint algorithm, and SHA1 is the only algorithm allowed in the ServiceConfiguration.cscfg file:

<Certificates>
  <Certificate name="Certificate1" thumbprint="69BF333452DAA85E462E33B138F3B65842C8B428" thumbprintAlgorithm="sha1" />
< /Certificates>

 

Solution

You can use your SHA2/SHA256 signed certificate in Azure, you just have to specify an SHA1 thumbprint.  Your certificate provider should be able to provide you with an SHA1 thumbprint, but it is relatively straightforward to find or calculate the SHA1 thumbprint on your own.  Here are a few options:

  1. The easiest option is to simply open the certificate in the Certificate Manager on any Windows OS.  Windows will display the SHA1 thumbprint in the certificate properties window.
  2. On a Windows OS you can run ‘certutil –store my’  (replace my with whatever store your cert is in).
  3. In Powershell you can call System.Security.Cryptography.SHA1CNG.ComputeHash per http://ig2600.blogspot.com/2010/01/how-do-you-thumbprint-certificate.html
  4. There are a few other options including .NET code, openssl, and Apache Commons Codec at http://stackoverflow.com/questions/1270703/how-to-retrieve-compute-an-x509-certificates-thumbprint-in-java.

 

Thank you to Morgan Simonsen for his excellent blog post Understanding X.509 digital certificate thumbprints which details the different certificate algorithms.

 

SDK 2.5 / WAD 1.2 — IIS Logs Not Transferring to Storage in PaaS WebRoles

$
0
0

After upgrading to Azure SDK 2.5 with Windows Azure Diagnostics 1.2 (see http://blogs.msdn.com/b/kwill/archive/2014/12/02/windows-azure-diagnostics-upgrading-from-azure-sdk-2-4-to-azure-sdk-2-5.aspx) you may notice that IIS logs and failed request (FREB) logs are no longer transferred to storage.

 

Root Cause

When WAD generates the diagnostics configuration it queries the IIS Management Service to find the location of the IIS logs, and by default this location is set to %SystemDrive%\inetpub\logs\LogFiles.  In a PaaS WebRole IISConfigurator will configure IIS according to your service definition, and part of this setup changes the IIS log file location to C:\Resources\directory\{deploymentid.rolename}.DiagnosticStore\LogFiles\Web.  The WAD configuration happens prior to IISConfigurator running which means WAD is watching the wrong folder for IIS logs.

 

Workaround

To work around this issue you have to restart the WAD diagnostics agent after IISConfigurator has setup IIS.  When the WAD diagnostics agent starts up again it will query the IIS Management Service for the IIS log file location and will get the correct C:\Resources\directory\{deploymentid.rolename}.DiagnosticStore\LogFiles\Web location.

The two ways to restart the diagnostics agent are:

  1. Reboot the VM.  This can be done from the portal or from an RDP session with the VM.
  2. Update the WAD configuration, which will cause the diagnostics agent to refresh it’s configuration.  This can be done from Visual Studio (Server Explorer –> Cloud Services –> Right-click a role –> Update Diagnostics –> Make any change and update) or from Powershell (see this post).

One problem with these two options is that you have to manually do this for each role/VM in your service after deploying.  The bigger problem is that any operation which recreates the Windows (D: drive) partition will also reset the IIS log file location to the default %SystemDrive% location which will cause the diagnostics agent to again get the wrong location.  This will happen to all instances roughly once per month for Guest OS updates, or randomly to single instances due to service healing (see this and this for more info).

 

Resolution

The WAD dev team is working to fix this issue with the next Azure SDK release.  In the meantime you can add the following code to your WebRole.OnStart method in order to automatically reboot the VM once during initial startup.

public override bool OnStart()
{
// For information on handling configuration changes
// see the MSDN topic at http://go.microsoft.com/fwlink/?LinkId=166357.

// Write a RebootFlag.txt file to the %RoleRoot%\Approot\bin folder to track if this VM has rebooted to fix WAD 1.3 IIS logs issue
string path = "RebootFlag.txt";
// If RebootFlag.txt already exists then skip rebooting the VM
if (!System.IO.File.Exists(path))
{
System.IO.File.WriteAllText(path, "Writing RebootFlag at " + DateTime.Now.ToString("O"));
System.Diagnostics.Trace.WriteLine("Rebooting");
System.Diagnostics.Process.Start("shutdown", "/r /t 0");
}

return base.OnStart();
}

 

Note that this code uses a file on the %RoleRoot% drive as a flag, so this will also cause an additional reboot in extra scenarios such as portal reboots and in-place upgrades (see this post), but these scenarios are rare enough that it should not cause an issue in your service.  If you wish to avoid these extra reboots you can set the role runtime execution context to elevated by adding <Runtime executionContext="elevated" /> to the .csdef file and then write to either a file to the %SystemRoot% drive or write a flag to the registry.

Viewing all 67 articles
Browse latest View live


<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>