Dumping core memory: 2013

Sunday 1 September 2013

Dynamic resource pool with Weak References

Sometimes its necessary to allocate a number of expensive resources at runtime. These resources can be expensive in terms of threads or memory. They maybe individually expensive, have a large memory footprint, have a high number of threads or cumulatively expensive for example many instances have to be allocated. It may also be the case that the creation of the resource objects is an expensive task in itself.

To address this problem it is usual to implement an object pool. This pattern is one of the creational design patterns, the intent of this pattern is to reuse and share objects that are expensive to create. It can also be put to use to manage objects that encapsulate expensive resources.

One of the implementation constraints of an object pool implementation is the management of the number resources that are allocated by the object pool. The general implementation of an object pool supposes a fixed number of resources that are allocated when the pool is constructed and are destroyed when the object pool is destroyed.

For the specific problem I am trying to solve, fixed resource allocation is too restrictive, I need some elasticity in the pool. I require the pool to shrink and expand base on the number of outstanding requests for objects.

There are a number of options available to implement this elasticity for example

A cache configured with sliding window flushing of least recently used resources
A background thread that manages last access time and release resources

Both of the implementations above would allow the resource allocation to be elastic. Both have the disadvantage of there response time to clean up of unused resources contained within the pool. This is due to the fact that both will have some periodic interval for clean up. The sliding wind value for the cache and the polling interval for the threaded solution. Tuning an optimal “rate of elasticity” adds to the complexity of the solution. It also means that one needs to tune this rate for different applications that have different resource allocation profiles.

The solution that I have chosen is to allocate resources and maintain references with Weak References inside the resource pool. This allows the pool to respond to the request for resources and also to memory pressure on the system.

Dynamic Resource Pool

First lets define an interface for the resource pool. Using a generic interface allows the implementation of the pool to be abstracted from the implementation of the object pool.

/// <summary>
/// A resource pool that dynamically allocates and releases resources of type T
/// </summary>
/// <typeparam name="T"></typeparam>
public interface IDynamicResourcePool<T> where T : class
{
 /// <summary>
 /// Acquires a resource from the pool.
 /// </summary>
 /// <returns>An instance of a resource from the pool</returns>
 T AcquireResource();

 /// <summary>
 /// Release a resource back to the pool
 /// </summary>
 /// <param name="resource">The resource to release</param>
 void ReleaseResource(T resource);
}

Dissecting the DynamicResourcePool first we look at the local variables and the constructor for the class.

private readonly ConcurrentQueue<WeakReference> _resources;

private readonly PerformanceCounter _rateOfCreation;
private readonly PerformanceCounter _rateOfRelease;
private readonly PerformanceCounter _rateOfAquire;
private readonly PerformanceCounter _totalResources;
private readonly Func<T> _factoryFunc;

/// <summary>
/// Construct an instance of a <see cref="DynamicResourcePool{T}"/>
/// </summary>
/// <param name="resourceFactoryFunc">The factory used to create resources</param>
public DynamicResourcePool(Func<T> resourceFactoryFunc)
{
 _resources = new ConcurrentQueue<WeakReference>();
 _factoryFunc = resourceFactoryFunc;

 if(PerformanceCounterCategory.Exists(CounterMetadata.CategoryName))
 {
  String instanceName = GetType().GetGenericArguments()[0].Name;

  _rateOfCreation = new PerformanceCounter(CounterMetadata.CategoryName, CounterMetadata.CreationCounterName, instanceName, false);
  _rateOfRelease = new PerformanceCounter(CounterMetadata.CategoryName, CounterMetadata.ReleasedCounterName, instanceName, false);
  _rateOfAquire = new PerformanceCounter(CounterMetadata.CategoryName, CounterMetadata.AquiredCounterName, instanceName, false);
  _totalResources = new PerformanceCounter(CounterMetadata.CategoryName, CounterMetadata.TotalCreatedCounterName, instanceName, false);
 }
}

The ConcurrentQueue<WeakReference> is used to store the weak references that contain the allocated resources. There are also a set of performance counters for tracking the performance characteristics of the DynamicResourcePool at runtime. The performance counters will be used later in this post to examine the DynamicResourcePool.

Looking at the AcquireResource operation excluding some code for incrementing and decrementing counters its actually pretty simple. Try dequeue a resource from the resources queue if either a resource cannot be dequeued or the dequeued resource is null, i.e. it had been garbage collected, then create a new resource via the factory. Note: because a queue is used as the resource container the release and acquire of resources are round robin. This behaviour could be made configurable by delegating the internal acquire and release operations to an Action<T> or Func<T> delegate like the factory method.

/// <summary>
/// Acquires a resource from the pool.
/// </summary>
/// <returns>An instance of a resource from the pool</returns>
public T AcquireResource()
{
 IncrementCounter(_rateOfAquire);
 WeakReference weakReference;
 T resource = default(T);

 if(_resources.TryDequeue(out weakReference))
 {
  resource = (T)(weakReference.Target);

  if(resource == null)
  {
   DecrementCounter(_totalResources);
  }
 }

 if(resource == null)
 {
  if(_factory != null)
  {
   IncrementCounter(_totalResources);
   IncrementCounter(_rateOfCreation);
   resource = _factoryFunc();
  }
 }

 return resource;
}

The Release method is really simple. The resource is simply queued back in the resource pool.

/// <summary>
/// Release a resource back to the pool
/// </summary>
/// <param name="resource">The resource to release</param>
/// <exception cref="ArgumentNullException">Thrown when the argument is resource is null</exception>
public void ReleaseResource(T resource)
{
 IncrementCounter(_rateOfRelease);

 if(resource == null)
 {
  throw new ArgumentNullException("resource");
 }

 _resources.Enqueue(new WeakReference(resource));
}

Test Client

To demonstrate the behaviour of the DynamicResourcePool the test client performs the following:

Defines a resource type, ByteBuffer
Creates a DynamicResourcePool <ByteBuffer>
Starts a number of Tasks to allocate and release ByteBuffer resources
Forces a garbage collection
At each stage reports the memory consumption of the process

After running the test client the following output is displayed.

The Memory to be Allocated figure indicates the amount of bytes that the ByteBuffer instances will contain, this is 10Mb each times 20 tasks. The Memory used after Allocation shows the initial processes allocated managed memory plus the memory allocated for the 20 ByteBuffer instances. The test client forces a garbage collection cycle, since there are no tasks running at this point all ByteBuffer instances are collected.

Looking at the performance counters we can see the behaviour of the DynamicResourcePool. First the Total Resources Created counter shows that it starts at zero and rises linearly to 20, which corresponds to the allocation profile of the Tasks.

The Resources Created counter shows that the DynamicResourcePool creates resources initially and then stops when the Total Resources Created reaches 20 as expected.

The Resources Acquired counter shows the rate of resources acquired. The Resources Release counter tracks the acquired counter due to the nature of the test client code. The Resources Acquired counter reaches about 200 per second which is as expected as the test code acquires and releases a resource every 100 milliseconds on 20 separate threads.

Further Enhancements

Allow acquire and release strategy to be configured.
Allow the pool to pre allocate resources at construction time

Source Code

Note: you will need to launch either VS or the test client as administrator as it registers performance counters.
The code is available here: GitHub DumpingCoreMemory.ResoucePooling

Wednesday 31 July 2013

C++/CLI Asynchronous events using EventHandler<T>

While developing some code to expose a native C++ library as a managed component to be used in a C# component. I found that I needed to raise an event asynchronously from the C++/CLI class. Since I have a preference to use EventHandler and EventHandler<TEventArgs> delegates for exposing events rather than define the delegate and the event separately.

So naively, as its been a while since I wrote C++ and my first time writing C++/CLI, I proceeded to simply define an event as shown in the following example snippet:

ref class EventSource
{
 public:
  event EventHandler^ SyncEvent;
}

And invoke it as in the following example snippet:

void EventSource::RaiseSyncEvent(void)
{
 SyncEvent->BeginInvoke(this, EventArgs::Empty, gcnew AsyncCallback(this, &EventSource::Callback), nullptr);
}

Two which I received the following compiler error:
Error 1 error C3918: usage requires 'EventSource::SyncEvent' to be a data member

In the case above this is caused by the fact that you cannot access the backing field for the event delegate. If we decompile the code with ILSpy we can see that the compiler generates the following C# code:

This is due to the fact that the event syntax used to define the SyncEvent is called trivial event. A trivial event is similar to a trivial property where by the compiler provides the backing store field and the assessor methods.

The C++/CLI compiler also provides a raise_xxxxxxx method as shown below, which is kind of neat as the C# compiler requires that you write the equivalent code to insure thread safety when accessing the delegate.

To get access to the underlying event delegate the nontrivial event syntax must be used. This requires the definition of the assessor methods. The add and remove are mandatory and the raise method is optional.

Here is where I discovered another small problem. There are no examples of how to use EventHandler and EventHandler<TEventArgs> delegates on the Microsoft web site. All of the examples use the explicit delegate and event definition syntax.

To define a non trivial event first define the private field/property.

ref class EventSource
{
 private:
  EventHandler^ m_asyncEvent;

Then define the explicit add and remove methods.

event EventHandler^ AsyncEvent
{
 void add(EventHandler^ eventHandler) 
 {
  m_asyncEvent += eventHandler;
 }

 void remove(EventHandler^ eventHandler) 
 {
  m_asyncEvent -= eventHandler;
 }
}

You can now invoke the event asynchronously via BeginInvoke, note you must check for null as the compiler no longer generates the nice raise method.

void EventSource::RaiseAsyncEvent(void)
{
 if(m_asyncEvent != nullptr)
 {
  m_asyncEvent->BeginInvoke(this, EventArgs::Empty, gcnew AsyncCallback(this, &EventSource::Callback), nullptr);
 }
}

Note that a call-back must provided to clean up the IAsyncResult instance by calling the EndInvoke method, see here for further reference.

void EventSource::Callback(IAsyncResult^ ar)
{
 Console::WriteLine("EventSource::Callback invoked on Thread Id: {0}", Thread::CurrentThread->ManagedThreadId);
 AsyncResult^ result = (AsyncResult^) ar;
    EventHandler^ caller = (EventHandler^) result->AsyncDelegate;
 caller->EndInvoke(ar);
}

The code is available here: GitHub AsyncEvent

Wednesday 24 July 2013

A Bounded Task Dispatcher

During a recent implementation I had to resolve some information from a database in response to a network event. These network events are delivered in a serialised order via a custom protocol. In order to decouple the processing of event messages from the query of the database, which is an order of magnitude slower than the event rate, it was necessary to dispatch the event processing onto a separate thread. This would allow the events to be processed in parallel. A secondary requirement was to limit the number of parallel threads to some upper bound.

To execute event processing on a separate thread there are a number of options:

Create a Thread explicitly and manage its lifecycle
Use the ThreadPool to dispatch the request processing
Use a BackGroundWorker to dispatch request processing
Create a Task and manage its lifecycle

There are a number of disadvantages to the options 1 and 2, which basically boil down to the amount of plumbing code required to manage and synchronise thread execution and lifecycle. For example handling exceptions, return data etc.

Option 3 is not really a viable option, it is on the list just for completeness, I would not use, or recommend, the BackGroundWorker for dispatching requests for a backend component as it is primarily a UI component.

The .Net framework 4.0 introduced the System.Threading.Tasks namespace which contains a number of classes and types that simplify the work of writing concurrent and asynchronous code. The Task class is the type from System.Threading.Tasks that the bounded dispatcher is based on.

The Task dispatcher code walk through

The TaskDispatcher type encapsulates the management of the dispatched tasks. Looking at the constructor we can see it takes an argument that specifies the maximum number of tasks that can be dispatched in parallel.

/// <summary>
/// The <see cref="TaskDispatcher"/> constructor
/// </summary>
/// <param name="maxTasks">The maximum number of tasks that can be dispatched before the call to <seealso cref="Dispatch"/> blocks</param>
public TaskDispatcher(Int32 maxTasks)
{
 _executingTasks = new List<Task>();
 _tasks = new Semaphore(maxTasks, maxTasks);
 _tokenSource = new CancellationTokenSource();
}

The _executingTasks is a list that references the executing tasks. The _tasks Semaphore is used to bound the upper limit of Tasks that can be dispatched by the TaskDispatcher. The _tokenSource is a CancellationToken that is use to cancel executing tasks.

Dispatching Tasks

/// <summary>
/// Dispatches an <see cref="Action"/> for execution. 
/// </summary>
/// <param name="action">The <see cref="Action"/> to dispatch</param>
/// <param name="completionCallback">The callback action to invoke when the task completes</param>
/// <param name="dispatchTimeout">The <see cref="TimeSpan"/> to wait for a task to be dispatched</param>
/// <exception cref="TimeoutException">An exception thrown when the <see cref="action"/> cannot be dispatched within the <paramref name="dispatchTimeout"/> </exception>
public void Dispatch(Action<CancellationToken> action, Action<Task,Exception> completionCallback, TimeSpan dispatchTimeout)
{
 if (_tasks.WaitOne(dispatchTimeout))
 {
  lock (_executingTasks)
  {
   var task = new Task(() => action(_tokenSource.Token));
   task.ContinueWith(t => Completion(t, completionCallback));
   task.Start();

   _executingTasks.Add(task);
  }
 }
 else
 {
  throw new TimeoutException(String.Format("Unable to dispatch action within timeout {0}", dispatchTimeout));
 }
}

To dispatch a task the Dispatch method is called, passing a parameter action of type Action<CancellationToken> and a parameter completionCallback of type Action<Task, Exception> delegate which is invoked when the dispatched action delegate completes.

The thread invoking the Dispatch method attempts to enter the _tasks semaphore by calling the WaitOne method. If the semaphore is not signalled within the timeout period then an exception is thrown to indicate that a Task cannot be dispatched.

If the thread can enter the _tasks semaphore a Task is created passing a CancellationToken to the constructor, the CancellationToken is used to control the executing Tasks cancellation. The Task is added to the list of executing tasks, and the calling thread returns.

If there are multiple competing threads trying to enter the _tasks semaphore there is no guaranteed order as to which competing thread will be signalled. This is due to the behaviour of the semaphore. Also it is possible that to threads enter the Semaphore, thread A followed by thread B. Due to pre-emptive scheduling thread B may acquire the lock on the _executingTasks before thread A.

On Completion

/// <summary>
/// The completion callback thats called when a dispatch task was completed
/// </summary>
/// <param name="completedTask">The <see cref="Task"/> that completed</param>
/// <param name="completionCallback">The callback that is invoked when the dispatched <see cref="Task"/> executes to completion</param>
private void Completion(Task completedTask, Action<Task,Exception> completionCallback)
{
 _tasks.Release();

 lock (_executingTasks)
 {
  _executingTasks.Remove(completedTask);
 }

 completionCallback(completedTask, completedTask.Exception);
}

On completion of an instance of a Task the private method Completion is called. This method releases the _tasks semaphore and removes the Task from the _executingTasks list. It then invokes the completionCallback delegate which was provided to the completedTask when it was constructed. If there is an exception associated with the completed task it is passed to the completion call-back.

Cancelling Tasks

/// <summary>
/// Cancel all executing <see cref="Task"/>
/// </summary>
public void CancelAll()
{
 _tokenSource.Cancel();

 Task.WaitAll(_executingTasks.ToArray());
}

To cancel executing tasks the CancelAll method of the task dispatcher is invoked. This method sets the _tokenSource to cancelled. This sets the isCancellationRequested property to true. Dispatched actions should check this property for example:

static void CancellableTask(CancellationToken cancellationToken)
{
 for (int i = 0; i < 100000; i++)
 {
  // Simulating work.
  Thread.SpinWait(5000000);

  if (cancellationToken.IsCancellationRequested)
  {
   // Perform cleanup if necessary.
   //...
   // Terminate the operation.
   break;
  }
 }
}

Note: An errant task method that does not honour the CancellationToken will block the call to CancelAll method. This behaviour could of the CancelAll method could be modified by causing the Task.WaitAll to wait a period of time and then terminate the tasks that did not terminate with the timeout. This would be drastic and should be avoided, if it is possible modify the behaviour of the dispatched methods to hour honour the CancellationToken

Further enhancements

The code presented here is the basic functionality of a bounded task dispatcher some further improvements could be made. For example:

Make the entire Dispatch class generic so that the action parameter type can be specified for example as a Func delegate so a return value can be passed back from the dispatched method.

Add methods to allow individual tasks to be cancelled.

The code is available here: GitHub DumpingCoreMemory.Threading

Thursday 20 June 2013

Windows Azure SDK 1.7 + Windows Server App Fabric 1.1 Install Error

While setting up my development machine for some proof of concept development for migration of an existing application to Windows Azure, I experienced an issue with installing Windows Azure SDK 1.7, released in June, on the machine that had Windows Server AppFabric 1.1 installed previously.
MSI (s) (58:BC) [12:08:37:660]: Executing op: ActionStart(Name=RollbackRegisterEventManifest,,)
MSI (s) (58:BC) [12:08:37:662]: Executing op: CustomActionSchedule(Action=RollbackRegisterEventManifest,ActionType=3393,Source=BinaryData,Target=CAQuietExec,CustomActionData="wevtutil.exe" um "C:\Program Files\Microsoft SDKs\Windows Azure\.NET SDK\2012-06\bin\plugins\Caching\Microsoft.ApplicationServer.Caching.EventDefinitions.man")
MSI (s) (58:BC) [12:08:37:664]: Executing op: ActionStart(Name=RegisterEventManifest,,)
MSI (s) (58:BC) [12:08:37:666]: Executing op: CustomActionSchedule(Action=RegisterEventManifest,ActionType=3073,Source=BinaryData,Target=CAQuietExec,CustomActionData="wevtutil.exe" im "C:\Program Files\Microsoft SDKs\Windows Azure\.NET SDK\2012-06\bin\plugins\Caching\Microsoft.ApplicationServer.Caching.EventDefinitions.man")
MSI (s) (58:50) [12:08:37:705]: Invoking remote custom action. DLL: C:\Windows\Installer\MSI4266.tmp, Entrypoint: CAQuietExec
CAQuietExec: Provider Microsoft-Windows Server AppFabric Caching is already installed with GUID {a77dcf21-545f-4191-b3d0-c396cf2683f2}.
CAQuietExec: Configuration error.
CAQuietExec: Error 0x80073aa2: Command line returned an error.
CAQuietExec: Error 0x80073aa2: CAQuietExec Failed
CustomAction RegisterEventManifest returned actual error code 1603 (note this may not be 100% accurate if translation happened inside sandbox)
Action ended 12:08:37: InstallFinalize. Return value 3.
The workaround is to remove the AppFabric Cache component first and then install the Windows Azure SDK 1.7, there is some more information here: http://social.msdn.microsoft.com/Forums/en-US/windowsazuredevelopment/thread/cafcc159-578a-469f-b1d5-59a4b9c6b5c4/

Friday 1 February 2013

A small ARM race

I have been on a quest lately to find a computer that meets the following criteria:

Cheap enough to let the kids have it
I don't really care if the kids break it
I would like them to maybe learn something about computers, Don’t want app junkies
I’m old school and want the kids to learn to use a keyboard and a mouse, not mash their paws on a screen
I can reuse some old hardware, keyboard + mouse + monitor
Low power consumption

The initial logical answer is the Raspberry PI, yes its cheap but, I’m of the opinion based on discussion on the web that it is a bit under powered as a computer and is more suited to embedded control projects then as a low cost PC like device.

Next up on the list is the Beagleboard-xm it has a bit “power” than the Raspberry PI but it is still considered an embedded board. Having the opportunity to play around with a Beagleboard-xm I was impressed buy its low power consumption, about 750mA @ 5 volts which works out at about 3.75 watts, that was running Ångström while playing 720p video. Unfortunately that is all this board is capable of due to the video processor. But the gnome UI was smooth and use able.

I the installed Ubuntu 12.10 on the micro SD card, which is what most of these small arm board boot from, the result was less that impressive. The response from the UI was pretty slow.

Next on the list is the Pandaboard ES which is a step above the Beagleboard-xm, it has 1GB of ram and a 1.2 Ghz processor and is capable of 1080p. This looks like a pretty capable board but it is more expensive than the Beagleboard-xm. Unfortunately I’ve not had the chance to play with one yet but its on the list of contenders.

I recently found a new entry into the small low powered ARMBrix Zero from a Korean company.

Unfortunately the first batch has sold out. Hopefully there will be more. This is a pretty powerful board it has a Samsung Exynos 5250 dual core ARM Cortex-A15 which is a pretty powerful processor on a board that has 2GB of ram, HDMI, USB 3.0 and a SATA port, 1080p video decoding all for $145 + shipping.

The catch is that there was only one pre order batch available to order which was for an engineering evaluation run. The people behind the board are in the process of testing the design. I for one am eagerly awaiting the next run.

Only problem is I might not let the kids have it :>

Tuesday 29 January 2013

IASA Ireland Chapter

Attended the IASA IT Architects – 2013 Kick Off Event on Thursday 24th. Wasn't really too sure what to expect but was pleasantly surprised. The presentation covered the usual who and what is IASA, information about the Irish chapter, training and certification.

The really big news is that the Irish chapter is to become the IT architecture network within ICS, ICS and IASA have signed a partnership agreement with ICS.

Expect to hear more from March onwards as the details are worked out.

Custom WCF UserNamePasswordValidator Performance Impact

Recently I have had to perform some debug/analysis on an unusual case of clients receiving timeouts on a critical WCF service in a production environment. I use the term “unusual” to describe the timeouts due to the nature in which these timeouts occur combined with the operation behaviour of the service when they are occurring. That and the timeouts are extremely severe, every client experiences timeouts, in my experience with WCF is not an expected operational behaviour of the WCF stack. Under normal high load scenarios a percentage of clients will experience timeouts or receive the expected Server To Busy fault when the WCF service has breached its configured throttle values.
In this specific case knowing the operation capabilities of the WCF service in question I knew that the amount of load on the server although high was well within the capacity of the service in question. Nor was the load significant enough to breach the configured service throttles. I eventually got to the bottom of the issue through debug and crash dump analysis. It was an unexpected/undocumented behaviour of the WCF stack coupled/influenced by a specific behaviour of a custom UserNamePasswordValidator used by the WCF service.
I thought it would be worth describing the problem in more detail in the hope it might help others who experience WCF Service performance issues, that done seem to "make sense", giving one additional diagnostic check to ad to the toolbox.

Symptoms

The symptoms of the service outage are as follows:

A high percentage of clients, > 90%, receive timeouts or Server To Busy exceptions
The Instances service performance Counter is at or close to the the configured or defaulted ServiceThrottlingElement.MaxConcurrentInstances value
The Calls Per Second Rate service performance counter is low or at 0 for periods of time
If the service is deployed in an IIS app pool the ASP.NET Application Performance Counter Pipeline Instance Count will be at or close to its configured or defaulted value
The Security Validation And Authentication Failures performance counter can be non zero
The service appears to be deadlocked or being starved of requests, the service is not writing log files or updating database etc. but clients experience timeouts or very slow reponses
Very low CPU utilisation by the process in question

Cause

After generating a process memory dump, from the WCF Service. After examining all of the threads stacks by executing the EEStack command it was clearly evident that most of the threads in the process where blocked on the unmanaged windows API method GetQueuedCompletionStatus which is a blocking I/O call. One call was blocked on the Validate method of the UserNamePasswordValidator.
It transpires that the the Validate method of the custom UserNamePasswordValidator used by the WCF service, talks to a proprietary back office system to validate a clients username and password. To prevent a client from brute forcing a users password it will delay the username/password validation response for 5 seconds after the nth unsuccessful attempt. This has very serious consequences for the throughput of the WCF service.

Demonstration Code And Results

To demonstrate this problem I have put together a simple demo base on one of the WCF sample applications. It is the calculator application that implements a simple custom UserNamePasswordValidator. The code itself is not really the interesting part and is pretty trivial, the intent is to demonstrate the symptoms.

Setup

Run the Setup.Bat file in the .\UserNamePasswordValidator\CS folder. Note you should run this from a VS2010 command prompt with administrative privileges. As it creates a test certificate and grants access to the private key for encryption/signing the message credentials.

Clean-up

Run the CleanUp.bat file in the .\UserNamePasswordValidator\CS folder. Note you should run this from a VS2010 command prompt with administrative privileges. AS it deletes the test certificate created by setup.bat
Open the solution in VS2010 running as a administrator as the code attempts to register a http.sys listener on port 8001.
There are a number of application settings that can be used to modify the behaviour of the client and service code.
In the LoadTestClient there are the following settings:

<applicationSettings>
 <LoadTestClient.Properties.Settings>
   <!-- The number of tasks to create with valid passwords -->
   <setting name="NumberTasksValidPassword" serializeAs="String">
    <value>10</value>
   </setting>
   <!-- The number of tasks to create with invalid passwords -->
   <setting name="NumberTasksInvalidPassword" serializeAs="String">
    <value>2</value>
   </setting>
 </LoadTestClient.Properties.Settings>
</applicationSettings>

In the service there is one setting to configure the sleep time for invalid requests:

<applicationSettings>
 <service.Properties.Settings>
  <!-- The sleep time for a request with an invalid password -->
  <setting name="InvalidCredentialSleep" serializeAs="String">
   <value>1000</value>
  </setting>
 </service.Properties.Settings>
</applicationSettings>

Test Runs

To demonstrate the behaviour I have put together a few test cases, that vary each of the parameters of the test and show the results.

10 Tasks valid credentials , 0 Tasks invalid credentials

Form the graphs above we can see 10 service instances performing approximately 85 calls/second. The calls outstanding counter averages < 1. This is normal operation of the service.

10 Tasks valid credentials , 2 Tasks invalid credentials, Invalid Credential sleep time 1 sec

As we can see can see from the above, the results are relatively similar.

10 Tasks valid credentials , 3 Tasks invalid credentials, Invalid Credential sleep time 1 sec

As we can see from the above we have reached a tipping point. When the number of clients making invalid requests increases to 3, we can see initially the service respond times and calls/second are ok. When the three client tasks with invalid passwords start the throughput of the service is severely throttled to approximately 10 calls/sec. I would also draw your attention to the Calls Duration metric which is close to the other two runs.

10 Tasks valid credentials , 2 Tasks invalid credentials, Invalid Credential sleep time 5 sec

When the number of clients making invalid requests is set to 2 but the sleep time is increased to 5 seconds we see a very erratic calls/second profile. Again the Calls Duration metric at the service level is in line with the other tests, but if measured from the client we would see a higher Calls Duration. The Calls Outstanding metric is also not zero more often than in the previous tests.

10 Tasks valid credentials , 4 Tasks invalid credentials, Invalid Credential sleep time 1 sec

When the number of clients making invalid requests is set to 4 with a sleep time of 1 seconds, we can see initially the service respond times and calls/second are ok. When the When the 4 client tasks with invalid passwords start the calls/second profile of the service is very erratic. Again the Calls Duration metric at the service level is in line with the other tests.

In a real world environment it is clear that you will not see such clear patterns to the calls/second metric. It is more likely that the service will in fact not process any requests if the client invalid call pattern is more random and at a higher volume.

Crash Dump Analysis

If we take a crash dump of the WCF Service process while it is in the “Deadlocked/Blocked” state we can examine the processes threads.

0:000> !threads
ThreadCount:      12
UnstartedThread: 0
BackgroundThread: 11
PendingThread:    0
DeadThread:       0
Hosted Runtime:   no
                                       PreEmptive                                                   Lock
   ID OSID        ThreadOBJ     State GC       GC Alloc Context                  Domain           Count APT Exception
0    1 16d0 0000000000205eb0      a020 Enabled 0000000000000000:0000000000000000 00000000001f9780     1 MTA
2    2 184c 000000000020c1f0      b220 Enabled 0000000000000000:0000000000000000 00000000001f9780     0 MTA (Finalizer)
6    a 1708 000000001b9df9d0   a009220 Enabled 0000000000000000:0000000000000000 00000000001f9780     0 MTA (Threadpool Completion Port)
8    6 1ae8 000000001b927570   a009220 Enabled 0000000000000000:0000000000000000 00000000001f9780     0 MTA (Threadpool Completion Port)
9    4 19fc 000000001b926cf0   8009220 Enabled 0000000000000000:0000000000000000 00000000001f9780     0 MTA (Threadpool Completion Port)
10    8 1958 000000001b9e1e40   8009220 Enabled 0000000000000000:0000000000000000 00000000001f9780     0 MTA (Threadpool Completion Port)
11    b   e84 000000001b9e6c60   8009220 Enabled 00000000033bb8c0:00000000033bd178 00000000001f9780     0 MTA (Threadpool Completion Port)13    c   e18 000000001b9f2260   8009220 Enabled 0000000000000000:0000000000000000 00000000001f9780     0 MTA (Threadpool Completion Port)
14    5 1a58 000000001b90c4a0   8009220 Enabled 0000000000000000:0000000000000000 00000000001f9780     0 MTA (Threadpool Completion Port)
17    9   728 000000001af1f7c0   8009220 Enabled 0000000000000000:0000000000000000 00000000001f9780     0 MTA (Threadpool Completion Port)
18    d 1aec 000000001af1ef40   8009220 Enabled 0000000000000000:0000000000000000 00000000001f9780     0 MTA (Threadpool Completion Port)
19    3 1be0 000000001aed7e00   8009220 Enabled 0000000000000000:0000000000000000 00000000001f9780     0 MTA (Threadpool Completion Port)

We can see that all of the threads are running but the Exception column which lists the last thrown exception (if any) for the thread, shows (ThreadPool Completion Port). We can exclude a few of the threads from inspection, for example thread 0 is the thread executing the Microsoft.ServiceModel.Samples.CalculatorService.Main() method. Thread 2 is the CLR Finalizer thread.

If we look at a few of the other threads call stacks, shortened for brevity, we see the following:

---------------------------------------------
Thread 17
Current frame: ntdll!NtRemoveIoCompletion+0xa
Child-SP         RetAddr          Caller, Callee
000000001d5deb40 000007fefdf616ad KERNELBASE!GetQueuedCompletionStatus+0x39, calling ntdll!ZwRemoveIoCompletion
000000001d5deb50 000007fef4d46b1c clr!GCHolderBase<1,0,0,1>::Pop+0x2c, calling clr!Thread::EnablePreemptiveGC
000000001d5deb80 000007fef4e309ff clr!Thread::InternalReset+0x140, calling clr!Thread::SetBackground
000000001d5deba0 00000000777aa4e1 kernel32!GetQueuedCompletionStatusStub+0x11, calling kernel32!GetQueuedCompletionStatus
000000001d5debb0 000007fef4d7dd49 clr!Thread::LeaveRuntime+0x9, calling clr!Thread::LeaveRuntimeNoThrow
000000001d5debe0 000007fef4f1eb2f clr!ThreadpoolMgr::CompletionPortThreadStart+0x113, calling kernel32!GetQueuedCompletionStatusStub
000000001d5dec50 000007fef4d46a14 clr!operator delete+0x45, calling kernel32!GetLastErrorStub---------------------------------------------
Thread 18
Current frame: ntdll!NtRemoveIoCompletion+0xa
Child-SP         RetAddr          Caller, Callee
000000001ce7ec10 000007fefdf616ad KERNELBASE!GetQueuedCompletionStatus+0x39, calling ntdll!ZwRemoveIoCompletion
000000001ce7ec20 000007fef4d46b1c clr!GCHolderBase<1,0,0,1>::Pop+0x2c, calling clr!Thread::EnablePreemptiveGC
000000001ce7ec50 000007fef4e309ff clr!Thread::InternalReset+0x140, calling clr!Thread::SetBackground
000000001ce7ec70 00000000777aa4e1 kernel32!GetQueuedCompletionStatusStub+0x11, calling kernel32!GetQueuedCompletionStatus
000000001ce7ec80 000007fef4d7dd49 clr!Thread::LeaveRuntime+0x9, calling clr!Thread::LeaveRuntimeNoThrow
000000001ce7ecb0 000007fef4f1eb2f clr!ThreadpoolMgr::CompletionPortThreadStart+0x113, calling kernel32!GetQueuedCompletionStatusStub
000000001ce7ed20 000007fef4d46a14 clr!operator delete+0x45, calling kernel32!GetLastErrorStub

As mentioned earlier GetQueuedCompletionStatus is a blocking operation.

Looking at Thread 8 we can see that it is the only thread executing the Validate method of the CustomUsernameValidator.

Thread   8
Current frame: ntdll!NtDelayExecution+0xa
Child-SP         RetAddr          Caller, Callee
000000001c4cd200 000007fefdf61203 KERNELBASE!SleepEx+0xab, calling ntdll!ZwDelayExecution
000000001c4cd220 000007fee14f2f9a (MethodDesc 000007fee148ffb0 +0x4a System.Xml.XmlBufferReader.GetString(Int32, Int32)), calling (MethodDesc 000007fef39e5540 +0 System.String.CtorCharArrayStartLength(Char[], Int32, Int32))
000000001c4cd270 000007fefdf6358b KERNELBASE!SleepEx+0x12d, calling ntdll!RtlActivateActivationContextUnsafeFast
000000001c4cd2a0 000007fef4d76455 clr!EESleepEx+0x22, calling kernel32!SleepExStub
000000001c4cd2d0 000007fef4f33fba clr!Thread::UserSleep+0x97
000000001c4cd310 000007fee14e95bf (MethodDesc 000007fee1470360 +0x2f System.Xml.XmlBaseReader.get_Value()), calling clr!JIT_WriteBarrier_Fast
000000001c4cd320 000007fef4f3410c clr!ThreadNative::Sleep+0x100, calling clr!Thread::UserSleep
000000001c4cd3d0 000007fee166fca5 (MethodDesc 000007fee148f5e8 +0x85 System.Xml.XmlBaseReader.ReadElementString())
000000001c4cd3e0 000007fee167ed39 (MethodDesc 000007fee148d818 +0x19 System.Xml.XmlDictionaryReader.ReadStartElement(System.Xml.XmlDictionaryString, System.Xml.XmlDictionaryString))
000000001c4cd3e8 000007ff00160be6 (MethodDesc 000007ff001b4f98 +0x116 Microsoft.ServiceModel.Samples.CalculatorService+CustomUserNameValidator.Validate(System.String, System.String)), calling (MethodDesc 000007fef3a5cd98 +0 System.Threading.Thread.Sleep(Int32))

Summary

In my opinion it is evident from the test results and the stack trace analysis presented in this post that there is some inherent "queuing/pipelining" in the processing of messages with credentials. Whether or not this is due to the fact that a custom UserNamePasswordValidator is being used is unclear. It is also evident that this “pipeline” is sensitive to execution latency of the Validate method of the UserNamePasswordValidator, possibly as a side effect of being a shallow queue.

Either a number of requests of moderate latency or a smaller number of requests with high latency can have a impact on the overall performance of the service. It is therefore something to be aware of when implementing custom UserNamePasswordValidator types that perform credential validation that may be subject to non trivial execution latency either due to I/O or CPU.

In my opinion this behaviour could be exploited to create an effective DOS attack, as the attacker does not have to create a huge request load to affect the WCF services performance. In the specific case that I discovered in the production environment I modified the code to mitigate the problem while remaining secure.