Sunday, August 12, 2007

Correctness of Parallel Code

One of the big problems in writing parallel applications is code correctness. How do we know that our parallel implementations are correct? While multi-core correctness has been researched and documented a great deal, correctness for clustering code appears to still be in its infancy. Writing, debugging, and verifying the correctness of your applications remains incredibly difficult. Now, I am not talking about grid computing or web services. I am specifically referring to MPI and the like doing high performance computing.

When debugging a parallel application, a user must set breakpoints, attach to multiple processes, and then try to step through the applications in parallel. This becomes even more unwieldy if you are also trying to write the library/service and test your applications. I am not a fan of debugging, and parallel debugging just makes my skin crawl. So, what can one do?

When designing the implementation of SPPM I wanted to approach it in a way that would allow flexibility never seen in a parallel programming system. Starting at the highest level, I abstracted the service to a well defined interface. Given this interface, I can actually implement several versions of my service that can be swapped out at runtime. But why would I want multiple versions? Using a framework like Castle, I can specify implementations of my service in a configuration file for the WindsorContainer to resolve. Depending on how I want to test, I can change my service implementation. I have two implementations right now: the real service, and a local virtual service. The real service tells my library that when calls are made, try to talk to an external service which runs on every node in the cluster. The local virtual service is used when you want to test a master and worker application without needing a cluster or external service. From the applications' point of view nothing changes. They make the same library calls, but the service implementation has changed allowing the programmer to focus on their application and not its dependencies.

Another set of work dealt with the library implementation. All calls were implemented with a composite pattern partnering the command pattern and template method pattern. All commands would request a pointer from a communication interface factory which would return a pointer to come kind of network transmission class. This class is switched out when the service implementation changes - it may get a TCP socket communicator, a remoting communicator, or, just a pointer to an object that routes calls into the local virtual service.

Up to this point, all of the features has been done with design patterns and object containers. To push things one step further, I had to solve a problem that has really bothered me. Why must I actually execute all of my parallel applications together in order to test them? Why must I run a service, master, and worker just to verify that the simplest case works? I have already shown how to test using just the master and worker by abstracting the service, so what about verifying the correctness of the master and worker independently of one another? Impossible? Why? External Dependencies? What about the service layer? Is it at all possible to apply TDD to HPC and clusters? Of coarse, but it is far from obvious.

Aspect-oriented programming will allow us to simulate our dependencies without changing our code. The TypeMock web site does a good job at illustrating this. Our problem breaks down to the need to simulate external dependencies. Write unit tests for your parallel applications, but where external library calls would be made, like a parallel library which talks to the cluster, mock the calls! Below is a small sample trying to mock the worker of a parallel matrix multiplication application. The worker application is supposed to get the B matrix, and a series of rows Ai from A, and compute a sub-result Ci which is sent back to the master. The Check method will be called when the Put call is made and intercepted, and the Ci matrix will be verified for correct contents. The term tuple tells the worker application to exit when it tries to get the next piece of A.

[Test]
public void WorkerTest() {
// We use this call to tell TypeMock that the next creation of
// the InjectionLibrary needs to be faked. TypeMock will replace
// the instance that should have been created with its own
// empty implementation. The mocks we set up below tell
// TypeMock that we are going to have calls made by another component
// and we need to insert fake results.
Mock mock = MockManager.Mock(typeof(InjectionLibrary));

// Fake the B call. When the Worker application calls
// Slm.Get("B")
mock.ExpectAndReturn("Get", 0).Args(new Assign("B"), new Assign(B));

// fake the Ai call
string tupleName = String.Format("A0:A{0}", N - 1);
mock.ExpectAndReturn("Get", 0).Args(new Assign(tupleName, new Assign(A)));

// Grab the put of Ci and validate the result
// We tell TypeMock to call the Check method below to
// validate the argument (Ci matrix).
mock.ExpectCall("Put").Args(Check.CustomChecker(new ParameterCheckerEx(Check)));

// Fake the term tuple
mock.ExpectAndReturn("Get", 0).Args(new Assign("A0:A0"), new Assign(term));

// This next line is mocked, the library is static
// and the faked instance is shared.
InjectionLibrary Slm = InjectionLibrary.Instance;

// Call our worker's main method.
// All of the calls that were mocked above will
// have there calls hijacked.
OCMMWorker.Main(null);
}


Here we have referenced a .NET application like an external DLL and invoked its Main method. Everything is run from the comfort of your TestFixture and you have verified one case in your worker application without the need of a cluster, master, or external/virtual service. Using these methods, we can write parallel cluster code while minimizing our need to debug. Why debug when you know it works?

Saturday, August 11, 2007

Overhead of Parallel Applications and Scheduling Algorithms

When we typically try to measure the speedup of a parallelized algorithm, many people only calculate the serial vs. parallel running time (Sp = Tseq/Tpar). The effective speedup can more closely be approximated with the following equation to further define Tpar:

Tpar = Tcomp + Tcomm + Tmem + Tsync
Tpar is the total parallel running time.
Tcomp is the time spent in parallel computation on all machines / processors.
Tmem is the time spent in main memory or disk.
Tsync is the time required to synchronize.

All of these are easily quantifiable with the exception of Tsync. In an ideal world Tsync is zero, but this is never the case. A single slow processor/machine/data link can make a huge difference with Tsync. So, how can we manage it?

The most straightforward way to try to minimize Tsync it it to apply a fixed chunk load. MPI uses this method. This is fine unless you have a heterogeneous cluster or other applications are running on the processors/machines being used.

An extension of the fixed chunking is to create a normalized value for all of your computing resources. Then, let the computing resources get jobs whose size is comparable to their ability. Another way, say we have three machines whose computing powers are A:1, B:2, C:3 (C is three times as powerful as A). If we have twelve pieces of work to distribute, we can group them into six groups of two (total / sum of power). In the time C will compute three of the chunks, B will have done two and A will have done one. The entire job should wrap up with a minimal Tsync. This kind of work load balancing works very well for heterogeneous clusters.

Other scheduling methods had been tried in clusters; factoring and guided self-scheduling being two, but they seem to add too much overhead for the Tcomm. There has to be a balance in creating chunky communication and minimizing Tsync. They may however work as valid scheduling algorithms for multi-core processing. The Tcomm is greatly reduced compared to cluster computing.

Working again from Joe Duffy's MSDN article "Reusable Parallel Data Structures and Algorithms: Reusable Parallel Data Structures and Algorithms", I have reworked the loop tiling code to support scheduling algorithms.

[ThreadStatic]
private static int partitions = ProcessorCount;

[ThreadStatic]
private static SchedulingAlgorithms algorithm = SchedulingAlgorithms.Factoring;

[ThreadStatic]
private static double factor = 0.5;


private static void For<TInput>( IList<TInput> data, int from, int to, Action<TInput> dataBoundClause, Action<int> indexBoundClause )
{
Debug.Assert( from < to );
Debug.Assert( ( dataBoundClause != null ) || ( indexBoundClause != null ) );
Debug.Assert( ( data != null && dataBoundClause != null ) ||
( data == null && dataBoundClause == null ) );

int size = to - from;
int offset = 0;
List<int> gp = GetPartitionList( size );

int parts = gp.Count;
CountdownLatch latch = new CountdownLatch( parts );
for (int i = 0; i < parts; i++) {
int start = offset + from;
int partitionSize = gp[0];
gp.RemoveAt( 0 );
int end = Math.Min( to, start + partitionSize );
offset += partitionSize;
ThreadPool.QueueUserWorkItem( delegate
{
for (int j = start; j < end; j++) {
if ( data != null ) {
dataBoundClause( data[j] );
}
else {
indexBoundClause( j );
}
}

latch.Signal();
} );
}
latch.Wait();
}
private static List<int> GetPartitionList( int size ) {
ISchedulingAlgorithm schedulingAlgorith = null;

switch ( Algorithm ) {
case SchedulingAlgorithms.FixedChunking:
schedulingAlgorith = new FixedChunking( size, Math.Min( size, partitions ) );
break;
case SchedulingAlgorithms.GuidedSelfScheduling:
schedulingAlgorith = new GuidedSelfScheduling( size );
break;
case SchedulingAlgorithms.Factoring:
schedulingAlgorith = new Factoring( size, factor, factoringThreshold );
break;
}

Debug.Assert( schedulingAlgorith != null );
return schedulingAlgorith.GetPartitionSizes();
}



With this setup, all you have to do is set the algorithm that you would like to use along with the scheduling algorithm. The variables are thread static so that many threads can have their own algorithms, but since the for loop is blocking you only need one copy per thread.


It will take a little bit of tweaking for your application's algorithm, but these should produce reasonable results. My next goal is to get them benchmarked under different scenarios.

Follow Up: Frustration Caused by Amdahl's 'Law'

Michael Suess from Thinking Parallel was kind enough to point me to a paper he coauthored "Implementing Irregular Parallel Algorithms with OpenMP". In it he goes on to describe a way to implement the early thread sync that I mentioned using OpenMP. If you are interested in more of his papers, here is his publications page.

RE: Why are int[,], double[,], single[,], et alia so slow?

I actually wrote the previous post in October of 2006. Since then, someone was nice enough to look up the reasons for me. Wesner Moise posted this article on CodeProject that I think explains a lot.

Wednesday, August 8, 2007

Jamie Cansdale - TestDriven.Net - Silently Posts a Resolution with Microsoft

From the 2.8.2130 release notes

Release Notes - TestDriven.NET: 2.8

993: Retire Visual Studio Express support

Joint statement: "Jamie Cansdale and Microsoft Corporation have agreed to concentrate on working together on future releases of TestDriven.Net for Microsoft's officially extensible editions of Visual Studio, as opposed to spending time litigating their differences."

Saturday, August 4, 2007

Frustration Caused by Amdahl's 'Law'

I was looking over some OpenMP resources and I found this line on the Wikipedia page:

A large portion of the program may not be parallelized by OpenMP, which sets theoretical upper limit of speedup according to Amdahl's law.

I will leave my rant concerning Amdahl's 'law' for another post, but I feel like I need to point something out. According to Amdahl's law, linear speedup is the theoretical maximum speedup (using N processors would be N). While some have argued saying that caching effects allow for a superlinear speedup, it is always small S * N where S is a small multiplier.

Wouldn't you like to put the 'super' back into superlinear? But oh wait, Amdahl says I say you can do it. Crack open that MIT bible on algorithms ( I know you have/ had it) and find the section on decision problems. If you no longer have the book, take a quick look at Wikipedia.

In running code in parallel you still have the worst case near linear speedup. But, what happens when you are partitioning? What if you partition the data into two pieces, one for each processor? If there is no solution, you will run both processors exhaustively and you have the worst case. Now, let there be a solution to your decision problem. We have a few scenarios:

The solution is:

  1. In the end of the first chunk, no speedup over non parallel.
  2. In the end of the second chunk, near linear speedup.
  3. In the beginning of the first chunk, no speedup over non parallel.
  4. Anywhere other than the end of the second chunk - linear to true superlinear speedup!!!

As your solution in the problem moves toward the beginning of the chunk you can achieve amazing results. In working on Synergy and SPPM I have seen a speedup of over 2000 using two machines doing the 0/1 knapsack!

I still need to look into OpenMP some more to see if I can get it to shortcut the thread synchronization, but a true superlinear capable OpenMP application would be great. I made a number of modifications to the code from Joe Duffy's MSDN article "Reusable Parallel Data Structures and Algorithms: Reusable Parallel Data Structures and Algorithms" to implement scheduling algorithms for loop tiling.  I have fixed chunking, guided self-scheduling, and factoring algorithms implemented. I am also hoping to add a synchronization scheme to allow one thread to tell the rest to stop when a solution is found. Once complete I should have a load balanced multi-core loop tiling library capable of superlinear speedup.

Thursday, August 2, 2007

Castle References

I have been following BitterCoder's series of Castle tutorials for a while, but I just stumbled upon his container tutorials wiki. I think that there is a lot of good information there for someone new to Castle.

Another Way to Optimize Code

In a previous post I showed the performance of several matrix multiplication implementations. As impressive as the JIT compiler is, I had an inkling that the C++/CLI compiler could do better. I compiled the jagged array (type[N][N]) in C# with optimization enabled. Here is the original source

        public static TimeSpan Multiply(int N) {
double[][] C = new double[N][];
double[][] A = new double[N][];
double[][] B = new double[N][];
int i, j, k;

for (i = 0; i < N; i++) {
C[i] = new double[N];
B[i] = new double[N];
A[i] = new double[N];

for (j = 0; j < N; j++) {
C[i][j] = 0;
B[i][j] = i*j;
A[i][j] = i*j;
}
}

DateTime now = DateTime.Now;

for (i = 0; i < N; i++) {
for (k = 0; k < N; k++) {
for (j = 0; j < N; j++) {
C[i][j] += A[i][k]*B[k][j];
}
}
}

return DateTime.Now - now;
}

Using Reflector on the dll, the only thing that the compiler does is moves the declaration of k into line 23. I used the C++/CLI plugin for Reflector to get the code into C++.  With a little modification we get this code:

    static TimeSpan Multiply(int N)
{
int i;
int j;
array<array<double>^>^ C = gcnew array<array<double>^>(N);
array<array<double>^>^ A = gcnew array<array<double>^>(N);
array<array<double>^>^ B = gcnew array<array<double>^>(N);
for (i = 0 ; (i < N); i++)
{
C[i] = gcnew array<double>(N);
B[i] = gcnew array<double>(N);
A[i] = gcnew array<double>(N);
for (j = 0 ; (j < N); j++)
{
C[i][j] = 0;
B[i][j] = (i * j);
A[i][j] = (i * j);
}
}
DateTime now = DateTime::Now;
for (i = 0 ; (i < N); i++)
{
for (int k = 0 ; (k < N); k++)
{
for (j = 0 ; (j < N); j++)
{
C[i][j] = (C[i][j] + (A[i][k] * B[k][j]));
}
}
}
return ((TimeSpan) (DateTime::Now - now));
}

Compiling this code with



/Ox /Ob2 /Oi /Ot /Oy /GL /D "WIN32" /D "NDEBUG" /D "_UNICODE" /D "UNICODE" /FD /EHa /MD /Yu"stdafx.h" /Fp"Release\CppDemo.pch" /Fo"Release\\" /Fd"Release\vc80.pdb" /W3 /nologo /c /Zi /clr /TP /errorReport:prompt /FU "c:\WINDOWS\Microsoft.NET\Framework\v2.0.50727\System.dll"


We get an optimized matrix multiply. Again, reflecting the exe to get the C# code we get our new code which is is quite a bit faster:

        public static TimeSpan Multiply(int N) {
double[][] C = new double[N][];
double[][] A = new double[N][];
double[][] B = new double[N][];
int index = 0;
if (0 < N)
{
do
{
C[index] = new double[N];
B[index] = new double[N];
A[index] = new double[N];
int column = 0;
int value = 0;
do
{
C[index][column] = 0;
double num7 = value;
B[index][column] = num7;
A[index][column] = num7;
column++;
value = index + value;
}
while (column < N);
index++;
}
while (index < N);
}
DateTime now = DateTime.Now;
int i = 0;
if (0 < N)
{
do
{
int k = 0;
do
{
int j = 0;
do
{
double[] Ci = C[i];
Ci[j] = (A[i][k] * B[k][j]) + Ci[j];
j++;
}
while (j < N);
k++;
}
while (k < N);
i++;
}
while (i < N);
}
return (TimeSpan) (DateTime.Now - now);
}



This optimized code averages 182 MOPS on my machine compared to the original C# which only achieved 118 MOPS. This is a lot of work, but the speedup is amazing. I will try to put up some IL later to figure out exactly why the last version is so much faster.

Monday, July 30, 2007

Why are int[,], double[,], single[,], et alia so slow?

I wrote a few matrix multiplication (MM) applications for benchmarking my parallel processing research. I was blown away by how slow the C# implementations were working. I tried to write the MM application a few ways. It turns out that the fastest is a jagged array, type[N][N], rather than type[N*N] or type[N,N]. I even tried to use unsafe C# to use pointers, but the type[N][N] was still by far the fastest. I am going to compare the type[N][N] and the type[N,N] here.

Here is the C# code for the type[N,N]. I am using int as it benchmarks faster and I spend less time waiting. I have removed the initialization of the arrays as it just added more code to analyze and isn't needed for this analysis:

private static int[,] Standard() {
Console.WriteLine( "Init" );
int rows = 1500;
int cols = 1500;
int[,] first = new int[rows,cols];
int[,] second = new int[rows,cols];
int[,] third = new int[rows,cols];

Console.WriteLine( "Calculate" );
for (int i = 0; i < rows; i++
for (int j = 0; j < cols; j++)
for (int k = 0; k < rows; k++)
third[i, j] += first[i, k] * second[k, j];

return third;
}

I have added the console output to identify the IL more easily.


Here is the type[N][N] version:

private static int[][] Jagged() {
Console.WriteLine( "Init" );
int rows = 1500;
int cols = 1500;
int[][] first = new int[rows][];
int[][] second = new int[rows][];
int[][] third = new int[rows][];

for (int i = 0; i < rows; i++) {
first[i] = new int[cols];
second[i] = new int[cols];
third[i] = new int[cols];
}

Console.WriteLine( "Calculate" );
for (int i = 0; i < rows; i++)
for (int j = 0; j < cols; j++)
for (int k = 0; k < rows; k++)
third[i][j] += first[i][k] * second[k][j];

return third;
}



On first glance it looks like we have a bit more initialization overhead for the type[N][N]. I compiled everything and reflected out the IL. I then stripped all locations to keep this as clean and simple as possible (and the diff cleaner). Though there is a difference. The type[N][N] uses ldelem and stelem to initialize while type[N,N] uses call instance void int32[0...,0...]::Set(int32, int32, int32). I think this is where we start seeing why the C# array is slower than the jagged version.


If we look at the IL for the initialization that takes care of everything up to the writing of "Calculate" to the console:



On the type[N,N] side we are creating a new object on the evaluation stack and then popping from the stack and storing in a local variable - this is done for each the three arrays.


On the type[N][N] we push a new primitive array of int32 onto the stack and for each position we use the same call to create a new primitive array and replace the value on the stack at the given position.


Here is the grand conclusion to our IL:



As you can see, the calculations happen 'almost' identically. In the type[N,N] array we are making external calls to the runtime in order to get an address and set values whereas the type[N][N] simply issue a ldelem. Now, I wish there were a way to see exactly how long the Address and Get methods take, but the evidence is clear that ldelem is a lot faster.


I ran all of the matrix permutations for all of the types of arrays I could use. The ikj MM was the fastest for all. I benchmarked all of these on two machines (S01, S02). The machines were as identical as possible and nothing else running except background services. Each point was run three times and the average of the three was plotted. The matrix size (bottom) was NxN (square) and the left hand side is the number of algorithmic steps per second. For MM the multiplication, addition, and assignment are 1 operation.




For a 3000x3000 matrix we are making millions or method calls using the type[N,N] matrix that the type[N,N] does. Why is the C# matrix implemented in the way? Please let me know if I missed anything or you have some insight.

Using Castle.Facilities.Synchronize

I removed the last post as the formatting was horrible. Here again is my first demo app with the Synchronization facility. It is normally quite painful to try to make your WinForms application threadsafe. VS2005 made it much easier to see that you are doing something wrong. Your application will move along and then the debugger will break and you have a window with an InvalidOperationException in you face.

System.InvalidOperationException was unhandled
  Message="Cross-thread operation not valid: Control 'richTextBox' accessed from a thread other than the thread it was created on."
  Source="System.Windows.Forms"
  StackTrace:
       at System.Windows.Forms.Control.get_Handle()
       at System.Windows.Forms.RichTextBox.StreamIn(Stream data, Int32 flags)
       at System.Windows.Forms.RichTextBox.StreamIn(String str, Int32 flags)
       at System.Windows.Forms.RichTextBox.set_Text(String value)
       at SafelyUpdatingWinForm.MainForm.UpdateTime() in C:\SafelyUpdatingWinForm\SafelyUpdatingWinForm\MainForm.cs:line 38
       at SafelyUpdatingWinForm.MainForm.UpdateLoop() in C:\SafelyUpdatingWinForm\SafelyUpdatingWinForm\MainForm.cs:line 27
       at System.Threading.ThreadHelper.ThreadStart_Context(Object state)
       at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
       at System.Threading.ThreadHelper.ThreadStart()

The IDE is nice enough though to point you in the direction of a solution How to: Make Thread-Safe Calls to Windows Forms Controls (MSDN). They suggest that for any component you want to change that you create delegates and try to use Control.Invoke to make the call on the appropriate thread. This gets very bloated when you have a large and complex UI. To simplify things, Craig Neuwirt created a new facility to plug into the Castle WindsorContainer. It can automatically martial the calls to the UI thread for you and a whole lot more. Roy Osherove has his own implementation that uses the DynamicProxy2.


Program.cs:

using System;
using System.Windows.Forms;
using Castle.Facilities.Synchronize;
using Castle.Windsor;

namespace SafelyUpdatingWinForm {
internal static class Program {
[STAThread]
private static void Main() {
Application.EnableVisualStyles();
Application.SetCompatibleTextRenderingDefault( false );

WindsorContainer container = new WindsorContainer();
container.AddFacility( "sync.facility", new SynchronizeFacility() );
container.AddComponent( "mainform.form.class", typeof (MainForm) );

Application.Run( container.Resolve( "mainform.form.class" ) as Form );
}
}
}



And then in the MainForm.cs:

using System;
using System.ComponentModel;
using System.IO;
using System.Net;
using System.Text;
using System.Text.RegularExpressions;
using System.Threading;
using System.Windows.Forms;

namespace SafelyUpdatingWinForm {
public partial class MainForm : Form {
private Thread updateLoop;
private bool loop = true;
private const int delay = 2000;

public MainForm() {
InitializeComponent();
}

private void UpdateLoop() {
while ( loop ) {
UpdateTime();
Thread.Sleep( delay );
}
}

protected virtual void UpdateTime() {
string text = GetWebPage();
Regex regex = new Regex( @"<b>\d\d:\d\d:\d\d<br>" );
Match match = regex.Match( text );
string time = match.Value.TrimStart( "<b>".ToCharArray() )
.TrimEnd( "<br>".ToCharArray() );
richTextBox.Text = time;
}

private static string GetWebPage() {
string url = "http://www.time.gov/timezone.cgi?Eastern/d/-5";
HttpWebRequest webRequest = WebRequest.Create( url ) as HttpWebRequest;
webRequest.Method = "GET";
using (WebResponse webResponse = webRequest.GetResponse()) {
using (StreamReader sr = new StreamReader( webResponse.GetResponseStream(), Encoding.UTF8 )) {
return sr.ReadToEnd();
}
}
}

protected override void OnClosing( CancelEventArgs e ) {
loop = false;
while ( updateLoop.IsAlive ) { }
base.OnClosing( e );
}

protected override void OnLoad( EventArgs e ) {
updateLoop = new Thread( new ThreadStart( UpdateLoop ) );
updateLoop.Name = "UpdateLoop";
updateLoop.Start();
base.OnLoad( e );
}
}
}