ὅπερ ἔδει ποιῆσαι (hoper edei poiēsai): August 2007

Monday, August 27, 2007

Running Mocked Unit Tests in FinalBuilder With Coverage

I just read an article by Craig Murphy on running unit tests in FinalBuilder. I liked the article and it gives some good pointers. I wish he had written it a while ago, it would have been helpful to me. We use TypeMock which will not allow us to make our unit tests as simply. When I wrote our build scripts I took the same general approach he did.

Starting in the Main ActionList we defined a few variables that give the paths to the applications we use. I also created an ActionList call RunTest which I will go into detail below. For each DLL being tested a call to RunTest is made passing in the name of the DLL to be tested.

In each call we set the CURRENTDLL ActionList Parameter with the name of the DLL (Animation,Audio,Core,..).

NCOVEROPTIONS holds the command-line options for ncover.console.exe. The same follows for NUNITOPTIONS and TMOCKRUNNEROPTIONS. The rest of the ActionList Parameters are used as temporary variables allowing me to put together very complex command argument strings and parse XML into variables.

The RunTest ActionList is separated into two try...catch blocks. The first to run the tests and the second to process the results.

Here are what the commands are doing:

Set Variable NCOVEROPTIONS
//x "%BUILDDIR%\%CURRENTDLL%Coverage.Xml" //l "%BUILDDIR%\%CURRENTDLL%Coverage.Log"

Set Variable TESTDLLS
"%ROOTDIR%\bin\%CURRENTDLL%Tests\%CURRENTDLL%.Tests.dll"

Set Variable NUNITOPTIONS
%TESTDLLS% /xml:"%BUILDDIR%\%CURRENTDLL%TestResult.xml" /noshadow

Set Variable PARAMETERS
-first -link NCover %NCOVER% %NCOVEROPTIONS% %NUNIT% %NUNITOPTIONS%

Execute Program
%TMOCKRUNNER% %PARAMETERS%
-Wait For Completion, Log Output, and Hide Window are checked. I do not check the exit code. The second try block handles it.

The first sets up all of our variables and launches the TMockRunner application with all of our parameters. The second tries to read the test results into the TESTFAILURES variable. If the TMockRunner failed then the file will not exist and the Catch block will append our error message to indicate this. If there are test failures we read the failed test cases into a temporary error variable using XPath. This temporary variable is appended to the error message variable which is sent via email to our team. This allows us to find the exact tests that failed directly from our email.

That is basically it. Our CI server runs the FinalBuilder script with a few extra options and processes the NCover output to give us a nice web page.

Saturday, August 18, 2007

Parallel Functionals

I have used the loop tiling code with scheduling algorithms to parallelize Dustin Campbell's code from his post A Higher Calling. The parallelized funtionals can be found here. It uses the scheduling algorithm code from the last post. The parallel reduction is actually based on Joe Duffy's serial reduction.

Scheduling Algorithms

I have uploaded my current version of the scheduling algorithms code that I referenced in my previous post. You can view the file here. I no longer have access to a multi core/processor machine so I cannot benchmark them properly right now.

Sunday, August 12, 2007

Correctness of Parallel Code

One of the big problems in writing parallel applications is code correctness. How do we know that our parallel implementations are correct? While multi-core correctness has been researched and documented a great deal, correctness for clustering code appears to still be in its infancy. Writing, debugging, and verifying the correctness of your applications remains incredibly difficult. Now, I am not talking about grid computing or web services. I am specifically referring to MPI and the like doing high performance computing.

When debugging a parallel application, a user must set breakpoints, attach to multiple processes, and then try to step through the applications in parallel. This becomes even more unwieldy if you are also trying to write the library/service and test your applications. I am not a fan of debugging, and parallel debugging just makes my skin crawl. So, what can one do?

When designing the implementation of SPPM I wanted to approach it in a way that would allow flexibility never seen in a parallel programming system. Starting at the highest level, I abstracted the service to a well defined interface. Given this interface, I can actually implement several versions of my service that can be swapped out at runtime. But why would I want multiple versions? Using a framework like Castle, I can specify implementations of my service in a configuration file for the WindsorContainer to resolve. Depending on how I want to test, I can change my service implementation. I have two implementations right now: the real service, and a local virtual service. The real service tells my library that when calls are made, try to talk to an external service which runs on every node in the cluster. The local virtual service is used when you want to test a master and worker application without needing a cluster or external service. From the applications' point of view nothing changes. They make the same library calls, but the service implementation has changed allowing the programmer to focus on their application and not its dependencies.

Another set of work dealt with the library implementation. All calls were implemented with a composite pattern partnering the command pattern and template method pattern. All commands would request a pointer from a communication interface factory which would return a pointer to come kind of network transmission class. This class is switched out when the service implementation changes - it may get a TCP socket communicator, a remoting communicator, or, just a pointer to an object that routes calls into the local virtual service.

Up to this point, all of the features has been done with design patterns and object containers. To push things one step further, I had to solve a problem that has really bothered me. Why must I actually execute all of my parallel applications together in order to test them? Why must I run a service, master, and worker just to verify that the simplest case works? I have already shown how to test using just the master and worker by abstracting the service, so what about verifying the correctness of the master and worker independently of one another? Impossible? Why? External Dependencies? What about the service layer? Is it at all possible to apply TDD to HPC and clusters? Of coarse, but it is far from obvious.

Aspect-oriented programming will allow us to simulate our dependencies without changing our code. The TypeMock web site does a good job at illustrating this. Our problem breaks down to the need to simulate external dependencies. Write unit tests for your parallel applications, but where external library calls would be made, like a parallel library which talks to the cluster, mock the calls! Below is a small sample trying to mock the worker of a parallel matrix multiplication application. The worker application is supposed to get the B matrix, and a series of rows Ai from A, and compute a sub-result Ci which is sent back to the master. The Check method will be called when the Put call is made and intercepted, and the Ci matrix will be verified for correct contents. The term tuple tells the worker application to exit when it tries to get the next piece of A.

[Test]
public void WorkerTest() {
    // We use this call to tell TypeMock that the next creation of
    // the InjectionLibrary needs to be faked. TypeMock will replace
    // the instance that should have been created with its own
    // empty implementation. The mocks we set up below tell
    // TypeMock that we are going to have calls made by another component
    // and we need to insert fake results.
    Mock mock = MockManager.Mock(typeof(InjectionLibrary));
    
    // Fake the B call. When the Worker application calls
    // Slm.Get("B")
    mock.ExpectAndReturn("Get", 0).Args(new Assign("B"), new Assign(B));
    
    // fake the Ai call
    string tupleName = String.Format("A0:A{0}", N - 1);
    mock.ExpectAndReturn("Get", 0).Args(new Assign(tupleName, new Assign(A)));
    
    // Grab the put of Ci and validate the result
    // We tell TypeMock to call the Check method below to
    // validate the argument (Ci matrix).
    mock.ExpectCall("Put").Args(Check.CustomChecker(new ParameterCheckerEx(Check)));
    
    // Fake the term tuple
    mock.ExpectAndReturn("Get", 0).Args(new Assign("A0:A0"), new Assign(term));
    
    // This next line is mocked, the library is static
    // and the faked instance is shared.
    InjectionLibrary Slm = InjectionLibrary.Instance;
    
    // Call our worker's main method.
    // All of the calls that were mocked above will
    // have there calls hijacked.
    OCMMWorker.Main(null);
}

Here we have referenced a .NET application like an external DLL and invoked its Main method. Everything is run from the comfort of your TestFixture and you have verified one case in your worker application without the need of a cluster, master, or external/virtual service. Using these methods, we can write parallel cluster code while minimizing our need to debug. Why debug when you know it works?

Saturday, August 11, 2007

Overhead of Parallel Applications and Scheduling Algorithms

When we typically try to measure the speedup of a parallelized algorithm, many people only calculate the serial vs. parallel running time (Sp = Tseq/Tpar). The effective speedup can more closely be approximated with the following equation to further define Tpar:

Tpar = Tcomp + Tcomm + Tmem + Tsync
Tpar is the total parallel running time.
Tcomp is the time spent in parallel computation on all machines / processors.
Tmem is the time spent in main memory or disk.
Tsync is the time required to synchronize.

All of these are easily quantifiable with the exception of Tsync. In an ideal world Tsync is zero, but this is never the case. A single slow processor/machine/data link can make a huge difference with Tsync. So, how can we manage it?

The most straightforward way to try to minimize Tsync it it to apply a fixed chunk load. MPI uses this method. This is fine unless you have a heterogeneous cluster or other applications are running on the processors/machines being used.

An extension of the fixed chunking is to create a normalized value for all of your computing resources. Then, let the computing resources get jobs whose size is comparable to their ability. Another way, say we have three machines whose computing powers are A:1, B:2, C:3 (C is three times as powerful as A). If we have twelve pieces of work to distribute, we can group them into six groups of two (total / sum of power). In the time C will compute three of the chunks, B will have done two and A will have done one. The entire job should wrap up with a minimal Tsync. This kind of work load balancing works very well for heterogeneous clusters.

Other scheduling methods had been tried in clusters; factoring and guided self-scheduling being two, but they seem to add too much overhead for the Tcomm. There has to be a balance in creating chunky communication and minimizing Tsync. They may however work as valid scheduling algorithms for multi-core processing. The Tcomm is greatly reduced compared to cluster computing.

Working again from Joe Duffy's MSDN article "Reusable Parallel Data Structures and Algorithms: Reusable Parallel Data Structures and Algorithms", I have reworked the loop tiling code to support scheduling algorithms.

[ThreadStatic]
private static int partitions = ProcessorCount;

[ThreadStatic]
private static SchedulingAlgorithms algorithm = SchedulingAlgorithms.Factoring;

[ThreadStatic]
private static double factor = 0.5;

private static void For<TInput>( IList<TInput> data, int from, int to, Action<TInput> dataBoundClause, Action<int> indexBoundClause )
{
    Debug.Assert( from < to );
    Debug.Assert( ( dataBoundClause != null ) || ( indexBoundClause != null ) );
    Debug.Assert( ( data != null && dataBoundClause != null ) ||
                  ( data == null && dataBoundClause == null ) );

    int size = to - from;
    int offset = 0;
    List<int> gp = GetPartitionList( size );

    int parts = gp.Count;
    CountdownLatch latch = new CountdownLatch( parts );
    for (int i = 0; i < parts; i++) {
        int start = offset + from;
        int partitionSize = gp[0];
        gp.RemoveAt( 0 );
        int end = Math.Min( to, start + partitionSize );
        offset += partitionSize;
        ThreadPool.QueueUserWorkItem( delegate
                                      {
                                          for (int j = start; j < end; j++) {
                                              if ( data != null ) {
                                                  dataBoundClause( data[j] );
                                              }
                                              else {
                                                  indexBoundClause( j );
                                              }
                                          }

                                          latch.Signal();
                                      } );
    }
    latch.Wait();
}

private static List<int> GetPartitionList( int size ) {
    ISchedulingAlgorithm schedulingAlgorith = null;

    switch ( Algorithm ) {
        case SchedulingAlgorithms.FixedChunking:
            schedulingAlgorith = new FixedChunking( size, Math.Min( size, partitions ) );
            break;
        case SchedulingAlgorithms.GuidedSelfScheduling:
            schedulingAlgorith = new GuidedSelfScheduling( size );
            break;
        case SchedulingAlgorithms.Factoring:
            schedulingAlgorith = new Factoring( size, factor, factoringThreshold );
            break;
    }

    Debug.Assert( schedulingAlgorith != null );
    return schedulingAlgorith.GetPartitionSizes();
}

With this setup, all you have to do is set the algorithm that you would like to use along with the scheduling algorithm. The variables are thread static so that many threads can have their own algorithms, but since the for loop is blocking you only need one copy per thread.

It will take a little bit of tweaking for your application's algorithm, but these should produce reasonable results. My next goal is to get them benchmarked under different scenarios.

Follow Up: Frustration Caused by Amdahl's 'Law'

Michael Suess from Thinking Parallel was kind enough to point me to a paper he coauthored "Implementing Irregular Parallel Algorithms with OpenMP". In it he goes on to describe a way to implement the early thread sync that I mentioned using OpenMP. If you are interested in more of his papers, here is his publications page.

RE: Why are int[,], double[,], single[,], et alia so slow?

I actually wrote the previous post in October of 2006. Since then, someone was nice enough to look up the reasons for me. Wesner Moise posted this article on CodeProject that I think explains a lot.

Wednesday, August 8, 2007

Jamie Cansdale - TestDriven.Net - Silently Posts a Resolution with Microsoft

From the 2.8.2130 release notes

Release Notes - TestDriven.NET: 2.8

993: Retire Visual Studio Express support

Joint statement: "Jamie Cansdale and Microsoft Corporation have agreed to concentrate on working together on future releases of TestDriven.Net for Microsoft's officially extensible editions of Visual Studio, as opposed to spending time litigating their differences."

Saturday, August 4, 2007

Frustration Caused by Amdahl's 'Law'

I was looking over some OpenMP resources and I found this line on the Wikipedia page:

A large portion of the program may not be parallelized by OpenMP, which sets theoretical upper limit of speedup according to Amdahl's law.

I will leave my rant concerning Amdahl's 'law' for another post, but I feel like I need to point something out. According to Amdahl's law, linear speedup is the theoretical maximum speedup (using N processors would be N). While some have argued saying that caching effects allow for a superlinear speedup, it is always small S * N where S is a small multiplier.

Wouldn't you like to put the 'super' back into superlinear? But oh wait, ~~Amdahl says~~ I say you can do it. Crack open that MIT bible on algorithms ( I know you have/ had it) and find the section on decision problems. If you no longer have the book, take a quick look at Wikipedia.

In running code in parallel you still have the worst case near linear speedup. But, what happens when you are partitioning? What if you partition the data into two pieces, one for each processor? If there is no solution, you will run both processors exhaustively and you have the worst case. Now, let there be a solution to your decision problem. We have a few scenarios:

The solution is:

In the end of the first chunk, no speedup over non parallel.
In the end of the second chunk, near linear speedup.
In the beginning of the first chunk, no speedup over non parallel.
Anywhere other than the end of the second chunk - linear to true superlinear speedup!!!

As your solution in the problem moves toward the beginning of the chunk you can achieve amazing results. In working on Synergy and SPPM I have seen a speedup of over 2000 using two machines doing the 0/1 knapsack!

I still need to look into OpenMP some more to see if I can get it to shortcut the thread synchronization, but a true superlinear capable OpenMP application would be great. I made a number of modifications to the code from Joe Duffy's MSDN article "Reusable Parallel Data Structures and Algorithms: Reusable Parallel Data Structures and Algorithms" to implement scheduling algorithms for loop tiling. I have fixed chunking, guided self-scheduling, and factoring algorithms implemented. I am also hoping to add a synchronization scheme to allow one thread to tell the rest to stop when a solution is found. Once complete I should have a load balanced multi-core loop tiling library capable of superlinear speedup.

Thursday, August 2, 2007

Castle References

I have been following BitterCoder's series of Castle tutorials for a while, but I just stumbled upon his container tutorials wiki. I think that there is a lot of good information there for someone new to Castle.

Another Way to Optimize Code

In a previous post I showed the performance of several matrix multiplication implementations. As impressive as the JIT compiler is, I had an inkling that the C++/CLI compiler could do better. I compiled the jagged array (type[N][N]) in C# with optimization enabled. Here is the original source

        public static TimeSpan Multiply(int N) {
            double[][] C = new double[N][];
            double[][] A = new double[N][];
            double[][] B = new double[N][];
            int i, j, k;

            for (i = 0; i < N; i++) {
                C[i] = new double[N];
                B[i] = new double[N];
                A[i] = new double[N];

                for (j = 0; j < N; j++) {
                    C[i][j] = 0;
                    B[i][j] = i*j;
                    A[i][j] = i*j;
                }
            }

            DateTime now = DateTime.Now;

            for (i = 0; i < N; i++) {
                for (k = 0; k < N; k++) {
                    for (j = 0; j < N; j++) {
                        C[i][j] += A[i][k]*B[k][j];
                    }
                }
            }

            return DateTime.Now - now;
        }

Using Reflector on the dll, the only thing that the compiler does is moves the declaration of k into line 23. I used the C++/CLI plugin for Reflector to get the code into C++. With a little modification we get this code:

    static TimeSpan Multiply(int N)
    {
        int i;
        int j;
        array<array<double>^>^ C = gcnew array<array<double>^>(N);
        array<array<double>^>^ A = gcnew array<array<double>^>(N);
        array<array<double>^>^ B = gcnew array<array<double>^>(N);
        for (i = 0 ; (i < N); i++)
        {
            C[i] = gcnew array<double>(N);
            B[i] = gcnew array<double>(N);
            A[i] = gcnew array<double>(N);
            for (j = 0 ; (j < N); j++)
            {
                C[i][j] = 0;
                B[i][j] = (i * j);
                A[i][j] = (i * j);
            }
        }
        DateTime now = DateTime::Now;
        for (i = 0 ; (i < N); i++)
        {
            for (int k = 0 ; (k < N); k++)
            {
                for (j = 0 ; (j < N); j++)
                {
                    C[i][j] = (C[i][j] + (A[i][k] * B[k][j]));
                }
            }
        }
        return ((TimeSpan) (DateTime::Now - now));
    }

Compiling this code with

/Ox /Ob2 /Oi /Ot /Oy /GL /D "WIN32" /D "NDEBUG" /D "_UNICODE" /D "UNICODE" /FD /EHa /MD /Yu"stdafx.h" /Fp"Release\CppDemo.pch" /Fo"Release\\" /Fd"Release\vc80.pdb" /W3 /nologo /c /Zi /clr /TP /errorReport:prompt /FU "c:\WINDOWS\Microsoft.NET\Framework\v2.0.50727\System.dll"

We get an optimized matrix multiply. Again, reflecting the exe to get the C# code we get our new code which is is quite a bit faster:

        public static TimeSpan Multiply(int N) {
            double[][] C = new double[N][];
            double[][] A = new double[N][];
            double[][] B = new double[N][];
            int index = 0;
            if (0 < N)
            {
                do
                {
                    C[index] = new double[N];
                    B[index] = new double[N];
                    A[index] = new double[N];
                    int column = 0;
                    int value = 0;
                    do
                    {
                        C[index][column] = 0;
                        double num7 = value;
                        B[index][column] = num7;
                        A[index][column] = num7;
                        column++;
                        value = index + value;
                    }
                    while (column < N);
                    index++;
                }
                while (index < N);
            }
            DateTime now = DateTime.Now;
            int i = 0;
            if (0 < N)
            {
                do
                {
                    int k = 0;
                    do
                    {
                        int j = 0;
                        do
                        {
                            double[] Ci = C[i];
                            Ci[j] = (A[i][k] * B[k][j]) + Ci[j];
                            j++;
                        }
                        while (j < N);
                        k++;
                    }
                    while (k < N);
                    i++;
                }
                while (i < N);
            }
            return (TimeSpan) (DateTime.Now - now);
        }

This optimized code averages 182 MOPS on my machine compared to the original C# which only achieved 118 MOPS. This is a lot of work, but the speedup is amazing. I will try to put up some IL later to figure out exactly why the last version is so much faster.

ὅπερ ἔδει ποιῆσαι (hoper edei poiēsai)