ὅπερ ἔδει ποιῆσαι (hoper edei poiēsai): 2007

Tuesday, December 11, 2007

Visual Studio 2008 InstallFest

I will be attending the Visual Studio 2008 InstallFest in Spokane Valley on Wednesday the 12th with a few other deve lopers from PLAYXPERT. It will be nice to meet some other .NET coders in the Spokane area. Hopefully we can pick something up soon (better SIGS). I was really looking forward to attending the Agile Philly and Philly ALT.NET group meetings before the move to Spokane.

Settling In

Well, I have been in Spokane for a week now. I still feel like I am just settling in. My car finally shipped (thanks to MN for helping) and should arrive this week. The contents of my apt may not be here until Christmas, so it will be a little rough until then. The weather the last two weekends has been as bad as the worst I saw in Philadelphia (and it is only December). I wish I could find reliable information on what are the best winter tires.

Monday, November 12, 2007

SC07: It Begins

All around you will find Intel trying to make waves

Celebrating the 45nm Launch - 45 steps to Technology Leadership

Ok. So they made 45nm, what's next? With the physical limit a few nm away (from what I understand, 30nm is the limit), what are they going to shock and awe us with next?

Everyone is trying to get the final touches finished on their booths while eagerly awaiting the food coming in just 5 minutes. After an hour of eating, we will have a half hour break before the VIPs get their privileged two hour private run on the show floor. From what I hear, D-Wave is claiming to have the first quantum computer presented here, but we will have to see about that.

Saturday, November 10, 2007

News

It has been too long since my last post, for which there are a number of reasons.

Super Computing 2007
Vacation Preparation
New Job
Moving

I head to Reno, Nevada early tomorrow morning for SC07. I have been trying to get my research ready for presentation. Hopefully this year will be as fun and interesting as last.

The day after I get back from Reno, my girlfriend and I are flying to Aruba for ten days.

There is a lot of preparation when I will essentially be gone for fifteen days. This should be a nice break compared to the bedlam to come.

I resigned. When I get back from Aruba, I will be working my last week as a Software Developer for NAVTEQ. I will be moving 2529 miles from Philadelphia to Spokane, WA to take the role of Software Quality Assurance Director for PlayXPert.

I will be doing software architecture, software development, developer management, project management, system administration, continuous integration, QA management, and other tasks.

Along with the new job comes moving. I work for one week, during which I need to pack every morning and night ( as I will be gone for the two weeks prior ) which leaves no room for anything. I start Monday, December 3rd.

This come as a stumbling block for my Ensurance project on CodePlex. It is stable, fully featured, and usable, but I will not be able to work on it for a month which really pains me. The only pieces missing are the distribution tasks such as strong naming, finishing internationalization, and such. I was really looking forward to implementing the full set of fluent parameter attributes, but there is a minimal implementation right now.

I will try to post from SC07 with anything that I find.

Thursday, October 11, 2007

New Ensurance Release

Yesterday I released version 0.4 on Codeplex. All of the functionality for the 1.0 release is finished. I am trying to wrap up all of the tasks needed for a distribution. When all of that is done I will mark the release as 1.0. I am still looking for functionality ideas.

Friday, October 5, 2007

Long time, no posts, new OS project Ensurance

I have been incredibly busy lately and thus there have been no posts. I have finally published the first release of my project Ensurance on CodePlex. It is based on the NUnit 2.4.3 framework and is intended to be used for production code to verify particular constraints in an application. Check out the CodePlex page for more information. I will be posting sample code soon.

Sunday, September 2, 2007

Portable 'supercomputer'

The story of the Calvin College portable supercomputer really irks me. It is getting a lot of attention for doing nothing. You can see it on engadget, Slashdot, and others. Anyone can build a cluster like what they have. The only credit I will give them is that not everyone would go through the benchmark and analysis as they did. On the other hand, these people are still on the same old trail of HPC. They ignore the entire issue of programming -- they used HPL code. Ask if they can write their own code for the microwolf. What kind of real application performance can they achieve? The engadget article says "Talk about a solid price-to-performance ratio." - what load of crap. How about asking real questions and reporting real HPC news?

Ninject Dependency Injection Framework

I have been playing with Ninject for a couple days now and I am having a lot of fun. One feature I have been really digging is Contextual Binding. The kernel allows multiple modules to be loaded that define the bindings of your types. So depending on which modules and bindings are set, I can change the way other binding resolve. Here is an example:

Say I want to resolve a different type of car dependent on a selected engine. I have an interface IEngine and two classes V4, V6 that inherit from IEngine. If V4 is resolved by the kernel then the car returned would be a Civic; an Accord if a V6 is returned. Here is the code in custom modules:

    internal class EngineModule : StandardModule
    {
        public override void Load()
        {
            Bind<IEngine>().To( typeof ( V4 ) );
        }
    }

    internal class CarModule : StandardModule
    {
        public override void Load()
        {
            IContext context = new StandardContext( Kernel, typeof ( IEngine ) );
            IBinding binding = Kernel.GetBinding<IEngine>( context );
            Type implementationType = binding.Provider.GetImplementationType( context, false );

            Bind<ICar>().To( typeof ( Accord ) ).OnlyIf(
                delegate
                {
                    if ( implementationType == typeof ( V6 ) )
                    {
                        return true;
                    }

                    return false;
                } );

            Bind<ICar>().To( typeof ( Civic ) ).OnlyIf(
                delegate
                {
                    if ( implementationType == typeof ( V4 ) )
                    {
                        return true;
                    }

                    return false;
                } );
        }
    }

I can then load the modules into my kernel and use it to resolve cars:

    internal class Program
    {
        private static void Main( string[] args )
        {
            IModule[] modules = new IModule[] { new SettingsModule(), new EngineModule(), new CarModule() };
            IKernel kernel = new StandardKernel( modules );

            ICar car = kernel.Get<ICar>();
            Console.WriteLine( "I just got a brand new {0} {1}.", car.Manufacturer, car.Model );
            Console.WriteLine( "It has a {0} cylinder engine with {1} horsepower and {2} ft-lb torqe.", car.Engine.Cylinders, car.Engine.Horsepower, car.Engine.Torque );
            Console.WriteLine( "Time to start the Engine!" );
            car.Engine.Start();
            Console.WriteLine( "Sound good, but I am going to turn it off for now." );
            car.Engine.Stop();
            Console.ReadLine();
        }
    }

Nate Kohari just wrote a first cut user guide that is very helpful when getting started.

Monday, August 27, 2007

Running Mocked Unit Tests in FinalBuilder With Coverage

I just read an article by Craig Murphy on running unit tests in FinalBuilder. I liked the article and it gives some good pointers. I wish he had written it a while ago, it would have been helpful to me. We use TypeMock which will not allow us to make our unit tests as simply. When I wrote our build scripts I took the same general approach he did.

Starting in the Main ActionList we defined a few variables that give the paths to the applications we use. I also created an ActionList call RunTest which I will go into detail below. For each DLL being tested a call to RunTest is made passing in the name of the DLL to be tested.

In each call we set the CURRENTDLL ActionList Parameter with the name of the DLL (Animation,Audio,Core,..).

NCOVEROPTIONS holds the command-line options for ncover.console.exe. The same follows for NUNITOPTIONS and TMOCKRUNNEROPTIONS. The rest of the ActionList Parameters are used as temporary variables allowing me to put together very complex command argument strings and parse XML into variables.

The RunTest ActionList is separated into two try...catch blocks. The first to run the tests and the second to process the results.

Here are what the commands are doing:

Set Variable NCOVEROPTIONS
//x "%BUILDDIR%\%CURRENTDLL%Coverage.Xml" //l "%BUILDDIR%\%CURRENTDLL%Coverage.Log"

Set Variable TESTDLLS
"%ROOTDIR%\bin\%CURRENTDLL%Tests\%CURRENTDLL%.Tests.dll"

Set Variable NUNITOPTIONS
%TESTDLLS% /xml:"%BUILDDIR%\%CURRENTDLL%TestResult.xml" /noshadow

Set Variable PARAMETERS
-first -link NCover %NCOVER% %NCOVEROPTIONS% %NUNIT% %NUNITOPTIONS%

Execute Program
%TMOCKRUNNER% %PARAMETERS%
-Wait For Completion, Log Output, and Hide Window are checked. I do not check the exit code. The second try block handles it.

The first sets up all of our variables and launches the TMockRunner application with all of our parameters. The second tries to read the test results into the TESTFAILURES variable. If the TMockRunner failed then the file will not exist and the Catch block will append our error message to indicate this. If there are test failures we read the failed test cases into a temporary error variable using XPath. This temporary variable is appended to the error message variable which is sent via email to our team. This allows us to find the exact tests that failed directly from our email.

That is basically it. Our CI server runs the FinalBuilder script with a few extra options and processes the NCover output to give us a nice web page.

Saturday, August 18, 2007

Parallel Functionals

I have used the loop tiling code with scheduling algorithms to parallelize Dustin Campbell's code from his post A Higher Calling. The parallelized funtionals can be found here. It uses the scheduling algorithm code from the last post. The parallel reduction is actually based on Joe Duffy's serial reduction.

Scheduling Algorithms

I have uploaded my current version of the scheduling algorithms code that I referenced in my previous post. You can view the file here. I no longer have access to a multi core/processor machine so I cannot benchmark them properly right now.

Sunday, August 12, 2007

Correctness of Parallel Code

One of the big problems in writing parallel applications is code correctness. How do we know that our parallel implementations are correct? While multi-core correctness has been researched and documented a great deal, correctness for clustering code appears to still be in its infancy. Writing, debugging, and verifying the correctness of your applications remains incredibly difficult. Now, I am not talking about grid computing or web services. I am specifically referring to MPI and the like doing high performance computing.

When debugging a parallel application, a user must set breakpoints, attach to multiple processes, and then try to step through the applications in parallel. This becomes even more unwieldy if you are also trying to write the library/service and test your applications. I am not a fan of debugging, and parallel debugging just makes my skin crawl. So, what can one do?

When designing the implementation of SPPM I wanted to approach it in a way that would allow flexibility never seen in a parallel programming system. Starting at the highest level, I abstracted the service to a well defined interface. Given this interface, I can actually implement several versions of my service that can be swapped out at runtime. But why would I want multiple versions? Using a framework like Castle, I can specify implementations of my service in a configuration file for the WindsorContainer to resolve. Depending on how I want to test, I can change my service implementation. I have two implementations right now: the real service, and a local virtual service. The real service tells my library that when calls are made, try to talk to an external service which runs on every node in the cluster. The local virtual service is used when you want to test a master and worker application without needing a cluster or external service. From the applications' point of view nothing changes. They make the same library calls, but the service implementation has changed allowing the programmer to focus on their application and not its dependencies.

Another set of work dealt with the library implementation. All calls were implemented with a composite pattern partnering the command pattern and template method pattern. All commands would request a pointer from a communication interface factory which would return a pointer to come kind of network transmission class. This class is switched out when the service implementation changes - it may get a TCP socket communicator, a remoting communicator, or, just a pointer to an object that routes calls into the local virtual service.

Up to this point, all of the features has been done with design patterns and object containers. To push things one step further, I had to solve a problem that has really bothered me. Why must I actually execute all of my parallel applications together in order to test them? Why must I run a service, master, and worker just to verify that the simplest case works? I have already shown how to test using just the master and worker by abstracting the service, so what about verifying the correctness of the master and worker independently of one another? Impossible? Why? External Dependencies? What about the service layer? Is it at all possible to apply TDD to HPC and clusters? Of coarse, but it is far from obvious.

Aspect-oriented programming will allow us to simulate our dependencies without changing our code. The TypeMock web site does a good job at illustrating this. Our problem breaks down to the need to simulate external dependencies. Write unit tests for your parallel applications, but where external library calls would be made, like a parallel library which talks to the cluster, mock the calls! Below is a small sample trying to mock the worker of a parallel matrix multiplication application. The worker application is supposed to get the B matrix, and a series of rows Ai from A, and compute a sub-result Ci which is sent back to the master. The Check method will be called when the Put call is made and intercepted, and the Ci matrix will be verified for correct contents. The term tuple tells the worker application to exit when it tries to get the next piece of A.

[Test]
public void WorkerTest() {
    // We use this call to tell TypeMock that the next creation of
    // the InjectionLibrary needs to be faked. TypeMock will replace
    // the instance that should have been created with its own
    // empty implementation. The mocks we set up below tell
    // TypeMock that we are going to have calls made by another component
    // and we need to insert fake results.
    Mock mock = MockManager.Mock(typeof(InjectionLibrary));
    
    // Fake the B call. When the Worker application calls
    // Slm.Get("B")
    mock.ExpectAndReturn("Get", 0).Args(new Assign("B"), new Assign(B));
    
    // fake the Ai call
    string tupleName = String.Format("A0:A{0}", N - 1);
    mock.ExpectAndReturn("Get", 0).Args(new Assign(tupleName, new Assign(A)));
    
    // Grab the put of Ci and validate the result
    // We tell TypeMock to call the Check method below to
    // validate the argument (Ci matrix).
    mock.ExpectCall("Put").Args(Check.CustomChecker(new ParameterCheckerEx(Check)));
    
    // Fake the term tuple
    mock.ExpectAndReturn("Get", 0).Args(new Assign("A0:A0"), new Assign(term));
    
    // This next line is mocked, the library is static
    // and the faked instance is shared.
    InjectionLibrary Slm = InjectionLibrary.Instance;
    
    // Call our worker's main method.
    // All of the calls that were mocked above will
    // have there calls hijacked.
    OCMMWorker.Main(null);
}

Here we have referenced a .NET application like an external DLL and invoked its Main method. Everything is run from the comfort of your TestFixture and you have verified one case in your worker application without the need of a cluster, master, or external/virtual service. Using these methods, we can write parallel cluster code while minimizing our need to debug. Why debug when you know it works?

Saturday, August 11, 2007

Overhead of Parallel Applications and Scheduling Algorithms

When we typically try to measure the speedup of a parallelized algorithm, many people only calculate the serial vs. parallel running time (Sp = Tseq/Tpar). The effective speedup can more closely be approximated with the following equation to further define Tpar:

Tpar = Tcomp + Tcomm + Tmem + Tsync
Tpar is the total parallel running time.
Tcomp is the time spent in parallel computation on all machines / processors.
Tmem is the time spent in main memory or disk.
Tsync is the time required to synchronize.

All of these are easily quantifiable with the exception of Tsync. In an ideal world Tsync is zero, but this is never the case. A single slow processor/machine/data link can make a huge difference with Tsync. So, how can we manage it?

The most straightforward way to try to minimize Tsync it it to apply a fixed chunk load. MPI uses this method. This is fine unless you have a heterogeneous cluster or other applications are running on the processors/machines being used.

An extension of the fixed chunking is to create a normalized value for all of your computing resources. Then, let the computing resources get jobs whose size is comparable to their ability. Another way, say we have three machines whose computing powers are A:1, B:2, C:3 (C is three times as powerful as A). If we have twelve pieces of work to distribute, we can group them into six groups of two (total / sum of power). In the time C will compute three of the chunks, B will have done two and A will have done one. The entire job should wrap up with a minimal Tsync. This kind of work load balancing works very well for heterogeneous clusters.

Other scheduling methods had been tried in clusters; factoring and guided self-scheduling being two, but they seem to add too much overhead for the Tcomm. There has to be a balance in creating chunky communication and minimizing Tsync. They may however work as valid scheduling algorithms for multi-core processing. The Tcomm is greatly reduced compared to cluster computing.

Working again from Joe Duffy's MSDN article "Reusable Parallel Data Structures and Algorithms: Reusable Parallel Data Structures and Algorithms", I have reworked the loop tiling code to support scheduling algorithms.

[ThreadStatic]
private static int partitions = ProcessorCount;

[ThreadStatic]
private static SchedulingAlgorithms algorithm = SchedulingAlgorithms.Factoring;

[ThreadStatic]
private static double factor = 0.5;

private static void For<TInput>( IList<TInput> data, int from, int to, Action<TInput> dataBoundClause, Action<int> indexBoundClause )
{
    Debug.Assert( from < to );
    Debug.Assert( ( dataBoundClause != null ) || ( indexBoundClause != null ) );
    Debug.Assert( ( data != null && dataBoundClause != null ) ||
                  ( data == null && dataBoundClause == null ) );

    int size = to - from;
    int offset = 0;
    List<int> gp = GetPartitionList( size );

    int parts = gp.Count;
    CountdownLatch latch = new CountdownLatch( parts );
    for (int i = 0; i < parts; i++) {
        int start = offset + from;
        int partitionSize = gp[0];
        gp.RemoveAt( 0 );
        int end = Math.Min( to, start + partitionSize );
        offset += partitionSize;
        ThreadPool.QueueUserWorkItem( delegate
                                      {
                                          for (int j = start; j < end; j++) {
                                              if ( data != null ) {
                                                  dataBoundClause( data[j] );
                                              }
                                              else {
                                                  indexBoundClause( j );
                                              }
                                          }

                                          latch.Signal();
                                      } );
    }
    latch.Wait();
}

private static List<int> GetPartitionList( int size ) {
    ISchedulingAlgorithm schedulingAlgorith = null;

    switch ( Algorithm ) {
        case SchedulingAlgorithms.FixedChunking:
            schedulingAlgorith = new FixedChunking( size, Math.Min( size, partitions ) );
            break;
        case SchedulingAlgorithms.GuidedSelfScheduling:
            schedulingAlgorith = new GuidedSelfScheduling( size );
            break;
        case SchedulingAlgorithms.Factoring:
            schedulingAlgorith = new Factoring( size, factor, factoringThreshold );
            break;
    }

    Debug.Assert( schedulingAlgorith != null );
    return schedulingAlgorith.GetPartitionSizes();
}

With this setup, all you have to do is set the algorithm that you would like to use along with the scheduling algorithm. The variables are thread static so that many threads can have their own algorithms, but since the for loop is blocking you only need one copy per thread.

It will take a little bit of tweaking for your application's algorithm, but these should produce reasonable results. My next goal is to get them benchmarked under different scenarios.

Follow Up: Frustration Caused by Amdahl's 'Law'

Michael Suess from Thinking Parallel was kind enough to point me to a paper he coauthored "Implementing Irregular Parallel Algorithms with OpenMP". In it he goes on to describe a way to implement the early thread sync that I mentioned using OpenMP. If you are interested in more of his papers, here is his publications page.

RE: Why are int[,], double[,], single[,], et alia so slow?

I actually wrote the previous post in October of 2006. Since then, someone was nice enough to look up the reasons for me. Wesner Moise posted this article on CodeProject that I think explains a lot.

Wednesday, August 8, 2007

Jamie Cansdale - TestDriven.Net - Silently Posts a Resolution with Microsoft

From the 2.8.2130 release notes

Release Notes - TestDriven.NET: 2.8

993: Retire Visual Studio Express support

Joint statement: "Jamie Cansdale and Microsoft Corporation have agreed to concentrate on working together on future releases of TestDriven.Net for Microsoft's officially extensible editions of Visual Studio, as opposed to spending time litigating their differences."

Saturday, August 4, 2007

Frustration Caused by Amdahl's 'Law'

I was looking over some OpenMP resources and I found this line on the Wikipedia page:

A large portion of the program may not be parallelized by OpenMP, which sets theoretical upper limit of speedup according to Amdahl's law.

I will leave my rant concerning Amdahl's 'law' for another post, but I feel like I need to point something out. According to Amdahl's law, linear speedup is the theoretical maximum speedup (using N processors would be N). While some have argued saying that caching effects allow for a superlinear speedup, it is always small S * N where S is a small multiplier.

Wouldn't you like to put the 'super' back into superlinear? But oh wait, ~~Amdahl says~~ I say you can do it. Crack open that MIT bible on algorithms ( I know you have/ had it) and find the section on decision problems. If you no longer have the book, take a quick look at Wikipedia.

In running code in parallel you still have the worst case near linear speedup. But, what happens when you are partitioning? What if you partition the data into two pieces, one for each processor? If there is no solution, you will run both processors exhaustively and you have the worst case. Now, let there be a solution to your decision problem. We have a few scenarios:

The solution is:

In the end of the first chunk, no speedup over non parallel.
In the end of the second chunk, near linear speedup.
In the beginning of the first chunk, no speedup over non parallel.
Anywhere other than the end of the second chunk - linear to true superlinear speedup!!!

As your solution in the problem moves toward the beginning of the chunk you can achieve amazing results. In working on Synergy and SPPM I have seen a speedup of over 2000 using two machines doing the 0/1 knapsack!

I still need to look into OpenMP some more to see if I can get it to shortcut the thread synchronization, but a true superlinear capable OpenMP application would be great. I made a number of modifications to the code from Joe Duffy's MSDN article "Reusable Parallel Data Structures and Algorithms: Reusable Parallel Data Structures and Algorithms" to implement scheduling algorithms for loop tiling. I have fixed chunking, guided self-scheduling, and factoring algorithms implemented. I am also hoping to add a synchronization scheme to allow one thread to tell the rest to stop when a solution is found. Once complete I should have a load balanced multi-core loop tiling library capable of superlinear speedup.

Thursday, August 2, 2007

Castle References

I have been following BitterCoder's series of Castle tutorials for a while, but I just stumbled upon his container tutorials wiki. I think that there is a lot of good information there for someone new to Castle.

Another Way to Optimize Code

In a previous post I showed the performance of several matrix multiplication implementations. As impressive as the JIT compiler is, I had an inkling that the C++/CLI compiler could do better. I compiled the jagged array (type[N][N]) in C# with optimization enabled. Here is the original source

        public static TimeSpan Multiply(int N) {
            double[][] C = new double[N][];
            double[][] A = new double[N][];
            double[][] B = new double[N][];
            int i, j, k;

            for (i = 0; i < N; i++) {
                C[i] = new double[N];
                B[i] = new double[N];
                A[i] = new double[N];

                for (j = 0; j < N; j++) {
                    C[i][j] = 0;
                    B[i][j] = i*j;
                    A[i][j] = i*j;
                }
            }

            DateTime now = DateTime.Now;

            for (i = 0; i < N; i++) {
                for (k = 0; k < N; k++) {
                    for (j = 0; j < N; j++) {
                        C[i][j] += A[i][k]*B[k][j];
                    }
                }
            }

            return DateTime.Now - now;
        }

Using Reflector on the dll, the only thing that the compiler does is moves the declaration of k into line 23. I used the C++/CLI plugin for Reflector to get the code into C++. With a little modification we get this code:

    static TimeSpan Multiply(int N)
    {
        int i;
        int j;
        array<array<double>^>^ C = gcnew array<array<double>^>(N);
        array<array<double>^>^ A = gcnew array<array<double>^>(N);
        array<array<double>^>^ B = gcnew array<array<double>^>(N);
        for (i = 0 ; (i < N); i++)
        {
            C[i] = gcnew array<double>(N);
            B[i] = gcnew array<double>(N);
            A[i] = gcnew array<double>(N);
            for (j = 0 ; (j < N); j++)
            {
                C[i][j] = 0;
                B[i][j] = (i * j);
                A[i][j] = (i * j);
            }
        }
        DateTime now = DateTime::Now;
        for (i = 0 ; (i < N); i++)
        {
            for (int k = 0 ; (k < N); k++)
            {
                for (j = 0 ; (j < N); j++)
                {
                    C[i][j] = (C[i][j] + (A[i][k] * B[k][j]));
                }
            }
        }
        return ((TimeSpan) (DateTime::Now - now));
    }

Compiling this code with

/Ox /Ob2 /Oi /Ot /Oy /GL /D "WIN32" /D "NDEBUG" /D "_UNICODE" /D "UNICODE" /FD /EHa /MD /Yu"stdafx.h" /Fp"Release\CppDemo.pch" /Fo"Release\\" /Fd"Release\vc80.pdb" /W3 /nologo /c /Zi /clr /TP /errorReport:prompt /FU "c:\WINDOWS\Microsoft.NET\Framework\v2.0.50727\System.dll"

We get an optimized matrix multiply. Again, reflecting the exe to get the C# code we get our new code which is is quite a bit faster:

        public static TimeSpan Multiply(int N) {
            double[][] C = new double[N][];
            double[][] A = new double[N][];
            double[][] B = new double[N][];
            int index = 0;
            if (0 < N)
            {
                do
                {
                    C[index] = new double[N];
                    B[index] = new double[N];
                    A[index] = new double[N];
                    int column = 0;
                    int value = 0;
                    do
                    {
                        C[index][column] = 0;
                        double num7 = value;
                        B[index][column] = num7;
                        A[index][column] = num7;
                        column++;
                        value = index + value;
                    }
                    while (column < N);
                    index++;
                }
                while (index < N);
            }
            DateTime now = DateTime.Now;
            int i = 0;
            if (0 < N)
            {
                do
                {
                    int k = 0;
                    do
                    {
                        int j = 0;
                        do
                        {
                            double[] Ci = C[i];
                            Ci[j] = (A[i][k] * B[k][j]) + Ci[j];
                            j++;
                        }
                        while (j < N);
                        k++;
                    }
                    while (k < N);
                    i++;
                }
                while (i < N);
            }
            return (TimeSpan) (DateTime.Now - now);
        }

This optimized code averages 182 MOPS on my machine compared to the original C# which only achieved 118 MOPS. This is a lot of work, but the speedup is amazing. I will try to put up some IL later to figure out exactly why the last version is so much faster.

Monday, July 30, 2007

Why are int[,], double[,], single[,], et alia so slow?

I wrote a few matrix multiplication (MM) applications for benchmarking my parallel processing research. I was blown away by how slow the C# implementations were working. I tried to write the MM application a few ways. It turns out that the fastest is a jagged array, type[N][N], rather than type[N*N] or type[N,N]. I even tried to use unsafe C# to use pointers, but the type[N][N] was still by far the fastest. I am going to compare the type[N][N] and the type[N,N] here.

Here is the C# code for the type[N,N]. I am using int as it benchmarks faster and I spend less time waiting. I have removed the initialization of the arrays as it just added more code to analyze and isn't needed for this analysis:

private static int[,] Standard() {
    Console.WriteLine( "Init" );
    int rows = 1500;
    int cols = 1500;
    int[,] first = new int[rows,cols];
    int[,] second = new int[rows,cols];
    int[,] third = new int[rows,cols];
    
    Console.WriteLine( "Calculate" ); 
    for (int i = 0; i < rows; i++
        for (int j = 0; j < cols; j++)
            for (int k = 0; k < rows; k++)
                third[i, j] += first[i, k] * second[k, j];

    return third;
}

I have added the console output to identify the IL more easily.

Here is the type[N][N] version:

private static int[][] Jagged() {
    Console.WriteLine( "Init" );
    int rows = 1500;
    int cols = 1500;
    int[][] first = new int[rows][];
    int[][] second = new int[rows][];
    int[][] third = new int[rows][];
    
    for (int i = 0; i < rows; i++) {
        first[i] = new int[cols];
        second[i] = new int[cols];
        third[i] = new int[cols];
    } 

    Console.WriteLine( "Calculate" );
    for (int i = 0; i < rows; i++)
        for (int j = 0; j < cols; j++)
            for (int k = 0; k < rows; k++)
                third[i][j] += first[i][k] * second[k][j];
                
    return third;
}

On first glance it looks like we have a bit more initialization overhead for the type[N][N]. I compiled everything and reflected out the IL. I then stripped all locations to keep this as clean and simple as possible (and the diff cleaner). Though there is a difference. The type[N][N] uses ldelem and stelem to initialize while type[N,N] uses call instance void int32[0...,0...]::Set(int32, int32, int32). I think this is where we start seeing why the C# array is slower than the jagged version.

If we look at the IL for the initialization that takes care of everything up to the writing of "Calculate" to the console:

On the type[N,N] side we are creating a new object on the evaluation stack and then popping from the stack and storing in a local variable - this is done for each the three arrays.

On the type[N][N] we push a new primitive array of int32 onto the stack and for each position we use the same call to create a new primitive array and replace the value on the stack at the given position.

Here is the grand conclusion to our IL:

As you can see, the calculations happen 'almost' identically. In the type[N,N] array we are making external calls to the runtime in order to get an address and set values whereas the type[N][N] simply issue a ldelem. Now, I wish there were a way to see exactly how long the Address and Get methods take, but the evidence is clear that ldelem is a lot faster.

I ran all of the matrix permutations for all of the types of arrays I could use. The ikj MM was the fastest for all. I benchmarked all of these on two machines (S01, S02). The machines were as identical as possible and nothing else running except background services. Each point was run three times and the average of the three was plotted. The matrix size (bottom) was NxN (square) and the left hand side is the number of algorithmic steps per second. For MM the multiplication, addition, and assignment are 1 operation.

For a 3000x3000 matrix we are making millions or method calls using the type[N,N] matrix that the type[N,N] does. Why is the C# matrix implemented in the way? Please let me know if I missed anything or you have some insight.

Using Castle.Facilities.Synchronize

I removed the last post as the formatting was horrible. Here again is my first demo app with the Synchronization facility. It is normally quite painful to try to make your WinForms application threadsafe. VS2005 made it much easier to see that you are doing something wrong. Your application will move along and then the debugger will break and you have a window with an InvalidOperationException in you face.

System.InvalidOperationException was unhandled

  Message="Cross-thread operation not valid: Control 'richTextBox' accessed from a thread other than the thread it was created on."

  Source="System.Windows.Forms"

  StackTrace:

       at System.Windows.Forms.Control.get_Handle()

       at System.Windows.Forms.RichTextBox.StreamIn(Stream data, Int32 flags)

       at System.Windows.Forms.RichTextBox.StreamIn(String str, Int32 flags)

       at System.Windows.Forms.RichTextBox.set_Text(String value)

       at SafelyUpdatingWinForm.MainForm.UpdateTime() in C:\SafelyUpdatingWinForm\SafelyUpdatingWinForm\MainForm.cs:line 38

       at SafelyUpdatingWinForm.MainForm.UpdateLoop() in C:\SafelyUpdatingWinForm\SafelyUpdatingWinForm\MainForm.cs:line 27

       at System.Threading.ThreadHelper.ThreadStart_Context(Object state)

       at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)

       at System.Threading.ThreadHelper.ThreadStart()

The IDE is nice enough though to point you in the direction of a solution How to: Make Thread-Safe Calls to Windows Forms Controls (MSDN). They suggest that for any component you want to change that you create delegates and try to use Control.Invoke to make the call on the appropriate thread. This gets very bloated when you have a large and complex UI. To simplify things, Craig Neuwirt created a new facility to plug into the Castle WindsorContainer. It can automatically martial the calls to the UI thread for you and a whole lot more. Roy Osherove has his own implementation that uses the DynamicProxy2.

Program.cs:

using System;
using System.Windows.Forms;
using Castle.Facilities.Synchronize;
using Castle.Windsor; 

namespace SafelyUpdatingWinForm {
    internal static class Program {
        [STAThread]
        private static void Main() {
            Application.EnableVisualStyles();
            Application.SetCompatibleTextRenderingDefault( false ); 

            WindsorContainer container = new WindsorContainer();
            container.AddFacility( "sync.facility", new SynchronizeFacility() );
            container.AddComponent( "mainform.form.class", typeof (MainForm) ); 

            Application.Run( container.Resolve( "mainform.form.class" ) as Form );
        }
    }
}

And then in the MainForm.cs:

using System;
using System.ComponentModel;
using System.IO;
using System.Net;
using System.Text;
using System.Text.RegularExpressions;
using System.Threading;
using System.Windows.Forms; 

namespace SafelyUpdatingWinForm {
    public partial class MainForm : Form {
        private Thread updateLoop;
        private bool loop = true;
        private const int delay = 2000; 

        public MainForm() {
            InitializeComponent();
        } 

        private void UpdateLoop() {
            while ( loop ) {
                UpdateTime();
                Thread.Sleep( delay );
            }
        } 

        protected virtual void UpdateTime() {
            string text = GetWebPage();
            Regex regex = new Regex( @"<b>\d\d:\d\d:\d\d<br>" );
            Match match = regex.Match( text );
            string time = match.Value.TrimStart( "<b>".ToCharArray() )
                                     .TrimEnd( "<br>".ToCharArray() );
            richTextBox.Text = time;
        } 

        private static string GetWebPage() {
            string url = "http://www.time.gov/timezone.cgi?Eastern/d/-5";
            HttpWebRequest webRequest = WebRequest.Create( url ) as HttpWebRequest;
            webRequest.Method = "GET";
            using (WebResponse webResponse = webRequest.GetResponse())  {
                using (StreamReader sr = new StreamReader( webResponse.GetResponseStream(), Encoding.UTF8 )) {
                    return sr.ReadToEnd();
                }
            }
        } 

        protected override void OnClosing( CancelEventArgs e ) {
            loop = false;
            while ( updateLoop.IsAlive ) { }
            base.OnClosing( e );
        } 

        protected override void OnLoad( EventArgs e ) {
            updateLoop = new Thread( new ThreadStart( UpdateLoop ) );
            updateLoop.Name = "UpdateLoop";
            updateLoop.Start();
            base.OnLoad( e );
        }
    }
}

Thursday, March 15, 2007

Inversion of Control and Dependency Injection

I found a great article today while looking around for IoC and DI articles. Microsoft has an article on MSDN about it with the Windsor container.

Wednesday, March 14, 2007

Q.E.F.

Being that this is the first post, I will explain the cryptic title. It is the Greek equivalent of Q.E.F. which is an abbreviation for the Latin phrase "quod erat faciendum" or "that which was to be done". I will not be proving anything on this blog; I will just be trying to justify what I say without going through the pain of full proofs.

-Ian