Monday, July 30, 2007

Why are int[,], double[,], single[,], et alia so slow?

I wrote a few matrix multiplication (MM) applications for benchmarking my parallel processing research. I was blown away by how slow the C# implementations were working. I tried to write the MM application a few ways. It turns out that the fastest is a jagged array, type[N][N], rather than type[N*N] or type[N,N]. I even tried to use unsafe C# to use pointers, but the type[N][N] was still by far the fastest. I am going to compare the type[N][N] and the type[N,N] here.

Here is the C# code for the type[N,N]. I am using int as it benchmarks faster and I spend less time waiting. I have removed the initialization of the arrays as it just added more code to analyze and isn't needed for this analysis:

private static int[,] Standard() {
Console.WriteLine( "Init" );
int rows = 1500;
int cols = 1500;
int[,] first = new int[rows,cols];
int[,] second = new int[rows,cols];
int[,] third = new int[rows,cols];

Console.WriteLine( "Calculate" );
for (int i = 0; i < rows; i++
for (int j = 0; j < cols; j++)
for (int k = 0; k < rows; k++)
third[i, j] += first[i, k] * second[k, j];

return third;
}

I have added the console output to identify the IL more easily.


Here is the type[N][N] version:

private static int[][] Jagged() {
Console.WriteLine( "Init" );
int rows = 1500;
int cols = 1500;
int[][] first = new int[rows][];
int[][] second = new int[rows][];
int[][] third = new int[rows][];

for (int i = 0; i < rows; i++) {
first[i] = new int[cols];
second[i] = new int[cols];
third[i] = new int[cols];
}

Console.WriteLine( "Calculate" );
for (int i = 0; i < rows; i++)
for (int j = 0; j < cols; j++)
for (int k = 0; k < rows; k++)
third[i][j] += first[i][k] * second[k][j];

return third;
}



On first glance it looks like we have a bit more initialization overhead for the type[N][N]. I compiled everything and reflected out the IL. I then stripped all locations to keep this as clean and simple as possible (and the diff cleaner). Though there is a difference. The type[N][N] uses ldelem and stelem to initialize while type[N,N] uses call instance void int32[0...,0...]::Set(int32, int32, int32). I think this is where we start seeing why the C# array is slower than the jagged version.


If we look at the IL for the initialization that takes care of everything up to the writing of "Calculate" to the console:



On the type[N,N] side we are creating a new object on the evaluation stack and then popping from the stack and storing in a local variable - this is done for each the three arrays.


On the type[N][N] we push a new primitive array of int32 onto the stack and for each position we use the same call to create a new primitive array and replace the value on the stack at the given position.


Here is the grand conclusion to our IL:



As you can see, the calculations happen 'almost' identically. In the type[N,N] array we are making external calls to the runtime in order to get an address and set values whereas the type[N][N] simply issue a ldelem. Now, I wish there were a way to see exactly how long the Address and Get methods take, but the evidence is clear that ldelem is a lot faster.


I ran all of the matrix permutations for all of the types of arrays I could use. The ikj MM was the fastest for all. I benchmarked all of these on two machines (S01, S02). The machines were as identical as possible and nothing else running except background services. Each point was run three times and the average of the three was plotted. The matrix size (bottom) was NxN (square) and the left hand side is the number of algorithmic steps per second. For MM the multiplication, addition, and assignment are 1 operation.




For a 3000x3000 matrix we are making millions or method calls using the type[N,N] matrix that the type[N,N] does. Why is the C# matrix implemented in the way? Please let me know if I missed anything or you have some insight.

Using Castle.Facilities.Synchronize

I removed the last post as the formatting was horrible. Here again is my first demo app with the Synchronization facility. It is normally quite painful to try to make your WinForms application threadsafe. VS2005 made it much easier to see that you are doing something wrong. Your application will move along and then the debugger will break and you have a window with an InvalidOperationException in you face.

System.InvalidOperationException was unhandled
  Message="Cross-thread operation not valid: Control 'richTextBox' accessed from a thread other than the thread it was created on."
  Source="System.Windows.Forms"
  StackTrace:
       at System.Windows.Forms.Control.get_Handle()
       at System.Windows.Forms.RichTextBox.StreamIn(Stream data, Int32 flags)
       at System.Windows.Forms.RichTextBox.StreamIn(String str, Int32 flags)
       at System.Windows.Forms.RichTextBox.set_Text(String value)
       at SafelyUpdatingWinForm.MainForm.UpdateTime() in C:\SafelyUpdatingWinForm\SafelyUpdatingWinForm\MainForm.cs:line 38
       at SafelyUpdatingWinForm.MainForm.UpdateLoop() in C:\SafelyUpdatingWinForm\SafelyUpdatingWinForm\MainForm.cs:line 27
       at System.Threading.ThreadHelper.ThreadStart_Context(Object state)
       at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
       at System.Threading.ThreadHelper.ThreadStart()

The IDE is nice enough though to point you in the direction of a solution How to: Make Thread-Safe Calls to Windows Forms Controls (MSDN). They suggest that for any component you want to change that you create delegates and try to use Control.Invoke to make the call on the appropriate thread. This gets very bloated when you have a large and complex UI. To simplify things, Craig Neuwirt created a new facility to plug into the Castle WindsorContainer. It can automatically martial the calls to the UI thread for you and a whole lot more. Roy Osherove has his own implementation that uses the DynamicProxy2.


Program.cs:

using System;
using System.Windows.Forms;
using Castle.Facilities.Synchronize;
using Castle.Windsor;

namespace SafelyUpdatingWinForm {
internal static class Program {
[STAThread]
private static void Main() {
Application.EnableVisualStyles();
Application.SetCompatibleTextRenderingDefault( false );

WindsorContainer container = new WindsorContainer();
container.AddFacility( "sync.facility", new SynchronizeFacility() );
container.AddComponent( "mainform.form.class", typeof (MainForm) );

Application.Run( container.Resolve( "mainform.form.class" ) as Form );
}
}
}



And then in the MainForm.cs:

using System;
using System.ComponentModel;
using System.IO;
using System.Net;
using System.Text;
using System.Text.RegularExpressions;
using System.Threading;
using System.Windows.Forms;

namespace SafelyUpdatingWinForm {
public partial class MainForm : Form {
private Thread updateLoop;
private bool loop = true;
private const int delay = 2000;

public MainForm() {
InitializeComponent();
}

private void UpdateLoop() {
while ( loop ) {
UpdateTime();
Thread.Sleep( delay );
}
}

protected virtual void UpdateTime() {
string text = GetWebPage();
Regex regex = new Regex( @"<b>\d\d:\d\d:\d\d<br>" );
Match match = regex.Match( text );
string time = match.Value.TrimStart( "<b>".ToCharArray() )
.TrimEnd( "<br>".ToCharArray() );
richTextBox.Text = time;
}

private static string GetWebPage() {
string url = "http://www.time.gov/timezone.cgi?Eastern/d/-5";
HttpWebRequest webRequest = WebRequest.Create( url ) as HttpWebRequest;
webRequest.Method = "GET";
using (WebResponse webResponse = webRequest.GetResponse()) {
using (StreamReader sr = new StreamReader( webResponse.GetResponseStream(), Encoding.UTF8 )) {
return sr.ReadToEnd();
}
}
}

protected override void OnClosing( CancelEventArgs e ) {
loop = false;
while ( updateLoop.IsAlive ) { }
base.OnClosing( e );
}

protected override void OnLoad( EventArgs e ) {
updateLoop = new Thread( new ThreadStart( UpdateLoop ) );
updateLoop.Name = "UpdateLoop";
updateLoop.Start();
base.OnLoad( e );
}
}
}