Sunday, March 30, 2008

Parallel extensions for .NET and multi-core cpu

This time we look a bit at parallel extensions library for .NET. This library intended for speeding-up work for multi-core cpu. If you want to speed-up program in this way -you need:
  • of course multi-core machine
  • parallel extensions library for .NET. You can get it there
  • Also .NET Framework 3.5
  • And at last- you need to optimize code to be especially suitable for multi-cores.
    This issue will be mentioned later.

This library was tested on Intel Core2 Duo machine with the help of C#.
Testing procedure:

  • There was created array of 10000000 elements.
  • This array then was initialized in sequential and parallel ways.
    And perfomance measurements was done.
  • When measuring perfomance, several factors was changed to look for impact on parallel execution speed:
    a) Data clusters amount (array was partitioned into different number of clusters)
    b) Different mathematical operations on array elements was used.

C# code which did all measurement comparisions between sequential and parallel executions is here:



using System;

class Program
{
static int ProcessVariable(int i)
{
int res;

res =
//i * i
(int)Math.Sqrt(i)
//i >> 1
;
return res;
}

static void InitArrayParallel(int[] num, int clusters)
{
int clustcount = num.Length / clusters;

Parallel.For(1, clusters + 1, k =>
{
int ixlow = (k - 1) * clustcount;
int ixhigh = ixlow + clustcount;

for (int i = ixlow; i < ixhigh; i++)
{ num[i] = ProcessVariable(i); }
});
}

static void InitArraySerial(int[] num)
{
for (int i = 1; i < num.Length; i++)
{ num[i] = ProcessVariable(i); }
}

static void Main(string[] args)
{
int limit = 10000000;
int[] num = new int[limit];
double ds = 0.0;
double dp = 0.0;
double speed = 0.0;
int avg_count = 6;
DateTime t1, t2;

for (int c = 1; c <= limit; c *= 2)
{
ds = 0.0;
dp = 0.0;
for (int i = 1; i <= avg_count; i++)
{
t1 = DateTime.Now;
InitArraySerial(num);
t2 = DateTime.Now;
ds += ((TimeSpan)(t2 - t1)).TotalMilliseconds;
t1 = DateTime.Now;
InitArrayParallel(num, c);
t2 = DateTime.Now;
dp += ((TimeSpan)(t2 - t1)).TotalMilliseconds;
}
speed = ds / dp;
Console.WriteLine("Data clusters {1}; Speed-up factor {0:F2}", speed, c);
}
Console.ReadLine();
}
}



Results is interesting:


There results is displayed as XY graph of speed-up factor dependance on data clusters amount.
Clusters amount scale is logarithmic.

Several conclusions that can be made from this graph (and this test):

  • As it was supposed to be, maximal speed-up factor is about 2
    (because machine was with 2 cores)
  • speed-up heavily depends on data partitions amount.
  • It seems there are a lower number of data clusters -> 16, when workload starts being nicelly distributed between cores and speed-up very sharply increases. Or else - if data partitions amount is smaller then 16,- seems there are no speed-up at all, and no use of 2 cores :-)
  • Speed-up may depend on operations used on array elements. For example- from graph can be seen that parallel execution speeds-up more square root operation than bitshift, or multiplication. It seems somehow Sqrt work can be distributed between cores more friendly.
  • Last- there exists also upper limit of data clusters amount, on which speed-up is still noticable. So users should not divide data into very big amount of chunks also, because in that way cores are not capable to interoperate efficently. Also higher limit may be dependent on actual operations on array elements.

So all in all - its good that we got this library for .NET. It simplifies a lot writing efficent algorithm for several cores. But.. as we saw- we need to remember that its not enough to have this library for multi-cores enabled code. We need to optimized it for that- Splitting jobs between cores is a different world of programming.

Sunday, March 23, 2008

Projecteuler.net problem 14 - Csharp vs Fsharp

Problem 14 stated as following:

The following iterative sequence is defined for the set of positive integers:
n → n/2 (n is even)
n → 3n + 1 (n is odd)
Using the rule above and starting with 13, we generate the following sequence:
13 → 40 → 20 → 10 → 5 → 16 → 8 → 4 → 2 → 1
It can be seen that this sequence (starting at 13 and finishing at 1) contains 10 terms. Although it has not been proved yet (Collatz Problem), it is thought that all starting numbers finish at 1.

Which starting number, under one million, produces the longest chain?

Ok, so here it is. Lets have fun with F#:




let seq_length n = n > Seq.unfold(fun x -> match x with
x when x = 0L -> None
x when x = 1L -> Some((1L,0L))
x when x%2L=0L -> Some((x,x/2L))
_ -> Some((x,3L*x + 1L))
) > Seq.length
[for i in 1L..1000000L -> (i, seq_length i)]
> List.fold1_left(fun (ai,al) (i,l) -> if l > al then (i,l) else (ai,al) )
> (fun (x,y) -> x ) > print_any

Elegant, slow version of solution-> ~20 lines; ~10 seconds.

Now lets try C#:




static long CountTerms(long x, ref long[] terms)
{
long tc = 1;
long k = x;

while (k!=1)
{
tc++;
k = (k % 2 == 0) ? k / 2 : 3 * k + 1;
if (k <= x)
{
if (terms[k] > 0)
{
terms[x] = terms[k] + tc;
return terms[x];
}
}
}
terms[x] = tc;
return terms[x] ;
}

static long LongestSeq()
{
long max = 1000000;
long[] termcount = new long[max+1];
long tcmax = 0;
long c = 0;
long ix = 0;

for (int i = 1; i <= max; i++)
{
c = CountTerms(i, ref termcount);
if ( c > tcmax)
{
tcmax = c;
ix = i;
}
}
return ix;
}

static void Main(string[] args)
{
Console.WriteLine(LongestSeq());
Console.ReadLine();
}



Ugly, fast version of solution -> ~0.18 second; ~60 lines.

Conclusion:

F# solutions can be done in more elegant/less code way, but in contrary it is slower that C#
counterpart. Also C# solutions can be written in fast way, but needs more code and attention.
F# solution was by factor 55 slower than C# solution, but also code was by factor of 3 shorter than C# counterpart. So if you want elegancy/generics power - take F#; and if you want only perfomance- take C#.
P.S. There is a way in F# to code (using mutable keyword) very very ugly like in C#, but i`ve not
suggest using it on F#,- why to code ugly in F# if you can grab C# instead ?? :-)

Tuesday, March 18, 2008

Reversing number in Csharp

This time i want to compare 2 methods for reversing a number. One method is based on Array.Reverse method and second - on number modulus 10 operator. Code is this for reversing number as string (C#):



static long ReverseString(long number)
{
char[] str = number.ToString().ToCharArray();
Array.Reverse(str);
return long.Parse(new string(str));
}



And code for reversing number by math operations (C#):



static long ReverseNumber(long number)
{
long digit = number % 10;
long rest = (long) number / 10;
long o = digit;

while (rest > 0)
{
digit = rest % 10;
rest /= 10;
o = (o*10)+ digit;
}
return o;
}



And these separate functions are run for 1000000 iterations.
Then execution time is measured.
Situation is this:

Function ReverseString --> 984 ms.
Function ReverseNumber --> 765 ms.

Conclusion:
Reversing by math operations is a bit faster- faster by factor of 1.3
So if for some reason you need to reverse a number - do it with math,
rather than by string/array operations.

Sunday, March 16, 2008

Sieve of eratosthenes algorithm perfomance

Prime numbers are very interesting for just 1 reason - they are unique in that sense that they are not divisible from any number without remainder, except 1 and itself. So there are many algorithms for finding prime numbers. But first not very complex and fast algoritm is "Sieve of eratosthenes". This algorithm pseudocode:
1. Take next number
2. Check is number X exists in composite numbers list
3.
if (2) not - it is prime number, - then add numbers X*2,X*3,X*4,...X*n (X*n- maximum number until primes will be looked) to the composite numbers list
- go to step 1
if (2) yes - it is not prime number, rather composite.
- go to step 1
4. Repeat 1-4 until the maximum number you need to search.
C# code:



static int PrimesBelow(long until)
{
int sum = 0;
bool[] composite = new bool[until + 1];
for (long i = 2; i < until; i++)
{
if (!composite[i])
{
for (long j = i + i; j <= until; j += i)
{
composite[j] = true;
}
sum++;
}
}
return sum;
}



So after executing this code to find primes with different prime search limit, we get this picture of algorithm perfomance:
_________________________
Prime count --- CPU time (ms)
78498 ________ 15
120322 _______46
184586 _______62
283551 _______93
436077 _______234
671165 _______421
1034145 ______765
1594712 ______1296
2461527 ______2140
3801997 ______3515
5876959 ______5671
9091052 ______9218
14073059 _____15109
21798316 _____24671
33785616 _____40781
________________________

So we see that for example 9 million first prime numbers will be found in just 9 seconds.
Also we see that running time almost linear, that is time depends on prime search space linearly.

Conclusion:
IMHO if you want fast, not very complex algorithm for finding prime numbers- I would suggest this algorithm - "Sieve of eratosthenes". But of course this algorithm is not suitable if you want search big prime numbers, because of it requirements on memory consumption. When array gets out of RAM, this algorithm is not effective anymore...
But in any way this algorithm is old but still good, at least comparing with trivial trial division algorithm...