Sunday, August 31, 2008

Number of runs in random sequence

While reading this article about human ability on outputing random data-
I started to think - How we can simply detect that entered sequence is NOT random ( or entered by human) ? Of course to make some real-world measurements we need statistics a lot. But if we need just some straightfoward method for preliminary results ? Well, the idea was to check the number of runs in sequence and to compare the result with the expected value. (Number of runs is total number of subsequences in sequence).
So that if number_of_runs != expected_number_of_runs then we can conclude that data is not much random. OK. Number of runs we can measure simply. But what about expected_number_of_runs ? Somehow I suspected that this expected runs count should be dependent on the random variable values amount. For example sequences 'azazaz' and 'azxazx' are the same length, but one is composed of a,z (2 values) and second- a,z,x (3 values). So if we know what is going on with the number_of_runs when random sequences is composed from different values amount - we may calculate expected_number_of_runs given any sequence. Second point - for eliminating need to recalculate everything when sequence length changes - better calculate expected_number_of_runs / sequence_length.
So I made such experiment for detection of such dependancy.
Experiment Python code (to run it you need matplotlib python module):

from random import *
from pylab import plot, show, xlabel, ylabel, title, grid, legend

def runs(lst):
return sum([int(not lst[i]==lst[i+1]) for i in range(len(lst)-1)])+1

def testruns(n):
s=[randint(1, n) for x in range(10000)]
return float(runs(s)) / len(s)

max = 20
xrange = range(2, max)
yruns = [testruns(x) for x in xrange] # Generating random data and counting runs
yreg = [1.0-1.0/x for x in xrange] # Function for fitting data
plot(xrange, yruns , 'ro', xrange, yreg)
xlabel('Variable values amount')
ylabel('Number of runs per item')
title('Number of runs dependance on variable values amount')
legend(('Experimental' , 'Data fit: ' r'$1-n^{-1}$'), loc=7)

So after running this random data test- we get such plot:

So it tells us that expected_number_of_runs we need to calculate like that-
expected_number_of_runs = sequence_length * (1 - 1/n);
n- unique element count in sequence. So now we have some tool for detecting sequence being not much random.
First- we measure real number_of_runs in sequence.
Second - we calculate expected_number_of_runs.
Third - if these two values is not approximately equal - then we can conclude that data is not random. Of course we can`t be sure about randomness just from this test.
We need a lot of different measures - information entropy and as such... But
the goal of this article was to play with number_of_runs. Maybe later - I will write something about information entropy. It is very interesting also.

Have fun!!!

No comments:

Post a Comment

Comment will be posted after comment moderation.
Thank you for your appreciation.