Sunday, August 31, 2008

Number of runs in random sequence

While reading this article about human ability on outputing random data-
http://faculty.rhodes.edu/wetzel/random/mainbody.html
I started to think - How we can simply detect that entered sequence is NOT random ( or entered by human) ? Of course to make some real-world measurements we need statistics a lot. But if we need just some straightfoward method for preliminary results ? Well, the idea was to check the number of runs in sequence and to compare the result with the expected value. (Number of runs is total number of subsequences in sequence).
So that if number_of_runs != expected_number_of_runs then we can conclude that data is not much random. OK. Number of runs we can measure simply. But what about expected_number_of_runs ? Somehow I suspected that this expected runs count should be dependent on the random variable values amount. For example sequences 'azazaz' and 'azxazx' are the same length, but one is composed of a,z (2 values) and second- a,z,x (3 values). So if we know what is going on with the number_of_runs when random sequences is composed from different values amount - we may calculate expected_number_of_runs given any sequence. Second point - for eliminating need to recalculate everything when sequence length changes - better calculate expected_number_of_runs / sequence_length.
So I made such experiment for detection of such dependancy.
Experiment Python code (to run it you need matplotlib python module):



from random import *
from pylab import plot, show, xlabel, ylabel, title, grid, legend

def runs(lst):
return sum([int(not lst[i]==lst[i+1]) for i in range(len(lst)-1)])+1

def testruns(n):
s=[randint(1, n) for x in range(10000)]
return float(runs(s)) / len(s)

max = 20
xrange = range(2, max)
yruns = [testruns(x) for x in xrange] # Generating random data and counting runs
yreg = [1.0-1.0/x for x in xrange] # Function for fitting data
plot(xrange, yruns , 'ro', xrange, yreg)
xlabel('Variable values amount')
ylabel('Number of runs per item')
title('Number of runs dependance on variable values amount')
legend(('Experimental' , 'Data fit: ' r'$1-n^{-1}$'), loc=7)
grid()
show()



So after running this random data test- we get such plot:

So it tells us that expected_number_of_runs we need to calculate like that-
expected_number_of_runs = sequence_length * (1 - 1/n);
n- unique element count in sequence. So now we have some tool for detecting sequence being not much random.
First- we measure real number_of_runs in sequence.
Second - we calculate expected_number_of_runs.
Third - if these two values is not approximately equal - then we can conclude that data is not random. Of course we can`t be sure about randomness just from this test.
We need a lot of different measures - information entropy and as such... But
the goal of this article was to play with number_of_runs. Maybe later - I will write something about information entropy. It is very interesting also.

Have fun!!!

Wednesday, August 27, 2008

Codegolf - Vigenere Cipher

Problem definition:
---------------------------------------------------
The Vigenere cipher is a simple form of a polyalphabetic substitution, using a series of different Caesar ciphers based on the letters of a keyword. It's over 450 years old now, but that shouldn't stop us having a bit of golfing fun with it now.

The following information is largely taken from Wikipedia's Vigenere cipher page. You might find some interesting information that isn't included here, so I recommend you check it out.

Your job is to write some code which takes a keyword and some plaintext, and encrypts it according to the Vigenere cipher.

In a Caesar cipher, each letter of the alphabet is shifted along some number of places; for example, in a Caesar cipher of shift 3, A would become D, B would become E and so on. The Vigenere cipher consists of several Caesar ciphers in sequence with different shift values.

To encipher, a table of alphabets can be used, termed a tabula recta, Vigenere square, or Vigenere table. It consists of the alphabet written out 26 times in different rows, each alphabet shifted cyclically to the left compared to the previous alphabet, corresponding to the 26 possible Caesar ciphers. At different points in the encryption process, the cipher uses a different alphabet from one of the rows. The alphabet used at each point depends on a repeating keyword.

Click here to see the Vigenere square.

See the examples section below for a worked example.
More Information

* Your code will be given both the keyword and the plaintext on stdin. It will be in the format

KEYWORD
PLAINTEXT

with newlines at the end of each line.

* You should print the encrypted plaintext on stdout.
* Both the keyword and the plaintext will contain only the uppercase letters A through Z.
* Your code will be run three times and will need to pass all three tests to be deemed successful.
---------------------------------------------------
Code golf challenge is interesting because it measures your ability to write code as short as possible. My try to give shortest solution in Python:

Run no 1 -> code size 111 bytes:


k=raw_input()*100;p=raw_input()
print ''.join(map(lambda x, y: chr(65+(ord(x)+ord(y)-130)%26), k[:len(p)], p))


Run no 2 -> code size 84 bytes:


i=raw_input;print ''.join(chr(65+(ord(x)+ord(y)-130)%26) for x,y in zip(i()*10,i()))


Have fun!
Code golf

Monday, August 25, 2008

Project Euler p22

Problem description:
-----------------------------------------------------------
Using names.txt (right click and 'Save Link/Target As...'), a 46K text file containing over five-thousand first names, begin by sorting it into alphabetical order. Then working out the alphabetical value for each name, multiply this value by its alphabetical position in the list to obtain a name score.

For example, when the list is sorted into alphabetical order, COLIN, which is worth 3 + 15 + 12 + 9 + 14 = 53, is the 938th name in the list. So, COLIN would obtain a score of 938 * 53 = 49714.

What is the total of all the name scores in the file?
-----------------------------------------------------------
Solution in Python:



def ReadFile():
""" Read data input file"""
f = open('C:/ProjectEuler/names.txt','r')
str = f.readlines()
f.close()
str[0] = str[0].replace('"','')
ret = str[0].split(',')
ret.sort()
return ret

def AlphabeticalValue(string):
alph = list('ABCDEFGHIJKLMNOPQRSTUVWXYZ')
values = map(lambda x: alph.index(x)+1, list(string))
return sum(values)

def TotalScore():
names = ReadFile()
namsc = map(lambda (x,y): AlphabeticalValue(y)*(x+1), list(enumerate(names)))
return sum(namsc)

if __name__ == '__main__':
print TotalScore()


What is missing ? Of course - Have fun :-)