Incorrect output from my implementation of SHA-256 - python

For a personal project, I'm working on implementing SHA-256 in Python 3, without using the hashlib module (since that would defeat the purpose of learning how SHA-256 works). I've been working from the Wikipedia pseudocode, but my code gives incorrect output (compared to the hashlib output). I've been staring at the code for an hour, and besides a headache, I've made no headway on figuring out what I've done wrong.
The code:
#!/usr/bin/env python3
import hashlib
import sys
# ror function taken from http://stackoverflow.com/a/27229191/2508324
def ror(val, r_bits, max_bits=32):
return ((val & (2**max_bits-1)) >> r_bits%max_bits)|(val << (max_bits-(r_bits%max_bits)) & (2**max_bits-1))
h = [0x6a09e667, 0xbb67ae85, 0x3c6ef372, 0xa54ff53a, 0x510e527f, 0x9b05688c, 0x1f83d9ab, 0x5be0cd19]
k = [0x428a2f98, 0x71374491, 0xb5c0fbcf, 0xe9b5dba5, 0x3956c25b, 0x59f111f1, 0x923f82a4, 0xab1c5ed5,
0xd807aa98, 0x12835b01, 0x243185be, 0x550c7dc3, 0x72be5d74, 0x80deb1fe, 0x9bdc06a7, 0xc19bf174,
0xe49b69c1, 0xefbe4786, 0x0fc19dc6, 0x240ca1cc, 0x2de92c6f, 0x4a7484aa, 0x5cb0a9dc, 0x76f988da,
0x983e5152, 0xa831c66d, 0xb00327c8, 0xbf597fc7, 0xc6e00bf3, 0xd5a79147, 0x06ca6351, 0x14292967,
0x27b70a85, 0x2e1b2138, 0x4d2c6dfc, 0x53380d13, 0x650a7354, 0x766a0abb, 0x81c2c92e, 0x92722c85,
0xa2bfe8a1, 0xa81a664b, 0xc24b8b70, 0xc76c51a3, 0xd192e819, 0xd6990624, 0xf40e3585, 0x106aa070,
0x19a4c116, 0x1e376c08, 0x2748774c, 0x34b0bcb5, 0x391c0cb3, 0x4ed8aa4a, 0x5b9cca4f, 0x682e6ff3,
0x748f82ee, 0x78a5636f, 0x84c87814, 0x8cc70208, 0x90befffa, 0xa4506ceb, 0xbef9a3f7, 0xc67178f2]
s = sys.stdin.read().encode()
msg = [int(x,2) for c in s for x in '{:08b}'.format(c)]
msg.append(1)
while len(msg) % 512 != 448:
msg.append(0)
msg.extend([int(x,2) for x in '{:064b}'.format(len(s))])
for i in range(len(msg)//512):
chunk = msg[512*i:512*(i+1)] # sloth love chunk
w = [0 for _ in range(64)]
for j in range(16):
w[j] = int(''.join(str(x) for x in chunk[32*j:32*(j+1)]),2)
for j in range(16, 64):
s0 = ror(w[j-15], 7) ^ ror(w[j-15], 18) ^ (w[j-15] >> 3)
s1 = ror(w[j-2], 17) ^ ror(w[j-2], 19) ^ (w[j-2] >> 10)
w[j] = (w[j-16] + s0 + w[j-7] + s1) % 2**32
work = h[:]
for j in range(64):
S1 = ror(work[4], 6) ^ ror(work[4], 11) ^ ror(work[4], 25)
ch = (work[4] & work[5]) ^ (~work[4] & work[6])
temp1 = (work[7] + S1 + ch + k[j] + w[j]) % 2**32
S0 = ror(work[0], 2) ^ ror(work[0], 13) ^ ror(work[0], 22)
maj = (work[0] & work[1]) ^ (work[0] & work[2]) ^ (work[1] & work[2])
temp2 = (S0 + maj) % 2**32
work = [(temp1 + temp2) % 2**32] + work[:-1]
work[4] = (work[4] + temp1) % 2**32
h = [(H+W)%2**32 for H,W in zip(h,work)]
print(''.join('{:08x}'.format(H) for H in h))
print(hashlib.sha256(s).hexdigest())
If the implementation was correct, the two outputs would match. Instead, I get this (with input abc):
$ echo -n abc | ./sha256.py
203b1d9016060802fe5ef80436611159de1868b58d44940e3d3979eab5f4d193
ba7816bf8f01cfea414140de5dae2223b00361a396177a9cb410ff61f20015ad
I have thoroughly examined the code, but I do not see any differences between it and the Wikipedia pseudocode. I suspect the error is in the compression loop (for j in range(64):). I've manually debugged and reviewed the state of the program up through initializing the first 16 words of the w array, and it all checks out.
Any help would be greatly appreciated!

SHA1 works on bits, not bytes. Therefore, the 64 bit length at the end of the padding is expressed in bits as well; the mistake is in the line
msg.extend([int(x,2) for x in '{:064b}'.format(len(s))])
which should be
msg.extend([int(x,2) for x in '{:064b}'.format(len(s) * 8)])

Related

CRC32 calculation in Python without using libraries

I have been trying to get my head around CRC32 calculations without much success, the values that I seem to get do not match what I should get.
I am aware that Python has libraries that are capable of generating these checksums (namely zlib and binascii) but I do not have the luxury of being able to use them as the CRC functionality do not exist on the micropython.
So far I have the following code:
import binascii
import zlib
from array import array
poly = 0xEDB88320
table = array('L')
for byte in range(256):
crc = 0
for bit in range(8):
if (byte ^ crc) & 1:
crc = (crc >> 1) ^ poly
else:
crc >>= 1
byte >>= 1
table.append(crc)
def crc32(string):
value = 0xffffffffL
for ch in string:
value = table[(ord(ch) ^ value) & 0x000000ffL] ^ (value >> 8)
return value
teststring = "test"
print "binascii calc: 0x%08x" % (binascii.crc32(teststring) & 0xffffffff)
print "zlib calc: 0x%08x" % (zlib.crc32(teststring) & 0xffffffff)
print "my calc: 0x%08x" % (crc32(teststring))
Then I get the following output:
binascii calc: 0xd87f7e0c
zlib calc: 0xd87f7e0c
my calc: 0x2780810c
The binascii and zlib calculations agree where as my one doesn't. I believe the calculated table of bytes is correct as I have compared it to examples available on the net. So the issue must be the routine where each byte is calculated, could anyone point me in the correct direction?
Thanks in advance!
I haven't looked closely at your code, so I can't pinpoint the exact source of the error, but you can easily tweak it to get the desired output:
import binascii
from array import array
poly = 0xEDB88320
table = array('L')
for byte in range(256):
crc = 0
for bit in range(8):
if (byte ^ crc) & 1:
crc = (crc >> 1) ^ poly
else:
crc >>= 1
byte >>= 1
table.append(crc)
def crc32(string):
value = 0xffffffffL
for ch in string:
value = table[(ord(ch) ^ value) & 0xff] ^ (value >> 8)
return -1 - value
# test
data = (
'',
'test',
'hello world',
'1234',
'A long string to test CRC32 functions',
)
for s in data:
print repr(s)
a = binascii.crc32(s)
print '%08x' % (a & 0xffffffffL)
b = crc32(s)
print '%08x' % (b & 0xffffffffL)
print
output
''
00000000
00000000
'test'
d87f7e0c
d87f7e0c
'hello world'
0d4a1185
0d4a1185
'1234'
9be3e0a3
9be3e0a3
'A long string to test CRC32 functions'
d2d10e28
d2d10e28
Here are a couple more tests that verify that the tweaked crc32 gives the same result as binascii.crc32.
from random import seed, randrange
print 'Single byte tests...',
for i in range(256):
s = chr(i)
a = binascii.crc32(s) & 0xffffffffL
b = crc32(s) & 0xffffffffL
assert a == b, (repr(s), a, b)
print('ok')
seed(42)
print 'Multi-byte tests...'
for width in range(2, 20):
print 'Width', width
r = range(width)
for n in range(1000):
s = ''.join([chr(randrange(256)) for i in r])
a = binascii.crc32(s) & 0xffffffffL
b = crc32(s) & 0xffffffffL
assert a == b, (repr(s), a, b)
print('ok')
output
Single byte tests... ok
Multi-byte tests...
Width 2
Width 3
Width 4
Width 5
Width 6
Width 7
Width 8
Width 9
Width 10
Width 11
Width 12
Width 13
Width 14
Width 15
Width 16
Width 17
Width 18
Width 19
ok
As discussed in the comments, the source of the error in the original code is that this CRC-32 algorithm inverts the initial crc buffer, and then inverts the final buffer contents. So value is initialised to 0xffffffff instead of zero, and we need to return value ^ 0xffffffff, which can also be written as ~value & 0xffffffff, i.e. invert value and then select the low-order 32 bits of the result.
If using binary data where the crc is chained over multiple buffers I used the following (using the OPs table):
def crc32(data, crc=0xffffffff):
for b in data:
crc = table[(b ^ crc) & 0xff] ^ (crc >> 8)
return crc
One can XOR the final result with -1 to agree with the online calculators.
crc = crc32(b'test')
print('0x{:08x}'.format(crc))
crc = crc32(b'te')
crc = crc32(b'st', crc)
print('0x{:08x}'.format(crc))
print('xor: 0x{:08x}'.format(crc ^ 0xffffffff))
output
0x278081f3
0x278081f3
xor: 0xd87f7e0c

Python - reading 10 bit integers from a binary file

I have a binary file containing a stream of 10-bit integers. I want to read it and store the values in a list.
It is working with the following code, which reads my_file and fills pixels with integer values:
file = open("my_file", "rb")
pixels = []
new10bitsByte = ""
try:
byte = file.read(1)
while byte:
bits = bin(ord(byte))[2:].rjust(8, '0')
for bit in reversed(bits):
new10bitsByte += bit
if len(new10bitsByte) == 10:
pixels.append(int(new10bitsByte[::-1], 2))
new10bitsByte = ""
byte = file.read(1)
finally:
file.close()
It doesn't seem very elegant to read the bytes into bits, and read it back into "10-bit" bytes. Is there a better way to do it?
With 8 or 16 bit integers I could just use file.read(size) and convert the result to an int directly. But here, as each value is stored in 1.25 bytes, I would need something like file.read(1.25)...
Here's a generator that does the bit operations without using text string conversions. Hopefully, it's a little more efficient. :)
To test it, I write all the numbers in range(1024) to a BytesIO stream, which behaves like a binary file.
from io import BytesIO
def tenbitread(f):
''' Generate 10 bit (unsigned) integers from a binary file '''
while True:
b = f.read(5)
if len(b) == 0:
break
n = int.from_bytes(b, 'big')
#Split n into 4 10 bit integers
t = []
for i in range(4):
t.append(n & 0x3ff)
n >>= 10
yield from reversed(t)
# Make some test data: all the integers in range(1024),
# and save it to a byte stream
buff = BytesIO()
maxi = 1024
n = 0
for i in range(maxi):
n = (n << 10) | i
#Convert the 40 bit integer to 5 bytes & write them
if i % 4 == 3:
buff.write(n.to_bytes(5, 'big'))
n = 0
# Rewind the stream so we can read from it
buff.seek(0)
# Read the data in 10 bit chunks
a = list(tenbitread(buff))
# Check it
print(a == list(range(maxi)))
output
True
Doing list(tenbitread(buff)) is the simplest way to turn the generator output into a list, but you can easily iterate over the values instead, eg
for v in tenbitread(buff):
or
for i, v in enumerate(tenbitread(buff)):
if you want indices as well as the data values.
Here's a little-endian version of the generator which gives the same results as your code.
def tenbitread(f):
''' Generate 10 bit (unsigned) integers from a binary file '''
while True:
b = f.read(5)
if not len(b):
break
n = int.from_bytes(b, 'little')
#Split n into 4 10 bit integers
for i in range(4):
yield n & 0x3ff
n >>= 10
We can improve this version slightly by "un-rolling" that for loop, which lets us get rid of the final masking and shifting operations.
def tenbitread(f):
''' Generate 10 bit (unsigned) integers from a binary file '''
while True:
b = f.read(5)
if not len(b):
break
n = int.from_bytes(b, 'little')
#Split n into 4 10 bit integers
yield n & 0x3ff
n >>= 10
yield n & 0x3ff
n >>= 10
yield n & 0x3ff
n >>= 10
yield n
This should give a little more speed...
As there is no direct way to read a file x-bit by x-bit in Python, we have to read it byte by byte. Following MisterMiyagi and PM 2Ring's suggestions I modified my code to read the file by 5 byte chunks (i.e. 40 bits) and then split the resulting string into 4 10-bit numbers, instead of looping over the bits individually. It turned out to be twice as fast as my previous code.
file = open("my_file", "rb")
pixels = []
exit_loop = False
try:
while not exit_loop:
# Read 5 consecutive bytes into fiveBytesString
fiveBytesString = ""
for i in range(5):
byte = file.read(1)
if not byte:
exit_loop = True
break
byteString = format(ord(byte), '08b')
fiveBytesString += byteString[::-1]
# Split fiveBytesString into 4 10-bit numbers, and add them to pixels
pixels.extend([int(fiveBytesString[i:i+10][::-1], 2) for i in range(0, 40, 10) if len(fiveBytesString[i:i+10]) > 0])
finally:
file.close()
Adding a Numpy based solution suitable for unpacking large 10-bit packed byte buffers like the ones you might receive from AVT and FLIR cameras.
This is a 10-bit version of #cyrilgaudefroy's answer to a similar question; there you can also find a Numba alternative capable of yielding an additional speed increase.
import numpy as np
def read_uint10(byte_buf):
data = np.frombuffer(byte_buf, dtype=np.uint8)
# 5 bytes contain 4 10-bit pixels (5x8 == 4x10)
b1, b2, b3, b4, b5 = np.reshape(data, (data.shape[0]//5, 5)).astype(np.uint16).T
o1 = (b1 << 2) + (b2 >> 6)
o2 = ((b2 % 64) << 4) + (b3 >> 4)
o3 = ((b3 % 16) << 6) + (b4 >> 2)
o4 = ((b4 % 4) << 8) + b5
unpacked = np.reshape(np.concatenate((o1[:, None], o2[:, None], o3[:, None], o4[:, None]), axis=1), 4*o1.shape[0])
return unpacked
Reshape can be omitted if returning a buffer instead of a Numpy array:
unpacked = np.concatenate((o1[:, None], o2[:, None], o3[:, None], o4[:, None]), axis=1).tobytes()
Or if image dimensions are known it can be reshaped directly, e.g.:
unpacked = np.reshape(np.concatenate((o1[:, None], o2[:, None], o3[:, None], o4[:, None]), axis=1), (1024, 1024))
If the use of the modulus operator appears confusing, try playing around with:
np.unpackbits(np.array([255%64], dtype=np.uint8))
Edit: It turns out that the Allied Vision Mako-U cameras employ a different ordering than the one I originally suggested above:
o1 = ((b2 % 4) << 8) + b1
o2 = ((b3 % 16) << 6) + (b2 >> 2)
o3 = ((b4 % 64) << 4) + (b3 >> 4)
o4 = (b5 << 2) + (b4 >> 6)
So you might have to test different orders if images come out looking wonky initially for your specific setup.

Can't reproduce working C bitwise encoding function in Python

I'm reverse engineering a proprietary network protocol that generates a (static) one-time pad on launch and then uses that to encode/decode each packet it sends/receives. It uses the one-time pad in a series of complex XORs, shifts, and multiplications.
I have produced the following C code after walking through the decoding function in the program with IDA. This function encodes/decodes the data perfectly:
void encodeData(char *buf)
{
int i;
size_t bufLen = *(unsigned short *)buf;
unsigned long entropy = *((unsigned long *)buf + 2);
int xorKey = 9 * (entropy ^ ((entropy ^ 0x3D0000) >> 16));
unsigned short baseByteTableIndex = (60205 * (xorKey ^ (xorKey >> 4)) ^ (668265261 * (xorKey ^ (xorKey >> 4)) >> 15)) & 0x7FFF;
//Skip first 24 bytes, as that is the header
for (i = 24; i <= (signed int)bufLen; i++)
buf[i] ^= byteTable[((unsigned short)i + baseByteTableIndex) & 2047];
}
Now I want to try my hand at making a Peach fuzzer for this protocol. Since I'll need a custom Python fixup to do the encoding/decoding prior to doing the fuzzing, I need to port this C code to Python.
I've made the following Python function but haven't had any luck with it decoding the packets it receives.
def encodeData(buf):
newBuf = bytearray(buf)
bufLen = unpack('H', buf[:2])
entropy = unpack('I', buf[2:6])
xorKey = 9 * (entropy[0] ^ ((entropy[0] ^ 0x3D0000) >> 16))
baseByteTableIndex = (60205 * (xorKey ^ (xorKey >> 4)) ^ (668265261 * (xorKey ^ (xorKey >> 4)) >> 15)) & 0x7FFF;
#Skip first 24 bytes, since that is header data
for i in range(24,bufLen[0]):
newBuf[i] = xorPad[(i + baseByteTableIndex) & 2047]
return str(newBuf)
I've tried with and without using array() or pack()/unpack() on various variables to force them to be the right size for the bitwise operations, but I must be missing something, because I can't get the Python code to work as the C code does. Does anyone know what I'm missing?
In case it would help you to try this locally, here is the one-time pad generating function:
def buildXorPad():
global xorPad
xorKey = array('H', [0xACE1])
for i in range(0, 2048):
xorKey[0] = -(xorKey[0] & 1) & 0xB400 ^ (xorKey[0] >> 1)
xorPad = xorPad + pack('B',xorKey[0] & 0xFF)
And here is the hex-encoded original (encoded) and decoded packet.
Original: 20000108fcf3d71d98590000010000000000000000000000a992e0ee2525a5e5
Decoded: 20000108fcf3d71d98590000010000000000000000000000ae91e1ee25252525
Solution
It turns out that my problem didn't have much to do with the difference between C and Python types, but rather some simple programming mistakes.
def encodeData(buf):
newBuf = bytearray(buf)
bufLen = unpack('H', buf[:2])
entropy = unpack('I', buf[8:12])
xorKey = 9 * (entropy[0] ^ ((entropy[0] ^ 0x3D0000) >> 16))
baseByteTableIndex = (60205 * (xorKey ^ (xorKey >> 4)) ^ (668265261 * (xorKey ^ (xorKey >> 4)) >> 15)) & 0x7FFF;
#Skip first 24 bytes, since that is header data
for i in range(24,bufLen[0]):
padIndex = (i + baseByteTableIndex) & 2047
newBuf[i] ^= unpack('B',xorPad[padIndex])[0]
return str(newBuf)
Thanks everyone for your help!
This line of C:
unsigned long entropy = *((unsigned long *)buf + 2);
should translate to
entropy = unpack('I', buf[8:12])
because buf is cast to an unsigned long first before adding 2 to the address, which adds the size of 2 unsigned longs to it, not 2 bytes (assuming an unsigned long is 4 bytes in size).
Also:
newBuf[i] = xorPad[(i + baseByteTableIndex) & 2047]
should be
newBuf[i] ^= xorPad[(i + baseByteTableIndex) & 2047]
to match the C, otherwise the output isn't actually based on the contents of the buffer.
Python integers don't overflow - they are automatically promoted to arbitrary precision when they exceed sys.maxint (or -sys.maxint-1).
>>> sys.maxint
9223372036854775807
>>> sys.maxint + 1
9223372036854775808L
Using array and/or unpack does not seem to make a difference (as you discovered)
>>> array('H', [1])[0] + sys.maxint
9223372036854775808L
>>> unpack('H', '\x01\x00')[0] + sys.maxint
9223372036854775808L
To truncate your numbers, you'll have to simulate overflow by manually ANDing with an appropriate bitmask whenever you're increasing the size of the variable.

Trying to get a UInt64 into a base32 string.....works in python not in vb.net

Hi Hope I can get an answer here to a stupid question.
Stupid because it's been explained to me & I still can't get it right.
Python Code:
sqln = 13173942528351756918
sqlid = ''
alphabet = '0123456789abcdfghjkmnpqrstuvwxyz'
for i in range(0, 13):
sqlid = alphabet[(sqln / (32 ** i)) % 32] + sqlid
RESULT: bdntyxtax2smq
vb.net code:
Dim sqln As UInt64 = 13173942528351756918
Dim sqlid as string = ""
Dim alphabet as String = "0123456789abcdfghjkmnpqrstuvwxyz"
For iCount As Integer = 0 To 12
sqlid = alphabet((sqln / (32 ^ iCount)) Mod 32) + sqlid
Next iCount
RESULT: bfpuzytbx3s00
I suspect the reason the results are different is because vb.net returns a double when you call ^ and I should be "shifting" it but can't seem to work out the syntax (or find it)
Thanks for your time =+ ideas =+ help
This time, you're dividing by 32iCount which is equivalent to shifting right by 5 * iCount:
Dim sqln As UInt64 = 13173942528351756918UL
Dim sqlid as string = ""
Dim alphabet as String = "0123456789abcdfghjkmnpqrstuvwxyz"
For iCount As Integer = 0 To 12
sqlid = alphabet((sqln >> (iCount * 5)) Mod 32) + sqlid
Next iCount
Console.WriteLine(sqlid)
Output:
bdntyxtax2smq
It sounds like it would be a good idea to read up on bitwise operations :)

Translating a C binary data read function to Python

(I've edited this for clarity, and changed the actual question a bit based on EOL's answer)
I'm trying to translate the following function in C to Python but failing miserably (see C code below). As I understand it, it takes four 1-byte chars starting from the memory location pointed to by from, treats them as unsigned long ints in order to give each one 4 bytes of space, and does some bitshifting to arrange them as a big-endian 32-bit integer. It's then used in an algorithm of checking file validity. (from the Treaty of Babel)
static int32 read_alan_int(unsigned char *from)
{
return ((unsigned long int) from[3])| ((unsigned long int)from[2] << 8) |
((unsigned long int) from[1]<<16)| ((unsigned long int)from[0] << 24);
}
/*
The claim algorithm for Alan files is:
* For Alan 3, check for the magic word
* load the file length in blocks
* check that the file length is correct
* For alan 2, each word between byte address 24 and 81 is a
word address within the file, so check that they're all within
the file
* Locate the checksum and verify that it is correct
*/
static int32 claim_story_file(void *story_file, int32 extent)
{
unsigned char *sf = (unsigned char *) story_file;
int32 bf, i, crc=0;
if (extent < 160) return INVALID_STORY_FILE_RV;
if (memcmp(sf,"ALAN",4))
{ /* Identify Alan 2.x */
bf=read_alan_int(sf+4);
if (bf > extent/4) return INVALID_STORY_FILE_RV;
for (i=24;i<81;i+=4)
if (read_alan_int(sf+i) > extent/4) return INVALID_STORY_FILE_RV;
for (i=160;i<(bf*4);i++)
crc+=sf[i];
if (crc!=read_alan_int(sf+152)) return INVALID_STORY_FILE_RV;
return VALID_STORY_FILE_RV;
}
else
{ /* Identify Alan 3 */
bf=read_alan_int(sf+12);
if (bf > (extent/4)) return INVALID_STORY_FILE_RV;
for (i=184;i<(bf*4);i++)
crc+=sf[i];
if (crc!=read_alan_int(sf+176)) return INVALID_STORY_FILE_RV;
}
return INVALID_STORY_FILE_RV;
}
I'm trying to reimplement this in Python. For implementing the read_alan_int function, I would think that importing struct and doing struct.unpack_from('>L', data, offset) would work. However, on valid files, this always returns 24 for the value bf, which means that the for loop is skipped.
def read_alan_int(file_buffer, i):
i0 = ord(file_buffer[i]) * (2 ** 24)
i1 = ord(file_buffer[i + 1]) * (2 ** 16)
i2 = ord(file_buffer[i + 2]) * (2 ** 8)
i3 = ord(file_buffer[i + 3])
return i0 + i1 + i2 + i3
def is_a(file_buffer):
crc = 0
if len(file_buffer) < 160:
return False
if file_buffer[0:4] == 'ALAN':
# Identify Alan 2.x
bf = read_alan_int(file_buffer, 4)
if bf > len(file_buffer)/4:
return False
for i in range(24, 81, 4):
if read_alan_int(file_buffer, i) > len(file_buffer)/4:
return False
for i in range(160, bf * 4):
crc += ord(file_buffer[i])
if crc != read_alan_int(file_buffer, 152):
return False
return True
else:
# Identify Alan 3.x
#bf = read_long(file_buffer, 12, '>')
bf = read_alan_int(file_buffer, 12)
print bf
if bf > len(file_buffer)/4:
return False
for i in range(184, bf * 4):
crc += ord(file_buffer[i])
if crc != read_alan_int(file_buffer, 176):
return False
return True
return False
if __name__ == '__main__':
import sys, struct
data = open(sys.argv[1], 'rb').read()
print is_a(data)
...but the damn thing still returns 24. Unfortunately, my C skills are non-existent so I'm having trouble getting the original program to print some debug output so I can know what bf is supposed to be.
What am I doing wrong?
Ok, so I'm apparently doing read_alan_int correctly. However, what's failing for me is the check that the first 4 characters are "ALAN". All of my test files fail this test. I've changed the code to remove this if/else statement and to instead just take advantage of early returns, and now all of my unit tests pass. So, on a practical level, I'm done. However, I'll keep the question open to address the new problem: how can I possibly wrangle the bits to get "ALAN" out of the first 4 chars?
def is_a(file_buffer):
crc = 0
if len(file_buffer) < 160:
return False
#if file_buffer.startswith('ALAN'):
# Identify Alan 2.x
bf = read_long(file_buffer, 4)
if bf > len(file_buffer)/4:
return False
for i in range(24, 81, 4):
if read_long(file_buffer, i) > len(file_buffer)/4:
return False
for i in range(160, bf * 4):
crc += ord(file_buffer[i])
if crc == read_long(file_buffer, 152):
return True
# Identify Alan 3.x
crc = 0
bf = read_long(file_buffer, 12)
if bf > len(file_buffer)/4:
return False
for i in range(184, bf * 4):
crc += ord(file_buffer[i])
if crc == read_long(file_buffer, 176):
return True
return False
Ah, I think I've got it. Note that the description says
/*
The claim algorithm for Alan files is:
* For Alan 3, check for the magic word
* load the file length in blocks
* check that the file length is correct
* For alan 2, each word between byte address 24 and 81 is a
word address within the file, so check that they're all within
the file
* Locate the checksum and verify that it is correct
*/
which I read as saying that there's a magic word in Alan 3, but not in Alan 2. However, your code goes the other way, even though the C code only assumes that the ALAN exists for Alan 3 files.
Why? Because you don't speak C, so you guessed -- naturally enough! -- that memcmp would return (the equivalent of a Python) True if the first four characters of sf and "ALAN" are equal.. but it doesn't. memcmp returns 0 if the contents are equal, and nonzero if they differ.
And that seems to be the way it works:
>>> import urllib2
>>>
>>> alan2 = urllib2.urlopen("http://ifarchive.plover.net/if-archive/games/competition2001/alan/chasing/chasing.acd").read(4)
>>> alan3 = urllib2.urlopen("http://mirror.ifarchive.org/if-archive/games/competition2006/alan/enterthedark/EnterTheDark.a3c").read(4)
>>>
>>> alan2
'\x02\x08\x01\x00'
>>> alan3
'ALAN'
Hypothesis 1: You are running on Windows, and you haven't opened your file in binary mode.
Your Python version looks fine to me.
PS: I missed the "memcmp() catch" that DSM found, so the Python code for if memcmp(…)… should actually be `if file_buffer[0:4] != 'ALAN'.
As far as I can see from the C code and from the sample file you give in the comments to the original question, the sample file is indeed invalid; here are the values:
read_alan_int(sf+12) == 24 # 0, 0, 0, 24 in file sf, big endian
crc = 0
read_alan_int(sf+176) = 46 # 0, 0, 0, 46 in file sf, big endian
So, crc != read_alan_int(sf+176), indeed.
Are you sure that the sample file is a valid file? Or is part of the calculation of crc missing from the original post??

Categories