Problem:
I made an elf executable that self modifies one of its byte. It simply changes a 0 for a 1. When I run the executable normally, I can see that the change was successful because it runs exactly as expected (more on that further down). The problem arises when debugging it: The debugger (using radare2) returns the wrong value when looking at the modified byte.
Context:
I made a reverse engineering challenge, inspired by Smallest elf. You can see the "source code" (if you can even call it that) there: https://pastebin.com/Yr1nFX8W.
To assemble and execute:
nasm -f bin -o tinyelf tinyelf.asm
chmod +x tinyelf
./tinyelf [flag]
If the flag is right, it returns 0. Any other value means your answer is wrong.
./tinyelf FLAG{wrong-flag}; echo $?
... outputs "255".
!Solution SPOILERS!
It's possible to reverse it statically. Once that is done, you will find out that each characters in the flag is found by doing this calculation:
flag[i] = b[i] + b[i+32] + b[i+64] + b[i+96];
...where i is the index of the character, and b is the bytes of the executable itself. Here is a c script that solve the challenge without a debugger:
#include <stdio.h>
int main()
{
char buffer[128];
FILE* fp;
fp = fopen("tinyelf", "r");
fread(buffer, 128, 1, fp);
int i;
char c = 0;
for (i = 0; i < 32; i++) {
c = buffer[i];
// handle self-modifying code
if (i == 10) {
c = 0;
}
c += buffer[i+32] + buffer[i+64] + buffer[i+96];
printf("%c", c);
}
printf("\n");
}
You can see that my solver handles a special case: When i == 10, c = 0. That's because it's the index of the byte that is modified during execution. Running the solver and calling tinyelf with it I get:
FLAG{Wh3n0ptiMizaTioNGOesT00F4r}
./tinyelf FLAG{Wh3n0ptiMizaTioNGOesT00F4r} ; echo $?
Output: 0. Success!
Great, let's try to solve it dynamically now, using python and radare2:
import r2pipe
r2 = r2pipe.open('./tinyelf')
r2.cmd('doo FLAG{AAAAAAAAAAAAAAAAAAAAAAAAAA}')
r2.cmd('db 0x01002051')
flag = ''
for i in range(0, 32):
r2.cmd('dc')
eax = r2.cmd('dr? al')
c = int(eax, 16)
flag += chr(c)
print('\n\n' + flag)
It puts a breakpoint on the command that compares the input characters with the expected characters, then get what it is compared with (al). This SHOULD work. Yet, here is the output:
FLAG{Wh3n0�tiMiza�ioNGOesT00F4r}
2 incorrect values, one of which is at the index 10 (the modified byte). Weird, maybe a bug with radare2? Let's try unicorn (a cpu emulator) next:
from unicorn import *
from unicorn.x86_const import *
from pwn import *
ADDRESS = 0x01002000
mu = Uc(UC_ARCH_X86, UC_MODE_32)
code = bytearray(open('./tinyelf').read())
mu.mem_map(ADDRESS, 20 * 1024 * 1024)
mu.mem_write(ADDRESS, str(code))
mu.reg_write(UC_X86_REG_ESP, ADDRESS + 0x2000)
mu.reg_write(UC_X86_REG_EBP, ADDRESS + 0x2000)
mu.mem_write(ADDRESS + 0x2000, p32(2)) # argc
mu.mem_write(ADDRESS + 0x2000 + 4, p32(ADDRESS + 0x5000)) # argv[0]
mu.mem_write(ADDRESS + 0x2000 + 8, p32(ADDRESS + 0x5000)) # argv[1]
mu.mem_write(ADDRESS + 0x5000, "x" * 32)
flag = ''
def hook_code(uc, address, size, user_data):
global flag
eip = uc.reg_read(UC_X86_REG_EIP)
if eip == 0x01002051:
c = uc.reg_read(UC_X86_REG_EAX) & 0x7f
#print(str(c) + " " + chr(c))
flag += chr(c)
mu.hook_add(UC_HOOK_CODE, hook_code)
try:
mu.emu_start(0x01002004, ADDRESS + len(code))
except Exception:
print flag
This time the solver outputs: FLAG{Wh3n0otiMizaTioNGOesT00F4r}
Notice at the index 10: 'o' instead of 'p'. That's an off by 1 mistake exactly where the byte is modified. That can't be a coincidence, right?
Anyone has an idea why both these scripts do not work? Thank you.
There is no issue with radare2 but your analysis of the program is incorrect thus the code that you wrote handles this RE incorrectly.
Lets start with
When i == 10, c = 0. That's because it's the index of the byte that is modified during execution.
That is partially true. It is set to zero at the beginning but then after each round there is this code:
xor al, byte [esi]
or byte [ebx + 0xa], al
So let's understand what's happening here. al is the currently calculated char of the flag and esi points to the FLAG that was entered as a argument and at [ebx + 0xa] we currently have 0 (set at the beginning), so the char at index 0xa will stay zero only if the calculated flag char is equal to the one in argument and since you are running r2 with a fake flag, that starts to be a problem from 6th char but the result of this you see at the first � at index 10. To mitigate that we need to update your script a little bit.
eax = r2.cmd('dr? al')
c = int(eax, 16)
r2.cmd("ds 2")
r2.cmd("dr al = 0x0")
What we do here is that after the brekpoint was hit and we read the calculated flag char we move two instructions further (to reach 0x01002054) and then we set al to 0x0 to emulate that our char at [esi] was actually the same as the calculated one (so xor will return 0 in such case). By doing this we keep value at 0xa to be zero still.
Now the second character. This RE is tricky ;) - it reads itself and if you forget about that you might end up with case like this. Let's try to analyze why this character is off. It is 18th character of the flag (so index is 17 as we start from 0) and if we check the formula for characters indexes that we read from the binary we noticed that indexes are: 17(dec) = 11(hex), 17 + 32 = 49(dec) = 31(hex), 17 + 64 = 81(dec) = 51(hex), 17 + 96 = 113(dec) = 71(hex). But this 51(hex) looks oddly familiar? Didn't we see that somewhere before? Yup, it's the offset at which you set your breakpoint to read the al value.
This is the code that break your second char
r2.cmd('db 0x01002051')
Yup - your breakpoint. You are setting to break at that address and a soft breakpoint is putting a 0xcc in the memory address so when the opcode that reads 3rd byte of the 18th char hits that spot it does not get 0x5b (the original value) it gets 0xcc. So to fix that we need to correct that calculation. Here probably it can be done in a smarter/more elegant way but I went for a simple solution so I just did this:
if i == 17:
c -= (0xcc-0x5b)
Just subtract was was unintentionally added by putting a breakpoint in the code.
The final code:
import r2pipe
r2 = r2pipe.open('./tinyelf')
print r2
r2.cmd("doo FLAG{AAAAAAAAAAAAAAAAAAAAAAAAAA}")
r2.cmd("db 0x01002051")
flag = ''
for i in range(0, 32):
r2.cmd("dc")
eax = r2.cmd('dr? al')
c = int(eax, 16)
if i == 17:
c -= (0xcc-0x5b)
r2.cmd("ds 2")
r2.cmd("dr al = 0x0")
flag += chr(c)
print('\n\n' + flag)
That prints the correct flag:
FLAG{Wh3n0ptiMizaTioNGOesT00F4r}
As for the Unicorn you are not setting the breakpoint so the problem 2 goes away, and the off-by-1 on 10th index is due to the same reason as for r2.
Related
I'm a newbie in this field and am trying to learn a bit about how to write cryptographic hash functions.
To get some hands-on, I tried updating the PySHA2 algorithm for Python 3.6 and up (the original version doesn't work on Python 2.5+ and the author says he won't fix this). I don't intend to use this algorithm for any work, just coding this for the sake of knowledge.
I've reached this far:
import copy
import struct
_initial_hashes = [0x6a09e667f3bcc908, 0xbb67ae8584caa73b, 0x3c6ef372fe94f82b, 0xa54ff53a5f1d36f1,
0x510e527fade682d1, 0x9b05688c2b3e6c1f, 0x1f83d9abfb41bd6b, 0x5be0cd19137e2179]
_round_constants = [0x428a2f98d728ae22, 0x7137449123ef65cd, 0xb5c0fbcfec4d3b2f, 0xe9b5dba58189dbbc,
0x3956c25bf348b538, 0x59f111f1b605d019, 0x923f82a4af194f9b, 0xab1c5ed5da6d8118,
0xd807aa98a3030242, 0x12835b0145706fbe, 0x243185be4ee4b28c, 0x550c7dc3d5ffb4e2,
0x72be5d74f27b896f, 0x80deb1fe3b1696b1, 0x9bdc06a725c71235, 0xc19bf174cf692694,
0xe49b69c19ef14ad2, 0xefbe4786384f25e3, 0x0fc19dc68b8cd5b5, 0x240ca1cc77ac9c65,
0x2de92c6f592b0275, 0x4a7484aa6ea6e483, 0x5cb0a9dcbd41fbd4, 0x76f988da831153b5,
0x983e5152ee66dfab, 0xa831c66d2db43210, 0xb00327c898fb213f, 0xbf597fc7beef0ee4,
0xc6e00bf33da88fc2, 0xd5a79147930aa725, 0x06ca6351e003826f, 0x142929670a0e6e70,
0x27b70a8546d22ffc, 0x2e1b21385c26c926, 0x4d2c6dfc5ac42aed, 0x53380d139d95b3df,
0x650a73548baf63de, 0x766a0abb3c77b2a8, 0x81c2c92e47edaee6, 0x92722c851482353b,
0xa2bfe8a14cf10364, 0xa81a664bbc423001, 0xc24b8b70d0f89791, 0xc76c51a30654be30,
0xd192e819d6ef5218, 0xd69906245565a910, 0xf40e35855771202a, 0x106aa07032bbd1b8,
0x19a4c116b8d2d0c8, 0x1e376c085141ab53, 0x2748774cdf8eeb99, 0x34b0bcb5e19b48a8,
0x391c0cb3c5c95a63, 0x4ed8aa4ae3418acb, 0x5b9cca4f7763e373, 0x682e6ff3d6b2b8a3,
0x748f82ee5defb2fc, 0x78a5636f43172f60, 0x84c87814a1f0ab72, 0x8cc702081a6439ec,
0x90befffa23631e28, 0xa4506cebde82bde9, 0xbef9a3f7b2c67915, 0xc67178f2e372532b,
0xca273eceea26619c, 0xd186b8c721c0c207, 0xeada7dd6cde0eb1e, 0xf57d4f7fee6ed178,
0x06f067aa72176fba, 0x0a637dc5a2c898a6, 0x113f9804bef90dae, 0x1b710b35131c471b,
0x28db77f523047d84, 0x32caab7b40c72493, 0x3c9ebe0a15c9bebc, 0x431d67c49c100d4c,
0x4cc5d4becb3e42b6, 0x597f299cfc657e2a, 0x5fcb6fab3ad6faec, 0x6c44198c4a475817]
def _rit_rot(on: int, by: int) -> int:
"""
helper function for right rotation as it isn't done by a simple bitwise operation (xor is done by '^')
:param on: value to be rotated
:param by: value by which to rotate
:return: right rotated 'on'
"""
return ((on >> by) | (on << (64 - by))) & 0xFFFFFFFFFFFFFFFF
def hash_main(chunk):
global _initial_hashes, _round_constants
# start the hashing process
# to begin, create a place to store the 80 words that we'll make
words = [0] * 80
# first 16 words will be saved without any changes
words[:16] = struct.unpack('!16Q', chunk)
# extend these 16 words into the remaining 64 words of 'message schedule array'
for i in range(16, 80):
part_1 = _rit_rot(words[i - 15], 1) ^ _rit_rot(words[i - 15], 8) ^ (words[i - 15] >> 7)
part_2 = _rit_rot(words[i - 2], 19) ^ _rit_rot(words[i - 2], 61) ^ (words[i - 2] >> 6)
words[i] = (words[i - 16] + part_1 + words[i - 7] + part_2) & 0xFFFFFFFFFFFFFFFF
# create the working variables
a, b, c, d, e, f, g, h = _initial_hashes
# start the compression function
for z in range(80):
var_1 = _rit_rot(a, 28) ^ _rit_rot(a, 34) ^ _rit_rot(a, 39)
var_2 = _rit_rot(e, 14) ^ _rit_rot(e, 18) ^ _rit_rot(e, 41)
var_3 = (a & b) ^ (a & c) ^ (b & c)
var_4 = (e & f) ^ ((~e) & g)
temp_1 = var_1 + var_3
temp_2 = h + var_2 + var_4 + _round_constants[z] + words[z]
# remix the hashes
h = g
g = f
f = e
e = (d + temp_2) & 0xFFFFFFFFFFFFFFFF
d = c
c = b
b = a
a = (temp_1 + temp_2) & 0xFFFFFFFFFFFFFFFF
# add this chunk to initial hashes
_initial_hashes = [(x + y) & 0xFFFFFFFFFFFFFFFF for x, y in zip(_initial_hashes,
[a, b, c, d, e, f, g, h])]
def _sha_backend_update(text_copy, _buffer, _counter):
"""
backend function that hashes given string
"""
global _initial_hashes, _round_constants
# create variables for cycling
_buffer += text_copy
_counter += len(text_copy)
# assert the variables are correct
if not text_copy:
return
if type(text_copy) is not str:
raise TypeError("Invalid Object! Please enter a valid string for hashing!")
# break the buffer into 128-bit chunks
while len(_buffer) >= 128:
chunk = _buffer[:128].encode()[1:]
hash_main(chunk)
_buffer = _buffer[128:]
def sha_backend_digest(text_to_hash: str, _buffer: str, _counter: int,
_output_size: int, hex_output: bool = False):
# initialize variables
variable_x = _counter & 0x7F
length = str(struct.pack('!Q', _counter << 3))
# set the thresholds
if variable_x < 112:
padding_len = 111 - variable_x
else:
padding_len = 239 - variable_x
# make a copy of the text_to_hash before starting hashing
text_copy = copy.deepcopy(text_to_hash)
m = '\x80' + ('\x00' * (padding_len + 8)) + length
# run the update function
_sha_backend_update(text_copy, _buffer, _counter)
# return the hash value
return_val = [hex(stuff) for stuff in _initial_hashes[:_output_size]]
if hex_output is True:
return_val = [int(stuff, base=16) for stuff in return_val]
return return_val
return ''.join(return_val)
def sha_512(text_to_hash: str, hex_digest: bool = False) -> str:
"""
frontend function for SHA512 hashing
:return: hashed string
"""
# before anything, check if the input is correct
if not text_to_hash:
return ""
if type(text_to_hash) is not str:
raise TypeError("Invalid content! Please provide content in correct format for hashing!")
# initialize default variables
_buffer = ''
_counter = 0
_output_size = 8
# start the backend function
return sha_backend_digest(text_to_hash, _buffer, _counter, _output_size, hex_output=hex_digest)
message = "This is a string to be hashed"
from hashlib import sha512
print("hashlib gives: ", sha512(message.encode()).hexdigest())
print("I give: ", sha_512(message))
As is obvious, I don't understand a lot of things in this algorithm and have literally copied many parts from the original code (also, I know it isn't good practice to write everything in a single function but I find it easier when trying to understand something).
But the biggest problem I have right now is it doesn't work! Whatever input message I provide to my function, it gives the same output:
0x6a09e667f3bcc9080xbb67ae8584caa73b0x3c6ef372fe94f82b0xa54ff53a5f1d36f1
0x510e527fade682d10x9b05688c2b3e6c1f0x1f83d9abfb41bd6b0x5be0cd19137e2179
I wrote a code at the bottom to compare it with python's hashlib module.
Where am I going wrong in this and how do I fix this?
EDIT: As mentioned in the comments, I tried to feed in a longer message string and the code seems to be working (it still gives longer output than hashlib though):
message = "This is a string to be hashed. I'll try to make this string as long as possible by adding" \
"as much information to it as I can, in the hopes that this string would somehow become longer than" \
"128 bits and my code can run properly. Hopefully, this is already longer than 128 bits, so lets see" \
"how it works..."
hash: 0x6fcc0f346f2577800x334bd9b6c1178a970x90964a3f45f7b5bb0xc14033d12f6607e60xb598bea0a8b0ac1e0x116b0e134691ab540x73d88e77e5b862ba0x89181da7462c5574
message = "This is a string to be hashed. I'll try to make this string as long as possible by adding" \
"as much information to it as I can, in the hopes that this string would somehow become longer than"
hash: 0x166e40ab03bc98750xe81fe34168b6994f0xe56b81bd5972b5560x8789265c3a56b30b0x2c810d652ea7b1550xa23ca2704602a8240x12ffb1ec8f3dd6d10x88c29f84cbef8988
You'll always have to pad the message. Padding and adding the length are always required as last step of the SHA-2 process. Currently you weren't performing that last step (to completion).
Here are my last two comments that pointed you in the right direction:
So generally you try and take one 128 byte block from the binary message, update the hash state using the information in that block, then move to the next one until you have a partial or 0 byte block. That block you need to pad & add size indication (in bits) and process. If you've not enough space for the padding / size indication then you need yet another block consisting entirely of padding and the size indication. If you read carefully, then you always process at least one block.
and
Hmm, it is already in sha_backend_digest (the 0x80 followed by zero bytes and the length which is input size * 8 (_counter << 3).
But of course you do need to perform that and not skip any step.
I'm trying to solve a puzzle, which is to reverse engineer this code, to get a list of possible passwords, and from those there should be one that 'stands out', and should work
function checkPass(password) {
var total = 0;
var charlist = "abcdefghijklmnopqrstuvwxyz";
for (var i = 0; i < password.length; i++) {
var countone = password.charAt(i);
var counttwo = (charlist.indexOf(countone));
counttwo++;
total *= 17;
total += counttwo;
}
if (total == 248410397744610) {
//success//
} else {...
I wrote something in python that I think should work, I reverse engineered the order of which it adds and multiplies, and has it try every character to see if it properly divides into an even number.
from random import shuffle
def shuffle_string(word):
word = list(word)
shuffle(word)
return ''.join(word)
charlist = "abcdefghijklmnopqrstuvwxyz"
total = 248410397744610
temp = total
password = ''
runtime = 20
def runlist():
global charlist
global total
global temp
global password
global runtime
for i in range(25):
if (temp - (i+1)) % 17 == 0:
password += charlist[i]
temp -= i+1
temp = temp /17
if temp == 0:
print password
password = ''
temp = total
runtime += 1
charlist = shuffle_string(charlist)
if runtime < 21:
runlist()
else:
runlist()
But when I try to run it I only get
deisx
Process finished with exit code 0
I'm wondering why my function isn't recursing properly, because it looks like it should from what I see. try and run it yourself, and see what happens.
There should be multiple solutions for this puzzle (I think?), and I was planning on making it be able repeat until it gets all solutions, but I'm a little lost on why it just runs through every letter once, then dies.
EDIT:
Revised code to actually recurse, but now I get no output, and still finish with exit code 0.
EDIT 2:
Revised code again to fix a mistake
Afraid I don't know much about python, but I can probably help you with the algorithm.
The encoding process repeats the following:
multiply the current total by 17
add a value (a = 1, b = 2, ..., z = 26) for the next letter to the
total
So at any point, the total is a multiple of 17 plus the value of the final letter. So each step in the recursive solver must remove the final letter then divide the result by 17.
However, because the multiplier is 17 and there are 26 characters in the alphabet, some of the remainder values may be produced by more than one letter - this is where many passwords may give rise to the same solution.
For example:
encoding "a" gives a total of 1, and 1 % 17 = 1
encoding "r" gives a total of 18, and 18 % 17 = 1
So if the current remainder is 1, then the encoded letter may be either "a" or "r". I think in your solution you only ever look for the first of these cases, and miss the second.
In pseudo code, my function to solve this would look something like:
function decodePassword(total, password)
if total == 0
print password
return
end if
var rem = total / 17 # integer division - truncate
var char1 = total % 17 # potential first character
var char2 = char1 + 17 # potential second character
# char1 values 1-16 map to letters a-p
if char1 >= 1
decodePassword(rem, charlist.charAt(char1 - 1) + password)
end if
# char2 values 17-26 map to letters q-z
if rem > 0 && char2 <= 26
decodePassword(rem - 1, charlist.charAt(char2 - 1) + password)
end if
end function
If it helps, the answer you are looking for is 12 chars long, and probably not printable in this forum!
HTH
Your code is neither repeating nor recursing because:
The runlist function is only called once
The runlist function does not fit the pattern for recursion which is:
Check for end of processing condition and if met return final result
Otherwise return the results so far plus of calling youself
I have a string that is packed such that each character was originally an unsigned byte but is stored as 7 bits and then packed into an unsigned byte array. I'm trying to find a quick way to unpack this string in Python but the function I wrote that uses the bitstring module works well but is very slow. It seems like something like this should not be so slow but I'm probably doing it very inefficiently...
This seems like something that is probably trivial but I just don't know what to use, maybe there is already a function that will unpack the string?
from bitstring import BitArray
def unpackString(raw):
msg = ''
bits = BitArray(bytes=raw)
mask = BitArray('0b01111111')
i = 0
while 1:
try:
iByte = (bits[i:i + 8] & mask).int
# value of 0 denotes a line break
if iByte == 0:
msg += '\n'
elif iByte >= 32 and iByte <= 126:
msg += chr(iByte)
i += 7
except:
break
return msg
This took me a while to figure out, as your solution seems to ignore the first bit of data. Given the input byte of 129 (0b10000001) I would expect to see 64 '1000000' printed by the following, but your code produces 1 '0000001' -- ignoring the first bit.
bs = b'\x81' # one byte string, whose value is 129 (0x81)
arr = BitArray(bs)
mask = BitArray('0b01111111')
byte = (arr[0:8] & mask).int
print(byte, repr("{:07b}".format(byte)))
Simplest solution would be to modify your solution to use bitstring.ConstBitStream -- I got an order of magnitude speed increase with the following.
from bitstring import ConstBitStream
def unpack_bitstream(raw):
num_bytes, remainder = divmod(len(raw) * 8 - 1, 7)
bitstream = ConstBitStream(bytes=raw, offset=1) # use offset to ignore leading bit
msg = b''
for _ in range(num_bytes):
byte = bitstream.read("uint:7")
if not byte:
msg += b'\n'
elif 32 <= byte <= 126:
msg += bytes((byte,))
# msg += chr(byte) # python 2
return msg
However, this can be done quite easily using only the standard library. This makes the solution more portable and, in the instances I tried, faster by another order of magnitude (I didn't try the cythonised version of bitstring).
def unpack_bytes(raw, zero_replacement=ord("\n")):
# use - 1 to ignore leading bit
num_bytes, remainder = divmod(len(raw) * 8 - 1, 7)
i = int.from_bytes(raw, byteorder="big")
# i = int(raw.encode("hex"), 16) # python 2
if remainder:
# remainder means there are unused trailing bits, so remove these
i >>= remainder
msg = []
for _ in range(num_bytes):
byte = i & 127
if not byte:
msg.append(zero_replacement)
elif 32 <= byte <= 126:
msg.append(byte)
i >>= 7
msg.reverse()
return bytes(msg)
# return b"".join(chr(c) for c in msg) # python 2
I've used python 3 to create these methods. If you're using python 2 then there are a number of adjustments you'll need to make. I've added these as comments after the line they are intended to replace and marked them python 2.
I am new to Python. In Perl, to set specific bits to a scalar variable(integer), I can use vec() as below.
#!/usr/bin/perl -w
$vec = '';
vec($vec, 3, 4) = 1; # bits 0 to 3
vec($vec, 7, 4) = 10; # bits 4 to 7
vec($vec, 11, 4) = 3; # bits 8 to 11
vec($vec, 15, 4) = 15; # bits 12 to 15
print("vec() Has a created a string of nybbles,
in hex: ", unpack("h*", $vec), "\n");
Output:
vec() Has a created a string of nybbles,
in hex: 0001000a0003000f
I was wondering how to achieve the same in Python, without having to write bit manipulation code and using struct.pack manually?
Not sure how the vec function works in pearl (haven't worked with the vec function). However, according to the output you have mentioned, the following code in python works fine. I do not see the significance of the second argument. To call the vec function this way: vec(value, size). Every time you do so, the output string will be concatenated to the global final_str variable.
final_vec = ''
def vec(value, size):
global final_vec
prefix = ''
str_hex = str(hex(value)).replace('0x','')
str_hex_size = len(str_hex)
for i in range (0, size - str_hex_size):
prefix = prefix + '0'
str_hex = prefix + str_hex
final_vec = final_vec + str_hex
return 0
vec(1, 4)
vec(10, 4)
vec(3, 4)
vec(15, 4)
print(final_vec)
If you really want to create a hex string from nibbles, you could solve it this way
nibbles = [1,10,3,15]
hex = '0x' + "".join([ "%04x" % x for x in nibbles])
(I've edited this for clarity, and changed the actual question a bit based on EOL's answer)
I'm trying to translate the following function in C to Python but failing miserably (see C code below). As I understand it, it takes four 1-byte chars starting from the memory location pointed to by from, treats them as unsigned long ints in order to give each one 4 bytes of space, and does some bitshifting to arrange them as a big-endian 32-bit integer. It's then used in an algorithm of checking file validity. (from the Treaty of Babel)
static int32 read_alan_int(unsigned char *from)
{
return ((unsigned long int) from[3])| ((unsigned long int)from[2] << 8) |
((unsigned long int) from[1]<<16)| ((unsigned long int)from[0] << 24);
}
/*
The claim algorithm for Alan files is:
* For Alan 3, check for the magic word
* load the file length in blocks
* check that the file length is correct
* For alan 2, each word between byte address 24 and 81 is a
word address within the file, so check that they're all within
the file
* Locate the checksum and verify that it is correct
*/
static int32 claim_story_file(void *story_file, int32 extent)
{
unsigned char *sf = (unsigned char *) story_file;
int32 bf, i, crc=0;
if (extent < 160) return INVALID_STORY_FILE_RV;
if (memcmp(sf,"ALAN",4))
{ /* Identify Alan 2.x */
bf=read_alan_int(sf+4);
if (bf > extent/4) return INVALID_STORY_FILE_RV;
for (i=24;i<81;i+=4)
if (read_alan_int(sf+i) > extent/4) return INVALID_STORY_FILE_RV;
for (i=160;i<(bf*4);i++)
crc+=sf[i];
if (crc!=read_alan_int(sf+152)) return INVALID_STORY_FILE_RV;
return VALID_STORY_FILE_RV;
}
else
{ /* Identify Alan 3 */
bf=read_alan_int(sf+12);
if (bf > (extent/4)) return INVALID_STORY_FILE_RV;
for (i=184;i<(bf*4);i++)
crc+=sf[i];
if (crc!=read_alan_int(sf+176)) return INVALID_STORY_FILE_RV;
}
return INVALID_STORY_FILE_RV;
}
I'm trying to reimplement this in Python. For implementing the read_alan_int function, I would think that importing struct and doing struct.unpack_from('>L', data, offset) would work. However, on valid files, this always returns 24 for the value bf, which means that the for loop is skipped.
def read_alan_int(file_buffer, i):
i0 = ord(file_buffer[i]) * (2 ** 24)
i1 = ord(file_buffer[i + 1]) * (2 ** 16)
i2 = ord(file_buffer[i + 2]) * (2 ** 8)
i3 = ord(file_buffer[i + 3])
return i0 + i1 + i2 + i3
def is_a(file_buffer):
crc = 0
if len(file_buffer) < 160:
return False
if file_buffer[0:4] == 'ALAN':
# Identify Alan 2.x
bf = read_alan_int(file_buffer, 4)
if bf > len(file_buffer)/4:
return False
for i in range(24, 81, 4):
if read_alan_int(file_buffer, i) > len(file_buffer)/4:
return False
for i in range(160, bf * 4):
crc += ord(file_buffer[i])
if crc != read_alan_int(file_buffer, 152):
return False
return True
else:
# Identify Alan 3.x
#bf = read_long(file_buffer, 12, '>')
bf = read_alan_int(file_buffer, 12)
print bf
if bf > len(file_buffer)/4:
return False
for i in range(184, bf * 4):
crc += ord(file_buffer[i])
if crc != read_alan_int(file_buffer, 176):
return False
return True
return False
if __name__ == '__main__':
import sys, struct
data = open(sys.argv[1], 'rb').read()
print is_a(data)
...but the damn thing still returns 24. Unfortunately, my C skills are non-existent so I'm having trouble getting the original program to print some debug output so I can know what bf is supposed to be.
What am I doing wrong?
Ok, so I'm apparently doing read_alan_int correctly. However, what's failing for me is the check that the first 4 characters are "ALAN". All of my test files fail this test. I've changed the code to remove this if/else statement and to instead just take advantage of early returns, and now all of my unit tests pass. So, on a practical level, I'm done. However, I'll keep the question open to address the new problem: how can I possibly wrangle the bits to get "ALAN" out of the first 4 chars?
def is_a(file_buffer):
crc = 0
if len(file_buffer) < 160:
return False
#if file_buffer.startswith('ALAN'):
# Identify Alan 2.x
bf = read_long(file_buffer, 4)
if bf > len(file_buffer)/4:
return False
for i in range(24, 81, 4):
if read_long(file_buffer, i) > len(file_buffer)/4:
return False
for i in range(160, bf * 4):
crc += ord(file_buffer[i])
if crc == read_long(file_buffer, 152):
return True
# Identify Alan 3.x
crc = 0
bf = read_long(file_buffer, 12)
if bf > len(file_buffer)/4:
return False
for i in range(184, bf * 4):
crc += ord(file_buffer[i])
if crc == read_long(file_buffer, 176):
return True
return False
Ah, I think I've got it. Note that the description says
/*
The claim algorithm for Alan files is:
* For Alan 3, check for the magic word
* load the file length in blocks
* check that the file length is correct
* For alan 2, each word between byte address 24 and 81 is a
word address within the file, so check that they're all within
the file
* Locate the checksum and verify that it is correct
*/
which I read as saying that there's a magic word in Alan 3, but not in Alan 2. However, your code goes the other way, even though the C code only assumes that the ALAN exists for Alan 3 files.
Why? Because you don't speak C, so you guessed -- naturally enough! -- that memcmp would return (the equivalent of a Python) True if the first four characters of sf and "ALAN" are equal.. but it doesn't. memcmp returns 0 if the contents are equal, and nonzero if they differ.
And that seems to be the way it works:
>>> import urllib2
>>>
>>> alan2 = urllib2.urlopen("http://ifarchive.plover.net/if-archive/games/competition2001/alan/chasing/chasing.acd").read(4)
>>> alan3 = urllib2.urlopen("http://mirror.ifarchive.org/if-archive/games/competition2006/alan/enterthedark/EnterTheDark.a3c").read(4)
>>>
>>> alan2
'\x02\x08\x01\x00'
>>> alan3
'ALAN'
Hypothesis 1: You are running on Windows, and you haven't opened your file in binary mode.
Your Python version looks fine to me.
PS: I missed the "memcmp() catch" that DSM found, so the Python code for if memcmp(…)… should actually be `if file_buffer[0:4] != 'ALAN'.
As far as I can see from the C code and from the sample file you give in the comments to the original question, the sample file is indeed invalid; here are the values:
read_alan_int(sf+12) == 24 # 0, 0, 0, 24 in file sf, big endian
crc = 0
read_alan_int(sf+176) = 46 # 0, 0, 0, 46 in file sf, big endian
So, crc != read_alan_int(sf+176), indeed.
Are you sure that the sample file is a valid file? Or is part of the calculation of crc missing from the original post??