Should I be comparing bytes using struct? - python

I'm trying to compare the data within two files, and retrieve a list of offsets of where the differences are.
I tried it on some text files and it worked quite well..
However on non-text files (that still contain ascii text), I call them binary data files. (executables, so on..)
It seems to think some bytes are the same, even though when I look at it in hex editor, they are obviously not. I tried printing out this binary data that it thinks is the same and I get blank lines where it should be printed.
Thus, I think this is the source of the problem.
So what is the best way to compare bytes of data that could be both binary and contain ascii text? I thought using the struct module by be a starting point...
As you can see below, I compare the bytes with the == operator
Here's the code:
import os
import math
#file1 = 'file1.txt'
#file2 = 'file2.txt'
file1 = 'file1.exe'
file2 = 'file2.exe'
file1size = os.path.getsize(file1)
file2size = os.path.getsize(file2)
a = file1size - file2size
end = file1size #if they are both same size
if a > 0:
#file 2 is smallest
end = file2size
big = file1size
elif a < 0:
#file 1 is smallest
end = file1size
big = file2size
f1 = open(file1, 'rb')
f2 = open(file2, 'rb')
readSize = 500
r = readSize
off = 0
data = []
looking = False
d = open('data.txt', 'w')
while off < end:
f1.seek(off)
f2.seek(off)
b1, b2 = f1.read(r), f2.read(r)
same = b1 == b2
print ''
if same:
print 'Same at: '+str(off)
print 'readSize: '+str(r)
print b1
print b2
print ''
#save offsets of the section of "different" bytes
#data.append([diffOff, diffOff+off-1]) #[begin diff off, end diff off]
if looking:
d.write(str(diffOff)+" => "+str(diffOff+off-2)+"\n")
looking = False
r = readSize
off = off + 1
else:
off = off + r
else:
if r == 1:
looking = True
diffOff = off
off = off + 1 #continue reading 1 at a time, until u find a same reading
r = 1 #it will shoot back to the last off, since we didn't increment it here
d.close()
f1.close()
f2.close()
#add the diff ending portion to diff data offs, if 1 file is longer than the other
a = int(math.fabs(a)) #get abs val of diff
if a:
data.append([big-a, big-1])
print data

Did you try difflib and filecmp modules?
This module provides classes and
functions for comparing sequences. It
can be used for example, for comparing
files, and can produce difference
information in various formats,
including HTML and context and unified
diffs. For comparing directories and
files, see also, the filecmp module.
The filecmp module defines functions
to compare files and directories, with
various optional time/correctness
trade-offs. For comparing files, see
also the difflib module
.

You are probably encountering encoding/decoding problems. Someone may suggest a better solution, but you could try reading the file into a bytearray so you're reading raw bytes instead of decoded characters:
Here's a crude example:
$ od -Ax -tx1 /tmp/aa
000000 e0 b2 aa 0a
$ od -Ax -tx1 /tmp/bb
000000 e0 b2 bb 0a
$ cat /tmp/diff.py
a = bytearray(open('/tmp/aa', 'rb').read())
b = bytearray(open('/tmp/bb', 'rb').read())
print "%02x, %02x" % (a[2], a[3])
print "%02x, %02x" % (b[2], b[3])
$ python /tmp/diff.py
aa, 0a
bb, 0a

Related

Load and modify YAML without breaking the indents?

Hi wanted to update this integer value of x to 8 but I could not get any good understanding on how can I do that without making any effect in indents inside sub in yaml file. I wanted to read this YAML file, replace the string to x=8 and save the yaml file as it is.
I am using Python for the modification, here is the sample code:
parent:
-
subchild: something
subchild2: something
- sub:
y = 4;
x = 6 # I wanted to replace this integer to 8
z = 10
Note Point: x=6 will be in multiple files, so I wanted to open files one by one, and do all the modifications ( x=8) and save those files one by one.
Problem
I was able to replace the file but what the problem I am facing is, the result becomes this:
parent:
-
subchild: something
subchild2: something
- sub: y = 4; x = 8; z = 10;
And what I want is the same indents inside sub as in the original yaml file. Hope you got point here.
[EDIT] Additional information:
This is the way I am reading the file and saving the file.
f = open(test.yaml, 'r')
newf = f.read().replace('6', '8')
overrides = yaml.load(newf)
f.close()
with open('updated_test.yaml', 'w') as ff:
yaml.dump(overrides, ff)
Just don't use the yaml module for this task at all, as you aren't doing anything what it is needed for ! ;)
f = open("test.yaml", 'r')
newf = f.read().replace('6', '8')
f.close()
with open('updated_test.yaml', 'w') as ff:
ff.write(newf)
Your input is invalid YAML, as you cannot start an indented sequence after the second something.
Your output is invalid as well.
If you want something like:
y = 4;
x = 6 # I wanted to replace this integer to 8
z = 10
in your YAML document as a multiline string, and preserve the multiple lines you
need to use ruamel.yaml and specify the string as a literal block scalar (using |).
And since x = 6 is part of the string, you need to do a string replacement to
change the value. I would use a regex for that:
import sys
import re
import ruamel.yaml
yaml_str = """\
- sub: |
y = 4;
x = 6 # I wanted to replace this integer to 8
z = 10
"""
yaml = ruamel.yaml.YAML()
# yaml.indent(mapping=4, sequence=4, offset=2)
# yaml.preserve_quotes = True
data = yaml.load(yaml_str)
target = data[0]['sub']
new_value = 8
data[0]['sub'] = type(target)(re.sub('x = [0-9]*', f'x = {new_value}', target))
yaml.dump(data, sys.stdout)
which gives:
- sub: |
y = 4;
x = 8 # I wanted to replace this integer to 8
z = 10

Reading variable length binary values from a file in python

I have three text values that I am encrypting and then writing to a file. Later I want to read the values back (in another script) and decrypt them.
I've successfully encrypted the values:
cenc = rsa.encrypt(client_name.encode('utf8'), publicKey)
denc = rsa.encrypt(expiry_date.encode('utf8'), publicKey)
fenc = rsa.encrypt(features.encode('utf8'), publicKey)
and written to a binary file:
licensefh = open("license.sfb", "wb")
licensefh.write(cenc)
licensefh.write(denc)
licensefh.write(fenc)
licensefh.close()
The three values cenc, denc and fenc are all of different lengths so when I read the file back:
licensefh = open("license.sfb", "rb")
encMessage = licensefh.read()
encMessage contains the entire file and I don't know how to get the three values back again.
I've tried using a separator between the values:
SEP = bytes(chr(0x02).encode('utf8'))
...
licensefh.write(cenc)
licensefh.write(SEP)
...
and then using encMessage.partition(SEP) or encMessage.split(SEP) but the data invariably contains the SEP value in it somewhere (I've tried a few different characters) so that didn't work.
I tried getting the length of the bytes objects cenc, denc and fenc, but this returned 256 for each value even though the contents of the variables are all different lengths.
My question is this. How do I write these three variable length values to a binary file and then separate them when I read them back again?
Here's an example of the 3 binary values:
b'tX\x10Fo\x89\x10~\x83Pok\xd1\xfb\xbe\x0e<a\xe5\x11md:\xe6\x84#\xfa\xf8\xe5\xeb\xf8\xdc{\xc0Z\xa0\xc0^\xc1\xd9\x820\xec\xec\xb0R\x99/\xa2l\x88\xa9\xa6g\xa3\x01m\xf9\x7f\x91\xb9\xe1\x80\xccs|\xb7_\xa9Fp\x11yvG\xdc\x02d\x8aK2\x92t\x0e\x1f\xca\x19\xbb&\xaf{\xc0y>\t|\x86\xab\x16.\xa5kZ"\xab6\xaaV\xf4w\x7f\xc5q\x07\xef\xa9\xa5\xa3\xf3 6\xdb\x03\x19S\xbd\x81\xf9\xc8\xc5\x90\x1e\x19\x86\xa4q\xe3?i\xc4\xac\t\xd5=3C\x9b#\xc3IuAN,\xeat\xc6\x96VFL\x1eFWZ\xa4\xd73\x92P#\x1d\xb9\x12\x15\xc9\xd4~\x8aWm^\xb8\x8b\x9d\x88\n)\xeb#\xe3\x93\xb1\\\xd6^\xe0\xce\xa2(\x05\xf5\xe6\x8b\xd1\x15\xd8v\xf0\xae\x90\xd8?\x01\r\x00\xf4\xa5\xadM|%\x98\xa9SR\xc6\xd0K\x9e&\xc3\xe0M\x81\x87\xdea\xcc\xd5\x9c\xcd\xfd1l\x1f\xb9?\xed\xd1\x95\xbc\x11\x85U9'
b'l\xd3S\xcc\x03\x9a\xf2\xfdr\xca\xbbA\x06\xfb\xd8\xbbWi\xdc\xb1\xf6&\x97T\x81Kl\r\x86\x9b\x95?\x94}\x8a\xd3\xa1V\x81\xd3]*B\x1f\x96`\xa3\xd1\xf2|B\x84?\xa0\ns\xb7\xcf\x18Y\x87\xcfR\x87!\x14\x81!\xf7\xf2\xe5x|=O\xe3\xba2\xf2!\x93\x0fT7\x0c~4\xa3\xe5\xb7\xf9wy\xb5\x12FM\x96\xd9\xfd\xedn\x9c\xacw\x1b\xc2\x17+\xb6\x05`\x10\xf8\xe4\x01\xde\xc7\xa2\xa0\x80\xd8\x15\xb1+<s\xc7\x19\x9c\x14\xb0\x1a"\x10\xbb\x0f\xe1\x05\x93\xd2?xX\xd9\x93\x8an\x8d\xcd\xbd!c\xd0,\xa45\xbai\xe3\xccx\x08\xaa,\xd1\xe5\'t\x91\xb8\xf2n$\x0c\xf9-\xb4\xc2\x07\x81\xe1\xe7\x8e\xb3\x98\x11\xf3\xa6\xd9wz\x9a3\xc9\x9c?z\xd8\xaa\x08}\xa2\x9c[\xf2\x9d\xe4\xcdb\xddl\xceV\x7f\xf1\x81\xb3\x88\x1e\x9c5?k\x0f\xc9\x86\x86&\xedV.\xa7\x8d\x13&V\xad\xca\xe5\x93\xfe\xa5\x94\xbc\xf5\xd1{Cl\xc0\x030\x92\x03\xc9'
b'#\xbdd7\xe9\xa0{\t\xb9\x87B\x9e\xf9\x97P^\xf3V\xb6\x93\x1f(J\x0b\xa3\xbf\xd8\x04\x86T\xa4\xca\xf3\xe8%\xddC\x11\xdb5\xff,\xf7\x13\xd7\xd2\xbc\xf3\x893\x83\xdcmJ\xc8p\xdf\x07V\x7fb\xeb\xa9\x8b\x0f\xca\xf9\x05\xfc\xdfS\x94b\x90\xcd\xfcn?/]\x11\xaf\xe606\xfb\\U59\xa0>\xbd\xd8\x1c\xa8\xca\x83\xf4C\x95v7\xc6\xe00\xe4,d_/\x83\xa0\xb9mO\x0e\xc4\x97J\x15\xf0\xca-\xa0\xafT\xe4\x82\x03\n\x14:\xa1\xdcL\x98\x9d,1\xfa\x10\xf4\xfd\xa0\x0b\xc7\x13!\xf7\xdb/\xda\x1a\x9df\x1cQ\xc0\x99H\x08\xa0c\x8f9/4\xc4\x05\xc6\x9eM\x8e\xe5V\xf8D\xc3\xfd\xad4\x94A\xb9[\x80\xb9\xcf\xe6\xd9\xb3M2\xd9N\xfbA\x18\x84/W\x9b\x92\xfe\xbb\xd6C\x85\xa3\xc6\xd2T\xd0\xb2\xb9\xf7R\xb4(s\xda\xbcX,9w\x17\x1c\xfb|\xa0\x87\xba\xca6>y\xba\\L4wc\x94\xe7$Y\x89\x07\x9b\xfe\x9b?{\x85'
#pippo1980 's comment is how I would do it, using struct :
import struct
cenc = b'tX\x10Fo\x89\x10~\x83Pok\xd1\xfb\xbe\x0e<a\xe5\x11md:\xe6\x84#\xfa\xf8\xe5\xeb\xf8\xdc{\xc0Z\xa0\xc0^\xc1\xd9\x820\xec\xec\xb0R\x99/\xa2l\x88\xa9\xa6g\xa3\x01m\xf9\x7f\x91\xb9\xe1\x80\xccs|\xb7_\xa9Fp\x11yvG\xdc\x02d\x8aK2\x92t\x0e\x1f\xca\x19\xbb&\xaf{\xc0y>\t|\x86\xab\x16.\xa5kZ"\xab6\xaaV\xf4w\x7f\xc5q\x07\xef\xa9\xa5\xa3\xf3 6\xdb\x03\x19S\xbd\x81\xf9\xc8\xc5\x90\x1e\x19\x86\xa4q\xe3?i\xc4\xac\t\xd5=3C\x9b#\xc3IuAN,\xeat\xc6\x96VFL\x1eFWZ\xa4\xd73\x92P#\x1d\xb9\x12\x15\xc9\xd4~\x8aWm^\xb8\x8b\x9d\x88\n)\xeb#\xe3\x93\xb1\\\xd6^\xe0\xce\xa2(\x05\xf5\xe6\x8b\xd1\x15\xd8v\xf0\xae\x90\xd8?\x01\r\x00\xf4\xa5\xadM|%\x98\xa9SR\xc6\xd0K\x9e&\xc3\xe0M\x81\x87\xdea\xcc\xd5\x9c\xcd\xfd1l\x1f\xb9?\xed\xd1\x95\xbc\x11\x85U9'
denc = b'l\xd3S\xcc\x03\x9a\xf2\xfdr\xca\xbbA\x06\xfb\xd8\xbbWi\xdc\xb1\xf6&\x97T\x81Kl\r\x86\x9b\x95?\x94}\x8a\xd3\xa1V\x81\xd3]*B\x1f\x96`\xa3\xd1\xf2|B\x84?\xa0\ns\xb7\xcf\x18Y\x87\xcfR\x87!\x14\x81!\xf7\xf2\xe5x|=O\xe3\xba2\xf2!\x93\x0fT7\x0c~4\xa3\xe5\xb7\xf9wy\xb5\x12FM\x96\xd9\xfd\xedn\x9c\xacw\x1b\xc2\x17+\xb6\x05`\x10\xf8\xe4\x01\xde\xc7\xa2\xa0\x80\xd8\x15\xb1+<s\xc7\x19\x9c\x14\xb0\x1a"\x10\xbb\x0f\xe1\x05\x93\xd2?xX\xd9\x93\x8an\x8d\xcd\xbd!c\xd0,\xa45\xbai\xe3\xccx\x08\xaa,\xd1\xe5\'t\x91\xb8\xf2n$\x0c\xf9-\xb4\xc2\x07\x81\xe1\xe7\x8e\xb3\x98\x11\xf3\xa6\xd9wz\x9a3\xc9\x9c?z\xd8\xaa\x08}\xa2\x9c[\xf2\x9d\xe4\xcdb\xddl\xceV\x7f\xf1\x81\xb3\x88\x1e\x9c5?k\x0f\xc9\x86\x86&\xedV.\xa7\x8d\x13&V\xad\xca\xe5\x93\xfe\xa5\x94\xbc\xf5\xd1{Cl\xc0\x030\x92\x03\xc9'
fenc = b'#\xbdd7\xe9\xa0{\t\xb9\x87B\x9e\xf9\x97P^\xf3V\xb6\x93\x1f(J\x0b\xa3\xbf\xd8\x04\x86T\xa4\xca\xf3\xe8%\xddC\x11\xdb5\xff,\xf7\x13\xd7\xd2\xbc\xf3\x893\x83\xdcmJ\xc8p\xdf\x07V\x7fb\xeb\xa9\x8b\x0f\xca\xf9\x05\xfc\xdfS\x94b\x90\xcd\xfcn?/]\x11\xaf\xe606\xfb\\U59\xa0>\xbd\xd8\x1c\xa8\xca\x83\xf4C\x95v7\xc6\xe00\xe4,d_/\x83\xa0\xb9mO\x0e\xc4\x97J\x15\xf0\xca-\xa0\xafT\xe4\x82\x03\n\x14:\xa1\xdcL\x98\x9d,1\xfa\x10\xf4\xfd\xa0\x0b\xc7\x13!\xf7\xdb/\xda\x1a\x9df\x1cQ\xc0\x99H\x08\xa0c\x8f9/4\xc4\x05\xc6\x9eM\x8e\xe5V\xf8D\xc3\xfd\xad4\x94A\xb9[\x80\xb9\xcf\xe6\xd9\xb3M2\xd9N\xfbA\x18\x84/W\x9b\x92\xfe\xbb\xd6C\x85\xa3\xc6\xd2T\xd0\xb2\xb9\xf7R\xb4(s\xda\xbcX,9w\x17\x1c\xfb|\xa0\x87\xba\xca6>y\xba\\L4wc\x94\xe7$Y\x89\x07\x9b\xfe\x9b?{\x85'
packing_format = "<HHH" # little-endian, 3 * (2-byte unsigned short)
with open("license.sfb", "wb") as licensefh:
licensefh.write(struct.pack(packing_format, len(cenc), len(denc), len(fenc)))
licensefh.write(cenc)
licensefh.write(denc)
licensefh.write(fenc)
# close is automatic with a context-manager
with open("license.sfb", "rb") as licensefh2:
header_length = struct.calcsize(packing_format)
cenc2_len, denc2_len, fenc2_len = struct.unpack(packing_format, licensefh2.read(header_length))
cenc2 = licensefh2.read(cenc2_len)
denc2 = licensefh2.read(denc2_len)
fenc2 = licensefh2.read(fenc2_len)
assert len(cenc2) == cenc2_len and len(denc2) == denc2_len and len(fenc2) == fenc2_len # the file was not truncated
unread_bytes = licensefh2.read() # until EOF
assert len(unread_bytes) == 0 # there is nothing else in the file, everything has been read
assert cenc == cenc2
assert denc == denc2
assert fenc == fenc2

how to create an index to parse big text file

I have two files A and B in FASTQ format, which are basically several hundred million lines of text organized in groups of 4 lines starting with an # as follows:
#120412_SN549_0058_BD0UMKACXX:5:1101:1156:2031#0/1
GCCAATGGCATGGTTTCATGGATGTTAGCAGAAGACATGAGACTTCTGGGACAGGAGCAAAACACTTCATGATGGCAAAAGATCGGAAGAGCACACGTCTGAACTCN
+120412_SN549_0058_BD0UMKACXX:5:1101:1156:2031#0/1
bbbeee_[_ccdccegeeghhiiehghifhfhhhiiihhfhghigbeffeefddd]aegggdffhfhhihbghhdfffgdb^beeabcccabbcb`ccacacbbccB
I need to compare the
5:1101:1156:2031#0/
part between files A and B and write the groups of 4 lines in file B that matched to a new file. I got a piece of code in python that does that, but only works for small files as it parses through the entire #-lines of file B for every #-line in file A, and both files contain hundreds of millions of lines.
Someone suggested that I should create an index for file B; I have googled around without success and would be very grateful if someone could point out how to do this or let me know of a tutorial so I can learn. Thanks.
==EDIT==
In theory each group of 4 lines should only exist once in each file. Would it increase the speed enough if breaking the parsing after each match or do I need a different algorithm altogether?
An index is just a shortened version of the information you are working with. In this case, you will want the "key" - the text between the first colon(':') on the #-line and the final slash('/') near the end - as well as some kind of value.
Since the "value" in this case is the entire contents of the 4-line block, and since our index is going to store a separate entry for each block, we would be storing the entire file in memory if we used the actual value in the index.
Instead, let's use the file position of the beginning of the 4-line block. That way, you can move to that file position, print 4 lines, and stop. Total cost is the 4 or 8 or however many bytes it takes to store an integer file position, instead of however-many bytes of actual genome data.
Here is some code that does the job, but also does a lot of validation and checking. You might want to throw stuff away that you don't use.
import sys
def build_index(path):
index = {}
for key, pos, data in parse_fastq(path):
if key not in index:
# Don't overwrite duplicates- use first occurrence.
index[key] = pos
return index
def error(s):
sys.stderr.write(s + "\n")
def extract_key(s):
# This much is fairly constant:
assert(s.startswith('#'))
(machine_name, rest) = s.split(':', 1)
# Per wikipedia, this changes in different variants of FASTQ format:
(key, rest) = rest.split('/', 1)
return key
def parse_fastq(path):
"""
Parse the 4-line FASTQ groups in path.
Validate the contents, somewhat.
"""
f = open(path)
i = 0
# Note: iterating a file is incompatible with fh.tell(). Fake it.
pos = offset = 0
for line in f:
offset += len(line)
lx = i % 4
i += 1
if lx == 0: # #machine: key
key = extract_key(line)
len1 = len2 = 0
data = [ line ]
elif lx == 1:
data.append(line)
len1 = len(line)
elif lx == 2: # +machine: key or something
assert(line.startswith('+'))
data.append(line)
else: # lx == 3 : quality data
data.append(line)
len2 = len(line)
if len2 != len1:
error("Data length mismatch at line "
+ str(i-2)
+ " (len: " + str(len1) + ") and line "
+ str(i)
+ " (len: " + str(len2) + ")\n")
#print "Yielding #%i: %s" % (pos, key)
yield key, pos, data
pos = offset
if i % 4 != 0:
error("EOF encountered in mid-record at line " + str(i));
def match_records(path, index):
results = []
for key, pos, d in parse_fastq(path):
if key in index:
# found a match!
results.append(key)
return results
def write_matches(inpath, matches, outpath):
rf = open(inpath)
wf = open(outpath, 'w')
for m in matches:
rf.seek(m)
wf.write(rf.readline())
wf.write(rf.readline())
wf.write(rf.readline())
wf.write(rf.readline())
rf.close()
wf.close()
#import pdb; pdb.set_trace()
index = build_index('afile.fastq')
matches = match_records('bfile.fastq', index)
posns = [ index[k] for k in matches ]
write_matches('afile.fastq', posns, 'outfile.fastq')
Note that this code goes back to the first file to get the blocks of data. If your data is identical between files, you would be able to copy the block from the second file when a match occurs.
Note also that depending on what you are trying to extract, you may want to change the order of the output blocks, and you may want to make sure that the keys are unique, or perhaps make sure the keys are not unique but are repeated in the order they match. That's up to you - I'm not sure what you're doing with the data.
these guys claim to parse a few gigs file while using a dedicated library, see http://www.biostars.org/p/15113/
fastq_parser = SeqIO.parse(fastq_filename, "fastq")
wanted = (rec for rec in fastq_parser if ...)
SeqIO.write(wanted, output_file, "fastq")
a better approach IMO would be to parse it once and load the data to some database instead of that output_file (i.e mysql) and latter run the queries there

Read between 2 offsets of a file

I was wondering how the read() function can be used to read between 2 offsets that are in hex?
I tried using this to convert the offset values to int, but I get a syntax error for the read.() line. Any ideas?
OFFSETS = ('3AF7','3ECF')
OFFSETE = ('3B04','3EDE')
for r, d, f in os.walk("."):
for hahahoho, value in enumerate(OFFSETS and OFFSETE):
try:
with open(os.path.join(r,f), 'rb' ) as fileread:
texttoprint = fileread.seek(int(OFFSETS[hahahoho], 16) -1)
yeeha = texttoprint.read[int(OFFSETS[hahahoho], 16) -1 : int(OFFSETE[damn],16)]
print (yeeha)
hahahoho + 1
this is not the entire code thou, just posted the ones i need help with =(
EDIT:
Alright, i think i should listen to the advice of you people this is the entire code
nost = 1
OFFSETS = ('3AF7','3ECF')
OFFSETE = ('3B04','3EDE')
endscript = 'No'
nooffile = 1
import os, glob, sys, tempfile
try:
directory = input('Enter your directory:').replace('\\','\\\\')
os.chdir(directory)
except FileNotFoundError:
print ('Directory not found!')
endscript = 'YES!'
if endscript == 'YES!':
sys.exit('Error. Be careful of what you do to your computer!')
else:
if os.path.isfile('Done.txt') == True:
print ('The folder has already been modified!')
else:
print ('Searching texts...\r\n')
print ('Printing...')
for r, d, f in os.walk("."):
for HODF in f:
if HODF.endswith(".hod") or "." not in HODF:
for damn, value in enumerate(OFFSETS and OFFSETE):
try:
with open(os.path.join(r,HODF), 'rb' ) as fileread:
fileread.seek(int(OFFSETS[damn],16) -1)
yeeha = fileread.read(int(OFFSETE[damn], 16) - (int(OFFSETS[damn],16) -1))
if b'?\x03\x00\x00\x00\x01\x00\x00\x00Leg2.' not in yeeha and b'?\x03\x00\x00\x00\x01\x00\x00\x00Leg2_r.' not in yeeha:
print (yeeha)
damn + 1
except FileNotFoundError:
print('Invalid file path!')
os._exit(1)
except IndexError:
print ('File successfully modified!')
nooffile = nooffile + 1
nost = 1
print ('\r\n'+str(nooffile)+' files read.',)
print ('\tANI_list.txt, End.dat, Group.txt, Head.txt, Tail.dat files ignored.')
print ('\r\nFiles successfully read! Hope you found what you are looking for!')
May I know whats wrong with it? Cause it works just fine for me
There are other problems with your code, but it sounds like you want to solve that yourself. When it comes to reading a particular byte range from a file, you can do that like this:
start = 1000
end = 1020 # Just examples
fileread.seek(start)
stuff = fileread.read(end - start)
That is, you start by seeking to the start position, and then you read as many bytes as you need (that is 20, in this example).
EDIT:
The only real "problem" with your code is that you're using enumerate in a strange and weird fashion that makes it completely unnecessary. The expression OFFSETS and OFFSETE will simply evaluate to OFFSETE, making OFFSETS and completely superfluous in it. Then, you're only actually using the first value from enumerate (the index), which makes enumerate itself superfluous: You could just have used range(len(OFFSETE)) instead.
More proper, however, would be to loop directly over the values instead of going via an index, like this:
for start, end in zip(OFFSETS, OFFSETE):
# snip
fileread.seek(int(start, 16) - 1)
yeeha = fileread.read(int(start, 16) - int(end, 16) - 1)
The other things are more like slight uglinesses that could be eliminated to make your code much nicer, but aren't strictly speaking wrong. Among them are that you don't need to represent your offsets as strings, but could use hexadecimal literals instead; that you open the file multiple times for no reason, that the hohohaha + 1 expression is completely superfluous, and that you could just bake the - 1 extra offsets directly into your actual offsets instead of adding it later.
I would write it closer to this instead:
OFFSETS = [0x3AF7 - 1, 0x3ECF - 2]
OFFSETE = [0x3B04 - 1, 0x3EDE - 2]
for r, d, f in os.walk("."):
for fn in f:
with open(os.path.join(r, fn), "rb") as fp:
for start, end in zip(OFFSETS, OFFSETE):
fp.seek(start)
yeeha = fp.read(start - end)
# Do whatever it is you need with yeeha

Verify that an uploaded file is a word document in Python

In my web app (Flask) I'm letting the user upload a word document.
I check that the extension of the file is either .doc or .docx .
However, I changed a .jpg file's extension to .docx and it passed as well (as I expected).
Is there a way to verify that an uploaded file is indeed a word document? I searched and read something about the header of a file but could not find any other information.
I'm using boto to upload the files to aws, in case it matters.
Thanks.
Well, that python-magic library in the question linked in the comments looks like a pretty straight-forward solution.
Nevertheless, I'll give a more manual option. According to this site, DOC files have a signature of D0 CF 11 E0 A1 B1 1A E1 (8 bytes), while DOCX files have 50 4B 03 04 (4 bytes). Both have an offset of 0. It's safe to assume that the files are little-endian since they're from Microsoft (though, maybe Office files are Big Endian on Macs? I'm not sure)
You can unpack the binary data using the struct module like so:
>>> with open("foo.doc", "rb") as h:
... buf = h.read()
>>> byte = struct.unpack_from("<B", buf, 0)[0]
>>> print("{0:x}".format(byte))
d0
So, here we unpacked the first little-endian ("<") byte ("B") from a buffer containing the binary data read from the file, at an offset of 0 and we found "D0", the first byte in a doc file. If we set the offset to 1, we get CF, the second byte.
Let's check if it is, indeed, a DOC file:
def is_doc(file):
with open(file, 'rb') as h:
buf = h.read()
fingerprint = []
if len(buf) > 8:
for i in range(8):
byte = struct.unpack_from("<B", buf, i)[0]
fingerprint.append("{0:x}".format(byte))
if ' '.join(fingerprint).upper() == "D0 CF 11 E0 A1 B1 1A E1":
return True
return False
>>> is_doc("foo.doc")
True
Unfortunately I don't have any DOCX files to test on but the process should be the same, except you only get the first 4 bytes and you compare against the other fingerprint.
Docx files are actually zip files. This zip contains three basic folders: word, docProps and _rels. Thus, use zipfile to test if those files exist in this file.
import zipfile
def isdir(z, name):
return any(x.startswith("%s/" % name.rstrip("/")) for x in z.namelist())
def isValidDocx(filename):
f = zipfile.ZipFile(filename, "r")
return isdir(f, "word") and isdir(f, "docProps") and isdir(f, "_rels")
Code adapted from Check if a directory exists in a zip file with Python
However, any ZIP that contains those folders will bypass the verification.
I also don't know if it works for DOC or for encrypted DOCS.
You can use the python-docx library
The below code will raise value error is the file is not a docx file.
from docx import Document
try:
Document("abc.docx")
except ValueError:
print "Not a valid document type"
I used python-magic to verify whether the file type is a word document.
However I met a lot of problems. Such as: the different word version or the different software was resulting in different types. So I gave up the python-magic.
Here is my solution.
DOC_MAGIC_BYTES = [
"D0 CF 11 E0 A1 B1 1A E1",
"0D 44 4F 43",
"CF 11 E0 A1 B1 1A E1 00",
"DB A5 2D 00",
"EC A5 C1 00"
]
DOCX_MAGIC_BYTES = [
"50 4B 03 04"
]
def validate_is_word(content):
magic_bytes = content[:8]
fingerprint = []
bytes_len = len(magic_bytes)
if bytes_len >= 4:
for i in xrange(bytes_len):
byte = struct.unpack_from("<B", magic_bytes, i)[0]
fingerprint.append("{:02x}".format(byte))
if not fingerprint:
return False
if is_docx_file(fingerprint):
return True
if is_doc_file(fingerprint):
return True
return False
def is_doc_file(magic_bytes):
four_bytes = " ".join(magic_bytes[:4]).upper()
all_bytes = " ".join(magic_bytes).upper()
return four_bytes in DOC_MAGIC_BYTES or all_bytes in DOC_MAGIC_BYTES
def is_docx_file(magic_bytes):
type_ = " ".join(magic_bytes[:4]).upper()
return type_ in DOCX_MAGIC_BYTES
You can try this.
I use filetype python lib to check and compare mime type with its document extension so my users can't fool me just by changing their file extension.
pip install filetype
Then
import filetype
kind = filetype.guess('path/to/file')
mime = kind.mime
ext = kind.extension
You can check their doc here
python-magic does a very good job of detecting docx as well as pptx formats.
Here are a few examples:
In [60]: magic.from_file("oz123.docx")
Out[60]: 'Microsoft Word 2007+'
In [61]: magic.from_file("oz123.docx", mime=True)
Out[61]: 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'
In [62]: magic.from_file("presentation.pptx")
Out[62]: 'Microsoft PowerPoint 2007+'
In [63]: magic.from_file("presentation.pptx", mime=True)
Out[63]: 'application/vnd.openxmlformats-officedocument.presentationml.presentation'
Since the OP asked about a file upload, a file handle isn't very useful. Luckily,
magic support detecting from buffer:
In [63]: fdox
Out[63]: <_io.BufferedReader name='/home/oz123/Documents/oz123.docx'>
In [64]: magic.from_buffer(fdox.read(2048))
Out[64]: 'Zip archive data, at least v2.0 to extract
Naively, we read an amount which is too small ... Reading more bytes solves the problem:
In [65]: fdox.seek(0)
Out[65]: 0
In [66]: magic.from_buffer(fdox.read(4096))
Out[66]: 'Microsoft Word 2007+'
In [67]: fdox.seek(0)
Out[67]: 0
In [68]: magic.from_buffer(fdox.read(4096), mime=True)
Out[68]: 'application/vnd.openxmlformats-officedocument.wordprocessingml.document'

Categories