Comparing slices of binary data in python - python

Say I have a file in hexadecimal and I need to search for repeating sets of bytes in it. What would be the best way to do this in python?
Right now what I'm doing is treating everything as a string with the re module, which is extremely slow and not the right way to do it. I just can't figure out how to slice up and compare binary data.
for i in range(int(len(data))):
string = data[i:i+16]
pattern = re.compile(string)
m = pattern.findall(data)
count += 1
if len(m) > 1:
k = [str(i), str(len(m))]
t = ":".join(k)
output_file.write(' {}'.format(t))
else:
continue
Just to make sure there's no confusion, data here is just a big string of hex data from open('pathtofile/file', 'r')

Related

Replace space-char by newline-char within max-line-length in python

I am trying to write a python script,
which breaks a continuous string into lines,
when the max_line_length has been exceeded.
It shall not break words,
and searches therefore the last occurrence of a whitespace-char,
which will be replaced by a newline-char.
For some reason it does not break within the specified limit.
E.g. when defining the max_line_length = 80,
the text sometimes breaks at 82 or 83, etc.
Since quite some time I am trying to fix the problem,
however it feels like i am having the tunnel vision
and don't see the problem here:
#!/usr/bin/python
import sys
if len(sys.argv) < 3:
print('usage: $ python3 breaktext.py <max_line_length> <file>')
print('example: $ python3 breaktext.py 80 infile.txt')
exit()
filename = str(sys.argv[2])
with open(filename, 'r') as file:
text_str = file.read().replace('\n', '')
m = int(sys.argv[1]) # max_line_length
text_list = list(text_str) # convert string to list
l = 0; # line_number
i = m+1 # line_character_index
index = m+1 # total_list_index
while index < len(text_list):
while text_list[l * m + i] != ' ':
i -= 1
pass
text_list[l * m + i] = '\n'
l += 1
i = m+1
index += m+1
pass
text_str = ''.join(text_list)
print(text_str)
I guess we'll take this from the top.
text_str = file.read().replace('\n', '')
Here's one assumption about the input data I don't know if it's true. You're replacing all the newline characters with nothing; if there weren't spaces next to them, this means the code below will never break the lines in the same places.
text_list = list(text_str) # convert string to list
This splits the input file into single character strings. I guess you might have done so to make it mutable, such that you can replace individual characters, but it's a very expensive operation and loses all the features of a string. Python is a high level language that would allow you to split into e.g. words instead.
index = m+1 # total_list_index
while index < len(text_list):
#...
index += m+1
Let's consider what this means. We're not entering into the loop if index exceeds the text_list length. But index is advancing in steps of m+1. So we're splitting math.floor(len(text)/(max_line_length+1)) times. Unless every line is exactly max_line_length characters, not counting its space we replace with a newline, that's too few times. Too few times means too long lines, at least at the end.
l = 0; # line_number
i = m+1 # line_character_index
#loop:
while text_list[l * m + i] != ' ':
i -= 1
text_list[l * m + i] = '\n'
l += 1
i = m+1
This is making things difficult with index math. Quite clearly the one index we ever use is l * m + i. This moves in a quite odd way; it searches backwards for a space, then leaps forward as l increments and i resets. Whatever position it had reversed to is lost as all the leaps are in steps of m.
Let's apply m=5 to the string "Fee fie faw fum who did you see now". For the first iteration, 0 * 5 + 5+1 hits the second word, and i seeks back to the first space. The first line then is "Fee", as expected. The second search starts at 1*5 + 5+1, which is a space, and the second line becomes "fie faw", which already exceeds our limit of 5! The reason is that l * m isn't the beginning of the line; it's actually in the middle of "fie", a discrepancy which can only grow as you continue through the file. It grows whenever you split off a line that is shorter than m.
The solution involves remembering where you did your split. That could be as simple as replacing l * m with index, and updating it by index += i instead of m+1.
Another odd effect happens if you ever encounter a word that exceeds the maximum line length. Beyond meaning a line is longer than the limit, i will still search backwards until it finds a space; that space could then be in an earlier line altogether, producing extra short lines as well as too long ones. That's a result of handling the entire text as one array and not limiting which section we're looking at.
Personally I'd much rather use Python's built in methods, such as str.rindex, which can find a particular character in a given region within a string:
s = "Fee fie faw fum who did you see now"
maxlen = 5
start = 8
end = s.rindex(' ', start, start+maxlen)
print(s[start:end])
start = end + 1
We also, as PaulMcG pointed out, can go full "batteries included" and use the standard library textwrap module for the entire task.

Want to convert text file of complex numbers to a list of numbers in Python

I have a text file of complex numbers called output.txt in the form:
[-3.74483279909056 + 2.54872970226369*I]
[-3.64042002652517 + 0.733996349939531*I]
[-3.50037473491252 + 2.83784532111642*I]
[-3.80592861109028 + 3.50296053533826*I]
[-4.90750592116062 + 1.24920836601026*I]
[-3.82560512449716 + 1.34414866823615*I]
etc...
I want to create a list from these (read in as a string in Python) of complex numbers.
Here is my code:
data = [line.strip() for line in open("output.txt", 'r')]
for i in data:
m = map(complex,i)
However, I'm getting the error:
ValueError: complex() arg is a malformed string
Any help is appreciated.
From the help information, for the complex builtin function:
>>> help(complex)
class complex(object)
| complex(real[, imag]) -> complex number
|
| Create a complex number from a real part and an optional imaginary part.
| This is equivalent to (real + imag*1j) where imag defaults to 0.
So you need to format the string properly, and pass the real and imaginary parts as separate arguments.
Example:
num = "[-3.74483279909056 + 2.54872970226369*I]".translate(None, '[]*I').split(None, 1)
real, im = num
print real, im
>>> -3.74483279909056 + 2.54872970226369
im = im.replace(" ", "") # remove whitespace
c = complex(float(real), float(im))
print c
>>> (-3.74483279909+2.54872970226j)
Try this:
numbers = []
with open("output.txt", 'r') as data:
for line in data.splitlines():
parts = line.split('+')
real, imag = tuple( parts[0].strip(' ['), parts[1].strip(' *I]') )
numbers.append(complex(float(real), float(imag)))
The problem with your original approach is that your input file contains lines of text that complex() does not know how to process. We first need to break each line down to a pair of numbers - real and imag. To do that, we need to do a little string manipulation (split and strip). Finally, we convert the real and imag strings to floats as we pass them into the complex() function.
Here is a concise way to create the list of complex values (based on dal102 answer):
data = [complex(*map(float,line.translate(None, ' []*I').split('+'))) for line in open("output.txt")]

Extracting columns having values >= 90%

I wrote this script to extract values from my .txt file that have >= 90 % identity. However, this program does not take into consideration values higher than 100.00 for example 100.05, why?
import re
output=open('result.txt','w')
f=open('file.txt','r')
lines=f.readlines()
for line in lines:
new_list=re.split(r'\t+',line.strip())
id_per=new_list[2]
if id_per >= '90':
new_list.append(id_per)
output.writelines(line)
f.close()
output.close()
Input file example
A 99.12
B 93.45
C 100.00
D 100.05
E 87.5
You should compare them as floats not strings. Something as follows:
import re
output=open('result.txt','w')
f=open('file.txt','r')
lines=f.readlines()
for line in lines:
new_list=re.split(r'\t+',line.strip())
id_per=new_list[2]
if float(id_per) >= 90.0:
new_list.append(id_per)
output.writelines(line)
f.close()
output.close()
This is because python compares is interpreting the numbers as strings even though you want them interpreted as numbers. For strings, python does the comparisons character by character using the ASCII or Unicode rules. This is why your code will not throw any error however it will not run the way you expect it to run using float rules rather than string rules.
As an alternative to #sshashank124's answer, you could use simple string manipulation if your lines have a simple format;
output=open('result.txt','w')
f=open('file.txt','r')
for line in f:
words = line.split()
num_per=words[1]
if float(num_per) >= 90:
new_list.append(num_per)
output.writelines(line)
f.close()
output.close()
Python is dynamicaly but strongly typed language. Therefore 90 and '90' are completely different things - one is integer number and other is a string.
You're comparing strings and in string comparison, '90' is "greater" than '100.05' (strings are compared characted by character and '9' is greater than '1').
So what you need to do is:
convert id_per to number (you'll want probably floats, as you care about decimal places)
compare it to number, i.e., 90, not a '90'
In code:
id_per = float(new_list[2])
if id_per >= 90:
You are using string comparison - lexically 100 is less than 90. I bet that it works for 950...
Get rid of the quotes around the '90'

Converting from hex to binary without losing leading 0's python

I have a hex value in a string like
h = '00112233aabbccddee'
I know I can convert this to binary with:
h = bin(int(h, 16))[2:]
However, this loses the leading 0's. Is there anyway to do this conversion without losing the 0's? Or is the best way to do this just to count the number of leading 0's before the conversion then add it in afterwards.
I don't think there is a way to keep those leading zeros by default.
Each hex digit translates to 4 binary digits, so the length of the new string should be exactly 4 times the size of the original.
h_size = len(h) * 4
Then, you can use .zfill to fill in zeros to the size you want:
h = ( bin(int(h, 16))[2:] ).zfill(h_size)
This is actually quite easy in Python, since it doesn't have any limit on the size of integers. Simply prepend a '1' to the hex string, and strip the corresponding '1' from the output.
>>> h = '00112233aabbccddee'
>>> bin(int(h, 16))[2:] # old way
'1000100100010001100111010101010111011110011001101110111101110'
>>> bin(int('1'+h, 16))[3:] # new way
'000000000001000100100010001100111010101010111011110011001101110111101110'
Basically the same but padding to 4 bindigits each hexdigit
''.join(bin(int(c, 16))[2:].zfill(4) for c in h)
A newbie to python such as I would proceed like so
datastring = 'HexInFormOfString'
Padding to accommodate preceding zeros if any, when python converts string to Hex.
datastrPadded = 'ffff' + datastring
Convert padded value to binary.
databin = bin(int(datastrPadded,16))
Remove 2bits ('0b') that python adds to denote binary + 16 padded bits .
databinCrop = databin[18:]
This converts a hex string into a binary string. Since you want the length to be dependent on the original, this may be what you want.
data = ""
while len(h) > 0:
data = data + chr(int(h[0:2], 16))
h = h[2:]
print h
I needed integer as input and pure hex/bin strings out with the prefixs '0b' and '0x' so my general solution is this:
def pure_bin(data, no_of_bits=NO_OF_BITS):
data = data + 2**(no_of_bits)
return bin(data)[3:]
def pure_hex(data, no_of_bits=NO_OF_BITS):
if (no_of_bits%4) != 0:
no_of_bits = 4*int(no_of_bits / 4) + 4
data = data + 2**(no_of_bits)
return hex(data)[3:]
hexa = '91278c4bfb3cbb95ffddc668d995bfe0'
binary = bin(int(hexa, 16))[2:]
print binary
hexa_dec = hex(int(binary, 2))[2:]
print hexa_dec

python: find and replace numbers < 1 in text file

I'm pretty new to Python programming and would appreciate some help to a problem I have...
Basically I have multiple text files which contain velocity values as such:
0.259515E+03 0.235095E+03 0.208262E+03 0.230223E+03 0.267333E+03 0.217889E+03 0.156233E+03 0.144876E+03 0.136187E+03 0.137865E+00
etc for many lines...
What I need to do is convert all the values in the text file that are less than 1 (e.g. 0.137865E+00 above) to an arbitrary value of 0.100000E+01. While it seems pretty simple to replace specific values with the 'replace()' method and a while loop, how do you do this if you want to replace a range?
thanks
I think when you are beginning programming, it's useful to see some examples; and I assume you've tried this problem on your own first!
Here is a break-down of how you could approach this:
contents='0.259515E+03 0.235095E+03 0.208262E+03 0.230223E+03 0.267333E+03 0.217889E+03 0.156233E+03 0.144876E+03 0.136187E+03 0.137865E+00'
The split method works on strings. It returns a list of strings. By default, it splits on whitespace:
string_numbers=contents.split()
print(string_numbers)
# ['0.259515E+03', '0.235095E+03', '0.208262E+03', '0.230223E+03', '0.267333E+03', '0.217889E+03', '0.156233E+03', '0.144876E+03', '0.136187E+03', '0.137865E+00']
The map command applies its first argument (the function float) to each of the elements of its second argument (the list string_numbers). The float function converts each string into a floating-point object.
float_numbers=map(float,string_numbers)
print(float_numbers)
# [259.51499999999999, 235.095, 208.262, 230.22300000000001, 267.33300000000003, 217.88900000000001, 156.233, 144.876, 136.18700000000001, 0.13786499999999999]
You can use a list comprehension to process the list, converting numbers less than 1 into the number 1. The conditional expression (1 if num<1 else num) equals 1 when num is less than 1, otherwise, it equals num.
processed_numbers=[(1 if num<1 else num) for num in float_numbers]
print(processed_numbers)
# [259.51499999999999, 235.095, 208.262, 230.22300000000001, 267.33300000000003, 217.88900000000001, 156.233, 144.876, 136.18700000000001, 1]
This is the same thing, all in one line:
processed_numbers=[(1 if num<1 else num) for num in map(float,contents.split())]
To generate a string out of the elements of processed_numbers, you could use the str.join method:
comma_separated_string=', '.join(map(str,processed_numbers))
# '259.515, 235.095, 208.262, 230.223, 267.333, 217.889, 156.233, 144.876, 136.187, 1'
typical technique would be:
read file line by line
split each line into a list of strings
convert each string to the float
compare converted value with 1
replace when needed
write back to the new file
As I don't see you having any code yet, I hope that this would be a good start
def float_filter(input):
for number in input.split():
if float(number) < 1.0:
yield "0.100000E+01"
else:
yield number
input = "0.259515E+03 0.235095E+03 0.208262E+03 0.230223E+03 0.267333E+03 0.217889E+03 0.156233E+03 0.144876E+03 0.136187E+03 0.137865E+00"
print " ".join(float_filter(input))
import numpy as np
a = np.genfromtxt('file.txt') # read file
a[a<1] = 0.1 # replace
np.savetxt('converted.txt', a) # save to file
You could use regular expressions for parsing the string. I'm assuming here that the mantissa is never larger than 1 (ie, begins with 0). This means that for the number to be less than 1, the exponent must be either 0 or negative. The following regular expression matches '0', '.', unlimited number of decimal digits (at least 1), 'E' and either '+00' or '-' and two decimal digits.
0\.\d+E(-\d\d|\+00)
Assuming that you have the file read into variable 'text', you can use the regexp with the following python code:
result = re.sub(r"0\.\d*E(-\d\d|\+00)", "0.100000E+01", text)
Edit: Just realized that the description doesn't limit the valid range of input numbers to positive numbers. Negative numbers can be matched with the following regexp:
-0\.\d+E[-+]\d\d
This can be alternated with the first one using the (pattern1|pattern2) syntax which results in the following Python code:
result = re.sub(r"(0\.\d+E(-\d\d|\+00)|-0\.\d+E[-+]\d\d)", "0.100000E+00", subject)
Also if there's a chance that the exponent goes past 99, the regexp can be further modified by adding a '+' sign after the '\d\d' patterns. This allows matching digits ending in two OR MORE digits.
I've got the script working as I want now...thanks people.
When writing the list to a new file I used the replace method to get rid of the brackets and commas - is there a simpler way?
ftext = open("C:\\Users\\hhp06\\Desktop\\out.grd", "r")
otext = open("C:\\Users\\hhp06\\Desktop\\out2.grd", "w+")
for line in ftext:
stringnum = line.split()
floatnum = map(float, stringnum)
procnum = [(1.0 if num<1 else num) for num in floatnum]
stringproc = str(procnum)
s = (stringproc).replace(",", " ").replace("[", " ").replace("]", "")
otext.writelines(s + "\n")
otext.close()

Categories