I'm trying to convert a 2.5 program to 3.
Is there a way in python 3 to change a byte string, such as b'\x01\x02' to a python 2.5 style string, such as '\x01\x02', so that string and byte-by-byte comparisons work similarly to 2.5? I'm reading the string from a binary file.
I have a 2.5 program that reads bytes from a file, then compares or processes each byte or combination of bytes with specified constants. To run the program under 3, I'd like to avoid changing all my constants to bytes and byte strings ('\x01' to b'\x01'), then dealing with issues in 3 such as:
a = b'\x01'
b = b'\x02'
results in
(a+b)[0] != a
even though similar operation work in 2.5. I have to do (a+b)[0] == ord(a), while a+b == b'\x01\x02' works fine. (By the way, what do I do to (a+b)[0] so it equals a?)
Unpacking structures is also an issue.
Am I missing something simple?
Bytes is an immutable sequence of integers (in the range 0<= to <256), therefore when you're accessing (a+b)[0] you're getting back an integer, exactly the same one you'd get by accessing a[0]. so when you're comparing sequence a to an integer (a+b)[0], they're naturally different.
using the slice notation you could however get a sequence back:
>>> (a+b)[:1] == a # 1 == len(a) ;)
True
because slicing returns bytes object.
I would also advised to run 2to3 utility (it needs to be run with py2k) to convert some code automatically. It won't solve all your problems, but it'll help a lot.
Related
I'm writing code in python 3.5 that uses hashlib to spit out MD5 encryption for each packet once it is is given a pcap file and the password. I am traversing through the pcap file using pyshark. Currently, the values it is spitting out are not the same as the MD5 encryptions on the packets in the pcap file.
One of the reasons I have attributed this to is that in the hex representation of the packet, the values are represented with leading 0s. Eg: Protocol number is shown as b'06'. But the value I am updating the hashlib variable with is b'6'. And these two values are not the same for same reason:
>> b'06'==b'6'
False
The way I am encoding integers is:
(hex(int(value))[2:]).encode()
I am doing this encoding because otherwise it would result in this error: "TypeError: Unicode-objects must be encoded before hashing"
I was wondering if I could get some help finding a python encoding library that ignores leading 0s or if there was any way to get the inbuilt hex method to ignore the leading 0s.
Thanks!
Hashing b'06' and b'6' gives different results because, in this context, '06' and '6' are different.
The b string prefix in Python tells the Python interpreter to convert each character in the string into a byte. Thus, b'06' will be converted into the two bytes 0x30 0x36, whereas b'6' will be converted into the single byte 0x36. Just as hashing b'a' and b' a' (note the space) produces different results, hashing b'06' and b'6' will similarly produce different results.
If you don't understand why this happens, I recommend looking up how bytes work, both within Python and more generally - Python's handling of bytes has always been a bit counterintuitive, so don't worry if it seems confusing! It's also important to note that the way Python represents bytes has changed between Python 2 and Python 3, so be sure to check which version of Python any information you find is talking about. You can comment here, too,
I have a Python2 codebase that makes extensive use of str to store raw binary data. I want to support both Python2 and Python3.
The bytes (an alis of str) type in Python2 and bytes in Python3 are completely different. They take different arguments to construct, index to different types and have different str and repr.
What's the best way of unifying the code for both Python versions, using a single type to store raw data?
The python-future package has a backport of the Python3 bytes type.
>>> from builtins import bytes # in py2, this picks up the backport
>>> b = bytes(b'ABCD')
This provides the Python 3 interface in both Python 2 and Python 3. In Python 3, it is the builtin bytes type. In Python 2, it is a compatibility layer on top of the str type.
I don't know on what parts you want to work with bytes, I allmost allways work with bytearray's, and this is how I do it when reading from a file
with open(file, 'rb') as imageFile:
f = imageFile.read()
b = bytearray(f)
I took that right out of a project I am working on, and it works in both 2 and 3. Maybe something for you to look at?
If your project small and simple use six.
Otherwise I suggest to have two independent codebases: one for Python 2 and one for Python 3. Initially it may sound like a lot of unnecessary work, but eventually it's actually a lot easier to maintain.
As an example of what your project may become if you decide to support both pythons in a single codebase, take a look at google's protobuf. Lots of often counterintuitive branching all round the code, abstractions that were modified just to allow hacks. And as your project will evolve it won't get better: deadlines play against quality of the code.
With two separate codebases you will simply apply almost identical patches which isn't a lot of work compared to what is ahead of you if you want a single code base. And it will be easier to migrate to Python 3 completely once number of Python 2 users of your package drop.
Assuming you only need to support Python 2.6 and newer, you can simply use bytes for, well, bytes. Use b literals to create bytes objects, such as b'\x0a\x0b\x00'. When working with files, make sure the mode includes a b (as in open('file.bin', 'rb')).
Beware that iteration and element access is different though. In these cases, you can write your code to use chunks. Instead of b[0] == 0 (Python 3) or b[0] == b'\x00' (Python 2) write b[0:1] == b'\x00'. Other options is using bytearray (when the bytes are mutable) or helper functions.
Strings of characters should be unicode in Python 2, independent from Python 3 porting; otherwise the code would likely be wrong when encountering non-ASCII characters anyways. The equivalent is str in Python 3.
Either use u literals to create character strings (such as u'Düsseldorf') and/or make sure to start every file with from __future__ import unicode_literals. Declare file encodings when necessary by starting files with # encoding: utf-8.
Use io.open to read character strings from files. For network code, fetch bytes and call decode on them to get a character string.
If you need to support Python 2.5 or 3.2, have a look at six to convert literals.
Add plenty of assertions to make sure you that functions which operate on character strings don't get bytes, and vice versa. As usual, a good test suite with 100% coverage helps a lot.
As shown in the IPython (Python 3) snapshot below I expect to see an array of Boolean values printed in the end. However, I see ONLY 1 Boolean value returned.
Unable to identify why?
What does the character 'b' before every
value in the first print statement denote? Am I using the wrong
dtype=numpy.string_ in my numpy.getfromtxt() command?
Python has the distinction between unicode strings and ASCII bytes. In Python3, the default is that "strings" are unicode.
The b prefixing the "strings", indicate that the interpreter considers these to be bytes.
For the comparison, you need to compare it to bytes as well, i.e.,
... == b"1984"
and then numpy will understand that it should perform broadcasting on same-type elements.
I'm creating some fuzz tests in python and it would be invaluable for me to be able to, given a binary string, randomly flip some bits and ensure that exceptions are correctly raised, or results are correctly displayed for slight alterations on given valid binaries. Does anyone know how I might go about this in Python? I realize this is pretty trivial in lower level languages but for work reasons I've been told to do this in Python, but I'm not sure how to start this, or get the binary representation for something in python. Any ideas on how to execute these fuzz tests in Python?
Strings are immutable, so to make changes, the first thing to do is probably to convert it into a list. At the same time, you can convert the digits into ints for greater ease in manipulation.
hexstring = "1234567890deadbeef"
values = [int(digit, 16) for digit in hexstring]
Then you can flip an individual bit in any of the hex digits.
digitindex = 2
bitindex = 3
values[digitindex] ^= 1 << bitindex
If needed, you can then convert back to hex.
result = "".join("0123456789abcdef"[val] for val in values)
One thing you could try is to convert the string into a bytearray, then performing bit manipulations on each character. You can access each character by index and treat it as an integer.
For example:
>>> a = "hello world"
>>> b = bytearray(a)
>>> b[0] = b[0] ^ 5 # bitwise XOR
>>> print b # or do str(b) to convert it back to a string
mello world
You may also find this article on the Python wiki about bit manipulation to be useful. It goes over bit manipulation in Python to far greater detail, along with loads of useful tips and tricks.
I usually perform things like this in C++, but I'm using python to write a quick script and I've run into a wall.
If I have a binary list (or whatever python stores the result of an "fread" in). I can access the individual bytes in it with: buffer[0], buffer[1], etc.
I need to change the bytes [8-11] to hold a new 32-bit file-size (read: there's already a filesize there, I need to update it). In C++ I would just get a pointer to the location and cast it to store the integer, but with python I suddenly realized I have no idea how to do something like this.
How can I update 4 bytes in my buffer at a specific location to hold the value of an integer in python?
EDIT
I'm going to add more because I can't seem to figure it out from the solutions (though I can see they're on the right track).
First of all, I'm on python 2.4 (and can't upgrade, big corporation servers) - so that apparently limits my options. Sorry for not mentioning that earlier, I wasn't aware it had so many less features.
Secondly, let's make this ultra-simple.
Lets say I have a binary file named 'myfile.binary' with the five-byte contents '4C53535353' in hex - this equates to the ascii representations for letters "L and 4xS" being alone in the file.
If I do:
f = open('myfile.binary', 'rb')
contents = f.read(5)
contents should (from Sven Marnach's answer) hold a five-byte immutable string.
Using Python 2.4 facilities only, how could I change the 4 S's held in 'contents' to an arbitrary integer value? I.e. give me a line of code that can make byte indices contents [1-4] contain the 32-bit integer 'myint' with value 12345678910.
What you need is this function:
struct.pack_into(fmt, buffer, offset, v1, v2, ...)
It's documented at http://docs.python.org/library/struct.html near the top.
Example code:
import struct
import ctypes
data=ctypes.create_string_buffer(10)
struct.pack_into(">i", data, 5, 0x12345678)
print list(data)
Similar posting: Python: How to pack different types of data into a string buffer using struct.pack_into
EDIT: Added a Python 2.4 compatible example:
import struct
f=open('myfile.binary', 'rb')
contents=f.read(5)
data=list(contents)
data[0:4]=struct.pack(">i", 0x12345678)
print data
Have a look at Struct module. You need pack function.
EDIT:
The code:
import struct
s = "LSSSS" # your string
s = s[0] + struct.pack('<I', 1234567891) # note "shorter" constant than in your example
print s
Output:
L╙☻ЦI
struct.pack should be available in Python2.4.
Your number "12345678910" cannot be packed into 4 bytes, I shortened it a bit.
The result of file.read() is a string in Python, and it is immutable. Depending on the context of the task you are trying to accomplish, there are different solutions to the problem.
One is using the array module: Read the file directly as an array of 32-bit integers. You can modify this array and write it back to the file.
with open("filename") as f:
f.seek(0, 2)
size = f.tell()
f.seek(0)
data = array.array("i")
assert data.itemsize == 4
data.fromfile(f, size // 4)
data[2] = new_value
# use data.tofile(g) to write the data back to a new file g
You could install the numpy module, which is often used for scientific computing.
read_data = numpy.fromfile(file=id, dtype=numpy.uint32)
Then access the data at the desired location and make your changes.
The following is just a demonstration for you to understand what really happens when the four bytes are converted into an integer.
Suppose you have a number: 15213
Decimal: 15213
Binary: 0011 1011 0110 1101
Hex: 3 B 6 D
On little-endian systems (i.e x86 machines), this number can be represented using a length-4 bytearray as: b"\x6d\x3b\x00\x00" or b"m;\x00\x00" when you print it on the screen, to convert the four bytes into an integer, we simply do a bit of base conversion, which in this case, is:
sum(n*(256**i) for i,n in enumerate(b"\x6d\x3b\x00\x00"))
This gives you the result: 15213