How does one read binary and text from the same file in Python? I know how to do each separately, and can imagine doing both very carefully, but not both with the built-in IO library directly.
So I have a file that has a format that has large chunks of UTF-8 text interspersed with binary data. The text does not have a length written before it or a special character like "\0" delineating it from the binary data, there is a large portion of text near the end when parsed means "we are coming to an end".
The optimal solution would be to have the built-in file reading classes have "read(n)" and "read_char(n)" methods, but alas they don't. I can't even open the file twice, once as text and once as binary, since the return value of tell() on the text one can't be used with the binary one in any meaningful way.
So my first idea would be to open the whole file as binary and when I reach a chunk of text, read it "character by character" until I realize that the text is ending and then go back to reading it as binary. However this means that I have to read byte-by-byte and do my own decoding of UTF-8 characters (do I need to read another byte for this character before doing something with it?). If it was a fixed-width character encoding I would just read that many bytes each time. In the end I would also like the universal line endings as supported by the Python text-readers, but that would be even more difficult to implement while reading byte-by-byte.
Another easier solution would be if I could ask the text file object its real offset in the file. That alone would solve all my problems.
One way might be to use Hachoir to define a file parsing protocol.
The simple alternative is to open the file in binary mode and manually initialise a buffer and text wrapper around it. You can then switch in and out of binary pretty neatly:
my_file = io.open("myfile.txt", "rb")
my_file_buffer = io.BufferedReader(my_file, buffer_size=1) # Not as performant but a larger buffer will "eat" into the binary data
my_file_text_reader = io.TextIOWrapper(my_file_buffer, encoding="utf-8")
string_buffer = ""
while True:
while "near the end" not in string_buffer:
string_buffer += my_file_text_reader.read(1) # read one Unicode char at a time
# binary data must be next. Where do we get the binary length from?
print string_buffer
data = my_file_buffer.read(3)
print data
string_buffer = ""
A quicker, less extensible way might be to use the approach you've suggested in your question by intelligently parsing the text portions, reading each UTF-8 sequence of bytes at a time. The following code (from http://rosettacode.org/wiki/Read_a_file_character_by_character/UTF8#Python), seems to be a neat way to conservatively read UTF-8 bytes into characters from a binary file:
def get_next_character(f):
# note: assumes valid utf-8
c = f.read(1)
while c:
while True:
try:
yield c.decode('utf-8')
except UnicodeDecodeError:
# we've encountered a multibyte character
# read another byte and try again
c += f.read(1)
else:
# c was a valid char, and was yielded, continue
c = f.read(1)
break
# Usage:
with open("input.txt","rb") as f:
my_unicode_str = ""
for c in get_next_character(f):
my_unicode_str += c
Related
I'm trying to open and interpret a P6 ppm image file by hand in python. A ppm p6 file has a few lines of plain ascii at the start, followed by the actual image data in binary (this in contrast to a ppm p3 file, which is all plain text).
I've found a few modules that can read ppm files (opencv, numpy), but I'd really like to try to do it by hand (especially since ppm is supposed to be a fairly simple image format).
When I try to open and read the file I encounter errors, whether I use open("image.ppm", "rb") or open("image.ppm", "r"), because these both expect a file that's either just binary or just plaintext.
So, more broadly: is there an easy to way to open files that are mixed binary/text in python?
you can do something like this, open the file in rb mode and check if the current byte is printable, if it is print as a character if not print as hex value.
import string
with open("file name", "rb") as file:
data = file.read()
# to print, go through the file data
for byte in data:
# check if the byte is printable
if chr(byte) in string.printable:
# if it is print as character
print(chr(byte), end="")
else:
# if it isn't print the hex value
print(hex(byte), end="")
I am currently experimenting with how Python 3 handles bytes when reading, and writing data and I have come across a particularly troubling problem that I can't seem to find the source of. I am basically reading bytes out of a JPEG file, converting them to an integer using ord(), then returning the bytes to their original character using the line chr(character).encode('utf-8') and writing it back into a JPEG file. No issue right? Well when I go to try opening the JPEG file, I get a Windows 8.1 notification saying it can not open the photo. When I check the two files against each other one is 5.04MB, and the other is 7.63MB which has me awfully confused.
def __main__():
operating_file = open('photo.jpg', 'rb')
while True:
data_chunk = operating_file.read(64*1024)
if len(data_chunk) == 0:
print('COMPLETE')
break
else:
new_operation = open('newFile.txt', 'ab')
for character in list(data_chunk):
new_operation.write(chr(character).encode('utf-8'))
if __name__ == '__main__':
__main__()
This is the exact code I am using, any ideas on what is happening and how I can fix it?
NOTE: I am assuming that the list of numbers that list(data_chunk) provides is the equivalent to ord().
Here is a simple example you might wish to play with:
import sys
f = open('gash.txt', 'rb')
stuff=f.read() # stuff refers to a bytes object
f.close()
print(stuff)
f2 = open('gash2.txt', 'wb')
for i in stuff:
f2.write(i.to_bytes(1, sys.byteorder))
f2.close()
As you can see, the bytes object is iterable, but in the for loop we get back an int in i. To convert that to a byte I use int.to_bytes() method.
When you have a code point and you encode it in UTF-8, it is possible for the result to contain more bytes than the original.
For a specific example, refer to the WikiPedia page and consider the hexadecimal value 0xA2.
This is a single binary value, less than 255, but when encoded to UTF8 it becomes 0xC2, 0xA2.
Given that you are pulling bytes out of your source file, my first recommendation would be to simply pass the bytes directly to the writer of your target file.
If you are trying to understand how file I/O works, be wary of encode() when using a binary file mode. Binary files don't need to be encoded and or decoded - they are raw data.
I want to read a file with data, coded in hex format:
01ff0aa121221aff110120...etc
the files contains >100.000 such bytes, some more than 1.000.000 (they comes form DNA sequencing)
I tried the following code (and other similar):
filele=1234563
f=open('data.geno','r')
c=[]
for i in range(filele):
a=f.read(1)
b=a.encode("hex")
c.append(b)
f.close()
This gives each byte separate "aa" "01" "f1" etc, that is perfect for me!
This works fine up to (in this case) byte no 905 that happen to be "1a". I also tried the ord() function that also stopped at the same byte.
There might be a simple solution?
Simple solution is binascii:
import binascii
# Open in binary mode (so you don't read two byte line endings on Windows as one byte)
# and use with statement (always do this to avoid leaked file descriptors, unflushed files)
with open('data.geno', 'rb') as f:
# Slurp the whole file and efficiently convert it to hex all at once
hexdata = binascii.hexlify(f.read())
This just gets you a str of the hex values, but it does it much faster than what you're trying to do. If you really want a bunch of length 2 strings of the hex for each byte, you can convert the result easily:
hexlist = map(''.join, zip(hexdata[::2], hexdata[1::2]))
which will produce the list of len 2 strs corresponding to the hex encoding of each byte. To avoid temporary copies of hexdata, you can use a similar but slightly less intuitive approach that avoids slicing by using the same iterator twice with zip:
hexlist = map(''.join, zip(*[iter(hexdata)]*2))
Update:
For people on Python 3.5 and higher, bytes objects spawned a .hex() method, so no module is required to convert from raw binary data to ASCII hex. The block of code at the top can be simplified to just:
with open('data.geno', 'rb') as f:
hexdata = f.read().hex()
Just an additional note to these, make sure to add a break into your .read of the file or it will just keep going.
def HexView():
with open(<yourfilehere>, 'rb') as in_file:
while True:
hexdata = in_file.read(16).hex() # I like to read 16 bytes in then new line it.
if len(hexdata) == 0: # breaks loop once no more binary data is read
break
print(hexdata.upper()) # I also like it all in caps.
If the file is encoded in hex format, shouldn't each byte be represented by 2 characters? So
c=[]
with open('data.geno','rb') as f:
b = f.read(2)
while b:
c.append(b.decode('hex'))
b=f.read(2)
Thanks for all interesting answers!
The simple solution that worked immediately, was to change "r" to "rb",
so:
f=open('data.geno','r') # don't work
f=open('data.geno','rb') # works fine
The code in this case is actually only two binary bites, so one byte contains four data, binary; 00, 01, 10, 11.
Yours!
I'm writing a program to 'manually' arrange a csv file to be proper JSON syntax, using a short Python script. From the input file I use readlines() to format the file as a list of rows, which I manipulate and concenate into a single string, which is then outputted into a separate .txt file. The output, however, contains gibberish instead of Hebrew characters that were present in the input file, and the output is double-spaced, horizontally (a whitespace character is added in between each character). As far as I can understand, the problem has to do with the encoding, but I haven't been able to figure out what. When I detect the encoding of the input and output files (using .encoding attribute), they both return None, which means they use the system default. Technical details: Python 2.7, Windows 7.
While there are a number of questions out there on this topic, I didn't find a direct answer to my problem.
Detecting the system defaults won't help me in this case, because I need the program to be portable.
Here's the code:
def txt_to_JSON(csv_list):
...some manipulation of the list...
return JSON_string
file_name = "input_file.txt"
my_file = open(file_name)
# make each line of input file a value in a list
lines = my_file.readlines()
# break up each line into a list such that each 'column' is a value in that list
for i in range(0,len(lines)):
lines[i] = lines[i].split("\t")
J_string = txt_to_JSON(lines)
json_file = open("output_file.txt", "w+")
json_file.write(jstring)
json_file.close()
All data needs to be encoded to be stored on disk. If you don't know the encoding, the best you can do is guess. There's a library for that: https://pypi.python.org/pypi/chardet
I highly recommend Ned Batchelder's presentation
http://nedbatchelder.com/text/unipain.html
for details.
There's an explanation about the use of "unicode" as an encoding on windows: What's the difference between Unicode and UTF-8?
TLDR:
Microsoft uses UTF16 as encoding for unicode strings, but decided to call it "unicode" as they also use it internally.
Even if Python2 is a bit lenient as to string/unicode conversions, you should get used to always decode on input and encode on output.
In your case
filename = 'where your data lives'
with open(filename, 'rb') as f:
encoded_data = f.read()
decoded_data = encoded_data.decode("UTF16")
# do stuff, resulting in result (all on unicode strings)
result = text_to_json(decoded_data)
encoded_result = result.encode("UTF-16") #really, just using UTF8 for everything makes things a lot easier
outfile = 'where your data goes'
with open(outfile, 'wb') as f:
f.write(encoded_result)
You need to tell Python to use the Unicode character encoding to decode the Hebrew characters.
Here's a link to how you can read Unicode characters in Python: Character reading from file in Python
I'm programming in Python 3 and I'm having a small problem which I can't find any reference to it on the net.
As far as I understand the default string in is utf-16, but I must work with utf-8, I can't find the command that will convert from the default one to utf-8.
I'd appreciate your help very much.
In Python 3 there are two different datatypes important when you are working with string manipulation. First there is the string class, an object that represents unicode code points. Important to get is that this string is not some bytes, but really a sequence of characters. Secondly, there is the bytes class, which is just a sequence of bytes, often representing an string stored in an encoding (like utf-8 or iso-8859-15).
What does this mean for you? As far as I understand you want to read and write utf-8 files. Let's make a program that replaces all 'ć' with 'ç' characters
def main():
# Let's first open an output file. See how we give an encoding to let python know, that when we print something to the file, it should be encoded as utf-8
with open('output_file', 'w', encoding='utf-8') as out_file:
# read every line. We give open() the encoding so it will return a Unicode string.
for line in open('input_file', encoding='utf-8'):
#Replace the characters we want. When you define a string in python it also is automatically a unicode string. No worries about encoding there. Because we opened the file with the utf-8 encoding, the print statement will encode the whole string to utf-8.
print(line.replace('ć', 'ç'), out_file)
So when should you use bytes? Not often. An example I could think of would be when you read something from a socket. If you have this in an bytes object, you could make it a unicode string by doing bytes.decode('encoding') and visa versa with str.encode('encoding'). But as said, probably you won't need it.
Still, because it is interesting, here the hard way, where you encode everything yourself:
def main():
# Open the file in binary mode. So we are going to write bytes to it instead of strings
with open('output_file', 'wb') as out_file:
# read every line. Again, we open it binary, so we get bytes
for line_bytes in open('input_file', 'rb'):
#Convert the bytes to a string
line_string = bytes.decode('utf-8')
#Replace the characters we want.
line_string = line_string.replace('ć', 'ç')
#Make a bytes to print
out_bytes = line_string.encode('utf-8')
#Print the bytes
print(out_bytes, out_file)
Good reading about this topic (string encodings) is http://www.joelonsoftware.com/articles/Unicode.html. Really recommended read!
Source: http://docs.python.org/release/3.0.1/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit
(P.S. As you see, I didn't mention utf-16 in this post. I actually don't know whether python uses this as internal decoding or not, but it is totally irrelevant. At the moment you are working with a string, you work with characters (code points), not bytes.