Simple way to read a mixed binary / ascii file in python? - python

I'm trying to open and interpret a P6 ppm image file by hand in python. A ppm p6 file has a few lines of plain ascii at the start, followed by the actual image data in binary (this in contrast to a ppm p3 file, which is all plain text).
I've found a few modules that can read ppm files (opencv, numpy), but I'd really like to try to do it by hand (especially since ppm is supposed to be a fairly simple image format).
When I try to open and read the file I encounter errors, whether I use open("image.ppm", "rb") or open("image.ppm", "r"), because these both expect a file that's either just binary or just plaintext.
So, more broadly: is there an easy to way to open files that are mixed binary/text in python?

you can do something like this, open the file in rb mode and check if the current byte is printable, if it is print as a character if not print as hex value.
import string
with open("file name", "rb") as file:
data = file.read()
# to print, go through the file data
for byte in data:
# check if the byte is printable
if chr(byte) in string.printable:
# if it is print as character
print(chr(byte), end="")
else:
# if it isn't print the hex value
print(hex(byte), end="")

Related

Convert EBCDIC file to ASCII using Python 2

I need to convert the EBCDIC files to ASCII using python 2.
The sample extract from the sample file looks like the below (in notepad++)
I have tried to decode it with 'cp500' and then encode it in 'utf8' in python like below
with open(path, 'rb') as input_file:
line = input_file.read()
line = line.decode('cp500').encode('utf8').strip()
print line
And below
with io.open(path, 'rb', encoding="cp500") as input_file:
line = input_file.read()
print line
Also, tried with codecs
with codecs.open(path, 'rb') as input_file:
count = 0
line = input_file.read()
line = codecs.decode(line, 'cp500').encode('utf8')
print line
Also, tried importing/installing the ebcdic module, but it doesn't seem to be working properly.
here is the sample output for the first 58 chars
It does transform the data to some human-readable values for some bytes but doesn't seem to be 100 percent in ASCII. For example, the 4th character in the input file is 'P' (after the first three NUL), and if I open the file in hex mode, the hex code for 'P' is 0x50, which maps to character 'P' in ASCII. But the code above gives me the character '&' for this in output, which is the EBCDIC character for hex value 0x50.
Also, tried the below code,
with open(path, 'rb') as input_file:
line = input_file.read()
line = line.decode('utf8').strip()
print line
It gives me the below error.
UnicodeDecodeError: 'utf8' codec can't decode byte 0xf0 in position 4: invalid continuation byte
And If I change the 'utf8' to 'latin1' in the above code, it generates the same output as in the input file shown above which was opened in the notepad++.
Can anyone please help me with how to transform the EBCDIC to ASCII correctly?
Should I build my own mapping dictionary/table/map to transform the EBCDIC to ASCII i.e. convert the file data in hex codes and then get the corresponding ASCII char from that mapping table/dict?
If I do so, then hex 0x40 is 'Space' and 0xe2 is 'S' in EBCDIC but in ASCII 0x40 is '#' and 0xe2 doesn't have the mapping in the ASCII. But as per the input data, it looks like I need EBCDIC characters in this case.
So should I construct some map by looking at the input data and decide wheater I want EBCDIC or ASCII character for some particular hex value and construct that map accordingly for lookup?
Or I need to follow some other way to correctly parse the data.
Note:- The non-alphanumeric data is needed as well, there are some images at some particulars places in the input file encoded in that non-alphanumeric/alphanumeric chars, which we can extract, so not sure if I need to convert that to ASCII or leave as its.
Thanks in advance
Posting for others how I was able to transform the EBCDIC to ASCII.
I learned that I only needed to convert the non-binary alpha-numeric data to ASCII from EBCDIC. To know which data will be non-binary alphanumeric data, one needs to understand the format/structure of the EBCDIC/input file. Since I knew the format/structure of the input file, I was aware of which fields/bytes of the input files needed transformation and did transform only those bytes leaving other binary data as it is in the input file.
Earlier I was trying to convert the whole file into ASCII, which was converting the binary data as well, hence distorting the data in conversion. Hence, by understanding the structure/format of the files I converted only the required alphanumeric data to ASCII and processed it. It worked.
Options
Convert the file to Text on the Mainframe - They have the tools understand the formats
You might be able to use Stingray to read the file in python
Write a Cobol program (GNU Cobol) to translate the file
Use java utilities coboltocsv or coboltoxml to convert the file
Java/Jython code with JRecord
ZOS Mainframe Files
The 2 main mainframe file formats
FB - all records (lines) are the same length
VB - each record start with a length and is followed by the data. These files
can be transfered to other platforms with/without the record length.
Cobol Files
A Cobol copybook allows you to work out
Where fields start and End
The format of the field
Some examples of Cobol Fields and there representation
Inn this example I will look at 2 Cobol Field definitions
and how 4 values are represented in a file
Cobol field definition
03 fld1 pic s999v99.
03 fld2 pic s999v99 comp-3.
Representation in the file
Numeric-Value pic s999v99 pic s999v99 comp-3
12.34 0123D x'01234C'
-12.34 0123M x'01234d'
12.35 0123E x'01235C'
-12.35 0123N x'01235d'
You are reading the file in binary mode so the content in the buffer is in EBCDIC. You need to decode it to ASCII. Try the following:
with open(path, 'rb') as input_file:
line = input_file.read()
line = line.decode('utf8').strip()
print line
The above suggestion was tested on a z/OS machine, but if you are running on an ASCII machine you can try the following instead:
with codecs.open(path, 'rb', 'cp500') as input_file:
line = input_file.read()
print line
These suggestions assume you have a text file, but if the file contains binary data mixed with text you will need a different approach as suggested by #bruce-martin.

Writing more data to file than reading?

I am currently experimenting with how Python 3 handles bytes when reading, and writing data and I have come across a particularly troubling problem that I can't seem to find the source of. I am basically reading bytes out of a JPEG file, converting them to an integer using ord(), then returning the bytes to their original character using the line chr(character).encode('utf-8') and writing it back into a JPEG file. No issue right? Well when I go to try opening the JPEG file, I get a Windows 8.1 notification saying it can not open the photo. When I check the two files against each other one is 5.04MB, and the other is 7.63MB which has me awfully confused.
def __main__():
operating_file = open('photo.jpg', 'rb')
while True:
data_chunk = operating_file.read(64*1024)
if len(data_chunk) == 0:
print('COMPLETE')
break
else:
new_operation = open('newFile.txt', 'ab')
for character in list(data_chunk):
new_operation.write(chr(character).encode('utf-8'))
if __name__ == '__main__':
__main__()
This is the exact code I am using, any ideas on what is happening and how I can fix it?
NOTE: I am assuming that the list of numbers that list(data_chunk) provides is the equivalent to ord().
Here is a simple example you might wish to play with:
import sys
f = open('gash.txt', 'rb')
stuff=f.read() # stuff refers to a bytes object
f.close()
print(stuff)
f2 = open('gash2.txt', 'wb')
for i in stuff:
f2.write(i.to_bytes(1, sys.byteorder))
f2.close()
As you can see, the bytes object is iterable, but in the for loop we get back an int in i. To convert that to a byte I use int.to_bytes() method.
When you have a code point and you encode it in UTF-8, it is possible for the result to contain more bytes than the original.
For a specific example, refer to the WikiPedia page and consider the hexadecimal value 0xA2.
This is a single binary value, less than 255, but when encoded to UTF8 it becomes 0xC2, 0xA2.
Given that you are pulling bytes out of your source file, my first recommendation would be to simply pass the bytes directly to the writer of your target file.
If you are trying to understand how file I/O works, be wary of encode() when using a binary file mode. Binary files don't need to be encoded and or decoded - they are raw data.

Reading binary and text from same file in Python

How does one read binary and text from the same file in Python? I know how to do each separately, and can imagine doing both very carefully, but not both with the built-in IO library directly.
So I have a file that has a format that has large chunks of UTF-8 text interspersed with binary data. The text does not have a length written before it or a special character like "\0" delineating it from the binary data, there is a large portion of text near the end when parsed means "we are coming to an end".
The optimal solution would be to have the built-in file reading classes have "read(n)" and "read_char(n)" methods, but alas they don't. I can't even open the file twice, once as text and once as binary, since the return value of tell() on the text one can't be used with the binary one in any meaningful way.
So my first idea would be to open the whole file as binary and when I reach a chunk of text, read it "character by character" until I realize that the text is ending and then go back to reading it as binary. However this means that I have to read byte-by-byte and do my own decoding of UTF-8 characters (do I need to read another byte for this character before doing something with it?). If it was a fixed-width character encoding I would just read that many bytes each time. In the end I would also like the universal line endings as supported by the Python text-readers, but that would be even more difficult to implement while reading byte-by-byte.
Another easier solution would be if I could ask the text file object its real offset in the file. That alone would solve all my problems.
One way might be to use Hachoir to define a file parsing protocol.
The simple alternative is to open the file in binary mode and manually initialise a buffer and text wrapper around it. You can then switch in and out of binary pretty neatly:
my_file = io.open("myfile.txt", "rb")
my_file_buffer = io.BufferedReader(my_file, buffer_size=1) # Not as performant but a larger buffer will "eat" into the binary data
my_file_text_reader = io.TextIOWrapper(my_file_buffer, encoding="utf-8")
string_buffer = ""
while True:
while "near the end" not in string_buffer:
string_buffer += my_file_text_reader.read(1) # read one Unicode char at a time
# binary data must be next. Where do we get the binary length from?
print string_buffer
data = my_file_buffer.read(3)
print data
string_buffer = ""
A quicker, less extensible way might be to use the approach you've suggested in your question by intelligently parsing the text portions, reading each UTF-8 sequence of bytes at a time. The following code (from http://rosettacode.org/wiki/Read_a_file_character_by_character/UTF8#Python), seems to be a neat way to conservatively read UTF-8 bytes into characters from a binary file:
def get_next_character(f):
# note: assumes valid utf-8
c = f.read(1)
while c:
while True:
try:
yield c.decode('utf-8')
except UnicodeDecodeError:
# we've encountered a multibyte character
# read another byte and try again
c += f.read(1)
else:
# c was a valid char, and was yielded, continue
c = f.read(1)
break
# Usage:
with open("input.txt","rb") as f:
my_unicode_str = ""
for c in get_next_character(f):
my_unicode_str += c

python, crawler for website, stored the jpg and png files, but can't be opend. why?

win8.1-32bit, python3.4
made a web-robot for www.douban.com to get the main html, jpg files and png files.
but when finished, I can't open the pic files.(Windows Photo Viewer can't open this picture balablabala~~~~)
Questions:
1: why can't the pics be opened?
2: if line 35 is edited like this:dbr.write(data), the command line will prompt: TypeError: 'str' does not support the buffer interface.
Same thing will happen for line 51 and 59.
But when line 35 is :dbr.write(bytes(data, 'UTF-8')) , I will get the right html file. So I did the same for line 51 and 59 for pic files, but somethings went wrong. I wonder there should be a bug in the "write()", but I can't figure out what exactly is wrong.
Here is the code.
import urllib.request
import os
import re
#make dirs for douban_robot, jpg, png
dirpath = 'D:/Pwork/webrobot/'
if not os.path.isdir(dirpath):
os.makedirs(dirpath)
jpg_path = dirpath + 'jpgfiles/'
png_path = dirpath + 'pngfiles/'
if not os.path.isdir(jpg_path):
os.makedirs(jpg_path)
if not os.path.isdir(png_path):
os.makedirs(png_path)
douban_robot = dirpath + 'douban.html'
url = 'http://www.douban.com'
#get .html
data = urllib.request.urlopen(url).read().decode('UTF-8')
with open(douban_robot, 'wb') as dbr:
dbr.write(bytes(data, 'UTF-8'))
dbr.close()
# create regex
re_jpg = re.compile(r'<img src="(http.+?.jpg)"')
re_png = re.compile(r'<img src="(http:.+?.png)"')
jpg_data = re_jpg.findall(data)
png_data = re_png.findall(data)
# for test jpg and png date
print(jpg_data, png_data)
#get jpg files
i = 1
for image in jpg_data:
jpg_name = jpg_path + str(i)+'.jpg'
#urllib.request.urlretrieve(image, jpg_name)
with open(jpg_name, 'wb') as jpg_file:
jpg_file.write(bytes(image, 'UTF-8'))
jpg_file.close()
i += 1
for image in png_data:
png_name = png_path + str(i)+'.png'
#urllib.request.urlretrieve(image, png_name)
with open(png_name, 'wb') as png_file:
png_file.write(bytes(image, 'UTF-8'))
png_file.close()
i += 1
The variables jpg_data and png_data are lists containing captured URLs. Your loops iterate over each URL, placing the URL string in the variable image. Then, in both loops, you write the URL string to the file, not the actual image. It actually looks like the commented out urllib lines would do the trick, instead of what you're currently doing now.
The .write() function expects you to give it an object that matches the mode of the file. When you call open(..., 'wb'), you're saying to open the file in write and binary mode, which means that you need to give it bytes instead of str.
Bytes are the fundamental way everything is stored in a computer. Everything is a series of bytes -- the data on your hard drive, and the data you send and receive on the Internet. Bytes don't really have meaning on their own -- each one is just 8 bits strung together. The meaning depends on how you interpret the bytes. For instance, you could interpret a single byte as representing a number from 0 to 255. Or, you could interpret it as a number from -128 to 127 (both of these are common). You could also assign these "numbers" to characters, and interpret a sequence of bytes as text. However, this only allows you to represent 256 characters, and there are many more than that in the world's various languages. So, there are multiple ways of representing text as sequences of bytes. These are called "character encodings". The most popular modern one is "UTF-8".
In Python, a bytes object is just a series of bytes. It has no special meaning -- nobody has said what it represents yet. If you want to use that as text, you need to decode it, using one of the character encodings. Once you do that (.decode('UTF-8')), you have a str object. In order to write it to disk (or the network), your str will have to eventually be encoded back into bytes. When you open a file in text mode, Python chooses your computer's default encoding, and it will decode everything you read using that, and encode everything you write with it. However, when you open a file in b mode, Python expects that you will give it bytes, and so it throws an error when you give it a str instead. Since you know the HTML file you downloaded and put in data is text, it would have been best for you to save it to a file in text mode. However, encoding it as UTF-8 and writing it to a binary file works too, as long as your system's default encoding is UTF-8. In general, when you have a str and you want to write it to a file, open the file in text mode (just don't pass b in the mode parameter) and let Python pick the encoding, since it knows better than you do!
For more info on the character sets and encoding stuff (which I only glossed over), you really should read this article.

Stop python from writing a carriage return when hex value is 0x0A

I decided to brush up on my python, and for the first project I thought I would write a script to help me with some homework for another class I have. For this class, I have an excel spread sheet that converts some data to hex. I would normally have to manually type out the hex values into the program I'm using or I could load the binary data from a file.
The script I have set up right now lets me copy the section in the spread sheet and copy the contents to a temporary text file. When I run my script it creates a file with the binary data. It works except for when I have a hex value of 0x0A, because of the way I convert the string value to hex converts it to '\n' so the binary file has 0D 0A rather then just 0A.
The way I get the hex value:
hexvalue = chr(int(string_value,16))
The way I write to the file:
f = file("code_out", "w")
for byte in data:
f.write(byte)
f.close()
What's the proper way of doing this?
I think you need to open the file for binary writing!
f = open('file', 'wb')
Open the file in "binary mode"
On Windows, 'b' appended to the mode opens the file in binary mode, so there are also modes like 'rb', 'wb', and 'r+b'. Python on Windows makes a distinction between text and binary files; the end-of-line characters in text files are automatically altered slightly when data is read or written.
That is, under non-"binary mode" in Windows, when a "\n" (LF, \x0A) is written to the file stream the actual result is that the byte sequence "\r\n" (CR LF, \x0D \x0A) is written. See newline for additional notes.
This behind-the-scenes modification to file data is fine for ASCII text files, but it’ll corrupt binary data like that in JPEG or EXE files. Be very careful to use binary mode when reading and writing such files. On Unix, it doesn’t hurt to append a 'b' to the mode, so you can use it platform-independently for all binary files.

Categories