How to read a binary file as a string efficiently in Python?

How to read a binary file as a string efficiently in Python? - python

I am attempting to read from a file and pass it through a data redundancy and cryptography algorithm that takes a string. How can I properly read in this file as a string. I need a encoding format that maps across all character positions since these are raw binary bytes. So far, I have tried the encoding format known as 'cp866', but whenever I use this encoding format, it reads from the file very, very slowly.
How can I read from the file as a string just as the UNIX cat command or the Windows type command does?
This is my file
character_encoding = 'cp866'
with open(r'Insert_Your_Large_Binary_File_Here',
encoding=character_encoding) as file:
text = file.read()
print(text)
How can I speed up this function or better replicate the string generation that the cat and type command yields?
How do I, print the data to the STDOUT? Is print sufficient?
Essentially, I am looking for cross-platform Python script to replicate this data.
This is an extension of my previous question
Any help or pointing me to proper Python package would be greatly appreciated.
Update: When I don't specify an encoding, I get the following error:
Traceback (most recent call last):
File "filename_redacted", line 13, in
text = file.read()
File "C:\Python34\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 34: character maps to
Based off this question, it looks like I should be using this ancient MSDOS encoding. Is there really no better way to do this?

Related

Python3 - Cannot read docx, odt file - UnicodeDecodeError: 'utf-8' codec can't decode byte 0xea in position 10: invalid continuation byte

I am trying to split a large docx file into small files. For that when reading a file in python3.6 with the following code.
with open('h.docx', 'r') as f:
a = f.read()
It throws this error.
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/usr/local/lib/python3.6/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xea in position
10: invalid continuation byte
h.docx is created using LibreOffice Calc with just 'hello world' in it as content. I can read this successfully in Python 2.7 without any errors.
I tried
with open('h.docx', 'r', encoding='latin-1') as f:
a = f.read()
In this I can read the file without any errors. But when written to another file, the original contents are lost.
Also tried errors='surrogateescape', but when written to another file the original contents are lost.

Not really an answer but too long for a comment. What you are doing is just non-sense: you are trying to read a ".docx" file as if it was a text file which it is not. It is a complex format where several xml files (and possibly others...) are concatenated into a single zip file. You should not even contemplate processing such a file by hand unless:
trivial changes such as replacing a word with another one
read-only operation such as researching a particular string
you want to write an docx processing package (good luck with it)
and even those would not be simple operation.
What is possible:
process the file as a binary file when you only see it as an opaque content, for example to send it over a network connection
use a dedicated library like python-docx
under Windows, use the automation interface of Word to have word itself process the file (comtypes could help here)

python file open() throws exception for non utf-8 character

I wrote the simplest python program that exhibits the error I need help with.
lines_read = 0
urllist_file = open('../fall11_urls.txt', 'r')
for line in urllist_file:
lines_read += 1
print('line count:', lines_read)
I run this on most files and of course it works as expected but "fall11_urls.txt" is a 14 million line text file that contains URLs, one per line. Some of these lines contain text with appeaently non utf-8 characters and I get the error quoted below. I need access every one of these URLs What is the best way to handle this. These URLs can be "anything" some are 400 characters of random characters as in "https://bbswigr.fty.com/_Kcsnuk4J71A/RjzGhXZGmfI/AAAARg/xP3FO-Xbt68/s320/Axolo.jpg Some of these string contain characters such as 0x96 I need my python program to be robust against whatever might be in the file. (If it matters this runs on Ubuntu 16.04)
Here is the error
Traceback (most recent call last):
File "./count_lines.py", line 2, in <module>
for line in urllist_file:
File "/home/chris/.virtualenvs/cvml3/lib/python3.5/codecs.py", line 321, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x96 in position 5529: invalid start byte
One more bit of information iconv finds the same problem with the same file. See below
$ iconv ../fall11_urls.txt >> /dev/null
iconv: illegal input sequence at position 1042953625
My current work around is UGLY. I use iconv to find the problem then I hand edit the file in vi, then process it. and keep doing this until it is clean but I have MILLIONS of lines in many files to process. And the URLs do mostly work after I hand correct them so these are not noise or "flipped bits".

Answering my own question to let you all know what worked. Yes opening in binary worked I tried it but then I don't have a "text" file. I read up on encoding and found the following works because every binary character value is valid. It is the Safest thing to do.
urllist_file = open('../fall11_urls.txt', 'r', encoding="latin-1")
It seems that anyone opening text files they get from other people and have no way to control or know in advance what is inside might be advised to use "latin-1" because there are no invalid byte values in Latin-1.
Thanks. The suggestion to open in binary got me to investigate what other parameters open() accepts. I'm new to Python and was astounded to find that strings are just a list of bytes. (That is what 20+ years of working in C will condition you to expect.)

Did you try crook method? This should work.
urllist_file = open('../fall11_urls.txt', 'rb') then convert to whatever format you want

What kind of Encoding does a standard midi file use?

Here's what brought this question up:
with open(path + "/OneChance1.mid") as f:
for line in f.readline():
print(line)
Here I am simply trying to read a midi file to scour its contents. I then receive this error message: UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 153: character maps to <undefined>
If I use open()'s second param like so: with open(path + "/OneChance1.mid"m encoding='utf-8) as f: then I receive this error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 13: invalid start byte
If I change the encoding param to ascii I get another error about an ordinal being out of range. Lastly I tried utf-16 and it said that the file didn't start with BOM (which made me smile for some reason). Also, if I ignore the errors I get characters that resemble nothing of the kind of data I am expecting. My expectations are based on this source: http://www.sonicspot.com/guide/midifiles.html
Anyway, does anyone know what kind of encoding a midi file uses? My research is coming up short in that regard so I thought it would be worth asking on SO. Or maybe someone can point out some other possibilities or blunders?

MIDI files are binary content. By opening the file as a text file however, Python applies the default system encoding in trying to decode the text as Unicode.
Open the file in binary mode instead:
with open(midifile, 'rb') as mfile:
leader = mfile.read(4)
if leader != b'MThd':
raise ValueError('Not a MIDI file!')
You'd have to study the MIDI standard file format if you wanted to learn more from the file. Also see What is the structure of a MIDI file?

It's a binary file, it's not text using a text encoding like you seem to expect.
To open a file in binary mode in Python, pass a string containing "b" as the second argument to open().
This page contains a description of the format.

How do I convert LF to CRLF?

I found a list of the majority of English words online, but the line breaks are of unix-style (encoded in Unicode: UTF-8). I found it on this website: http://dreamsteep.com/projects/the-english-open-word-list.html
How do I convert the line breaks to CRLF so I can iterate over them? The program I will be using them in goes through each line in the file, so the words have to be one per line.
This is a portion of the file: bitbackbitebackbiterbackbitersbackbitesbackbitingbackbittenbackboard
It should be:
bit
backbite
backbiter
backbiters
backbites
backbiting
backbitten
backboard
How can I convert my files to this type? Note: it's 26 files (one per letter) with 80,000 words or so in total (so the program should be very fast).
I don't know where to start because I've never worked with unicode. Thanks in advance!
Using rU as the parameter (as suggested), with this in my code:
with open(my_file_name, 'rU') as my_file:
for line in my_file:
new_words.append(str(line))
my_file.close()
I get this error:
Traceback (most recent call last):
File "<pyshell#5>", line 1, in <module>
addWords('B Words')
File "D:\my_stuff\Google Drive\documents\SCHOOL\Programming\Python\Programming Class\hangman.py", line 138, in addWords
for line in my_file:
File "C:\Python3.3\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 7488: character maps to <undefined>
Can anyone help me with this?

Instead of converting, you should be able to just open the file using Python's universal newline support:
f = open('words.txt', 'rU')
(Note the U.)

You can use the replace method of strings. Like
txt.replace('\n', '\r\n')
EDIT :
in your case :
with open('input.txt') as inp, open('output.txt', 'w') as out:
txt = inp.read()
txt = txt.replace('\n', '\r\n')
out.write(txt)

You don't need to convert the line endings in the files in order to be able to iterate over them. As suggested by NPE, simply use python's universal newlines mode.
The UnicodeDecodeError happens because the files you are processing are encoded as UTF-8 and when you attempt to decode the contents from bytes to a string, via str(line), Python is using the cp1252 encoding to convert the bytes read from the file into a Python 3 string (i.e. a sequence of unicode code points). However, there are bytes in those files that cannot be decoded with the cp1252 encoding and that causes a UnicodeDecodeError.
If you change str(line) to line.decode('utf-8') you should no longer get the UnicodeDecodeError. Check out the Text Vs. Data Instead of Unicode Vs. 8-bit writeup for some more details.
Finally, you might also find The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky useful.

You can use cereja package
pip install cereja==1.2.0
import cereja cereja.lf_to_crlf(dir_or_file_path)
or
cereja.lf_to_crlf(dir_or_file_path, ext_in=[“.py”,”.csv”])
You can substitute for any standard. See the filetools module

Pythonic way of reading NUL in a file

I'm using python to read a text file with the segment below
(can't post a screenshot since i'm a noob) but this is what is looks like in notepad++:
NULSOHSOHNULNULNULSUBMesssage-ID:
error:
Traceback (most recent call last):
File "<pyshell#3>", line 1, in <module>
print(f.readline())
File "C:\Python32\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 7673: character maps to <undefined>
Opening the file as binary:
f = open('file.txt','rb')
f.readline()
gives me the text as binary
b'\x00\x01\x01\x00\x00\x00\x1a\xb7Message-ID:
but how do I get the text as ascii ? And whats the easiest/pythonic way of handling this ?

The problem is with "byte 0x8f in position 7673", not with "byte 0x00 in position 1". I.e., your NUL is not the problem. If you look at the cp-1252 codepage on wikipedia, you can see that 0x8f has no corresponding character.
The larger issue is that your file is not in a single encoding: it appears to be a mix of binary framing of text segments. What you really need to do is figure out the format of this file and parse it into binary pieces (or perhaps some richer data structure, like a tuple, list, dict, object, etc), then decode the text pieces into unicode if you need to process further.

When opening a file in text mode, you can specifically tell which encoding to use:
f = open('file.txt','r',encoding='ascii')
However your real problem is different: the binary piece that you cited can not be read as ASCII, because the byte \xb7 is outside of ASCII range (0-127). The exception traceback tells that Python is using cp1252 codec by default, which cannot decode your file either.
You need either to figure out which encoding the file has, or to handle it as binary all the time.

Perhaps open it in read the correct read mode?
f = open('file.txt','r')
f.readline()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.