How to open an unicode text file inside a zip?

How to open an unicode text file inside a zip? - python

I tried
with zipfile.ZipFile("5.csv.zip", "r") as zfile:
for name in zfile.namelist():
with zfile.open(name, 'rU') as readFile:
line = readFile.readline()
print(line)
split = line.split('\t')
it answers:
b'$0.0\t1822\t1\t1\t1\n'
Traceback (most recent call last)
File "zip.py", line 6
split = line.split('\t')
TypeError: Type str doesn't support the buffer API
How to open the text file as unicode instead of as b?

To convert a byte stream into Unicode stream, you could use io.TextIOWrapper():
encoding = 'utf-8'
with zipfile.ZipFile("5.csv.zip") as zfile:
for name in zfile.namelist():
with zfile.open(name) as readfile:
for line in io.TextIOWrapper(readfile, encoding):
print(repr(line))
Note: TextIOWrapper() uses universal newline mode by default. rU mode in zfile.open() is deprecated since version 3.4.
It avoids issues with multibyte encodings described in #Peter DeGlopper's answer.

edit For Python 3, using io.TextIOWrapper as this answer describes is the best choice. The answer below could still be helpful for 2.x. I don't think anything below is actually incorrect even for 3.x, but io.TestIOWrapper is still better.
If the file is utf-8, this will work:
# the rest of the code as above, then:
with zfile.open(name, 'rU') as readFile:
line = readFile.readline().decode('utf8')
# etc
If you're going to be iterating over the file you can use codecs.iterdecode, but that won't work with readline().
with zfile.open(name, 'rU') as readFile:
for line in codecs.iterdecode(readFile, 'utf8'):
print line
# etc
Note that neither approach is necessarily safe for multibyte encodings. For example, little-endian UTF-16 represents the newline character with the bytes b'\x0A\x00'. A non-unicode aware tool looking for newlines will split that incorrectly, leaving the null bytes on the following line. In such a case you'd have to use something that doesn't try to split the input by newlines, such as ZipFile.read, and then decode the whole byte string at once. This is not a concern for utf-8.

The reason why you're seeing that error is because you are trying to mix bytes with unicode. The argument to split must also be byte-string:
>>> line = b'$0.0\t1822\t1\t1\t1\n'
>>> line.split(b'\t')
[b'$0.0', b'1822', b'1', b'1', b'1\n']
To get a unicode string, use decode:
>>> line.decode('utf-8')
'$0.0\t1822\t1\t1\t1\n'

Related

How to open a file with utf-8 non encoded characters?

I want to open a text file (.dat) in python and I get the following error:
'utf-8' codec can't decode byte 0x92 in position 4484: invalid start byte
but the file is encoded using utf-8, so maybe there some character that cannot be read. I am wondering, is there a way to handle the problem without calling each single weird characters? Cause I have a rather huge text file and it would take me hours to run find the non encoded Utf-8 encoded character.
Here is my code
import codecs
f = codecs.open('compounds.dat', encoding='utf-8')
for line in f:
if "InChI=1S/C11H8O3/c1-6-5-9(13)10-7(11(6)14)3-2-4-8(10)12/h2-5" in line:
print(line)
searchfile.close()

It shouldn't "take you hours" to find the bad byte. The error tells you exactly where it is; it's at index 4484 in your input with a value of 0x92; if you did:
with open('compounds.dat', 'rb') as f:
data = f.read()
the invalid byte would be at data[4484], and you can slice as you like to figure out what's around it.
In any event, if you just want to ignore or replace invalid bytes, that's what the errors parameter is for. Using io.open (because codecs.open is subtly broken in many ways, and io.open is both faster and more correct):
# If this is Py3, you don't even need the import, just use plain open which is
# an alias for io.open
import io
with io.open('compounds.dat', encoding='utf-8', errors='ignore') as f:
for line in f:
if u"InChI=1S/C11H8O3/c1-6-5-9(13)10-7(11(6)14)3-2-4-8(10)12/h2-5" in line:
print(line)
will just ignore the invalid bytes (dropping them as if they never existed). You can also pass errors='replace' to insert a replacement character for each garbage byte, so you're not silently dropping data.

if working with huge data , better to use encoding as default and if the error persists then use errors="ignore" as well
with open("filename" , 'r' , encoding="utf-8",errors="ignore") as f:
f.read()

How to use unidecode in python (3.3)

I'm trying to remove all non-ascii characters from a text document. I found a package that should do just that, https://pypi.python.org/pypi/Unidecode
It should accept a string and convert all non-ascii characters to the closest ascii character available. I used this same module in perl easily enough by just calling while (<input>) { $_ = unidecode($_); } and this one is a direct port of the perl module, the documentation indicates that it should work the same.
I'm sure this is something simple, I just don't understand enough about character and file encoding to know what the problem is. My origfile is encoded in UTF-8 (converted from UCS-2LE). The problem may have more to do with my lack of encoding knowledge and handling strings wrong than the module, hopefully someone can explain why though. I've tried everything I know without just randomly inserting code and search the errors I'm getting with no luck so far.
Here's my python
from unidecode import unidecode
def toascii():
origfile = open(r'C:\log.convert', 'rb')
convertfile = open(r'C:\log.toascii', 'wb')
for line in origfile:
line = unidecode(line)
convertfile.write(line)
origfile.close()
convertfile.close()
toascii();
If I don't open the original file in byte mode (origfile = open('file.txt','r') then I get an error UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 1563: character maps to <undefined> from the for line in origfile: line.
If I do open it in byte mode 'rb' I get TypeError: ord() expected string length 1, but int found from the line = unidecode(line) line.
if I declare line as a string line = unidecode(str(line)) then it will write to the file, but... not correctly. \r\n'b'\xef\xbb\xbf[ 2013.10.05 16:18:01 ] User_Name > .\xe2\x95\x90\xe2\x95\x90\xe2\x95\x90\xe2\x95\x90\ It's writing out the \n, \r, etc and unicode characters instead of converting them to anything.
If I convert the line to string as above, and open the convertfile in byte mode 'wb' it gives the error TypeError: 'str' does not support the buffer interface
If I open it in byte mode without declaring it a string 'wb' and unidecode(line) then I get the TypeError: ord() expected string length 1, but int found error again.

The unidecode module accepts unicode string values and returns a unicode string in Python 3. You are giving it binary data instead. Decode to unicode or open the input text file in textmode, and encode the result to ASCII before writing it to a file, or open the output text file in text mode.
Quoting from the module documentation:
The module exports a single function that takes an Unicode object (Python 2.x) or string (Python 3.x) and returns a string (that can be encoded to ASCII bytes in Python 3.x)
Emphasis mine.
This should work:
def toascii():
with open(r'C:\log.convert', 'r', encoding='utf8') as origfile, open(r'C:\log.toascii', 'w', encoding='ascii') as convertfile:
for line in origfile:
line = unidecode(line)
convertfile.write(line)
This opens the inputfile in text modus (using UTF8 encoding, which judging by your sample line is correct) and writes in text modus (encoding to ASCII).
You do need to explicitly specify the encoding of the file you are opening; if you omit the encoding the current system locale is used (the result of a locale.getpreferredencoding(False) call), which usually won't be the correct codec if your code needs to be portable.

Python - read text file with weird utf-16 format

I'm trying to read a text file into python, but it seems to use some very strange encoding. I try the usual:
file = open('data.txt','r')
lines = file.readlines()
for line in lines[0:1]:
print line,
print line.split()
Output:
0.0200197 1.97691e-005
['0\x00.\x000\x002\x000\x000\x001\x009\x007\x00', '\x001\x00.\x009\x007\x006\x009\x001\x00e\x00-\x000\x000\x005\x00']
Printing the line works fine, but after I try to split the line so that I can convert it into a float, it looks crazy. Of course, when I try to convert those strings to floats, this produces an error. Any idea about how I can convert these back into numbers?
I put the sample datafile here if you would like to try to load it:
https://dl.dropboxusercontent.com/u/3816350/Posts/data.txt
I would like to simply use numpy.loadtxt or numpy.genfromtxt, but they also do not want to deal with this crazy file.

I'm willing to bet this is a UTF-16-LE file, and you're reading it as whatever your default encoding is.
In UTF-16, each character takes two bytes.* If your characters are all ASCII, this means the UTF-16 encoding looks like the ASCII encoding with an extra '\x00' after each character.
To fix this, just decode the data:
print line.decode('utf-16-le').split()
Or do the same thing at the file level with the io or codecs module:
file = io.open('data.txt','r', encoding='utf-16-le')
* This is a bit of an oversimplification: Each BMP character takes two bytes; each non-BMP character is turned into a surrogate pair, with each of the two surrogates taking two bytes. But you probably didn't care about these details.

Looks like UTF-16 to me.
>>> test_utf16 = '0\x00.\x000\x002\x000\x000\x001\x009\x007\x00'
>>> test_utf16.decode('utf-16')
u'0.0200197'
You can work directly off the Unicode strings:
>>> float(test_utf16)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: null byte in argument for float()
>>> float(test_utf16.decode('utf-16'))
0.020019700000000001
Or encode them to something different, if you prefer:
>>> float(test_utf16.decode('utf-16').encode('ascii'))
0.020019700000000001
Note that you need to do this as early as possible in your processing. As your comment noted, split will behave incorrectly on the utf-16 encoded form. The utf-16 representation of the space character ' ' is ' \x00', so split removes the whitespace but leaves the null byte.
The 2.6 and later io library can handle this for you, as can the older codecs library. io handles linefeeds better, so it's preferable if available.

This is really just #abarnert's suggestion, but I wanted to post it as an answer since this is the simplest solution and the one that I ended up using:
file = io.open(filename,'r',encoding='utf-16-le')
data = np.loadtxt(file,skiprows=8)
This demonstrates how you can create a file object using io.open using whatever crazy encoding your file happens to have, and then pass that file object to np.loadtxt (or np.genfromtxt) for quick-and-easy loading.

This piece of code will do the necessary
file_handle=open(file_name,'rb')
file_first_line=file_handle.readline()
file_handle.close()
print file_first_line
if '\x00' in file_first_line:
file_first_line=file_first_line.replace('\x00','')
print file_first_line
When you try to use 'file_first_line.split()' before replacing, the output would contain '\x00' i just tried replacing '\x00' with empty and it worked.

How do I convert LF to CRLF?

I found a list of the majority of English words online, but the line breaks are of unix-style (encoded in Unicode: UTF-8). I found it on this website: http://dreamsteep.com/projects/the-english-open-word-list.html
How do I convert the line breaks to CRLF so I can iterate over them? The program I will be using them in goes through each line in the file, so the words have to be one per line.
This is a portion of the file: bitbackbitebackbiterbackbitersbackbitesbackbitingbackbittenbackboard
It should be:
bit
backbite
backbiter
backbiters
backbites
backbiting
backbitten
backboard
How can I convert my files to this type? Note: it's 26 files (one per letter) with 80,000 words or so in total (so the program should be very fast).
I don't know where to start because I've never worked with unicode. Thanks in advance!
Using rU as the parameter (as suggested), with this in my code:
with open(my_file_name, 'rU') as my_file:
for line in my_file:
new_words.append(str(line))
my_file.close()
I get this error:
Traceback (most recent call last):
File "<pyshell#5>", line 1, in <module>
addWords('B Words')
File "D:\my_stuff\Google Drive\documents\SCHOOL\Programming\Python\Programming Class\hangman.py", line 138, in addWords
for line in my_file:
File "C:\Python3.3\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8d in position 7488: character maps to <undefined>
Can anyone help me with this?

Instead of converting, you should be able to just open the file using Python's universal newline support:
f = open('words.txt', 'rU')
(Note the U.)

You can use the replace method of strings. Like
txt.replace('\n', '\r\n')
EDIT :
in your case :
with open('input.txt') as inp, open('output.txt', 'w') as out:
txt = inp.read()
txt = txt.replace('\n', '\r\n')
out.write(txt)

You don't need to convert the line endings in the files in order to be able to iterate over them. As suggested by NPE, simply use python's universal newlines mode.
The UnicodeDecodeError happens because the files you are processing are encoded as UTF-8 and when you attempt to decode the contents from bytes to a string, via str(line), Python is using the cp1252 encoding to convert the bytes read from the file into a Python 3 string (i.e. a sequence of unicode code points). However, there are bytes in those files that cannot be decoded with the cp1252 encoding and that causes a UnicodeDecodeError.
If you change str(line) to line.decode('utf-8') you should no longer get the UnicodeDecodeError. Check out the Text Vs. Data Instead of Unicode Vs. 8-bit writeup for some more details.
Finally, you might also find The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky useful.

You can use cereja package
pip install cereja==1.2.0
import cereja cereja.lf_to_crlf(dir_or_file_path)
or
cereja.lf_to_crlf(dir_or_file_path, ext_in=[“.py”,”.csv”])
You can substitute for any standard. See the filetools module

UnicodeDecodeError: 'charmap' codec can't decode byte X in position Y: character maps to <undefined>

I'm trying to get a Python 3 program to do some manipulations with a text file filled with information. However, when trying to read the file I get the following error:
Traceback (most recent call last):
File "SCRIPT LOCATION", line NUMBER, in <module>
text = file.read()
File "C:\Python31\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 2907500: character maps to `<undefined>`

The file in question is not using the CP1252 encoding. It's using another encoding. Which one you have to figure out yourself. Common ones are Latin-1 and UTF-8. Since 0x90 doesn't actually mean anything in Latin-1, UTF-8 (where 0x90 is a continuation byte) is more likely.
You specify the encoding when you open the file:
file = open(filename, encoding="utf8")

If file = open(filename, encoding="utf-8") doesn't work, try
file = open(filename, errors="ignore"), if you want to remove unneeded characters. (docs)

Alternatively, if you don't need to decode the file, such as uploading the file to a website, use:
open(filename, 'rb')
where r = reading, b = binary

As an extension to #LennartRegebro's answer:
If you can't tell what encoding your file uses and the solution above does not work (it's not utf8) and you found yourself merely guessing - there are online tools that you could use to identify what encoding that is. They aren't perfect but usually work just fine. After you figure out the encoding you should be able to use solution above.
EDIT: (Copied from comment)
A quite popular text editor Sublime Text has a command to display encoding if it has been set...
Go to View -> Show Console (or Ctrl+`)
Type into field at the bottom view.encoding() and hope for the best (I was unable to get anything but Undefined but maybe you will have better luck...)

TLDR: Try: file = open(filename, encoding='cp437')
Why? When one uses:
file = open(filename)
text = file.read()
Python assumes the file uses the same codepage as current environment (cp1252 in case of the opening post) and tries to decode it to its own default UTF-8. If the file contains characters of values not defined in this codepage (like 0x90) we get UnicodeDecodeError. Sometimes we don't know the encoding of the file, sometimes the file's encoding may be unhandled by Python (like e.g. cp790), sometimes the file can contain mixed encodings.
If such characters are unneeded, one may decide to replace them by question marks, with:
file = open(filename, errors='replace')
Another workaround is to use:
file = open(filename, errors='ignore')
The characters are then left intact, but other errors will be masked too.
A very good solution is to specify the encoding, yet not any encoding (like cp1252), but the one which has ALL characters defined (like cp437):
file = open(filename, encoding='cp437')
Codepage 437 is the original DOS encoding. All codes are defined, so there are no errors while reading the file, no errors are masked out, the characters are preserved (not quite left intact but still distinguishable).

Stop wasting your time, just add the following encoding="cp437" and errors='ignore' to your code in both read and write:
open('filename.csv', encoding="cp437", errors='ignore')
open(file_name, 'w', newline='', encoding="cp437", errors='ignore')
Godspeed

for me encoding with utf16 worked
file = open('filename.csv', encoding="utf16")

def read_files(file_path):
with open(file_path, encoding='utf8') as f:
text = f.read()
return text
OR (AND)
def read_files(text, file_path):
with open(file_path, 'rb') as f:
f.write(text.encode('utf8', 'ignore'))

For those working in Anaconda in Windows, I had the same problem. Notepad++ help me to solve it.
Open the file in Notepad++. In the bottom right it will tell you the current file encoding.
In the top menu, next to "View" locate "Encoding". In "Encoding" go to "character sets" and there with patiente look for the enconding that you need. In my case the encoding "Windows-1252" was found under "Western European"

Before you apply the suggested solution, you can check what is the Unicode character that appeared in your file (and in the error log), in this case 0x90: https://unicodelookup.com/#0x90/1 (or directly at Unicode Consortium site http://www.unicode.org/charts/ by searching 0x0090)
and then consider removing it from the file.

In the newer version of Python (starting with 3.7), you can add the interpreter option -Xutf8, which should fix your problem. If you use Pycharm, just got to Run > Edit configurations (in tab Configuration change value in field Interpreter options to -Xutf8).
Or, equivalently, you can just set the environmental variable PYTHONUTF8 to 1.

for me changing the Mysql character encoding the same as my code helped to sort out the solution. photo=open('pic3.png',encoding=latin1)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to open an unicode text file inside a zip? - python

Related

How to open a file with utf-8 non encoded characters?

How to use unidecode in python (3.3)

Python - read text file with weird utf-16 format

How do I convert LF to CRLF?

UnicodeDecodeError: 'charmap' codec can't decode byte X in position Y: character maps to <undefined>

Categories

Resources