Unicode file with python and fileinput

Unicode file with python and fileinput - python

I am becoming more and more convinced that the business of file encodings is made as confusing as possible on purpose. I have a problem with reading a file in utf-8 encoding that contains just one line:
“blabla this is some text”
(note that the quotation marks are some fancy version of the standard quotation marks).
Now, I run this piece of Python code on it:
import fileinput
def charinput(paths):
with open(paths) as fi:
for line in fi:
for char in line:
yield char
i = charinput('path/to/file.txt')
for item in i:
print(item)
with two results:
If i run my python code from command prompt, the result is some strange characters, followed by an error mesage:
ď
»
ż
â
Traceback (most recent call last):
File "krneki.py", line 11, in <module>
print(item)
File "C:\Python34\lib\encodings\cp852.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u20ac' in position
0: character maps to <undefined>
I get the idea that the problem comes from the fact that Python tries to read a "wrongly" encoded document, but is there a way to order fileinput.input to read utf-8?
EDIT: Some really weird stuff is happening and I have NO idea how any of it works. After saving the same file as before in notepad++, the python code now runs within IDLE and results in the following output (newlines removed):
ď»żâ€śblabla this is some textâ€ť
while I can get the command prompt to not crash if I first input chcp 65001. Running the file then results in
ÄÂ»Å¼Ã¢â‚¬Å›blabla this is some text Ã¢â‚¬Å¥
Any ideas? This is a horrible mess, if you ask me, but it is vital I understand it...

Encoding
Every file is encoded. The byte 0x4C is interpreted as latin capital letter L according to the ASCII encoding, but as less-than sign ('<') according to the EBCDIC encoding. There Ain't No Such Thing As Plain Text.
There are single byte character sets like ASCII that use a single byte to encode each symbol, there are double byte character sets like KS X 1001 that use two bytes to encode each symbol, and there are encodings like the popular UTF-8 that use a variable number of bytes per symbol.
UTF-8 has become the most popular encoding for new applications, so I'll give some examples: The Latin Capital Letter A is stored as a single byte: 0x41. The Left Double Quotation Mark (“) is stored as three bytes: 0xE2 0x80 0x9C. The emoji Pile of Poo is stored as four bytes: 0xF0 0x9F 0x92 0xA9.
Any program that reads a file and has to interpret the bytes as symbols has to know (or to guess) which encoding was used.
If you are not familiar with Unicode or UTF-8 you might want to read http://www.joelonsoftware.com/articles/unicode.html
Reading Files in Python 3
Python 3's builtin function open() has an optional keywords argument encoding to support different encodings. To open a UTF-8 encoded file you can write open(filename, encoding="utf-8") and Python will take care of the decoding.
Also, the fileinput module supports encodings via the openhook keyword argument: fileinput.input(filename, openhook=fileinput.hook_encoded("utf-8")).
If you are not familiar with Python and Unicode or UTF-8 you should read http://docs.python.org/3/howto/unicode.html
I also found some nice tricks in http://www.chirayuk.com/snippets/python/unicode
Reading Strings in Python 2
In Python 2 open() does not know about encodings. Instead you can use the codecs module to specify which encoding should be used: codecs.open(filename, encoding="utf-8")
The best source for Python2/Unicode enlightment is http://docs.python.org/2/howto/unicode.html

Related

Python opening files with utf-8 file names

In my code I used something like file = open(path +'/'+filename, 'wb') to write the file
but in my attempt to support non-ascii filenames, I encode it as such
naming = path+'/'+filename
file = open(naming.encode('utf-8', 'surrogateescape'), 'wb')
write binary data...
so the file is named something like directory/path/\xd8\xb9\xd8\xb1\xd8\xa8\xd9.txt
and it works, but the issue arises when I try to get that file again by crawling into the same directory using:
for file in path:
data = open(file.as_posix(), 'rb)
...
I keep getting this error 'ascii' codec can't encode characters in position..
I tried converting the string to bytes like data = open(bytes(file.as_posix(), encoding='utf-8'), 'rb') but I get 'utf-8' codec can't encode characters in position...'
I also tried file.as_posix().encode('utf-8', 'surrogateescape'), I found that both encode and print just fine but with open() I still get the error 'utf-8' codec can't encode characters in position...'
How can I open a file with a utf-8 filename?
I'm using Python 3.9 on ubuntu linux
Any help is greatly appreciated.
EDIT
I figured out why the issue happens when crawling to the directory after writing.
So, when I write the file and give it the raw string directory/path/\xd8\xb9\xd8\xb1\xd8\xa8\xd9.txt and encode the string to utf, it writes fine.
But when finding the file again by crawling into the directory the str(filepath) or filepath.as_posix() returns the string as directory/path/????????.txt so it gives me an error when I try to encode it to any codec.
Currently I'm investigating if the issue's related to my linux locale, it was set to POSIX, I changed it to C.UTF-8 but still no luck atm.
More context: this is a file system where the file is uploaded through a site, so I receive the filename string in utf-8 format

I don't understand why you feel you need to recode filepaths.
Linux (unix) filenames are just sequences of bytes (with a couple of prohibited byte values). There's no need to break astral characters in surrogate pairs; the UTF-8 sequence for an astral character is perfectly acceptable in a filename. But creating surrogate pairs is likely to get you into trouble, because there's no UTF-8 encoding for a surrogate. So if you actually manage to create something that looks like the UTF-8 encoding for a surrogate codepoint, you're likely to encounter a decoding error when you attempt to turn it back into a Unicode codepoint.
Anyway, there's no need to go to all that trouble. Before running this session, I created a directory called ´ñ´ with two empty files, 𝔐 and mañana. The first one is an astral character, U+1D510. As you can see, everything works fine, with no need for manual decoding.
>>> [*Path('ñ').iterdir()]
[PosixPath('ñ/𝔐'), PosixPath('ñ/mañana')]
>>> Path.mkdir('ñ2')
>>> for path in Path('ñ').iterdir():
... open(Path('ñ2', path.name), 'w').close()
...
>>> [*Path('ñ2').iterdir()]
[PosixPath('ñ2/𝔐'), PosixPath('ñ2/mañana')]
>>> [open(path).read() for path in Path('ñ2').iterdir()]
['', '']
Note:
In a comment, OP says that they had previously tried:
file = open('/upload/\xd8\xb9\xd8\xb1\xd8\xa8\xd9\x8a.png', 'wb')
and received the error
UnicodeEncodeError: 'ascii' codec can't encode characters in position 8-11: ordinal not in range(128)
Without more details, it's hard to know how to respond to that. It's possible that open will raise that error for a filesystem which doesn't allow non-ascii characters, but that wouldn't be normal on Linux.
However, it's worth noting that the string literal
'/upload/\xd8\xb9\xd8\xb1\xd8\xa8\xd9\x8a.png'
is not the string you think it is. \x escapes in a Python string are Unicode codepoints (with a maximum value of 255), not individual UTF-8 byte values. The Python string literal, "\xd8\xb9" contains two characters, "O with stroke" (Ø) and "superscript 1" (¹); in other words, it is exactly the same as the string literal "\u00d8\u00b9".
To get the Arabic letter ain (ع), either just type it (if you have an Arabic keyboard setting and your source file encoding is UTF-8, which is the default), or use a Unicode escape for its codepoint U+0639: "\u0639".
If for some reason you insist on using explicit UTF-8 byte encoding, you can use a byte literal as the argument to open:
file = open(b'/upload/\xd8\xb9\xd8\xb1\xd8\xa8\xd9\x8a.png', 'wb')
But that's not recommended.

So after being in a rabbit hole for the past few days, I figured the issue isn't with python itself but with the locale that my web framework was using. Debugging this, I saw that
import sys
print(sys.getfilesystemencoding())
returned 'ASCII', which was weird considering I had set the linux locale to C.UTF-8 but discovered that since I was running WSGI on Apache2, I had to add locale to my WSGI as such WSGIDaemonProcess my_app locale='C.UTF-8' in the Apache configuration file thanks to this post.

Using unicode character u201c

I'm a new to python and am having problems understand unicode. I'm using
Python 3.4.
I've spent an entire day trying to figure this out by reading about unicode including http://www.fileformat.info/info/unicode/char/201C/index.htm and
http://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html
I need to refer to special quotes because they are used in the text I'm analyzing. I did test that the W7 command window can read and write the 2 special quote characters.
To make things simple, I wrote a one line script:
print ('“') # that's the special quote mark in between normal single quotes
and get this output:
Traceback (most recent call last):
File "C:\Users\David\Documents\Python34\Scripts\wordCount3.py", line 1, in <module>
print ('\u201c')
File "C:\Python34\lib\encodings\cp437.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\u201c' in position 0: character maps to <undefined>
So how do I write something to refer to these two characters u201C and u201D?
Is this the correct encoding choice in the file open statement?
with open(fileIn, mode='r', encoding='utf-8', errors='replace') as f:

The reason is that in 3.x Python You can't just mix unicode strings with byte strings. Probably, You've read the manuals dealing with Python 2.x where such things are possible as long as bytestring contains convertable chars.
print('\u201c', '\u201d')
works fine for me, so the only reason is that you're using wrong encoding for source file or terminal.
Also You may explicitly point python to codepage you're using, by throwing the next line ontop of your source:
# -*- coding: utf-8 -*-
Added: it seems that You're working on Windows machine, if so you could change Your console codepage to utf-8 by running
chcp 65001
before You fire up your python interpreter. That changes would be temporary, and if You want permanent, run the next .reg file:
Windows Registry Editor Version 5.00
[HKEY_CURRENT_USER\Console]
"CodePage"=dword:fde9

Write bytes literal with undefined character to CSV file (Python 3)

Using Python 3.4.2, I want to get a part of a website. According to the meta tags, that website is encoded with iso-8859-1. And I want to write one part (along with other parts) to a CSV file.
However, this part contains an undefined character with the hex value 0x8b. In order to preserve the part as good as possible, I want to write it as is into the CSV file. However, Python doesn't let me do it.
Here's a minimal example:
import urllib.request
import urllib.parse
import csv
if __name__ == "__main__":
with open("bytewrite.csv", "w", newline="") as csvfile:
a = b'\x8b' # byte literal by urllib.request
b = a.decode("iso-8859-1")
w = csv.writer(csvfile)
w.writerow([b])
And this is the output:
Traceback (most recent call last):
File "D:\Eigene\Dateien\Code\Python\writebyte.py", line 12, in <module>
w.writerow([b])
File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\x8b' in position 0: character maps to <undefined>
Eventually, I did it manually. It was just copy and paste with Notepad++, and according to a hex editor the value was inserted correctly. But how can I do it with Python 3? Why does Python even care what 0x8b stands for, instead of just writing it to the file?
It further irritates me that according to iso8859_1.py (and also cp1252.py) in C:\Python34\lib\encodings\ the lookup table seems to not interfere:
# iso8859_1.py
'\x8b' # 0x8B -> <control>
# cp1252.py
'\u2039' # 0x8B -> SINGLE LEFT-POINTING ANGLE QUOTATION MARK

Quoted from csv docs:
Since open() is used to open a CSV file for reading, the file will by
default be decoded into unicode using the system default encoding (see
locale.getpreferredencoding()). To decode a file using a different
encoding, use the encoding argument of open:
import csv
with open('some.csv', newline='', encoding='utf-8') as f:
reader = csv.reader(f)
for row in reader:
print(row)
The same applies to writing in something other than the system default encoding: specify the encoding argument when opening the output file.
What is happening is you've decoded to Unicode from iso-8859-1, but getpreferredencoding() returns cp1252 and the Unicode character \x8b is not supported in that encoding.
Corrected minimal example:
import csv
with open('bytewrite.csv', 'w', encoding='iso-8859-1', newline='') as csvfile:
a = b'\x8b'
b = a.decode("iso-8859-1")
w = csv.writer(csvfile)
w.writerow([b])

Your interpretation of the lookup tables in encodings is not correct. The code you've listed:
# iso8859_1.py
'\x8b' # 0x8B -> <control>
# cp1252.py
'\u2039' # 0x8B -> SINGLE LEFT-POINTING ANGLE QUOTATION MARK
Tells you two things:
How to map the unicode character '\x8b' to bytes in iso8859-1, it's just a control character.
How to map the unicode character '\u2039' to bytes in cp1252, it's a piece of punctuation: ‹
This does not tell you how to map the unicode character '\x8b' to bytes in cp1252, which is what you're trying to do.
The root of the problem is that "\x8b" is not a valid iso8859-1 character. Look at the table here:
http://en.wikipedia.org/wiki/ISO/IEC_8859-1#Codepage_layout
8b is undefined, so it just decodes as a control character. After it's decoded and we're in unicode land, what is 0x8b? This is a little tricky to find out, but it's defined in the unicode database here:
008B;<control>;Cc;0;BN;;;;;N;PARTIAL LINE FORWARD;;;;
Now, does CP1252 have this control character, "PARTIAL LINE FORWARD"?
http://en.wikipedia.org/wiki/Windows-1252#Code_page_layout
No, it does not. So you get an error when trying to encode it in CP1252.
Unfortunately there's no good solution for this. Some ideas:
Guess what encoding the page actually is. It's probably CP1252, not ISO-8859-1, but who knows. It could even contain a mix of encodings, or incorrectly encoded data (mojibake). You can use chardet to guess the encoding, or force this URL to use CP1252 in your program (overriding what the meta tag says), or you could try a series of codecs and take the first one that decodes & encodes successfully.
Fix up the input text or the decoded unicode string using some kind of mapping of problematic characters like this. This will work most of the time, but will fail silently or do something weird if you're trying to "fix up" data where it doesn't make sense.
Do not try to convert from ISO-8859-1 to CP1252, as they aren't compatible with each other. If you use UTF-8 that might work better.
Use an encoding error handler. See this table for a list of handlers. Using xmlcharrefreplace and backslashreplace will preserve the information (but then require you to do extra steps when decoding), while replace and ignore will silently skip over the bad character.
These types of issues caused by older encodings are really hard to solve, and there is no perfect solution. This is the reason why unicode was invented.

Python 3 unicode to utf-8 on file

I am trying to parse through a log file, but the file format is always in unicode. My usual process that I would like to automate:
I pull file up in notepad
Save as...
change encoding from unicode to UTF-8
Then run python program on it
So this is the process I would like to automate in Python 3.4. Pretty much just changed the file to UTF-8 or something like open(filename,'r',encoding='utf-8') although this exact line was throwing me this error when I tried to call read() on it:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
It would be EXTREMELY helpful if I could convert the entire file (like in my first scenario) or just open the whole thing in UTF-8 that way I don't have to str.encode (or something like that) every time I analyze a string.
Anybody been through this and know which method I should use and how to do it?
EDIT:
In the python3 repr, I did
>>> f = open('file.txt','r')
>>> f
(_io.TextIOWrapper name='file.txt' mode='r' encoding='cp1252')
So now my python code in my program opens the file with open('file.txt','r',encoding='cp1252'). I am running a lot of regex looking through this file though and it isn't picking it up (I think because it isn't utf-8). So I just have to figure out how to switch from cp1252 to UTF-8. Thank you #Mark Ransom

What notepad considers Unicode is utf16 to Python. Windows "Unicode" files start with a byte order mark (BOM) of FF FE, which indicates little-endian UTF-16. This is why you get the following when using utf8 to decode the file:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
To convert to UTF-8, you could use:
with open('log.txt',encoding='utf16') as f:
data = f.read()
with open('utf8.txt','w',encoding='utf8') as f:
f.write(data)
Note that many Windows editors like a UTF-8 signature at the beginning of the file, or may assume ANSI instead. ANSI is really the local language locale. On US Windows it is cp1252, but it varies for other localized builds. If you open utf8.txt and it still looks garbled, use encoding='utf-8-sig' when writing instead.

How to use unidecode in python (3.3)

I'm trying to remove all non-ascii characters from a text document. I found a package that should do just that, https://pypi.python.org/pypi/Unidecode
It should accept a string and convert all non-ascii characters to the closest ascii character available. I used this same module in perl easily enough by just calling while (<input>) { $_ = unidecode($_); } and this one is a direct port of the perl module, the documentation indicates that it should work the same.
I'm sure this is something simple, I just don't understand enough about character and file encoding to know what the problem is. My origfile is encoded in UTF-8 (converted from UCS-2LE). The problem may have more to do with my lack of encoding knowledge and handling strings wrong than the module, hopefully someone can explain why though. I've tried everything I know without just randomly inserting code and search the errors I'm getting with no luck so far.
Here's my python
from unidecode import unidecode
def toascii():
origfile = open(r'C:\log.convert', 'rb')
convertfile = open(r'C:\log.toascii', 'wb')
for line in origfile:
line = unidecode(line)
convertfile.write(line)
origfile.close()
convertfile.close()
toascii();
If I don't open the original file in byte mode (origfile = open('file.txt','r') then I get an error UnicodeDecodeError: 'charmap' codec can't decode byte 0x90 in position 1563: character maps to <undefined> from the for line in origfile: line.
If I do open it in byte mode 'rb' I get TypeError: ord() expected string length 1, but int found from the line = unidecode(line) line.
if I declare line as a string line = unidecode(str(line)) then it will write to the file, but... not correctly. \r\n'b'\xef\xbb\xbf[ 2013.10.05 16:18:01 ] User_Name > .\xe2\x95\x90\xe2\x95\x90\xe2\x95\x90\xe2\x95\x90\ It's writing out the \n, \r, etc and unicode characters instead of converting them to anything.
If I convert the line to string as above, and open the convertfile in byte mode 'wb' it gives the error TypeError: 'str' does not support the buffer interface
If I open it in byte mode without declaring it a string 'wb' and unidecode(line) then I get the TypeError: ord() expected string length 1, but int found error again.

The unidecode module accepts unicode string values and returns a unicode string in Python 3. You are giving it binary data instead. Decode to unicode or open the input text file in textmode, and encode the result to ASCII before writing it to a file, or open the output text file in text mode.
Quoting from the module documentation:
The module exports a single function that takes an Unicode object (Python 2.x) or string (Python 3.x) and returns a string (that can be encoded to ASCII bytes in Python 3.x)
Emphasis mine.
This should work:
def toascii():
with open(r'C:\log.convert', 'r', encoding='utf8') as origfile, open(r'C:\log.toascii', 'w', encoding='ascii') as convertfile:
for line in origfile:
line = unidecode(line)
convertfile.write(line)
This opens the inputfile in text modus (using UTF8 encoding, which judging by your sample line is correct) and writes in text modus (encoding to ASCII).
You do need to explicitly specify the encoding of the file you are opening; if you omit the encoding the current system locale is used (the result of a locale.getpreferredencoding(False) call), which usually won't be the correct codec if your code needs to be portable.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.