Python Decoding Uncode files with 'ÆØÅ' - python

I read in some data from a danish text file. But i can't seem to find a way to decode it.
The original text is "dør" but in the raw text file its stored as "d√∏r"
So i tried the obvious
InputData = "d√∏r"
Print InputData.decode('iso-8859-1')
sadly resulting in the following error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-6: ordinal not in range(128)
UTF-8 gives the same error.
(using Python 2.6.5)
How can i decode this text so the printed message would be "dør"?

C3 B8 is the UTF-8 encoding for "ø". You need to read the file in UTF-8 encoding:
import codecs
codecs.open(myfile, encoding='utf-8')
The reason that you're getting a UnicodeEncodeError is that you're trying to output the text and Python doesn't know what encoding your terminal is in, so it defaults to ascii. To fix this issue, use sys.stdout = codecs.getwriter('utf8')(sys.stdout) or use the environment variable PYTHONIOENCODING="utf-8".
Note that this will give you the text as unicode objects; if everything else in your program is str then you're going to run into compatibility issues. Either convert everything to unicode or (probably easier) re-encode the file into Latin-1 using ustr.encode('iso-8859-1'), but be aware that this will break if anything is outside the Latin-1 codepage. It might be easier to convert your program to use str in utf-8 encoding internally.

Related

Python opening files with utf-8 file names

In my code I used something like file = open(path +'/'+filename, 'wb') to write the file
but in my attempt to support non-ascii filenames, I encode it as such
naming = path+'/'+filename
file = open(naming.encode('utf-8', 'surrogateescape'), 'wb')
write binary data...
so the file is named something like directory/path/\xd8\xb9\xd8\xb1\xd8\xa8\xd9.txt
and it works, but the issue arises when I try to get that file again by crawling into the same directory using:
for file in path:
data = open(file.as_posix(), 'rb)
...
I keep getting this error 'ascii' codec can't encode characters in position..
I tried converting the string to bytes like data = open(bytes(file.as_posix(), encoding='utf-8'), 'rb') but I get 'utf-8' codec can't encode characters in position...'
I also tried file.as_posix().encode('utf-8', 'surrogateescape'), I found that both encode and print just fine but with open() I still get the error 'utf-8' codec can't encode characters in position...'
How can I open a file with a utf-8 filename?
I'm using Python 3.9 on ubuntu linux
Any help is greatly appreciated.
EDIT
I figured out why the issue happens when crawling to the directory after writing.
So, when I write the file and give it the raw string directory/path/\xd8\xb9\xd8\xb1\xd8\xa8\xd9.txt and encode the string to utf, it writes fine.
But when finding the file again by crawling into the directory the str(filepath) or filepath.as_posix() returns the string as directory/path/????????.txt so it gives me an error when I try to encode it to any codec.
Currently I'm investigating if the issue's related to my linux locale, it was set to POSIX, I changed it to C.UTF-8 but still no luck atm.
More context: this is a file system where the file is uploaded through a site, so I receive the filename string in utf-8 format
I don't understand why you feel you need to recode filepaths.
Linux (unix) filenames are just sequences of bytes (with a couple of prohibited byte values). There's no need to break astral characters in surrogate pairs; the UTF-8 sequence for an astral character is perfectly acceptable in a filename. But creating surrogate pairs is likely to get you into trouble, because there's no UTF-8 encoding for a surrogate. So if you actually manage to create something that looks like the UTF-8 encoding for a surrogate codepoint, you're likely to encounter a decoding error when you attempt to turn it back into a Unicode codepoint.
Anyway, there's no need to go to all that trouble. Before running this session, I created a directory called ´ñ´ with two empty files, 𝔐 and mañana. The first one is an astral character, U+1D510. As you can see, everything works fine, with no need for manual decoding.
>>> [*Path('ñ').iterdir()]
[PosixPath('ñ/𝔐'), PosixPath('ñ/mañana')]
>>> Path.mkdir('ñ2')
>>> for path in Path('ñ').iterdir():
... open(Path('ñ2', path.name), 'w').close()
...
>>> [*Path('ñ2').iterdir()]
[PosixPath('ñ2/𝔐'), PosixPath('ñ2/mañana')]
>>> [open(path).read() for path in Path('ñ2').iterdir()]
['', '']
Note:
In a comment, OP says that they had previously tried:
file = open('/upload/\xd8\xb9\xd8\xb1\xd8\xa8\xd9\x8a.png', 'wb')
and received the error
UnicodeEncodeError: 'ascii' codec can't encode characters in position 8-11: ordinal not in range(128)
Without more details, it's hard to know how to respond to that. It's possible that open will raise that error for a filesystem which doesn't allow non-ascii characters, but that wouldn't be normal on Linux.
However, it's worth noting that the string literal
'/upload/\xd8\xb9\xd8\xb1\xd8\xa8\xd9\x8a.png'
is not the string you think it is. \x escapes in a Python string are Unicode codepoints (with a maximum value of 255), not individual UTF-8 byte values. The Python string literal, "\xd8\xb9" contains two characters, "O with stroke" (Ø) and "superscript 1" (¹); in other words, it is exactly the same as the string literal "\u00d8\u00b9".
To get the Arabic letter ain (ع), either just type it (if you have an Arabic keyboard setting and your source file encoding is UTF-8, which is the default), or use a Unicode escape for its codepoint U+0639: "\u0639".
If for some reason you insist on using explicit UTF-8 byte encoding, you can use a byte literal as the argument to open:
file = open(b'/upload/\xd8\xb9\xd8\xb1\xd8\xa8\xd9\x8a.png', 'wb')
But that's not recommended.
So after being in a rabbit hole for the past few days, I figured the issue isn't with python itself but with the locale that my web framework was using. Debugging this, I saw that
import sys
print(sys.getfilesystemencoding())
returned 'ASCII', which was weird considering I had set the linux locale to C.UTF-8 but discovered that since I was running WSGI on Apache2, I had to add locale to my WSGI as such WSGIDaemonProcess my_app locale='C.UTF-8' in the Apache configuration file thanks to this post.

Parse big JSON file with font-encoding cp1252

I have to handle a big JSON file (approx. 47GB) and it seems as if I found the solution in ijson.
However, when I want to go through the objects I get the following error:
byggesag = (o for o in objects if o["h�ndelse"] == 'Byggesag')
^
SyntaxError: (unicode error) 'utf-8' codec can't decode byte 0xe6 in position 12: invalid continuation byte
Here is the code I am using so far:
import ijson
with open("C:/Path/To/Json/JSON_20220703180000.json", "r", encoding="cp1252") as json_file:
objects = ijson.items(json_file, 'SagList.item')
byggesag = (o for o in objects if o['hændelse'] == 'Byggesag')
How can I deal with the encoding of the input file?
The problem is with the python script itself, which is encoded with cp1252 but python expects it to be in utf8. You seem to be dealing with the input JSON file correctly (but you won't be able to tell until you actually are able to run your script).
First, note that the error is a SyntaxError, which probably happens when you are loading your script/module.
Secondly, note how in the first bit of code you shared hændelse appears somewhat scrambled, and python is complaining about how utf-8 cannot handle byte 0xe6. This is becase the character æ (U+00E6, https://www.compart.com/de/unicode/U+00E6) is encoded as 0xe6 in cp1252, which isn't a valid utf8 byte sequence; hence the error.
To solve it save your python script with utf8 encoding, or specify that it's saved with cp1252 (see https://peps.python.org/pep-0263/ for reference).

Encoding problems in Python - 'ascii' codec can't encode character '\xe3' when using UTF-8

I've created a program to print out some html content. My source file is in utf-8, the server's terminal is in utf-8, and I also use:
out = out.encode('utf8')
to make sure, the character chain is in utf8.
Despite all that, when I use some characters like "ã", "é" in the string out, I get:
UnicodeEncodeError: 'ascii' codec can't encode character '\xe3' in position 84: ordinal not in range(128)
It seems to me that the print after:
print("Content-Type: text/html; charset=utf-8 \n\n")
It's being forced to use ASCII encoding... But, I just don't know this would be the case.
Thanks a lot.
Here it goes how I've solved the encoding problem in with Python 3.4.1:
First I've inserted this line in the code to check the output encoding:
print(sys.stdout.encoding)
And I saw that the output encoding was:
ANSI_X3.4-1968 -
which stands for ASCII and doesn't support characters like 'ã', 'é', etc.
so, I've deleted the previous line, and inserted theses ones here to change the standard output encoding with theses lines
import codecs
if sys.stdout.encoding != 'UTF-8':
sys.stdout = codecs.getwriter('utf-8')(sys.stdout.buffer, 'strict')
if sys.stderr.encoding != 'UTF-8':
sys.stderr = codecs.getwriter('utf-8')(sys.stderr.buffer, 'strict')
Here is where I found the information:
http://www.macfreek.nl/memory/Encoding_of_Python_stdout
P.S.: everybody says it's not a good practice to change the default encoding. I really don't know about it. In my case it has worked fine for me, but I'm building a very small and simple webapp.
I guess you should read the file as unicode object, that way you might not need to encode it.
import codecs
file = codecs.open('file.html', 'w', 'utf-8')

Python 3 unicode to utf-8 on file

I am trying to parse through a log file, but the file format is always in unicode. My usual process that I would like to automate:
I pull file up in notepad
Save as...
change encoding from unicode to UTF-8
Then run python program on it
So this is the process I would like to automate in Python 3.4. Pretty much just changed the file to UTF-8 or something like open(filename,'r',encoding='utf-8') although this exact line was throwing me this error when I tried to call read() on it:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
It would be EXTREMELY helpful if I could convert the entire file (like in my first scenario) or just open the whole thing in UTF-8 that way I don't have to str.encode (or something like that) every time I analyze a string.
Anybody been through this and know which method I should use and how to do it?
EDIT:
In the python3 repr, I did
>>> f = open('file.txt','r')
>>> f
(_io.TextIOWrapper name='file.txt' mode='r' encoding='cp1252')
So now my python code in my program opens the file with open('file.txt','r',encoding='cp1252'). I am running a lot of regex looking through this file though and it isn't picking it up (I think because it isn't utf-8). So I just have to figure out how to switch from cp1252 to UTF-8. Thank you #Mark Ransom
What notepad considers Unicode is utf16 to Python. Windows "Unicode" files start with a byte order mark (BOM) of FF FE, which indicates little-endian UTF-16. This is why you get the following when using utf8 to decode the file:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
To convert to UTF-8, you could use:
with open('log.txt',encoding='utf16') as f:
data = f.read()
with open('utf8.txt','w',encoding='utf8') as f:
f.write(data)
Note that many Windows editors like a UTF-8 signature at the beginning of the file, or may assume ANSI instead. ANSI is really the local language locale. On US Windows it is cp1252, but it varies for other localized builds. If you open utf8.txt and it still looks garbled, use encoding='utf-8-sig' when writing instead.

How to open an ascii-encoded file as UTF8?

My files are in US-ASCII and a command like a = file( 'main.html') and a.read() loads them as an ASCII text. How do I get it to load as UTF8?
The problem I am tring to solve is:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 38: ordinal not in range(128)
I was using the content of the files for templating as in template_str.format(attrib=val). But the string to interpolate is of a superset of ASCII.
Our team's version control and text editors does not care about the encoding. So how do I handle it in the code?
You are trying to opening files without specifying an encoding, which means that python uses the default value (ASCII).
You need to decode the byte-string explicitly, using the .decode() function:
template_str = template_str.decode('utf8')
Your val variable you tried to interpolate into your template is itself a unicode value, and python wants to automatically convert your byte-string template (read from the file) into a unicode value too, so that it can combine both, and it'll use the default encoding to do so.
Did I mention already you should read Joel Spolsky's article on Unicode and the Python Unicode HOWTO? They'll help you understand what happened here.
A solution working in Python2:
import codecs
fo = codecs.open('filename.txt', 'r', 'ascii')
content = fo.read() ## returns unicode
assert type(content) == unicode
fo.close()
utf8_content = content.encode('utf-8')
assert type(utf8_content) == str
I suppose that you are sure that your files are encoded in ASCII. Are you? :) As ASCII is included in UTF-8, you can decode this data using UTF-8 without expecting problems. However, when you are sure that the data is just ASCII, you should decode the data using just ASCII and not UTF-8.
"How do I get it to load as UTF8?"
I believe you mean "How do I get it to load as unicode?". Just decode the data using the ASCII codec and, in Python 2.x, the resulting data will be of type unicode. In Python 3, the resulting data will be of type str.
You will have to read about this topic in order to learn how to perform this kind of decoding in Python. Once understood, it is very simple.

Categories