Parse big JSON file with font-encoding cp1252

Parse big JSON file with font-encoding cp1252 - python

I have to handle a big JSON file (approx. 47GB) and it seems as if I found the solution in ijson.
However, when I want to go through the objects I get the following error:
byggesag = (o for o in objects if o["h�ndelse"] == 'Byggesag')
^
SyntaxError: (unicode error) 'utf-8' codec can't decode byte 0xe6 in position 12: invalid continuation byte
Here is the code I am using so far:
import ijson
with open("C:/Path/To/Json/JSON_20220703180000.json", "r", encoding="cp1252") as json_file:
objects = ijson.items(json_file, 'SagList.item')
byggesag = (o for o in objects if o['hændelse'] == 'Byggesag')
How can I deal with the encoding of the input file?

The problem is with the python script itself, which is encoded with cp1252 but python expects it to be in utf8. You seem to be dealing with the input JSON file correctly (but you won't be able to tell until you actually are able to run your script).
First, note that the error is a SyntaxError, which probably happens when you are loading your script/module.
Secondly, note how in the first bit of code you shared hændelse appears somewhat scrambled, and python is complaining about how utf-8 cannot handle byte 0xe6. This is becase the character æ (U+00E6, https://www.compart.com/de/unicode/U+00E6) is encoded as 0xe6 in cp1252, which isn't a valid utf8 byte sequence; hence the error.
To solve it save your python script with utf8 encoding, or specify that it's saved with cp1252 (see https://peps.python.org/pep-0263/ for reference).

Related

Python opening files with utf-8 file names

In my code I used something like file = open(path +'/'+filename, 'wb') to write the file
but in my attempt to support non-ascii filenames, I encode it as such
naming = path+'/'+filename
file = open(naming.encode('utf-8', 'surrogateescape'), 'wb')
write binary data...
so the file is named something like directory/path/\xd8\xb9\xd8\xb1\xd8\xa8\xd9.txt
and it works, but the issue arises when I try to get that file again by crawling into the same directory using:
for file in path:
data = open(file.as_posix(), 'rb)
...
I keep getting this error 'ascii' codec can't encode characters in position..
I tried converting the string to bytes like data = open(bytes(file.as_posix(), encoding='utf-8'), 'rb') but I get 'utf-8' codec can't encode characters in position...'
I also tried file.as_posix().encode('utf-8', 'surrogateescape'), I found that both encode and print just fine but with open() I still get the error 'utf-8' codec can't encode characters in position...'
How can I open a file with a utf-8 filename?
I'm using Python 3.9 on ubuntu linux
Any help is greatly appreciated.
EDIT
I figured out why the issue happens when crawling to the directory after writing.
So, when I write the file and give it the raw string directory/path/\xd8\xb9\xd8\xb1\xd8\xa8\xd9.txt and encode the string to utf, it writes fine.
But when finding the file again by crawling into the directory the str(filepath) or filepath.as_posix() returns the string as directory/path/????????.txt so it gives me an error when I try to encode it to any codec.
Currently I'm investigating if the issue's related to my linux locale, it was set to POSIX, I changed it to C.UTF-8 but still no luck atm.
More context: this is a file system where the file is uploaded through a site, so I receive the filename string in utf-8 format

I don't understand why you feel you need to recode filepaths.
Linux (unix) filenames are just sequences of bytes (with a couple of prohibited byte values). There's no need to break astral characters in surrogate pairs; the UTF-8 sequence for an astral character is perfectly acceptable in a filename. But creating surrogate pairs is likely to get you into trouble, because there's no UTF-8 encoding for a surrogate. So if you actually manage to create something that looks like the UTF-8 encoding for a surrogate codepoint, you're likely to encounter a decoding error when you attempt to turn it back into a Unicode codepoint.
Anyway, there's no need to go to all that trouble. Before running this session, I created a directory called ´ñ´ with two empty files, 𝔐 and mañana. The first one is an astral character, U+1D510. As you can see, everything works fine, with no need for manual decoding.
>>> [*Path('ñ').iterdir()]
[PosixPath('ñ/𝔐'), PosixPath('ñ/mañana')]
>>> Path.mkdir('ñ2')
>>> for path in Path('ñ').iterdir():
... open(Path('ñ2', path.name), 'w').close()
...
>>> [*Path('ñ2').iterdir()]
[PosixPath('ñ2/𝔐'), PosixPath('ñ2/mañana')]
>>> [open(path).read() for path in Path('ñ2').iterdir()]
['', '']
Note:
In a comment, OP says that they had previously tried:
file = open('/upload/\xd8\xb9\xd8\xb1\xd8\xa8\xd9\x8a.png', 'wb')
and received the error
UnicodeEncodeError: 'ascii' codec can't encode characters in position 8-11: ordinal not in range(128)
Without more details, it's hard to know how to respond to that. It's possible that open will raise that error for a filesystem which doesn't allow non-ascii characters, but that wouldn't be normal on Linux.
However, it's worth noting that the string literal
'/upload/\xd8\xb9\xd8\xb1\xd8\xa8\xd9\x8a.png'
is not the string you think it is. \x escapes in a Python string are Unicode codepoints (with a maximum value of 255), not individual UTF-8 byte values. The Python string literal, "\xd8\xb9" contains two characters, "O with stroke" (Ø) and "superscript 1" (¹); in other words, it is exactly the same as the string literal "\u00d8\u00b9".
To get the Arabic letter ain (ع), either just type it (if you have an Arabic keyboard setting and your source file encoding is UTF-8, which is the default), or use a Unicode escape for its codepoint U+0639: "\u0639".
If for some reason you insist on using explicit UTF-8 byte encoding, you can use a byte literal as the argument to open:
file = open(b'/upload/\xd8\xb9\xd8\xb1\xd8\xa8\xd9\x8a.png', 'wb')
But that's not recommended.

So after being in a rabbit hole for the past few days, I figured the issue isn't with python itself but with the locale that my web framework was using. Debugging this, I saw that
import sys
print(sys.getfilesystemencoding())
returned 'ASCII', which was weird considering I had set the linux locale to C.UTF-8 but discovered that since I was running WSGI on Apache2, I had to add locale to my WSGI as such WSGIDaemonProcess my_app locale='C.UTF-8' in the Apache configuration file thanks to this post.

Write bytes literal with undefined character to CSV file (Python 3)

Using Python 3.4.2, I want to get a part of a website. According to the meta tags, that website is encoded with iso-8859-1. And I want to write one part (along with other parts) to a CSV file.
However, this part contains an undefined character with the hex value 0x8b. In order to preserve the part as good as possible, I want to write it as is into the CSV file. However, Python doesn't let me do it.
Here's a minimal example:
import urllib.request
import urllib.parse
import csv
if __name__ == "__main__":
with open("bytewrite.csv", "w", newline="") as csvfile:
a = b'\x8b' # byte literal by urllib.request
b = a.decode("iso-8859-1")
w = csv.writer(csvfile)
w.writerow([b])
And this is the output:
Traceback (most recent call last):
File "D:\Eigene\Dateien\Code\Python\writebyte.py", line 12, in <module>
w.writerow([b])
File "C:\Python34\lib\encodings\cp1252.py", line 19, in encode
return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode character '\x8b' in position 0: character maps to <undefined>
Eventually, I did it manually. It was just copy and paste with Notepad++, and according to a hex editor the value was inserted correctly. But how can I do it with Python 3? Why does Python even care what 0x8b stands for, instead of just writing it to the file?
It further irritates me that according to iso8859_1.py (and also cp1252.py) in C:\Python34\lib\encodings\ the lookup table seems to not interfere:
# iso8859_1.py
'\x8b' # 0x8B -> <control>
# cp1252.py
'\u2039' # 0x8B -> SINGLE LEFT-POINTING ANGLE QUOTATION MARK

Quoted from csv docs:
Since open() is used to open a CSV file for reading, the file will by
default be decoded into unicode using the system default encoding (see
locale.getpreferredencoding()). To decode a file using a different
encoding, use the encoding argument of open:
import csv
with open('some.csv', newline='', encoding='utf-8') as f:
reader = csv.reader(f)
for row in reader:
print(row)
The same applies to writing in something other than the system default encoding: specify the encoding argument when opening the output file.
What is happening is you've decoded to Unicode from iso-8859-1, but getpreferredencoding() returns cp1252 and the Unicode character \x8b is not supported in that encoding.
Corrected minimal example:
import csv
with open('bytewrite.csv', 'w', encoding='iso-8859-1', newline='') as csvfile:
a = b'\x8b'
b = a.decode("iso-8859-1")
w = csv.writer(csvfile)
w.writerow([b])

Your interpretation of the lookup tables in encodings is not correct. The code you've listed:
# iso8859_1.py
'\x8b' # 0x8B -> <control>
# cp1252.py
'\u2039' # 0x8B -> SINGLE LEFT-POINTING ANGLE QUOTATION MARK
Tells you two things:
How to map the unicode character '\x8b' to bytes in iso8859-1, it's just a control character.
How to map the unicode character '\u2039' to bytes in cp1252, it's a piece of punctuation: ‹
This does not tell you how to map the unicode character '\x8b' to bytes in cp1252, which is what you're trying to do.
The root of the problem is that "\x8b" is not a valid iso8859-1 character. Look at the table here:
http://en.wikipedia.org/wiki/ISO/IEC_8859-1#Codepage_layout
8b is undefined, so it just decodes as a control character. After it's decoded and we're in unicode land, what is 0x8b? This is a little tricky to find out, but it's defined in the unicode database here:
008B;<control>;Cc;0;BN;;;;;N;PARTIAL LINE FORWARD;;;;
Now, does CP1252 have this control character, "PARTIAL LINE FORWARD"?
http://en.wikipedia.org/wiki/Windows-1252#Code_page_layout
No, it does not. So you get an error when trying to encode it in CP1252.
Unfortunately there's no good solution for this. Some ideas:
Guess what encoding the page actually is. It's probably CP1252, not ISO-8859-1, but who knows. It could even contain a mix of encodings, or incorrectly encoded data (mojibake). You can use chardet to guess the encoding, or force this URL to use CP1252 in your program (overriding what the meta tag says), or you could try a series of codecs and take the first one that decodes & encodes successfully.
Fix up the input text or the decoded unicode string using some kind of mapping of problematic characters like this. This will work most of the time, but will fail silently or do something weird if you're trying to "fix up" data where it doesn't make sense.
Do not try to convert from ISO-8859-1 to CP1252, as they aren't compatible with each other. If you use UTF-8 that might work better.
Use an encoding error handler. See this table for a list of handlers. Using xmlcharrefreplace and backslashreplace will preserve the information (but then require you to do extra steps when decoding), while replace and ignore will silently skip over the bad character.
These types of issues caused by older encodings are really hard to solve, and there is no perfect solution. This is the reason why unicode was invented.

Python 3 unicode to utf-8 on file

I am trying to parse through a log file, but the file format is always in unicode. My usual process that I would like to automate:
I pull file up in notepad
Save as...
change encoding from unicode to UTF-8
Then run python program on it
So this is the process I would like to automate in Python 3.4. Pretty much just changed the file to UTF-8 or something like open(filename,'r',encoding='utf-8') although this exact line was throwing me this error when I tried to call read() on it:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
It would be EXTREMELY helpful if I could convert the entire file (like in my first scenario) or just open the whole thing in UTF-8 that way I don't have to str.encode (or something like that) every time I analyze a string.
Anybody been through this and know which method I should use and how to do it?
EDIT:
In the python3 repr, I did
>>> f = open('file.txt','r')
>>> f
(_io.TextIOWrapper name='file.txt' mode='r' encoding='cp1252')
So now my python code in my program opens the file with open('file.txt','r',encoding='cp1252'). I am running a lot of regex looking through this file though and it isn't picking it up (I think because it isn't utf-8). So I just have to figure out how to switch from cp1252 to UTF-8. Thank you #Mark Ransom

What notepad considers Unicode is utf16 to Python. Windows "Unicode" files start with a byte order mark (BOM) of FF FE, which indicates little-endian UTF-16. This is why you get the following when using utf8 to decode the file:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
To convert to UTF-8, you could use:
with open('log.txt',encoding='utf16') as f:
data = f.read()
with open('utf8.txt','w',encoding='utf8') as f:
f.write(data)
Note that many Windows editors like a UTF-8 signature at the beginning of the file, or may assume ANSI instead. ANSI is really the local language locale. On US Windows it is cp1252, but it varies for other localized builds. If you open utf8.txt and it still looks garbled, use encoding='utf-8-sig' when writing instead.

How to open an ascii-encoded file as UTF8?

My files are in US-ASCII and a command like a = file( 'main.html') and a.read() loads them as an ASCII text. How do I get it to load as UTF8?
The problem I am tring to solve is:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 38: ordinal not in range(128)
I was using the content of the files for templating as in template_str.format(attrib=val). But the string to interpolate is of a superset of ASCII.
Our team's version control and text editors does not care about the encoding. So how do I handle it in the code?

You are trying to opening files without specifying an encoding, which means that python uses the default value (ASCII).
You need to decode the byte-string explicitly, using the .decode() function:
template_str = template_str.decode('utf8')
Your val variable you tried to interpolate into your template is itself a unicode value, and python wants to automatically convert your byte-string template (read from the file) into a unicode value too, so that it can combine both, and it'll use the default encoding to do so.
Did I mention already you should read Joel Spolsky's article on Unicode and the Python Unicode HOWTO? They'll help you understand what happened here.

A solution working in Python2:
import codecs
fo = codecs.open('filename.txt', 'r', 'ascii')
content = fo.read() ## returns unicode
assert type(content) == unicode
fo.close()
utf8_content = content.encode('utf-8')
assert type(utf8_content) == str

I suppose that you are sure that your files are encoded in ASCII. Are you? :) As ASCII is included in UTF-8, you can decode this data using UTF-8 without expecting problems. However, when you are sure that the data is just ASCII, you should decode the data using just ASCII and not UTF-8.
"How do I get it to load as UTF8?"
I believe you mean "How do I get it to load as unicode?". Just decode the data using the ASCII codec and, in Python 2.x, the resulting data will be of type unicode. In Python 3, the resulting data will be of type str.
You will have to read about this topic in order to learn how to perform this kind of decoding in Python. Once understood, it is very simple.

Python Decoding Uncode files with 'ÆØÅ'

I read in some data from a danish text file. But i can't seem to find a way to decode it.
The original text is "dør" but in the raw text file its stored as "d√∏r"
So i tried the obvious
InputData = "d√∏r"
Print InputData.decode('iso-8859-1')
sadly resulting in the following error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-6: ordinal not in range(128)
UTF-8 gives the same error.
(using Python 2.6.5)
How can i decode this text so the printed message would be "dør"?

C3 B8 is the UTF-8 encoding for "ø". You need to read the file in UTF-8 encoding:
import codecs
codecs.open(myfile, encoding='utf-8')
The reason that you're getting a UnicodeEncodeError is that you're trying to output the text and Python doesn't know what encoding your terminal is in, so it defaults to ascii. To fix this issue, use sys.stdout = codecs.getwriter('utf8')(sys.stdout) or use the environment variable PYTHONIOENCODING="utf-8".
Note that this will give you the text as unicode objects; if everything else in your program is str then you're going to run into compatibility issues. Either convert everything to unicode or (probably easier) re-encode the file into Latin-1 using ustr.encode('iso-8859-1'), but be aware that this will break if anything is outside the Latin-1 codepage. It might be easier to convert your program to use str in utf-8 encoding internally.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.