pandas read_csv encoding weird character - python

I tried to read my dataset in text file format using pandas. However, some characters are not encoded correctly. I got ??? for apostrophe.
What should I do to encode my file correctly? I've tried
encoding = "utf8" but I got UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 2044: unexpected end of data.
encoding = "latin1" but this gave me a lot of ???
encoding = "ISO-8859-1" or "ISO-8859-2" but this also gave me just like no encoding...
When I open my data in sublime, I got this character ’.
UPDATED: But when I access the entry using loc I got something like \u0102\u02d8\xe2\x82\u0179\xc2\u015, \u0102\u02d8\xe2\x82\u0179\xe2\x84\u02d8

You may be able to determine the encoding with chardet:
$ pip install chardet
>>> import urllib
>>> rawdata = urllib.urlopen('http://yahoo.co.jp/').read()
>>> import chardet
>>> chardet.detect(rawdata)
{'encoding': 'EUC-JP', 'confidence': 0.99}
The basic usage also suggests how you can use this to infer the encoding from large files e.g. files too large to read into memory - it'll read the file until it's confident enought about the encoding.
According to this answer you should try encoding="ISO-8859-2":
My guess is that your input is encoded as ISO-8859-2 which contains Ă as 0xC3.
Note: Sublime may not infer the encoding correctly either so you have to take it's output with a pinch of salt, it's best to check with your vendor (wherever you're getting the file from) what the actual encoding is...

Related

Python opening files with utf-8 file names

In my code I used something like file = open(path +'/'+filename, 'wb') to write the file
but in my attempt to support non-ascii filenames, I encode it as such
naming = path+'/'+filename
file = open(naming.encode('utf-8', 'surrogateescape'), 'wb')
write binary data...
so the file is named something like directory/path/\xd8\xb9\xd8\xb1\xd8\xa8\xd9.txt
and it works, but the issue arises when I try to get that file again by crawling into the same directory using:
for file in path:
data = open(file.as_posix(), 'rb)
...
I keep getting this error 'ascii' codec can't encode characters in position..
I tried converting the string to bytes like data = open(bytes(file.as_posix(), encoding='utf-8'), 'rb') but I get 'utf-8' codec can't encode characters in position...'
I also tried file.as_posix().encode('utf-8', 'surrogateescape'), I found that both encode and print just fine but with open() I still get the error 'utf-8' codec can't encode characters in position...'
How can I open a file with a utf-8 filename?
I'm using Python 3.9 on ubuntu linux
Any help is greatly appreciated.
EDIT
I figured out why the issue happens when crawling to the directory after writing.
So, when I write the file and give it the raw string directory/path/\xd8\xb9\xd8\xb1\xd8\xa8\xd9.txt and encode the string to utf, it writes fine.
But when finding the file again by crawling into the directory the str(filepath) or filepath.as_posix() returns the string as directory/path/????????.txt so it gives me an error when I try to encode it to any codec.
Currently I'm investigating if the issue's related to my linux locale, it was set to POSIX, I changed it to C.UTF-8 but still no luck atm.
More context: this is a file system where the file is uploaded through a site, so I receive the filename string in utf-8 format
I don't understand why you feel you need to recode filepaths.
Linux (unix) filenames are just sequences of bytes (with a couple of prohibited byte values). There's no need to break astral characters in surrogate pairs; the UTF-8 sequence for an astral character is perfectly acceptable in a filename. But creating surrogate pairs is likely to get you into trouble, because there's no UTF-8 encoding for a surrogate. So if you actually manage to create something that looks like the UTF-8 encoding for a surrogate codepoint, you're likely to encounter a decoding error when you attempt to turn it back into a Unicode codepoint.
Anyway, there's no need to go to all that trouble. Before running this session, I created a directory called ´ñ´ with two empty files, 𝔐 and mañana. The first one is an astral character, U+1D510. As you can see, everything works fine, with no need for manual decoding.
>>> [*Path('ñ').iterdir()]
[PosixPath('ñ/𝔐'), PosixPath('ñ/mañana')]
>>> Path.mkdir('ñ2')
>>> for path in Path('ñ').iterdir():
... open(Path('ñ2', path.name), 'w').close()
...
>>> [*Path('ñ2').iterdir()]
[PosixPath('ñ2/𝔐'), PosixPath('ñ2/mañana')]
>>> [open(path).read() for path in Path('ñ2').iterdir()]
['', '']
Note:
In a comment, OP says that they had previously tried:
file = open('/upload/\xd8\xb9\xd8\xb1\xd8\xa8\xd9\x8a.png', 'wb')
and received the error
UnicodeEncodeError: 'ascii' codec can't encode characters in position 8-11: ordinal not in range(128)
Without more details, it's hard to know how to respond to that. It's possible that open will raise that error for a filesystem which doesn't allow non-ascii characters, but that wouldn't be normal on Linux.
However, it's worth noting that the string literal
'/upload/\xd8\xb9\xd8\xb1\xd8\xa8\xd9\x8a.png'
is not the string you think it is. \x escapes in a Python string are Unicode codepoints (with a maximum value of 255), not individual UTF-8 byte values. The Python string literal, "\xd8\xb9" contains two characters, "O with stroke" (Ø) and "superscript 1" (¹); in other words, it is exactly the same as the string literal "\u00d8\u00b9".
To get the Arabic letter ain (ع), either just type it (if you have an Arabic keyboard setting and your source file encoding is UTF-8, which is the default), or use a Unicode escape for its codepoint U+0639: "\u0639".
If for some reason you insist on using explicit UTF-8 byte encoding, you can use a byte literal as the argument to open:
file = open(b'/upload/\xd8\xb9\xd8\xb1\xd8\xa8\xd9\x8a.png', 'wb')
But that's not recommended.
So after being in a rabbit hole for the past few days, I figured the issue isn't with python itself but with the locale that my web framework was using. Debugging this, I saw that
import sys
print(sys.getfilesystemencoding())
returned 'ASCII', which was weird considering I had set the linux locale to C.UTF-8 but discovered that since I was running WSGI on Apache2, I had to add locale to my WSGI as such WSGIDaemonProcess my_app locale='C.UTF-8' in the Apache configuration file thanks to this post.

Failed to read pdf file

text= textract.process("/Users/dg/Downloads/Data Wrangling/syllabi/82445.pdf")
I tried to read this file, but it throws the following error:-
'charmap' codec can't decode byte 0x9d in position 6583: character maps to.
Why does it throw this error? How do I fix this ?
Regarding your question this error can be solved by doing :
You can do it in 2 ways:
The first: is by doing : r"THEPATH", what this will do is that it will read the file that you have inserted via the path, example: text = r"/Users/dg/Downloads/Data Wrangling/syllabi/82445.pdf"
or you can just put double "/", sucha as : "//Users//dg//Downloads//Data Wrangling//syllabi//82445.pdf"(this will work the same way.
Hopefully this helped you :), and feel free to ask any further questions
I could do it like this :
import os
file = open("/Users/dg/Downloads/Data Wrangling/syllabi/82445.pdf", "r")
text = file.read()
file.close
That's an encoding problem.
Textract uses chardet to detect the encoding of the pdf file (utf-8, latin1, cp1252, etc.). Detecting the encoding of a file is not always an easy task, and chardet can fail at detecting the encoding of the file. In your case, it seems that for this particular pdf file, it failed.
If you know the encoding of your file, then you could use the input_encoding parameter like this:
textract.process(filename, input_encoding="cp1252", output_encoding="utf8")
(see issue 309 in the links below)
Note that the encoding parameter specifies the output encoding, not the input encoding.
So, writing
text = textract.process(filename, encoding='ascii')
means that you want to write the output file with ascii encoding. But it doesn't mean that ascii is the encoding of your input file.
A note about chardet:
You can guess the encoding of a file like this with chardet:
import chardet
guessed_encoding = chardet.detect(file)
print(guessed_encoding)
And it will output something like this:
{'encoding': 'EUC-JP', 'confidence': 0.99}
Or:
{'encoding': 'EUC-JP', 'confidence': 0.24}
Here you can see tat there is a confidence key. In the first example, chardet is very confident that the encoding is EUC-JP, but that's not the case in the second example.
You could try to use chardet with the pdf file that causes problem and see what is its confidence score.
Useful links:
https://github.com/deanmalmgren/textract/issues/309
https://github.com/deanmalmgren/textract/issues/164

utf-8 error when opening csv file in pandas on mac

I am trying to open a csv file with Japanese characters using utf8 on my mac.
The code that I am using is as follows:
foo = pd.read_csv("filename.csv", encoding = 'utf8')
However, I have been getting the following error message.
'utf-8' codec can't decode byte 0x96 in position 0
I've tried looking around but a lot of the solutions seem to be for windows/I haven't had any success with other solutions yet.
Appreciate the help!
It seems that your file really has a non-unicode character. A correct encoding for this file strongly depends on its content, but in the most common case, 0x96 can be decoded with CP-1252. So, just try to decode it like following:
foo = pd.read_csv("filename.csv", encoding = 'cp1252')
If you don't know the original encoding of the file, you can try to detect it with third-party libs such as chardet.
I may help you a little bit more if you upload a chunk of the file to reproduce the problem.

How to open an ascii-encoded file as UTF8?

My files are in US-ASCII and a command like a = file( 'main.html') and a.read() loads them as an ASCII text. How do I get it to load as UTF8?
The problem I am tring to solve is:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xae' in position 38: ordinal not in range(128)
I was using the content of the files for templating as in template_str.format(attrib=val). But the string to interpolate is of a superset of ASCII.
Our team's version control and text editors does not care about the encoding. So how do I handle it in the code?
You are trying to opening files without specifying an encoding, which means that python uses the default value (ASCII).
You need to decode the byte-string explicitly, using the .decode() function:
template_str = template_str.decode('utf8')
Your val variable you tried to interpolate into your template is itself a unicode value, and python wants to automatically convert your byte-string template (read from the file) into a unicode value too, so that it can combine both, and it'll use the default encoding to do so.
Did I mention already you should read Joel Spolsky's article on Unicode and the Python Unicode HOWTO? They'll help you understand what happened here.
A solution working in Python2:
import codecs
fo = codecs.open('filename.txt', 'r', 'ascii')
content = fo.read() ## returns unicode
assert type(content) == unicode
fo.close()
utf8_content = content.encode('utf-8')
assert type(utf8_content) == str
I suppose that you are sure that your files are encoded in ASCII. Are you? :) As ASCII is included in UTF-8, you can decode this data using UTF-8 without expecting problems. However, when you are sure that the data is just ASCII, you should decode the data using just ASCII and not UTF-8.
"How do I get it to load as UTF8?"
I believe you mean "How do I get it to load as unicode?". Just decode the data using the ASCII codec and, in Python 2.x, the resulting data will be of type unicode. In Python 3, the resulting data will be of type str.
You will have to read about this topic in order to learn how to perform this kind of decoding in Python. Once understood, it is very simple.

Python Decoding Uncode files with 'ÆØÅ'

I read in some data from a danish text file. But i can't seem to find a way to decode it.
The original text is "dør" but in the raw text file its stored as "d√∏r"
So i tried the obvious
InputData = "d√∏r"
Print InputData.decode('iso-8859-1')
sadly resulting in the following error:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-6: ordinal not in range(128)
UTF-8 gives the same error.
(using Python 2.6.5)
How can i decode this text so the printed message would be "dør"?
C3 B8 is the UTF-8 encoding for "ø". You need to read the file in UTF-8 encoding:
import codecs
codecs.open(myfile, encoding='utf-8')
The reason that you're getting a UnicodeEncodeError is that you're trying to output the text and Python doesn't know what encoding your terminal is in, so it defaults to ascii. To fix this issue, use sys.stdout = codecs.getwriter('utf8')(sys.stdout) or use the environment variable PYTHONIOENCODING="utf-8".
Note that this will give you the text as unicode objects; if everything else in your program is str then you're going to run into compatibility issues. Either convert everything to unicode or (probably easier) re-encode the file into Latin-1 using ustr.encode('iso-8859-1'), but be aware that this will break if anything is outside the Latin-1 codepage. It might be easier to convert your program to use str in utf-8 encoding internally.

Categories