Reading .dat file through pandas.read_csv( ) gives unicode error - python

I've a set of .dat files and I'm not sure what type pf data they carry (mostly non video, audio content - should be a mix of integer, text and special characters). I came to learn that we read .dat files using pandas read_csv or read_table into Python and I tried the below
DATA = pd.read_csv(r'file_path\Data.dat', header=None)
below is the error I get
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes
in position 2-3: truncated \UXXXXXXXX escape
I've tried the solutions listed from around including Stack overflow and blogs, and tried the below options too, but none of them worked
Used Double quotes for filepath instead of single quote (pd.read_csv(r"filepath"))
Use double backslash instead of single backslash
Use forward slash
Use double backslash only at the beginning of the filepath, something like C:\User\folder....
Tried a few encodings like utf-8, ascii, latin-1 etc., and the error for all the above is "EmptyDataError: No columns to parse from file"
Tried without r in the read_csv argument. Didn't work
Tried 'sep='\s+', also set skiprows too. No use
One thing to mention is that one of my folder name has numbers apart from text. Does that create this issue by any chance?
Can someone highlight what I need to do...thanks in advance

Related

Dataset loading error in Python in Jupyter

import pandas as pd
import numpy as ny
studentPerfomance = 'C:\Users\Vignesh\Desktop\project\students-performance-in-exams\StudentsPerformance.csv'
error
File "<ipython-input-10-056bf84aaa71>", line 1
studentPerfomance = 'C:\Users\Vignesh\Desktop\project\students-performance-in-exams\StudentsPerformance.csv'
^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape
Use the standard slash / and not the backslash. It is not good practice to use the backslash to separate folders. I do not know why Windows is still using this as the standard way to display paths.
The problem with the backslash is related to escape sequences like \n (new line) or \t (tab).
So the solution is to replace als backslashes with a standard slash /.
import pandas as pd
import numpy as ny
studentPerfomance = 'C:/Users/Vignesh/Desktop/project/students-performance-in-exams/StudentsPerformance.csv'
The problem is that you are using a string as a path.
Just put r before your normal string it converts normal string to raw string:
studentPerfomance = r'C:\Users\Vignesh\Desktop\project\students-performance-in-exams\StudentsPerformance.csv'
or
studentPerfomance = 'C:\\Users\\Vignesh\\Desktop\\project\\students-performance-in-exams\\StudentsPerformance.csv'
In general, there is nothing wrong with what you did. I'm also proud of you for not having any spaces in your path!(very unprofessional). The issue is that the backslashes(\) in your studentPerformance string are escape characters in Python. So Python escapes from the string every time it sees a \.
That said, Windows uses backslashes in system paths instead of forward slashes like Linux based operating systems, causing the users extra pain.
The best way to fix this issue is to prefix your string with an r, like so:
studentPerfomance = r'C:\Users\Vignesh\Desktop\project\students-performance-in-exams\StudentsPerformance.csv'
This tells Python to ignore the backslashes so that it does not escape the string.

Python decoding issue with Chinese characters

I'm using Python 3.5, and I'm trying to take a block of byte text that may or may not contain special Chinese characters and output it to a file. It works for entries that do not contain Chinese characters, but breaks when they do. The Chinese characters are always a person's name, and are always in addition to the English spelling of their name. The text is JSON formatted and needs to be decoded before I can load it. The decoding seems to go fine and doesn't give me any errors. When I try and write the decoded text to a file it gives me the following error message:
UnicodeEncodeError: 'charmap' codec can't encode characters in position 14-18: character maps to undefined
Here is an example of the raw data that I get before I do anything to it:
b' "isBulkRecipient": "false",\r\n "name": "Name in, English \xef'
b'\xab\x62\xb6\xe2\x15\x8a\x8b\x8a\xee\xab\x89\xcf\xbc\x8a",\r\n
Here is the code that I am using:
recipientData = json.loads(recipientContent.decode('utf-8', 'ignore'))
recipientName = recipientData['signers'][0]['name']
pprint(recipientName)
with open('envelope recipient list.csv', 'a', newline='') as fp:
a = csv.writer(fp, delimiter=',')
csvData = [[recipientName]]
a.writerows(csvData)
The recipientContent is obtained from an API call. I do not need to have the Chinese characters in the output file. Any advice will be greatly appreciated!
Update:
I've been doing some manual workarounds for each entry that breaks, and came other entries that didn't contain Chinese special characters, but had them from other languages, and the broke the program as well. The special characters are only in the name field. So a name could be something like "Ałex" where it is a mixture of normal and special characters. Before i decode the string that contains this information i am able to print it out to the screen and it looks like this: b'name": "A\xc5ex",\r\n
But after i decode it into utf-8 it will give me an error if i try to output it. The error message is: UnicodeEncodeError: 'charmap' codec can't encode character 'u0142' in position 2- character maps to -undefined-
I looked up what \u0142 was and it is the ł special character.
The error you're getting is when you're writing to the file.
In Python 3.x, when you open() in text mode (the default) without specifying an encoding=, Python will use an encoding most suitable to your locale or language settings.
If you're on Windows, this will use the charmap codec to map to your language encoding.
Although you could just write bytes straight to a file, you're doing the right thing by decoding it first. As others have said, you should really decode using the encoding specified by the web server. You could also use Python Requests module, which does this for you. (You example doesn't decode as UTF-8, so I assume your example isn't correct)
To solve your immediate error, simply pass an encoding to open(), which supports the characters you have in your data. Unicode in UTF-8 encoding is the obvious choice. Therefore, you should change your code to read:
with open('envelope recipient list.csv', 'a', encoding='utf-8', newline='') as fp:
Warning: shotgun solution ahead
Assuming you just want to get rid of all foreign character in all your file ( that is they are not important for your future processing of all other fields), you can simply ignore all non ascii characters
recipientData = json.loads(recipientContent.decode('utf-8', 'ignore'))
by
recipientData = json.loads(recipientContent.decode('ascii', 'ignore'))
like this you remove all non ascii characters before future processing.
I called it shotgun solution because it might not work correctly under certain circumstances:
Obviously if non ascii characters are needed to keep for future use
If b'\' or b" characters appears for example from part of an utf-16 character.
Add this line to your code :
from __future__ import unicode_literals

character set issue UTF-8

I have a document,in which the word "doesn't" contains apostrophe as shown below.
When i tried to process that via a python program it is showing the word as " doesnÆt" and exiting with the error as mentioned below.
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 70: invalid start byte
I opened the document in notepad and changed the encoding to UTF-8 from ANSI(found somewhere on the web), now it is working fine.
But can some one throw light on ,what all these things are about and how can i type the kind of apostrophe with my laptop keyboard.
MS Word famously converts quotes to "smart-quotes", so that they properly wrap around words or point in the right direction as an apostrophe.
You haven't been exactly faithful with your copy-paste so it's hard to be sure we're talking about the same thing.
For example here's the smart-quotes compared to plain ascii:
Doesn’t vs. Doesn't
or
“hello” vs. "hello"
Notice how the smart quotes on the left are curlier. In your screenshot, ’ will have been mapped to the Unicode point U+2019 ('RIGHT SINGLE QUOTATION MARK'). You can't easily manually type smart quotes with using a Windows key combination and typing the Unicode value.
You've then likely saved this text as in Windows-1252 (Western Europe) encoding (aka ANSI), which assigned the byte 0x92. Then, you loaded this into Python but passed the incorrect encoding of UTF-8. That's when you saw the exception.
The way to deal with this in the future is to specify the correct encoding when opening the file in Python. E.g.
with io.open("myfile.txt", 'r', encoding="windows-1252") as my_file:
my_data = my_file.read()

Importing file with unknown encoding from Python into MongoDB

Working on importing a tab-delimited file over HTTP in Python.
Before inserting a row's data into MongoDB, I'm removing slashes, ticks and quotes from the string.
Whatever the encoding of the data is, MongoDB is throwing me the exception:
bson.errors.InvalidStringData: strings in documents must be valid UTF-8
So in an endeavour to solve this problem, from the reading I've done I want to as quickly as I can, convert the row's data to Unicode using the unicode() function. In addition, I have tried calling the decode() function passing "unicode" as the first parameter but receive the error:
LookupError: unknown encoding: unicode
From there, I can make my string manipulations such as replacing the slashes, ticks, and quotes. Then before inserting the data into MongoDB, convert it to UTF-8 using the str.encode('utf-8') function.
Problem: When converting to Unicode, I am receiving the error
UnicodeDecodeError: 'ascii' codec can't decode byte 0x93 in position 1258: ordinal not in range(128)
With this error, I'm not exactly sure where to continue.
My question is this: How do I successfully import the data from a file without knowing its encoding and successfully insert it into MongoDB, which requires UTF-8?
Thanks Much!
Try these in order:
(0) Check that your removal of the slashes/ticks/etc is not butchering the data. What's a tick? Please show your code. Please show a sample of the raw data ... use print repr(sample_raw data) and copy/paste the output into an edit of your question.
(1) There's an old maxim: "If the encoding of a file is unknown, or stated to be ISO-8859-1, it is cp1252" ... where are you getting it from? If it's coming from Western Europe, the Americas, or any English/French/Spanish-speaking country/territory elsewhere, and it's not valid UTF-8, then it's likely to be cp1252
[Edit 2] Your error byte 0x93 decodes to U+201C LEFT DOUBLE QUOTATION MARK for all encodings cp1250 to cp1258 inclusive ... what language is the text written in? [/Edit 2]
(2) Save the file (before tick removal), then open the file in your browser: Does it look sensible? What do you see when you click on View / Character Encoding?
(3) Try chardet
Edit with some more advice:
Once you know what the encoding is (let's assume it's cp1252):
(1) convert your input data to unicode: uc = raw_data.decode('cp1252')
(2) process the data (remove slashes/ticks/etc) as unicode: clean_uc = manipulate(uc)
(3) you need to output your data encoded as utf8: to_mongo = clean_uc.encode('utf8')
Note 1: Your error message says "can't decode byte 0x93 in position 1258" ... 1258 bytes is a rather long chunk of text; is this reasonable? Have you had a look at the data that it is complaining about? How? what did you see?
Note 2: Please consider reading the Python Unicode HOWTO and this article

Python Unicode CSV export (using Django)

I'm using a Django app to export a string to a CSV file. The string is a message that was submitted through a front end form. However, I've been getting this error when a unicode single quote is provided in the input.
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019'
in position 200: ordinal not in range(128)
I've been trying to convert the unicode to ascii using the code below, but still get a similar error.
UnicodeEncodeError: 'ascii' codec can't encode characters in
position 0-9: ordinal not in range(128)
I've sifted through dozens of websites and learned a lot about unicode, however, I'm still not able to convert this unicode to ascii. I don't care if the algorithm removes the unicode characters. The commented lines indicate some various options I've tried, but the error persists.
import csv
import unicodedata
...
#message = unicode( unicodedata.normalize(
# 'NFKD',contact.message).encode('ascii','ignore'))
#dmessage = (contact.message).encode('utf-8','ignore')
#dmessage = contact.message.decode("utf-8")
#dmessage = "%s" % dmessage
dmessage = contact.message
csv_writer.writerow([
dmessage,
])
Does anyone have any advice in removing unicode characters to I can export them to CSV? This seemingly easy problem has kept my head spinning. Any help is much appreciated.
Thanks,
Joe
You can't encode the Unicode character u'\u2019' (U+2019 Right Single Quotation Mark) into ASCII, because ASCII doesn't have that character in it. ASCII is only the basic Latin alphabet, digits and punctuation; you don't get any accented letters or ‘smart quotes’ like this character.
So you will have to choose another encoding. Now normally the sensible thing to do would be to export to UTF-8, which can hold any Unicode character. Unfortunately for you if your target users are using Office (and they probably are), they're not going to be able to read UTF-8-encoded characters in CSV. Instead Excel will read the files using the system default code page for that machine (also misleadingly known as the ‘ANSI’ code page), and end up with mojibake like ’ instead of ’.
So that means you have to guess the user's system default code page if you want the characters to show up correctly. For Western users, that will be code page 1252. Users with non-Western Windows installs will see the wrong characters, but there's nothing you can do about that (other than organise a letter-writing campaign to Microsoft to just drop the stupid nonsense with ANSI already and use UTF-8 like everyone else).
Code page 1252 can contain U+2019 (’), but obviously there are many more characters it can't represent. To avoid getting UnicodeEncodeError for those characters you can use the ignore argument (or replace to replace them with question marks).
dmessage= contact.message.encode('cp1252', 'ignore')
alternatively, to give up and remove all non-ASCII characters, so that everyone gets an equally bad experience regardless of locale:
dmessage= contact.message.encode('ascii', 'ignore')
Encoding is a pain, but if you're working in django have you tried smart_unicode(str) from django.utils.encoding? I find that usually does the trick.
The only other option I've found is to use the built-in python encode() and decode() for strings, but you have to specify the encoding for those and honestly, it's a pain.
[caveat: I'm not a djangoist; django may have a better solution].
General non-django-specific answer:
If you have a smallish number of known non-ASCII characters and there are user-acceptable ASCII equivalents for them, you can set up a translation table and use the unicode.translate method:
smashcii = {
0x2019 : u"'",
# etc
#
smashed = input_string.translate(smashcii)

Categories