Retrieving over-encoded information in csv file? - python

I have been logging thousands of messages for a data science project into a csv file. Many of these messages contain either emojis or non-English characters, therefore when opening the csv file using Excel, these already appeared in an encoded format (e.g. the red heart emoji ❤️ got encoded as â¤ï¸). This didn't disturb me much in the beginning, as I only needed the csv to store my data that I periodically analyzed. When reading the csv file using Python, I didn't notice any data corruption.
(However, I made an apparently huge mistake a couple of days ago: I ran into an error when reading the csv file, so I specified the engine attribute of pd.read_csv as 'python' and I believe this launched it all: every time I re-ran the script that updates the csv, all the text data got encoded again, possibly in utf-8 instead of csv's orriginal windows-1252.) Edit: I realized thank to Tomalak's comments below that the real problem wasn't this modification but me manually modifying the csv file using Excel a number of times along the way.
The older the csv entries, the more the subsequent encoding-recoding affected them: for the newest entries, there is no issue but for the oldest ones, I now have a single heart emoji appearing in the csv as:
���
I found numerous entries in the csv file where I could easily apply the .encode('windows-1252').decode('utf-8') 3-6 times (depending on how old the given entry is and therefore how many times it got re-encoded) and obtained a favorable outcome, such as:
😞 stands for the sad/disappointed face emoji (😞). Applying the encoding-decoding pattern four times returned: \U0001f61e which is good enough for me; I can easily use unicodedata library's excellent conversion method to obtain their corresponding unicodedata.name. I believe that's how I should be storing emojis from now on...
My understanding about applying the above mentioned encode-decode pattern numerous times is that I cannot overdo it. If one string needs only three of these patterns while the next cell needs six, I could just do something like this (yes, I know iterrows() is a terribly inefficient approach but just for the example):
for idx, _ in df.iterrows():
tmp = df.loc[idx, 'text']
for _ in range(6):
tmp = tmp.encode("windows-1252").decode("utf-8")
df.loc[idx, 'text'] = tmp
The problem is, however, that there are still quite a lot of entries where the above solution doesn't work. Let's just consider the above mentioned encoded string which stands for a red heart:
���
Applying .encode("windows-1252").decode("utf-8") three times yields: ��� but when attempting to apply the pattern the fourth time, I get: UnicodeEncodeError: 'charmap' codec can't encode character '\ufffd' in position 1: character maps to <undefined>. My hunch is, not all strings were encoded by windows-1252...?
Is there any hope to get back my data in an uncorrupted format?

Related

Emoji reading discrepancy between different applications

I have a bunch of tweets/threads dataset that I need to process, along with some separate annotation files. These annotation files consists of some spans represented by indexes that corresponds to a word/sentence. The indexes are, as you may have predicted, the position of the characters in the tweet/thread files.
The problem arises when I process the files with some emojis in them. To go with a specific example:
This is a part of the file in question (download):
TeamKhabib 😂😂😂 #danawhite #seanshelby #arielhelwani #AliAbdelaziz00 #McTapper xxxxx://x.xx/xxxxxxxxxx
mmafan1709 #TeamKhabib #danawhite #seanshelby #arielhelwani #AliAbdelaziz00 Conor is Khabib hardest fight and Khabib is Conors hardest fight
I read the file in python with plain open function, with the parameter encoding='utf8':
with open('028_948124816611139589.branch318.txt.username_text_tabseparated', 'r', encoding='utf-8') as f:
content = f.read()
print(content[211:214])
An annotation says there is the word and in the span 211-214. The way I read it I mention above, there is ' kh'.
When I use the indexes in the annotation files to get the spanned string, the string I am getting is 3 chars off (to the right). Because, in the annotations, 😂's apparently take 2 spaces. However, when python reads them, it is one, hence the character shift. It becomes much more obvious when I get the length of the file with len(list(file.read())). This returns me 7809, while the actual length of the file is 7812. 7812 is the pos I am getting at the end of the file in vscode, a plugin called vscode-position. Another file with gives me an inconsistency of 513 and 527.
I have no problem with reading emojis, I see them in my output/array however the space they are taking up in the encoding is different. My question is not answered in other relevant questions.
Obviously, there is a point in reading this file, as these files were read/created with some format/method/concept/encoding/whatever that this plugin and the annotators agree, but open.read does not.
I am using python 3.8.
What am I missing here?
I believe this issue after discussion is the spans were computed from Unicode strings that used surrogate pairs for Unicode code points > U+FFFF. Python 2 and other languages like Java and C# store Unicode strings with UTF-16 code units instead of abstracted code points like Python 3. If I treat the test data as UTF-16LE-encoded, the answer comes out:
import re
# Important to note that the original file has two tabs in it that SO doesn't display.
# * Between the first "TeamKabib" and smiley
# * Between "mmafan1709" and "#TeamKhabib"
# Use the download link while it is valid.
with open('test.txt', 'r', encoding='utf-8') as f:
content = f.read()
b = content.encode('utf-16le')
print(b[211 * 2:214 * 2].decode('utf-16le'))
# result: and
The offsets need to be double because each UTF-16 code unit is two bytes, then the result must be decoded to display it correctly.
I specifically used utf-16le vs. utf-16 because the latter will add a BOM and throw off the count another two bytes (or one code unit).

str.replace not working on my series - but works well on example

In my dataframe, I have a variable that should be a number, but is currently recognized as a string, with space after the thousand's value, for example : "5 948.5"
Before I can convert it to a float, I need to remove that space
d = {'col1': [1,2], 'numbers': [' 4 856.4','5 000.5']}
data = pd.DataFrame(data=d)
data['numbers']=data['numbers'].str.replace(" ", "")
This works perfectly.
But when I do the exact same thing to my series, nothing happens (no error message, but the spaces remain). Other manipulations to that series work normally.
Any idea of what I can try to understand and fix the problem on my series?
Thanks!
Edit:
I ve loaded the data with a
pd.read_csv(file.csv, encoding = "ISO-8859-1")
could that be responsible for the unmovable spaces? If I did not do that, I'd have an error message when loading "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe8 in position 2: invalid continuation byte"
I have tried to call read_csv with encoding='latin1', and encoding='cp1252' - the problem remains.
Edit 1.b. it seems to be an issue with the encoding of the space (Thanks #Marat). I downloaded an excel version of the data, and tried to replace all spaces of that column by nothing. It did not work. Removing a few spaces manually did work (but the file is too large to do it this way)
Edit 2: sample data. It really looks like the example I gave above, that works..but it really doesnt. I know nobody can reproduce this on their computer, I am not asking for the solution, but rather for ideas of what could be wrong...
As requested: here is a copytoclipboard of my data:
GroupeRA,SecteurMA,StatutMA,TypeMA,fields_ancienMatricule,historizedFields_denomination,Annee,region,fields_codeComiteSubregional,arrondissement,fields_ins,adresse_commune,perequation
Crèches,MASS,Collectif,CREC,632100101,Le Bocage I,2017,RBC,BX,Bruxelles-capitale,21001,Anderlecht,"0,00 "
Crèches,MASS,Collectif,CREC,632100101,Le Bocage I,2018,RBC,BX,Bruxelles-capitale,21001,Anderlecht,"0,00 "
Crèches,MASS,Collectif,CREC,x,Le xyzI,2018,RBC,BX,Bruxelles-capitale,21001,Anderlecht,"1 302,26 "
Crèches,MASS,Collectif,CREC,632100101,Le Bocage I,2018,RBC,BX,Bruxelles-capitale,21001,Anderlecht,"687,56 "
Crèches,MASS,Collectif,CREC,632100101,xyz,2019,RBC,BX,Bruxelles-capitale,21001,Anderlecht,"1 372,91 "
Edit 3: the data is in csv (though as mentioned in edit 1.b I also got the data in xls and have the same issue (even when opening in xls directly, cannot "find&replace all" to find the spaces, as if excel did not read them as such)
I used DB vizualizer to extract the data from our database.
Thanks all for your help. It was indeed an issue with the "space" character - which was not a 'space' like the one produced by my keyboard. Got solved with the following sql command when extracting the data
[perequation]= CONVERT(MONEY,REPLACE(REPLACE(ds.computedFields_perequation, CHAR(160),''), ',', '.')),

UnicodeEncodeError When Attempting to Print Pandas DataFrame Created With Query in Python 3

I have searched and searched. I can't exactly find an issue quite like mine. I did try.
I have read Parquet data into a Pandas dataframe and used .query statement to filter data.
import pandas as pd
import fastparquet as fp
fieldsToInclude = ['ACCURACY','STATE','LOCATION','COUNTRY_CODE']
criteria = 'ACCURACY == 1.0 or COUNTRY_CODE == "AD"'
pandaParqFile = fp.ParquetFile(fn = inputPath + "World Zip Code.parquet")
newDF = pandaParqFile.to_pandas()
dataset = newDF[fieldsToInclude]
extraction = dataset.query(criteria)
with pd.option_context('display.max_rows', 100, 'display.max_columns', 10):
print(extraction)
When it prints, I get UnicodeEncodeError: 'charmap' codec can't encode error 'u\0310' in position 4174: character maps to undefined'. This is in Geany. I get a different character and position if I print from the administrator console. I'm running Windows 7. The data does have characters that are Latin, German, etc.
I'm actually seeing some special characters when I print the data to the screen using other criteria for .query, so I guess it's only certain characters? I looked up 'u\0310' and that's some sort of Latin i. But I can print other Latin characters.
I've tried some suggestions for trying to resolve this with specifying encoding, but they didn't seem to work because this is a dataframe. Other questions I came across were about this error occurring when trying to open CSV files. Not what I'm experiencing here.
The zip code data is just something to work with to learn Pandas. In the future, there's no telling what kind of data will be processed by this script. I'm really looking for a solution to this problem that will prevent it from happening regardless of what kinds of characters the data will have. Simply removing the LOCATION field, which is where all of these special characters are for this particular data, isn't viable.
Has anyone seen this before? Thanks in advance.
You need to specify utf-8 as encoding format.
Try:
with pd.option_context('display.encoding', 'UTF-8', 'display.max_rows', 100, 'display.max_columns', 10):
print(extraction)

Python, Hex and common file signatures

I’ve got files from a system restore which have odd bits of data padded out to the front of the file which makes it gobbledegook when opening it. I’ve got a text file of file signatures which I’ve collected, and which contain information represented like this at the moment:
Sig_MicrosoftOffice_before2007= \xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1
What I am planning on is reading the text file and using the data to identify the correct header in the data of the corrupt file, and strip everything off before it – hopefully leaving a readable file after. I’m stuck on how best to get this data into python in a readable format though.
My first try was simply reading the values from the file, but as python does, it’s representing the backslashes as the escape character. Is this the best method to achieve what I need? Do I need to think about representing the data in the text file some other way? Or maybe in a dictionary? Any help you could provide would be really appreciated.
You can decode the \xhh escapes by using the string_escape codec (Python 2) or unicode_escape codec (Python 3 or when you have to us Unicode in Python 2):
>>> r'\xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1'
'\\xD0\\xCF\\x11\\xE0\\xA1\\xB1\\x1A\\xE1'
>>> r'\xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1'.decode('string_escape')
'\xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1'

Decoding a String that Contains Encoded Characters

I have some strings that I am pasting in to my script as test data. The strings come from emails that contain encoded characters and it's throwing a SyntaxError. So far, I have not been able to find a solution to this issue. When I print repr(string), I get these strings:
'Total Value for 1st Load \xe2\x80\x93 approx. $75,200\n'
'Total Value for 2nd Load \xe2\x80\x93 approx. $74,300\n'
And this error pops up when I run my script:
SyntaxError: Non-ASCII character '\xe2' in file <filename> on line <line number>, but no
encoding declared; see http://www.python.org/peps/pep-2063.html
When I just print the lines containing the encoded characters I get this:
'Total Value for 2nd Load – approx. $74,300'
The data looks like this when I copy it from the email:
'Total Value for 1st Load – approx. $75,200'
'Total Value for 2nd Load – approx. $74,300'
From doing my searches, I believe it's encoded with utf-8, but I have no idea how to work with this data based on the fact that some characters are encoded, but most of them are not(maybe?). I have tried varying "solutions" I have found so far. Including adding # -*- coding: utf-8 -*- to the top of my script and the script just hangs... It doesn't do anything :(
If someone could provide some information on how to deal with this scenario, that would be amazing.
I have tried decoding and encoding using string.encode() and string.decode(), using different encoding based on what I could find on Google, but that hasn't solved the problem.
I would really prefer a python solution because the project I'm working on requires people to paste data into a textfield in a GUI, and then that data will be processed. Other solutions suggested pasting the data into something like word, or notepad, saving it as plain text, then doing another copy/paste back from that file. This is a bit much. Does anybody know of a pythonic way of dealing with this issue?
>>> msg = 'Total Value for 1st Load \xe2\x80\x93 approx. $75,200\n'
>>> print msg.decode("utf-8")
Total Value for 1st Load – approx. $75,200
make sure you use something like idle that can support these characters (IE dos terminal probably will not!)

Categories