I have a bunch of tweets/threads dataset that I need to process, along with some separate annotation files. These annotation files consists of some spans represented by indexes that corresponds to a word/sentence. The indexes are, as you may have predicted, the position of the characters in the tweet/thread files.
The problem arises when I process the files with some emojis in them. To go with a specific example:
This is a part of the file in question (download):
TeamKhabib 😂😂😂 #danawhite #seanshelby #arielhelwani #AliAbdelaziz00 #McTapper xxxxx://x.xx/xxxxxxxxxx
mmafan1709 #TeamKhabib #danawhite #seanshelby #arielhelwani #AliAbdelaziz00 Conor is Khabib hardest fight and Khabib is Conors hardest fight
I read the file in python with plain open function, with the parameter encoding='utf8':
with open('028_948124816611139589.branch318.txt.username_text_tabseparated', 'r', encoding='utf-8') as f:
content = f.read()
print(content[211:214])
An annotation says there is the word and in the span 211-214. The way I read it I mention above, there is ' kh'.
When I use the indexes in the annotation files to get the spanned string, the string I am getting is 3 chars off (to the right). Because, in the annotations, 😂's apparently take 2 spaces. However, when python reads them, it is one, hence the character shift. It becomes much more obvious when I get the length of the file with len(list(file.read())). This returns me 7809, while the actual length of the file is 7812. 7812 is the pos I am getting at the end of the file in vscode, a plugin called vscode-position. Another file with gives me an inconsistency of 513 and 527.
I have no problem with reading emojis, I see them in my output/array however the space they are taking up in the encoding is different. My question is not answered in other relevant questions.
Obviously, there is a point in reading this file, as these files were read/created with some format/method/concept/encoding/whatever that this plugin and the annotators agree, but open.read does not.
I am using python 3.8.
What am I missing here?
I believe this issue after discussion is the spans were computed from Unicode strings that used surrogate pairs for Unicode code points > U+FFFF. Python 2 and other languages like Java and C# store Unicode strings with UTF-16 code units instead of abstracted code points like Python 3. If I treat the test data as UTF-16LE-encoded, the answer comes out:
import re
# Important to note that the original file has two tabs in it that SO doesn't display.
# * Between the first "TeamKabib" and smiley
# * Between "mmafan1709" and "#TeamKhabib"
# Use the download link while it is valid.
with open('test.txt', 'r', encoding='utf-8') as f:
content = f.read()
b = content.encode('utf-16le')
print(b[211 * 2:214 * 2].decode('utf-16le'))
# result: and
The offsets need to be double because each UTF-16 code unit is two bytes, then the result must be decoded to display it correctly.
I specifically used utf-16le vs. utf-16 because the latter will add a BOM and throw off the count another two bytes (or one code unit).
In my dataframe, I have a variable that should be a number, but is currently recognized as a string, with space after the thousand's value, for example : "5 948.5"
Before I can convert it to a float, I need to remove that space
d = {'col1': [1,2], 'numbers': [' 4 856.4','5 000.5']}
data = pd.DataFrame(data=d)
data['numbers']=data['numbers'].str.replace(" ", "")
This works perfectly.
But when I do the exact same thing to my series, nothing happens (no error message, but the spaces remain). Other manipulations to that series work normally.
Any idea of what I can try to understand and fix the problem on my series?
Thanks!
Edit:
I ve loaded the data with a
pd.read_csv(file.csv, encoding = "ISO-8859-1")
could that be responsible for the unmovable spaces? If I did not do that, I'd have an error message when loading "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe8 in position 2: invalid continuation byte"
I have tried to call read_csv with encoding='latin1', and encoding='cp1252' - the problem remains.
Edit 1.b. it seems to be an issue with the encoding of the space (Thanks #Marat). I downloaded an excel version of the data, and tried to replace all spaces of that column by nothing. It did not work. Removing a few spaces manually did work (but the file is too large to do it this way)
Edit 2: sample data. It really looks like the example I gave above, that works..but it really doesnt. I know nobody can reproduce this on their computer, I am not asking for the solution, but rather for ideas of what could be wrong...
As requested: here is a copytoclipboard of my data:
GroupeRA,SecteurMA,StatutMA,TypeMA,fields_ancienMatricule,historizedFields_denomination,Annee,region,fields_codeComiteSubregional,arrondissement,fields_ins,adresse_commune,perequation
Crèches,MASS,Collectif,CREC,632100101,Le Bocage I,2017,RBC,BX,Bruxelles-capitale,21001,Anderlecht,"0,00 "
Crèches,MASS,Collectif,CREC,632100101,Le Bocage I,2018,RBC,BX,Bruxelles-capitale,21001,Anderlecht,"0,00 "
Crèches,MASS,Collectif,CREC,x,Le xyzI,2018,RBC,BX,Bruxelles-capitale,21001,Anderlecht,"1 302,26 "
Crèches,MASS,Collectif,CREC,632100101,Le Bocage I,2018,RBC,BX,Bruxelles-capitale,21001,Anderlecht,"687,56 "
Crèches,MASS,Collectif,CREC,632100101,xyz,2019,RBC,BX,Bruxelles-capitale,21001,Anderlecht,"1 372,91 "
Edit 3: the data is in csv (though as mentioned in edit 1.b I also got the data in xls and have the same issue (even when opening in xls directly, cannot "find&replace all" to find the spaces, as if excel did not read them as such)
I used DB vizualizer to extract the data from our database.
Thanks all for your help. It was indeed an issue with the "space" character - which was not a 'space' like the one produced by my keyboard. Got solved with the following sql command when extracting the data
[perequation]= CONVERT(MONEY,REPLACE(REPLACE(ds.computedFields_perequation, CHAR(160),''), ',', '.')),
I have searched and searched. I can't exactly find an issue quite like mine. I did try.
I have read Parquet data into a Pandas dataframe and used .query statement to filter data.
import pandas as pd
import fastparquet as fp
fieldsToInclude = ['ACCURACY','STATE','LOCATION','COUNTRY_CODE']
criteria = 'ACCURACY == 1.0 or COUNTRY_CODE == "AD"'
pandaParqFile = fp.ParquetFile(fn = inputPath + "World Zip Code.parquet")
newDF = pandaParqFile.to_pandas()
dataset = newDF[fieldsToInclude]
extraction = dataset.query(criteria)
with pd.option_context('display.max_rows', 100, 'display.max_columns', 10):
print(extraction)
When it prints, I get UnicodeEncodeError: 'charmap' codec can't encode error 'u\0310' in position 4174: character maps to undefined'. This is in Geany. I get a different character and position if I print from the administrator console. I'm running Windows 7. The data does have characters that are Latin, German, etc.
I'm actually seeing some special characters when I print the data to the screen using other criteria for .query, so I guess it's only certain characters? I looked up 'u\0310' and that's some sort of Latin i. But I can print other Latin characters.
I've tried some suggestions for trying to resolve this with specifying encoding, but they didn't seem to work because this is a dataframe. Other questions I came across were about this error occurring when trying to open CSV files. Not what I'm experiencing here.
The zip code data is just something to work with to learn Pandas. In the future, there's no telling what kind of data will be processed by this script. I'm really looking for a solution to this problem that will prevent it from happening regardless of what kinds of characters the data will have. Simply removing the LOCATION field, which is where all of these special characters are for this particular data, isn't viable.
Has anyone seen this before? Thanks in advance.
You need to specify utf-8 as encoding format.
Try:
with pd.option_context('display.encoding', 'UTF-8', 'display.max_rows', 100, 'display.max_columns', 10):
print(extraction)
I’ve got files from a system restore which have odd bits of data padded out to the front of the file which makes it gobbledegook when opening it. I’ve got a text file of file signatures which I’ve collected, and which contain information represented like this at the moment:
Sig_MicrosoftOffice_before2007= \xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1
What I am planning on is reading the text file and using the data to identify the correct header in the data of the corrupt file, and strip everything off before it – hopefully leaving a readable file after. I’m stuck on how best to get this data into python in a readable format though.
My first try was simply reading the values from the file, but as python does, it’s representing the backslashes as the escape character. Is this the best method to achieve what I need? Do I need to think about representing the data in the text file some other way? Or maybe in a dictionary? Any help you could provide would be really appreciated.
You can decode the \xhh escapes by using the string_escape codec (Python 2) or unicode_escape codec (Python 3 or when you have to us Unicode in Python 2):
>>> r'\xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1'
'\\xD0\\xCF\\x11\\xE0\\xA1\\xB1\\x1A\\xE1'
>>> r'\xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1'.decode('string_escape')
'\xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1'
I have some strings that I am pasting in to my script as test data. The strings come from emails that contain encoded characters and it's throwing a SyntaxError. So far, I have not been able to find a solution to this issue. When I print repr(string), I get these strings:
'Total Value for 1st Load \xe2\x80\x93 approx. $75,200\n'
'Total Value for 2nd Load \xe2\x80\x93 approx. $74,300\n'
And this error pops up when I run my script:
SyntaxError: Non-ASCII character '\xe2' in file <filename> on line <line number>, but no
encoding declared; see http://www.python.org/peps/pep-2063.html
When I just print the lines containing the encoded characters I get this:
'Total Value for 2nd Load – approx. $74,300'
The data looks like this when I copy it from the email:
'Total Value for 1st Load – approx. $75,200'
'Total Value for 2nd Load – approx. $74,300'
From doing my searches, I believe it's encoded with utf-8, but I have no idea how to work with this data based on the fact that some characters are encoded, but most of them are not(maybe?). I have tried varying "solutions" I have found so far. Including adding # -*- coding: utf-8 -*- to the top of my script and the script just hangs... It doesn't do anything :(
If someone could provide some information on how to deal with this scenario, that would be amazing.
I have tried decoding and encoding using string.encode() and string.decode(), using different encoding based on what I could find on Google, but that hasn't solved the problem.
I would really prefer a python solution because the project I'm working on requires people to paste data into a textfield in a GUI, and then that data will be processed. Other solutions suggested pasting the data into something like word, or notepad, saving it as plain text, then doing another copy/paste back from that file. This is a bit much. Does anybody know of a pythonic way of dealing with this issue?
>>> msg = 'Total Value for 1st Load \xe2\x80\x93 approx. $75,200\n'
>>> print msg.decode("utf-8")
Total Value for 1st Load – approx. $75,200
make sure you use something like idle that can support these characters (IE dos terminal probably will not!)