I have a bunch of tweets/threads dataset that I need to process, along with some separate annotation files. These annotation files consists of some spans represented by indexes that corresponds to a word/sentence. The indexes are, as you may have predicted, the position of the characters in the tweet/thread files.
The problem arises when I process the files with some emojis in them. To go with a specific example:
This is a part of the file in question (download):
TeamKhabib 😂😂😂 #danawhite #seanshelby #arielhelwani #AliAbdelaziz00 #McTapper xxxxx://x.xx/xxxxxxxxxx
mmafan1709 #TeamKhabib #danawhite #seanshelby #arielhelwani #AliAbdelaziz00 Conor is Khabib hardest fight and Khabib is Conors hardest fight
I read the file in python with plain open function, with the parameter encoding='utf8':
with open('028_948124816611139589.branch318.txt.username_text_tabseparated', 'r', encoding='utf-8') as f:
content = f.read()
print(content[211:214])
An annotation says there is the word and in the span 211-214. The way I read it I mention above, there is ' kh'.
When I use the indexes in the annotation files to get the spanned string, the string I am getting is 3 chars off (to the right). Because, in the annotations, 😂's apparently take 2 spaces. However, when python reads them, it is one, hence the character shift. It becomes much more obvious when I get the length of the file with len(list(file.read())). This returns me 7809, while the actual length of the file is 7812. 7812 is the pos I am getting at the end of the file in vscode, a plugin called vscode-position. Another file with gives me an inconsistency of 513 and 527.
I have no problem with reading emojis, I see them in my output/array however the space they are taking up in the encoding is different. My question is not answered in other relevant questions.
Obviously, there is a point in reading this file, as these files were read/created with some format/method/concept/encoding/whatever that this plugin and the annotators agree, but open.read does not.
I am using python 3.8.
What am I missing here?
I believe this issue after discussion is the spans were computed from Unicode strings that used surrogate pairs for Unicode code points > U+FFFF. Python 2 and other languages like Java and C# store Unicode strings with UTF-16 code units instead of abstracted code points like Python 3. If I treat the test data as UTF-16LE-encoded, the answer comes out:
import re
# Important to note that the original file has two tabs in it that SO doesn't display.
# * Between the first "TeamKabib" and smiley
# * Between "mmafan1709" and "#TeamKhabib"
# Use the download link while it is valid.
with open('test.txt', 'r', encoding='utf-8') as f:
content = f.read()
b = content.encode('utf-16le')
print(b[211 * 2:214 * 2].decode('utf-16le'))
# result: and
The offsets need to be double because each UTF-16 code unit is two bytes, then the result must be decoded to display it correctly.
I specifically used utf-16le vs. utf-16 because the latter will add a BOM and throw off the count another two bytes (or one code unit).
Related
I encounter weird problem and could not solve it for days. I have created byte array that contains values from 1 to 250 and write it to binary file from C# using WriteAllBytes.
Later i read it from Python using np.fromfile(filename, dtype=np.ubyte). However, i realize this functions was adding arbitrary comma (see the image). Interestingly it is not visible in array property. And if i call numpy.array2string, comma turns '\n'. One solution is to replace comma with none, however i have very long sequences it will take forever on 100gb of data to use replace function. I also recheck the files by reading using .net Core, i'm quite sure comma is not there.
What could i be missing?
Edit:
I was trying to read all byte values to array and cast each member to or entire array to string. I found out that most reliable way to do this is:
list(map(str, (ubyte_array))
Above code returns string list that its elements without any arbitrary comma or blank space.
I have been logging thousands of messages for a data science project into a csv file. Many of these messages contain either emojis or non-English characters, therefore when opening the csv file using Excel, these already appeared in an encoded format (e.g. the red heart emoji ❤️ got encoded as â¤ï¸). This didn't disturb me much in the beginning, as I only needed the csv to store my data that I periodically analyzed. When reading the csv file using Python, I didn't notice any data corruption.
(However, I made an apparently huge mistake a couple of days ago: I ran into an error when reading the csv file, so I specified the engine attribute of pd.read_csv as 'python' and I believe this launched it all: every time I re-ran the script that updates the csv, all the text data got encoded again, possibly in utf-8 instead of csv's orriginal windows-1252.) Edit: I realized thank to Tomalak's comments below that the real problem wasn't this modification but me manually modifying the csv file using Excel a number of times along the way.
The older the csv entries, the more the subsequent encoding-recoding affected them: for the newest entries, there is no issue but for the oldest ones, I now have a single heart emoji appearing in the csv as:
���
I found numerous entries in the csv file where I could easily apply the .encode('windows-1252').decode('utf-8') 3-6 times (depending on how old the given entry is and therefore how many times it got re-encoded) and obtained a favorable outcome, such as:
😞 stands for the sad/disappointed face emoji (😞). Applying the encoding-decoding pattern four times returned: \U0001f61e which is good enough for me; I can easily use unicodedata library's excellent conversion method to obtain their corresponding unicodedata.name. I believe that's how I should be storing emojis from now on...
My understanding about applying the above mentioned encode-decode pattern numerous times is that I cannot overdo it. If one string needs only three of these patterns while the next cell needs six, I could just do something like this (yes, I know iterrows() is a terribly inefficient approach but just for the example):
for idx, _ in df.iterrows():
tmp = df.loc[idx, 'text']
for _ in range(6):
tmp = tmp.encode("windows-1252").decode("utf-8")
df.loc[idx, 'text'] = tmp
The problem is, however, that there are still quite a lot of entries where the above solution doesn't work. Let's just consider the above mentioned encoded string which stands for a red heart:
���
Applying .encode("windows-1252").decode("utf-8") three times yields: ��� but when attempting to apply the pattern the fourth time, I get: UnicodeEncodeError: 'charmap' codec can't encode character '\ufffd' in position 1: character maps to <undefined>. My hunch is, not all strings were encoded by windows-1252...?
Is there any hope to get back my data in an uncorrupted format?
I have some csv data of some users tweet.
In excel it is displayed like this:
‰ÛÏIt felt like they were my friends and I was living the story with them‰Û #retired #IAN1
I had imported this csv file into python and in python the same tweet appears like this (I am using putty to connect to a server and I copied this from putty's screen)
▒▒▒It felt like they were my friends and I was living the story with them▒ #retired #IAN1
I am wondering how to display these emoji characters properly. I am trying to separate all the words in this tweet but I am not sure how I can separate those emoji unicode characters.
In fact, you certainly have a loss of data…
I don’t know how you get your CSV file from users tweet (you may explain that). But generally, CSV files are encoded in "cp1252" (or "windows-1252"), sometimes in "iso-8859-1" encoding. Nowadays, we can found CSV files encoded in "utf-8".
If you tweets are encoded in "cp1252" or any 8-bit single-byte coded character sets, the Emojis are lost (replaced by "?") or badly converted.
Then, if you open your CSV file into Excel, it will use it’s default encoding ("cp1252") and load the file with corrupted characters. You can try with Libre Office, it has a dialog box which allows you to choose your encoding more easily.
The copy/paste from Putty will also convert your characters depending of your console encoding… It is worst!
If your CSV file use "utf-8" encoding (or "utf-16", "utf-32") you may have more chance to preserve the Emojis. But there is still a problem: most Emojis have a code-point greater that U+FFFF (65535 in decimal). For instance, Grinning Face "😀" has the code-point U+1F600).
This kind of characters are badly handled in Python, try this:
# coding: utf8
from __future__ import unicode_literals
emoji = u"😀"
print(u"emoji: " + emoji)
print(u"repr: " + repr(emoji))
print(u"len: {}".format(len(emoji)))
You’ll get (if your console allow it):
emoji: 😀
repr: u'\U0001f600'
len: 2
The first line won’t print if your console don’t allow unicode,
The \U escape sequence is similar to the \u, but expects 8 hex digits, not 4.
Yes, this character has a length of 2!
EDIT: With Python 3, you get:
emoji: 😀
repr: '😀'
len: 1
No escape sequence for repr(),
the length is 1!
What you can do is posting your CSV file (a fragment) as attachment, then one could analyse it…
See also Unicode Literals in Python Source Code in the Python 2.7 documentation.
First of all you shouldn't work with text copied from a console (nonetheless from a remote connection) because of formatting differences and how unreliable clipboards are. I'd suggest exporting your CSV and reading it directly.
I'm not quite sure what you are trying to do but twitter emojis cannot be displayed in a console due to them being basically compressed images. Would you mind explaning your issue further?
I would personally treat the whole string as Unicode, separate each character in a list then rebuilding words based on spaces.
I have a string that looks like this
'\x00\x03\x10B\x00\x0e12102 G1103543\x10T\x07\x21'
I have been able to match the data I want which is "12102 G1103543" with this.
re.findall('\x10\x42(.*)\x10\x54', data)
Which will output this
'\x00\x0e12102 G1103543'
The problem im having is that \x10\x54 is not always at the end of the data I want. However what I have noticed is that the first two hex digits correspond to how long the data length will be. I.E. \x00\x0e = 14 so the data length is 14char long.
Is there a better way to do this, like matching the first part then cutting the next 14 characters? I should also say that the length will vary as im looking to match several things.
Also is there a way to output the string in all hex so its easier for me to read when working in a python shell I.E. \x10B == \x10\x42
Thank You!
Edit: I managed to come up with this working solution.
newdata = re.findall('\x10\x42(.*)', data)
newdata[0][2:int(newdata[0][0:2].encode('hex'))]
Please, note that you have an structured binary file at your hands, and it is foolish to try to use regular expressions to extract data from it.
First of all the "hex data" you talk about is not "hex data" -it is just bytes
in your stream outside the ASCII range - therefore Python2 will display these characters as a \x10 and so on - but internally it is just a single byte with the value 16 (when viewed as decimal). The \x42you write corresponds to the ASCII letter B and that is why you see B in your representation.
So your best bet there would be to get the file specification, and read the data you want from there using the struct module and byte-string slicing.
If you can't have the file spec, so it is a reverse-engineering work to find out the fields of interest -just like you are already doing. But even then, you should write some code with the struct module to get your values, since field lenghts (and most likely offsets) are encoded in the byte stream itself.
In this example, your marker "\x10\x42" will rarely be a marker per se - it is most likely its position is indicated by other factors in the file (either a fixed place in the file definition, or by an offset earlier on the file.
But - if you are correctly using this as a marker, you could make use of regular expressions just to findout all offsets of the "\x10\x42" marker as you are doing, and them interpreting the following two bytes as the message length:
import struct, re
def get_data(data, sep=b"\x10B"):
results = []
for match in re.finditer(sep, data):
offset = match.start()
msglen = struct.unpack(">H", data[offset + 2: offset + 4])[0]
print(msglen)
results.append(data[offset + 4: offset + 4 + msglen])
return results
I need to parse a big-endian binary file and convert it to little-endian. However, the people who have handed the file over to me seem unable to tell me anything about what data types it contains, or how it is organized — the only thing they know for certain is that it is a big-endian binary file with some old data. The function struct.unpack(), however, requires a format character as its first argument.
This is the first line of the binary file:
import binascii
path = "BC2003_lr_m32_chab_Im.ised"
with open(path, 'rb') as fd:
line = fd.readline()
print binascii.hexlify(line)
a0040000dd0000000000000080e2f54780f1094840c61a4800a92d48c0d9424840a05a48404d7548e09d8948a0689a48e03fad48a063c248c01bda48c0b8f448804a0949100b1a49e0d62c49e0ed41499097594900247449a0a57f4900d98549b0278c49a0c2924990ad9949a0eba049e080a8490072b049c0c2b849d077c1493096ca494022d449a021de49a099e849e08ff349500a
Is it possible to change the endianness of a file without knowing anything about it?
You cannot do this without knowing the datatypes. There is little point in attempting to do so otherwise.
Even if it was a homogeneous sequence of one datatype, you'd still need to know what you are dealing with; flipping the byte order in double values is very different from short integers.
Take a look at the formatting characters table; anything with a different byte size in it will result in a different set of bytes being swapped; for double values, you need to reverse the order of every 8 bytes, for example.
If you know what data should be in the file, then at least you have a starting point; you'd have to puzzle out how those values fit into the bytes given. It'll be a puzzle, but with a target set of values you can build a map of the datatypes contained, then write a byte-order adjustment script. If you don't even have that, best not to start as the task is impossible to achieve.