Python np.fromfile() adding arbitrary random comma when reading from binary file - python

I encounter weird problem and could not solve it for days. I have created byte array that contains values from 1 to 250 and write it to binary file from C# using WriteAllBytes.
Later i read it from Python using np.fromfile(filename, dtype=np.ubyte). However, i realize this functions was adding arbitrary comma (see the image). Interestingly it is not visible in array property. And if i call numpy.array2string, comma turns '\n'. One solution is to replace comma with none, however i have very long sequences it will take forever on 100gb of data to use replace function. I also recheck the files by reading using .net Core, i'm quite sure comma is not there.
What could i be missing?
Edit:
I was trying to read all byte values to array and cast each member to or entire array to string. I found out that most reliable way to do this is:
list(map(str, (ubyte_array))
Above code returns string list that its elements without any arbitrary comma or blank space.

Related

Emoji reading discrepancy between different applications

I have a bunch of tweets/threads dataset that I need to process, along with some separate annotation files. These annotation files consists of some spans represented by indexes that corresponds to a word/sentence. The indexes are, as you may have predicted, the position of the characters in the tweet/thread files.
The problem arises when I process the files with some emojis in them. To go with a specific example:
This is a part of the file in question (download):
TeamKhabib 😂😂😂 #danawhite #seanshelby #arielhelwani #AliAbdelaziz00 #McTapper xxxxx://x.xx/xxxxxxxxxx
mmafan1709 #TeamKhabib #danawhite #seanshelby #arielhelwani #AliAbdelaziz00 Conor is Khabib hardest fight and Khabib is Conors hardest fight
I read the file in python with plain open function, with the parameter encoding='utf8':
with open('028_948124816611139589.branch318.txt.username_text_tabseparated', 'r', encoding='utf-8') as f:
content = f.read()
print(content[211:214])
An annotation says there is the word and in the span 211-214. The way I read it I mention above, there is ' kh'.
When I use the indexes in the annotation files to get the spanned string, the string I am getting is 3 chars off (to the right). Because, in the annotations, 😂's apparently take 2 spaces. However, when python reads them, it is one, hence the character shift. It becomes much more obvious when I get the length of the file with len(list(file.read())). This returns me 7809, while the actual length of the file is 7812. 7812 is the pos I am getting at the end of the file in vscode, a plugin called vscode-position. Another file with gives me an inconsistency of 513 and 527.
I have no problem with reading emojis, I see them in my output/array however the space they are taking up in the encoding is different. My question is not answered in other relevant questions.
Obviously, there is a point in reading this file, as these files were read/created with some format/method/concept/encoding/whatever that this plugin and the annotators agree, but open.read does not.
I am using python 3.8.
What am I missing here?
I believe this issue after discussion is the spans were computed from Unicode strings that used surrogate pairs for Unicode code points > U+FFFF. Python 2 and other languages like Java and C# store Unicode strings with UTF-16 code units instead of abstracted code points like Python 3. If I treat the test data as UTF-16LE-encoded, the answer comes out:
import re
# Important to note that the original file has two tabs in it that SO doesn't display.
# * Between the first "TeamKabib" and smiley
# * Between "mmafan1709" and "#TeamKhabib"
# Use the download link while it is valid.
with open('test.txt', 'r', encoding='utf-8') as f:
content = f.read()
b = content.encode('utf-16le')
print(b[211 * 2:214 * 2].decode('utf-16le'))
# result: and
The offsets need to be double because each UTF-16 code unit is two bytes, then the result must be decoded to display it correctly.
I specifically used utf-16le vs. utf-16 because the latter will add a BOM and throw off the count another two bytes (or one code unit).

How to convert String (containing a table of numbers without comma delimiter) into Array in Python

I have a CSV file and I load it by "pd.read_csv". One of the columns is a variable with String datatype. But, it actually contains a table of numbers (like a 2D array) without comma delimiter.
I would like to convert it into Array. I tried "eval()" function but it gives an error (as can be seen in the following image).
If you have any idea how to solve this issue, please let me know.

Python Matplotlib ValueError

Hi,
how do i plot the Attached Dataframe in python, i am looking for multiple series line graph.
Any help will be much appreciated.
Error:-ValueError: could not convert string to float
Thanks
Your problem here is that the % signs in your csv file are making Pandas read each value as a string object, rather than as a float.
The best option for resolving this would probably be to not have extraneous characters like %s everywhere in your csv file. Instead, it would probably make more sense to list units in your columns, or elsewhere in descriptions.
However, in this case, it can also be solved afterward by removing the extraneous characters and converting manually, eg, for a DataFrame a:
a.ix[:,a.dtypes==object] = a.ix[:,a.dtypes==object].applymap(lambda x: float(x[:-1]))
This will work for your specific case of one % at the end being the offending character consistently:
The indexing here selects all columns that are of dtype 'object', which in this case are all strings with the last character %.
The lambda function that is applied to each element removes the last character from the string, and then converts it to a float.
It is then assigned to the same columns.

Speeding up (simple) text processing

I need to display data from two files (with equal sizes) to be able to visually compare them. For this, I made a new Tk widget consisting of four Text widgets. The first widget contains characters representing bytes from the first file, the second one contains hexadecimal values of the bytes in the left widget, and the same goes for the third and four one respective (containing data/hex values for the second file). The input data to be displayed are two bytearrays.
To fill the Text widgets, I have to process the input data (bytearrays), because
I have to get rid of unprintable characters and some characters that caused misalignment of the respective lines in the four widgets,
I have to fill the second/fourth widgets with hex values of the bytes, therefore I have to convert the byte values to hex numbers.
The code I used does the functionality described, and it works quite well for small files (several hundreds of kilobytes). However when I try to load bigger files (several megabytes), the time it takes to process and load the data is unacceptable (tens of seconds).
An example of my widget for displaying the data can be seen here:
To process the input data, I use the following code. _ldata and _rdata are bytearrays with the input data, ldata and rdata are strings to be loaded in the first and third Text widgets, lhexdata and rhexdata are strings with the hexadecimal values to be loaded in the second and fourth Text widget. wrap is an integer determining how many bytes will be represented on one line. The print_chars function replaces all the characters that caused misalignment or couldn't be selected in the Text widgets.
def print_chars(self, byte):
if (byte < 0x20 or
(byte > 0x7E and byte < 0xB1)):
return 0x07
else:
return byte
...
ldata = "\n".join(["".join(map(chr,
map(self.print_chars, _ldata[i:i+wrap])))
for i in range(0, len(_ldata), wrap)])
rdata = "\n".join(["".join(map(chr,
map(self.print_chars, _rdata[i:i+wrap])))
for i in range(0, len(_rdata), wrap)])
lhexdata = "\n".join([" ".join(map("{0:02X}".format, _ldata[i:i+wrap]))
for i in range(0, len(_ldata), wrap)])
rhexdata = "\n".join([" ".join(map("{0:02X}".format, _rdata[i:i+wrap]))
for i in range(0, len(_rdata), wrap)])
I think there is a way to speed things up, but can't figure out any. Before I implemented the list comprehension, I had used for cycles for the data processing, and it was a real pain in the neck even for very short inputs. The list comprehensions vere a big improvement in performance, yet not sufficient. Thanks for any advices.
I think your first two lines can be improved by using bytearray.translate with an appropriate translation table rather than using your own escaping and converting system. Then you can turn it into a string with bytearray.decode. You still need an additional step to split the text into lines and recombine it, but I suspect that it will be faster if you've done the translation work already.
table = bytearray.maketrans(bytes(range(0x20))+bytes(range(0x7f, 0xb1)),
b"\x07"*(0x20+0xb1-0x7f))
ldata_string = _ldata.translate(table).decode("latin-1") # pick some 8-bit encoding
ldata = "\n".join(ldata_string[i:i+wrap] for i in range(0, len(ldata), wrap))
You can do something similar for the hex output, using the b16encode function from the base64 module to convert to hex, then decode to make the bytes output into a string. The splitting and rejoining gets a bit more complicated due to the need for spaces between each pair of hex digits, but I suspect it will still be faster than encoding each byte separately.
import base64
lhexdata_string = base64.b16encode(_ldata).decode("ascii") # hex will always be ASCII
lhexdata = "\n".join(" ".join(hexdata_string[i+j:i+j+2] for i in range(0, 2*wrap, 2))
for j in range(0, len(lhexdata_string), 2*wrap))
Note that the code above assumes that you're using Python 3. If you're using Python 2 you'll need to change a few things (such as working around the lack of maketrans and not needing to decode).

Using struct.unpack() without knowing anything about the string

I need to parse a big-endian binary file and convert it to little-endian. However, the people who have handed the file over to me seem unable to tell me anything about what data types it contains, or how it is organized — the only thing they know for certain is that it is a big-endian binary file with some old data. The function struct.unpack(), however, requires a format character as its first argument.
This is the first line of the binary file:
import binascii
path = "BC2003_lr_m32_chab_Im.ised"
with open(path, 'rb') as fd:
line = fd.readline()
print binascii.hexlify(line)
a0040000dd0000000000000080e2f54780f1094840c61a4800a92d48c0d9424840a05a48404d7548e09d8948a0689a48e03fad48a063c248c01bda48c0b8f448804a0949100b1a49e0d62c49e0ed41499097594900247449a0a57f4900d98549b0278c49a0c2924990ad9949a0eba049e080a8490072b049c0c2b849d077c1493096ca494022d449a021de49a099e849e08ff349500a
Is it possible to change the endianness of a file without knowing anything about it?
You cannot do this without knowing the datatypes. There is little point in attempting to do so otherwise.
Even if it was a homogeneous sequence of one datatype, you'd still need to know what you are dealing with; flipping the byte order in double values is very different from short integers.
Take a look at the formatting characters table; anything with a different byte size in it will result in a different set of bytes being swapped; for double values, you need to reverse the order of every 8 bytes, for example.
If you know what data should be in the file, then at least you have a starting point; you'd have to puzzle out how those values fit into the bytes given. It'll be a puzzle, but with a target set of values you can build a map of the datatypes contained, then write a byte-order adjustment script. If you don't even have that, best not to start as the task is impossible to achieve.

Categories