Converting an array of integers to unicode

Converting an array of integers to unicode - python

I have an array arr=[07,10,11,05].
I actually would want these values to be converted to hex and send it to a function. I know I can convert the values individually to hex like hex(arr[0]), hex(arr[1]) etc. What I have been doing now is, converting the array to
unicode as arr =[u'07 0a 0b 05'] manually and then sending to the function which serves the purpose for me. But it is too impractical when I have to send hundreds to unicode strings in arr as in
arr = [u'a3 06 06 01 00 01 01 00',
u'a3 06 06 02 00 01 02 00',
u'a3 06 06 03 00 01 03 00',
u'a3 06 06 04 00 01 04 00',
u'a3 06 06 05 00 01 05 00',
u'a3 06 06 06 00 01 06 00',
u'a3 06 06 07 00 01 07 00']
I am sending this one by one (arr[i]) using a loop. Is there a way I can dynamically convert an array of integers to unicode and send it to an external function which readily accepts such unicode strings?

Is ' '.join(map(hex,arr)) what you are looking for?
More specifically:
import numpy as np
a = np.array([[1,2,3],[6,11,23]])
def strip_hexstring(s):
return s.strip('0x').strip('L')
def format_line(line):
return u'{}'.format(' '.join(map(strip_hexstring,map(hex,line))))
arr = [format_line(a[i,:]) for i in range(a.shape[0])]
You can pad with zeros like in your example, if necessary.

Related

When I use Thrift to serialize map in C++ to disk, and then de-serialize it using Python, I do not get back the same object

Summary: when I use Thrift to serialize map in C++ to disk, and then de-serialize it using Python, I do not get back the same object.
A minimal example to reproduce to the problem is in Github repo https://github.com/brunorijsman/reproduce-thrift-crash
Clone this repo on Ubuntu (tested on 16.04) and follow the instructions at the top of the file reproduce.sh
I have the following Thrift model file, which (as you can see) contains a map indexed by a struct:
struct Coordinate {
1: required i32 x;
2: required i32 y;
}
struct Terrain {
1: required map<Coordinate, i32> altitude_samples;
}
I use the following C++ code to create an object with 3 coordinates in the map (see the repo for complete code for all snippets below):
Terrain terrain;
add_sample_to_terrain(terrain, 10, 10, 100);
add_sample_to_terrain(terrain, 20, 20, 200);
add_sample_to_terrain(terrain, 30, 30, 300);
where:
void add_sample_to_terrain(Terrain& terrain, int32_t x, int32_t y, int32_t altitude)
{
Coordinate coordinate;
coordinate.x = x;
coordinate.y = y;
std::pair<Coordinate, int32_t> sample(coordinate, altitude);
terrain.altitude_samples.insert(sample);
}
I use the following C++ code to serialize an object to disk:
shared_ptr<TFileTransport> transport(new TFileTransport("terrain.dat"));
shared_ptr<TBinaryProtocol> protocol(new TBinaryProtocol(transport));
terrain.write(protocol.get());
Important note: for this to work correctly, I had to implement the function Coordinate::operator<. Thrift does generate the declaration for the Coordinate::operator< but does not generate the implementation of Coordinate::operator<. The reason for this is that Thrift does not understand the semantics of the struct and hence cannot guess the correct implementation of the comparison operator. This is discussed at http://mail-archives.apache.org/mod_mbox/thrift-user/201007.mbox/%3C4C4E08BD.8030407#facebook.com%3E
// Thrift generates the declaration but not the implementation of operator< because it has no way
// of knowning what the criteria for the comparison are. So, provide the implementation here.
bool Coordinate::operator<(const Coordinate& other) const
{
if (x < other.x) {
return true;
} else if (x > other.x) {
return false;
} else if (y < other.y) {
return true;
} else {
return false;
}
}
Then, finally, I use the following Python code to de-serialize the same object from disk:
file = open("terrain.dat", "rb")
transport = thrift.transport.TTransport.TFileObjectTransport(file)
protocol = thrift.protocol.TBinaryProtocol.TBinaryProtocol(transport)
terrain = Terrain()
terrain.read(protocol)
print(terrain)
This Python program outputs:
Terrain(altitude_samples=None)
In other words, the de-serialized Terrain contains no terrain_samples, instead of the expected dictionary with 3 coordinates.
I am 100% sure that the file terrain.dat contains valid data: I also de-serialized the same data using C++ and in that case, I do get the expected results (see repo for details)
I suspect that this has something to do with the comparison operator.
I gut feeling is that I should have done something similar in Python with respect to the comparison operator as I did in C++. But I don't know what that missing something would be.
Additional information added on 19-Sep-2018:
Here is a hexdump of the encoding produced by the C++ encoding program:
Offset: 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
00000000: 01 00 00 00 0D 02 00 00 00 00 01 01 00 00 00 0C ................
00000010: 01 00 00 00 08 04 00 00 00 00 00 00 03 01 00 00 ................
00000020: 00 08 02 00 00 00 00 01 04 00 00 00 00 00 00 0A ................
00000030: 01 00 00 00 08 02 00 00 00 00 02 04 00 00 00 00 ................
00000040: 00 00 0A 01 00 00 00 00 04 00 00 00 00 00 00 64 ...............d
00000050: 01 00 00 00 08 02 00 00 00 00 01 04 00 00 00 00 ................
00000060: 00 00 14 01 00 00 00 08 02 00 00 00 00 02 04 00 ................
00000070: 00 00 00 00 00 14 01 00 00 00 00 04 00 00 00 00 ................
00000080: 00 00 C8 01 00 00 00 08 02 00 00 00 00 01 04 00 ..H.............
00000090: 00 00 00 00 00 1E 01 00 00 00 08 02 00 00 00 00 ................
000000a0: 02 04 00 00 00 00 00 00 1E 01 00 00 00 00 04 00 ................
000000b0: 00 00 00 00 01 2C 01 00 00 00 00 .....,.....
The first 4 bytes are 01 00 00 00
Using a debugger to step through the Python decoding function reveals that:
This is being decoded as a struct (which is expected)
The first byte 01 is interpreted as the field type. 01 means field type VOID.
The next two bytes are interpreted as the field id. 00 00 means field ID 0.
For field type VOID, nothing else is read and we continue to the next field.
The next byte is interpreted as the field type. 00 means STOP.
We top reading data for the struct.
The final result is an empty struct.
All off the above is consistent with the information at https://github.com/apache/thrift/blob/master/doc/specs/thrift-binary-protocol.md which describes the Thrift binary encoding format
My conclusion thus far is that the C++ encoder appears to produce an "incorrect" binary encoding (I put incorrect in quotes because certainly something as blatant as that would have been discovered by lots of other people, so I am sure that I am still missing something).
Additional information added on 19-Sep-2018:
It appears that the C++ implementation of TFileTransport has the concept of "events" when writing to disk.
The output which is written to disk is divided into a sequence of "events" where each "event" is preceded by a 4-byte length field of the event, followed by the contents of the event.
Looking at the hexdump above, the first couple of events are:
0100 0000 0d : Event length 1, event value 0d
02 0000 0000 01 : Event length 2, event value 00 01
Etc.
The Python implementation of TFileTransport does not understand this concept of events when parsing the file.
It appears that the problem is one of the following two:
1) Either the C++ code should not be inserting these event lengths into the encoded file,
2) Or the Python code should understand these event lengths when decoding the file.
Note that all these event lengths make the C++ encode file much larger than the Python encoded file.

Sadly C++ TFileTransport is not totally portable and will not work with Python's TFileObjectTransport. If you switch to C++ TSimpleFileTransport it will work as expected, with Python TFileObjectTransport and with Java TSimpleFileTransport.
Take a look at the examples here:
https://github.com/RandyAbernethy/ThriftBook/tree/master/part2/types/complex
They do pretty much exactly what you are attempting in Java and Python and you can find examples with C++, Java and Python here (though they add a zip compression layer):
https://github.com/RandyAbernethy/ThriftBook/tree/master/part2/types/zip
Another caution however would be against the use of complex key types. Complex key types require (as you discovered) comparators but will flat out not work with some languages. I might suggest, for example:
map<x,map<y,alt>>
giving the same utility but eliminating a whole class of possible problems (and no need for comparators).

Chrome corrupting binary file download by converting to UTF-8

I've currently been assigned to maintain an application written with Flask. Currently I'm trying to add a feature that allows users to download a pre-generated excel file, however, whenever I try to send it, my browser appears to re-encode the file in UTF-8 which causes multibyte characters to be added, which corrupts the file.
File downloaded with wget:
(venv) luke#ubuntu:~$ hexdump -C wget.xlsx | head -n 2
00000000 50 4b 03 04 14 00 00 00 08 00 06 06 fb 4a 1f 23 |PK...........J.#|
00000010 cf 03 c0 00 00 00 13 02 00 00 0b 00 00 00 5f 72 |.............._r|
The file downloaded with Chrome (notice the EF BF BD sequences?)
(venv) luke#ubuntu:~$ hexdump -C chrome.xlsx | head -n 2
00000000 50 4b 03 04 14 00 00 00 08 00 ef bf bd 03 ef bf |PK..............|
00000010 bd 4a 1f 23 ef bf bd 03 ef bf bd 00 00 00 13 02 |.J.#............|
Does anyone know how I could fix this? This is the code I'm using:
data = b'PK\x03\x04\x14\x00\x00\x00\x08\x00}\x0c\xfbJ\x1f#\xcf\x03\xc0\x00\x00\x00\x13\x02\x00\x00\x0b\x00\x00\x00'
send_file(BytesIO(data), attachment_filename="x.xlsx", as_attachment=True)
Related issue: Encoding problems trying to send a binary file using flask_restful

Chrome was expecting to receive utf-8 encoded text, and found some bytes that couldn't be interpreted as valid utf-8 encoding of a char - which is normal, because your file is binary. So it replaced these invalid bytes with EF BF BD, the utf-8 encoding of the Unicode replacement character. The content-type header you send is probably text/..... Maybe try something like Content-Type:application/vnd.openxmlformats-officedocument.spreadsheetml.sheet

Reading fortran binary (streaming access) with np.fromfile or open & struct

The following Fortran code:
INTEGER*2 :: i, Array_A(32)
Array_A(:) = (/ (i, i=0, 31) /)
OPEN (unit=11, file = 'binary2.dat', form='unformatted', access='stream')
Do i=1,32
WRITE(11) Array_A(i)
End Do
CLOSE (11)
Produces streaming binary output with numbers from 0 to 31 in integer 16bit. Each record is taking up 2 bytes, so they are written at byte 1, 3, 5, 7 and so on. The access='stream' suppresses the standard header of Fortran for each record (I need to do that to keep the files as tiny as possible).
Looking at it with a Hex-Editor, I get:
00 00 01 00 02 00 03 00 04 00 05 00 06 00 07 00
08 00 09 00 0A 00 0B 00 0C 00 0D 00 0E 00 0F 00
10 00 11 00 12 00 13 00 14 00 15 00 16 00 17 00
18 00 19 00 1A 00 1B 00 1C 00 1D 00 1E 00 1F 00
which is completely fine (despite the fact that the second byte is never used, because decimals are too low in my example).
Now I need to import these binary files into Python 2.7, but I can't. I tried many different routines, but I always fail in doing so.
1. attempt: "np.fromfile"
with open("binary2.dat", 'r') as f:
content = np.fromfile(f, dtype=np.int16)
returns
[ 0 1 2 3 4 5 6 7 8 9 10 11
12 13 14 15 16 17 18 19 20 21 22 23
24 25 0 0 26104 1242 0 0]
2. attempt: "struct"
import struct
with open("binary2.dat", 'r') as f:
content = f.readlines()
struct.unpack('h' * 32, content)
delivers
struct.error: unpack requires a string argument of length 64
because
print content
['\x00\x00\x01\x00\x02\x00\x03\x00\x04\x00\x05\x00\x06\x00\x07\x00\x08\x00\t\x00\n', '\x00\x0b\x00\x0c\x00\r\x00\x0e\x00\x0f\x00\x10\x00\x11\x00\x12\x00\x13\x00\x14\x00\x15\x00\x16\x00\x17\x00\x18\x00\x19\x00']
(note the delimiter, the t and the n which shouldn't be there according to what Fortran's "streaming" access does)
3. attempt: "FortranFile"
f = FortranFile("D:/Fortran/Sandbox/binary2.dat", 'r')
print(f.read_ints(dtype=np.int16))
With the error:
TypeError: only length-1 arrays can be converted to Python scalars
(remember how it detected a delimiter in the middle of the file, but it would also crash for shorter files without line break (e.g. decimals from 0 to 8))
Some additional thoughts:
Python seems to have troubles with reading parts of the binary file. For np.fromfile it reads Hex 19 (dec: 25), but crashes for Hex 1A (dec: 26). It seems to be confused with the letters, although 0A, 0B ... work just fine.
For attempt 2 the content-result is weird. Decimals 0 to 8 work fine, but then there is this strange \t\x00\n thing. What is it with hex 09 then?
I've been spending hours trying to find the logic, but I'm stuck and really need some help. Any ideas?

The problem is in open file mode. Default it is 'text'. Change this mode to binary:
with open("binary2.dat", 'rb') as f:
content = np.fromfile(f, dtype=np.int16)
and all the numbers will be readed successfull. See Dive in to Python chapter Binary Files for more details.

How to read Deep-ocean Assessment and Reporting of Tsunamis (DART®) data in python

I am trying to plot water column height with python from DART data,
http://www.ndbc.noaa.gov/data/dart_deployment_realtime/23228.dart
import pandas as pd
link = "http://www.ndbc.noaa.gov/data/dart_deployment_realtime/23228.dart"
data = pd.read_table(link)
but data have only one column and I cant access seprated data
Site: 23228
0 #Paroscientific Serial: W23228
1 #YY MM DD hh mm ss T HEIGHT
2 #yr mo dy hr mn s - m
3 2014 08 08 06 00 00 1 2609.494
4 2014 08 08 05 45 00 1 2609.550
5 2014 08 08 05 30 00 1 2609.605
6 2014 08 08 05 15 00 1 2609.658
7 2014 08 08 05 00 00 1 2609.703
8 2014 08 08 04 45 00 1 2609.741
9 2014 08 08 04 30 00 1 2609.769
10 2014 08 08 04 15 00 1 2609.787
11 2014 08 08 04 00 00 1 2609.799
12 2014 08 08 03 45 00 1 2609.802
for example I just want HEIGHT value as numpy array, I dont know have to access this specific column

With pure Python (no NumPy) I would use the csv module:
import urllib2
import csv
u = urllib2.urlopen('http://www.ndbc.noaa.gov/data/dart_deployment_realtime/23228.dart')
r = csv.reader(r, delimiter=' ')
# skip the headers
for _ in range(3):
next(r, None)
Now r contains an iterable which gives one row (list of 8 items) at a time for whatever you need. Of course, if you need a list of lists, you may just do list(r).
However, as you are handling rather a large amount of data, you may probably want to use NumPy. In that case:
import urllib2
import numpy as np
u = urllib2.urlopen('http://www.ndbc.noaa.gov/data/dart_deployment_realtime/23228.dart')
arr = np.loadtxt(u, skiprows=3)
This gives you an array of 92551 x 8 values.
Accessing the heights as a NumPy array is then simple:
arr[:,7]
Pandas is another possibility, as you correctly thought. It is just a matter of a few parameters...
import urllib2
import pandas as pd
link = 'http://www.ndbc.noaa.gov/data/dart_deployment_realtime/23228.dart'
df = pd.read_table(link, delimiter=r'\s+', skiprows=[1,3], header=1)
Now you have a nice DataFrame with df["HEIGHT"] as the height. (The column names are taken from row 2 of the file.)
And for the plotting...
df["HEIGHT"].plot()
creates
(Then I guess you will ask how to get the proper date on the X axis. I think that is worth a completely new question...)

Perhaps you can modify the following:
import urllib2, numpy
response = urllib2.urlopen('http://www.ndbc.noaa.gov/data/dart_deployment_realtime/23228.dart')
all_lines = response.read().splitlines()
lines_of_interest = all_lines[4:len(all_lines)]
heights = numpy.zeros(len(lines_of_interest), dtype=float)
for idx, line in enumerate(lines_of_interest):
heights[idx] = float(line.split()[7])
Then:
>>> heights.shape
(92551,)
>>> heights
array([ 2609.27 , 2609.213, 2609.153, ..., 2611.157, 2611.084,
2611.008])
Etc.

Python: hexadecimal regular expression question

I want to parse the output of a serial monitoring program called Docklight (I highly recommend it)
It outputs 'hexadecimal' strings: or a sequence of (two capital hex digits followed by a space). the corresponding regular expression is: ([0-9A-F]{2} )+ for example: '05 03 DA 4B 3F '
When program detects particular sequences of characters it places comments in the 'hexadecimal ' string. for example:
'05 03 04 01 0A The Header 03 08 0B BD AF The PAYLOAD 0D 0A The Footer'
comments are strings of the following format ' .+ ' (a sequence of characters preceded by a space and followed by a space)
I want to get rid of the comments. for example, the 'hexadecimal' string above filtered would be:
'05 03 04 01 0A 03 08 0B BD AF 0D 0A '
how do i go about doing this with A regular expression?

You could try re.findall():
>>> a='05 03 04 01 0A The Header 03 08 0B BD AF The PAYLOAD 0D 0A The Footer'
>>> re.findall(r"\b[0-9A-F]{2}\b", a)
['05', '03', '04', '01', '0A', '03', '08', '0B', 'BD', 'AF', '0D', '0A']
The \b in the regular expression matches a "word boundary".
Of course, your input is ambiguous if the serial monitor inserts something like THIS BE THE HEADER.

Using your regex
hexa = '([0-9A-F]{2} )+'
" ".join(re.findall(hexa, line))

It might be easier to find all the hexadecimal numbers, assuming the inserted strings won't contain a match:
>>> data = '05 03 04 01 0A The Header 03 08 0B BD AF The PAYLOAD 0D 0A The Footer'
>>> import re
>>> pattern = re.compile("[0-9A-F]{2} ")
>>> "".join(pattern.findall(data))
'05 03 04 01 0A 03 08 0B BD AF AD 0D 0A '
Otherwise you could use the fact that the inserted strings are preceed by two spaces:
>>> data = '05 03 04 01 0A The Header 03 08 0B BD AF The PAYLOAD 0D 0A The Footer'
>>> re.sub("( .*?)(?=( [0-9A-F]{2} |$))","",data)
'05 03 04 01 0A 03 08 0B BD AF 0D 0A'
This uses a look ahead to work out when the inserted string ends. It looks for either a hexadecimal string surround by spaces or the end of the source string.

How about a solution that actually uses regex negation? ;)
result = re.sub(r"[ ]+(?:(?!\b[0-9A-F]{2}\b).)+", "", subject)

While you already received two answers that find you all hexadecimal numbers, here's the same with a direct regular expression that finds you all text that does not look like a hexadecimal number (assuming that's two letter/digits in uppercase / lowercase 0-9 and A-F range, followed by a space).
Something like this (sorry, I'm not a pythoneer, but you get the idea):
newstring = re.sub(r"[^ ]+(?<![0-9A-Fa-f ]{2}|^.)", "", yourstring)
It works by "looking back". It finds every consecutive non-space substring, then negatively looks back with (?<!....). It says: "if the previous two characters were not a hex number, then succeed". The little ^. at the end prevents to incorrectly match the first character of the string.
Edit
As suggested by Alan Moore, here's the same idea with a positive lookahead expression:
newstring = re.sub(r"(?>\b[0-9A-Fa-f ]{2}\b)", "", yourstring)

Why regexp? More pythonic for me is (fixed for hexdigit not regular digit):
command='05 03 04 01 0A The Header 03 08 0B BD AF The PAYLOAD 0D 0A The Footer'
print ' '.join(com for com in command.split()
if len(com)==2 and all(c.upper() in '0123456789ABCDEF' for c in com))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Converting an array of integers to unicode - python

Related

When I use Thrift to serialize map in C++ to disk, and then de-serialize it using Python, I do not get back the same object

Chrome corrupting binary file download by converting to UTF-8

Reading fortran binary (streaming access) with np.fromfile or open & struct

How to read Deep-ocean Assessment and Reporting of Tsunamis (DART®) data in python

Python: hexadecimal regular expression question

Categories

Resources