Python generically parse data into object structure

Python generically parse data into object structure - python

What I want to know is if I have a defined structured object with known parameters and a known order. I want to parse a binary blob into this structure in a generic way.
For example, I know that my file is a binary file of this structure
typedef struct {
uint frCompressedSize;
uint frUncompressedSize;
ushort frFileNameLength;
ushort frExtraFieldLength;
char frFileName[ frFileNameLength ];
uchar frExtraField[ frExtraFieldLength ];
uchar frData[ frCompressedSize ];
} ZIPFILERECORD;
Is there a better way to do this than reading in individual fields at a time in a hard coded manner? In my real code the structure has almost 100 parameters so the hardcoded method is not my first choice.
Any Ideas?
thanks!

You are looking for the python struct library

Related

How could I work with binary data in Python?

I'm trying to work with robotics binary data but I am very stuck. I don't understand even how to work with it. I want to create a dataframe from it and use pandas to get some statistics.
When I open the file I get this:
struct Time
{
long long time;
unsigned short millitm;
short timezone;
short dstflag;
};
struct wxp1
{
float x;
float y;
};
struct wxp2
struct position
{
wxp1 position; // based on Basic Full(679x382)
wxp1 position2; // based on Basic Full(679x382)
wxp1 position3; // based on Basic Full(679x382)
wxp1 estimatedposition;
wxp1 estimatedposition2;
wxp1 estimatedposition3;
float score;
};
Followed by binary
00\x00\x00\x00\x00\x00\x00\x00\x etc
Not familiar at all with it I tried to open it with struct or methods I found on stackoverflow without success.
from numpy import fromfile, dtype
from pandas import DataFrame
records = fromfile('/content/my_file.blog')
df=DataFrame(records)
But I don't get something relevant with it...

Get address of member of struct using pyelftools

I am using the pyelftools to read an elf file. How can I get an offset value or address of a member in a struct? For example, say I have the following struct in C.
typedef struct
{
int valA;
} TsA;
typedef struct
{
int valB;
} TsB;
typedef struct
{
int valC;
TsB b;
} TsC;
typedef struct
{
TsA a;
TsC c;
} TsStruct;
TsStrcut myStruct;
How can I get an address of myStruct.c.b.valB? I found a similar question here but did not find any good answer.

Find the DIE for the structure, the one with tag DW_TAG_structure_type and DW_AT_name equal to structure names.
Enumerate the DW_TAG_member subdies under it. While there, look at the DW_AT_member_location, it's the offset of the corresponding structure element.
It might help if you take a look at the DIE structure visually first. DWARF Explorer might help (disclaimer: I wrote it).

c++ read map in binary file which created in python

I created a Python script which creates the following map (illustration):
map<uint32_t, string> tempMap = {{2,"xx"}, {200, "yy"}};
and saved it as map.out file (a binary file).
When I try to read the binary file from C++, it doesn't copy the map, why?
map<uint32_t, string> tempMap;
ifstream readFile;
std::streamsize length;
readFile.open("somePath\\map.out", ios::binary | ios::in);
if (readFile)
{
readFile.ignore( std::numeric_limits<std::streamsize>::max() );
length = readFile.gcount();
readFile.clear(); // Since ignore will have set eof.
readFile.seekg( 0, std::ios_base::beg );
readFile.read((char *)&tempMap,length);
for(auto &it: tempMap)
{
/* cout<<("%u, %s",it.first, it.second.c_str()); ->Prints map*/
}
}
readFile.close();
readFile.clear();

It's not possible to read in raw bytes and have it construct a map (or most containers, for that matter)[1]; and so you will have to write some code to perform proper serialization instead.
If the data being stored/loaded is simple, as per your example, then you can easily devise a scheme for how this might be serialized, and then write the code to load it. For example, a simple plaintext mapping can be established by writing the file with each member after a newline:
<number>
<string>
...
So for your example of:
std::map<std::uint32_t, std::string> tempMap = {{2,"xx"}, {200, "yy"}};
this could be encoded as:
2
xx
200
yy
In which case the code to deserialize this would simply read each value 1-by-1 and reconstruct the map:
// Note: untested
auto loadMap(const std::filesystem::path& path) -> std::map<std::uint32_t, std::string>
{
auto result = std::map<std::uint32_t, std::string>{};
auto file = std::ifstream{path};
while (true) {
auto key = std::uint32_t{};
auto value = std::string{};
if (!(file >> key)) { break; }
if (!std::getline(file, value)) { break; }
result[key] = std::move(value);
}
return result;
}
Note: For this to work, you need your python program to output the format that will be read from your C++ program.
If the data you are trying to read/write is sufficiently complicated, you may look into different serialization interchange formats. Since you're working between python and C++, you'll need to look into libraries that support both. For a list of recommendations, see the answers to Cross-platform and language (de)serialization
[1]
The reason you can't just read (or write) the whole container as bytes and have it work is because data in containers isn't stored inline. Writing the raw bytes out won't produce something like 2 xx\n200 yy\n automatically for you. Instead, you'll be writing the raw addresses of pointers to indirect data structures such as the map's internal node objects.
For example, a hypothetical map implementation might contain a node like:
template <typename Key, typename Value>
struct map_node
{
Key key;
Value value;
map_node* left;
map_node* right;
};
(The real map implementation is much more complicated than this, but this is a simplified representation)
If map<Key,Value> contains a map_node<Key,Value> member, then writing this out in binary will write the binary representation of key, value, left, and right -- the latter of which are pointers. The same is true with any container that uses indirection of any kind; the addresses will fundamentally differ between the time they are written and read, since they depend on the state of the program at any given time.
You can write a simple map_node to test this, and just print out the bytes to see what it produces; the pointer will be serialized as well. Behaviorally, this is the exact same as what you are trying to do with a map and reading from a binary file. See the below example which includes different addresses.
Live Example

You can use protocol buffers to serialize your map in python and deserialization can be performed in C++.
Protocol buffers supports both Python and C++.

Fastest way to store and retrieve a large stream of small unstructured messages

I am developing an IOT application that requires me to handle many small unstructured messages (meaning that their fields can change over time - some can appear and others can disappear). These messages typically have between 2 and 15 fields, whose values belong to basic data types (ints/longs, strings, booleans). These messages fit very well within the JSON data format (or msgpack).
It is critical that the messages get processed in their order of arrival (understand: they need to be processed by a single thread - there is no way to parallelize this part). I have my own logic for handling these messages in realtime (the throughput is relatively small, a few hundred thousand messages per second at most), but there is an increasing need for the engine to be able to simulate/replay previous periods by replaying a history of messages. Though it wasn't initially written for that purpose, my event processing engine (written in Go) could very well handle dozens (maybe in the low hundreds) of millions of messages per second if I was able to feed it with historical data at a sufficient speed.
This is exactly the problem. I have been storing many (hundreds of billions) of these messages over a long period of time (several years), for now in delimited msgpack format (https://github.com/msgpack/msgpack-python#streaming-unpacking). In this setting and others (see below), I was able to benchmark peak parsing speeds of ~2M messages/second (on a 2019 Macbook Pro, parsing only), which is far from saturating disk IO.
Even without talking about IO, doing the following:
import json
message = {
'meta1': "measurement",
'location': "NYC",
'time': "20200101",
'value1': 1.0,
'value2': 2.0,
'value3': 3.0,
'value4': 4.0
}
json_message = json.dumps(message)
%%timeit
json.loads(json_message)
gives me a parsing time of 3 microseconds/message, that is slightly above 300k messages/second. Comparing with ujson, rapidjson and orjson instead of the standard library's json module, I was able to get peak speeds of 1 microsecond/message (with ujson), that is about 1M messages/second.
Msgpack is slightly better:
import msgpack
message = {
'meta1': "measurement",
'location': "NYC",
'time': "20200101",
'value1': 1.0,
'value2': 2.0,
'value3': 3.0,
'value4': 4.0
}
msgpack_message = msgpack.packb(message)
%%timeit
msgpack.unpackb(msgpack_message)
Gives me a processing time of ~750ns/message (about 100ns/field), that is about 1.3M messages/second. I initially thought that C++ could be much faster. Here's an example using nlohmann/json, though this is not directly comparable with msgpack:
#include <iostream>
#include "json.hpp"
using json = nlohmann::json;
const std::string message = "{\"value\": \"hello\"}";
int main() {
auto jsonMessage = json::parse(message);
for(size_t i=0; i<1000000; ++i) {
jsonMessage = json::parse(message);
}
std::cout << jsonMessage["value"] << std::endl; // To avoid having the compiler optimize the loop away.
};
Compiling with clang 11.0.3 (std=c++17, -O3), this runs in ~1.4s on the same Macbook, that is to say a parsing speed of ~700k messages/second with even smaller messages than the Python example. I know that nlohmann/json can be quite slow, and was able to get parsing speeds of about 2M messages/second using simdjson's DOM API.
This is still far too slow for my use case. I am open to all suggestions to improve message parsing speed with potential applications in Python, C++, Java (or whatever JVM language) or Go.
Notes:
I do not necessarily care about the size of the messages on disk (consider it a plus if the storage method you suggest is memory-efficient).
All I need is a key-value model for basic data types - I do not need nested dictionaries or lists.
Converting the existing data is not an issue at all. I am simply looking for something read-optimized.
I do not necessarily need to parse the entire thing into a struct or a custom object, only to access some of the fields when I need it (I typically need a small fraction of the fields of each message) - it is fine if this comes with a penalty, as long as the penalty does not destroy the whole application's throughput.
I am open to custom/slightly unsafe solutions.
Any format I choose to use needs to be naturally delimited, in the sense that the messages will be written serially to a file (I am currently using one file per day, which is sufficient for my use case). I've had issues in the past with unproperly delimited messages (see writeDelimitedTo in the Java Protobuf API - lose a single byte and the entire file is ruined).
Things I have already explored:
JSON: experimented with rapidjson, simdjson, nlohmann/json, etc...)
Flat files with delimited msgpack (see this API: https://github.com/msgpack/msgpack-python#streaming-unpacking): what I am currently using to store the messages.
Protocol Buffers: slightly faster, but does not really fit with the unstructured nature of the data.
Thanks!!

I assume that messages only contain few named attributes of basic types (defined at runtime) and that these basic types are for example strings, integers and floating-point numbers.
For the implementation to be fast, it is better to:
avoid text parsing (slow because sequential and full of conditionals);
avoid checking if messages are ill-formed (not needed here as they should all be well-formed);
avoid allocations as much as possible;
work on message chunks.
Thus, we first need to design a simple and fast binary message protocol:
A binary message contains the number of its attributes (encoded on 1 byte) followed by the list of attributes. Each attribute contains a string prefixed by its size (encoded on 1 byte) followed by the type of the attribute (the index of the type in the std::variant, encoded on 1 byte) as well as the attribute value (a size-prefixed string, a 64-bit integer or a 64-bit floating-point number).
Each encoded message is a stream of bytes that can fit in a large buffer (allocated once and reused for multiple incoming messages).
Here is a code to decode a message from a raw binary buffer:
#include <unordered_map>
#include <variant>
#include <climits>
// Define the possible types here
using AttrType = std::variant<std::string_view, int64_t, double>;
// Decode the `msgData` buffer and write the decoded message into `result`.
// Assume the message is not ill-formed!
// msgData must not be freed or modified while the resulting map is being used.
void decode(const char* msgData, std::unordered_map<std::string_view, AttrType>& result)
{
static_assert(CHAR_BIT == 8);
const size_t attrCount = msgData[0];
size_t cur = 1;
result.clear();
for(size_t i=0 ; i<attrCount ; ++i)
{
const size_t keyLen = msgData[cur];
std::string_view key(msgData+cur+1, keyLen);
cur += 1 + keyLen;
const size_t attrType = msgData[cur];
cur++;
// A switch could be better if there is more types
if(attrType == 0) // std::string_view
{
const size_t valueLen = msgData[cur];
std::string_view value(msgData+cur+1, valueLen);
cur += 1 + valueLen;
result[key] = std::move(AttrType(value));
}
else if(attrType == 1) // Native-endian 64-bit integer
{
int64_t value;
// Required to not break the strict aliasing rule
std::memcpy(&value, msgData+cur, sizeof(int64_t));
cur += sizeof(int64_t);
result[key] = std::move(AttrType(value));
}
else // IEEE-754 double
{
double value;
// Required to not break the strict aliasing rule
std::memcpy(&value, msgData+cur, sizeof(double));
cur += sizeof(double);
result[key] = std::move(AttrType(value));
}
}
}
You probably need to write the encoding function too (based on the same idea).
Here is an example of usage (based on your json-related code):
const char* message = "\x01\x05value\x00\x05hello";
void bench()
{
std::unordered_map<std::string_view, AttrType> decodedMsg;
decodedMsg.reserve(16);
decode(message, decodedMsg);
for(size_t i=0; i<1000*1000; ++i)
{
decode(message, decodedMsg);
}
visit([](const auto& v) { cout << "Result: " << v << endl; }, decodedMsg["value"]);
}
On my machine (with an Intel i7-9700KF processor) and based on your benchmark, I get 2.7M message/s with the code using the nlohmann json library and 35.4M message/s with the new code.
Note that this code can be much faster. Indeed, most of the time is spent in efficient hashing and allocations. You can mitigate the problem by using a faster hash-map implementation (eg. boost::container::flat_map or ska::bytell_hash_map) and/or by using a custom allocator. An alternative is to build your own carefully tuned hash-map implementation. Another alternative is to use a vector of key-value pairs and use a linear search to perform lookups (this should be fast because your messages should not have a lot of attributes and because you said that you need a small fraction of the attributes per message).
However, the larger the messages, the slower the decoding. Thus, you may need to leverage parallelism to decode message chunks faster.
With all of that, this is possible to reach more than 100 M message/s.

C send struct over UART to Python

I'm trying to send a struct over UART (from an ESP32) to be processed by Python by using this guide.
// we send this to the host, to be processed by python script
struct package {
uint8_t modifier;
uint8_t keyboard_keys[6];
};
// instantiate struct
package to_send = {};
// send the contents of keyboard_keys and keyboard_modifier_keys
// https://folk.uio.no/jeanra/Microelectronics/TransmitStructArduinoPython.html
void usb_keyboard_send(void)
{
to_send.modifier = keyboard_modifier_keys;
for(uint8_t i = 0; i < 6; i++) {
to_send.keyboard_keys[i] = keyboard_keys[i];
}
printf("S");
printf((uint8_t *)&to_send, sizeof(to_send));
printf("E");
}
However I get the error: invalid conversion from 'uint8_t* {aka unsigned char*}' to 'const char*' [-fpermissive]
I'm pretty new to C++, and I've tried all sorts of casting, but I just can't get it to work. Could someone offer guidance please?

Setting aside that it's generally a bad idea to mix ASCII and raw binary, your code is almost right.
You have 2 major errors:
// instantiate struct
package to_send = {};
should be:
// instantiate struct
struct package to_send = {};
Also, to write directly (not formatted text) to STDOUT you want to use fwrite()
i.e.
printf("S");
fwrite((uint8_t *)&to_send, sizeof(uint8_t), sizeof(struct_package), STDOUT);
printf("E");
As an aside, after fixing these 2 errors you may be surprised to find that your struct isn't the number of bytes you expect. The compiler may optimize it to make memory accesses faster by padding fields to word sized boundaries (32 bits on ESP32). sizeof() will return the correct value taking in to account whatever optimizations are done, but your Python code may not expect that. To fix this you probably wan to use a compiler hint, e.g. __attribute__((__packed__)). See here for a general guide to structure packing.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python generically parse data into object structure - python

You are looking for the python struct library

Related

How could I work with binary data in Python?

Get address of member of struct using pyelftools

c++ read map in binary file which created in python

Fastest way to store and retrieve a large stream of small unstructured messages

C send struct over UART to Python

Categories

Resources