I am trying to decode and process Binary data file, following is a data format
input:9,data:443,gps:3
and has more data in the same fashion, [key:value] format.
basically, I need to create a dictionary of the file to process it later.
input:b'input:9,data:443,gps:3'
Desired output:{'input': '9', 'data': '443', 'gps': '3'}
Your input data is bytes(sequence of bytes). To convert it to str object you can use bytes.decode(). Than you can work with data lie with sting and split it by , and :. Code:
inp = b"input:9,data:443,gps:3"
out = dict(s.split(":") for s in inp.decode().split(","))
The string input:9,data:443,gps:3 is text, not binary data, so I am going to guess that it is a format template, not a sample of the file contents. This format would mean that your file has an "input" field that is 9 bytes long, followed by 443 bytes of "data", followed by a 3-byte "gps" value. This description does not specify the types of the fields, so it is incomplete; but it's a start.
The Python tool for structured binary files is the module struct. Here's how to extract the three fields as bytes objects:
import struct
with open("some_file.bin", "rb") as binfile:
content = binfile.read()
input_, data, gps = struct.unpack("9s443s3s", content)
The function struct.unpack provides many other formats besides s; this is just an example. But there is no specifier for plain text strings, so if input_ is a text string, the next step is to convert it:
input_ = input_.decode("ascii") # or other encoding
Since you ask for a dictionary of the results, here is one way:
result = { "input":input_, "data": data, "gps": gps }
Solution using eval:
inp = b"input:9,data:443,gps:3"
out = eval(b'dict(%s)' % inp.replace(b':', b'='))
Related
I am writing a python script using web3 package.
The process explained:
I have a transaction, which I read the transaction receipt for
txn_receipt = w3.eth.getTransactionReceipt('0x8ddd5ab8f53df7365a2feb8ee249ca2d317edcdcb6f40faae728a3cb946b4eb1')
Just for this example, I read a specific section of the log. This returns a hex.
x = txn_receipt['logs'][4]['data']
PROBLEM:
How do I decode this hex? If you go to BSC SCAN, you will see the decoded value I am expecting at block 453.
Expected value:
amount0In :
2369737542851785768252
amount1In :
0
amount0Out :
0
amount1Out :
82650726831815053455
See here:
https://bscscan.com/tx/0x8ddd5ab8f53df7365a2feb8ee249ca2d317edcdcb6f40faae728a3cb946b4eb1#eventlog
Assuming you don't need the key names, you could do this with basic python (no need for web3 library).
The data field in BSC logs is a string of hex encoded values, which is just a base 16 representation of the decimal value. In your example:
0x00000000000000000000000000000000000000000000008076b6fbd0ebb5bd3c000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000047b025e26b62ed08f
Just trim the beginning, split it up, and convert each string with python's int() function:
hexdata = '0x00000000000000000000000000000000000000000000008076b6fbd0ebb5bd3c000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000047b025e26b62ed08f'
# Trim '0x' from beginning of string
hexdataTrimed = hexdata[2:]
# Split trimmed string every 64 characters
n = 64
dataSplit = [hexdataTrimed[i:i+n] for i in range(0, len(hexdataTrimed), n)]
# Fill new list with converted decimal values
data = []
for val in range(len(dataSplit)):
toDec = int(dataSplit[val], 16)
data.append(toDec)
print(data)
# returns [2369737542851785768252, 0, 0, 82650726831815053455]
Sources:
https://www.binaryhexconverter.com/hex-to-decimal-converter
https://www.w3schools.com/python/ref_func_int.asp
Currently reading from s3 and saving within a dataframe.
Problem image:
S3 objects are read in as bytes however it seems within my string, the byte string is also there.
Unable to decode a string using - example_string.decode().
Another problem from this is trying to find emojis within the text. These are saved as UTF-8 and due to be saved as a byte string within a string, it adds extra \ etc.
I wish just the string with no additional byte string or any combination.
Any help would be appreciated.
bucket_iter = iter(bucket)
while (True) :
next_val = next(bucket_iter)
current_file = (next_val.get()['Body'].read())).decode('utf-8')
split_file = current_file.split(']')
for tweet in split_file:
a = tweet.split(',')
if (len(a) == 10):
a[0] = a[0][2:12]
new_row = {'date':a[0], 'tweet':a[1], 'user':a[2], 'cashtags':a[3],'number_cashtags':a[4],'Hashtags':a[5],'number_hashtags':a[6],'quoted_tweet':a[7],'urs_present':a[8],'spam':a[9]}
df = df.append(new_row, ignore_index=True)
example of a line in s3bucket
["2021-01-06 13:41:48", "Q1 2021 Earnings Estimate for The Walt Disney Company $DIS Issued By Truist Securiti https://t co/l5VSCCCgDF #stocks", "b'AmericanBanking'", "$DIS", "1", "#stocks'", "1", "False", "1", "0"]
Even though this is a string, it will keep the 'b' before the string, even though the item is a string. Just make a small bit of code to only keep what is inside the quotes.
def bytes_to_string(b):
return str(b)[2:-1]
EDIT: you could technically use regexes to do this, but this is a much more readable way of doing it (and shorter)
I want to parse a bytes string in JSON format to convert it into python objects. This is the source I have:
my_bytes_value = b'[{\'Date\': \'2016-05-21T21:35:40Z\', \'CreationDate\': \'2012-05-05\', \'LogoType\': \'png\', \'Ref\': 164611595, \'Classe\': [\'Email addresses\', \'Passwords\'],\'Link\':\'http://some_link.com\'}]'
And this is the desired outcome I want to have:
[{
"Date": "2016-05-21T21:35:40Z",
"CreationDate": "2012-05-05",
"LogoType": "png",
"Ref": 164611595,
"Classes": [
"Email addresses",
"Passwords"
],
"Link": "http://some_link.com"}]
First, I converted the bytes to string:
my_new_string_value = my_bytes_value.decode("utf-8")
but when I try to invoke loads to parse it as JSON:
my_json = json.loads(my_new_string_value)
I get this error:
json.decoder.JSONDecodeError: Expecting value: line 1 column 174 (char 173)
Your bytes object is almost JSON, but it's using single quotes instead of double quotes, and it needs to be a string. So one way to fix it is to decode the bytes to str and replace the quotes. Another option is to use ast.literal_eval; see below for details. If you want to print the result or save it to a file as valid JSON you can load the JSON to a Python list and then dump it out. Eg,
import json
my_bytes_value = b'[{\'Date\': \'2016-05-21T21:35:40Z\', \'CreationDate\': \'2012-05-05\', \'LogoType\': \'png\', \'Ref\': 164611595, \'Classe\': [\'Email addresses\', \'Passwords\'],\'Link\':\'http://some_link.com\'}]'
# Decode UTF-8 bytes to Unicode, and convert single quotes
# to double quotes to make it valid JSON
my_json = my_bytes_value.decode('utf8').replace("'", '"')
print(my_json)
print('- ' * 20)
# Load the JSON to a Python list & dump it back out as formatted JSON
data = json.loads(my_json)
s = json.dumps(data, indent=4, sort_keys=True)
print(s)
output
[{"Date": "2016-05-21T21:35:40Z", "CreationDate": "2012-05-05", "LogoType": "png", "Ref": 164611595, "Classe": ["Email addresses", "Passwords"],"Link":"http://some_link.com"}]
- - - - - - - - - - - - - - - - - - - -
[
{
"Classe": [
"Email addresses",
"Passwords"
],
"CreationDate": "2012-05-05",
"Date": "2016-05-21T21:35:40Z",
"Link": "http://some_link.com",
"LogoType": "png",
"Ref": 164611595
}
]
As Antti Haapala mentions in the comments, we can use ast.literal_eval to convert my_bytes_value to a Python list, once we've decoded it to a string.
from ast import literal_eval
import json
my_bytes_value = b'[{\'Date\': \'2016-05-21T21:35:40Z\', \'CreationDate\': \'2012-05-05\', \'LogoType\': \'png\', \'Ref\': 164611595, \'Classe\': [\'Email addresses\', \'Passwords\'],\'Link\':\'http://some_link.com\'}]'
data = literal_eval(my_bytes_value.decode('utf8'))
print(data)
print('- ' * 20)
s = json.dumps(data, indent=4, sort_keys=True)
print(s)
Generally, this problem arises because someone has saved data by printing its Python repr instead of using the json module to create proper JSON data. If it's possible, it's better to fix that problem so that proper JSON data is created in the first place.
You can simply use,
import json
json.loads(my_bytes_value)
Python 3.5 + Use io module
import json
import io
my_bytes_value = b'[{\'Date\': \'2016-05-21T21:35:40Z\', \'CreationDate\': \'2012-05-05\', \'LogoType\': \'png\', \'Ref\': 164611595, \'Classe\': [\'Email addresses\', \'Passwords\'],\'Link\':\'http://some_link.com\'}]'
fix_bytes_value = my_bytes_value.replace(b"'", b'"')
my_json = json.load(io.BytesIO(fix_bytes_value))
d = json.dumps(byte_str.decode('utf-8'))
To convert this bytesarray directly to json, you could first convert the bytesarray to a string with decode(), utf-8 is standard. Change the quotation markers.. The last step is to remove the " from the dumped string, to change the json object from string to list.
dumps(s.decode()).replace("'", '"')[1:-1]
Better solution is:
import json
byte_array_example = b'{"text": "\u0627\u06CC\u0646 \u06CC\u06A9 \u0645\u062A\u0646 \u062A\u0633\u062A\u06CC \u0641\u0627\u0631\u0633\u06CC \u0627\u0633\u062A."}'
res = json.loads(byte_array_example.decode('unicode_escape'))
print(res)
result:
{'text': 'این یک متن تستی فارسی است.'}
decode by utf-8 cannot decode unicode characters. The right solution is uicode_escape
It is OK
if you have a bytes object and want to store it in a JSON file, then you should first decode the byte object because JSON only has a few data types and raw byte data isn't one of them. It has arrays, decimal numbers, strings, and objects.
To decode a byte object you first have to know its encoding. For this, you can use
import chardet
encoding = chardet.detect(your_byte_object)['encoding']
then you can save this object to your json file like this
data = {"data": your_byte_object.decode(encoding)}
with open('request.txt', 'w') as file:
json.dump(data, file)
The most simple solution is to use the json function that comes with http request.
For example:
I have a binary file mixed with ASCII in which there are some floating point numbers I want to find. The file contains some lines like this:
1,1,'11.2','11.3';1,1,'100.4';
In my favorite regex tester I found that the correct regex should be ([0-9]+\.{1}[0-9]+).
Here's the code:
import re
data = open('C:\\Users\\Me\\file.bin', 'rb')
pat = re.compile(b'([0-9]+\.{1}[0-9]+)')
print(pat.match(data.read()))
I do not get a single match, why is that? I'm on Python 3.5.1.
You can try like this,
import re
with open('C:\\Users\\Me\\file.bin', 'rb') as f:
data = f.read()
re.findall("\d+\.\d+", data)
Output:
['11.2', '11.3', '100.4']
re.findall returns string list. If you want to convert to float you can do like this
>>> list(map(float, re.findall("\d+\.\d+", data)))
[11.2, 11.3, 100.4]
How to find floating point numbers in binary file with Python?
float_re = br"[+-]? *(?:\d+(?:\.\d*)?|\.\d+)(?:[eE][+-]?\d+)?"
for m in generate_tokens(r'C:\Users\Me\file.bin', float_re):
print(float(m.group()))
where float_re is from this answer and generate_tokens() is defined here.
pat.match() tries to match at the very start of the input string and your string does not start with a float and therefore you "do not get a single match".
re.findall("\d+\.\d+", data) produces TypeError because the pattern is Unicode (str) but data is a bytes object in your case. Pass the pattern as bytes:
re.findall(b"\d+\.\d+", data)
I'm having an issue parsing data after reading a file. What I'm doing is reading a binary file in and need to create a list of attributes from the read file all of the data in the file is terminated with a null byte. What I'm trying to do is find every instance of a null byte terminated attribute.
Essentially taking a string like
Health\x00experience\x00charactername\x00
and storing it in a list.
The real issue is I need to keep the null bytes in tact, I just need to be able to find each instance of a null byte and store the data that precedes it.
Python doesn't treat NUL bytes as anything special; they're no different from spaces or commas. So, split() works fine:
>>> my_string = "Health\x00experience\x00charactername\x00"
>>> my_string.split('\x00')
['Health', 'experience', 'charactername', '']
Note that split is treating \x00 as a separator, not a terminator, so we get an extra empty string at the end. If that's a problem, you can just slice it off:
>>> my_string.split('\x00')[:-1]
['Health', 'experience', 'charactername']
While it boils down to using split('\x00') a convenience wrapper might be nice.
def readlines(f, bufsize):
buf = ""
data = True
while data:
data = f.read(bufsize)
buf += data
lines = buf.split('\x00')
buf = lines.pop()
for line in lines:
yield line + '\x00'
yield buf + '\x00'
then you can do something like
with open('myfile', 'rb') as f:
mylist = [item for item in readlines(f, 524288)]
This has the added benefit of not needing to load the entire contents into memory before splitting the text.
To check if string has NULL byte, simply use in operator, for example:
if b'\x00' in data:
To find the position of it, use find() which would return the lowest index in the string where substring sub is found. Then use optional arguments start and end for slice notation.
Split on null bytes; .split() returns a list:
>> print("Health\x00experience\x00charactername\x00".split("\x00"))
['Health', 'experience', 'charactername', '']
If you know the data always ends with a null byte, you can slice the list to chop off the last empty string (like result_list[:-1]).