How to Convert Wrapped Bytes to Actual Bytes in Python Dataframe? - python

I have a column in my pandas dataframe which stores bytes. I believe the bytes are getting converted to a string when I put it in the dataframe because dataframe doesn't support actual bytes as a dtype. So instead of the column values being b'1a2b', it ends up getting wrapped in a string like this: "b'1a2b'".
I'm passing these values into a method that expects bytes. When I pass it like this ParseFromString("b'1a2b'"), I get the error message:
TypeError: memoryview: a bytes-like object is required, not 'str'
I was confused if encode or decode works in this case or if there is some other way to convert this wrapped bytes into bytes? (I'm using Python 3)
Since these values are in a dataframe, I can use a helper method during the conversion process from string-->bytes--> protocol buffer since the actual dataframe might not be able to store it as bytes. For example, my_dataframe.apply(_helper_method_convert_string_to_bytes_to_protobuf).

So the problem seems to be that you are unable to extract the byte object from the string. When you pass the string to the function, which is expecting a byte object like b'1a2b', it throws an error. My suggestion would be to try wrapping your string in an eval function. Like:
a = "b'1a2b'"
b = eval(a)
b is what you want. You haven't shared the code for your function, so I'm unable to do amend the actual code for you.

You can take a few approaches here, noting that eval() is considered bad practice and it is best to avoid this where possible.
Store your byte representation as a string and encode() on call to function
Extract the byte representation out of your string, then call encode() to function
Whilst if possible, it would be best to just store your bytes as 1a2b when importing the data, if that's not possible you could use regex to extract the contents of the string between b'' and pass the result to encode().
import re
string = "b'1a2b'"
re.search(r"(?<=').*(?=')", string).group().encode()
Output:
#b'1a2b'
type(re.search(r"(?<=').*(?=')", string).group().encode())
#<class 'bytes'>

Related

convert python string of byte data to bytes

I have a Python string of bytes data. An example string looks like this:
string = "b'\xabVJ-K\xcd+Q\xb2R*.M*N.\xcaLJU\xd2QJ\xceH\xcc\xcbK\xcd\x01\x89\x16\xe4\x97\xe8\x97d&g\xa7\x16Y\x85\x06\xbb8\xeb\x02\t\xa5Z\x00'"
It is a string, it not not bytes. I wish to convert it to bytes. Normal approaches (like encode) yield this:
b'\\xabVJ-K\\xcd+Q\\xb2R*.M*N.\\xcaLJU\\xd2QJ\\xceH\\xcc\\xcbK\\xcd\\x01\\x89\\x16\\xe4\\x97\\xe8\\x97d&g\\xa7\\x16Y\\x85\\x06\\xbb8\\xeb\\x02\\t\\xa5Z\\x00'
which leads to issues (note the addition of all the extra slashes).
I've looked through 10+ potential answers to this question on SO and only one of them works, and its a solution I'd prefer not to use, for obvious reasons:
this_works = eval(string)
Is there any way to get this to work without eval? Other potential solutions I've tried, that failed:
Option 1
Option 2
Option 3
I assume that you have python-like string representation in variable s:
s = r"b'\xabVJ-K\xcd+Q\xb2R*.M*N.\xcaLJU\xd2QJ\xceH\xcc\xcbK\xcd\x01\x89\x16\xe4\x97\xe8\x97d&g\xa7\x16Y\x85\x06\xbb8\xeb\x02\t\xa5Z\x00'"
Yes, if you eval this then you got real python bytes object.
But you can try parse it with ast module:
import ast
s = r"b'\xabVJ-K\xcd+Q\xb2R*.M*N.\xcaLJU\xd2QJ\xceH\xcc\xcbK\xcd\x01\x89\x16\xe4\x97\xe8\x97d&g\xa7\x16Y\x85\x06\xbb8\xeb\x02\t\xa5Z\x00'"
tree = ast.parse(s)
value = tree.body[0].value.value
print(type(value), value)
This will output your bytes object:
<class 'bytes'> b'\xabVJ-K\xcd+Q\xb2R*.M*N.\xcaLJU\xd2QJ\xceH\xcc\xcbK\xcd\x01\x89\x16\xe4\x97\xe8\x97d&g\xa7\x16Y\x85\x06\xbb8\xeb\x02\t\xa5Z\x00'

How to replace `c2a0` with none character in python3?

I want to convert b'\xc2\xa0\x38' into b'x38' in python3.
b'\xc2\xa0\x38'.replace(u'\xc2\xa0',"")
b'\xc2\xa0\x38'.replace(u'\xc2a0',"")
TypeError: a bytes-like object is required, not 'str'
In the webpage,the c2 a0 means NO-BREAK SPACE whose unicode point is U+00A0 .
Unicode code point character UTF-8 (hex.) name
U+00A0 c2 a0 NO-BREAK SPACE
Notice: c2a0 is unprintable , character column is blank here.
relationship on unicode point,character,utf-8
How to convert b'\xc2\xa0\x38' into b'\x38' with replace method?
You were already almost there:
b'\xc2\xa0\x38'.replace(b'\xc2\xa0',b'')
b'\xc2\xa0\x38'.replace(u'\xc2\xa0',"")
b'\xc2\xa0\x38'.replace(u'\xc2a0',"")
Since b'\xc2\xa0\x38' is a bytes object, you cannot use string methods on it. So when you call .replace() on it, you are not calling str.replace but bytes.replace. While those two look and behave very similarly, they still operate on different types:
str.replace replaces a substring inside of a string with another string. And bytes.replace replaces a sub-bytestring inside of a bytestring with another bytestring. So the types of all arguments always match:
str.replace(str, str)
bytes.replace(bytes, bytes)
So in order to replace something inside of a bytes string, you need to pass bytes objects:
>>> b'\xc2\xa0\x38'.replace(b'\xc2\xa0', b'')
b'8'
>>> b'\xc2\xa0\x38'.replace(b'\xc2a0', b'')
b'\xc2\xa08'
How to make b'8' displayed as b'\x38'?
You generall cannot do that. b'8' and b'\x38' are both equal to another:
>>> b'8' == b'\x38'
True
Both contain the same single byte value, a 0x38. It’s just that there are multiple ways to represent that content as a bytes literal in Python. Just like you can write 10, 0xA, 0b1010 or 0o12 to refer to the same int object with the decimal value of 10, you can describe a bytes object in multiple ways.
Now, when you use the interactive Python REPL, when you just write b'\x38', then Python will interpret that bytes literal, create a bytes object with the single byte 0x38, and then the REPL will print out the repr() of that bytes object. And the repr() of bytes objects just happen to attempt to use ASCII letters whenever possible.
There is no way to change this, but there’s also no need to change that. The b'8' that you see is just one representation of the same bytes object. And if you use that object and do something with it (e.g. write it to a file, transform it, or send over the network), then it’s the actual bytes that are sent, and not some string representation of the bytes object.
If you however want to actually print the bytes object, you can deliberately convert it into a string using your favorite representation. For example, if you want a hex representation of your bytes string, you could use one of the many ways to do that:
>>> print(b'8'.hex())
38
>>> print(b'\x38'.hex())
38
Is that data being read from a file? Maybe you opened the file in binary mode:
with open(fname, 'rb') as f:
This means that the data read from the file is returned as bytes object, not str.
If that is so, try to open the file as a textfile instead by replacing the 'rb' mode with 'r'.

Turned bytes string and saved. Is it possible to get the original bytes back

I took a Python bytes object returned by a method and assigned it to a text column in a database. I meant to call decode on the bytes prior to saving. Is there a way to take the string representation of that bytes object and turn it back into bytes so that i can call decode and re-save it?
The string in the database is:
\x30316331643763386665356566663764303761626132633030373931376531343835616334623136346131633633663564663235393532656361373663353966
I'd like to be able to read that into bytes somehow but can't quite figure out the correct way to instantiate it so that I can make the decode('utf-8') call I missed the first time.
>>> from binascii import unhexlify
>>> unhexlify("30316331643763386665356566663764303761626132633030373931376531343
835616334623136346131633633663564663235393532656361373663353966")
'01c1d7c8fe5eff7d07aba2c007917e1485ac4b164a1c63f5df25952eca76c59f'
unhexlify might be what you are looking for, this particular example unhexlify's to what looks like maybe some sort of hash

In Python, what would be a single, compact method to convert an int, a float or unicode to a string?

I have a very large number of database objects from SQLite3 to loop over and to print to terminal. I'm trying to have a robust method that can be applied to every object retreived from the database such that it is converted to a string. The objects are likely to be strings, ints, floats and unicode.
My initial approach was to simply use the function str() on every object, but this fails on some unicode. I was prompted then to try to use .encode("utf-8") on every object, but this fails on ints ('int' object has no attribute 'encode'). What would be a compact way to try to convert these objects to strings?
The best I've got right now is something like the following:
try:
string_representation = str(row[column])
except:
string_representation = str(row[column].encode("utf-8"))
row_contents.append(string_representation)
Is there a better, perhaps more compact approach? A one-liner would be nice.
Call unicode() on the numeric values.
But if you have a collection that includes both unicode string and byte strings, and you want to avoid any implicit encoding, you would have to check the types.
In C you would call
const unsigned char sqlite3_column_text(sqlite3_stmt, int iCol);
which will return whatever converted to text
https://www.sqlite.org/c3ref/column_blob.html
Dunno what you would do in python - it is better to use the database engine for these things = faster and more reliable.

Should I define a function to handle a string object or a unicode object?

I found that the doc says that Software should only work with Unicode strings internally, converting to a particular encoding on output..
Does it mean that every method I define should handle the parameter as a unicode object instead of a string object? If not, when do I need to handle as a string and when do I need to handle as a unicode?
Yes, this is exactly what they mean.
Handle textual input from outside sources as strings, but immediately decode to unicode. Only encode back to some encoding to output it (preferably this is done by whatever function/method you call to do the output, rather than you needing to explicitly encode and then pass the encoded string somewhere).
Obviously, if you're dealing with non-text binary bytes, keep them in byte strings.

Categories