I'm having trouble converting a compressed, hex-encoded string back into its original format, without introducing numerous / seemingly erroneous backslashes + unconverted unicode characters.
The code I'm using to do this process is:
import gzip
from io import StringIO, BytesIO
def string_to_bytes(input_str: str) -> bytes:
"""
Read the given string, encode it in utf-8, gzip compress
the data and return it as a byte array.
"""
bio = BytesIO()
bio.write(input_str.encode("utf-8"))
bio.seek(0)
stream = BytesIO()
compressor = gzip.GzipFile(fileobj=stream, mode='w')
while True: # until EOF
chunk = bio.read(8192)
if not chunk: # EOF?
compressor.close()
return stream.getvalue()
compressor.write(chunk)
def bytes_to_string(input_bytes: bytes) -> str:
"""
Decompress the given byte array (which must be valid
compressed gzip data) and return the decoded text (utf-8).
"""
bio = BytesIO()
stream = BytesIO(input_bytes)
decompressor = gzip.GzipFile(fileobj=stream, mode='r')
while True: # until EOF
chunk = decompressor.read(8192)
if not chunk:
decompressor.close()
bio.seek(0)
return bio.read().decode("utf-8")
bio.write(chunk)
return None
In the script I'm running the input_string gets compressed + saved as hex with:
saved_hex = string_to_bytes(input_string).hex()
This gets stored as a BINARY datatype in a Snowflake database (using the HEX binary format).
This gets loaded out from there like so:
hex_bytes = bytes.fromhex(hex_html)
html_string = bytes_to_string(hex_bytes)
And the results are coming out like:
href\\\\\\\\u003d\\\\\\\\\\\\x22https://www.google.com/advanced_search\\\\\\\\\\\\x22 target\\\\\\\\u003d\\\\\\\\\\\\x22_blank\\\\\\\\\\\\x22\\\\\\\\u003eadvanced search\\\\\\\\u003c/a\\\\\\\\u003e to find results...
Where there's multiple backslashes which I'm unable to convert back to a single backslash (in the case of the unicode characters) or remove entirely.
Is there any way to more efficiently:
Gzip compress the string
Convert to Hex
Decode the hex + decompress - without adding any of these weird unconverted unicode characters?
Thank you all for the answers - foolishly I realised that:
I was adding an additional json.dumps() to the input string (further encoding it as a string and adding all the additional back-slashes).
Snowflake saves the data as bytes, which must be converted to binary first using TO_VARCHAR(saved_hex_data) before you can call bytes_to_string(bytes.fromhex(output_string)) on it.
At which point everything is preserved as before, many thanks again.
Related
I have written a small python script which encrypts a message with rsa.
Now I want to save the bytes in a txt to read them later.
But when I use str(...) on it I don't know how to convert the string back.
For example I encrypted "Test" to b'Y\xf8\xbc\xca\x14\x0f\x80\xd3\xc6\xce\xecE\x14\xc1\xaf\xbd\x82\xd24\xcf\x04\xe2\x9a\x81NF\xbeXi\x85\xef\xc4\xbbl\xd3(5\x80\xe4\xde3\x8eC\xd2jR*\xb7.gq\x8c\x8b\xa12\x1a\x10+\xbf\xefHZ\n/'
and saved it as a string.
When I aply bytes(...) on it I get the Error: TypeError: string argument without an encoding.
What can I do in order to do this?
You've saved the Python string representation of a binary byte array (bytestring).
To get the actual bytes back from such a representation, pass it through ast.literal_eval():
>>> import ast
>>> s = r"b'Y\xf8\xbc\xca\x14\x0f\x80\xd3\xc6\xce\xecE\x14\xc1\xaf\xbd\x82\xd24\xcf\x04\xe2\x9a\x81NF\xbeXi\x85\xef\xc4\xbbl\xd3(5\x80\xe4\xde3\x8eC\xd2jR*\xb7.gq\x8c\x8b\xa12\x1a\x10+\xbf\xefHZ\n/'"
>>> b = ast.literal_eval(s)
b'Y\xf8\xbc\xca\x14\x0f\x80\xd3\xc6\xce\xecE\x14\xc1\xaf\xbd\x82\xd24\xcf\x04\xe2\x9a\x81NF\xbeXi\x85\xef\xc4\xbbl\xd3(5\x80\xe4\xde3\x8eC\xd2jR*\xb7.gq\x8c\x8b\xa12\x1a\x10+\xbf\xefHZ\n/'
Better yet, just save the binary bytes to your file without passing through a string:
encrypted_bytes = my_rsa("Test")
with open("encrypted.bin", "wb") as f:
f.write(encrypted_bytes)
# ...
with open("encrypted.bin", "rb") as f:
encrypted_bytes = f.read()
If you really want a "text-safe" format for those bytes, use base64.b64encode() and base64.b64decode().
I need to save a params file in python and this params file contains some parameters that I won't leave on plain text, so I codify the entire file to base64 (I know that this isn't the most secure encoding of the world but it works for the kind of data that I need to use).
With the encoding, everything works well. I encode the content of my file (a simply txt with a proper extension) and save the file. The problem comes with the decode. I print the text coded before save the file and the text coded from the file saved and there are exactly the same, but for a reason I don't know, the decode of the text of the file saved returns me this error UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8d in position 1: invalid start byte and the decode of the text before save the file works well.
Any idea to resolve this issue?
This is my code, I have tried converting all to bytes, to string, and everything...
params = open('params.bpr','r').read()
paramsencoded = base64.b64encode(bytes(params,'utf-8'))
print(paramsencoded)
paramsdecoded = str(base64.b64decode(str(paramsencoded,'utf-8')),'utf-8')
newparams = open('paramsencoded.bpr','w+',encoding='utf-8')
newparams.write(str(paramsencoded))
newparams.close()
params2 = open('paramsencoded.bpr',encoding='utf-8').read()
print(params2)
paramsdecoded = str(base64.b64decode(str(paramsencoded,'utf-8')),'utf-8')
paramsdecoded = base64.b64decode(str(params2))
print(str(paramsdecoded,'utf-8'))
Your error lies in your handling of the bytes object returned by base64.b64encode(), you called str() on the object:
newparams.write(str(paramsencoded))
That doesn't decode the bytes object:
>>> bytesvalue = b'abc='
>>> str(bytesvalue)
"b'abc='"
Note the b'...' notation. You produced the representation of the bytes object, which is a string containing Python syntax that can reproduce the value for debugging purposes (you can copy that string value and paste it into Python to re-create the same bytes value).
This may not be that easy to notice at first, as base64.b64encode() otherwise only produces output with printable ASCII bytes.
But your decoding problem originates from there, because when decoding the value read back from the file includes the b' characters at the start. Those first two characters are interpreted as Base64 data too; the b is a valid Base64 character, and the ' is ignored by the parser:
>>> bytesvalue = b'hello world'
>>> base64.b64encode(bytesvalue)
b'aGVsbG8gd29ybGQ='
>>> str(base64.b64encode(bytesvalue))
"b'aGVsbG8gd29ybGQ='"
>>> base64.b64decode(str(base64.b64encode(bytesvalue))) # with str()
b'm\xa1\x95\xb1\xb1\xbc\x81\xdd\xbd\xc9\xb1\x90'
>>> base64.b64decode(base64.b64encode(bytesvalue)) # without str()
b'hello world'
Note how the output is completely different, because the Base64 decoding is now starting from the wrong place, as b is the first 6 bits of the first byte (making the first decoded byte a 6C, 6D, 6E or 6F bytes, so m,n, o or p ASCII).
You could properly decode the value (using paramsencoded.decode('ascii') or str(paramsencoded, 'ascii')) but you should't treat any of this data as text.
Instead, open your files in binary mode. Reading and writing then operates with bytes objects, and the base64.b64encode() and base64.b64decode() functions also operate on bytes, making for a perfect match:
with open('params.bpr', 'rb') as params_source:
params = params_source.read() # bytes object
params_encoded = base64.b64encode(params)
print(params_encoded.decode('ascii')) # base64 data is always ASCII data
params_decoded = base64.b64decode(params_encoded)
with open('paramsencoded.bpr', 'wb') as new_params:
newparams.write(params_encoded) # write binary data
with open('paramsencoded.bpr', 'rb') as new_params:
params_written = new_params.read()
print(params_written.decode('ascii')) # still Base64 data, so decode as ASCII
params_decoded = base64.b64decode(params_written) # decode the bytes value
print(params_decoded.decode('utf8')) # assuming the original source was UTF-8
I explicitly use bytes.decode(codec) rather than str(..., codec) to avoid accidental str(...) calls.
How can I convert bytes to string without changing data ?
E.g
Input:
file_data = b'\xb4\xeb7s\x14q[\xc4\xbb\x8e\xd4\xe0\x01\xec+\x8f\xf8c\xff\x00 \xeb\xff'
Output:
'\xb4\xeb7s\x14q[\xc4\xbb\x8e\xd4\xe0\x01\xec+\x8f\xf8c\xff\x00 \xeb\xff'
I want to write an image data using StringIO with some additional data, Below is my code snippet,
img_buf = StringIO()
f = open("Sample_image.jpg", "rb")
file_data = f.read()
img_buf.write('\r\n' + file_data + '\r\n')
This works fine with python 2.7 but I want it to be working with python 3.4.
on read operation file_data = f.read() returns bytes object data something like this
b'\xb4\xeb7s\x14q[\xc4\xbb\x8e\xd4\xe0\x01\xec+\x8f\xf8c\xff\x00 \xeb\xff'
While writting data using img_buf it accepts only String data, so unable to write file_data with some additional characters.
So I want to convert file_data as it is in String object without changing its data. Something like this
'\xb4\xeb7s\x14q[\xc4\xbb\x8e\xd4\xe0\x01\xec+\x8f\xf8c\xff\x00 \xeb\xff'
so that I can concat and write the image data.
I don't want to decode or encode data. Any suggestions would be helpful for me. thanks in advance.
It is not clear what kind of output you desire. If you are interested in aesthetically translating bytes to a string representation without encoding:
s = str(file_data)[1:]
print(s)
# '\xb4\xeb7s\x14q[\xc4\xbb\x8e\xd4\xe0\x01\xec+\x8f\xf8c\xff\x00 \xeb\xff'
This is the informal string representation of the original byte string (no conversion).
Details
The official string representation looks like this:
s
# "'\\xb4\\xeb7s\\x14q[\\xc4\\xbb\\x8e\\xd4\\xe0\\x01\\xec+\\x8f\\xf8c\\xff\\x00 \\xeb\\xff'"
String representation handles how a string looks. Double escape characters and double quotes are implicitly interpreted in Python to do the right thing so that the print function outputs a formatted string.
String intrepretation handles what a string means. Each block of characters means something different depending on the applied encoding. Here we interpret these blocks of characters (e.g. \\xb4, \\xeb, 7, s) with the UTF-8 encoding. Blocks unrecognized by this encoding are replaced with a default character, �:
file_data.decode("utf-8", "replace")
# '��7s\x14q[Ļ���\x01�+��c�\x00 ��'
Converting from bytes to strings is required for reliably working with strings.
In short, there is a difference in string output between how it looks (representation) and what it means (interpretation). Clarify which you prefer and proceed accordingly.
Addendum
If your question is "how do I concatenate a byte string?", here is one approach:
buffer = io.BytesIO()
with buffer as f:
f.write(b"\r\n")
f.write(file_data)
f.write(b"\r\n")
print(buffer.getvalue())
# b'\r\n\xb4\xeb7s\x14q[\xc4\xbb\x8e\xd4\xe0\x01\xec+\x8f\xf8c\xff\x00 \xeb\xff\r\n'
Equivalently:
buffer = b""
buffer += b"\r\n"
buffer += file_data
buffer += b"\r\n"
buffer
# b'\r\n\xb4\xeb7s\x14q[\xc4\xbb\x8e\xd4\xe0\x01\xec+\x8f\xf8c\xff\x00 \xeb\xff\r\n'
I'm new to Python and I do have an issue that is bothering me.
I use the following code to get a base64 string representation of my zip file.
with open( "C:\\Users\\Mario\\Downloads\\exportTest1.zip",'rb' ) as file:
zipContents = file.read()
encodedZip = base64.encodestring(zipContents)
Now, if I output the string it is contained inside a b'' representation. This for me is not necessary and I would like to avoid it. Also it adds a newlines character every 76 characters which is another issue. Is there a way to get the binary content and represent it without the newline characters and trailing and leading b''?
Just for comparison, if I do the following in PowerShell:
$fileName = "C:\Users\Mario\Downloads\exportTest1.zip"
$fileContentBytes = [System.IO.File]::ReadAllBytes($fileName)
$fileContentEncoded = [System.Convert]::ToBase64String($fileContentBytes)
I do get the exact string I'm looking for, no b'' and no \n every 76 chars.
From the base64 package doc:
base64.encodestring:
"Encode the bytes-like object s, which can contain arbitrary binary data, and return bytes containing the base64-encoded data, with newlines (b"\n") inserted after every 76 bytes of output, and ensuring that there is a trailing newline, as per RFC 2045 (MIME)."
You want to use
base64.b64encode:
"Encode the bytes-like object s using Base64 and return the encoded bytes."
Example:
import base64
with open("test.zip", "rb") as f:
encodedZip = base64.b64encode(f.read())
print(encodedZip.decode())
The decode() will convert the binary string to text.
Use b64encode to encode without the newlines and then decode the resulting binary string with .decode('ascii') to get a normal string.
encodedZip = base64.b64encode(zipContents).decode('ascii')
I've got a chunk of code that reads binary data off a string buffer (StringIO object), and tries to convert it to a bytearray object, but it's throwing errors when the value is greater than 127, which the ascii encoding can't handle, even when I'm trying to override it:
file = open(filename, 'r+b')
file.seek(offset)
chunk = file.read(length)
chunk = zlib.decompress(chunk)
chunk = StringIO(chunk)
d = bytearray(chunk.read(10), encoding="iso8859-1", errors="replace")
Running that code gives me:
d = bytearray(chunk.read(10), encoding="iso8859-1", errors="replace")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf0 in position 3: ordinal not in range(128)
Obviously 240 (decimal of 0xf0) can't fit in the ascii encoding range, but that's why I'm explicitly setting the encoding. But it seems to be ignoring it.
When converting a string to another encoding, its original encoding is taken to be ASCII if it is a str or Unicode if it is a unicode object. When creating the bytearray, the encoding parameter is required only if the string is unicode. Just don't specify an encoding and you will get the results you want.
I am not quite sure what the problem is.
StringIO is for string IO, not for binary IO.
If you want to get a bytearray representing the whole content of the file, use:
with open ('filename', 'r') as file: bytes = bytearray (file.read () )
if you want to get a string with only ascii characters contained in that file, use:
with open ('filename', 'r') as file: asciis = file.read ().decode ('ascii', 'ignore')
(If you run it on windows, you will probably need the binary flag for opening the file.