How to convert bytes data to string without changing data using python3 - python

How can I convert bytes to string without changing data ?
E.g
Input:
file_data = b'\xb4\xeb7s\x14q[\xc4\xbb\x8e\xd4\xe0\x01\xec+\x8f\xf8c\xff\x00 \xeb\xff'
Output:
'\xb4\xeb7s\x14q[\xc4\xbb\x8e\xd4\xe0\x01\xec+\x8f\xf8c\xff\x00 \xeb\xff'
I want to write an image data using StringIO with some additional data, Below is my code snippet,
img_buf = StringIO()
f = open("Sample_image.jpg", "rb")
file_data = f.read()
img_buf.write('\r\n' + file_data + '\r\n')
This works fine with python 2.7 but I want it to be working with python 3.4.
on read operation file_data = f.read() returns bytes object data something like this
b'\xb4\xeb7s\x14q[\xc4\xbb\x8e\xd4\xe0\x01\xec+\x8f\xf8c\xff\x00 \xeb\xff'
While writting data using img_buf it accepts only String data, so unable to write file_data with some additional characters.
So I want to convert file_data as it is in String object without changing its data. Something like this
'\xb4\xeb7s\x14q[\xc4\xbb\x8e\xd4\xe0\x01\xec+\x8f\xf8c\xff\x00 \xeb\xff'
so that I can concat and write the image data.
I don't want to decode or encode data. Any suggestions would be helpful for me. thanks in advance.

It is not clear what kind of output you desire. If you are interested in aesthetically translating bytes to a string representation without encoding:
s = str(file_data)[1:]
print(s)
# '\xb4\xeb7s\x14q[\xc4\xbb\x8e\xd4\xe0\x01\xec+\x8f\xf8c\xff\x00 \xeb\xff'
This is the informal string representation of the original byte string (no conversion).
Details
The official string representation looks like this:
s
# "'\\xb4\\xeb7s\\x14q[\\xc4\\xbb\\x8e\\xd4\\xe0\\x01\\xec+\\x8f\\xf8c\\xff\\x00 \\xeb\\xff'"
String representation handles how a string looks. Double escape characters and double quotes are implicitly interpreted in Python to do the right thing so that the print function outputs a formatted string.
String intrepretation handles what a string means. Each block of characters means something different depending on the applied encoding. Here we interpret these blocks of characters (e.g. \\xb4, \\xeb, 7, s) with the UTF-8 encoding. Blocks unrecognized by this encoding are replaced with a default character, �:
file_data.decode("utf-8", "replace")
# '��7s\x14q[Ļ���\x01�+��c�\x00 ��'
Converting from bytes to strings is required for reliably working with strings.
In short, there is a difference in string output between how it looks (representation) and what it means (interpretation). Clarify which you prefer and proceed accordingly.
Addendum
If your question is "how do I concatenate a byte string?", here is one approach:
buffer = io.BytesIO()
with buffer as f:
f.write(b"\r\n")
f.write(file_data)
f.write(b"\r\n")
print(buffer.getvalue())
# b'\r\n\xb4\xeb7s\x14q[\xc4\xbb\x8e\xd4\xe0\x01\xec+\x8f\xf8c\xff\x00 \xeb\xff\r\n'
Equivalently:
buffer = b""
buffer += b"\r\n"
buffer += file_data
buffer += b"\r\n"
buffer
# b'\r\n\xb4\xeb7s\x14q[\xc4\xbb\x8e\xd4\xe0\x01\xec+\x8f\xf8c\xff\x00 \xeb\xff\r\n'

Related

Save bytes in a .txt and read out as bytes later

I have written a small python script which encrypts a message with rsa.
Now I want to save the bytes in a txt to read them later.
But when I use str(...) on it I don't know how to convert the string back.
For example I encrypted "Test" to b'Y\xf8\xbc\xca\x14\x0f\x80\xd3\xc6\xce\xecE\x14\xc1\xaf\xbd\x82\xd24\xcf\x04\xe2\x9a\x81NF\xbeXi\x85\xef\xc4\xbbl\xd3(5\x80\xe4\xde3\x8eC\xd2jR*\xb7.gq\x8c\x8b\xa12\x1a\x10+\xbf\xefHZ\n/'
and saved it as a string.
When I aply bytes(...) on it I get the Error: TypeError: string argument without an encoding.
What can I do in order to do this?
You've saved the Python string representation of a binary byte array (bytestring).
To get the actual bytes back from such a representation, pass it through ast.literal_eval():
>>> import ast
>>> s = r"b'Y\xf8\xbc\xca\x14\x0f\x80\xd3\xc6\xce\xecE\x14\xc1\xaf\xbd\x82\xd24\xcf\x04\xe2\x9a\x81NF\xbeXi\x85\xef\xc4\xbbl\xd3(5\x80\xe4\xde3\x8eC\xd2jR*\xb7.gq\x8c\x8b\xa12\x1a\x10+\xbf\xefHZ\n/'"
>>> b = ast.literal_eval(s)
b'Y\xf8\xbc\xca\x14\x0f\x80\xd3\xc6\xce\xecE\x14\xc1\xaf\xbd\x82\xd24\xcf\x04\xe2\x9a\x81NF\xbeXi\x85\xef\xc4\xbbl\xd3(5\x80\xe4\xde3\x8eC\xd2jR*\xb7.gq\x8c\x8b\xa12\x1a\x10+\xbf\xefHZ\n/'
Better yet, just save the binary bytes to your file without passing through a string:
encrypted_bytes = my_rsa("Test")
with open("encrypted.bin", "wb") as f:
f.write(encrypted_bytes)
# ...
with open("encrypted.bin", "rb") as f:
encrypted_bytes = f.read()
If you really want a "text-safe" format for those bytes, use base64.b64encode() and base64.b64decode().

Convert Hex Encoded GZIP string back to uncompressed string

I'm having trouble converting a compressed, hex-encoded string back into its original format, without introducing numerous / seemingly erroneous backslashes + unconverted unicode characters.
The code I'm using to do this process is:
import gzip
from io import StringIO, BytesIO
def string_to_bytes(input_str: str) -> bytes:
"""
Read the given string, encode it in utf-8, gzip compress
the data and return it as a byte array.
"""
bio = BytesIO()
bio.write(input_str.encode("utf-8"))
bio.seek(0)
stream = BytesIO()
compressor = gzip.GzipFile(fileobj=stream, mode='w')
while True: # until EOF
chunk = bio.read(8192)
if not chunk: # EOF?
compressor.close()
return stream.getvalue()
compressor.write(chunk)
def bytes_to_string(input_bytes: bytes) -> str:
"""
Decompress the given byte array (which must be valid
compressed gzip data) and return the decoded text (utf-8).
"""
bio = BytesIO()
stream = BytesIO(input_bytes)
decompressor = gzip.GzipFile(fileobj=stream, mode='r')
while True: # until EOF
chunk = decompressor.read(8192)
if not chunk:
decompressor.close()
bio.seek(0)
return bio.read().decode("utf-8")
bio.write(chunk)
return None
In the script I'm running the input_string gets compressed + saved as hex with:
saved_hex = string_to_bytes(input_string).hex()
This gets stored as a BINARY datatype in a Snowflake database (using the HEX binary format).
This gets loaded out from there like so:
hex_bytes = bytes.fromhex(hex_html)
html_string = bytes_to_string(hex_bytes)
And the results are coming out like:
href\\\\\\\\u003d\\\\\\\\\\\\x22https://www.google.com/advanced_search\\\\\\\\\\\\x22 target\\\\\\\\u003d\\\\\\\\\\\\x22_blank\\\\\\\\\\\\x22\\\\\\\\u003eadvanced search\\\\\\\\u003c/a\\\\\\\\u003e to find results...
Where there's multiple backslashes which I'm unable to convert back to a single backslash (in the case of the unicode characters) or remove entirely.
Is there any way to more efficiently:
Gzip compress the string
Convert to Hex
Decode the hex + decompress - without adding any of these weird unconverted unicode characters?
Thank you all for the answers - foolishly I realised that:
I was adding an additional json.dumps() to the input string (further encoding it as a string and adding all the additional back-slashes).
Snowflake saves the data as bytes, which must be converted to binary first using TO_VARCHAR(saved_hex_data) before you can call bytes_to_string(bytes.fromhex(output_string)) on it.
At which point everything is preserved as before, many thanks again.

How not to decode escaped sequences when reading from file but keep the string representation

I am reading in a text file that contains lines with binaray data dumped in a encoded fashion, but still as a string (at least in emacs):
E.g.:
\240\263\205k\347\301\360G\224\217yr\335\355#\333\0320^\242\367\021\227C\035\000\207\016\327\201\360\242
This is perfectly fine for me and when I read in that file I want to keep this string and not decode or change it in any way. However, when I am reading in the file python does the decoding. How can I prevent that?
with open("/path/to/file") as file:
for line in file:
print line
the output will look like:
'���k���G�r��#�\0320^��\021�C\035\000�\016ׁ��'
but should look like:
\240\263\205k\347\301\360G\224\217yr\335\355#\333\0320^\242\367\021\227C\035\000\207\016\327\201\360\242
Edit: However, this encoded data is not the only data contained but part of a larger text dump.
You can read the file as binary with 'rb' option and it will retain the data as it is
EX:
with open(PathToFile, 'rb') as file:
raw_binary_data = file.read()
print(raw_binary_data)
If you really want the octal representation you can define a fuction that prints it back out.
import string
def octal_print(s):
print(''.join(map(lambda x: x if x in string.printable else '\\'+oct(ord(x))[2:], s)))
s = '\240\263\205k\347\301\360G\224\217yr\335\355#\333\0320^\242\367\021\227C\035\000\207'
octal_print(s)
# prints:
\240\263\205k\347\301\360G\224\217yr\335\355#\333\320^\242\367\21\227C\35\0\207
based on the answer of James I adapted the octal_print function to discriminate between actual octals and innocent characters.
def octal_print(s):
charlist = list()
for character in s:
try:
character.decode('ascii')
charlist.append(character)
except:
charlist.append('\\'+oct(ord(character))[1:])
return ''.join(charlist)

Python 3 and base64 encoding of a binary file

I'm new to Python and I do have an issue that is bothering me.
I use the following code to get a base64 string representation of my zip file.
with open( "C:\\Users\\Mario\\Downloads\\exportTest1.zip",'rb' ) as file:
zipContents = file.read()
encodedZip = base64.encodestring(zipContents)
Now, if I output the string it is contained inside a b'' representation. This for me is not necessary and I would like to avoid it. Also it adds a newlines character every 76 characters which is another issue. Is there a way to get the binary content and represent it without the newline characters and trailing and leading b''?
Just for comparison, if I do the following in PowerShell:
$fileName = "C:\Users\Mario\Downloads\exportTest1.zip"
$fileContentBytes = [System.IO.File]::ReadAllBytes($fileName)
$fileContentEncoded = [System.Convert]::ToBase64String($fileContentBytes)
I do get the exact string I'm looking for, no b'' and no \n every 76 chars.
From the base64 package doc:
base64.encodestring:
"Encode the bytes-like object s, which can contain arbitrary binary data, and return bytes containing the base64-encoded data, with newlines (b"\n") inserted after every 76 bytes of output, and ensuring that there is a trailing newline, as per RFC 2045 (MIME)."
You want to use
base64.b64encode:
"Encode the bytes-like object s using Base64 and return the encoded bytes."
Example:
import base64
with open("test.zip", "rb") as f:
encodedZip = base64.b64encode(f.read())
print(encodedZip.decode())
The decode() will convert the binary string to text.
Use b64encode to encode without the newlines and then decode the resulting binary string with .decode('ascii') to get a normal string.
encodedZip = base64.b64encode(zipContents).decode('ascii')

Converting from utf-16 to utf-8 in Python 3

I'm programming in Python 3 and I'm having a small problem which I can't find any reference to it on the net.
As far as I understand the default string in is utf-16, but I must work with utf-8, I can't find the command that will convert from the default one to utf-8.
I'd appreciate your help very much.
In Python 3 there are two different datatypes important when you are working with string manipulation. First there is the string class, an object that represents unicode code points. Important to get is that this string is not some bytes, but really a sequence of characters. Secondly, there is the bytes class, which is just a sequence of bytes, often representing an string stored in an encoding (like utf-8 or iso-8859-15).
What does this mean for you? As far as I understand you want to read and write utf-8 files. Let's make a program that replaces all 'ć' with 'ç' characters
def main():
# Let's first open an output file. See how we give an encoding to let python know, that when we print something to the file, it should be encoded as utf-8
with open('output_file', 'w', encoding='utf-8') as out_file:
# read every line. We give open() the encoding so it will return a Unicode string.
for line in open('input_file', encoding='utf-8'):
#Replace the characters we want. When you define a string in python it also is automatically a unicode string. No worries about encoding there. Because we opened the file with the utf-8 encoding, the print statement will encode the whole string to utf-8.
print(line.replace('ć', 'ç'), out_file)
So when should you use bytes? Not often. An example I could think of would be when you read something from a socket. If you have this in an bytes object, you could make it a unicode string by doing bytes.decode('encoding') and visa versa with str.encode('encoding'). But as said, probably you won't need it.
Still, because it is interesting, here the hard way, where you encode everything yourself:
def main():
# Open the file in binary mode. So we are going to write bytes to it instead of strings
with open('output_file', 'wb') as out_file:
# read every line. Again, we open it binary, so we get bytes
for line_bytes in open('input_file', 'rb'):
#Convert the bytes to a string
line_string = bytes.decode('utf-8')
#Replace the characters we want.
line_string = line_string.replace('ć', 'ç')
#Make a bytes to print
out_bytes = line_string.encode('utf-8')
#Print the bytes
print(out_bytes, out_file)
Good reading about this topic (string encodings) is http://www.joelonsoftware.com/articles/Unicode.html. Really recommended read!
Source: http://docs.python.org/release/3.0.1/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit
(P.S. As you see, I didn't mention utf-16 in this post. I actually don't know whether python uses this as internal decoding or not, but it is totally irrelevant. At the moment you are working with a string, you work with characters (code points), not bytes.

Categories