Python 3 and base64 encoding of a binary file - python

I'm new to Python and I do have an issue that is bothering me.
I use the following code to get a base64 string representation of my zip file.
with open( "C:\\Users\\Mario\\Downloads\\exportTest1.zip",'rb' ) as file:
zipContents = file.read()
encodedZip = base64.encodestring(zipContents)
Now, if I output the string it is contained inside a b'' representation. This for me is not necessary and I would like to avoid it. Also it adds a newlines character every 76 characters which is another issue. Is there a way to get the binary content and represent it without the newline characters and trailing and leading b''?
Just for comparison, if I do the following in PowerShell:
$fileName = "C:\Users\Mario\Downloads\exportTest1.zip"
$fileContentBytes = [System.IO.File]::ReadAllBytes($fileName)
$fileContentEncoded = [System.Convert]::ToBase64String($fileContentBytes)
I do get the exact string I'm looking for, no b'' and no \n every 76 chars.

From the base64 package doc:
base64.encodestring:
"Encode the bytes-like object s, which can contain arbitrary binary data, and return bytes containing the base64-encoded data, with newlines (b"\n") inserted after every 76 bytes of output, and ensuring that there is a trailing newline, as per RFC 2045 (MIME)."
You want to use
base64.b64encode:
"Encode the bytes-like object s using Base64 and return the encoded bytes."
Example:
import base64
with open("test.zip", "rb") as f:
encodedZip = base64.b64encode(f.read())
print(encodedZip.decode())
The decode() will convert the binary string to text.

Use b64encode to encode without the newlines and then decode the resulting binary string with .decode('ascii') to get a normal string.
encodedZip = base64.b64encode(zipContents).decode('ascii')

Related

Reading bytes from file python And converting to String

I have a file including some data like : \xe1\x8a\xa0\xe1\x88\x9b\xe1\x88\xad\xe1\x8a\x9b
How do i read this and write the string format(አማርኛ) in another file? And also vice versa?
[\xe1\x8a\xa0\xe1\x88\x9b\xe1\x88\xad\xe1\x8a\x9b == አማርኛ ]
That is a byte string, so you need to decode it to a utf-8 Unicode string.
b'\xe1\x8a\xa0\xe1\x88\x9b\xe1\x88\xad\xe1\x8a\x9b'.decode('utf8')
result: 'አማርኛ'
And to encode it back to byte string:
'አማርኛ'.encode()
result: b'\xe1\x8a\xa0\xe1\x88\x9b\xe1\x88\xad\xe1\x8a\x9b'
Basicly you have a byte string, you can do what you are talking about with the functions encode() and decode() respectively, in the example below, i will start by printing the byte string. And then i'm taking the byte string and decoding it to utf-8 (default value in all python versions above 2.7 if you don't specify a version yourself)
f = open("input.txt","rb")
x = f.read()
print(x) # b'\xe1\x8a\xa0\xe1\x88\x9b\xe1\x88\xad\xe1\x8a\x9b'
print(x.decode()) # አማርኛ
If you want to do the inverse operation, you can achieve this by just encoding back the decoded byte array! (Do note that the open function i'm using the arguments "rb" that stands for (following the wiki) "Opens a file for reading only in binary format."

Base64 encoding issue in Python

I need to save a params file in python and this params file contains some parameters that I won't leave on plain text, so I codify the entire file to base64 (I know that this isn't the most secure encoding of the world but it works for the kind of data that I need to use).
With the encoding, everything works well. I encode the content of my file (a simply txt with a proper extension) and save the file. The problem comes with the decode. I print the text coded before save the file and the text coded from the file saved and there are exactly the same, but for a reason I don't know, the decode of the text of the file saved returns me this error UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8d in position 1: invalid start byte and the decode of the text before save the file works well.
Any idea to resolve this issue?
This is my code, I have tried converting all to bytes, to string, and everything...
params = open('params.bpr','r').read()
paramsencoded = base64.b64encode(bytes(params,'utf-8'))
print(paramsencoded)
paramsdecoded = str(base64.b64decode(str(paramsencoded,'utf-8')),'utf-8')
newparams = open('paramsencoded.bpr','w+',encoding='utf-8')
newparams.write(str(paramsencoded))
newparams.close()
params2 = open('paramsencoded.bpr',encoding='utf-8').read()
print(params2)
paramsdecoded = str(base64.b64decode(str(paramsencoded,'utf-8')),'utf-8')
paramsdecoded = base64.b64decode(str(params2))
print(str(paramsdecoded,'utf-8'))
Your error lies in your handling of the bytes object returned by base64.b64encode(), you called str() on the object:
newparams.write(str(paramsencoded))
That doesn't decode the bytes object:
>>> bytesvalue = b'abc='
>>> str(bytesvalue)
"b'abc='"
Note the b'...' notation. You produced the representation of the bytes object, which is a string containing Python syntax that can reproduce the value for debugging purposes (you can copy that string value and paste it into Python to re-create the same bytes value).
This may not be that easy to notice at first, as base64.b64encode() otherwise only produces output with printable ASCII bytes.
But your decoding problem originates from there, because when decoding the value read back from the file includes the b' characters at the start. Those first two characters are interpreted as Base64 data too; the b is a valid Base64 character, and the ' is ignored by the parser:
>>> bytesvalue = b'hello world'
>>> base64.b64encode(bytesvalue)
b'aGVsbG8gd29ybGQ='
>>> str(base64.b64encode(bytesvalue))
"b'aGVsbG8gd29ybGQ='"
>>> base64.b64decode(str(base64.b64encode(bytesvalue))) # with str()
b'm\xa1\x95\xb1\xb1\xbc\x81\xdd\xbd\xc9\xb1\x90'
>>> base64.b64decode(base64.b64encode(bytesvalue)) # without str()
b'hello world'
Note how the output is completely different, because the Base64 decoding is now starting from the wrong place, as b is the first 6 bits of the first byte (making the first decoded byte a 6C, 6D, 6E or 6F bytes, so m,n, o or p ASCII).
You could properly decode the value (using paramsencoded.decode('ascii') or str(paramsencoded, 'ascii')) but you should't treat any of this data as text.
Instead, open your files in binary mode. Reading and writing then operates with bytes objects, and the base64.b64encode() and base64.b64decode() functions also operate on bytes, making for a perfect match:
with open('params.bpr', 'rb') as params_source:
params = params_source.read() # bytes object
params_encoded = base64.b64encode(params)
print(params_encoded.decode('ascii')) # base64 data is always ASCII data
params_decoded = base64.b64decode(params_encoded)
with open('paramsencoded.bpr', 'wb') as new_params:
newparams.write(params_encoded) # write binary data
with open('paramsencoded.bpr', 'rb') as new_params:
params_written = new_params.read()
print(params_written.decode('ascii')) # still Base64 data, so decode as ASCII
params_decoded = base64.b64decode(params_written) # decode the bytes value
print(params_decoded.decode('utf8')) # assuming the original source was UTF-8
I explicitly use bytes.decode(codec) rather than str(..., codec) to avoid accidental str(...) calls.

How to convert bytes data to string without changing data using python3

How can I convert bytes to string without changing data ?
E.g
Input:
file_data = b'\xb4\xeb7s\x14q[\xc4\xbb\x8e\xd4\xe0\x01\xec+\x8f\xf8c\xff\x00 \xeb\xff'
Output:
'\xb4\xeb7s\x14q[\xc4\xbb\x8e\xd4\xe0\x01\xec+\x8f\xf8c\xff\x00 \xeb\xff'
I want to write an image data using StringIO with some additional data, Below is my code snippet,
img_buf = StringIO()
f = open("Sample_image.jpg", "rb")
file_data = f.read()
img_buf.write('\r\n' + file_data + '\r\n')
This works fine with python 2.7 but I want it to be working with python 3.4.
on read operation file_data = f.read() returns bytes object data something like this
b'\xb4\xeb7s\x14q[\xc4\xbb\x8e\xd4\xe0\x01\xec+\x8f\xf8c\xff\x00 \xeb\xff'
While writting data using img_buf it accepts only String data, so unable to write file_data with some additional characters.
So I want to convert file_data as it is in String object without changing its data. Something like this
'\xb4\xeb7s\x14q[\xc4\xbb\x8e\xd4\xe0\x01\xec+\x8f\xf8c\xff\x00 \xeb\xff'
so that I can concat and write the image data.
I don't want to decode or encode data. Any suggestions would be helpful for me. thanks in advance.
It is not clear what kind of output you desire. If you are interested in aesthetically translating bytes to a string representation without encoding:
s = str(file_data)[1:]
print(s)
# '\xb4\xeb7s\x14q[\xc4\xbb\x8e\xd4\xe0\x01\xec+\x8f\xf8c\xff\x00 \xeb\xff'
This is the informal string representation of the original byte string (no conversion).
Details
The official string representation looks like this:
s
# "'\\xb4\\xeb7s\\x14q[\\xc4\\xbb\\x8e\\xd4\\xe0\\x01\\xec+\\x8f\\xf8c\\xff\\x00 \\xeb\\xff'"
String representation handles how a string looks. Double escape characters and double quotes are implicitly interpreted in Python to do the right thing so that the print function outputs a formatted string.
String intrepretation handles what a string means. Each block of characters means something different depending on the applied encoding. Here we interpret these blocks of characters (e.g. \\xb4, \\xeb, 7, s) with the UTF-8 encoding. Blocks unrecognized by this encoding are replaced with a default character, �:
file_data.decode("utf-8", "replace")
# '��7s\x14q[Ļ���\x01�+��c�\x00 ��'
Converting from bytes to strings is required for reliably working with strings.
In short, there is a difference in string output between how it looks (representation) and what it means (interpretation). Clarify which you prefer and proceed accordingly.
Addendum
If your question is "how do I concatenate a byte string?", here is one approach:
buffer = io.BytesIO()
with buffer as f:
f.write(b"\r\n")
f.write(file_data)
f.write(b"\r\n")
print(buffer.getvalue())
# b'\r\n\xb4\xeb7s\x14q[\xc4\xbb\x8e\xd4\xe0\x01\xec+\x8f\xf8c\xff\x00 \xeb\xff\r\n'
Equivalently:
buffer = b""
buffer += b"\r\n"
buffer += file_data
buffer += b"\r\n"
buffer
# b'\r\n\xb4\xeb7s\x14q[\xc4\xbb\x8e\xd4\xe0\x01\xec+\x8f\xf8c\xff\x00 \xeb\xff\r\n'

Python - writing a two byte string as a single byte hex character to a binary file

I'm writing a python script that opens a file and reads a two character string value e.g. 32 or 6f. It is then supposed to write that value as a single byte to a binary file as hex e.g. 0x32 or 0x6f.
file_in = open('32.xml')
file_contents = file_in.read()
file_in.close()
file_out = open('testfile', 'wb')
file_out.write(file_contents)
file_out.close()
In this example, 32.xml is a plain text file that contains the string '32'. But the contents of the testfile output file are '32' instead of 0x32 (or just 2).
I've tried all kinds of variations on the write command. I tried the chr() function but that requires converting the string to an int.
file_out.write(chr(int(file_contents)))
That ended up writing the hex value of the string, not what I wanted. It also failed as soon as you had a value containing a-f.
I also tried
file_out.write('\x' + file_contents)
but the python interpreter didn't like that.
You need to interpret the original string as a hexadecimal integer. Hexadecimal is a base-16 notation, so add 16 to the int() call:
file_out.write(chr(int(file_contents, 16)))

Converting from utf-16 to utf-8 in Python 3

I'm programming in Python 3 and I'm having a small problem which I can't find any reference to it on the net.
As far as I understand the default string in is utf-16, but I must work with utf-8, I can't find the command that will convert from the default one to utf-8.
I'd appreciate your help very much.
In Python 3 there are two different datatypes important when you are working with string manipulation. First there is the string class, an object that represents unicode code points. Important to get is that this string is not some bytes, but really a sequence of characters. Secondly, there is the bytes class, which is just a sequence of bytes, often representing an string stored in an encoding (like utf-8 or iso-8859-15).
What does this mean for you? As far as I understand you want to read and write utf-8 files. Let's make a program that replaces all 'ć' with 'ç' characters
def main():
# Let's first open an output file. See how we give an encoding to let python know, that when we print something to the file, it should be encoded as utf-8
with open('output_file', 'w', encoding='utf-8') as out_file:
# read every line. We give open() the encoding so it will return a Unicode string.
for line in open('input_file', encoding='utf-8'):
#Replace the characters we want. When you define a string in python it also is automatically a unicode string. No worries about encoding there. Because we opened the file with the utf-8 encoding, the print statement will encode the whole string to utf-8.
print(line.replace('ć', 'ç'), out_file)
So when should you use bytes? Not often. An example I could think of would be when you read something from a socket. If you have this in an bytes object, you could make it a unicode string by doing bytes.decode('encoding') and visa versa with str.encode('encoding'). But as said, probably you won't need it.
Still, because it is interesting, here the hard way, where you encode everything yourself:
def main():
# Open the file in binary mode. So we are going to write bytes to it instead of strings
with open('output_file', 'wb') as out_file:
# read every line. Again, we open it binary, so we get bytes
for line_bytes in open('input_file', 'rb'):
#Convert the bytes to a string
line_string = bytes.decode('utf-8')
#Replace the characters we want.
line_string = line_string.replace('ć', 'ç')
#Make a bytes to print
out_bytes = line_string.encode('utf-8')
#Print the bytes
print(out_bytes, out_file)
Good reading about this topic (string encodings) is http://www.joelonsoftware.com/articles/Unicode.html. Really recommended read!
Source: http://docs.python.org/release/3.0.1/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit
(P.S. As you see, I didn't mention utf-16 in this post. I actually don't know whether python uses this as internal decoding or not, but it is totally irrelevant. At the moment you are working with a string, you work with characters (code points), not bytes.

Categories