Base64 decode does not work every time in Python

Base64 decode does not work every time in Python - python

I have to decode a string. This one comes from Flash AS3 and I want to decode it in Python. I don't have any problems with PHP, but I cannot decode the following string with Python 2.6 'base64.b64decode'.
f3hvQgQaBFp9IC4NQhYZQiAhNhxBAkwIJC0pDR8fBl12ZjkWXwMEWn57bU0dGgBfcWdsTwAbGB4xLmVLAh0FXXd5a0gGHQRWdy5iQANNVAl/KmNLAhUBXyV8PkFQHwNefntjGgpPU18nK21OURtSC35wPE4FHFUJdi4/TlMUVFwlez9JVxtVDH0TB0IGHAc%Pr
Python returns "TypeError: Incorrect Padding". It seems to have superfious characters at the end of the string (from the '%'). But why Python base64 library do not manage this?
Thank you for your answer.

It seems to me you are not feeding a valid string to the function - it tells you so. You can't expect a function to "guess" what you wanted, and base its response on that. You have to use a valid parameter, or the function doesn't work.

Related

Is there a way to decode bytes inside a string object in Python? [duplicate]

This question already has an answer here:
Converting python string into bytes directly without eval()
(1 answer)
Closed 2 years ago.
Let me be more clear.
I'm receiving a string in Python like this:
file = "b'x\\x9c\\xb4'"
The type of file is str. But you can see inside of that string the format of a <class 'bytes'>. It was the result of calling str(file) once file was already encoded. I would like to decode it but i don't know how to decode the bytes inside of a string object.
My question is: is there a way to interpret file as bytes instead of str without having to call something like bytes(file, 'utf-8') or file.encode('utf-8')? The problem with these method is that i would encode the already encoded bytes as i stated before.
Why do i need that?
I'm building an API and i need to send back as a JSON value a significantly big string. Since there was plenty of space for me to compress it, i ended using zlib:
import zlib
file = BIG_STRING
file_compressed = zlib.compress(BIG_STRING.encode(utf-8)) # Since zlib expects a bytes object
send_back({"SOME_BIG_STRING": str(file_compressed)})
I'm sending it back as a string because i can't send it back as a bytes object, it doesn't support that. And if i try to decode it compressed before sending i ended up facing an error:
send_back({"SOME_BIG_STRING": file_compressed.decode('utf-8')})
-> UnicodeDecodeError: utf-8' codec can't decode byte 0x9c in position 1: invalid start byte
And when i receive the same string later in the program, i find myself stuck on the problem described initially.
I'm lacking knowledge right now to be able to do some workaround and couldn't find an answer to this. I'd be extremely grateful if anyone could help me!

Anyway, you can call eval("b'x\\x9c\\xb4'") and get your result b'x\x9c\xb4' if you don't find any other solution. But eval usage isn't recommended in the common case and it will be a bad practice.

Convert an input to string

I am using the lib presented in https://github.com/cmcqueen/cobs-python to make sure i can send data over a serial line.
However, this lib requires a string as input, which can be about anything (text files, images, etc...). Since I'm using serial line without any special features on, there is no need to worry about special characters triggering some events.
If I could, I would send for example the image in raw mode, since it does not matter how the data is passed to the other end, but I need to encode it with this lib.
I have tried the following:
data = open('img.jpg', 'wb')
cobs_packet = cobs.encode(''.join(data).encode('utf-8'))
This gives me the following error:
>>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)
The problem is, if I use different encoding types, the data length is changed, and that can't happen for what I'm trying to do.
Isn't there any way to simply convert the input to string as-is?
EDIT: I'm using python version 2.7

I didn't test how far this is valid, but I have found a simple solution which seems to be working. By using bytes() function, it works somehow as strings. I can pass it to the lib in question like this.
Thanks for the help everyone. Cheers.

Python & fql: getting "Dami\u00e1n" instead of "Damián"

I created a file containing a dictionary with data written in Spanish (i.e. Damián, etc.):
fileNameX.write(json.dumps(dictionaryX, indent=4))
The data come from some fql fetching operations, i.e.:
select name from user where uid in XXX
When I open the file, I find that, for instance, "Damián" looks like "Dami\u00e1n".
I've tried some options:
ensure_ascii=False:
fileNameX.write(json.dumps(dictionaryX, indent=4, ensure_ascii=False))
But I get an error (UnicodeEncodeError: 'ascii' codec can´t encode character u'\xe1' in position XXX: ordinal not in range(128)).
encode(encoding='latin-1):
dictionaryX.append({
'name': unicodeVar.encode(encoding='latin-1'),
...
})
But I get another error (UnicodeDecodeError: 'utf8' codec can't decode byte 0xe1 in position XXX: invalid continuation byte)
To sum up, I've tried several possibilities, but have less than a clue. I'm lost. Please, I need help. Thanks!

You have many options, and have stumbled upon something rather complicated that depends on your Python version and which you absolutely must understand fully in order to write correct code. Generally the approach taken in 3.x is stricter and a bit harder to work with, but it is much less likely that you will make a mistake or get yourself into a complicated situation. (Based on the exact symptoms you report, you seem to be using 2.x.)
json.dumps has different behaviour in 2.x and 3.x. In 2.x, it produces a str, which is a byte-string (unknown encoding). In 3.x, it still produces a str, but now str in 3.x is a proper Unicode string.
JSON is inherently a Unicode-supporting format, but it expects files to be in UTF-8 encoding. However, please understand that JSON supports \u style escapes in strings. When you read in this data, you will get the correct encoded string back. The reading code produces unicode objects (no matter whether you use 2.x or 3.x) when it reads strings out of the JSON.
When I open the file, I find that, for instance, "Damián" looks like "Dami\u00e1n"
á cannot be represented in ASCII. It gets encoded as \u00e1 by default, to avoid the other problems you had. This happens even in 3.x.
ensure_ascii=False
This disables the previous encoding. In 2.x, it means you get a unicode object instead - a real Unicode object, preserving the original á character. In 3.x, it means that the character is not explicitly translated. But either way, ensure_ascii=False means that json.dumps will give you a Unicode string.
Unicode strings must be encoded to be written to a file. There is no such thing as "unicode data"; Unicode is an abstraction. In 2.x, this encoding is implicitly 'ascii' when you feed a Unicode object to file.write; it was expecting a str. To get around this, you can use the codecs module, or explicitly encode as 'utf-8' before writing. In 3.x, the encoding is set with the encoding keyword argument when you open the file (the default is again probably not what you want).
encode(encoding='latin-1')
Here, you are encoding before producing the dictionary, so that you have a str object in your data. Now a problem occurs because when there are str objects in your data, the JSON encoder assumes, by default, that they represent Unicode strings in UTF-8 encoding. This can be changed, in 2.x, using the encoding keyword argument to json.dumps. (In 3.x, the encoder will simply refuse to serialize bytes objects, i.e. non-Unicode strings!)
However, if your goal is simply to get the data into the file directly, then json.dumps is the wrong tool for you. Have you wondered what that s in the name is for? It stands for "string"; this is the special case. The ordinary case, in fact, is writing directly to a file! (Instead of giving you a string and expecting you to write it yourself.) Which is what json.dump (no 's') does. Again, the JSON standard expects UTF-8 encoding, and again 2.x has an encoding keyword parameter that defaults to UTF-8 (you should leave this alone).

Use codecs.open() to open fileNameX with a specific encoding like encoding='utf-8' for example instead of using open().
Also, json.dump().

Since the string has a \u inside it that means it's a Unicode string. The string is actually correct! Your problem lies in displaying the string. If you print the string, Python's output encoding should print it in the proper encoding for your environment.
For example, this is what I get inside IDLE on Windows:
>>> print u'Dami\u00e1n'
Damián

Decode base64 string in python 3 (with lxml or not)

I know this looks embarrassingly easy, and I guess the problem is that I just don't have a clear understanding of all this bytes-str-unicode (and encoding-decoding, speaking frankly) stuff yet.
I've been trying to get my working code to run on Python 3. The part I'm stuck with is when I parse an XML with lxml and decode a base64 string that is in that XML.
The code now works in the following manner:
I retrieve the binary data with an XPath query '.../binary/text()'. This produces a one-element list containing a lxml.etree._ElementUnicodeResult object. Then, with python 2, I was able to do:
decoded = source.decode('base64')
and finally
output = numpy.frombuffer(decoded)
However, on python 3 I get an error message saying
AttributeError: 'lxml.etree._ElementUnicodeResult' object has no attribute 'decode'
This is not so surprising, because lxml.etree._ElementUnicodeResult is a subclass of str.
Another way would be to get a real str with the same data in it with
binary = tree.xpath('//binary')[0]
binary_string = binary.text
That would be essentially the same. So what do I do to decode it from base64? I've looked at the base64 module, but it takes a bytes object as an argument, and I can't think of the way to present str as bytes, because if I try to construct a bytes object, Python will try to encode the string, which I don't need.
Googling further, I came across the binascii module (which is invoked indirectly from base64 anyway, if I'm not mistaken), but calling binascii.b2a_base64() on my string produces
TypeError: 'str' does not support the buffer interface
P.S. I've even found an answered question on how to decode a hex string in Python 3, but this is done with a dedicated method bytes.fromhex() so I don't see how it would be helpful.
Could someone please tell me what I'm missing? I'm afraid most of the post is irrelevant and only aggravates my shame, but at least you guys know what I tried.

OK, I think I'm going to summarize my current understanding of things (feel free to correct me). Hopefully it will help someone else out there as confused as I've been.
The credit totally goes to thebjorn and delnan, of course.
So, starting with the most common things:
there's Unicode, and it's a global standard that assigns codes (or code points) to all the exotic characters you can imagine. Those codes are just integer numbers. As of Unicode 6.1 there are 109,975 graphic characters, says Wikipedia.
Then there are encodings that define how to designate Unicode characters with byte codes. One byte isn't enough to designate an arbitrary Unicode char. Although, if you only take a small subset of them (English alphabet, digits, punctuation, some control characters), you can do with one byte per character (or even 7 bits; see ASCII).
To pass a Unicode string anywhere, one needs to encode it in bytes, then it can be decoded on the other end.
In Python 2, str is actually bytes, and unicode is Unicode, but Python 2 will do implicit encoding/decoding for you when needed. It will try to use ASCII encoding.
In Python 3, str is always a Unicode string, and bytes is a new data type for actual bytes. No implicit conversion is ever done by Python 3, you always need to do it yourself and specify the encoding. That means that your program won't work until you understand what's going on, which totally happened to me.
Now, that being more or less clear, let's move on to base64 encoding, which is also an encoding of sorts, but has a slightly different meaning.
Suppose you have some binary data (i.e. bytes) that may mean anything (in my case it's a bunch of floats). Now you want to represent this binary array with a string. That's what base64 encoding means: you have your bytes represented as an ASCII string.
Base64 means 6 bit, so in a base64-encoded string a single character stands for 6 bits of your data. That is why base64-encoded strings need to have the length that is a multiple of 4: otherwise the number of bytes encoded will be not integer.
Finally, to decode from base64 you need an ASCII string. A Unicode string won't do, there can only be characters from the base64 alphabet. Base64 module does the job in Python. The base64.b64decode() function takes a byte string as the argument. In Python 2 it means: str. In Python 3 it means: bytes. So if you have a str, such as
>>> s = 'U3RhY2sgT3ZlcmZsb3c='
In Python 2 you could just do
>>> s.decode('base64')
because s is already in ASCII.
In Python 3, you need to encode it in ASCII first, so you'll have to do:
>>> base64.b64decode(s.encode('ascii'))
And by the way, this will return a bytes object, so it's really up to you how to treat those bytes then. Maybe it's my floats, but maybe you should try to decode it as ASCII :)
In Python 2 however it will be just a str. Anyway, have a look at struct for the tools to unpack your data from those bytes.
So if you need the code to work on both Python 2 and 3, go with the last one. To make sure you have Unicode in the end (if you are decoding text from base64), you'll have to decode it:
>>> base64.b64decode(s.encode('ascii')).decode('ascii')
On Python 2, encode('ascii') won't effectively do anything because it's applied to str. So it will do an implicit conversion to Unicode first, and then do what you want (convert it back to ASCII). decode('ascii') will return a unicode object on Python 2.

I don't have Python 3 installed, but it sounds like you need to convert the Unicode returned from lxml to bytes, perhaps by calling .encode('ascii') ?

django + unicode constant errors

I built a django site last year that utilises both a dashboard and an API for a client.
They are, on occasion, putting unicode information (usually via a Microsoft keyboard and a single quote character!) into the database.
It's fine to change this one instance for everything, but what I constantly get is something like this error when a new character is added that I haven't "converted":
UnicodeDecodeError at /xx/xxxxx/api/xxx.json
'ascii' codec can't decode byte 0xeb in position 0: ordinal not in range(128)
The issue is actually that I need to be able to convert this unicode (from the model) into HTML.
# if a char breaks the system, replace it here (duplicate line)
text = unicode(str(text).replace('\xa3', '£'))
I duplicate this line here, but it just breaks otherwise.
Tearing my hair out because I know this is straight forward and I'm doing something remarkably silly somewhere.
Have searched elsewhere and realised that while my issue is not new, I can't find the answer elsewhere.

I assume that text is unicode (which seems a safe assumption, as \xa3 is the unicode for the £ character).
I'm not sure why you need to encode it at all, seeing as the text will be converted to utf-8 on output in the template, and all browsers are perfectly capable of displaying that. There is likely another point further down the line where something (probably your code, unfortunately) is assuming ASCII, and the implicit conversion is breaking things.
In that case, you could just do this:
text = text.encode('ascii', 'xmlcharrefreplace')
which converts the non-ASCII characters into HTML/XML entities like £.

Tell the JSON-decoder that it shall decode the json-file as unicode. When using the json module directly, this can be done using this code:
json.JSONDecoder(encoding='utf8').decode(
json.JSONEncoder(encoding='utf8').encode('blä'))
If the JSON decoding takes place via some other modules (django, ...) maybe you can pass the information through this other module into the json stuff.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.