Validate base64 in python

Validate base64 in python - python

I would like to check wether a string is base64 encoded in python. As the built in module is very forgiving, I tried the following
s = b'111='
b64encode(b64decode(s)) == s
To my surprise it returned False. Indeed, b64encode(b64decode(s)) returns b'110='. I expected it to return True as I'm under the impression '111=' is a valid base64 string.
My python version is 3.6.4
$ python --version
Python 3.6.4
Why is this? Can someone explain this?

Since "0" and "1" both only differ in the last 4 bits (and the encoding has 4 padding bits), both are valid encodings of the plaintext. b64encode(), however, will only encode using "0" there.
In order to be robust you will need to calculate how many padding bits and bytes the encoded text should have and then only compare the significant bytes and bits.

Related

Converting integer to 8-bit ASCII characters, NOT Unicode in Python 3

I've been working on a project where I'm encoding numbers as characters. Being used to C++, I assumed I could just use any 8bit number and cast it to a character. However, python's chr() function is returning Unicode characters, which aren't 8-bit, so that will not work.
I am new to Python and, from what I've read, previous versions used to have 2 separate functions: chr() for ASCII characters and unichr() for Unicode characters.
I am also limited to what I can get in the standard python library for windows (we are not allowed to install modules with pip).
This might usually be okay, but here's an example of when this can mess with my program:
If I'm encoding the integer 143:
# this is not taken from my actual code
num = 143
c = chr(143)
print(c)
I would expect this to print the ASCII character (a capital A with a little circle above it). Instead, I get the unicode \x8f, which represents "SS3" (Single Shift 3).
TL;DR: I'm converting 8-bit numbers to characters, but chr() converts to Unicode and I REALLY need a way to convert to ASCII instead, but I can't seem to find it in the standard library.
I know that this is such a simple problem and it's extremely frustrating to be stuck on this of all things.
Thanks a lot in advance!
Have a nice day!
- Vlad

"A with a little circle above it" is not an ASCII character, and 143 is outside the ASCII range (0-127).
It seems you are thinking in terms of the encoded bytes rather than unicode codepoints (which Python3 uses to represent string values). See here for 8 bit encodings where b'\x8f' represents 'Å‎'.
You probably want to do something like this:
import sys
c = 143
# Convert to byte
b = c.to_bytes(1, sys.byteorder)
# Decode to unicode (str) and print
print(b.decode('cp437'))
Å‎
You could also take a look at the struct package in the standard library, which deals with bytes and chars in a more "C-like" fashion.

Python 3.5 base64 decoding seems to be incorrect?

In Python 3.5 the base64 module has a method, standard_b64decode() for decoding strings from base64, which returns a bytes object.
When I run base64.standard_b64decode("wc==") the output is b\xc1. When you base64 encode "\xc1", you get "wQ==". It looks like there is an error in the decoding function. Actually, I think "wc==" is an invalid base64 encoded string, by this reasoning:
wc== ends with ==, which means that it was produced from a single input byte.
The corresponding values of 'w' and 'c' in the regular base64 alphabet are, respectively, 48 and 28, meaning their 6-bit representations are, respectively, 110000 and 011100.
Concatenating these, the first 8 bits are 11000001, which is \xc1, but the remaining bits (1100) are non-zero, so couldn't have been produced by the padding process performed during base64 encoding, as that only appends bits with value 0, which means these extra 1 bits can't have been produced through valid base64 encoding -> the string is not a valid base64 encoded string.
I think this is true for any 4 character chunk of base64 encoding ending in == when any of the last 4 bits of the second character are 1.
I'm pretty convinced that this is right, but I'm rather less experienced than the Python developers.
Can anyone confirm the above, or explain why it's wrong, if indeed it is?

The Base64 standard is defined by RFC 4648. Your question is answered by §3.5:
Canonical Encoding
The padding step in base 64 and base 32 encoding can, if improperly implemented, lead to non-significant alterations of the encoded data. For example, if the input is only one octet for a base 64 encoding, then all six bits of the first symbol are used, but only the first two bits of the next symbol are used. These pad bits MUST be set to zero by conforming encoders, which is described in the descriptions on padding below. If this property do not hold, there is no canonical representation of base-encoded data, and multiple base- encoded strings can be decoded to the same binary data. If this property (and others discussed in this document) holds, a canonical encoding is guaranteed.
In some environments, the alteration is critical and therefore decoders MAY chose to reject an encoding if the pad bits have not been set to zero.
The meaning of MAY is defined by RFC 2119:
MAY This word, or the adjective "OPTIONAL", mean that an item is truly optional. One vendor may choose to include the item because a particular marketplace requires it or because the vendor feels that it enhances the product while another vendor may omit the same item.
So Python is not obliged by the standard to reject non-canonical encodings.

Python Encoding that ignores leading 0s

I'm writing code in python 3.5 that uses hashlib to spit out MD5 encryption for each packet once it is is given a pcap file and the password. I am traversing through the pcap file using pyshark. Currently, the values it is spitting out are not the same as the MD5 encryptions on the packets in the pcap file.
One of the reasons I have attributed this to is that in the hex representation of the packet, the values are represented with leading 0s. Eg: Protocol number is shown as b'06'. But the value I am updating the hashlib variable with is b'6'. And these two values are not the same for same reason:
>> b'06'==b'6'
False
The way I am encoding integers is:
(hex(int(value))[2:]).encode()
I am doing this encoding because otherwise it would result in this error: "TypeError: Unicode-objects must be encoded before hashing"
I was wondering if I could get some help finding a python encoding library that ignores leading 0s or if there was any way to get the inbuilt hex method to ignore the leading 0s.
Thanks!

Hashing b'06' and b'6' gives different results because, in this context, '06' and '6' are different.
The b string prefix in Python tells the Python interpreter to convert each character in the string into a byte. Thus, b'06' will be converted into the two bytes 0x30 0x36, whereas b'6' will be converted into the single byte 0x36. Just as hashing b'a' and b' a' (note the space) produces different results, hashing b'06' and b'6' will similarly produce different results.
If you don't understand why this happens, I recommend looking up how bytes work, both within Python and more generally - Python's handling of bytes has always been a bit counterintuitive, so don't worry if it seems confusing! It's also important to note that the way Python represents bytes has changed between Python 2 and Python 3, so be sure to check which version of Python any information you find is talking about. You can comment here, too,

Decode base64 string in python 3 (with lxml or not)

I know this looks embarrassingly easy, and I guess the problem is that I just don't have a clear understanding of all this bytes-str-unicode (and encoding-decoding, speaking frankly) stuff yet.
I've been trying to get my working code to run on Python 3. The part I'm stuck with is when I parse an XML with lxml and decode a base64 string that is in that XML.
The code now works in the following manner:
I retrieve the binary data with an XPath query '.../binary/text()'. This produces a one-element list containing a lxml.etree._ElementUnicodeResult object. Then, with python 2, I was able to do:
decoded = source.decode('base64')
and finally
output = numpy.frombuffer(decoded)
However, on python 3 I get an error message saying
AttributeError: 'lxml.etree._ElementUnicodeResult' object has no attribute 'decode'
This is not so surprising, because lxml.etree._ElementUnicodeResult is a subclass of str.
Another way would be to get a real str with the same data in it with
binary = tree.xpath('//binary')[0]
binary_string = binary.text
That would be essentially the same. So what do I do to decode it from base64? I've looked at the base64 module, but it takes a bytes object as an argument, and I can't think of the way to present str as bytes, because if I try to construct a bytes object, Python will try to encode the string, which I don't need.
Googling further, I came across the binascii module (which is invoked indirectly from base64 anyway, if I'm not mistaken), but calling binascii.b2a_base64() on my string produces
TypeError: 'str' does not support the buffer interface
P.S. I've even found an answered question on how to decode a hex string in Python 3, but this is done with a dedicated method bytes.fromhex() so I don't see how it would be helpful.
Could someone please tell me what I'm missing? I'm afraid most of the post is irrelevant and only aggravates my shame, but at least you guys know what I tried.

OK, I think I'm going to summarize my current understanding of things (feel free to correct me). Hopefully it will help someone else out there as confused as I've been.
The credit totally goes to thebjorn and delnan, of course.
So, starting with the most common things:
there's Unicode, and it's a global standard that assigns codes (or code points) to all the exotic characters you can imagine. Those codes are just integer numbers. As of Unicode 6.1 there are 109,975 graphic characters, says Wikipedia.
Then there are encodings that define how to designate Unicode characters with byte codes. One byte isn't enough to designate an arbitrary Unicode char. Although, if you only take a small subset of them (English alphabet, digits, punctuation, some control characters), you can do with one byte per character (or even 7 bits; see ASCII).
To pass a Unicode string anywhere, one needs to encode it in bytes, then it can be decoded on the other end.
In Python 2, str is actually bytes, and unicode is Unicode, but Python 2 will do implicit encoding/decoding for you when needed. It will try to use ASCII encoding.
In Python 3, str is always a Unicode string, and bytes is a new data type for actual bytes. No implicit conversion is ever done by Python 3, you always need to do it yourself and specify the encoding. That means that your program won't work until you understand what's going on, which totally happened to me.
Now, that being more or less clear, let's move on to base64 encoding, which is also an encoding of sorts, but has a slightly different meaning.
Suppose you have some binary data (i.e. bytes) that may mean anything (in my case it's a bunch of floats). Now you want to represent this binary array with a string. That's what base64 encoding means: you have your bytes represented as an ASCII string.
Base64 means 6 bit, so in a base64-encoded string a single character stands for 6 bits of your data. That is why base64-encoded strings need to have the length that is a multiple of 4: otherwise the number of bytes encoded will be not integer.
Finally, to decode from base64 you need an ASCII string. A Unicode string won't do, there can only be characters from the base64 alphabet. Base64 module does the job in Python. The base64.b64decode() function takes a byte string as the argument. In Python 2 it means: str. In Python 3 it means: bytes. So if you have a str, such as
>>> s = 'U3RhY2sgT3ZlcmZsb3c='
In Python 2 you could just do
>>> s.decode('base64')
because s is already in ASCII.
In Python 3, you need to encode it in ASCII first, so you'll have to do:
>>> base64.b64decode(s.encode('ascii'))
And by the way, this will return a bytes object, so it's really up to you how to treat those bytes then. Maybe it's my floats, but maybe you should try to decode it as ASCII :)
In Python 2 however it will be just a str. Anyway, have a look at struct for the tools to unpack your data from those bytes.
So if you need the code to work on both Python 2 and 3, go with the last one. To make sure you have Unicode in the end (if you are decoding text from base64), you'll have to decode it:
>>> base64.b64decode(s.encode('ascii')).decode('ascii')
On Python 2, encode('ascii') won't effectively do anything because it's applied to str. So it will do an implicit conversion to Unicode first, and then do what you want (convert it back to ASCII). decode('ascii') will return a unicode object on Python 2.

I don't have Python 3 installed, but it sounds like you need to convert the Unicode returned from lxml to bytes, perhaps by calling .encode('ascii') ?

PyCrypto: Encode chinese charaters with RSA asymmetric key

I am trying to use PyCrypto to encrypt/decrypt some strings, and I am having troubles with Chinese characters.
When I try to encrypt "ni-hao" (hello)...
pemFile = open("/home/borrajax/keys/myKey.pem", "r")
encryptor = RSA.importKey(pemFile, passphrase="f00")
return encryptor.encrypt("你好", 0)[0]
... I keep getting errors:
Module Crypto.PublicKey.pubkey:64 in encrypt
>> ciphertext=self._encrypt(plaintext, K)
Module Crypto.PublicKey.RSA:92 in _encrypt
>> return (self.key._encrypt(c),)
ValueError: Plaintext too large
I have tried many combinations,
encryptor.encrypt(u"你好"...
encryptor.encrypt(u"你好".encode("utf-8")...
without any luck.
I guess I could always try to use base64 before encoding, but I'd like to leave that as a "last resource"... I was hoping for a more "elegant" way of doing this.
Has anyone encountered the same problems? Any hint will be appreciated. Thank you in advance.

First, you should be encrypting only binary data, not Unicode text. That means str type (in Python 2.x) or bytes (in Python 3.x and most recent Python 2.x). You must encode text before encrypting it, and you must decode it after you decrypt.
Second, the byte string you are encrypting must be smaller than the RSA modulus (e.g. less than 256 bytes for RSA2048). If your data is longer, think of using an intermediate AES session key.
Third, if you use PyCrypto 2.5, there is no good reason for using the .encrypt/.decrypt methods of the RSA key object. It is better and more secure to use one of the PKCS#1 methods: OAEP or v1.5. With them, the plaintext must be even shorter.

I tested with PyCrypto v2.5 installed from pip on Ubuntu Linux 10.04 on python 2.6.5 in the interactive interpreter from the yakuake terminal.
I'm not able to reproduce the error you're seeing, especially the "Plaintext too large" bit. Some errors I have seen:
encryptor.encrypt(u"你好",0)[0]
TypeError: argument 1 must be long, not unicode
Seems like it doesn't like unicode objects - only wants str.
Both of these work on my setup, and both produce the same output, however the first solution is more correct:
encryptor.encrypt(u"你好".encode("utf-8"), 0)[0]
encryptor.encrypt("你好", 0)[0]
Are you trying this from the interactive interpreter, or from a file? If file, is the file UTF-8 encoded? If console, does it have proper UTF-8 support?

I checked related codes of PyCrypto, this error is only thrown when plain-text (after converted to long) is bigger than one of the key parameter. Assuming encoding of the script is correctly set, it may be because your RSA key is invalid or too short. I tried this snippet, it works with no problem:
rsa = RSA.generate(1024)
print(rsa.encrypt("你好", 0))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.