How can I convert a python urandom to a string? - python

If I call os.urandom(64), I am given 64 random bytes. With reference to Convert bytes to a Python string I tried
a = os.urandom(64)
a.decode()
a.decode("utf-8")
but got the traceback error stating that the bytes are not in utf-8.
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 0: invalid start byte
with the bytes
b'\x8bz\xaf$\xb6\x93q\xef\x94\x99$\x8c\x1eO\xeb\xed\x03O\xc6L%\xe70\xf9\xd8
\xa4\xac\x01\xe1\xb5\x0bM#\x19\xea+\x81\xdc\xcb\xed7O\xec\xf5\\}\x029\x122
\x8b\xbd\xa9\xca\xb2\x88\r+\x88\xf0\xeaE\x9c'
Is there a fullproof method to decode these bytes into some string representation? I am generating sudo random tokens to keep track of related documents across multiple database engines.

The code below will work on both Python 2.7 and 3:
from base64 import b64encode
from os import urandom
random_bytes = urandom(64)
token = b64encode(random_bytes).decode('utf-8')

You have random bytes; I'd be very surprised if that ever was decodable to a string.
If you have to have a unicode string, decode from Latin-1:
a.decode('latin1')
because it maps bytes one-on-one to corresponding Unicode code points.

You can use base-64 encoding. In this case:
a = os.urandom(64)
a.encode('base-64')
Also note that I'm using encode here rather than decode, as decode is trying to take it from whatever format you specify into unicode. So in your example, you're treating the random bytes as if they form a valid utf-8 string, which is rarely going to be the case with random bytes.

Are you sure that you need 64 bytes represented as string?
Maybe what you really need is N-bits token?
If so, use secrets. The secrets module provides functions for generating secure tokens, suitable for applications such as password resets, hard-to-guess URLs, and similar.
import secrets
>>> secrets.token_bytes(16)
b'\xebr\x17D*t\xae\xd4\xe3S\xb6\xe2\xebP1\x8b'
>>> secrets.token_hex(16)
'f9bf78b9a18ce6d46a0cd2b0b86df9da'
>>> secrets.token_urlsafe(16)
'Drmhze6EPcv0fN_81Bj-nA'
Or Maybe you need 64 chars length random string? import string
import secrets
alphabet = string.ascii_letters + string.digits
password = ''.join(secrets.choice(alphabet) for i in range(64))

this easy way:
a = str(os.urandom(64))
print(F"the: {a}")
print(type(a))

Related

Python script to encrypt a message fails

Trying to encrypt to HMAC-SHA256 by giving my script a key and message.
A popular example that I saw online fails to run on my machine:
import hmac
import hashlib
import binascii
def create_sha256_signature(key, message):
byte_key = binascii.unhexlify(key)
message = message.encode()
enc = hmac.new(byte_key, message, hashlib.sha256).hexdigest().upper()
print (enc)
create_sha256_signature("KeepMySecret", "aaaaa")
why am I getting this error?
Traceback (most recent call last):
File "encryption.py", line 12, in <module>
create_sha256_signature("SaveMyScret", "aaaaa")
File "encryption.py", line 8, in create_sha256_signature
byte_key = binascii.unhexlify(key)
binascii.Error: Odd-length string
How should I change my code so I will be able to give my own short key?
When you call unhexlify it implies that your key is a hexadecimal representation of bytes. E.g. A73FB0FF.... In this kind of encoding, every character represents just 4 bits and therefore you need two characters for a byte and an even number of characters for the whole input string.
From the docs:
hexstr must contain an even number of hexadecimal digits
But actually the given secrets "SaveMySecret" or "KeepMySecret have not only a odd number of characters, but are not even valid hex code, so it would fail anyway with:
binascii.Error: Non-hexadecimal digit found
You can either provide a key in hex encoded form, or instead of calling unhexlify use something like
byte_key = key.encode('utf-8')
to get bytes as input for hmac.new()

Send bytes encoded data over JSON

I'm working on an Python RESTful API and I need to send bytes-encoded data (specifically data encrypted using a public RSA key from the package rsa) over the network, via JSON forms.
Here is what it looks like :
>>> import rsa
>>> pubkey, privkey = rsa.newkeys(512, True) # Create a pair of public/private rsa keys ; 512 is for the example
>>> encStr = rsa.encrypt(b"Test string", pubkey) # Encrypt a bytes object using the public key
>>> encStr
b'r\x10\x03e\xc6*\xa8\xb1\xee\xbd\x18\x0f\x7f\xecz\xcex\xabP~\xb3]\x8f)R\x9b>i\x03\xab-m\x0c\x19\xd7\xa5f$\x07\xc1;X\x0b\xaa2\x99\xa8&\xfc/\x9f\x05!nk\x93%\xc0\xf5\x1d\xf8C\x1fo'
"encStr" is what I need to send, however, I can't tell what encoding it is, and the package documentation doesn't mention it. If you have any idea, please share it :)
>>> encStr.decode("utf-8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec cannot decode byte 0x8e in position 0: invalid start byte
>>> encStr.decode("latin1")
'\x8eM\x96Æ\'zÈZ\x89\x85±\x98Z¯Ûzùæ¯;£zñ8\x9b§Ù\x9dÏ\x8eâ0®\x89ó(*?\x92ªg\x12ôsä\x1d\x96\x19\x82\x19-3\x15SBýh"3òÖß\x91Ô' # This could be it
>>> encStr.decode("latin1").encode("latin1")
b'\x8eM\x96\xc6\'z\xc8Z\x89\x85\xb1\x98Z\xaf\xdbz\xf9\xe6\xaf;\xa3z\xf18\x9b\xa7\xd9\x9d\xcf\x8e\xe20\xae\x89\xf3(*?\x92\xaag\x12\xf4s\xe4\x1d\x96\x19\x82\x19-3\x15SB\xfdh"3\xf2\xd6\xdf\x91\xd4' # Nop, garbage
After manipulating for a while, I found a way to get a proper string using base64.
>>> import base64
>>> b64_encStr = base64.b64encode(encStr)
>>> b64_encStr
b'jk2Wxid6yFqJhbGYWq/bevnmrzujevE4m6fZnc+O4jCuifMoKj+SqmcS9HPkHZYZghktMxVTQv1oIjPy1t+R1A=='
>>> b64_encStr.decode("utf-8")
'jk2Wxid6yFqJhbGYWq/bevnmrzujevE4m6fZnc+O4jCuifMoKj+SqmcS9HPkHZYZghktMxVTQv1oIjPy1t+R1A=='
Now I'd just have to send this, however, I would like to know if there is a more efficient way of doing this (shorter string ; less operations, considering the client has to encode and the server decode, etc).
Thanks !
Shawn
Base64 is a relatively efficient method of sending bytes as text (6 bits per character, or 8 bits per byte for normal single byte character encoding). These bytes may have any value, such as found in ciphertext. There are more efficient encodings such as basE91, but they only provide few advantages for the complexity that they bring.
However, I often see ciphertext being "stringified" while there is no need. Files, HTTP, sockets etc. all handle any byte values well. If you want to use it in a GET request then you should use base64url instead of the normal base 64 encoding. Often developers encode strings needlessly so that the values can be easily seen in traces and such, but in that case only the trace printout itself needs to be encoded.
Note that I'd advise you to use OAEP padding rather than PKCS#1 and a key size of at least 3072 bits, especially if you want to encrypt data that is transported rather than encrypted "in place".

Convert a UTF-8 String to a string in Python

If I have a unicode string such as:
s = u'c\r\x8f\x02\x00\x00\x02\u201d'
how can I convert this to just a regular string that isn't in unicode format; i.e. I want to extract:
f = '\x00\x00\x02\u201d'
and I do not want it in unicode format. The reason why I need to do this is because I need to convert the unicode in s to an integer value, but if I try it with just s:
int((s[-4]+s[-3]+s[-2]+s[-1]).encode('hex'), 16)
Traceback (most recent call last):
File "<pyshell#48>", line 1, in <module>
int((s[-4]+s[-3]+s[-2]+s[-1]).encode('hex'), 16)
File "C:\Python27\lib\encodings\hex_codec.py", line 24, in hex_encode
output = binascii.b2a_hex(input)
UnicodeEncodeError: 'ascii' codec can't encode character u'\u201d' in position 3: ordinal not in range(128)
yet if I do it with f:
int(f.encode('hex'), 16)
664608376369508L
And this is the correct integer value I want to extract from s. Is there a method where I can do this?
Normally, the device sends back something like: \x00\x00\x03\xcc which I can easily convert to 972
OK, so I think what's happening here is you're trying to read four bytes from a byte-oriented device, and decode that to an integer, interpreting the bytes as a 32-bit word in big-endian order.
To do this, use the struct module and byte strings:
>>> struct.unpack('>i', '\x00\x00\x03\xCC')[0]
972
(I'm not sure why you were trying to reverse the string then hex-encode; that would put the bytes in the wrong order and give much too large output.)
I don't know how you're reading from the device, but at some point you've decoded the bytes into a text (Unicode) string. Judging from the U+201D character in there I would guess that the device originally gave you a byte 0x94 and you decoded it using code page 1252 or another similar Windows default (‘ANSI’) code page.
>>> struct.unpack('>i', '\x00\x00\x02\x94')[0]
660
It may be possible to reverse the incorrect decoding step by encoding back to bytes using the same mapping, but this is dicey and depends on which encoding are involved (not all bytes are mapped to anything usable in all encodings). Better would be to look at where the input is coming from, find where that decode step is happening, and get rid of it so you keep hold of the raw bytes the device sent you.

Converting Unicode to in python [duplicate]

This question already has an answer here:
Closed 10 years ago.
Possible Duplicate:
Convert Unicode to UTF-8 Python
I'm a very new python programmer, working on my first script. the script pulls in text from a plist string, then does some things to it, then packages it up as an HTML email.
from a few of the entries, I'm getting the dreaded Unicode "outside ordinal 128" error.
Having read as much as I can find about encoding, and decoding, I know that it is important for me to get the encoded, but I'm having a difficult time understanding when or how exactly to do this.
The offending variable is first pulled in using plistlib, and converted to HTML from markdown, like this:
entry = result['Entry Text']
donotecontent = markdown2.markdown(entry)
Later, it is put in the email like this:
html = donotecontent + '<br /><br />' + var3
part1 = MIMEText(html, 'html')
msg.attach(part1)
My question is, what is the best way for me to make sure that Unicode characters in this content doesn't cause this to throw an error. I prefer not to ignore the characters.
Sorry for my broken english. I am speaking Chinese/Japanese, and using CJK characters everyday.
Ceron solved almost of this problem, thus I won't talk about how to use encode()/decode() again.
When we use str() to cast any unicode object, it will encode unicode string to bytedata; when we use unicode() to cast str object, it will decode bytedata to unicode character.
And, the encoding must be what returned from sys.getdefaultencoding().
In default, sys.getdefaultencoding() return 'ascii' by default, the encoding/decoding exception may be thrown when doing str()/unicode() casting.
If you want to do str <-> unicode conversion by str() or unicode(), and also, implicity encoding/decoding with 'utf-8', you can execute the following statement:
import sys # sys.setdefaultencoding is cancelled by site.py
reload(sys) # to re-enable sys.setdefaultencoding()
sys.setdefaultencoding('utf-8')
and it will cause later execution of str() and unicode() convert any basestring object with encoding utf-8.
However, I would prefer to use encode()/decode() explicitly, because it makes code maintenance easier for me.
Assuming you're using Python 2.x, remember: there are two types of strings: str and unicode. str are byte strings, whereas unicode are unicode strings. unicode strings can be used to represent text in any language, but to store text in a computer or to send it via email, you need to represent that text using bytes. To represent text using bytes, you need an coding format. There are many coding formats, Python uses ascii by default, but ascii can only represent a few characters, mostly english letters. If you try to encode a text with other letters using ascii, you will get the famous "outside ordinal 128". For example:
>>> u'Cerón'.encode('ascii')
Traceback (most recent call last):
File "<input>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf3' in position 3:
ordinal not in range(128)
The same happens if you use str(u'Cerón'), because Python uses ascii by default to convert unicode to str.
To make this work, you have to use a different coding format. UTF-8 is a coding format that can express any unicode text as bytes. To convert the u'Cerón' unicode string to bytes you have to use:
>>> u'Cerón'.encode('utf-8')
'Cer\xc3\xb3n'
No errors this time.
Now, back to your email problem. I can see that you're using MIMEText, which accepts an already encoded str argument, in your case is the html variable. MIMEText also accepts an argument specifying what kind of encoding is being used. So, in your case, if html is a unicode string, you have to encode it as utf-8 and pass the charset parameter too (because HTMLText uses ascii by default):
part1 = MIMEText(html.encode('utf-8'), 'html', 'utf-8')
But be careful, because if html is already a str instead of unicode, then the encoding will fail. This is one of the problems of Python 2.x, it allows you to encode an already encoded string but it throws an error.
Another problem to add to the list is that utf-8 is compatible with ascii characters, and Python will always try to automatically encode/decode strings using ascii. If you're not properly encoding your strings, but you only use ascii characters, things will work fine. However, if for some reason some non-ascii characters slips into your message, you will get the error, this makes errors harder to detect.
Remember: You can't decode a unicode, and you can't encode a str
>>> u"\xa0".decode("ascii", "ignore")
Traceback (most recent call last):
File "<pyshell#7>", line 1, in <module>
u"\xa0".decode("ascii", "ignore")
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 0: ordinal not in range(128)
>>> "\xc2".encode("ascii", "ignore")
Traceback (most recent call last):
File "<pyshell#6>", line 1, in <module>
"\xc2".encode("ascii", "ignore")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 0: ordinal not in range(128)
Checkout this excellent tutorial

Double-decoding unicode in python

I am working against an application that seems keen on returning, what I believe to be, double UTF-8 encoded strings.
I send the string u'XüYß' encoded using UTF-8, thus becoming X\u00fcY\u00df (equal to X\xc3\xbcY\xc3\x9f).
The server should simply echo what I sent it, yet returns the following: X\xc3\x83\xc2\xbcY\xc3\x83\xc2\x9f (should be X\xc3\xbcY\xc3\x9f). If I decode it using str.decode('utf-8') becomes u'X\xc3\xbcY\xc3\x9f', which looks like a ... unicode-string, containing the original string encoded using UTF-8.
But Python won't let me decode a unicode string without re-encoding it first - which fails for some reason, that escapes me:
>>> ret = 'X\xc3\x83\xc2\xbcY\xc3\x83\xc2\x9f'.decode('utf-8')
>>> ret
u'X\xc3\xbcY\xc3\x9f'
>>> ret.decode('utf-8')
# Throws UnicodeEncodeError: 'ascii' codec can't encode ...
How do I persuade Python to re-decode the string? - and/or is there any (practical) way of debugging what's actually in the strings, without passing it though all the implicit conversion print uses?
(And yes, I have reported this behaviour with the developers of the server-side.)
ret.decode() tries implicitly to encode ret with the system encoding - in your case ascii.
If you explicitly encode the unicode string, you should be fine. There is a builtin encoding that does what you need:
>>> 'X\xc3\xbcY\xc3\x9f'.encode('raw_unicode_escape').decode('utf-8')
'XüYß'
Really, .encode('latin1') (or cp1252) would be OK, because that's what the server is almost cerainly using. The raw_unicode_escape codec will just give you something recognizable at the end instead of raising an exception:
>>> '€\xe2\x82\xac'.encode('raw_unicode_escape').decode('utf8')
'\\u20ac€'
>>> '€\xe2\x82\xac'.encode('latin1').decode('utf8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac' in position 0: ordinal not in range(256)
In case you run into this sort of mixed data, you can use the codec again, to normalize everything:
>>> '€\xe2\x82\xac'.encode('raw_unicode_escape').decode('utf8')
'\\u20ac€'
>>> '\\u20ac€'.encode('raw_unicode_escape')
b'\\u20ac\\u20ac'
>>> '\\u20ac€'.encode('raw_unicode_escape').decode('raw_unicode_escape')
'€€'
What you want is the encoding where Unicode code point X is encoded to the same byte value X. For code points inside 0-255 you have this in the latin-1 encoding:
def double_decode(bstr):
return bstr.decode("utf-8").encode("latin-1").decode("utf-8")
Don't use this! Use #hop's solution.
My nasty hack: (cringe! but quietly. It's not my fault, it's the server developers' fault)
def double_decode_unicode(s, encoding='utf-8'):
return ''.join(chr(ord(c)) for c in s.decode(encoding)).decode(encoding)
Then,
>>> double_decode_unicode('X\xc3\x83\xc2\xbcY\xc3\x83\xc2\x9f')
u'X\xfcY\xdf'
>>> print _
XüYß
Here's a little script that might help you, doubledecode.py --
https://gist.github.com/1282752

Categories