I'm using PyCrypto for generating secure key hashes. I want to store one or more of the partial keys I generate. Each partial key is in the form
\x0f|4\xcc\x02b\xc3\xf8\xb0\xd8\xfc\xd4\x90VE\xf2
I have an ndb StringProperty() in which I'd lke to store that info. However, it raises a BadValueError saying it expects an UTF-8 encoded string. I tried using str's .encode('uft-8') method but that also raises an error telling me it couldn't encode because bad positioning.
Anyway, my question is, how can I convert that byte string into something I can store in ndb?
Improved Answer:
In this case instead of storing the key as String or Text, you should use a BlobProperty which stores an uninterpreted byte string.
Original Answer:
To convert bytes (strings) to unicode you use the method decode. You also need to use an encoding that preserves the original binary data, which is ISO-8859-1. See ISO-8859-1 encoding and binary data preservation
unicode_key = key.decode('iso-8859-1')
bytes_key = unicode_key.encode('iso-8859-1')
Consider also using A TextProperty instead, as StringProperties are indexed.
Related
I am working on a migration project to upgrade a layer of web server from python 2.7.8 to python 3.6.3 and I have hit a roadblock for some special cases.
When a request is received from a client, payload is transmitted locally using pyzmq which now interacts in bytes in python3 instead of str (as it is in python2).
Now, the payload which I am receiving is encoded using iso-8859-1 (latin-1) scheme and I can easily convert it into string as payload.decode('latin-1') and pass it to next service (svc-save-entity) which expects string argument.
However, the subsequent service 'svc-save-entity' expects latin-1 chars (if present) to be represented in ASCII Character Reference (such as é for é) rather than in Hex (such as \xe9 for é).
I am struggling to find an efficient way to achieve this conversion. Can any python expert guide me here? Essentially I need the definition of a function say decode_tostring():
payload = b'Banco Santander (M\xe9xico)' #payload is in bytes
payload_str = decode_tostring(payload) #function to convert into string
payload_str == 'Banco Santander (México)' #payload_str is a string in ASCII Character Reference
Definition of decode_tostring() please. :)
The encode() and decode() methods accept a parameter called errors which allows you to specify how characters which are not representable in the specified encoding should be handled. The one you're looking for is XML numeric character reference replacement, which is fortunately one of the standard handlers provided in the codecs module.
Now, it's a little complex to actually do the replacement the way you want it, because the operation of replacing non-ASCII characters with their corresponding XML numeric character references happens during encoding, not decoding. After all, encoding is the process that takes in characters and emits bytes, so it's only during encoding that you can tell whether you have a character that is not part of ASCII. The cleanest way I can think of at the moment to get the transformation you want is to decode, re-encode, and re-decode, applying the XML entity reference replacement during the encoding step.
def decode_tostring(payload):
return payload.decode('latin-1').encode('ascii', errors='xmlcharrefreplace').decode('ascii')
I wouldn't be surprised if there is a method somewhere out there that will replace all non-ASCII characters in a string with their XML numeric character refs and give you back a string, and if so, you could use it to replace the encoding and the second decoding. But I don't know of one. The closest I found at the moment was xml.sax.saxutils.escape(), but that only acts on certain specific characters.
This isn't really relevant to your main question, but I did want to clarify one thing: the numeric entities like é are a feature of SGML, HTML, and XML, which are markup languages - a way to represent structured data as text. They have nothing to do with ASCII. A character encoding like ASCII is nothing more than a table of some characters and some byte sequences such that each character in the table is mapped to one byte sequence in the table and vice versa, with a few constraints to make the mapping unambiguous.
If you have a string with characters that are not in a particular encoding's table, you can't encode the string using that encoding. But what you can do is convert the string into a new string by replacing the characters which aren't in the table with sequences of characters that are in the table, and then encode the new string. There are many ways to do the replacement, of which XML numeric entity references are one example. Some of the other error handlers in Python's codecs module represent other approaches to this replacement.
I am working on migrating project code from Python 2 to Python 3.
One piece of code is using struct.pack which provides me value in string(Python2) and byte string(Python3)
I wanted to convert byte string in python3 to normal string. Converted string should have same content to make it consistent with existing values.
For e.g.
in_val = b'\0x01\0x36\0xff\0x27' # Input value
out_val = '\0x01\0x36\0xff\0x27' # Output should be this
I have one solution to convert in_val in string then explicitly remove 'b' and '\' character which will appear after its converted to string.
Is there any other solution to convert using clean way.
Any help appreciated
str values are always Unicode code points. The first 256 values are the Latin-1 range, so you can use that codec to decode bytes directly to those codepoints:
out_val = in_val.decode('latin1')
However, you want to re-assess why you are doing this. Don't store binary data in strings, there are almost always better ways to deal with binary data. If you want to store binary data in JSON, for example, then you'd want to use Base64 or some other binary-to-text encoding scheme that better handles edge cases such as binary data containing escape codes when interpreted as text.
I have a column a spreadsheet whose header contains non-ASCII characters thus:
'Campaign'
If I pop this string into the interpreter, I get:
'\xc3\xaf\xc2\xbb\xc2\xbfCampaign'
The string is one the keys in the rows of a csv.DictReader()
When I try to populate a new dict with with the value of this key:
spends['Campaign'] = 2
I get:
Key Error: '\xc3\xaf\xc2\xbb\xc2\xbfCampaign'
If I print the value of the keys of row, I can see that it is '\xef\xbb\xbfCampaign'
Obviously then I can just update my program to access this key thus:
spends['\xef\xbb\xbfCampaign']
But is there a "better" way of doing this, in Python? Indeed, if the value of this key every changes to contain other non-ASCII characters, what is an all-encompassing way of handling any all non-ASCII characters that may arise?
Your specific problem is the first three bytes of the file, "\xef\xbb\xbf". That's the UTF-8 encoding of the byte order mask and often prepended to text files to indicate they're encoded using UTF-8. You should strip these bytes. See Removing BOM from gzip'ed CSV in Python.
Second, you're decoding with the wrong codec. "" is what you get if you decode those bytes using the Windows-1252 character set. That's why the bytes look different if you use these characters in a source file. See the Python 2 Unicode howto.
In general, you should decode a bytestring into Unicode text using the corresponding character encoding as soon as possible on input. And, in reverse, encode Unicode text into a bytestring as late as possible on output. Some APIs such as io.open() can do it implicitly so that your code sees only Unicode.
Unfortunately, csv module does not support Unicode directly on Python 2. See UnicodeReader, UnicodeWriter in the doc examples. You could create their analog for csv.DictReader or as an alternative just pass utf-8 encoded bytestrings to csv module.
I was doing some work in Python with graphs and wanted to a save some structures in files so I could load them fast when I resumed work. One of those was a dictionary which I saved in JSON format using json.dump.
When I load it back with json.load the keys have changed from "1" to u'1'. Why is that? What does it mean? How can I change it? I use the keys later to make some lists which I will then use with the original graph which nodes are the keys (in integer form) and it causes problem in comparisons...
The u prefix signifies a Unicode string. In Python 2.x, you can convert it to a regular string with str(). That shouldn't really be necessary, though; u'1' == '1' because Python will do any conversion for you before comparing.
The u'' or u"" just means that this is a unicode string. Which in general should not be any problem unless you need a byte string. Though I would expect that your original data already was unicode, so it should not be a problem.
It is a unicode string. You can treat it as a normal python string in most cases. If you really want to convert it to a normal string use str(). If you need to convert it to a bytes type, use object.encode(encoding) where encoding is the encoding of the Unicode character, usually 'utf-8'.
I found that the doc says that Software should only work with Unicode strings internally, converting to a particular encoding on output..
Does it mean that every method I define should handle the parameter as a unicode object instead of a string object? If not, when do I need to handle as a string and when do I need to handle as a unicode?
Yes, this is exactly what they mean.
Handle textual input from outside sources as strings, but immediately decode to unicode. Only encode back to some encoding to output it (preferably this is done by whatever function/method you call to do the output, rather than you needing to explicitly encode and then pass the encoded string somewhere).
Obviously, if you're dealing with non-text binary bytes, keep them in byte strings.