A Confusing Case in Python - python

Im new to pyqt and there is something i didnt get.I made an app using hashlib and of course gui with pyqt.
self.pushButton.connect(self.pushButton,QtCore.SIGNAL("clicked()"),self.clickedButton)
and click button:
def clickedButton(self):
if self.comboBox.currentText() == "MD5":
self.MD5(self.lineEdit.text())
and MD5:
def MD5(self,text):
self.hash = hashlib.md5(text).hexdigest()
self.textEdit.setText(self.hash)
the result for "hello":
a6f145a01ad0127e555c051d15806eb5
No error.It looks ok.But trying same thing on python shell:
>>> print hashlib.md5("hello").hexdigest()
5d41402abc4b2a76b9719d911017c592
>>>
is this an error or why am i getting different results?

The problem is that you are passing to md5 a QString, not a regular Python string. From some experimentation, I can see that both a regular string and a unicode string in Python produce the same result - from here i can tell that it tries to convert the unicode version to a sequence of "narrow" characters using the ascii codec.
>>> print hashlib.md5("hello").hexdigest()
5d41402abc4b2a76b9719d911017c592
>>> print hashlib.md5(u"hello").hexdigest()
5d41402abc4b2a76b9719d911017c592
The QString, instead, gets hashed differently, as in your program:
>>> a=PyQt4.QtCore.QString("hello")
>>> print hashlib.md5(a).hexdigest()
a6f145a01ad0127e555c051d15806eb5
although its UTF-8 or Latin 1 representation (both of which are the same as the output of the ascii codec, since we are dealing only with alphabetic characters) are hashed the same way as Python strings:
>>> print hashlib.md5(a.toUtf8()).hexdigest()
5d41402abc4b2a76b9719d911017c592
>>> print hashlib.md5(a.toLatin1()).hexdigest()
5d41402abc4b2a76b9719d911017c592
Probably what's happening here is that the hashing algorithm is working on the internal representation of the QString, which is UTF-16, which obviously differs from the UTF-8 representation, producing different outputs.
Thus, the lesson here is that before performing any hashing on text you have to choose explicitly its encoding before passing it to the hashing function - which works just with bytes - since there ain't no such thing as plain text.
Edit: it's not working on the UTF-16 representation (otherwise you would get the same result with
>>> print hashlib.md5(u'hello'.encode('utf_16')).hexdigest()
25af7f84a93a6cf5cb00967c60910c7d
) but on something else; still, the point is that hashlib isn't thought to work with QString, so it produces some "strange" output. Again, before using it convert the QString to a narrow string in some adequate encoding.

Related

convert Unicode to normal string [duplicate]

When I parse this XML with p = xml.parsers.expat.ParserCreate():
<name>Fortuna Düsseldorf</name>
The character parsing event handler includes u'\xfc'.
How can u'\xfc' be turned into u'ü'?
This is the main question in this post, the rest just shows further (ranting) thoughts about it
Isn't Python unicode broken since u'\xfc' shall yield u'ü' and nothing else?
u'\xfc' is already a unicode string, so converting it to unicode again doesn't work!
Converting it to ASCII as well doesn't work.
The only thing that I found works is: (This cannot be intended, right?)
exec( 'print u\'' + 'Fortuna D\xfcsseldorf'.decode('8859') + u'\'')
Replacing 8859 with utf-8 fails! What is the point of that?
Also what is the point of the Python unicode HOWTO? - it only gives examples of fails instead of showing how to do the conversions one (especially the houndreds of ppl who ask similar questions here) actually use in real world practice.
Unicode is no magic - why do so many ppl here have issues?
The underlying problem of unicode conversion is dirt simple:
One bidirectional lookup table '\xFC' <-> u'ü'
unicode( 'Fortuna D\xfcsseldorf' )
What is the reason why the creators of Python think it is better to show an error instead of simply producing this: u'Fortuna Düsseldorf'?
Also why did they made it not reversible?:
>>> u'Fortuna Düsseldorf'.encode('utf-8')
'Fortuna D\xc3\xbcsseldorf'
>>> unicode('Fortuna D\xc3\xbcsseldorf','utf-8')
u'Fortuna D\xfcsseldorf'
You already have the value. Python simply tries to make debugging easier by giving you a representation that is ASCII friendly. Echoing values in the interpreter gives you the result of calling repr() on the result.
In other words, you are confusing the representation of the value with the value itself. The representation is designed to be safely copied and pasted around, without worry about how other systems might handle non-ASCII codepoints. As such the Python string literal syntax is used, with any non-printable and non-ASCII characters replaced by \xhh and \uhhhh escape sequences. Pasting those strings back into a Python string or interactive Python session will reproduce the exact same value.
As such ü has been replaced by \xfc, because that's the Unicode codepoint for the U+00FC LATIN SMALL LETTER U WITH DIAERESIS codepoint.
If your terminal is configured correctly, you can just use print and Python will encode the Unicode value to your terminal codec, resulting in your terminal display giving you the non-ASCII glyphs:
>>> u'Fortuna Düsseldorf'
u'Fortuna D\xfcsseldorf'
>>> print u'Fortuna Düsseldorf'
Fortuna Düsseldorf
If your terminal is configured for UTF-8, you can also write the UTF-8 bytes directly to your terminal, after encoding explicitly:
>>> u'Fortuna Düsseldorf'.encode('utf8')
'Fortuna D\xc3\xbcsseldorf'
>>> print u'Fortuna Düsseldorf'.encode('utf8')
Fortuna Düsseldorf
The alternative is for you upgrade to Python 3; there repr() only uses escape sequences for codepoints that have no printable glyphs (control codes, reserved codepoints, surrogates, etc; if the codepoint is not a space but falls in a C* or Z* general category, it is escaped). The new ascii() function gives you the Python 2 repr() behaviour still.

What is actually happening when I encode a string into bytes? [duplicate]

What's a Python bytestring?
All I can find are topics on how to encode to bytestring or decode to ASCII or UTF-8. I'm trying to understand how it works under the hood. In a normal ASCII string, it's an array or list of characters, and each character represents an ASCII value from 0-255, so that's how you know what character is represented by the number. In Unicode, it's the 8- or 16-byte representation for the character that tells you what character it is.
So what is a bytestring? How does Python know which characters to represent as what? How does it work under the hood? Since you can print or even return these strings and it shows you the string representation, I don't quite get it...
Ok, so my point is definitely getting missed here. I've been told that it's an immutable sequence of bytes without any particular interpretation.
A sequence of bytes.. Okay, let's say one byte:
'a'.encode() returns b'a'.
Simple enough. Why can I read the a?
Say I get the ASCII value for a, by doing this:
printf "%d" "'a"
It returns 97. Okay, good, the integer value for the ASCII character a. If we interpret 97 as ASCII, say in a C char, then we get the letter a. Fair enough. If we convert the byte representation to bits, we get this:
01100001
2^0 + 2^5 + 2^6 = 97. Cool.
So why is 'a'.encode() returning b'a' instead of 01100001??
If it's without a particular interpretation, shouldn't it be returning something like b'01100001'?
It seems like it's interpreting it like ASCII.
Someone mentioned that it's calling __repr__ on the bytestring, so it's displayed in human-readable form. However, even if I do something like:
with open('testbytestring.txt', 'wb') as f:
f.write(b'helloworld')
It will still insert helloworld as a regular string into the file, not as a sequence of bytes... So is a bytestring in ASCII?
It is a common misconception that text is ASCII or UTF-8 or Windows-1252, and therefore bytes are text.
Text is only text, in the way that images are only images. The matter of storing text or images to disk is a matter of encoding that data into a sequence of bytes. There are many ways to encode images into bytes: JPEG, PNG, SVG, and likewise many ways to encode text, ASCII, UTF-8 or Windows-1252.
Once encoding has happened, bytes are just bytes. Bytes are not images anymore; they have forgotten the colors they mean; although an image format decoder can recover that information. Bytes have similarly forgotten the letters they used to be. In fact, bytes don't remember whether they were images or text at all. Only out of band knowledge (filename, media headers, etcetera) can guess what those bytes should mean, and even that can be wrong (in case of data corruption).
so, in Python (Python 3), we have two types for things that might otherwise look similar; For text, we have str, which knows it's text; it knows which letters it's supposed to mean. It doesn't know which bytes that might be, since letters are not bytes. We also have bytestring, which doesn't know if it's text or images or any other kind of data.
The two types are superficially similar, since they are both sequences of things, but the things that they are sequences of is quite different.
Implementationally, str is stored in memory as UCS-? where the ? is implementation defined, it may be UCS-4, UCS-2 or UCS-1, depending on compile time options and which code points are present in the represented string.
"But why"?
Some things that look like text are actually defined in other terms. A really good example of this are the many Internet protocols of the world. For instance, HTTP is a "text" protocol that is in fact defined using the ABNF syntax common in RFCs. These protocols are expressed in terms of octets, not characters, although an informal encoding may also be suggested:
2.3. Terminal Values
Rules resolve into a string of terminal values, sometimes called
characters. In ABNF, a character is merely a non-negative integer.
In certain contexts, a specific mapping (encoding) of values into a
character set (such as ASCII) will be specified.
This distinction is important, because it's not possible to send text over the internet, the only thing you can do is send bytes. saying "text but in 'foo' encoding" makes the format that much more complex, since clients and servers need to now somehow figure out the encoding business on their own, hopefully in the same way, since they must ultimately pass data around as bytes anyway. This is doubly useless since these protocols are seldom about text handling anyway, and is only a convenience for implementers. Neither the server owners nor end users are ever interested in reading the words Transfer-Encoding: chunked, so long as both the server and the browser understand it correctly.
By comparison, when working with text, you don't really care how it's encoded. You can express the "Heävy Mëtal Ümlaüts" any way you like, except "Heδvy Mλtal άmlaόts"
The distinct types thus give you a way to say "this value 'means' text" or "bytes".
Python does not know how to represent a bytestring. That's the point.
When you output a character with value 97 into pretty much any output window, you'll get the character 'a' but that's not part of the implementation; it's just a thing that happens to be locally true. If you want an encoding, you don't use bytestring. If you use bytestring, you don't have an encoding.
Your piece about .txt files shows you have misunderstood what is happening. You see, plain text files too don't have an encoding. They're just a series of bytes. These bytes get translated into letters by the text editor but there is no guarantee at all that someone else opening your file will see the same thing as you if you stray outside the common set of ASCII characters.
As the name implies, a Python 3 bytestring (or simply a str in Python 2.7) is a string of bytes. And, as others have pointed out, it is immutable.
It is distinct from a Python 3 str (or, more descriptively, a unicode in Python 2.7) which is a
string of abstract Unicode characters (a.k.a. UTF-32, though Python 3 adds fancy compression under the hood to reduce the actual memory footprint similar to UTF-8, perhaps even in a more general way).
There are essentially three ways of "interpreting" these bytes. You can look at the numeric value of an element, like this:
>>> ord(b'Hello'[0]) # Python 2.7 str
72
>>> b'Hello'[0] # Python 3 bytestring
72
Or you can tell Python to emit one or more elements to the terminal (or a file, device, socket, etc.) as 8-bit characters, like this:
>>> print b'Hello'[0] # Python 2.7 str
H
>>> import sys
>>> sys.stdout.buffer.write(b'Hello'[0:1]) and None; print() # Python 3 bytestring
H
As Jack hinted at, in this latter case it is your terminal interpreting the character, not Python.
Finally, as you have seen in your own research, you can also get Python to interpret a bytestring. For example, you can construct an abstract unicode object like this in Python 2.7:
>>> u1234 = unicode(b'\xe1\x88\xb4', 'utf-8')
>>> print u1234.encode('utf-8') # if terminal supports UTF-8
ሴ
>>> u1234
u'\u1234'
>>> print ('%04x' % ord(u1234))
1234
>>> type(u1234)
<type 'unicode'>
>>> len(u1234)
1
>>>
Or like this in Python 3:
>>> u1234 = str(b'\xe1\x88\xb4', 'utf-8')
>>> print (u1234) # if terminal supports UTF-8 AND python auto-infers
ሴ
>>> u1234.encode('unicode-escape')
b'\\u1234'
>>> print ('%04x' % ord(u1234))
1234
>>> type(u1234)
<class 'str'>
>>> len(u1234)
1
(and I am sure that the amount of syntax churn between Python 2.7 and Python3 around bystestring, strings, and Unicode had something to do with the continued popularity of Python 2.7. I suppose that when Python 3 was invented they didn't yet realize that everything would become UTF-8 and therefore all the fuss about abstraction was unnecessary).
But the Unicode abstraction does not happen automatically if you don't want it to. The point of a bytestring is that you can directly get at the bytes. Even if your string happens to be a UTF-8 sequence, you can still access bytes in the sequence:
>>> len(b'\xe1\x88\xb4')
3
>>> b'\xe1\x88\xb4'[0]
'\xe1'
And this works in both Python 2.7 and Python 3, with the difference being that in Python 2.7 you have str, while in Python3 you have bytestring.
You can also do other wonderful things with bytestrings, like knowing if they will fit in a reserved space within a file, sending them directly over a socket, calculating the HTTP content-length field correctly, and avoiding Python Bug 8260. In short, use bytestrings when your data is processed and stored in bytes.
Bytes objects are immutable sequences of single bytes. The documentation has a very good explanation of what they are and how to use them.

Lxml trying to extract data with windows-1250 characters

Hello i am experimenting with Python and LXML, and I am stuck with the problem of extracting data from the webpage which contains windows-1250 characters like ž and ć.
tree = html.fromstring(new.text,parser=hparser)
title = tree.xpath('//strong[text()="Title"]')
opis[g] = opis[g].tail.encode('utf-8')[2:]
I get text responses containing something like this :
\xc2\x9ea
instead of characters. Then I have the problem with storing into database
So how can I accomplish this? I tried put 'windows-1250' instead utf8 without success. Can I convert this code to original characters somehow?
Try:
text = "\xc2\x9ea"
print text.decode('windows-1250').encode('utf-8')
Output:
ža
And save nice chars in your DB.
If encoding to UTF-8 results in b'\xc2\x9ea', then that means the original string was '\x9ea'. Whether lxml didn't do things correctly, or something happened on your end (perhaps a parser configuration issue), the fact is that you get the equivalent of this (Python 3.x syntax):
>>> '\x9ea'.encode('utf-8')
b'\xc2\x9ea'
How do you fix it? One error-prone way would be to encode as something other than UTF-8 that can properly handle the characters. It's error-prone because while something might work in one case, it might not in another. You could instead extract the character ordinals by mapping the character ordinals and work with the character ordinals instead:
>>> list(map((lambda n: hex(n)[2:]), map(ord, '\x9ea')))
['9e', '61']
That gets us somewhere because the bytes type has a fromhex method that can decode a string containing hexadecimal values to the equivalent byte values:
>>> bytes.fromhex(''.join(map((lambda n: hex(n)[2:]), map(ord, '\x9ea'))))
b'\x9ea'
You can use decode('cp1250') on the result of that to get ža, which I believe is the string you wanted. If you are using Python 2.x, the equivalent would be
from binascii import unhexlify
unhexlify(u''.join(map((lambda n: hex(n)[2:]), map(ord, u'\x9ea'))))
Note that this is highly destructive as it forces all characters in a Unicode string to be interpreted as bytes. For this reason, it should only be used on strings containing Unicode characters that fit in a single byte. If you had something like '\x9e\u724b\x61', that code would result in joining ['9e', '724b', '61'] as '9e724b61', and interpreting that using a single-byte character set such as CP1250 would result in something like 'žrKa'.
For that reason, more reliable code would replace ord with a function that throws an exception if 0 <= ord(ch) < 0x100 is false, but I'll leave that for you to code.

Python - Unicode to ASCII conversion

I am unable to convert the following Unicode to ASCII without losing data:
u'ABRA\xc3O JOS\xc9'
I tried encode and decode and they won’t do it.
Does anyone have a suggestion?
The Unicode characters u'\xce0' and u'\xc9' do not have any corresponding ASCII values. So, if you don't want to lose data, you have to encode that data in some way that's valid as ASCII. Options include:
>>> print s.encode('ascii', errors='backslashreplace')
ABRA\xc3O JOS\xc9
>>> print s.encode('ascii', errors='xmlcharrefreplace')
ABRAÃO JOSÉ
>>> print s.encode('unicode-escape')
ABRA\xc3O JOS\xc9
>>> print s.encode('punycode')
ABRAO JOS-jta5e
All of these are ASCII strings, and contain all of the information from your original Unicode string (so they can all be reversed without loss of data), but none of them are all that pretty for an end-user (and none of them can be reversed just by decode('ascii')).
See str.encode, Python Specific Encodings, and Unicode HOWTO for more info.
As a side note, when some people say "ASCII", they really don't mean "ASCII" but rather "any 8-bit character set that's a superset of ASCII" or "some particular 8-bit character set that I have in mind". If that's what you meant, the solution is to encode to the right 8-bit character set:
>>> s.encode('utf-8')
'ABRA\xc3\x83O JOS\xc3\x89'
>>> s.encode('cp1252')
'ABRA\xc3O JOS\xc9'
>>> s.encode('iso-8859-15')
'ABRA\xc3O JOS\xc9'
The hard part is knowing which character set you meant. If you're writing both the code that produces the 8-bit strings and the code that consumes it, and you don't know any better, you meant UTF-8. If the code that consumes the 8-bit strings is, say, the open function or a web browser that you're serving a page to or something else, things are more complicated, and there's no easy answer without a lot more information.
I found https://pypi.org/project/Unidecode/ this library very useful
>>> from unidecode import unidecode
>>> unidecode('ko\u017eu\u0161\u010dek')
'kozuscek'
>>> unidecode('30 \U0001d5c4\U0001d5c6/\U0001d5c1')
'30 km/h'
>>> unidecode('\u5317\u4EB0')
'Bei Jing '
I needed to calculate the MD5 hash of a unicode string received in HTTP request. MD5 was giving UnicodeEncodeError and python built-in encoding methods didn't work because it replaces the characters in the string with corresponding hex values for the characters thus changing the MD5 hash.
So I came up with the following code, which keeps the string intact while converting from unicode.
unicode_string = ''.join([chr(ord(x)) for x in unicode_string]).strip()
This removes the unicode part from the string and keeps all the data intact.

Python UTF-8 conversion problem

In my database, I have stored some UTF-8 characters. E.g. 'α' in the "name" field
Via Django ORM, when I read this out, I get something like
>>> p.name
u'\xce\xb1'
>>> print p.name
α
I was hoping for 'α'.
After some digging, I think if I did
>>> a = 'α'
>>> a
'\xce\xb1'
So when Python is trying to display '\xce\xb1' I get alpha, but when it's trying to display u'\xce\xb1', it's double encoding?
Why did I get u'\xce\xb1' in the first place? Is there a way I can just get back '\xce\xb1'?
Thanks. My UTF-8 and unicode handling knowledge really need some help...
Try to put the unicode signature u before your string, e.g. u'YOUR_ALFA_CHAR' and revise your database encoding, because Django always supports UTF-8 .
What you seem to have is the individual bytes of a UTF-8 encoded string interpreted as unicode codepoints. You can "decode" your string out of this strange form with:
p.name = ''.join(chr(ord(x)) for x in p.name)
or perhaps
p.name = ''.join(chr(ord(x)) for x in p.name).decode('utf8')
One way to get your strings "encoded" into this form is
''.join(unichr(ord(x)) for x in '\xce\xb1')
although I have a feeling your strings actually got in this state by different components of your system disagreeing on the encoding in use.
You will probably have to fix the source of your bad "encoding" rather than just fixing the data currently in your database. And the code above might be okay to convert your bad data once, but I would advise you don't insert this code into your Django app.
The problem is that p.name was not correctly stored and/or read in from the database.
Unicode small alpha is U+03B1 and p.name should have printed as u'\x03b1' or if you were using a Unicode capable terminal the actual alpha symbol itself may have been printed in quotes. Note the difference between u'\xce\xb1' and u'\xceb1'. The former is a two character string and the latter in a single character string. I have no idea how the '03' byte of the UTF-8 got translated into 'CE'.
You can turn any byte sequence into internal unicode representation through the decode function:
print '\xce\xb1'.decode('utf-8')
This allows you to import a byte sequence from any source and then turn it into a Python unicode string.
Reference: http://docs.python.org/library/stdtypes.html#string-methods
Try converting the encoding with p.name.encode('latin-1'). Here's a demonstration:
>>> print u'\xce\xb1'
α
>>> print u'\xce\xb1'.encode('latin-1')
α
>>> print '\xce\xb1'
α
>>> '\xce\xb1' == u'\xce\xb1'.encode('latin1')
True
For more information, see str.encode and Standard Encodings.

Categories