EOF while scanning triple-quoted string literal - python

i've looked on the web and here but i didn't find an answer :
here is my code
zlib.decompress("""
xワᆳヤ=ラᄇHナs~Ʀᄑç\ムîà
Z#ÑÁÔQÇlxÇÆïPP~ýVãì゙M6ÛÐ|ê֭ᄁᄂヤ=)}éÓUe﬿ö3ᄎᄌú"}ʿïÿ÷1þ8ñ́U÷ᄏñíLÒVi:`ᄈᄎL!Ê҆p6-%Fë^ヘ÷à,Q.K!ユô`ÄA!ÑêweÌ ÊÚAロYøøÂjôóᅠÂcñ䊧fᆴùテúN :nüzAÝ7%ᄌcdUタᄌ3ôPۂタlンyHᆲᄑ$/yzᄒíàヌ'ÕÓ&`|S!<'ᄂ÷Zļᄐ2ホモ;ニ(ÅÛfb!úü$ナテᄒ,9ßhàPᄎᄄێフÑbØὛホQᄍ-Ü}(n;ᄄホLヤ\^ï9ᆭᄍラDdВéÞ|åPOGᄂÐÙ%â&AÔë)ÎTÐC ᄐïc枢í%Èï!フᄋëiq*ᄌVKÐNᄡ[ᄁfOq{OᆭÆÊ,0GᄂリmtツᄈOᄌΥ$#îヘqbYᄆメUニᄉÞáP`
ヨ×ᆵÃPwaレǩâ×)ハFcêÚ=!Åöᄊ
)AFñᄈ/cMᄃ!NóNΈór?pàÜòXw
Bvæ0ïçIÉoマ>5pᆭ-ØWÚNᄆùFᄆØPçÃþdᅠ;ル1[Oᄈホ~6ツᄈᆬŕìᄄޠ=øð#ネV﾿ᄅ)÷%ユÜib{HᄆKŅVlDCテîfÑWì÷ìáár.ワîv﾿<dᄎn~ú*ÁÕ7ýá}EsYᆵWᄂÈ:R×ãQңメ?Ø1vヘäツ~èR1ᄉÜ*ᄡónAᆬjmNoツユᄈÌښᆬf[8ᆭÛ>゙OWラ|ÌbDᄁÖ녡M=Ð÷èâミム'ÂÝÐ ;ë mᄎQÂäԤۢ:モᄆdᄎᄑLȂ1ᄈ_÷YZᆲNòÛ â\ロxÐlݵᆵムᆱøm5Ëá=ïoÍlMᆪ[×#Ypᅠトx[ÉÊyæツoモナz)ᆭᄀÝÏìò
""")
so it was a string that i got by zlib.compress an other string.
How can i decompress this string ?
Regards
Bussiere

The zlib.decompress should work if you pass it the output of zlib.compress.
Since the compressed string is really not text it is a binary string. It will not play friendly with displaying to the terminal as you have found.
You can use base64 encoding to give you something safe to drop into unittests, paste into code etc.
>>> import zlib
>>> a = zlib.compress('fooo')
>>> b = a.encode('base64')
>>> b
'eJxLy8/PBwAENgG0\n'
>>> c = 'eJxLy8/PBwAENgG0\n'.decode('base64')
>>> zlib.decompress(c)
'fooo'
>>> zlib.decompress(a)
'fooo'
a as an output is ok for binary transmission or saving to a file.
b is friendly to use with the clipboard, send in email, etc.

I would not have it in that representation. Use repr() in the other code to generate an ASCII-clean representation, and use that instead. Then just look for triple quotes in the result and break them up.

Related

Python - Issues with Unicode String from API Call

I'm using Python to call an API that returns the last name of some soccer players. One of the players has a "ć" in his name.
When I call the endpoint, the name prints out with the unicode attached to it:
>>> last_name = (json.dumps(response["response"][2]["player"]["lastname"]))
>>> print(last_name)
"Mitrovi\u0107"
>>> print(type(last_name))
<class 'str'>
If I were to take copy and paste that output and put it in a variable on its own like so:
>>> print("Mitrovi\u0107")
Mitrović
>>> print(type("Mitrovi\u0107"))
<class 'str'>
Then it prints just fine?
What is wrong with the API endpoint call and the string that comes from it?
Well, you serialise the string with json.dumps() before printing it, that's why you get a different output.
Compare the following:
>>> print("Mitrović")
Mitrović
and
>>> print(json.dumps("Mitrović"))
"Mitrovi\u0107"
The second command adds double quotes to the output and escapes non-ASCII chars, because that's how strings are encoded in JSON. So it's possible that response["response"][2]["player"]["lastname"] contains exactly what you want, but maybe you fooled yourself by wrapping it in json.dumps() before printing.
Note: don't confuse Python string literals and JSON serialisation of strings. They share some common features, but they aren't the same (eg. JSON strings can't be single-quoted), and they serve a different purpose (the first are for writing strings in source code, the second are for encoding data for sending it accross the network).
Another note: You can avoid most of the escaping with ensure_ascii=False in the json.dumps() call:
>>> print(json.dumps("Mitrović", ensure_ascii=False))
"Mitrović"
Count the number of characters in your string & I'll bet you'll notice that the result of json is 13 characters:
"M-i-t-r-o-v-i-\-u-0-1-0-7", or "Mitrovi\\u0107"
When you copy "Mitrovi\u0107" you're coping 8 characters and the '\u0107' is a single unicode character.
That would suggest the endpoint is not sending properly json-escaped unicode, or somewhere in your doc you're reading it as ascii first. Carefully look at exactly what you're receiving.

Unicode strings to byte strings without the addition of backslashes

I'm learning python by doing the python challenge using python3.3 and I'm on question eight. There's a comment in the markup providing you with two bz2-compressed unicode strings outputting byte strings, one for username and one for password. There's also a link where you need the decompressed credentials to enter.
One way to easily solve this is just to manually copy the strings and assign it to two variables as byte strings and then just use the bz2 library to decompress it:
>>>un=b'BZh91AY&SYA\xaf\x82\r\x00\x00\x01\x01\x80\x02\xc0\x02\x00 \x00!\x9ah3M\x07<]\xc9\x14\xe1BA\x06\xbe\x084'
>>>print(bz2.decompress(un).decode('utf-8'))
huge
But that's not for me since I want the answer by just running my python file.
My code like this:
>>>import bz2, re, requests
>>>url = requests.get('http://www.pythonchallenge.com/pc/def/integrity.html')
>>>un = re.findall(r'un: \'(.*)\'',url.text)[0]
>>>correct=b'BZh91AY&SYA\xaf\x82\r\x00\x00\x01\x01\x80\x02\xc0\x02\x00 \x00!\x9ah3M\x07<]\xc9\x14\xe1BA\x06\xbe\x084'
>>>print(un,un is correct,sep='\n')
b'BZh91AY&SYA\\xaf\\x82\\r\\x00\\x00\\x01\\x01\\x80\\x02\\xc0\\x02\\x00 \\x00!\\x9ah3M\\x07<]\\xc9\\x14\\xe1BA\\x06\\xbe\\x084'
False
The problem is that when it converts from unicode string to byte string the escaping backslash gets added so that it cannot be read by bz2 module. I have tried everything I know and what got up when I searched.
How do I get it from unicode to byte so that it doesn't get changed?
Here it is a solution:
import urllib
import bz2
import re
def decode(line):
out = re.search(r"\'(.*?)\'",''.join(line)).group()
out = eval("b%s" % out)
return bz2.decompress(out)
#read lines that contain the encoded message
page = urllib.urlopen('http://www.pythonchallenge.com/pc/def/integrity.html').readlines()[20:22]
print "Click on the bee and insert: "
User_Name = decode(page[0])
print "User Name is: " + User_Name
Password = decode(page[1])
print "Password is: " + Password
The backslashes are present in the HTML source, so it's not surprising that the requests module preserves them. I don't have requests installed on my Python 3 environment, so I haven't been able to exactly replicate your situation, but it looks to me like if you start capturing the surrounding ' characters, you can use ast.literal_eval to parse the character sequence into a bytes array:
>>> test
"'BZh91AY&SYA\\xaf\\x82\\r\\x00\\x00\\x01\\x01\\x80\\x02\\xc0\\x02\\x00 \\x00!\\x9ah3M\\x07<]\\xc9\\x14\\xe1BA\\x06\\xbe\\x084'"
>>> import ast
>>> res = ast.literal_eval("b%s" % test)
>>> import bz2
>>> len(bz2.decompress(res))
4
There are probably other ways, but why not use Python's built in knowledge that the byte sequence b'\\xaf' can be parsed into a bytes array?

Convert Unicode string to UTF-8, and then to JSON

I want to encode a string in UTF-8 and view the corresponding UTF-8 bytes individually. In the Python REPL the following seems to work fine:
>>> unicode('©', 'utf-8').encode('utf-8')
'\xc2\xa9'
Note that I’m using U+00A9 COPYRIGHT SIGN as an example here. The '\xC2\xA9' looks close to what I want — a string consisting of two separate code points: U+00C2 and U+00A9. (When UTF-8-decoded, it gives back the original string, '\xA9'.)
Then, I want the UTF-8-encoded string to be converted to a JSON-compatible string. However, the following doesn’t seem to do what I want:
>>> import json; json.dumps('\xc2\xa9')
'"\\u00a9"'
Note that it generates a string containing U+00A9 (the original symbol). Instead, I need the UTF-8-encoded string, which would look like "\u00C2\u00A9" in valid JSON.
TL;DR How can I turn '©' into "\u00C2\u00A9" in Python? I feel like I’m missing something obvious — is there no built-in way to do this?
If you really want "\u00c2\u00a9" as the output, give json a Unicode string as input.
>>> print json.dumps(u'\xc2\xa9')
"\u00c2\u00a9"
You can generate this Unicode string from the raw bytes:
s = unicode('©', 'utf-8').encode('utf-8')
s2 = u''.join(unichr(ord(c)) for c in s)
I think what you really want is "\xc2\xa9" as the output, but I'm not sure how to generate that yet.

A Confusing Case in Python

Im new to pyqt and there is something i didnt get.I made an app using hashlib and of course gui with pyqt.
self.pushButton.connect(self.pushButton,QtCore.SIGNAL("clicked()"),self.clickedButton)
and click button:
def clickedButton(self):
if self.comboBox.currentText() == "MD5":
self.MD5(self.lineEdit.text())
and MD5:
def MD5(self,text):
self.hash = hashlib.md5(text).hexdigest()
self.textEdit.setText(self.hash)
the result for "hello":
a6f145a01ad0127e555c051d15806eb5
No error.It looks ok.But trying same thing on python shell:
>>> print hashlib.md5("hello").hexdigest()
5d41402abc4b2a76b9719d911017c592
>>>
is this an error or why am i getting different results?
The problem is that you are passing to md5 a QString, not a regular Python string. From some experimentation, I can see that both a regular string and a unicode string in Python produce the same result - from here i can tell that it tries to convert the unicode version to a sequence of "narrow" characters using the ascii codec.
>>> print hashlib.md5("hello").hexdigest()
5d41402abc4b2a76b9719d911017c592
>>> print hashlib.md5(u"hello").hexdigest()
5d41402abc4b2a76b9719d911017c592
The QString, instead, gets hashed differently, as in your program:
>>> a=PyQt4.QtCore.QString("hello")
>>> print hashlib.md5(a).hexdigest()
a6f145a01ad0127e555c051d15806eb5
although its UTF-8 or Latin 1 representation (both of which are the same as the output of the ascii codec, since we are dealing only with alphabetic characters) are hashed the same way as Python strings:
>>> print hashlib.md5(a.toUtf8()).hexdigest()
5d41402abc4b2a76b9719d911017c592
>>> print hashlib.md5(a.toLatin1()).hexdigest()
5d41402abc4b2a76b9719d911017c592
Probably what's happening here is that the hashing algorithm is working on the internal representation of the QString, which is UTF-16, which obviously differs from the UTF-8 representation, producing different outputs.
Thus, the lesson here is that before performing any hashing on text you have to choose explicitly its encoding before passing it to the hashing function - which works just with bytes - since there ain't no such thing as plain text.
Edit: it's not working on the UTF-16 representation (otherwise you would get the same result with
>>> print hashlib.md5(u'hello'.encode('utf_16')).hexdigest()
25af7f84a93a6cf5cb00967c60910c7d
) but on something else; still, the point is that hashlib isn't thought to work with QString, so it produces some "strange" output. Again, before using it convert the QString to a narrow string in some adequate encoding.

Python UTF-8 conversion problem

In my database, I have stored some UTF-8 characters. E.g. 'α' in the "name" field
Via Django ORM, when I read this out, I get something like
>>> p.name
u'\xce\xb1'
>>> print p.name
α
I was hoping for 'α'.
After some digging, I think if I did
>>> a = 'α'
>>> a
'\xce\xb1'
So when Python is trying to display '\xce\xb1' I get alpha, but when it's trying to display u'\xce\xb1', it's double encoding?
Why did I get u'\xce\xb1' in the first place? Is there a way I can just get back '\xce\xb1'?
Thanks. My UTF-8 and unicode handling knowledge really need some help...
Try to put the unicode signature u before your string, e.g. u'YOUR_ALFA_CHAR' and revise your database encoding, because Django always supports UTF-8 .
What you seem to have is the individual bytes of a UTF-8 encoded string interpreted as unicode codepoints. You can "decode" your string out of this strange form with:
p.name = ''.join(chr(ord(x)) for x in p.name)
or perhaps
p.name = ''.join(chr(ord(x)) for x in p.name).decode('utf8')
One way to get your strings "encoded" into this form is
''.join(unichr(ord(x)) for x in '\xce\xb1')
although I have a feeling your strings actually got in this state by different components of your system disagreeing on the encoding in use.
You will probably have to fix the source of your bad "encoding" rather than just fixing the data currently in your database. And the code above might be okay to convert your bad data once, but I would advise you don't insert this code into your Django app.
The problem is that p.name was not correctly stored and/or read in from the database.
Unicode small alpha is U+03B1 and p.name should have printed as u'\x03b1' or if you were using a Unicode capable terminal the actual alpha symbol itself may have been printed in quotes. Note the difference between u'\xce\xb1' and u'\xceb1'. The former is a two character string and the latter in a single character string. I have no idea how the '03' byte of the UTF-8 got translated into 'CE'.
You can turn any byte sequence into internal unicode representation through the decode function:
print '\xce\xb1'.decode('utf-8')
This allows you to import a byte sequence from any source and then turn it into a Python unicode string.
Reference: http://docs.python.org/library/stdtypes.html#string-methods
Try converting the encoding with p.name.encode('latin-1'). Here's a demonstration:
>>> print u'\xce\xb1'
α
>>> print u'\xce\xb1'.encode('latin-1')
α
>>> print '\xce\xb1'
α
>>> '\xce\xb1' == u'\xce\xb1'.encode('latin1')
True
For more information, see str.encode and Standard Encodings.

Categories