problem with unicode decoding

problem with unicode decoding - python

This is funny.. I am trying to read geographic lookup data from openstreetmap. The code that performs the query looks like this
params = urllib.urlencode({'q': ",".join([e for e in full_address]), 'format': "json", "addressdetails" : "1"})
query = "http://nominatim.openstreetmap.org/search?%s" % params
print query
time.sleep(5)
response = json.loads(unicode(urllib.urlopen(query).read(), "UTF-8"), encoding="UTF-8")
print response
The query for Zürich is correctly URL-encoded on UTF-8 data. No wonders here.
http://nominatim.openstreetmap.org/search?q=Z%C3%BCrich%2CSWITZERLAND&addressdetails=1&format=json
When I print the response, the u with umlaut is encoded latin1 (0xFC)
[{u'display_name': u'Z\xfcrich, Bezirk Z\xfcrich, Z\xfcrich, Schweiz, Europe', u'place_id': 588094, u'lon': 8.540443
but that's nonsense because openstreetmap returns the JSON data in UTF-8
Connecting to nominatim.openstreetmap.org (nominatim.openstreetmap.org)|128.40.168.106|:80... connected.
HTTP request sent, awaiting response...
HTTP/1.1 200 OK
Date: Wed, 26 Jan 2011 13:48:33 GMT
Server: Apache/2.2.14 (Ubuntu)
Content-Location: search.php
Vary: negotiate
TCN: choice
X-Powered-By: PHP/5.3.2-1ubuntu4.7
Access-Control-Allow-Origin: *
Content-Length: 3342
Keep-Alive: timeout=15, max=100
Connection: Keep-Alive
Content-Type: application/json; charset=UTF-8
Length: 3342 (3.3K) [application/json]
which is also confirmed by the file contents, and then I explicitly say that it's UTF-8 both at read and json parsing.
What's going on here ?
EDIT : apparently it's the json.loads that screws up somehow.

When I go and print the response, the
u with umlaut is encoded latin1 (0xFC)
You are just misinterpreting the output. It's a unicode string (you can tell by the u in prefix), there's no encoding "attached" - the \xFC means there it's the codepoint with number 0xFC, which happens to be the U-Umlaut (see http://www.fileformat.info/info/unicode/char/fc/index.htm). The reason why this happens is that the numbering of the first 256 unicode codepoints coincides with the latin1 encoding.
In short, you did everything right - you have a unicode object with the right content (that is agnostic to encodings), you can choose the encoding you want when you use that content for output somewhere by doing unicodestr.encode("utf-8") or by using codecs, see http://docs.python.org/howto/unicode.html#reading-and-writing-unicode-data

The output is fine. Whenever you print data on the console, Python encondes Unicode the data only when printing the actual string. If you print a list of unicodes, each unicode string is show on the console as its repr():
>>> a=u'á'
>>> a
u'\xe1'
>>> print a
á
>>> [a]
[u'\xe1']
>>> print [a]
[u'\xe1']

Related

Send "\r\n" symbols in python requests data

Everything ok with curl:
curl -v "http://user:password#localhost/_control.html" -d $'data1=1\r\n'
I tried this way in python:
url = "http://localhost/_control.html"
payload = {'data1': '1\r\n'}
headers = {"Content-type": "application/x-www-form-urlencoded", "Accept": "text/plain"}
r = requests.post(url, data=payload, headers=headers, auth=('user', 'password'))
But it doesn't work. Content-length in this case is 13 instead of 9 (with curl request)
Is it possible to send same data (with \r\n at the end) using python requests?

The \r and \n characters are being URL-encoded, as they should be, because application/x-www-form-urlencoded data cannot contain those characters directly:
Non-alphanumeric characters are replaced by %HH, a percent sign and two hexadecimal digits representing the ASCII code of the character. Line breaks are represented as "CR LF" pairs (i.e., %0D%0A).
Logging the request sent from Python using this technique, we can see that it correctly sends these 13 bytes:
data1=1%0D%0A
In the case of curl, its manual page makes no mention of encoding of whatever you pass to -d/--data, so presumably you're expected to encode the string yourself before passing it to curl. We can confirm this with --trace-ascii -:
=> Send data, 9 bytes (0x9)
0000: data1=1
The \r\n pair doesn't show up clearly here, but we can infer that it's not encoded because of the byte count.
In short, the request you are sending with that curl command is not valid to begin with.

python assign literal value of a dictionary to key of another dictionary

I am trying to form a web payload for a particular request body but unable to get it right. What I need is to pass my body data as below
data={'file-data':{"key1": "3","key2": "6","key3": "8"}}
My complete payload request looks like this
payload={url,headers, data={'file-data':{"key1": "3","key2": "6","key3": "8"}},files=files}
However, when I pass this, python tries to parse each individual key value and assigns to the 'file-data' key like this
file-data=key1
file-data=key2
file-data=key3
and so on for as many keys I pass within the nested dictionary. The requirement however, is to pass the entire dictionary as a literal content like this(without splitting the values by each key):
file-data={"key1": "3","key2": "6","key3": "8"}
The intended HTTP trace should thus ideally look like this:
POST /sample_URL/ HTTP/1.1
Host: sample_host.com
Authorization: Basic XYZ=
Cache-Control: no-cache
Content-Type: multipart/form-data; boundary=----UVWXXXX
------WebKitFormBoundaryXYZ
Content-Disposition: form-data; name="file-data"
{"key1": "3","key2": "6","key3":"8" }
------WebKitFormBoundaryMANZXC
Content-Disposition: form-data; name="file"; filename=""
Content-Type:
------WebKitFormBoundaryBNM--
As such, I want to use this as part of a payload for a POST request(using python requests library). Any suggestions are appreciated in advance-
Edit1: To provide more clarity, the API definition is this:
Body
Type: multipart/form-data
Form Parameters
file: required (file)
The file to be uploaded
file-data: (string)
Example:
{
"key1": "3",
"key2": "6",
"key3": "8"
}
The python code snippet I used(after checking suggestions) is this:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import requests
url = "https://sample_url/upload"
filepath='mypath'
filename='logo.png'
f=open(filepath+'\\'+filename)
filedata={'file-data':"{'key1': '3','key2': '6','key3': '8'}"}
base64string = encodestring('%s:%s' % ('user', 'password').replace('\n', '')
headers={'Content-type': 'multipart/form-data','Authorization':'Basic %s' % base64string}
r = requests.post(url=url,headers=headers,data=filedata,files={'file':f})
print r.text
The error I get now is still the same as shown below:
{"statusCode":400,"errorMessages":[{"severity":"ERROR","errorMessage":"An exception has occurred"]
It also says that some entries are either missing or incorrect. Note that I have tried passing the file parameter after opening it in binary mode as well but it throws the same error message
I got the HTTP trace printed out via python too and it looks like this:
send: 'POST sample_url HTTP/1.1
Host: abc.com
Connection: keep-alive
Accept-Encoding: gzip,deflate
Accept: */*
python-requests/2.11.1
Content-type: multipart/form-data
Authorization: Basic ABCDXXX=
Content-Length: 342
--CDXXXXYYYYY
Content-Disposition:form-data; name="file-data"
{\'key1\': \'3\',\'key2\': \'6\'
,\'key3\': \'8\'}
--88cdLMNO999999
Content-Disposition: form-data; name="file";
filename="logo.png"\x89PNG\n\r\n--cbCDEXXXNNNN--

If you want to post JSON with python requests, you should NOT use data but json:
r = requests.post('http://httpbin.org/post', json={"key": "value"})
I can only guess that you are using data because of your example
payload={url,headers, data={'file-data':{"key1": "3","key2": "6","key3": "8"}},files=files}
Whis is not valid python syntax btw.

Unicode encoding in email

I have manually created and sent myself an html email in gmail. I want to be able to reuse this html output to programatically send it (using smtplib in python).
In gmail, I view the source, which appears like:
Mime-Version: 1.0 Content-Type: multipart/alternative;
boundary="--==_mimepart_57daadsdas2e101427152ee"; charset=UTF-8
----==_mimepart_57daadsdas2e101427152ee Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi all !
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
Venez d=C3=A9couvrir
My problem is that when I then try to send this content as html programatically, it's not displayed correctly. I suspect it's because of unicode conversion. I can't convert back for example the characters "d=C3=A9couvrir" to what it should be: "découvrir".
Could anyone help?

There's are some MIME examples that are probably more suitable, but the simple answer from the headers is that it is UTF8 and quoted-printable encoding, so you can use the quopri module:
>>> quopri.decodestring('Venez d=C3=A9couvrir').decode('utf8')
'Venez découvrir'

How to extract JSON data from a response containing a header and body?

this is my first question posed to Stack Overflow, because typically I can find the solutions to my problem here, but for this particular situation, I cannot. I am writing a Python plugin for my compiler that outputs REST calls in various languages for interaction with an API. I am authenticating with the socket and ssl modules by sending a username and password in the request body in JSON form. Upon successful authentication, the API returns a response in the following format with important response data in the body:
HTTP/1.1 200 OK
Server: Apache-Coyote/1.1
Date: Tue, 05 Feb 2013 03:36:18 GMT
Vary: Accept-Charset, Accept-Encoding, Accept-Language, Accept
Accept-Ranges: bytes
Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: POST,OPTIONS,GET
Access-Control-Allow-Headers: Content-Type
Server: Restlet-Framework/2.0m5
Content-Type: text/plain;charset=ISO-8859-1
Content-Length: 94
{"authentication-token":"<token>","authentication-secret":"<secret>"}
This is probably a very elementary question for Pythonistas, given its powerful tools for String manipulation. But alas, I am a new programmer who started with Java. I would like to know what would be the best way to parse this entire response to obtain the "<token>" and "<secret>"? Should I use a search for a "{" and dump the substring into a json object? My intuition is telling me to try and use the re module, but I cannot seem to figure out how it would be used in this situation, since the pattern of the token and secret are obviously not predictable. Because I have opted to authenticate with a low-level module set, this response is one big String obtained by constructing the header and appending JSON data to it in the body, then executing the request and obtaining the response with the following code:
#Socket configuration and connection execution
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
conn = ssl.wrap_socket(sock, ca_certs = pem_file)
conn.connect((host, port))
conn.send(req)
response = conn.recv()
print(response)
The print statement outputs the first code sample. Any help or insight would be greatly appreciated!

HTTP headers are split from the rest of the body by a \r\n\r\n sequence. Do something like:
import json
...
(headers, js) = response.split("\r\n\r\n")
data = json.loads(js)
token = data["authentication-token"]
secret = data["authentication-secret"]
You'll probably want to check the response, etc, and various libraries (e.g. requests) can do all of this a whole lot easier for you.

python: HTTP PUT with unencoded binary data

I cannot for the life of me figure out how to perform an HTTP PUT request with verbatim binary data in Python 2.7 with the standard Python libraries.
I thought I could do it with urllib2, but that fails because urllib2.Request expects its data in application/x-www-form-urlencoded format. I do not want to encode the binary data, I just want to transmit it verbatim, after the headers that include
Content-Type: application/octet-stream
Content-Length: (whatever my binary data length is)
This seems so simple, but I keep going round in circles and can't seem to figure out how.
How can I do this? (aside from open up a raw binary socket and write to it)

I found out my problem. It seems there is some obscure behavior in urllib2.Request / urllib2.urlopen() (at least in Python 2.7)
The urllib2.Request(url, data, headers) constructor seems to expect the same type of string in its url and data parameters.
I was giving the data parameter raw data from a file read() call (which in Python 2.7 returns it in the form of a 'plain' string), but my url was accidentally Unicode because I concatenated a portion of the URL from the result of another function which returned Unicode strings.
Rather than trying to "downcast" url from Unicode -> plain strings, it tried to "upcast" the data parameter to Unicode, and it gave me a codec error. (oddly enough, this happens on the urllib2.urlopen() function call, not the urllib2.Request constructor)
When I changed my function call to
# headers contains `{'Content-Type': 'application/octet-stream'}`
r = urllib2.Request(url.encode('utf-8'), data, headers)
it worked fine.

You're misreading the documentation: urllib2.Request expects the data already encoded, and for POST that usually means the application/x-www-form-urlencoded format. You are free to associate any other, binary data, like this:
import urllib2
data = b'binary-data'
r = urllib2.Request('http://example.net/put', data,
{'Content-Type': 'application/octet-stream'})
r.get_method = lambda: 'PUT'
urllib2.urlopen(r)
This will produce the request you want:
PUT /put HTTP/1.1
Accept-Encoding: identity
Content-Length: 11
Host: example.net
Content-Type: application/octet-stream
Connection: close
User-Agent: Python-urllib/2.7
binary-data

Have you considered/tried using httplib?
HTTPConnection.request(method, url[, body[, headers]])
This will send a request to the server using the HTTP request method
method and the selector url. If the body argument is present, it
should be a string of data to send after the headers are finished.
Alternatively, it may be an open file object, in which case the
contents of the file is sent; this file object should support fileno()
and read() methods. The header Content-Length is automatically set to
the correct value. The headers argument should be a mapping of extra
HTTP headers to send with the request.

This snipped worked for me to PUT an image:
on HTTPS site. If you don't need HTTPS, use
httplib.HTTPConnection(URL) instead.
import httplib
import ssl
API_URL="api-mysight.com"
TOKEN="myDummyToken"
IMAGE_FILE="myimage.jpg"
imageID="myImageID"
URL_PATH_2_USE="/My/image/" + imageID +"?objectId=AAA"
headers = {"Content-Type":"application/octet-stream", "X-Access-Token": TOKEN}
imgData = open(IMAGE_FILE, "rb")
REQUEST="PUT"
conn = httplib.HTTPSConnection(API_URL, context=ssl.SSLContext(ssl.PROTOCOL_TLSv1))
conn.request(REQUEST, URL_PATH_2_USE, imgData, headers)
response = conn.getresponse()
result = response.read()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

problem with unicode decoding - python

Related

Send "\r\n" symbols in python requests data

python assign literal value of a dictionary to key of another dictionary

Unicode encoding in email

How to extract JSON data from a response containing a header and body?

python: HTTP PUT with unencoded binary data

Categories

Resources