Ñ not displayed in Google App Engine website - python

I'm using Google App Engine to build a website and I'm having problems with special characters. I think I've reduced the problem to this two code samples:
request = urlfetch.fetch(
url=self.WWW_INFO,
payload=urllib.urlencode(inputs),
method=urlfetch.POST,
headers={'Content-Type': 'application/x-www-form-urlencoded'})
print request.content
The previous code displays the content just fine, showing the special characters. But, the correct way to use the framework to display something is using:
request = urlfetch.fetch(
url=self.WWW_INFO,
payload=urllib.urlencode(inputs),
method=urlfetch.POST,
headers={'Content-Type': 'application/x-www-form-urlencoded'})
self.response.out.write(request.content)
Which doesn't display the special characters, and instead just prints �. What should I do so it displays correctly?
I know I'm missing something, but I can't seem to grasp what it is. The website sets the <meta http-equiv="content-type" content="text/html; charset=ISO-8859-1">, and I've tried with charset=UTF-8 with no success.
I'll appreciate any advice that can point me in the right direction.

You need to get the charset from the content-type header in the fetch's result, use it to decode the bytes into Unicode, then, on the response, set the header with your favorite encoding (I do suggest utf-8 -- no good reason to do otherwise) and emit the encoding of the Unicode text via that codec. The pass through unicode is not strictly needed (when you're doing nothing at all with the contents, just bouncing it right back to the response, you might use identical content-type and charset to what you received) but it's recommended on general grounds (use encoded byte strings only on input/output, always keep all text "within" your app as unicode).
IOW, your problem seems to be mostly that you're not setting headers correctly on the response.

Related

Python Requests returning different HTML values from browser occasionally

I am trying to access the data that loads on https://www.hellofresh.com/menus for a project, which can be done by reconstructing the API endpoint using the following endpoint as a template: https://www.hellofresh.com/_next/data/1.964.0/menus/2023-W01.json
Where I believe "1.964.0" is some sort of build number related to next.js and "2023-W01" serves as a key that returns the meals for a particular week. Because the "1.964.0" string increases randomly, I find the latest one by looking through the head tag from the HTML of the menu page, and exists as <meta content="1.964.0" property="version">.
However, when using Python Requests to automate this string lookup, I sometimes get an incorrect older string after running the script a few times. In this example, it returned "1.961.0," and putting that string into the endpoint does not work. In other words, I am getting a discrepancy between the HTML I see in the browser and the HTML that's being served by the GET request in Python.
The weird thing is, if I rerun the request script several times, it will eventually get the correct numerical string (example: 1.964.0) and there is no problem, and this occurs without making any changes to the script. I have tried sending the request with cache-control: no-cache, pragma: no-cache headers in addition to the user-agent, referer, and accept headers and the behavior is the same regardless of the combination of headers. I am really scratching my head at this point so anything that points to any answer is much appreciated.

HPE_UNEXPECTED_CONTENT_LENGTH error when Content-Length to BytesIO length

I'm moving a python pyramid project from python 2 to 3. I was using ReportLab to generate PDF files and send it to the front end. According to their examples I need to use io.BytesIO(), when previously it was StringIO().
Now using the generated document length to set the Content-Length in my response, I get an HPE_UNEXPECTED_CONTENT_LENGTH error.
pdf = io.BytesIO()
doc = SimpleDocTemplate(pdf)
doc.build(story)
pdfcontent = pdf.getvalue()
pdf.close()
response = Response(content_type='application/pdf', body=pdfcontent)
response.headers.add("Content-Length", str(len(pdfcontent)))
If I don't set the Content-Length attribute the download works fine, but I would rather not leave it blank.
I'm not sure about you particular example and error, but I'm pretty sure that when you provide the response body bytes like this, Pyramid sends the Content-Length header. No need to set it manually, it already has the bytes and therefore knows its size.
You should check the response headers (using your browser developer tools or a command line tools like curl or httpie).

Dump JSON from string in unknown character encoding

I'm trying to dump HTML from websites into JSON, and I need a way to handle the different character encodings.
I've read that if it isn't utf-8, it's probably ISO-8859-1, so what I'm doing now is:
for possible_encoding in ["utf-8", "ISO-8859-1"]:
try:
# post_dict contains, among other things, website html retrieved
# with urllib2
json = simplejson.dumps(post_dict, encoding=possible_encoding)
break
except UnicodeDecodeError:
pass
if json is None:
raise UnicodeDecodeError
This will of course fail if I come across any other encodings, so I'm wondering if there is a way to solve this problem in the general case.
The reason I'm trying to serialize the HTML in the first place is because I need to send it in a POST request to our NodeJS server. So, if someone has a different solution that allows me to do that (maybe without serializing to JSON at all), I'd be happy to hear that as well.
You should know the character encoding regardless of media type you use to send POST request (unless you want to send binary blobs). To get the character encoding of your html content, see
A good way to get the charset/encoding of an HTTP response in Python
.
To send post_dict as json, make sure all strings in it are Unicode (just convert html to Unicode as soon as you receive it) and don't use the encoding parameter for json.dumps() call. The parameter won't help you anyway if different web-sites (where you get your html strings) use different encodings.

Browser charsets order of precedence

Client browsers are sending the header HTTP_ACCEPT_CHARSET: ISO-8859-1,utf-8;q=0.7,*;q=0.3. I only serve webpages as utf8 with the correct header but browsers are posting data from forms encoded with the ISO-8859-1 charset. My question is, will a browser always prefer charsets in the order of its ACCEPT_CHARSET header so I can reliably write a middleware that will decode any posted data with the first entry, in this case ISO-8859-1, and encode it as utf8.
UPDATE:
I updated the form tag with accept-charset="utf-8" and I'm still seeing non-unicode characters appearing. Is it possible that a user copy/pasting their password from somewhere else (lastpass, excel file) could be injecting non-unicode characters?
The request header Accept-Charset (which may get mapped to HTTP_ACCEPT_CHARSET server-side) expresses the client’s preferences, to be used when the server is capable to serving the resource in different encodings. The server may ignore it, and often will.
If your page is UTF-8 encoded and declared as such, then any form on your page will send its data as UTF-8 encoded, unless you specify an accept-charset attribute. So if a browser posts data as ISO-8859-1 encoded, then this is a browser bug. However, this would need to be analyzed before drawing conclusions.
There’s an ald technique of including some special character, written using a character reference for safety, as the value of a hidden field. The server-side handler can then pick up the value of this field and detect an encoding mismatch, or even to heuristically deduce the actual encoding from the encoded form of the special character.
I am not sure if all browsers always prefer charset in the same specific order, but you can set the accept-charset in the form, which forces the browser to send utf-8 encoded data.
Like this:
<form accept-charset="utf-8"></form>

Using Python to download a document that's not explicitly referenced in a URL

I wrote a web crawler in Python 2.6 using the Bing API that searches for certain documents and then downloads them for classification later. I've been using string methods and urllib.urlretrieve() to download results whose URL ends in .pdf, .ps etc., but I run into trouble when the document is 'hidden' behind a URL like:
http://www.oecd.org/officialdocuments/displaydocument/?cote=STD/CSTAT/WPNA(2008)25&docLanguage=En
So, two questions. Is there a way in general to tell if a URL has a pdf/doc etc. file that it's linking to if it's not doing so explicitly (e.g. www.domain.com/file.pdf)? Is there a way to get Python to snag that file?
Edit:
Thanks for replies, several of which suggest downloading the file to see if it's of the correct type. Only problem is... I don't know how to do that (see question #2, above). urlretrieve(<above url>) gives only an html file with an href containing that same url.
There's no way to tell from the URL what it's going to give you. Even if it ends in .pdf it could still give you HTML or anything it likes.
You could do a HEAD request and look at the content-type, which, if the server isn't lying to you, will tell you if it's a PDF.
Alternatively you can download it and then work out whether what you got is a PDF.
In this case, what you refer to as "a document that's not explicitly referenced in a URL" seems to be what is known as a "redirect". Basically, the server tells you that you have to get the document at another URL. Normally, python's urllib will automatically follow these redirects, so that you end up with the right file. (and - as others have already mentioned - you can check the response's mime-type header to see if it's a pdf).
However, the server in question is doing something strange here. You request the url, and it redirects you to another url. You request the other url, and it redirects you again... to the same url! And again... And again... At some point, urllib decides that this is enough already, and will stop following the redirect, to avoid getting caught in an endless loop.
So how come you are able to get the pdf when you use your browser? Because apparently, the server will only serve the pdf if you have cookies enabled. (why? you have to ask the people responsible for the server...) If you don't have the cookie, it will just keep redirecting you forever.
(check the urllib2 and cookielib modules to get support for cookies, this tutorial might help)
At least, that is what I think is causing the problem. I haven't actually tried doing it with cookies yet. It could also be that the server is does not "want" to serve the pdf, because it detects you are not using a "normal" browser (in which case you would probably need to fiddle with the User-Agent header), but it would be a strange way of doing that. So my guess is that it is somewhere using a "session cookie", and in the case you haven't got one yet, keeps on trying to redirect.
As has been said there is no way to tell content type from URL. But if you don't mind getting the headers for every URL you can do this:
obj = urllib.urlopen(URL)
headers = obj.info()
if headers['Content-Type'].find('pdf') != -1:
# we have pdf file, download whole
...
This way you won't have to download each URL just it's headers. It's still not exactly saving network traffic, but you won't get better than that.
Also you should use mime-types instead of my crude find('pdf').
No. It is impossible to tell what kind of resource is referenced by a URL just by looking at it. It is totally up to the server to decide what he gives you when you request a certain URL.
Check the mimetype with the urllib.info() function. This might not be 100% accurate, it really depends on what the site returns as a Content-Type header. If it's well behaved it'll return the proper mime type.
A PDF should return application/pdf, but that may not be the case.
Otherwise you might just have to download it and try it.
You can't see it from the url directly. You could try to only download the header of the HTTP response and look for the Content-Type header. However, you have to trust the server on this - it could respond with a wrong Content-Type header not matching the data provided in the body.
Detect the file type in Python 3.x and webapp with url to the file which couldn't have an extension or a fake extension. You should install python-magic, using
pip3 install python-magic
For Mac OS X, you should also install libmagic using
brew install libmagic
Code snippet
import urllib
import magic
from urllib.request import urlopen
url = "http://...url to the file ..."
request = urllib.request.Request(url)
response = urlopen(request)
mime_type = magic.from_buffer(response.read())
print(mime_type)

Categories