Python/Feedparser: reading RSS feed fails - python

I'm using feedparser to fetch RSS feed data. For most RSS feeds that works perfectly fine. However, I know stumbled upon a website where fetching RSS feeds fails (example feed). The return result does not contain the expected keys and the values are some HTML codes.
I tries simply reading the feed URL with urllib2.Request(url). This fails with a HTTP Error 405: Not Allowed error. If I add a custom header like
headers = {
'Content-type' : 'text/xml',
'User-Agent': 'Mozilla/5.0 (X11; Linux i586; rv:31.0) Gecko/20100101 Firefox/31.0',
}
request = urllib2.Request(url)
I don't get the 405 error anymore, but the returned content is a HTML document with some HEAD tags and an essentially empty BODY. In the browser everything looks fine, same when I look at "View Page Source". feedparser.parse also allows to set agent and request_headers, I tried various agents. I'm still not able to correctly read the XML let alone the parsed feed from feedparse.
What am I missing here?

So, this feed yields a 405 error when the client making the request does not use a User-Agent. Try this:
$ curl 'http://www.propertyguru.com.sg/rss' -H 'User-Agent: hum' -o /dev/null -D- -s
HTTP/1.1 200 OK
Server: nginx
Date: Thu, 21 May 2015 15:48:44 GMT
Content-Type: application/xml; charset=utf-8
Content-Length: 24616
Connection: keep-alive
Vary: Accept-Encoding
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Pragma: no-cache
Vary: Accept-Encoding
While without the UA, you get:
$ curl 'http://www.propertyguru.com.sg/rss' -o /dev/null -D- -s
HTTP/1.1 405 Not Allowed
Server: nginx
Date: Thu, 21 May 2015 15:49:20 GMT
Content-Type: text/html
Transfer-Encoding: chunked
Connection: keep-alive
Vary: Accept-Encoding
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Pragma: no-cache
Vary: Accept-Encoding

Related

Download file from response header Content-Disposition without a file name - Python

I am trying to scrape information from youtube. Where youtube uses infinite scroll, after every pull ajax calls up for more data. I am using scrapy on python, while i request to this url(with continuation token)
'https://www.youtube.com/results?search_query=tamil&ctoken=xyz&continuation=xyz' i received the status 200 with the following header.
HTTP/2.0 200 OK
cache-control: no-cache
content-disposition: attachment
expires: Tue, 27 Apr 1971 19:44:06 GMT
content-type: application/json; charset=UTF-8
content-encoding: br
x-frame-options: SAMEORIGIN
strict-transport-security: max-age=31536000
x-spf-response-type: multipart
x-content-type-options: nosniff
date: Mon, 09 Dec 2019 11:59:25 GMT
server: YouTube Frontend Proxy
x-xss-protection: 0
alt-svc: quic=":443"; ma=2592000; v="46,43",h3-Q050=":443"; ma=2592000,h3-Q049=":443"; ma=2592000,h3-Q048=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000
X-Firefox-Spdy: h2
I just need to download the response json. i can view the response in Chrome and firefox inspector.
here is what i tried.
links = https://www.youtube.com/result?xyxxxx
ctoken = xyxxxxxxxx
ajax_url = "{links}&ctoken={ctoken}&continuation={ctoken}".format(ctoken=ctoken, links=links)
new_data = requests.get(ajax_url).json()
I am getting error on this.
What i am interested is, can i download the response as JSON file for further usage, by making use of content-disposition: attachment. If i need to download the response how can i implement.
Try:
header('Content-Disposition: attachment; filename=data.json');
header('Content-Type: application/json');
where header should the response of ajax call.

Python Scraping, Web page doesnot exist but the website redirects to another page

i am trying to find a way to know that if a web page exists or not. there are plenty of methods like httlib2, urlparse and using requests . but in my case the website redirects me to the home page if the webpage doesnot exist
e.g
https://www.thenews.com.pk/latest/category/sports/2015-09-21
Is there any method to catch that ?
The URL you mention gives a Redirect return code (307) which you can catch. See here:
$ curl -i https://www.thenews.com.pk/latest/category/sports/2015-09-21
HTTP/1.1 307 Temporary Redirect
Date: Sun, 26 Mar 2017 10:13:39 GMT
Content-Type: text/html; charset=UTF-8
Transfer-Encoding: chunked
Connection: keep-alive
Set-Cookie: __cfduid=ddcd246615efb68a7c72c73f480ea81971490523219; expires=Mon, 26-Mar-18 10:13:39 GMT; path=/; domain=.thenews.com.pk; HttpOnly
Set-Cookie: bf_session=b02fb5b6cc732dc6c3b60332288d0f1d4f9f7360; expires=Sun, 26-Mar-2017 11:13:39 GMT; Max-Age=3600; path=/; HttpOnly
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Location: https://www.thenews.com.pk/
X-Cacheable: YES
X-Varnish: 654909723
Age: 0
Via: 1.1 varnish
X-Age: 0
X-Cache: MISS
Access-Control-Allow-Origin: *
Server: cloudflare-nginx
CF-RAY: 345956a8be8a7289-AMS
You can check if the final url is the one you get redirected to, as well as if there was any history of redirects.
>>> import requests
>>> target_url = "https://www.thenews.com.pk/latest/category/sports/2015-09-21"
>>> response = requests.get(target_url)
>>> response.history[0].url
u'https://www.thenews.com.pk/latest/category/sports/2015-09-21'
>>> response.url
u'https://www.thenews.com.pk/'
>>> response.history and response.url == 'https://www.thenews.com.pk/' != target_url
True

download files with python (REST URL)

I am trying to write a script that will download a bunch files from a website that has REST URLs.
Here is the GET request:
GET /test/download/id/5774/format/testTitle HTTP/1.1
Host: testServer.com
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Cookie: __utma=11863783.1459862770.1379789243.1379789243.1379789243.1; __utmb=11863783.28.9.1379790533699; __utmc=11863783; __utmz=11863783.1379789243.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); PHPSESSID=fa844952890e9091d968c541caa6965f; loginremember=Qraoz3j%2BoWXxwqcJkgW9%2BfGFR0SDFLi1FLS7YVAfvbcd9GhX8zjw4u6plYFTACsRruZM4n%2FpX50%2BsjXW5v8vykKw2XNL0Vqo5syZKSDFSSX9mTFNd5KLpJV%2FFlYkCY4oi7Qyw%3D%3D; ma-refresh-storage=1; ma-pref=KLSFKJSJSD897897; skipPostLogin=0; pp-sid=hlh6hs1pnvuh571arl59t5pao0; __utmv=11863783.|1=MemberType=Yearly=1; nats_cookie=http%253A%252F%252Fwww.testServer.com%252F; nats=NDc1NzAzOjQ5MzoyNA%2C74%2C0%2C0%2C0; nats_sess=fe3f77e6e326eb8d18ef0111ab6f322e; __utma=163815075.1459708390.1379790355.1379790355.1379790355.1; __utmb=163815075.1.9.1379790485255; __utmc=163815075; __utmz=163815075.1379790355.1.1.utmcsr=ppp.contentdef.com|utmccn=(referral)|utmcmd=referral|utmcct=/postlogin; unlockedNetworks=%5B%22rk%22%2C%22bz%22%2C%22wkd%22%5D
Connection: close
If the request is good, it will return a 302 response such as this one:
HTTP/1.1 302 Found
Date: Sat, 21 Sep 2013 19:32:37 GMT
Server: Apache
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
location: http://downloads.test.stuff.com/5774/stuff/picture.jpg?wed=20130921152237&wer=20130922153237&hash=0f20f4a6d0c9f1720b0b6
Vary: User-Agent,Accept-Encoding
Content-Length: 0
Connection: close
Content-Type: text/html; charset=UTF-8
What I need the script to do is check to see if it was a 302 response. If it is not, it will "pass", if it is, it will need to parse out the location parameter shown here:
location: http://downloads.test.stuff.com/5774/stuff/picture.jpg?wed=20130921152237&wer=20130922153237&hash=0f20f4a6d0c9f1720b0b6
Once I have the location parameter, I will have to make another GET request to download that file. I will also have to maintain the cookie for my session in order to download the file.
Can someone point me in the right direction for what library is best to use for this? I am having trouble finding out how to parse the 302 response and adding a cookie value like the one shown in my GET request above. I am sure there must be some library that can do all of this.
Any help would be much appreciated.
import urllib.request as ur
import urllib.error as ue
'''
Note that http.client.HTTPResponse.read([amt]) reads and returns the response body, or up to
the next amt bytes. This is because there is no way for urlopen() to automatically determine
the encoding of the byte stream it receives from the http server.
'''
url = "http://www.example.org/images/{}.jpg"
dst = ""
arr = ["01","02","03","04","05","06","07","08","09"]
# arr = range(10,20)
try:
for x in arr:
print(str(x)+"). ".ljust(4),end="")
hrio = ur.urlopen(url.format(x)) # HTTPResponse iterable object (returns the response header and body, together, as bytes)
fh = open(dst+str(x)+".jpg","b+w")
fh.write(hrio.read())
fh.close()
print("\t[REQUEST COMPLETE]\t\t<Error ~ [None]>")
except ue.URLError as e:
print("\t[REQUEST INCOMPLETE]\t",end="")
print("<Error ~ [{}]>".format(e))

HEAD request in python not working as desired

I am trying to check the status code of any URL in Python using the following code
class HeadRequest(urllib2.Request):
def get_method(self):
return "HEAD"
when I use it like this:
response = urllib2.urlopen(HeadRequest("http://www.nativeseeds.org/"))
it throws following exception:
HTTPError: HTTP Error 503: Service Temporarily Unavailable
However when I open the above URL "http://www.nativeseeds.org/" in firefox/restclient, it returns 200 status code.
Any help will be highly appreciated.
After some investigating, the website requires that both Accept and User-Agent request headers are set. Otherwise, it returns a 503. This is terribly broken. It also appears to be doing user-agent sniffing. I get a 403 when using curl:
$ curl --head http://www.nativeseeds.org/
HTTP/1.1 403 Forbidden
Date: Wed, 26 Sep 2012 14:54:59 GMT
Server: Apache
P3P: CP="NOI ADM DEV PSAi COM NAV OUR OTRo STP IND DEM"
Set-Cookie: f65129b0cd2c5e10c387f919ac90ad66=PjZxNjvNmn6IlVh4Ac-tH0; path=/
Vary: Accept-Encoding
Content-Type: text/html
but works fine if I set the user-agent to Firefox:
$ curl --user-agent "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)" --head http://www.nativeseeds.org/
HTTP/1.1 200 OK
Date: Wed, 26 Sep 2012 14:55:57 GMT
Server: Apache
P3P: CP="NOI ADM DEV PSAi COM NAV OUR OTRo STP IND DEM"
Expires: Mon, 1 Jan 2001 00:00:00 GMT
Cache-Control: post-check=0, pre-check=0
Pragma: no-cache
Set-Cookie: f65129b0cd2c5e10c387f919ac90ad66=ykOpGnEE%2CQOMUaVJLnM7W0; path=/
Last-Modified: Wed, 26 Sep 2012 14:56:27 GMT
Vary: Accept-Encoding
Content-Type: text/html; charset=utf-8
It appears to work using the requests module:
>>> import requests
>>> r = requests.head('http://www.nativeseeds.org/')
>>> import pprint; pprint.pprint(r.headers)
{'cache-control': 'post-check=0, pre-check=0',
'content-encoding': 'gzip',
'content-length': '20',
'content-type': 'text/html; charset=utf-8',
'date': 'Wed, 26 Sep 2012 14:58:05 GMT',
'expires': 'Mon, 1 Jan 2001 00:00:00 GMT',
'last-modified': 'Wed, 26 Sep 2012 14:58:09 GMT',
'p3p': 'CP="NOI ADM DEV PSAi COM NAV OUR OTRo STP IND DEM"',
'pragma': 'no-cache',
'server': 'Apache',
'set-cookie': 'f65129b0cd2c5e10c387f919ac90ad66=2NtRrDBra9jPsszChZXDm2; path=/',
'vary': 'Accept-Encoding'}
The problem you see has nothing to do with Python. The website itself seems to require something more than just a HEAD request. Even a simple telnet session results in the error:
$ telnet www.nativeseeds.org 80
Trying 208.113.230.85...
Connected to www.nativeseeds.org (208.113.230.85).
Escape character is '^]'.
HEAD / HTTP/1.1
Host: www.nativeseeds.org
HTTP/1.1 503 Service Temporarily Unavailable
Date: Wed, 26 Sep 2012 14:29:33 GMT
Server: Apache
Vary: Accept-Encoding
Connection: close
Content-Type: text/html; charset=iso-8859-1
Try adding some more headers; the http command line client does get a 200 response:
$ http -v head http://www.nativeseeds.org
HEAD / HTTP/1.1
Host: www.nativeseeds.org
Content-Type: application/x-www-form-urlencoded; charset=utf-8
Accept-Encoding: identity, deflate, compress, gzip
Accept: */*
User-Agent: HTTPie/0.2.2
HTTP/1.1 200 OK
Date: Wed, 26 Sep 2012 14:33:21 GMT
Server: Apache
P3P: CP="NOI ADM DEV PSAi COM NAV OUR OTRo STP IND DEM"
Expires: Mon, 1 Jan 2001 00:00:00 GMT
Cache-Control: post-check=0, pre-check=0
Pragma: no-cache
Set-Cookie: f65129b0cd2c5e10c387f919ac90ad66=34hOijDSzeskKYtULx9V83; path=/
Last-Modified: Wed, 26 Sep 2012 14:33:23 GMT
Vary: Accept-Encoding
Content-Encoding: gzip
Content-Length: 20
Content-Type: text/html; charset=utf-8
Reading urllib2 docs, get_method only returns 'GET' or 'POST'.
You may be interested in this.

Mechanize for Python and authorization without first getting a 401 error

I'm trying to use Mechanize to automate interactions with a very picky legacy system. In particular, after the first login page the authorization must be sent with every request it knocks you out of the system. Unfortunately, Mechanize seems content on only sending the authorization after first getting a 401 Unauthorized error. Is there any way to have it send authorization every time?
Here's some sample code:
br.add_password("http://example.com/securepage", "USERNAME", "PASSWORD", "/MYREALM")
br.follow_link(link_to_secure_page) # where the url is the previous URL
Here's the response I get from debugging Mechanize:
send: 'GET /securepage HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: example.com\r\nReferer: http://example.com/home\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n'
reply: 'HTTP/1.1 401 Unauthorized\r\n'
header: Server: Tandy1000Web
header: Date: Thu, 08 Dec 2011 03:08:04 GMT
header: Connection: close
header: Expires: Tue, 01 Jan 1980 06:00:00 GMT
header: Content-Type: text/html; charset=US-ASCII
header: Content-Length: 210
header: WWW-Authenticate: Basic realm="/MYREALM"
header: Cache-control: no-cache
send: 'GET /securepage HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: example.com\r\nReferer: http://example.com/home\r\nConnection: close\r\nAuthorization: Basic VVNFUk5BTUU6UEFTU1dPUkQ=\r\nUser-Agent: Python-urllib/2.7\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Server: Tandy1000Web
header: Date: Thu, 08 Dec 2011 03:08:07 GMT
header: Connection: close
header: Last-Modified: Thu, 08 Dec 2011 03:08:06 GMT
header: Expires: Tue, 01 Jan 1980 06:00:00 GMT
header: Content-Type: text/html; charset=UTF-8
header: Content-Length: 33333
header: Cache-control: no-cache
The problem is that contrary to what should happen in modern web application with a GET request, by hitting the 401 error first I get the wrong page. I've confirmed with CURL and urllib2 that if I hit the URL directly by passing in the auth header on the first request I get the correct page.
Any hints on how to tell mechanize to always send the auth headers and avoid the first 401 error? This needs to be fixed on the client side. I can't modify the server.
from base64 import b64encode
import mechanize
url = 'http://192.168.3.5/table.js'
username = 'admin'
password = 'password'
# I have had to add a carriage return ('%s:%s\n'), but
# you may not have to.
b64login = b64encode('%s:%s' % (username, password))
br = mechanize.Browser()
# # I needed to change to Mozilla for mine, but most do not
# br.addheaders= [('User-agent', 'Mozilla/5.0')]
br.addheaders.append(
('Authorization', 'Basic %s' % b64login )
)
br.open(url)
r = br.response()
data = r.read()
print data

Categories