Download file from response header Content-Disposition without a file name - Python - python

I am trying to scrape information from youtube. Where youtube uses infinite scroll, after every pull ajax calls up for more data. I am using scrapy on python, while i request to this url(with continuation token)
'https://www.youtube.com/results?search_query=tamil&ctoken=xyz&continuation=xyz' i received the status 200 with the following header.
HTTP/2.0 200 OK
cache-control: no-cache
content-disposition: attachment
expires: Tue, 27 Apr 1971 19:44:06 GMT
content-type: application/json; charset=UTF-8
content-encoding: br
x-frame-options: SAMEORIGIN
strict-transport-security: max-age=31536000
x-spf-response-type: multipart
x-content-type-options: nosniff
date: Mon, 09 Dec 2019 11:59:25 GMT
server: YouTube Frontend Proxy
x-xss-protection: 0
alt-svc: quic=":443"; ma=2592000; v="46,43",h3-Q050=":443"; ma=2592000,h3-Q049=":443"; ma=2592000,h3-Q048=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000
X-Firefox-Spdy: h2
I just need to download the response json. i can view the response in Chrome and firefox inspector.
here is what i tried.
links = https://www.youtube.com/result?xyxxxx
ctoken = xyxxxxxxxx
ajax_url = "{links}&ctoken={ctoken}&continuation={ctoken}".format(ctoken=ctoken, links=links)
new_data = requests.get(ajax_url).json()
I am getting error on this.
What i am interested is, can i download the response as JSON file for further usage, by making use of content-disposition: attachment. If i need to download the response how can i implement.

Try:
header('Content-Disposition: attachment; filename=data.json');
header('Content-Type: application/json');
where header should the response of ajax call.

Related

Is it possible to get the content-length of a web page if the server doesn't include content-length as a header?

Is it possible to get the 'content-length' of a web page if the server doesn't include content-length as a header (even if the whole page is downloaded) without downloading the whole content?
I am doing some scraping using python requests and only need to classify data based on size of response.
Below is the response of the server for that matter.
HTTP/1.1 200 OK
Cache-Control: no-cache,private
Content-Type: text/html
Server: Microsoft-IIS/7.2
X-Powered-By: ASP.NET
Date: Fri, 04 Jul 2017 11:28:40 GMT
Connection: close

Python Scraping, Web page doesnot exist but the website redirects to another page

i am trying to find a way to know that if a web page exists or not. there are plenty of methods like httlib2, urlparse and using requests . but in my case the website redirects me to the home page if the webpage doesnot exist
e.g
https://www.thenews.com.pk/latest/category/sports/2015-09-21
Is there any method to catch that ?
The URL you mention gives a Redirect return code (307) which you can catch. See here:
$ curl -i https://www.thenews.com.pk/latest/category/sports/2015-09-21
HTTP/1.1 307 Temporary Redirect
Date: Sun, 26 Mar 2017 10:13:39 GMT
Content-Type: text/html; charset=UTF-8
Transfer-Encoding: chunked
Connection: keep-alive
Set-Cookie: __cfduid=ddcd246615efb68a7c72c73f480ea81971490523219; expires=Mon, 26-Mar-18 10:13:39 GMT; path=/; domain=.thenews.com.pk; HttpOnly
Set-Cookie: bf_session=b02fb5b6cc732dc6c3b60332288d0f1d4f9f7360; expires=Sun, 26-Mar-2017 11:13:39 GMT; Max-Age=3600; path=/; HttpOnly
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Location: https://www.thenews.com.pk/
X-Cacheable: YES
X-Varnish: 654909723
Age: 0
Via: 1.1 varnish
X-Age: 0
X-Cache: MISS
Access-Control-Allow-Origin: *
Server: cloudflare-nginx
CF-RAY: 345956a8be8a7289-AMS
You can check if the final url is the one you get redirected to, as well as if there was any history of redirects.
>>> import requests
>>> target_url = "https://www.thenews.com.pk/latest/category/sports/2015-09-21"
>>> response = requests.get(target_url)
>>> response.history[0].url
u'https://www.thenews.com.pk/latest/category/sports/2015-09-21'
>>> response.url
u'https://www.thenews.com.pk/'
>>> response.history and response.url == 'https://www.thenews.com.pk/' != target_url
True

Python library requests open the wrong page

I try to open a html page with python requests library but my code open the site root folder and i don't understand how solve the problem.
import requests
scraping = requests.request("POST", url = "http://www.pollnet.it/WeeklyReport_it.aspx?ID=69")
print scraping.content
Thank you for all suggestion!
You can see easily that the server is redirecting to the main page.
➜ ~ http -v http://www.pollnet.it/WeeklyReport_it.aspx\?ID\=69
GET /WeeklyReport_it.aspx?ID=69 HTTP/1.1
Accept: */*
Accept-Encoding: gzip, deflate
Connection: keep-alive
Host: www.pollnet.it
User-Agent: HTTPie/0.9.3
HTTP/1.1 302 Found
Content-Length: 131
Content-Type: text/html; charset=utf-8
Date: Sun, 07 Feb 2016 11:24:52 GMT
Location: /default.asp
Server: Microsoft-IIS/7.5
X-Powered-By: ASP.NET
<html><head><title>Object moved</title></head><body>
<h2>Object moved to here.</h2>
</body></html>
On further checking, it can be seen that the web server uses session cookies.
➜ ~ http -v http://www.pollnet.it/default_it.asp
HTTP/1.1 200 OK
Cache-Control: private
Content-Encoding: gzip
Content-Length: 9471
Content-Type: text/html; Charset=utf-8
Date: Sun, 07 Feb 2016 13:21:41 GMT
Server: Microsoft-IIS/7.5
Set-Cookie: ASPSESSIONIDSQTSTAST=PBHDLEIDFCNMPKIGANFDNMLK; path=/
Vary: Accept-Encoding
X-Powered-By: ASP.NET
It means that every time the main page is visited, the server sends a "Set-Cookie" header, which instructs the browser to set certain cookies. Then every time the browser asks for a Weekly Report, the server validates the session cookie.
Normally. requests package does not save cookies in between requests, but to do the scraping, we can use a Session object which will save the cookies in between page requests.
import requests
# create a Session object
s= requests.Session()
# first visit the main page
s.get("http://www.pollnet.it/default_it.asp")
# then we can visit the weekly report pages
r = s.get("http://www.pollnet.it/WeeklyReport_it.aspx?ID=69")
print(r.text)
# another page
r = s.get("http://www.pollnet.it/WeeklyReport_it.aspx?ID=89")
print(r.text)
But here is some advice - the web server may only allow opening of a fixed number of pages (maybe 10, maybe 15) with a certain Session object. Either immediately validate the results of r.text each time (maybe check the length of the request body to ensure it is not too small), or create a new Session object, for every 5 or 6 pages.
More info on Session objects here.

Python/Feedparser: reading RSS feed fails

I'm using feedparser to fetch RSS feed data. For most RSS feeds that works perfectly fine. However, I know stumbled upon a website where fetching RSS feeds fails (example feed). The return result does not contain the expected keys and the values are some HTML codes.
I tries simply reading the feed URL with urllib2.Request(url). This fails with a HTTP Error 405: Not Allowed error. If I add a custom header like
headers = {
'Content-type' : 'text/xml',
'User-Agent': 'Mozilla/5.0 (X11; Linux i586; rv:31.0) Gecko/20100101 Firefox/31.0',
}
request = urllib2.Request(url)
I don't get the 405 error anymore, but the returned content is a HTML document with some HEAD tags and an essentially empty BODY. In the browser everything looks fine, same when I look at "View Page Source". feedparser.parse also allows to set agent and request_headers, I tried various agents. I'm still not able to correctly read the XML let alone the parsed feed from feedparse.
What am I missing here?
So, this feed yields a 405 error when the client making the request does not use a User-Agent. Try this:
$ curl 'http://www.propertyguru.com.sg/rss' -H 'User-Agent: hum' -o /dev/null -D- -s
HTTP/1.1 200 OK
Server: nginx
Date: Thu, 21 May 2015 15:48:44 GMT
Content-Type: application/xml; charset=utf-8
Content-Length: 24616
Connection: keep-alive
Vary: Accept-Encoding
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Pragma: no-cache
Vary: Accept-Encoding
While without the UA, you get:
$ curl 'http://www.propertyguru.com.sg/rss' -o /dev/null -D- -s
HTTP/1.1 405 Not Allowed
Server: nginx
Date: Thu, 21 May 2015 15:49:20 GMT
Content-Type: text/html
Transfer-Encoding: chunked
Connection: keep-alive
Vary: Accept-Encoding
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Pragma: no-cache
Vary: Accept-Encoding

Mechanize for Python and authorization without first getting a 401 error

I'm trying to use Mechanize to automate interactions with a very picky legacy system. In particular, after the first login page the authorization must be sent with every request it knocks you out of the system. Unfortunately, Mechanize seems content on only sending the authorization after first getting a 401 Unauthorized error. Is there any way to have it send authorization every time?
Here's some sample code:
br.add_password("http://example.com/securepage", "USERNAME", "PASSWORD", "/MYREALM")
br.follow_link(link_to_secure_page) # where the url is the previous URL
Here's the response I get from debugging Mechanize:
send: 'GET /securepage HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: example.com\r\nReferer: http://example.com/home\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n'
reply: 'HTTP/1.1 401 Unauthorized\r\n'
header: Server: Tandy1000Web
header: Date: Thu, 08 Dec 2011 03:08:04 GMT
header: Connection: close
header: Expires: Tue, 01 Jan 1980 06:00:00 GMT
header: Content-Type: text/html; charset=US-ASCII
header: Content-Length: 210
header: WWW-Authenticate: Basic realm="/MYREALM"
header: Cache-control: no-cache
send: 'GET /securepage HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: example.com\r\nReferer: http://example.com/home\r\nConnection: close\r\nAuthorization: Basic VVNFUk5BTUU6UEFTU1dPUkQ=\r\nUser-Agent: Python-urllib/2.7\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Server: Tandy1000Web
header: Date: Thu, 08 Dec 2011 03:08:07 GMT
header: Connection: close
header: Last-Modified: Thu, 08 Dec 2011 03:08:06 GMT
header: Expires: Tue, 01 Jan 1980 06:00:00 GMT
header: Content-Type: text/html; charset=UTF-8
header: Content-Length: 33333
header: Cache-control: no-cache
The problem is that contrary to what should happen in modern web application with a GET request, by hitting the 401 error first I get the wrong page. I've confirmed with CURL and urllib2 that if I hit the URL directly by passing in the auth header on the first request I get the correct page.
Any hints on how to tell mechanize to always send the auth headers and avoid the first 401 error? This needs to be fixed on the client side. I can't modify the server.
from base64 import b64encode
import mechanize
url = 'http://192.168.3.5/table.js'
username = 'admin'
password = 'password'
# I have had to add a carriage return ('%s:%s\n'), but
# you may not have to.
b64login = b64encode('%s:%s' % (username, password))
br = mechanize.Browser()
# # I needed to change to Mozilla for mine, but most do not
# br.addheaders= [('User-agent', 'Mozilla/5.0')]
br.addheaders.append(
('Authorization', 'Basic %s' % b64login )
)
br.open(url)
r = br.response()
data = r.read()
print data

Categories