I'm trying to get a lot of requests, but I only need a part of the data near the start of the webpage html. Since right now I'm requesting for the whole webpage each time I request, it takes a lot of network usage to loop it. Can I request only a section of a website html, with any module?
If you know the specific number of bytes that is enough, then you can request a partial "range" of the resource:
curl -q http://www.example.com -i -H "Range: bytes=0-50"
HTTP/1.1 206 Partial Content
Accept-Ranges: bytes
Age: 506953
Cache-Control: max-age=604800
Content-Range: bytes 0-50/1256
...
Content-Length: 51
<!doctype html>
<html>
<head>
<title>Example Do%
See https://developer.mozilla.org/en-US/docs/Web/HTTP/Range_requests
Related
I am making get/Post request on a URL and in response getting an HTML page. I only want a response header, no response body.
already used HEAD method but it is not working in all kind of situations.
By getting complete HTML page in response, bandwidth is increasing.
and also need a solution so it will work in both https and HTTP request.
For Example
import urllib2
urllib2.urlopen('http://www.google.com')
if I am sending a request on this URL using urllib2 or request. I am getting both response body and header from the server. this request is taking 14.08 kb in bytes. If I break this, the response header is taking 775 bytes and response body is taking 13.32kb. so I need only response header and will save 13.32 kb
What you want to do is a so called HEAD request. See this question on how to do it.
Is this what you are looking for:
import urllib2
l = urllib2.urlopen('http://www.google.com')
print(l.headers)
#Date: Thu, 11 Oct 2018 09:07:20 GMT
#Expires: -1
#...
EDIT
This seems to do what you are looking for:
import requests
a = requests.head('https://www.google.com')
a.headers
#{'X-XSS-Protection': '1; mode=block', 'Content-Encoding':...
a.text
#u''
I am trying to create a response that will return 3 files with one request.
However, if you give it to the response body, I am in trouble because I do not know whether it can be realized.
The method of generating the response body that I am about to deliver is trying to return using python's MultipartEncoder
[ response body ]
※Boundary generation is also done
--dd7457a7dc684f32b2fd26ec468ed4b8
Content-Disposition: form-data; name=file1; filename="test1"
Content-Type: application/octet-stream
test1 sample
--dd7457a7dc684f32b2fd26ec468ed4b8
Content-Disposition: form-data; name=file2; filename="test2"
Content-Type: application/octet-stream
test2 sample
--dd7457a7dc684f32b2fd26ec468ed4b8
Content-Disposition: form-data; name=file3; filename="test3"
Content-Type: application/octet-stream
test3 sample
--dd7457a7dc684f32b2fd26ec468ed4b8--
Body as above
The following header
response.headers["Content-Type"] = 'multipart/form-data`
I know that swagger-ui.js creates a download link with the fileapi's blob library, but download three files via the download link of three files or one download link using the blob library it can
Is there such a way?
It is already possible to do a method of consolidating files into tar or zip and then doing DL and json format.
I would like to ask if there is any way.
[version]
swagger-ui 2.2.10
Python 3.4.4
flask 0.10.1
I am trying to extract articles from The New York Times using the python goose extractor.
I have tried using the standard url retrieval way:
g.extract(url=url)
However this yields an empty string. So I have tried the following way recommended through the documentation:
import urllib2
import goose
url = "http://www.nytimes.com/reuters/2015/12/21/world/africa/21reuters-kenya-attacks-somalia.html?_r=0"
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
response = opener.open(url)
raw_html = response.read()
g = goose.Goose()
a = g.extract(raw_html=raw_html)
a.cleaned_text
Again an empty string is returned for "cleaned_text". The html is retrieved from the website. I have tried as well using requests however the same result.
I am presuming this is a python goose problem in not being able to extract the article body from the raw data that is being returned. I have searched prior but I can't find any results that solve my problem.
It looks like the goose has traditionally had problems with New York Times because (1) they redirect users through another page to add/check cookies (see curl below) and because (2) they don't actually load the text of articles on page load. They do it asynchronously after first executing ad display code.
~ curl -I "http://www.nytimes.com/reuters/2015/12/21/world/africa/21reuters-kenya-attacks-somalia.html"
HTTP/1.1 303 See Other
Server: Varnish
Location: http://www.nytimes.com/glogin?URI=http%3A%2F%2Fwww.nytimes.com%2Freuters%2F2015%2F12%2F21%2Fworld%2Fafrica%2F21reuters-kenya-attacks-somalia.html%3F_r%3D0
Accept-Ranges: bytes
Date: Tue, 22 Dec 2015 15:46:55 GMT
X-Varnish: 1338962331
Age: 0
Via: 1.1 varnish
X-API-Version: 5-0
X-PageType: article
Connection: close
X-Frame-Options: DENY
Set-Cookie: RMID=007f01017a275679706f0004;Path=/; Domain=.nytimes.com;Expires=Wed, 21 Dec 2016 15:46:55 UTC
I have this strange situation,
I have a link that works on all borwsers that I currently have (chrome,IE,firefox),
I tried to crawl the page using scrapy in python. however I get response.status == 400,
I am using tor + polipo to crawl anonymously
response.body is :
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<html><head>
<title>Proxy error: 400 Couldn't parse URL.</title>
</head><body>
<h1>400 Couldn't parse URL</h1>
<p>The following error occurred while trying to access <strong>https://exmpale.com/blah</strong>:<br><br>
<strong>400 Couldn't parse URL</strong></p>
<hr>Generated Thu, 11 Dec 2014 13:55:38 UTC by Polipo on <em>localhost:8123</em>.
</body></html>
I'm just wondering why that should be, could it be that browser can get results but not scrapy?
I want to know how to get the direct link to an embedded video (the link to the .flv/.mp4 or whatever file) from just the embed link.
For example, http://www.kumby.com/ano-hana-episode-1/ has
<embed src="http://www.4shared.com/embed/571660264/396a46be"></embed>
, though the link to the video seems to be
"http://dc436.4shared.com/img/571660264/396a46be/dlink__2Fdownload_2FM2b0O5Rr_3Ftsid_3D20120514-093834-29c48ef9/preview.flv"
How does the browser know where to load the video from? How can I write code that converts the embed link to a direct link?
UPDATE:
Thanks for the quick answer, Quentin.
However, I don't seem to receive a 'Location' header when connecting to "http://www.4shared.com/embed/571660264/396a46be".
import urllib2
r=urllib2.urlopen('http://www.4shared.com/embed/571660264/396a46be')
gives me the following headers:
'content-length', 'via', 'x-cache', 'accept-ranges', 'server', 'x-cache-lookup', 'last-modified', 'connection', 'etag', 'date', 'content-type', 'x-jsl'
from urllib2 import Request
r=Request('http://www.4shared.com/embed/571660264/396a46be')
gives me no headers at all.
The server issues a 302 HTTP status code and a Location header.
$ curl -I http://www.4shared.com/embed/571660264/396a46be
HTTP/1.1 302 Moved Temporarily
Server: Apache-Coyote/1.1
(snip cookies)
Location: http://static.4shared.com/flash/player/5.6/player.swf?file=http://dc436.4shared.com/img/M2b0O5Rr/gg_Ano_Hi_Mita_Hana_no_Namae_o.flv&provider=image&image=http://dc436.4shared.com/img/M2b0O5Rr/gg_Ano_Hi_Mita_Hana_no_Namae_o.flv&displayclick=link&link=http://www.4shared.com/video/M2b0O5Rr/gg_Ano_Hi_Mita_Hana_no_Namae_o.html&controlbar=none
Content-Length: 0
Date: Mon, 14 May 2012 10:01:59 GMT
See How do I prevent Python's urllib(2) from following a redirect if you want to get information about the redirect response instead of following the redirect automatically.