How can i get a part of response of get/post-request from python-requests library? I need get URL content and analys it fast, but web-server may return a very large response (500Mb for example).
Set stream to True.
example_url = 'http://www.example.com/somethingbig'
r = requests.get(example_url, stream=True)
At this point only the response headers have been downloaded and the connection remains open.
Documentation: body-content-workflow
Related
I am trying to make a simple program that will help with the confusing part of rooting.
I need to download the file from tiny.cc/latestmagisk
I am using this python code
import request
url = tiny.cc/latestmagisk
r = request.get(url)
r.content
The content it returns is the usual 403 Forbidden for nginx
I need this to work with the shortened URL is there anyway to make that happen?
its's not necessary to import request lib
all you need to do is import ssl, urllib and pass ssl._create_unverified_context() as context to the server while you're sendig a request!
your code should be look like this:
import ssl, urllib
certcontext = ssl._create_unverified_context()
f = open('image.jpg','wb') #creating placeholder
#creating image from url and saving it as `image.jpg`!
f.write(urllib.urlopen("https://i.stack.imgur.com/IKh7E.png", context=certcontext).read())
f.close()
note: it will save the image as image.jpg file ..
Contrary to the other answer, you really should use requests for this as requests has better support for redirects.
For getting a page through a redirect from requests:
r=requests.get(url, allow_redirects=True)
For downloading files through redirects:
r = requests.get(url, allow_redirects=True, stream=True)
with open(filename, 'wb') as f:
for chunk in r.iter_content(chunk_size=1024):
if chunk: f.write(chunk)
However, in this case, either tiny.cc or XDA does not allow a simple requests.get; the 403 forbidden is likely due to the User-Agent or other intrinsic header as this method works well with bit.ly and other shortlink generators. You may need to fake headers.
I'm fairly new to Python and I'm trying to execute a HTTP Request to a URL that returns JSON. The code, I have is:
url = "http://myurl.com/"
req = urllib.request.Request(url)
response = urllib.request.urlopen(req)
data = response.read()
I'm getting an error reading: "'bytes' object has no attribute 'read'". I searched around, but haven't found a solution. Any suggestions?
You may find the requests library easier to use:
import requests
data = requests.get('http://example.com').text
or, if you need the raw, undecoded bytes,
import requests
data = requests.get('http://example.com').content
I'm writing a script to DL the entire collection of BBC podcasts from various show hosts. My script uses BS4, Mechanize, and wget.
I would like to know how I can test if a request for a URL yields a response code of '404' form the server. I have wrote the below function:
def getResponseCode(br, url):
print("Opening: " + url)
try:
response = br.open(url)
print("Response code: " + str(response.code))
return True
except (mechanize.HTTPError, mechanize.URLError) as e:
if isinstance(e,mechanize.HTTPError):
print("Mechanize error: " + str(e.code))
else:
print("Mechanize error: " + str(e.reason.args))
return False
I pass into it my Browser() object and a URL string. It returns either True or False depending on whether the response is a '404' or '200' (well actually, Mechanize throws and Exception if it is anything other than a '200' hence the Exception handling).
In main() I am basically looping over this function passing in a number of URLs from a list of URLs that I have scraped with BS4. When the function returns True I proceed to download the MP3 with wget.
However. My problem is:
The URLs are direct path to the podcast MP3 files on the remote
server and I have noticed that when the URL is available,
br.open(<URL>) will hang. I suspect this is because Mechanize is
caching/downloading the actual data from the server. I do not want
this because I merely want to return True if the response code is
'200'. How can I not cache/DL and just test the response code?
I have tried using br.open_novisit(url, data=None) however the hang still persists...
I don't think there's any good way to get Mechanize to do what you want. The whole point of Mechanize is that it's trying to simulate a browser visiting a URL, and a browser visiting a URL downloads the page. If you don't want to do that, don't use an API designed for that.
On top of that, whatever API you're using, by sending a GET request for the URL, you're asking the server to send you the entire response. Why do that just to hang up on it as soon as possible? Use the HEAD request to ask the server whether it's available. (Sometimes servers won't HEAD things even when they should, so you'll have to fall back to GET. But cross that bridge if you come to it.)
For example:
req = urllib.request.Request(url, method='HEAD')
resp = urllib.request.urlopen(req)
return 200 <= resp.code < 300
But this raises a question:
When the function returns True I proceed to download the MP3 with wget.
Why? Why not just use wget in the first place? If the URL is gettable, it will get the URL; if not, it will give you an error—just as easily as Mechanize will. And that avoids hitting each URL twice.
For that matter, why try to script wget, instead of using the built-in support in the stdlib or a third-party module like requests?
If you're just looking for a way to parallelize things, that's easy to do in Python:
def is_good_url(url):
req = urllib.request.Request(url, method='HEAD')
resp = urllib.request.urlopen(req)
return url, 200 <= resp.code < 300
with futures.ThreadPoolExecutor(max_workers=8) as executor:
fs = [executor.submit(is_good_url, url) for url in urls]
results = (f.result() for f in futures.as_completed(fs))
good_urls = [url for (url, good) in results if good]
And to change this to actually download the valid URLs instead of just making a note of which ones are valid, just change the task function to something that fetches and saves the data from a GET instead of doing the HEAD thing. The ThreadPoolExecutor Example in the docs does almost exactly what you want.
A portion of code that I have that will parse a web site does not work.
I can trace the problem to the .read function of my urllib2.urlopen object.
page = urllib2.urlopen('http://magiccards.info/us/en.html')
data = page.read()
Until yesterday, this worked fine; but now the length of the data is always 69496 instead of 122989, however when I open smaller pages my code works fine.
I have tested this on Ubuntu, Linux Mint and windows 7. All have the same behaviour.
I'm assuming that something has changed on the web server; but the page is complete when I use a web browser. I have tried to diagnose the issue with wireshark but the page is received as complete.
Does anybody know why this may be happening or what I could try to determine the issue?
The page seems to be misbehaving unless you request the content encoded as gzip. Give this a shot:
import urllib2
import zlib
request = urllib2.Request('http://magiccards.info/us/en.html')
request.add_header('Accept-Encoding', 'gzip')
response = urllib2.urlopen(request)
data = zlib.decompress(response.read(), 16 + zlib.MAX_WBITS)
As Nathan suggested, you could also use the great Requests library, which accepts gzip by default.
import requests
data = requests.get('http://magiccards.info/us/en.html').text
Yes, the server is closing connection and you need keep-alive to be sent. urllib2 does not have that facility ( :-( ). There used be urlgrabber which you could use have a HTTPHandler that works alongside with urllib2 opener. But unfortunately, I dont find that working too. At the moment, you could be other libraries, like requests as demonstrated in the other answer or httplib2.
import httplib2
h = httplib2.Http(".cache")
resp, content = h.request("http://magiccards.info/us/en.html", "GET")
print len(content)
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to use Python to login to a webpage and retrieve cookies for later usage?
I want to download whole webpage source from a service that handles cookies in some unusual way. I wrote a script that actually works and seems to be fine however at some point it returned such error:
urllib2.HTTPError: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Found
My script works in loop and changes link to subpage wchich content im interested to download.
I get a cookie, send a package of data and then i am able to get to porper link then download html.
script look like this:
import urllib2
data = 'some_string'
url = "http://example/index.php"
url2 = "http://example/source"
req1 = urllib2.Request(url)
response = urllib2.urlopen(req1)
cookie = response.info().getheader('Set-Cookie')
## Use the cookie is subsequent requests
req2 = urllib2.Request(url, data)
req2.add_header('cookie', cookie)
response = urllib2.urlopen(req2)
## reuse again
req3 = urllib2.Request(url2)
req3.add_header('cookie', cookie)
response = urllib2.urlopen(req3)
html = response.read()
Ive been reading sth ab cookiejar/cookielib coz using this lib i am supposed to ged rid of this error mentioned above however i have no clue how to reporoduce my code to be used by: http.cookiejar, urllib.request
i tried sth like this:
import http.cookiejar, urllib.request
cj = http.cookiejar.CookieJar()
opener = urllib.request.build_opener( urllib.request.HTTPCookieProcessor(cj) )
r = opener.open(url) # now cookies are stored in cj
r1 = urllib.request(url, data) #TypeError: POST data should be bytes or an iterable of bytes. It cannot be str.
r2 = opener.open(url2)
print( r2.read() )
But its not working as my first script.
ps. Sorry for my english but im am not native.
#Piotr Dobrogost thanks for the link, it solved the issue.
TypeError solved by using data=b"string" instead of data="string"
Ive got still some issues due to porting to python3 but issue is to be closed.