How to retrieve html with grequests - python

When I was just doing some research on python web scraping I got to know of a package named grequests, it was said that this can send parallel HTTP requests thus gaining more speed than the normal python requests module. Well, that sounds great but I was not able to get the HTML of the web pages I requested as there is no .text method like the normal requests module. If I get some help it would be great!

Since grequests.imap function returns a list, you'll need to use an index or call the entire list in a loop.
responses = grequests.imap(session)
for response in responses:
print(response.text)

Related

is there any way to get HTTP response code without receiving html files in python?

I don't know maybe am asking some kind of weird question or not.
Normally if we use any module in python like requests or urllib3. We get a response for each request. we get status-code, cookies, headers, and the HTML content.
But the problem is! this HTML data is huge and I don't need this data. I just need a response code for a request. So, is there any method or module to do so?
import requests
r = requests.head(url)
print(r)

Deferred Downloading using Python Requests Library

I am trying to fetch some information from Workflowy using Python Requests Library. Basically I am trying to programmatically get the content under this URL: https://workflowy.com/s/XCL9FCaH1b
The problem is Workflowy goes through a 'loading phase' before the actual content is displayed when I visit this website so I end up getting the content of 'loading' page when I get the request. Basically I need a way to defer getting the content so I can bypass the loading phase.
It seemed like Requests library is talking about this problem here: http://www.python-requests.org/en/latest/user/advanced/#body-content-workflow but I couldn't get this example work for my purposes.
Here is the super simple block of code that ends up getting the 'loading page':
import requests
path = "https://workflowy.com/s/XCL9FCaH1b"
r = requests.get(path, stream=True)
print(r.content)
Note that I don't have to use Requests just picked it up because it looked like it might offer a solution to my problem. Also currently using Python 2.7.
Thanks a lot for your time!

Python trace URL get requests - using python script

I'm writing a script, to help me do some repetitive testing of a bunch of URLs.
I've written a python method in the script that it opens up the URL and sends a get request. I'm using Requests: HTTP for Humans -http://docs.python-requests.org/en/latest/- api to handle the http calls.
There's the request.history that returns a list of status codes of the directs. I need to be able to access the particular redirects for those list of 301s. There doesn't seem to be a way to do this - to access and trace what my URLS are redirecting to. I want to be able to access the redirected URLS (status code 301)
Can anyone offer any advice?
Thanks
Okay, I'm so silly. Here's the answer I was looking for
r = requests.get("http://someurl")
r.history[1].url will return the URL

Unshorten the URL without downloading whole page in python

I want to unshorten URLs to get the real address.In some cases there are more than one redirection. I have tried using urllib2 but it seems to be making GET requests which is consuming time and bandwidth. I want get only the headers so that I have the final URL without needing to get the whole body/data of that page.
thanks
You need to execute a HTTP HEAD request to get just the headers.
The second answer shows how to perform a HEAD request using urllib.
How do you send a HEAD HTTP request in Python 2?

Extracting information from AJAX based sites using Python

I am trying to retrieve query results on sites based on ajax like www.snapbird.org using Python. Since it doesn't show in the page source, I am not sure how to proceed.
I am a Python newbie and hence it would be great if I could get a pointer in the right direction.
I am also open to some other approach to the task if that is easier
This is going to be complex but as a start, ppen firebug and find the URL that gets called when the AJAX request is handled. You can call that directly in your Python program and parse the output.
You could use Selenium's Python client driver to parse the page source. I usually use this in conjunction with PyQuery to make web scraping easier.
Here's the basic tutorial for Selenium's Python driver. Be sure to follow the instructions for Selenium version 2 instead of version 1 (unless you're using version 1 for some reason).
You could also configure chrome/firefox to an HTTP proxy and then log/extract the necessary content with the proxy. I've tinkered with python proxies to save/log the requests/content based on content-type or URI globs.
For other projects I've written site-specific javascript bookmarklets which poll for new data and then POST it to my server (by dynamically creating both a form and iframe, and setting myform.target=myiframe;
Other javascript scripts/bookmarklets simulate a user interacting with sites, so instead of polling every few seconds the javascript automates clicking buttons and form submissions, etc. These scripts are always very site-specific of course but they've been hugely useful for me, especially when iterating over all the paginated results for a given search.
Here is a stripped down version of walking over a list of "paginated" results and preparing to send the data off to my server (which then further parses it with BeautifulSoup). In particular this was designed for Youtube's Sent/Inbox messages.
var tables = [];
function process_and_repeat(){
if(!(inbox && inbox.message_pane_ && inbox.message_pane_.innerHTML)){
alert("We've got no data!");
}
if(inbox.message_pane_.innerHTML.indexOf('<table') === 0)
{
tables.push(inbox.message_pane_.innerHTML);
inbox.next_page();
setTimeout("process_and_repeat()",3000);
}
else{
alert("Fininshed, [" + tables.length + " processed]");
document.write('<form action=http://curl.sente.cc method=POST><textarea name=sent.html>'+escape(tables.join('\n'))+'</textarea><input type=submit></form>')
}
}
process_and_repeat(); // now we wait and watch as all the paginated pages are viewed :)
This is a stripped down example without any fancy iframes/non-essentials which just add complexity.
Adding to what Liam said, Selenium is a great tool, too, which has aided in my various scraping needs. I'd be more than happy to help you out with this if you'd like.
One easy solution might be using a browser like Mechanize. So you can browse site, follow links, make searches and nearly everything that you can do with a browser with user interface.
But for a very sepcific job, you may not even need a such library, you can use urllib and urllib2 python libraries to make a connection and read response... You can use Firebug to see data structure of a search and response body. Then use urllib to make a request with relevant parameters...
With an example...
I made a search with joyvalencia and check the request url with firebug to see:
http://api.twitter.com/1/statuses/user_timeline.json?screen_name=joyvalencia&count=100&page=2&include_rts=true&callback=twitterlib1321017083330
So calling this url with urllib2.urlopen() will be the same with making the query on Snapbird. Response body is:
twitterlib1321017083330([{"id_str":"131548107799396357","place":null,"geo":null,"in_reply_to_user_id_str":null,"coordinates":.......
When you use urlopen() and read the response, the upper string is what you get... Then you can use json library of python to read the data and parse it to a pythonic data structure...

Categories