What I'm trying to do:
from requests_html import HTMLSession
with HTMLSession() as s:
s.get('url', cookies=my_cookie_jar)
s.html.render()
print(s.html.html)
I want to access a page where I need to log-in. I already logged in using a selenium browser, where I then exported the cookies as a RequestsCookieJar.
Now when I print the text returned by the get-request, I receive the text of the correct webpage (but without the javescript rendered), but as soon as I render the html the cookies seem to have no effect and I get the html of a page asking me to log in (the same I get when issuing the request without the cookies in the first place).
Now my question:
Is it possible to specify the cookies when rendering the html (or should requests-html already do this by default)?
Yes, you can, by using the kwarg cookies in render method.
s.html.render(cookies=my_cookie_jar)
Solution, from Github (https://github.com/psf/requests-html/issues/109). Seems to work for me:
html.render(reload=False)
Related
When this page is scraped with urllib2:
url = https://www.geckoboard.com/careers/
response = urllib2.urlopen(url)
content = response.read()
the following element (the link to the job) is nowhere to be found in the source (content)
Taking a look at the full source that gets rendered in a browser:
So it would appear that the FRONT-END ENGINEER element is dynamically loaded by Javascript. Is it possible to have this Javascript executed by urllib2 (or other low-level library) without involving e.g. Selenium, BeautifulSoup, or other?
The pieces of information are loaded using some ajax request. You could use firebug extension for mozilla or google chrome has it's own tool to get theese details. Just hit f12 in google chrome while opening the URL. You can find the complete details there.
There you will find a request with url https://app.recruiterbox.com/widget/13587/openings/
Information from the above url is rendered in that web page.
From what I understand, you are building something generic for multiple web-sites and don't want to go deep down in how a certain site is loaded, what requests are made under-the-hood to construct the page. In this case, a real browser is your friend - load the page in a real browser automated via selenium - then, once the page is loaded, pass the .page_source to lxml.html (from what I see this is your HTML parser of choice) for further parsing.
If you don't want a browser to show up or you don't have a display, you can go headless - PhantomJS or a regular browser on a virtual display.
Here is a sample code to get you started:
from lxml.html import fromstring
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.set_page_load_timeout(15)
driver.get("https://www.geckoboard.com/careers/")
# TODO: you might need a delay here
tree = fromstring(driver.page_source)
driver.close()
# TODO: parse HTML
You should also know that, there are plenty of methods to locate elements in selenium and you might not even need a separate HTML parser here.
I think you're looking for something like this: https://github.com/scrapinghub/splash
I am signing into my account at www.goodreads.com to scrape the list of books from my profile.
However, when I go to the goodreads page, even if I am logged in, my scraper gets only the home page. It cannot log in to my account. How do I redirect it to my account?
Edit:
from bs4 import BeautifulSoup
import urllib2
response=urllib2.urlopen('http://www.goodreads.com')
soup = BeautifulSoup(response.read())
[x.extract() for x in soup.find_all('script')]
print(soup.get_text())
If I run this code, I get only till the homepage, I cannot login to the my profile, even if I am already logged in to the browser.
What do I do to log in from a scraper?
Actually when you go to the site there is something called sessions that contains information about your accout ( not exactly but something like that ) and your browser can use them so every time that you go to the main page you are logged in , but you code doesn't use sessions and these things so you should do everything from the first
1) go to mainpage 2) log in 3) gathering your data
and also this question showed how to login to your account
I hope it helps.
Goodreads has an API that you might want to use instead of trying to log in and scrape the site's HTML. It's formatted in XML, so you can still use BeautifulSoup - just make sure you have lxml installed and use it as the parser. You'll need to register for a developer key, and also register your application, but then you're good to go.
You can use urllib2 or requests library to login and then scrape the response. In my experience using requests is a lot easier.
Here's a good explanation on logging in using both urllib2 and requests:
How to use Python to login to a webpage and retrieve cookies for later usage?
i have to retrieve some text from a website called morningstar.com . To access that data i have to log in. Once i log in and provide the url of the web page , i get the HTML text of a normal user (not logged in).As a result am not able to accees that information . ANy solutions ?
BeautifulSoup is for parsing html once you've already fetched it. You can fetch the html using any standard url fetching library. I prefer curl, as you tagged your post, python's built-in urllib2 also works well.
If you're saying that after logging in the response html is the same as for those who are not logged in, I'm gonna guess that your login is failing for some reason. If you are using urllib2, are are you making sure to store the cookie properly after your first login and then passing this cookie to urllib2 when you are sending the request for the data?
It would help if you posted the code you are using to make the two requests (the initial login, and the attempt to fetch the data).
I've got a link that I know redirects to another end url, and I'm trying to get the address for that end url using python. But the original link is a little weird, and doesn't work like a normal redirect, and I can't figure out why. When I post the link (the link's below for you try, if you'd like) into a browser, it redirects perfectly. But when I run the following code, it doesn't.
import urllib2
request = urllib2.Request('http://www.facebook.com/ajax/emu/end.php?eid=AQJSWpZ3e4cCTHoNdahpJzPYzmzHOENzbTWBVlW4SgIxX0rL9bo6NXmS3q06cjeh5jO9wbsmr3IyGrpbXPSj0GPLbRJl4VUH-EBnmSy_R4j7iYzpMe1ooZ6IEqSEIlBl0-5SEldIhxI82m75YPa5nOhuBdokiwTw79hoiRB-Zn1auxN-6WLVe3e5WNSt3HLAEjZL-2e4ox_7yAyLcBo1nkamEvShTyZ-GfIf0A9oFXylwRnV8oNaqNmUnqrFYqDbUhzh7d6LSm3jbv1ue2coS3w8N7OxTKVwODHa-Hd3qRbYskB9weio8eKdDFtkvDKuzSSq5hjr711UjlDsgpxLuAmdD95xVwpomxeEsBsMCYJoUEQYa-cM7q3W1aiIYBHlyn2__t74qHWVvzK5zaLKFMKjRFQqphDlUMgMni6AP1VHSn1wli_3lgeVD8TzcJMSlJIF7DC_O44WdjBIMY8OufER3ZB_mm2NqwUe6cvV9oV9SNyYHE4UUURYjW_Z6sUxz3SpHG8c6QxJ-ltSeShvU3mIwAhFE3M0jGTg7AQ7nIoOUfC8PDainFZ1NV8g31aqaqDsF7UxdlOmBT6w-Y8TPmHOXfSlWB-M3MQYUBmcWS3UzlbSsavQG8LXPqYbyKfvkAfncSnZS3_tkoqbTksFirQWlSxJ3mgXrO5PqopH63Esd9ynCbFQM1q_3_wgkYvTeGS9XK6G63_Ag3N9dCHsO_bCJToJT4jeHQCSQ83cb1U5Qpe_7EWbw1ilzgyL-LBVrpH424dwK-4AoaL00W-gWzShSdOynjcoGeB7KE0pHbg-XhuaVribSodriSGybNdADBosnddVvZldY22-_97MqEuA&&c=4&&f=4&&ui=6003071106023-id_4e0b51323f9d01393198225&&en=1&&a=0&&sig=78154')
opener = urllib2.build_opener()
f = opener.open(request)
f.geturl()
I simply get my original url back. I encounter the same problem when I save cookies and use mechanize. Any help would be much appreciated! Thanks!
It looks like this is using Javascript to perform the redirect. You'll either have to figure out exactly how the Javascript is performing the redirects and pull out the appropriate urls, or you'll have to actually run the Javascript. As far as I know, running Javascript from python is not an easy task.
(original answer deleted)
If you look at the contents of f.read() you'll see what's going on here. Instead of returning a 301 or 302 that redirects to the new URL, Facebook actually returns a real HTML document - which contains a piece of Javascript that uses document.location.replace to change the URL in the browser.
There's no easy way of replicating that with Python - the best thing to do is to parse the document with something like BeautifulSoup to find the Javascript, and somehow extract the new URL. It won't be pretty.
I am using Python 2.7.1 to access an online website. I need to load a URL, then submit a POST request to that URL that causes the website to redirect to a new URL. I would then like to POST some data to the new URL. This would be easy to do, except that the website in question does not allow the user to use browser navigation. (As in, you cannot just type in the URL of the new page or press the back button, you must arrive there by clicking the "Next" button on the website). Therefore, when I try this:
import urllib, urllib2, cookielib
url = "http://www.example.com/"
jar = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(jar))
form_data_login = "POSTDATA"
form_data_try = "POSTDATA2"
resp = opener.open(url, form_data_login)
resp2 = opener.open(resp.geturl(), form_data_try)
print resp2.read()
I get a "Do not use the back button on your browser" message from the website in resp2. Is there any way to POST data to the website resp gives me? Thanks in advance!
EDIT: I'll look into Mechanize, so thanks for that pointer. For now, though, is there a way to do it with just Python?
Have you taken a look at mechanize? I believe it has the functionality you need.
You're probably getting to that page by posting something via that Next button. You'll have to take a look at the POST parameters sent when pressing that button and add all of these post parameters to your call.
The website could though be set up in such a way that it only accepts a particular POST parameter that ensures that you'll have to go through the website itself (e.g. by hashing a timestamp in a certain way or something like that) but it's not very likely.