Python-Logging in to a site while scraping it - python

I am signing into my account at www.goodreads.com to scrape the list of books from my profile.
However, when I go to the goodreads page, even if I am logged in, my scraper gets only the home page. It cannot log in to my account. How do I redirect it to my account?
Edit:
from bs4 import BeautifulSoup
import urllib2
response=urllib2.urlopen('http://www.goodreads.com')
soup = BeautifulSoup(response.read())
[x.extract() for x in soup.find_all('script')]
print(soup.get_text())
If I run this code, I get only till the homepage, I cannot login to the my profile, even if I am already logged in to the browser.
What do I do to log in from a scraper?

Actually when you go to the site there is something called sessions that contains information about your accout ( not exactly but something like that ) and your browser can use them so every time that you go to the main page you are logged in , but you code doesn't use sessions and these things so you should do everything from the first
1) go to mainpage 2) log in 3) gathering your data
and also this question showed how to login to your account
I hope it helps.

Goodreads has an API that you might want to use instead of trying to log in and scrape the site's HTML. It's formatted in XML, so you can still use BeautifulSoup - just make sure you have lxml installed and use it as the parser. You'll need to register for a developer key, and also register your application, but then you're good to go.

You can use urllib2 or requests library to login and then scrape the response. In my experience using requests is a lot easier.
Here's a good explanation on logging in using both urllib2 and requests:
How to use Python to login to a webpage and retrieve cookies for later usage?

Related

How to get Requests to go to a certain webage after AUTH in

I am trying to get a table that in on an HTML page.i am using Selenium to go to the page and log in because the page i need to access is only available after you log into the site. I wanted to get a date from the site using a MAC id for a product. This date is in a table. Instead of using Selenium though i want to try and use requests as it is faster, but my problem is to access this webpage i need to log in and after i log in it takes me to the homepage which is not the page i want. How can i go to the page i want while stil saying logged in?
with requests.Session() as t:
t.get("https://www.axis.com/partner_pages/serialno.php?serialno=00-40-8C-EB-4F-AD&submit=Search", auth=HTTPBasicAuth('user', "pass"))
t.get(("https://www.axis.com/partner_pages/serialno.php?serialno"))
print(t.)

Parsing bot protected site

I am trying to parse the website "https://ih.advfn.com/stock-market/NYSE/gen-electric-GE/stock-price" and extract its most recent messages from its board. It is bot protected with Cloud-flare. I am using python and its relative libraries and this is what I have so far
from bs4 import BeautifulSoup as soup #parses/cuts the html
import cfscrape
import requests
url = 'https://ih.advfn.com/stock-market/NYSE/gen-electric-GE/stock-
price'
r=requests.get(url)
html = soup(r.text, "html.parser")
containers = html.find("div",{"id":"bbPosts"})
print(containers.text.strip())
I am not able to use the html parser because the site detects and blocks my script then.
My questions are:
How can I parse the web pages to pull the table data?
Might I mention that this is for a security class I am taking. I am not using this for malicious reasons.
There are multiple ways of bypassing the site protection. You have to see exactly how they are blocking you.
One common way of blocking requests is to look at the User Agent header. The client ( in your case the requests library ) will inform the server about it's identity.
Generally speaking, a browser will say I am a browser and a library will say I am a library. The server can then say I allow browsers but not libraries to access my content.
However, for this particular case, you can simply lie to the server by sending your own User Agent header.
You can see a example here. Try to use your browsers user agent.
Other blocking techniques include ip ranges. One way to bypass this is via a vpn. This is one of the easiest vpns to set up. Just spin up a machine on amazon and get this container running.
What else could happen, you might try to access a single page application that is not rendered server side. In this case, what you should receive with that get requests is a very small html file that essentially references a javascript file. If this is the case, what you need is a actual browser that you control programatically. I would suggest you look at Google Chrome Headless however there are others. You can also use Selenium
Web crawling is a beautiful but very deep subject. I think these pointers should set you on the right direction.
Also, as a quick mention, my advice is to avoid from bs4 import BeautifulSoup as soup. I would recommend html2text

How to logout of website during requests session

I am attempting to scrape some data from a website which requires a login. To complicate matters, I am scraping data from three different accounts. So in other words, I need to login to the site, scrape the data and then logout, three times.
The html behind the logout button looks like this:
The (very simplified) code I've tried is below:
import requests
for account in [account1,account2,account3]:
with requests.session() as session:
[[login code here]]
[[scraping code here]]
session.get(url + "/logout")
The scraping using the first account works fine, but after that it doesn't. I'm assuming this is because I'm not logging out properly. What can I do to fix this?
It's quite simple:
You should forge correct login request.
To do it go to the login page:
open 'Inspect' tool, 'Network' tab. Checking 'Preserve log' option is quite useful as well.
Log in to the site, and you'll see login request appeared in Network tab (Usually it's a POST request).
Right-click to request, select Copy -> Copy as Curl, and then just use this brilliant tool
Usually, you can trim up and headers and cookies of the code produced by the tool(but be careful trimming Content-Type header, it can break your code).
Replace requests.[get|post](...) to session.[get|post](...)
Profit. You'll have logged in session by execution of the upper code. Logging out and any form population is made pretty much the same way.

Loading Ajax with Python Requests

For a personal project, I'm trying to get a full friends list of a user (myself for now) from Facebook using Requests and BeautifulSoup.
The main friends page however displays only 20, and the rest are loaded with Ajax when you scroll down.
The request url looks something like this (method is GET):
https://www.facebook.com/ajax/pagelet/generic.php/AllFriendsAppCollectionPagelet?dpr=1&data={"collection_token":"1244314824:2256358349:2","cursor":"MDpub3Rfc3RydWN0dXJlZDoxMzU2MDIxMTkw","tab_key":"friends","profile_id":1244214828,"overview":false,"ftid":null,"order":null,"sk":"friends","importer_state":null}&__user=1364274824&__a=1&__dyn=aihaFayfyGmagngDxfIJ3G85oWq2WiWF298yeqrWo8popyUW3F6wAxu13y78awHx24UJi28cWGzEgDKuEjKeCxicxabwTz9UcTCxaFEW58nVV8-cxnxm1typ9Voybx24oqyUf9UgC_UrQ4bBv-2jAxEhw&__af=o&__req=5&__be=-1&__pc=EXP1:DEFAULT&__rev=2677430&__srp_t=1474288976
My question is, is it possible to recreate the dynamically generated tokens such as the __dyn, cursor, collection_token etc. to send manually in my request? Is there some way to figure out how they are generated or is it a lost cause?
I know that the current Facebook API does not support viewing a full friends list. I also know that I can do this with Selenium, or some other browser simulator, but that feels way too slow, ideally I want to scrape thousands of friends lists (of users whose friends lists are public) in a reasonable time.
My current code is this:
import requests
from bs4 import BeautifulSoup
with requests.Session() as S:
requests.utils.add_dict_to_cookiejar(S.cookies, {'locale': 'en_US'})
form = {}
form['email'] = 'myusername'
form['pass'] = 'mypassword'
response = S.post('https://www.facebook.com/login.php?login_attempt=1&lwv=110', data=form)
# Im logged in
page = S.get('https://www.facebook.com/yoshidakai/friends?source_ref=pb_friends_tl')
Any help will be appreciated, including other methods to achieve this :)
As of this writing, you can extract this information by parsing the page and then get the next cursor for latter pages by parsing the preceding ajax response. However, as Facebook regularly makes updates to its backend, I have had more stable results using selenium to drive a Chrome headless browser to scroll through the page, and then parsing the resulting HTML.

How to extract text from a web page that requires logging in using python and beautiful soup?

i have to retrieve some text from a website called morningstar.com . To access that data i have to log in. Once i log in and provide the url of the web page , i get the HTML text of a normal user (not logged in).As a result am not able to accees that information . ANy solutions ?
BeautifulSoup is for parsing html once you've already fetched it. You can fetch the html using any standard url fetching library. I prefer curl, as you tagged your post, python's built-in urllib2 also works well.
If you're saying that after logging in the response html is the same as for those who are not logged in, I'm gonna guess that your login is failing for some reason. If you are using urllib2, are are you making sure to store the cookie properly after your first login and then passing this cookie to urllib2 when you are sending the request for the data?
It would help if you posted the code you are using to make the two requests (the initial login, and the attempt to fetch the data).

Categories