I am attempting to scrape some data from a website which requires a login. To complicate matters, I am scraping data from three different accounts. So in other words, I need to login to the site, scrape the data and then logout, three times.
The html behind the logout button looks like this:
The (very simplified) code I've tried is below:
import requests
for account in [account1,account2,account3]:
with requests.session() as session:
[[login code here]]
[[scraping code here]]
session.get(url + "/logout")
The scraping using the first account works fine, but after that it doesn't. I'm assuming this is because I'm not logging out properly. What can I do to fix this?
It's quite simple:
You should forge correct login request.
To do it go to the login page:
open 'Inspect' tool, 'Network' tab. Checking 'Preserve log' option is quite useful as well.
Log in to the site, and you'll see login request appeared in Network tab (Usually it's a POST request).
Right-click to request, select Copy -> Copy as Curl, and then just use this brilliant tool
Usually, you can trim up and headers and cookies of the code produced by the tool(but be careful trimming Content-Type header, it can break your code).
Replace requests.[get|post](...) to session.[get|post](...)
Profit. You'll have logged in session by execution of the upper code. Logging out and any form population is made pretty much the same way.
Related
im just trying to log into vrv and get the list of shows from the crunchyroll page so i can just open the site later, but when i try to get back the parsed website after logging in. Theres a lot of info missing like titles and images and its incomplete. This is the code i have up to now. Obviously my email and password isnt email and password, i just changed them to post it here.
import requests
import pyperclip as p
def enterVrv():
s = requests.Session()
dataL = {'email': 'email', 'password': 'password'}
s.post('https://static.vrv.co/vrvweb/build/common.6fb25c4cff650ac4e6ae.js', data=dataL)
crunchy = s.get('https://vrv.co/crunchyroll/browse')
p.copy(str(crunchy.content))
exit(0)
Ive tried posting from the normal 'https://vrv.co' site, i tried from the 'https://vrv.co.signin' link, and i tried the link you currently see in the code that i got from the networks pane in the developers tool. After i ran the code i would take the copied html and replace the current one on a webbrowser to see if its pulling up correctly, but it all comes in incomplete.
It looks like your problem is that you're trying to get data from a web page that's being loaded dynamically. Indeed, if you navigate to https://vrv.co/crunchyroll/browse in your browser you'll likely notice there's a delay in between the page loading and the cruncyroll titles being displayed.
It also looks like vrv does not expose an API for you to programmatically access this data either.
To get around this you could try accessing the page via a web automation tool such as selenium and scraping the data that way. As for just making a basic request to the site though, you're probably out of luck.
I am trying to write a python script to login to the following site in order to automatically keep on eye on some account details: https://gateway.usps.com/eAdmin/view/signin
I have the right credentials, but something isn't quite working correctly, I don't know if it is because of the hidden inputs that exist on the form
import requests
from bs4 import BeautifulSoup
user='myusername'
passwd='mypassword'
s=requests.Session()
r=s.get("https://gateway.usps.com/eAdmin/view/signin")
soup=BeautifulSoup(r.content)
sp=soup.find("input",{"name":"_sourcePage"})['value']
fp=soup.find("input",{"name":"__fp"})['value']
si=soup.find("input",{"name":"securityId"})['value']
data={
"securityId": si,
"username":user,
"password":passwd,
"_sourcePage":sp,
"__fp":fp}
headers={"Content-Type":"application/x-www-form-urlencoded",
"Host":"gateway.usps.com",
"Origin":"https://gateway.usps.com",
"Referer":"https://gateway.usps.com/eAdmin/view/signin"}
login_url="https://gateway.usps.com/eAdmin/view/signin"
r=s.post(login_url,headers=headers,data=data,cookies=r.cookies)
print(r.content)
_sourcePage, securityId and __fp are all hidden input values from the page source. I am scraping this from the page, but obviously when I get to do the POST request, I'm opening the url again, so these values change and are no longer valid. However, I'm unsure how to rewrite the POST line to ensure that I extract the correct hidden values for submission.
I don't think that this is only relevant to this site, but for any site with hidden random values.
You can't do that.
You are trying to authenticate using an HTTP POST request outside the application scope, the login page and his own web form.
For security reasons the web page implements differents techniques, one of all the Anti CSRF Token ( which it's probably __sourcePage ) to ensure that the login request comes exclusively from the web page.
For this reason, every time you scrape the page grabbing the content of the security hidden inputs, the web application generate them every time. Thus when you reuse them to craft the final request of course they are not anymore valid.
See also: https://www.owasp.org/index.php/Cross-Site_Request_Forgery_(CSRF)
I am signing into my account at www.goodreads.com to scrape the list of books from my profile.
However, when I go to the goodreads page, even if I am logged in, my scraper gets only the home page. It cannot log in to my account. How do I redirect it to my account?
Edit:
from bs4 import BeautifulSoup
import urllib2
response=urllib2.urlopen('http://www.goodreads.com')
soup = BeautifulSoup(response.read())
[x.extract() for x in soup.find_all('script')]
print(soup.get_text())
If I run this code, I get only till the homepage, I cannot login to the my profile, even if I am already logged in to the browser.
What do I do to log in from a scraper?
Actually when you go to the site there is something called sessions that contains information about your accout ( not exactly but something like that ) and your browser can use them so every time that you go to the main page you are logged in , but you code doesn't use sessions and these things so you should do everything from the first
1) go to mainpage 2) log in 3) gathering your data
and also this question showed how to login to your account
I hope it helps.
Goodreads has an API that you might want to use instead of trying to log in and scrape the site's HTML. It's formatted in XML, so you can still use BeautifulSoup - just make sure you have lxml installed and use it as the parser. You'll need to register for a developer key, and also register your application, but then you're good to go.
You can use urllib2 or requests library to login and then scrape the response. In my experience using requests is a lot easier.
Here's a good explanation on logging in using both urllib2 and requests:
How to use Python to login to a webpage and retrieve cookies for later usage?
I want, with a python script, to be able to login a website and retrieve some data. This behind my company's proxy.
I know that this question seems a duplicate of others that you can find searching, but it isn't.
I already tried using the proposed solutions in the responses to those answers but they didn't work... I don't only need a piece of code to login and get a specific webpage but also some "concepts" behind how all this mechanism works.
Here is a description of what I want to be able to do:
Log into a website > Get to page X > Insert data in some form of page X and push "Calculate" button > Capture the results of my query
Once I have the results I'll see how to sort how the data.
How can I achieve this behind a proxy? Every time I try to use "request" library to login it doesn't work saying I am unable to get page X since I did not authenticate... or worst, I am even unable to get to that side because I didn't set up the proxy before.
Clarification of Requirements
First, make sure you understand context for getting results of your calculation
(F12 shall show DevTools in Chrome or Firebug in Firefox where you can learn most details discussed below)
do you manage accessing from the target page your web browser?
is it really necessary to use proxy? If yes, then test it in the browser and note exactly what proxy to use
what sort of authentication you have to use to access target web app. Options being "basic", "digest", or some custom, requiring filling in some form and having something in cookies etc.
when you access the calculation form in your browser, does pressing "Calculate" button result in visible HTTP request? Is it POST? What is content of the request?
Simple: HTTP based scenario
It is very likely, that your situation will allow use of simple HTTP communication. I will assume following situation:
proxy is used and you know the url and possibly user name and password to use the proxy
All pages on target web application require either basic authentication or digest one.
Calculation button is using classical HTML form and results in HTTP POST request with all data see in form parameters.
Complex: Browser emulation scenario
There are some chances, that part of interaction needed to get your result is dependent on JavaScript code performing something on the page. Often it can be converted into HTTP scenario by investigating, what are final HTTP requests, but here I will assume this is not feasible or possible and we will emulate using real browser.
For this scenario I will assume:
you are able to perform the task yourself in web browser and have all required information available
proxy url
proxy user name and password, if required
url to log in
user name and password to fill into some login form to get in
knowing "where to follow" after login to reach your calculation form
you are able to find enough information about each page element to use (form to fill, button to press etc.) like name of it, id, or something else, which will allow to target it at the moment of simulation.
Resolving HTTP based scenario
Python provides excellent requests package, which shall serve our needs:
Proxy
Aassuming proxy at http://10.10.1.10:3128, username being user and password pass
import requests
proxies = {
"http": "http://user:pass#10.10.1.10:3128/",
}
#ready for `req = requests.get(url, proxies=proxies)`
Basic Authentication
Assuming, the web app allows access for user being appuser and password apppass
url = "http://example.com/form"
auth=("appuser", "apppass")
req = requests.get(url, auth=auth)
or using explicitly BasicAuthentication
from requests.auth import HTTPBasicAuth
url = "http://example.com/path"
auth = HTTPBasicAuth("appuser", "apppass")
req = requests.get(url, auth=auth)
Digest authentication differs only in classname being HTTPDigestAuth
Other authentication methods are documented at requests pages.
HTTP POST for a HTML Form
import requests
a = 4
b = 5
data = {"a": a, "b": b}
url = "http://example.com/formaction/url"
req = requests.post(url, data=data)
Note, that this url is not url of the form, but of the "action" taken, when you press the submit button.
All together
Users often go to the final HTML form in two steps, first log in, then navigate to the form.
However, web applications typically allow (with knowledge of the form url) direct access. This will perform authentication at the same step and this is the way described below.
Note: If this would not work, you would have to use sessions with requests, which is possible, but I will not elaborate on that here.
import request
from requests.auth import HTTPBasicAuth
proxies = {
"http": "http://user:pass#10.10.1.10:3128/",
}
auth = HTTPBasicAuth("appuser", "apppass")
a = 4
b = 5
data = {"a": a, "b": b}
url = "http://example.com/formaction/url"
req = requests.post(url, data=data, proxies=proxies, auth=auth)
By now, you shall have your result available via req and you are done.
Resolving Browser emulation scenario
Proxy
Selenimum doc for configuring proxy recommends configuring your proxy in your web browser. The same link provides details, how to set up proxy from your script, but here I will assume, you used Firefox and have already (during manual testing) succeeded with configuring proxy.
Basic or Digest Authentication
Following modified snippet originates from SO answer by Mimi, using Basic Authentication:
from selenium import webdriver
profile = webdriver.FirefoxProfile()
profile.set_preference('network.http.phishy-userpass-length', 255)
driver = webdriver.Firefox(firefox_profile=profile)
driver.get("https://appuser:apppass#somewebsite.com/")
Note, that Selenium does not seem providing complete solution for Basic/Digest authentication, the sample above is likely to work, but if not, you may check this Selenium Developer Activity Google Group thread and see, you are not alone. Some solutions might work for you.
Situation with Digest Authentication seems even worse then with Basic one, some people reporting success with AutoIT or blindly sending keys, discussion referenced above shows some attempts.
Authentication via Login Form
If the web site allows logging in by entering credentials into some form, you might be lucky one, as this is rather easy task to do with Selenium. For more see next chapter about Filling in forms.
Fill in a Form and Submit
In contrast to Authentication, filling data into forms, clicking buttons and similar activities are where Selenium works very well.
from selenium import webdriver
a = 4
b = 5
url = "http://example.com/form"
# formactionurl = "http://example.com/formaction/url" # this is not relevant in Selenium
# Start up Firefox
browser = webdriver.Firefox()
# Assume, you get somehow authenticated now
# You might succeed with Basic Authentication by using url = "http://appuser:apppass#example.com/form
# Navigate to your url
browser.get(url)
# find the element that's id is param_a and fill it in
inputElement = browser.find_element_by_id("param_a")
inputElement.send_keys(str(a))
# repeat for "b"
inputElement = browser.find_element_by_id("param_b")
inputElement.send_keys(str(b))
# submit the form (if having problems, try to set inputElement to the Submit button)
inputElement.submit()
time.sleep(10) # wait 10 seconds (better methods can be used)
page_text = browser.page_source
# now you have what you asked for
browser.quit()
Conclusions
Information provided in question describes what is to be done in rather general manner, but is lacking specific details, which would allow providing tailored solution. That is why this answer focuses on proposing general approach.
There are two scenarios, one bing HTTP based, second one uses emulated browser.
HTTP Solution is preferable, despite of a fact, it requires a bit more preparation in searching, what HTTP requests are to be used. Big advantage is, it is then in production much faster, requiring much less memory and shall be more robust.
In rare cases, when there is some essential JavaScript activity in the browser, we may use Browser emulation solution. However, this is much more complex to set up and has major problems at the Authentication step.
The goal here, given a user facebook profile url, access and open the profile page. Some simple python code:
from urllib2 import urlopen
url = "http://www.facebook.com/username"
page = urlopen(url)
The problem is that for some "username" this causes HTTP ERROR 404. I noticed this error only happening when the path includes a name rather than the "profile.php?id=XXX" format.
Notice that we only have the url here and not the user id.
UPDATE:
This turned out to happen also for some of the "profile.php?id=XXX" and other username formats.
This is a privacy feature of Facebook. Users have the ability to hide their profile page so that only logged in users can view their page. Accessing the page with /profile.php?id=XXX or with /username makes no difference. You must be logged-in in order to view the HTML page.
In your context, you'd have to first log in to a valid Facebook account before requesting the page and you should no longer receive the 404's.
One way to check this is on the graph API, graph.facebook.com/USERNAME will return a link property in the resulting JSON if they have a public page, and it will be omitted on private pages.
Not every Facebook account is accessible as FIRST.LAST, so you won't be able to reliably do this.
There is currently no guarantee that an account is accessible with a vanity name.
Works perfectly fine as long as the username exists.
Are you trying to open the page in a Web Browser or access the HTML source generated by the page?
If the latter, have you thought of using the Facebook Graph API to achieve whatever it is that you are doing? This will be much faster and the API is all documented. Plus the page's HTML source could change at any point in time, whereas the Graph API will not.
Edit
You could use the Graph API without having to even create an application to get the user ID, but going to http://graph.facebook.com/username and parsing the JSON response. You can then access the profile HTML using http://www.facebook.com/profile.php?id=userId