I have access to an admin page by using a basic HTTP authentication system.
This page loads data using JavaScript by retrieving JSON data from another URL I can see in the Firefox Web Dev tools (the combination Ctrl+Shift+I, then going in the Network tab and reloading the page)
If I copy and paste this URL in the same instance of my browser, I retrieve the JSON data I need.
So:
Using Firefox, I connect to the admin page and provide the username/passwd.
Using Firefox Webdev toolbox, I retrieve the URL used to retrieve the JSON data I want.
I copy and paste this URL and get the JSON data I need, ready to be parsed.
Now, I would like to do the same automatically using Python 3.
I use Requests to make it easier. However, if I try to retrieve directly the URL found in step 3, I get an 401 Authentication error:
import requests
url = "http://xxx/services/users?from=0&to=50"
r = requests.get(url, auth=('user', 'passwd'))
r.status_code
>>> 401
I can do an authenticated request on the admin URL (something like http://xxx/admin-ui/) and I can retrieve the content of the web page, but it doesn't contain anything interesting since everything is loaded in JavaScript from that JSON data coming from the URL in step 3...
Any help would be more than welcome!
I needed to use form-based authentication, not HTTP Basic Auth as I originally thought.
So first I needed to login to the first URL in order to retrieve a auth cookie:
url = "http://xxx/admin-ui/"
credentials = {'j_username':'my_username','j_password':'my_passwd'}
s = requests.session()
s.post(url, credentials)
s.cookies
>>> <<class 'requests.cookies.RequestsCookieJar'>[Cookie(version=0, name='JSESSIONID', value='...>
Then I could connect to the second URL using this cookie and retrieve the data I needed:
url2 = "http://xxx/services/users?from=0&to=50"
r = requests.get(url2, cookies=s.cookies)
r.content
>>> (a lot of JSON data! \o/)
Related
I'm new to webscraping and have been trying for fun to scrape a boxing website.
My code below was working on the first attempt, and when I tried to re-run it, it was no longer retrieving the link data any more.
I can still access the website from my browser, so not sure what the error is!
Appreciate any pointers.
import os
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
import re
os.system('cls')
heavy = 'https://boxrec.com/en/ratings?r%5Brole%5D=box-pro&r%5Bsex%5D=M&r%5Bstatus%5D=a&r%5Bdivision%5D=Heavyweight&r%5Bcountry%5D=&r_go='
pages = set()
def get_links(page_url):
print("running crawler...")
global pages
req = Request(heavy, headers = {'User-Agent':'Mozilla/5.0'})
html = urlopen(req)
bs = BeautifulSoup(html.read(), 'html.parser')
for link in bs.find_all('a', href=re.compile('^(/en/box-pro/)')):
if 'href' in link.attrs:
if link.attrs['href'] not in pages:
new_page = link.attrs['href']
print(new_page)
pages.add(new_page)
get_links(new_page)
get_links('')
print("crawling done.")
If you inspect html.read() you will find that the page displays a login form. It might be that a detection system picks up your bot and tries to prevent (or at least make it harder for) you to scrape.
As an engineer at WebScrapingAPI I've tested your URL using our API and it passes each time (it returns the data, not the login page). That is because we've implemented a number of detection evasion features, including an IP rotation system. So by sending the request from another IP with a completely different browser fingerprint, the targeted website 'thinks' it's another person and passes on the information. If you want to test it yourself, here is the script you can use:
import requests
API_KEY = '<YOUR_API_KEY>'
SCRAPER_URL = 'https://api.webscrapingapi.com/v1'
TARGET_URL = 'https://boxrec.com/en/ratings?r%5Brole%5D=box-pro&r%5Bsex%5D=M&r%5Bstatus%5D=a&r%5Bdivision%5D=Heavyweight&r%5Bcountry%5D=&r_go='
PARAMS = {
"api_key":API_KEY,
"url": TARGET_URL,
"render_js":1,
}
response = requests.get(SCRAPER_URL, params=PARAMS)
print(response.text)
If you want to build your own scraper, I suggest you implement some of the techniques in this article. You might also want to actualyy create an account on your targeted website, log in using the credentials, collect the cookies and pass them to your request.
In order to collect the cookies:
Navigate to the login screen
Open developer tools in your browser (Network tab)
Log in and check the login request:
(Note that I have a failed attempt, because I didn't use real credentials to log in)
To pass the cookies to your request, simply add it as a header to your req. Example: req = Request(url, headers={'User-Agent': 'Mozilla/5.0', 'Cookie':'myCookie=lovely'}). Also, try to use the same User-Agent as the original request (the one made when you logged in). It can be found in the same login request from where you picked up the cookies.
I'm relatively new to Python so excuse any errors or misconceptions I may have. I've done hours and hours of research and have hit a stopping point.
I'm using the Requests library to pull data from a website that requires a login. I was initially successful logging in through through a session.post,(payload)/session.get. I had a [200] response. Once I tried to view the JSON data that was beyond the login, I hit a [403] response. Long story short, I can make it work by logging in through a browser and inspecting the web elements to find the current session cookie and then defining the headers in requests to pass along that exact cookie with session.get
My questions is...is it possible to set/generate/find this cookie through python after logging in? After logging in and out a few times, I can see that some of the components of the cookie remain the same but others do not. The website I'm using is garmin connect.
Any and all help is appreciated.
If your issue is about login purposes, then you can use a session object. It stores the corresponding cookies so you can make requests, and it generally handles the cookies for you. Here is an example:
s = requests.Session()
# all cookies received will be stored in the session object
s.post('http://www...',data=payload)
s.get('http://www...')
Furthermore, with the requests library, you can get a cookie from a response, like this:
url = 'http://example.com/some/cookie/setting/url'
r = requests.get(url)
r.cookies
But you can also give cookie back to the server on subsequent requests, like this:
url = 'http://httpbin.org/cookies'
cookies = dict(cookies_are='working')
r = requests.get(url, cookies=cookies)
I hope this helps!
Reference: How to use cookies in Python Requests
I'm just studying the requests library(http://docs.python-requests.org/en/latest/),
and got a problem on how to fetch a page with cookies using requests.
for example:
url2= 'https://passport.baidu.com'
parsedCookies={'PTOKEN': '412f...', 'BDUSS': 'hnN2...', ...} #Sorry that the cookies value is replaced by ... for instance of privacy
req = requests.get(url2, cookies=parsedCookies)
text=req.text.encode('utf-8','ignore')
f=open('before.html','w')
f.write(text)
f.close()
req.close()
when I use the codes above to fetch the page, it just saves the login page to 'before.html' instead of logined page, it refers that actually I haven't logged in successfully.
But if I use URLlib2 to fetch the page, it works properly as expected.
parsedCookies="PTOKEN=412f...;BDUSS=hnN2...;..." #Different format but same content with the aboved cookies
req = urllib2.Request(url2)
req.add_header('Cookie', parsedCookies)
ret = urllib2.urlopen(req)
f=open('before_urllib2.html','w')
f.write(ret.read())
f.close()
ret.close()
When I use these codes, it saves the logined page in before_urllib2.html.
--
Are there any mistakes in my code?
Any reply would be grateful.
You can use Session object to get what you desire:
url2='http://passport.baidu.com'
session = requests.Session() # create a Session object
cookie = requests.utils.cookiejar_from_dict(parsedCookies)
session.cookies.update(cookie) # set the cookies of the Session object
req = session.get(url2, headers=headers,allow_redirects=True)
If you use the requests.get function, it doesn't send cookies for the redirected page. Instead, if you use the Session().get function, it will maintain and send cookies for all http requests, this is what the concept "session" exactly means.
Let me try to elaborate to you what happens here:
When I sent cookies to http://passport.baidu.com/center and set the parameter allow_redirects as false, the returned status code is 302 and one of the headers of the response is 'location': '/center?_t=1380462657' (This is a dynamic value generated by server, you can replace it with what you get from server):
url2= 'http://passport.baidu.com/center'
req = requests.get(url2, cookies=parsedCookies, allow_redirects=False)
print req.status_code # output 302
print req.headers
But when I set the parameter allow_redirects as True, it still doesn't redirect to the page (http://passport.baidu.com/center?_t=1380462657) and the server return the login page. The reason is that the requests.get doesn't send cookies for the redirected page, here is http://passport.baidu.com/center?_t=1380462657, so we can login successfully. That is why we need the Session object.
If I set url2 = http://passport.baidu.com/center?_t=1380462657, it will return the page you want. One solution is use the above code to get the dynamic location value and form a path to you account like http://passport.baidu.com/center?_t=1380462657 , then you can get the desired page.
url2= 'http://passport.baidu.com' + req.headers.get('location')
req = session.get(url2, cookies=parsedCookies, allow_redirects=True )
But this is cumbersome, so when dealing with cookies, Session object do excellent job for us!
I'm trying to scrape a web site with the requests module.
Using chrome and inspect elements, I go to the url, fill in a form and click the continue button. Chrome's inspect elements (network documents) shows what chrome sent with post. It also shows multiple cookies. The site redirects to a url with among other things a session ID.
To simulate this, I try using requests. I take the form data from inspect elements and reformat it to a dictionary. I use requests.session to include the cookies.
import requests
form_data = 'currentCalForm=dep¤tCodeForm=&tripType=oneWay&searchCategory=award&originAirport=JFK&flightParams.flightDateParams.travelMonth=5&flightParams.flightDateParams.travelDay=14&flightParams.flightDateParams.searchTime=040001&destinationAirport=LHR&returnDate.travelMonth=-1000&returnDate.travelDay=-1000&adultPassengerCount=2&adultPassengerCount=1&serviceclass=coach&searchTypeMode=matrix&awardDatesFlexible=true&originAlternateAirportDistance=0&destinationAlternateAirportDistance=0&discountCode=&flightSearch=award&dateChanged=false&fromSearchPage=true&advancedSearchOpened=false&numberOfFlightsToDisplay=10&searchCategory=&aairpassSearchType=false&moreOptionsIndicator=oneWay&seniorPassengerCount=0&youngAdultPassengerCount=0&childPassengerCount=0&infantPassengerCount=0&passengerCount=2'.split('&')
payload = {}
for item in form_data:
key, value = item.split('=')
if value:
payload[key] = value
with requests.session() as s:
r = s.post('https://www.aa.com/homePage.do', params = payload, allow_redirects=True)
print r.headers
print r.history
print r.url
print r.status_code
with open('x.htm', 'wb') as f:
f.write(r.text.encode('utf8'))
requests, however, does not appear to follow the redirect. history is empty and the url appears to be the data I sent rather than what the site returned. x.htm shows a web page, but does not contain the info I expected.
From http://docs.python-requests.org/en/latest/user/quickstart/#redirection-and-history I expected r.url to contain the redirected url and r.history to contain an http response code.
What am I doing wrong?
ok what you do seems to be wrong. i am not sure how you decided to sent a post on https://www.aa.com/homePage.do, but that seems to be a get and doesnt take the params you send. when you click search your browser sends this post: https://www.americanairlines.co.uk/reservation/searchFlightsSubmit.do;jsessionid=XXXXXXXXXXXXXXXXXXX and parameters:
currentCalForm=dep
currentCodeFrom=
tripType=roundTrip
originAirport=LAX
flightParams.flightDateParams.travelMonth=10
flightParams.flightDateParams.travelDay=24
flightParams.flightDateParams.searchTime=040001
destinationAirport=JFK
returnDate.travelMonth=10
returnDate.travelDay=31
returnDate.searchTime=400001
adultPassengerCount=1
adultPassengerCount=1
childPassengerCount=0
hotelRoomCount=1
serviceclass=coach
searchTypeMode=matrix
awardDatesFlexible=true
originAlternateAirportDistance=0
destinationAlternateAirportDistance=0
discountCode=
flightSearch=revenue
dateChanged=false
fromSearchPage=true
advancedSearchOpened=false
numberOfFlightsToDisplay=10
searchCategory=
aairpassSearchType=false
moreOptionsIndicator=
seniorPassengerCount=0
youngAdultPassengerCount=0
infantPassengerCount=0
passengerCount=1
This will then give you an html back. preety mach you have to send all requests send in the browser. it might be easier for you to do it with selenium.
i found this using httpfox probably is similar to chrome networks.
I am trying to scrape some selling data using the StubHub API. An example of this data seen here:
https://sell.stubhub.com/sellapi/event/4236070/section/null/seatmapdata
You'll notice that if you try and visit that url without logging into stubhub.com, it won't work. You will need to login first.
Once I've signed in via my web browser, I open the URL which I want to scrape in a new tab, then use the following command to retrieve the scraped data:
r = requests.get('https://sell.stubhub.com/sellapi/event/4236070/section/null/seatmapdata')
However, once the browser session expires after ten minutes, I get this error:
<FormErrors>
<FormField>User Auth Check</FormField>
<ErrorMessage>
Either is not active or the session might have expired. Please login again.
</ErrorMessage>
I think that I need to implement the session ID via cookie to keep my authentication alive and well.
The Requests library documentation is pretty terrible for someone who has never done this sort of thing before, so I was hoping you folks might be able to help.
The example provided by Requests is:
s = requests.Session()
s.get('http://httpbin.org/cookies/set/sessioncookie/123456789')
r = s.get("http://httpbin.org/cookies")
print r.text
# '{"cookies": {"sessioncookie": "123456789"}}'
I honestly can't make heads or tails of that. How do I preserve cookies between POST requests?
I don't know how stubhub's api works, but generally it should look like this:
s = requests.Session()
data = {"login":"my_login", "password":"my_password"}
url = "http://example.net/login"
r = s.post(url, data=data)
Now your session contains cookies provided by login form. To access cookies of this session simply use
s.cookies
Any further actions like another requests will have this cookie