I want to download a webpage using python for some web scraping task. The problem is that the website requires cookies to be enabled, otherwise it serves different version of a page. I did implement a solution that solves the problem, but it is inefficient in my opinion. Need your help to improve it!
This is how I go over it now:
import requests
import cookielib
cj = cookielib.CookieJar()
user_agent = {'User-agent': 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'}
#first request to get the cookies
requests.get('https://ccirecruit.cox.com/psc/RECRUIT/EMPLOYEE/HRMS/c/HRS_HRAM.HRS_CE.GBL?JobOpeningId=42845&SiteId=1&Page=HRS_CE_JOB_DTL&PostingSeq=1&',headers=user_agent, timeout=2, cookies = cj)
# second request reusing cookies served first time
r = requests.get('https://ccirecruit.cox.com/psc/RECRUIT/EMPLOYEE/HRMS/c/HRS_HRAM.HRS_CE.GBL?JobOpeningId=42845&SiteId=1&Page=HRS_CE_JOB_DTL&PostingSeq=1&',headers=user_agent, timeout=2, cookies = cj)
html_text = r.text
Basically, I create a CookieJar object and then send two consecutive requests for the same URL. First time it serves me the bad page but as compensation gives cookies. Second request reuses this cookie and I get the right page.
The question is: Is it possible to just use one request and still get the right cookie enabled version of a page?
I tried to send HEAD request first time instead of GET to minimize traffic, in this case cookies aren't served. Googling for it didn't give me the answer either.
So, it is interesting to understand how to make it efficiently! Any ideas?!
You need to make the request to get the cookie, so no, you cannot obtain the cookie and reuse it without making two separate requests. If by "cookie-enabled" you mean the version that recognizes your script as having cookies, then it all depends on the server and you could try:
hardcoding the cookies before making first request,
requesting some smallest possible page (with smallest possible response yet containing cookies) to obtain first cookie,
trying to find some walkaroung (maybe adding some GET argument will fool the site into believing you have cookies - but you would need to find it for this specific site),
I think the winner here might be to use requests's session framework, which takes care of the cookies for you.
That would look something like this:
import requests
import cookielib
user_agent = {'User-agent': 'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'}
s = requests.session(headers=user_agent, timeout=2)
r = s.get('https://ccirecruit.cox.com/psc/RECRUIT/EMPLOYEE/HRMS/c/HRS_HRAM.HRS_CE.GBL?JobOpeningId=42845&SiteId=1&Page=HRS_CE_JOB_DTL&PostingSeq=1&')
html_text = r.text
Try that and see if that works?
Related
I am a very beginner of Python. And I tried to crawl some product information from my www.Alibaba.com console. When I came to the visitor details page, I found the cookie changed every time when I clicked the search button. I found the cookie changed for each request. I can not crawl the data in the way I crawled from other pages where the cookies were fixed in a certain period.
After comparing the cookie data, I found here were only 3 key-value pairs were changed. I think those 3 values made me fail to crawl the data. So I want to know how to handle such situation.
For python3 the http.client in the standard library can be configured to use an http.cookiejar CookieJar which will keep track of cookies within the client automatically.
You can set this up like this:
import http.cookiejar, urllib.request
cj = http.cookiejar.CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
r = opener.open("http://example.com/")
If you're using pyhton2 then a similar approach works with urllib:
import urllib2
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
r = opener.open("http://example.com/")
I'm relatively new to Python so excuse any errors or misconceptions I may have. I've done hours and hours of research and have hit a stopping point.
I'm using the Requests library to pull data from a website that requires a login. I was initially successful logging in through through a session.post,(payload)/session.get. I had a [200] response. Once I tried to view the JSON data that was beyond the login, I hit a [403] response. Long story short, I can make it work by logging in through a browser and inspecting the web elements to find the current session cookie and then defining the headers in requests to pass along that exact cookie with session.get
My questions is...is it possible to set/generate/find this cookie through python after logging in? After logging in and out a few times, I can see that some of the components of the cookie remain the same but others do not. The website I'm using is garmin connect.
Any and all help is appreciated.
If your issue is about login purposes, then you can use a session object. It stores the corresponding cookies so you can make requests, and it generally handles the cookies for you. Here is an example:
s = requests.Session()
# all cookies received will be stored in the session object
s.post('http://www...',data=payload)
s.get('http://www...')
Furthermore, with the requests library, you can get a cookie from a response, like this:
url = 'http://example.com/some/cookie/setting/url'
r = requests.get(url)
r.cookies
But you can also give cookie back to the server on subsequent requests, like this:
url = 'http://httpbin.org/cookies'
cookies = dict(cookies_are='working')
r = requests.get(url, cookies=cookies)
I hope this helps!
Reference: How to use cookies in Python Requests
I'm trying scrape data from Mexico's Central Bank website but have hit a wall. In terms of actions, I need to first access a link within an initial URL. Once the link has been accessed, I need to select 2 dropdown values and then hit an activate a submit button. If all goes well, I will be taken to a new url where a set of links to pdfs are available.
The original url is:
"http://www.banxico.org.mx/mercados/valores-gubernamentales-secto.html"
The nested URL (the one with the dropbox) is:
"http://www.banxico.org.mx/valores/LeePeriodoSectorizacionValores.faces?BMXC_claseIns=GUB&BMXC_lang=es_MX"
The inputs (arbitrary) are, say: '07/03/2019' and '14/03/2019'.
Using BeautifulSoup and requests I feel like I got as far as filling the values in the dropbox, but failed to click the button and achieve the final url with the list of links.
My code follows below :
from bs4 import BeautifulSoup
import requests
pagem=requests.get("http://www.banxico.org.mx/mercados/valores-gubernamentales-secto.html")
soupm = BeautifulSoup(pagem.content,"lxml")
lst=soupm.find_all('a', href=True)
url=lst[-1]['href']
page = requests.get(url)
soup = BeautifulSoup(page.content,"lxml")
xin= soup.find("select",{"id":"_id0:selectOneFechaIni"})
xfn= soup.find("select",{"id":"_id0:selectOneFechaFin"})
ino=list(xin.stripped_strings)
fino=list(xfn.stripped_strings)
headers = {'Referer': url}
data = {'_id0:selectOneFechaIni':'07/03/2019', '_id0:selectOneFechaFin':'14/03/2019',"_id0:accion":"_id0:accion"}
respo=requests.post(url,data,headers=headers)
print(respo.url)
In the code, respo.url is equal to url...the code fails. Can anybody pls help me identify where the problem is? I'm a newbie to scraping so that might be obvious - apologize in advance for that...I'd appreciate any help. Thanks!
Last time I checked, you cannot submit a form via clicking buttons with BeautifulSoup and Python. There are typically two approaches I often see:
Reverse engineer the form
If the form makes AJAX calls (e.g. makes a request behind the scenes, common for SPAs written in React or Angular), then the best approach is to use the network requests tab in Chrome or another browser to understand what the endpoint is and what the payload is. Once you have those answers, you can make a POST request with the requests library to that endpoint with data=your_payload_dictionary (e.g. manually do what the form is doing behind the scenes). Read this post for a more elaborate tutorial.
Use a headless browser
If the website is written in something like ASP.NET or a similar MVC framework, then the best approach is to use a headless browser to fill out a form and click submit. A popular framework for this is Selenium. This simulates a normal browser. Read this post for a more elaborate tutorial.
Judging by a cursory look at the page you're working on, I recommend approach #2.
The page you have to scrape is:
http://www.banxico.org.mx/valores/PresentaDetalleSectorizacionGubHist.faces
Add the date to consult and JSESSIONID from cookies in the payload and Referer , User-Agent and all the old good stuff in request headers
Example:
import requests
import pandas as pd
cl = requests.session()
url = "http://www.banxico.org.mx/valores/PresentaDetalleSectorizacionGubHist.faces"
payload = {
"JSESSIONID": "cWQD8qxoNJy_fecAaN2k8N0IQ6bkQ7f3AtzPx4bWL6wcAmO0T809!-1120047000",
"fechaAConsultar": "21/03/2019"
}
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36",
"Content-Type": "application/x-www-form-urlencoded",
"Referer": "http://www.banxico.org.mx/valores/LeePeriodoSectorizacionValores.faces;jsessionid=cWQD8qxoNJy_fecAaN2k8N0IQ6bkQ7f3AtzPx4bWL6wcAmO0T809!-1120047000"
}
response = cl.post(url, data=payload, headers=headers)
tables = pd.read_html(response.text)
When just clicking through the pages it looks like there's some sort of cookie/session stuff going on that might be difficult to take into account when using requests.
(Example: http://www.banxico.org.mx/valores/LeePeriodoSectorizacionValores.faces;jsessionid=8AkD5D0IDxiiwQzX6KqkB2WIYRjIQb2TIERO1lbP35ClUgzmBNkc!-1120047000)
It might be easier to code this up using selenium since that will automate the browser (and take care of all the headers and whatnot). You'll still have access to the html to be able to scrape what you need. You can probably reuse a lot of what you're doing as well in selenium.
I'm currently writing a program which will help users to determine optimal times to make a post on tumblr. As with Twitter, most followers have so many subscriptions that there is no way they can keep up, meaning it's best to know when one's own specific following is (mostly) online. On tumblr this can be determined in two ways -- first whether they have recently shared any content which was recently posted, and secondly whether they have recently added to their liked-posts list.
Frustratingly, even when set to 'public', the liked-posts stream of an arbitrary user (other than self) is only available to logged-in entities. As far as I know, that means I've either got to upload a login-cookie to the application every so often, or get this post-request working.
I've looked at a number of successful outbound requests via Opera's inspector but I must still be missing something, or perhaps requests is doing something that the server is rejecting no matter what I do.
The essence of the problem is below. This is currently written in Python 2.7
and uses Python requests and BeautifulSoup. To run it yourself, update the e and p pair at the top of get_login_response() to a real set of values.
import requests
from bs4 import BeautifulSoup
class Login:
def __init__(self):
self.session = requests.session()
def get_hidden_fields(self):
""" -> string. tumblr dynamically generates a key for its login forms
This should extract that key from the form so that the POST-data to
login will be accepted.
"""
pageRequest = requests.Request("GET","https://www.tumblr.com/login")
received = self.session.send( pageRequest.prepare() )
html = BeautifulSoup(received.content)
hiddenFieldDict = {}
hiddenFields = html.find_all("input",type="hidden")
for x in hiddenFields: hiddenFieldDict[x["name"]]=x["value"]
return hiddenFieldDict
def get_login_response(self):
e = u"dead#live.com"
p = u"password"
endpoint = u"https://tumblr.com/login"
payload = { u"user[email]": e,
u"user[password]": p,
u"user[age]":u"",
u"tumblelog[name]": u"",
u"host": u"www.tumblr.com",
u"Connection:":u"keep-alive",
u"Context":u"login",
u"recaptcha_response_field":u""
}
payload.update( self.get_hidden_fields() )
## headers = {"Content-Type":"multipart/form-data"}
headers = {u"Content-Type":u"application/x-www-form-urlencoded",
u"Connection:":u"keep-alive",
u"Origin":u"https://tumblr.com",
u"Referer": u"https://www.tumblr.com/login",
u"User-Agent":u"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36 OPR/18.0.1284.68",
u"Accept":u"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
u"Accept-Encoding":u"gzip,deflate,sdch",
u"Accept-Language":u"en-US,en;q=0.8",
u"Cache-Control":u"max-age=0"
#"Content-Length":VALUE is still needed
}
# this cookie is stale but it seems we these for free anyways,
# so I'm not sure whether it's actually needed. It's mostly
# google analytics info.
sendCookie = {"tmgioct":"52c720e28536530580783210",
"__qca":"P0-1402443420-1388781796773",
"pfs":"POIPdNt2p1qmlMGRbZH5JXo5k",
"last_toast":"1388783309",
"capture":"GDTLiEN5hEbMxPzys1ye1Gf4MVM",
"logged_in":"0",
"_ga":"GA1.2.2064992906.1388781797",
"devicePixelRatio":"1",
"documentWidth":"1280",
"anon_id":"VNHOJWQXGTQXHNCFKYJQUMUIVQBRISPR",
"__utma":"189990958.2064992906.1388781797.1388781797.1388781797.1",
"__utmb":"189990958.28.10.1388781797",
"__utmc":"189990958",
"__utmz":"189990958.1388781797.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)"}
loginRequest = requests.Request("POST",
endpoint,
headers,
data=payload,
cookies=sendCookie # needed?
## ,auth=(e,p) # may not be needed
)
contentLength = len(loginRequest.prepare().body)
loginRequest.data.update({u"Content-Length":unicode(contentLength)})
return self.session.send( loginRequest.prepare() )
l = Login()
res = l.get_login_response()
print "All cookies: ({})".format(len(l.session.cookies))
print l.session.cookies # has a single generic cookie from the initial GET query
print "Celebrate if non-empty:"
print res.cookies # this should theoretically contain the login cookie
Output on my end:
All cookies: (1)
<<class 'requests.cookies.RequestsCookieJar'>[<Cookie tmgioct=52c773ed65cfa30622446430 for www.tumblr.com/>]>
Celebrate if non-empty:
<<class 'requests.cookies.RequestsCookieJar'>[]>
Bonus points if my code is insecure and you have pointers for me on that in addition . I chose requests module for its simplicity, but if it lacks features and my goal is possible using httplib2 or something I am willing to switch.
There are a number of things you're not doing that you need to be, and quite a few things you are doing that you don't.
Firstly, go back and examing the POST fields being sent on your login request. When I do this in Chrome, I see the following:
user[email]:<redacted>
user[password]:<redacted>
tumblelog[name]:
user[age]:
recaptcha_public_key:6Lf4osISAAAAAJHn-CxSkM9YFNbirusAOEmxqMlZ
recaptcha_response_field:
context:other
version:STANDARD
follow:
http_referer:http://www.tumblr.com/logout
form_key:!1231388831237|jS7l2SHeUMogRjxRiCbaJNVduXU
seen_suggestion:0
used_suggestion:0
Your Requests-based POST is missing a few of these fields, specifically recaptcha_public_key, version, follow, http_referer, form_key, seen_suggestion and used_suggestion.
These fields are not optional: they will need to be sent on this POST. Some of these can safely be used generically, but the safest way to get these is to get the data for the login page itself, and use BeautifulSoup to pull the values out of the HTML. I'm going to assume you've got the skillset to do that (e.g. you know how to find form inputs in HTML and parse them to get their default values).
A good habit to get in here is to start using a tool like Wireshark or tcpdump to examine your requests HTTP traffic, and compare it to what you get from Chrome/Opera. This will allow you to see what is and isn't being sent, and how the two requests differ.
Secondly, once you start hitting the login page you won't need to send cookies on your POST, so you can stop doing that. More generally, when using a requests Session object, you shouldn't input any additional cookies: just emulate the flow of HTTP requests from an actual browser and your cookies state will be fine.
Thirdly, you're massively over-specifying your headers dictionary. Most of the fields you're providing will be automatically populated by Requests. Now, given that you're trying to emulate a browser (Opera by the looks of things), you will want to override a few of them, but most can be left alone. You should be using this header dictionary:
{
u"Origin":u"https://tumblr.com",
u"Referer": u"https://www.tumblr.com/login",
u"User-Agent":u"Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.63 Safari/537.36 OPR/18.0.1284.68",
u"Accept":u"text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8",
u"Accept-Language":u"en-US,en;q=0.8",
}
Below is a list of the fields I removed from your header dictionary and why I removed them:
Content-Type: When you provide a dictionary to the data argument in Requests, we set the Content-Type to application/x-www-form-urlencoded for you. There's no need to do it yourself.
Connection: Requests manages HTTP connection pooling and keep-alives itself: don't get involved in the process, it'll just go wrong.
Accept-Encoding: Again, please let Requests set this unless you're actually prepared to deal with decoding the content. Requests only knows how to do gzip and deflate: if you send sdch and actually get it back, you'll have to decode it yourself. Best not to advertise you support it.
Cache-Control: POST requests cannot be cached, so this is irrelevant.
Fourth, and I want to be very clear here, do not calculate Content-Length yourself. Requests will do it for you and will get it right. If you send that header yourself, all kinds of weird bugs can come up that the Requests core dev team have to chase. There is never a good reason to set that header yourself. With this in mind, you can stop using PreparedRequest objects and just go back to using session.post().
I am looking for a way to view the request (not response) headers, specifically what browser mechanize claims to be. Also how would I go about manipulating them, eg setting another browser?
Example:
import mechanize
browser = mechanize.Browser()
# Now I want to make a request to eg example.com with custom headers using browser
The purpose is of course to test a website and see whether or not it shows different pages depending on the reported browser.
It has to be the mechanize browser as the rest of the code depends on it (but is left out as it's irrelevant.)
browser.addheaders = [('User-Agent', 'Mozilla/5.0 blahblah')]
You've got an answer on how to change the headers, but if you want to see the exact headers that are being used try using a proxy that displays the traffic. e.g. Fiddler2 on windows or see this question for some Linux altenatives.
you can modify referer too...
br.addheaders = [('Referer', 'http://google.com')]