xpath on authenticated page in Python - python

I am extracting content from page using the below code. But I now want to use this on a page which is in an authenticated page. Is there any way I can do this within python?
Below is sample code am using.
from lxml import html
import requests
page = requests.get('http://www.thesiteurl.com/')
tree = html.fromstring(page.text)
logo = tree.xpath('//*[#id="wraper"]/div[3]/header/div[1]/div[2]/div[1]/a/img//#src')
print logo

I assume you mean you want to get an authenticated page using requests (since you can do whatever you want after you fetch the html)?
If so, it depends on how the page authenticates. The requests documentation discusses various ways of doing so here: link. The simplest scheme (username, password) is supported with fairly painless syntax:
>>> requests.get('https://api.github.com/user', auth=('user', 'pass'))
<Response [200]>

Related

How to Login and Scrape Websites with Python?

I understand there are similar questions out there, however, I couldn't make this code to work out. Does anyone know how to login and scrape the data from this website?
from bs4 import BeautifulSoup
import requests
# Start the session
session = requests.Session()
# Create the payload
payload = {'login':<USERNAME>,
'password':<PASSWORD>
}
# Post the payload to the site to log in
s = session.post("https://www.beeradvocate.com/community/login", data=payload)
# Navigate to the next page and scrape the data
s = session.get('https://www.beeradvocate.com/place/list/?c_id=AR&s_id=0&brewery=Y')
soup = BeautifulSoup(s.text, 'html.parser')
soup.find('div', class_='titleBar')
print(soup)
The process is different for almost each site, the best way to know how to do it is to use your browser's request inspector (firefox) and look at how the site behaves when you try to login.
For your website, when you click the login button a post request is sent to https://www.beeradvocate.com/community/login/login, with a little bit of trial and error your should be able to replicate it.
Make sure you match the content-type and request headers (specifically cookies in case you need auth tokens).

Python- Get cookies of a website saved in a browser (Chrome/FireFox)

I can manually see the cookies set in the browser.
How can I fetch the cookie from a Python script?
import requests
res=requests.get("https://stackoverflow.com/questions/50404771/python-get-cookiesof-a-website-saved-in-a-browser-chrome-firefox")
res.cookies
print(res.cookies.keys())
print(res.cookies["prov"])
I hope I read your question right.
You may want to ask "how do I read cookies already stored in my browser?". Which I don't think you can do. But Selenium would give you access to a new browser session with which you can obtain more cookies.
UPDATE
Thanks to Sraw for the pointer, I've tried this now but it wouldn't transfer my login to the requests API. So maybe it is not possible on modern sites, or the OP could try these tools since their question is clearer in their mind than ours.
import requests
import browsercookie
url = "https://stackoverflow.com/questions/50404771/python-get-cookiesof-a-website-saved-in-a-browser-chrome-firefox"
res=requests.get(url)
cj = browsercookie.chrome()
res2 = requests.get(url, cookies=cj)
import re
get_title = lambda html: re.findall('(.*?)', html, flags=re.DOTALL)[0].strip()
get_me = lambda html: re.findall('John', html, flags=re.DOTALL)
# At this point I had deleted my answer so this got nothing
# now my answer is reinstated it will return me but not in place of the login button.
print(len(get_me(res2.text)))

Bypassing intrusive cookie statement with requests library

I'm trying to crawl a website using the requests library. However, the particular website I am trying to access (http://www.vi.nl/matchcenter/vandaag.shtml) has a very intrusive cookie statement.
I am trying to access the website as follows:
from bs4 import BeautifulSoup as soup
import requests
website = r"http://www.vi.nl/matchcenter/vandaag.shtml"
html = requests.get(website, headers={"User-Agent": "Mozilla/5.0"})
htmlsoup = soup(html.text, "html.parser")
This returns a web page that consists of just the cookie statement with a big button to accept. If you try accessing this page in a browser, you find that pressing the button redirects you to the requested page. How can I do this using requests?
I considered using mechanize.Browser but that seems a pretty roundabout way of doing it.
Try setting:
cookies = dict(BCPermissionLevel='PERSONAL')
html = requests.get(website, headers={"User-Agent": "Mozilla/5.0"}, cookies=cookies)
This will bypass the cookie consent page and will land you staight to the page.
Note: You could find the above by analyzing the javascript code that is run on the cookie concent page, it is a bit obfuscated but it should not be difficult. If you run into the same type of problem again, take a look at what kind of cookies does the javascript code that is executed upon a event's handling sets.
I have found this SO question which asks how to send cookies in a post using requests. The accepted answer states that the latest build of Requests will build CookieJars for you from simple dictionaries. Below is the POC code included in the original answer.
import requests
cookie = {'enwiki_session': '17ab96bd8ffbe8ca58a78657a918558'}
r = requests.post('http://wikipedia.org', cookies=cookie)

Login to jsp website using Requests

I have the following script:
import requests
import cookielib
jar = cookielib.CookieJar()
login_url = 'http://www.whispernumber.com/signIn.jsp?source=calendar.jsp'
acc_pwd = {'USERNAME':'myusername',
'PASSWORD':'mypassword'
}
r = requests.get(login_url, cookies=jar)
r = requests.post(login_url, cookies=jar, data=acc_pwd)
page = requests.get('http://www.whispernumber.com/calendar.jsp?day=20150129', cookies=jar)
print page.text
But the print page.text is showing that the site is trying to forward me back to the login page:
<script>location.replace('signIn.jsp?source=calendar.jsp');</script>
I have a feeling this is because of the jsp, and am not sure how to login to a java script page? Thanks for the help!
Firstly you're posting to the wrong page. If you view the HTML from your link you'll see the form is as follows:
<form action="ValidatePassword.jsp" method="post">
Assuming you're correctly authenticated you will probably get a cookie back that you can use for subsequent page requests. (You seem to be thinking along the right lines.)
Requests isn't a web browser, it is an http client, it simply grabs the raw text from the page. You are going to want to use something like Selenium or another headless browser to programatically login to a site.

Parsing webpage with beautifulsoup to get dynamic content

I am trying to parse the following page
http://www.lyricsnmusic.com/roxy-music/while-my-heart-is-still-beating-lyrics/26925936 for the list of similar songs.
The list of similar songs is not present in the page source but is present when I use 'Inspect Element' in the browser.
How do I do it??
Current code:
url = 'http://www.lyricsnmusic.com/roxy-music/while-my-heart-is-still-beating-lyrics/26925936'
request = urllib2.Request(url)
lyricsPage = urllib2.urlopen(request).read()
soup = BeautifulSoup(lyricsPage)
The code to generate the links is:
for p in soup.find_all('p'):
s = p.find('a', { "class" : 'title' }).get('href')
Which methods are available to do this??
This is handled probably by some ajax calls so it will not be in the source,
I think you would need to "monitor network" through developer tools in the browser and look for requests you are interested in.
i.e. a random picked request URL from this page:
http://ws.audioscrobbler.com/2.0/?api_key=73581584905631c5fc15720f03b0b9c8&format=json&callback=jQuery1703329798618797213_1380004055342&method=track.getSimilar&limit=10&artist=roxy%20music&track=while%20my%20heart%20is%20still%20beating&_=1380004055943
to get/see the response enter the above URL in the browser and see the content of the response.
so you need to simulate the requests in python and after you get the response you have to parse the response for interesting details.

Categories