Python Webscraping, how to navigate on a website?

Python Webscraping, how to navigate on a website? - python

I am programming a program that should read out certain data from a website and only output certain data (data from a table). However, I ran into a problem. I wrote a program that logs into the website, but from that website I have to go to the next website and then open the document with the data. Unfortunately, I have no idea how I can change the website and then open the document and read out the data.
Does anyone have any idea how I could get on there?
from bs4 import BeautifulSoup
import requests
User = ''
Pass = ''
LOGIN_URL = ''
LOGIN_API_URL = ''
def main():
session_requests = requests.session()
result = session_requests.get(LOGIN_URL)
cookies = result.cookies
soup = BeautifulSoup(result.content, "html.parser")
auth_token = soup.find("input", {'name': 'logintoken'}).get('value')
payload = {'username': User, 'password': Pass , 'logintoken':auth_token }
result = session_requests.post(
LOGIN_API_URL,
data=payload,
cookies=cookies
)
#Report successful login
print("Login succeeded: ", result.ok)
print("Status code:", result.status_code)
print(result.text)
#Get Data
# Close Session
requests.session().close()
print('Session closed')
# Entry point
if __name__ == '__main__':
main()

You should read into Selenium with Python. Since there is no specific URL or login details (which you shouldn't post here anyway) it would be quite hard for any of us to create a working example since we don't have anything to work with.
Try the using selenium from the link above and if you have any questions or run into any issues from there come back and ask that specific question.
BS4 and requests can be powerful but selenium emulates a web browser and lets you move through websites like a "human" would. Start there.

Related

webscraper no longer retrieving data - can still access website via browser

I'm new to webscraping and have been trying for fun to scrape a boxing website.
My code below was working on the first attempt, and when I tried to re-run it, it was no longer retrieving the link data any more.
I can still access the website from my browser, so not sure what the error is!
Appreciate any pointers.
import os
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
import re
os.system('cls')
heavy = 'https://boxrec.com/en/ratings?r%5Brole%5D=box-pro&r%5Bsex%5D=M&r%5Bstatus%5D=a&r%5Bdivision%5D=Heavyweight&r%5Bcountry%5D=&r_go='
pages = set()
def get_links(page_url):
print("running crawler...")
global pages
req = Request(heavy, headers = {'User-Agent':'Mozilla/5.0'})
html = urlopen(req)
bs = BeautifulSoup(html.read(), 'html.parser')
for link in bs.find_all('a', href=re.compile('^(/en/box-pro/)')):
if 'href' in link.attrs:
if link.attrs['href'] not in pages:
new_page = link.attrs['href']
print(new_page)
pages.add(new_page)
get_links(new_page)
get_links('')
print("crawling done.")

If you inspect html.read() you will find that the page displays a login form. It might be that a detection system picks up your bot and tries to prevent (or at least make it harder for) you to scrape.
As an engineer at WebScrapingAPI I've tested your URL using our API and it passes each time (it returns the data, not the login page). That is because we've implemented a number of detection evasion features, including an IP rotation system. So by sending the request from another IP with a completely different browser fingerprint, the targeted website 'thinks' it's another person and passes on the information. If you want to test it yourself, here is the script you can use:
import requests
API_KEY = '<YOUR_API_KEY>'
SCRAPER_URL = 'https://api.webscrapingapi.com/v1'
TARGET_URL = 'https://boxrec.com/en/ratings?r%5Brole%5D=box-pro&r%5Bsex%5D=M&r%5Bstatus%5D=a&r%5Bdivision%5D=Heavyweight&r%5Bcountry%5D=&r_go='
PARAMS = {
"api_key":API_KEY,
"url": TARGET_URL,
"render_js":1,
}
response = requests.get(SCRAPER_URL, params=PARAMS)
print(response.text)
If you want to build your own scraper, I suggest you implement some of the techniques in this article. You might also want to actualyy create an account on your targeted website, log in using the credentials, collect the cookies and pass them to your request.
In order to collect the cookies:
Navigate to the login screen
Open developer tools in your browser (Network tab)
Log in and check the login request:
(Note that I have a failed attempt, because I didn't use real credentials to log in)
To pass the cookies to your request, simply add it as a header to your req. Example: req = Request(url, headers={'User-Agent': 'Mozilla/5.0', 'Cookie':'myCookie=lovely'}). Also, try to use the same User-Agent as the original request (the one made when you logged in). It can be found in the same login request from where you picked up the cookies.

Python Script for logging into website does not input login info

I am trying to make a script that will login into a website and then scrape the data from a specific page that is on the website that can only be accessed after logging in. The data does not come through on the IDLE shell regardless if I am already logged in or not, so this tells me that the website must have some sort of verification key or ID that I am not seeing within the login website code. I have reviewed the code several times for the login site, but I cannot find anything else that I might be missing. I am not sure if I am allowed to post HTML data for other websites here, but here is the script I am writing.
Please excuse the commented section labeled excel scripts:
I have already tried using lxml and beautifulsoup for navigation purposes, but it does not seem to have any effect. I have tried a similar script on other, simpler websites and it seemed to work there for the most part.
import requests
from lxml import html
USERNAME = <username>
PASSWORD = <password>
LOGIN_URL = "https://www.tm3.com/homepage/login.jsf"
URL = "https://www.tm3.com/mmdrewrite/mmd/14902.faces"
def main():
session_requests = requests.session()
# Get login csrf token
result = session_requests.get(LOGIN_URL)
tree = html.fromstring(result.text)
authenticity_token = list(set(tree.xpath('//input[#name=
"javax.faces.ViewState"]/#value')))[0]
# Create payload
payload = {
"username": USERNAME,
"password": PASSWORD,
"javax.faces.ViewState": authenticity_token
}
# Perform login
result = session_requests.post(LOGIN_URL, data = payload, headers =
dict(referer = LOGIN_URL))
# Scrape url
result = session_requests.get(URL, headers = dict(referer = URL))
tree = html.fromstring(result.content)
print('',result.content)
"""
#excel scripts
def excel():
import xlwt
book = xlwt.Workbook(encoding= "utf-8")
sheet1= book.add_sheet("Sheet1")
#for loop for putting data into different cells
num=0
row = sheet1.row(num)
row.write(num,test)
print("EVEN:" , test)
print("ODD:" , ODD)
book.save("Testing.xls")
"""
if __name__ == '__main__':
main()
I want the webpage to be printed in its entirety, but instead the script just prints out the login webpage

Unable to access webpage with request in python

After some discussion with my problem on Unable to print links using beautifulsoup while automating through selenium
I realized that the main problem is in the URL which the request is not able to extract. URL of the page is actually https://society6.com/discover but I am using selenium to log into my account so the URL becomes https://society6.com/society?show=2
However, I can't use the second URL with request since its showing error. How do i scrap information from URL like this.

You need to log in first!
To do that you can use the bs4.BeautifulSoup library.
Here is an implementation that I have used:
import requests
from bs4 import BeautifulSoup
BASE_URL = "https://society6.com/"
def log_in_and_get_session():
"""
Get the session object with login details
:return: requests.Session
"""
ss = requests.Session()
ss.verify = False # optinal for uncertifaied sites.
text = ss.get(f"{BASE_URL}login").text
csrf_token = BeautifulSoup(text, "html.parser").input["value"]
data = {"username": "your_username", "password": "your_password", "csrfmiddlewaretoken": csrf_token}
# results = ss.post("{}login".format(BASE_URL), data=data)
results = ss.post("{}login".format(BASE_URL), data=data)
if results.ok:
print("Login success", results.status_code)
return ss
else:
print("Can't login", results.status_code)
Using the 'post` method to log in...
Hope this helps you!
Edit
Added the beginning of the function.

Is it possible to fetch the hidden info in webpage using python requests(?srn=true) library?

Here is the url
"https://www.gumtree.com/p/sofas/dfs-couches.-two-3-seaters.-one-teal-and-one-green.-pink-storage-footrest.-less-than-2-years-old.-/1265932994"
Login details :
usrname : life#tech69.com
pwd : shiva#123
While opening the page with above credentials, we can get the info like
Contact details
0770228XXXX
However if adding the ?srn = true at the end of url will give the following info
(https://www.gumtree.com/p/sofas/dfs-couches.-two-3-seaters.-one-teal-and-one-green.-pink-storage-footrest.-less-than-2-years-old.-/1265932994?srn=true)
Contact details
07702287887
The code I've used is below:
import requests
from bs4 import BeautifulSoup
s = requests.session()
login_data = dict(email='life#tech69.com', password='shiva#123')
s.post('https://my.gumtree.com/login', data=login_data)
r = s.get('https://www.gumtree.com/p/sofas/dfs-couches.-two-3-seaters.-one-teal-and-one-green.-pink-storage-footrest.-less-than-2-years-old.-/1265932994?srn=true')
soup = BeautifulSoup(r.content, 'lxml')
y = soup.find('strong' , 'txt-large txt-emphasis form-row-label').text
print str(y)
However the above python code still giving the partial info as
0770228XXXX
How to fetch the full info using python code.

that site is protected by recaptcha, a technology that is specifically designed to prevent autologins
so the line s.post('https://my.gumtree.com/login', data=login_data)
results in this
so when you try to go to the other url you are not actually logged in, and it will not reveal the number...
there may be ways to circumvent this, but im not sure of any offhand...

Scrape website that uses javascript with python

I am attempting to scrape a website using the following code
import re
import requests
def get_csrf(page):
matchme = r'name="csrfToken" value="(.*)" /'
csrf = re.search(matchme, str(page))
csrf = csrf.group(1)
return csrf
def login():
login_url = 'https://www.edline.net/InterstitialLogin.page'
with requests.Session() as s:
login_page = s.get(login_url)
csrf = get_csrf(login_page.text)
username = 'USER'
password = 'PASS'
login = {'screenName': username,
'kclq': password,
'csrfToken': csrf,
'TCNK':'authenticationEntryComponent',
'submitEvent':'1',
'enterClicked':'true',
'ajaxSupported':'yes'}
page = s.post(login_url, data=login)
r = s.get("https://www.edline.net/UserDocList.page?")
print(r.text)
login()
Where I log into https://www.edline.net/InterstitialLogin.page, which is successful, but the problem I have is when I try to do
r = s.get("https://www.edline.net/UserDocList.page?")
print(r.text)
It doesn't print the expected page, instead it throws an error. Upon further testing I discovered that it throws this error even if you try to go directly to the page from a browser. So when I investigated the page source I found that the button used to link to the page I'm trying to scrape uses the following code
Private Reports
So essentially I am looking for a way to trigger the above javascript code in python in order to scrape the resulting page.

It is impossible to answer this question without having more context than this single link.
However, the first thing you want to check, in the case of javaScript driven content generation, are the requests made by your web page when clicking on that link.
To do this, take a look at the network-panel in the console of your browser. Record the requests being made, look especially for XHR-requests. Then, you can try to replicate this e.g. with the requests library.
content = requests.get('xhr-url')

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Webscraping, how to navigate on a website? - python

Related

webscraper no longer retrieving data - can still access website via browser

Python Script for logging into website does not input login info

Unable to access webpage with request in python

Is it possible to fetch the hidden info in webpage using python requests(?srn=true) library?

Scrape website that uses javascript with python

Categories

Resources