I am writing a script to download files from a website.
import requests
import bs4 as bs
import urllib.request
import re
with requests.session() as c: #making c denote the requests.session() function
link="https://gpldl.com/wp-login.php" #login link
initial=c.get(link) #passing link through .get()
headers = {
'User-agent': 'Mozilla/5.0'
}
login_data= {"log":"****","pwd":"****","redirect_to":"https://gpldl.com/my-gpldl-account/","redirect_to_automatic":1,"rememberme": "forever"} #login data for logging in
page_int=c.post(link, data=login_data,headers=headers) #posting the login data to the login link
prefinal_link="https://gpldl.com" #initializing a part of link to be used later
page=c.get("https://gpldl.com/repository/",headers=headers) #passing the given URL through .get() to be used later
good_data = bs.BeautifulSoup(page.content, "lxml") #parsing the data from previous statement into lxml from by BS4
#loop for finding all required links
for category in good_data.find_all("a",{"class":"dt-btn-m"}):
inner_link=str(prefinal_link)+str(category.get("href"))
my_var_2 = requests.get(inner_link)
good_data_2 = bs.BeautifulSoup(my_var_2.content, "lxml") #parsing each link with lxml
for each in good_data_2.find_all("tr",{"class":"row-2"}):
for down_link_pre in each.find_all("td",{"class":"column-4"}): #downloading all files and getting their addresses for to be entered into .csv file
for down_link in down_link_pre.find_all("a"):
link_var=down_link.get("href")
file_name=link_var.split('/')[-1]
urllib.request.urlretrieve(str(down_link),str(file_name))
my_var.write("\n")
Using my code, when I access the website to download the files, the login keeps failing. Can anyone help me to find what's wrong with my code?
Edit: I think the error is with maintaining the logged in state since, when I try to access one page at a time, I'm able to access the links that can be accessed only when one is logged in. But from that, when I navigate, I think, the bot gets logged out and not able to retrieve the download links and download them.
Websites use cookies to check login status in every request to tell if it's coming from a logged in user or not, and modern browsers (Chrome/Firefox etc.) automatically manage your cookies. requests.session() has support for cookies and it handles cookies by default, so in your code with requests.session() as c c is like the miniature version of a browser, cookie is involved in every request made by c, once you log in with c, you're able to use c.get() to browse all those login-accessible-only pages.
And in your code urllib.request.urlretrieve(str(down_link),str(file_name)) is used for downloading, it has no idea of previous login state, that's why you're not able to download those files.
Instead, you should keep using c, which has the login state, to download all those files:
with open(str(file_name), 'w') as download:
response = c.get(down_link)
download.write(response.content)
Related
I'm new to webscraping and have been trying for fun to scrape a boxing website.
My code below was working on the first attempt, and when I tried to re-run it, it was no longer retrieving the link data any more.
I can still access the website from my browser, so not sure what the error is!
Appreciate any pointers.
import os
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
import re
os.system('cls')
heavy = 'https://boxrec.com/en/ratings?r%5Brole%5D=box-pro&r%5Bsex%5D=M&r%5Bstatus%5D=a&r%5Bdivision%5D=Heavyweight&r%5Bcountry%5D=&r_go='
pages = set()
def get_links(page_url):
print("running crawler...")
global pages
req = Request(heavy, headers = {'User-Agent':'Mozilla/5.0'})
html = urlopen(req)
bs = BeautifulSoup(html.read(), 'html.parser')
for link in bs.find_all('a', href=re.compile('^(/en/box-pro/)')):
if 'href' in link.attrs:
if link.attrs['href'] not in pages:
new_page = link.attrs['href']
print(new_page)
pages.add(new_page)
get_links(new_page)
get_links('')
print("crawling done.")
If you inspect html.read() you will find that the page displays a login form. It might be that a detection system picks up your bot and tries to prevent (or at least make it harder for) you to scrape.
As an engineer at WebScrapingAPI I've tested your URL using our API and it passes each time (it returns the data, not the login page). That is because we've implemented a number of detection evasion features, including an IP rotation system. So by sending the request from another IP with a completely different browser fingerprint, the targeted website 'thinks' it's another person and passes on the information. If you want to test it yourself, here is the script you can use:
import requests
API_KEY = '<YOUR_API_KEY>'
SCRAPER_URL = 'https://api.webscrapingapi.com/v1'
TARGET_URL = 'https://boxrec.com/en/ratings?r%5Brole%5D=box-pro&r%5Bsex%5D=M&r%5Bstatus%5D=a&r%5Bdivision%5D=Heavyweight&r%5Bcountry%5D=&r_go='
PARAMS = {
"api_key":API_KEY,
"url": TARGET_URL,
"render_js":1,
}
response = requests.get(SCRAPER_URL, params=PARAMS)
print(response.text)
If you want to build your own scraper, I suggest you implement some of the techniques in this article. You might also want to actualyy create an account on your targeted website, log in using the credentials, collect the cookies and pass them to your request.
In order to collect the cookies:
Navigate to the login screen
Open developer tools in your browser (Network tab)
Log in and check the login request:
(Note that I have a failed attempt, because I didn't use real credentials to log in)
To pass the cookies to your request, simply add it as a header to your req. Example: req = Request(url, headers={'User-Agent': 'Mozilla/5.0', 'Cookie':'myCookie=lovely'}). Also, try to use the same User-Agent as the original request (the one made when you logged in). It can be found in the same login request from where you picked up the cookies.
I'm trying to scrape data about my band's upcoming shows from our agent's web service (such as venue capacity, venue address, set length, set start time ...).
With Python 3.6 and Selenium I've successfully logged in to the site, scraped a bunch of data from the main page, and opened the deal sheet, which is a PDF-like ASPX page. From there I'm unable to scrape the deal sheet. I've successfully switched the Selenium driver to the deal sheet. But when I inspect that page, none of the content is there, just a list of JavaScript scripts.
I tried...
innerHTML = driver.execute_script("return document.body.innerHTML")
...but this yields the same list of scripts rather than the PDF content I can see in the browser.
I've tried the solution suggested here: Python scraping pdf from URL
But the HTML that solution returns is for the login page, not the deal sheet. My problem is different because the PDF is protected by a password.
You won't be able to read the PDF file using Selenium Python API bindings, the solution would be:
Download the file from the web page using requests library. Given you need to be logged in my expectation is that you might need to fetch cookies from the browser session via driver.get_cookies() command and add them to the request which will download the PDF file
Once you download the file you will be able to read its content using, for instance, PyPDF2
This 3-part solution works for me:
Part 1 (Get the URL for the password protected PDF)
# with selenium
driver.find_element_by_xpath('xpath To The PDF Link').click()
# wait for the new window to load
sleep(6)
# switch to the new window that just popped up
driver.switch_to.window(driver.window_handles[1])
# get the URL to the PDF
plugin = driver.find_element_by_css_selector("#plugin")
url = plugin.get_attribute("src")
The element with the url might be different on your page. Michael Kennedy also suggested #embed and #content.
Part 2 (Create a persistent session with python requests, as described here: How to "log in" to a website using Python's Requests module? . And download the PDF.)
# Fill in your details here to be posted to the login form.
# Your parameter names are probably different. You can find them by inspecting the login page.
payload = {
'logOnCode': username,
'passWord': password
}
# Use 'with' to ensure the session context is closed after use.
with requests.Session() as session:
session.post(logonURL, data=payload)
# An authorized request.
f = session.get(url) # this is the protected url
open('c:/yourFilename.pdf', 'wb').write(f.content)
Part 3 (Scrape the PDF with PyPDF2 as suggested by Dmitri T)
I'm looking something which could be interesting to you as well.
I'm developing a feature using Python, which should be able to authenticate (using userid/password and/or with other preferred authentication methods) and connect to specify website, navigate through the website and download the file under a specific option.
Later I have to write the schedules on developed code and automate it.
Did anyone come across such scenario and developed the code in python?
Please suggest if any python libraries are there.
What I have achieved right now is:
I can download file with specific URL.
I know how to authenticate and download the file.
I'm able to pull the links from the specific website.
This is something we could achieve using selenium, but I want to write in Python.
After 5 days of research, I found what I wanted. Your urlLogin and urlAuth could be same, its totally depends on what action taken on Login button or form action. I used crome inspect option to findout the actual GET or POST request used on the portal.
Here is the answer of my own question-->
import requests
urlLogin = 'https://example.com/jsp/login.jsp'
urlAuth = 'https://example.com/CheckLoginServlet'
urlBd = 'https://example.com/jsp/batchdownload.jsp'
payload = {
"username": "username",
"password": "password"
}
# Session will be closed at the end of with block
with requests.Session() as s:
s.get(urlLogin)
headers = s.cookies.get_dict()
print(f"Session cookies {headers}")
r1 = s.post(urlAuth, data=payload, headers=headers)
print(f'MainFrame text:::: {r1.status_code}') #200
r2 = s.post(urlBd, data=payload)
print(f'MainFrame text:::: {r2.status_code}') #200
print(f'MainFrame text:::: {r2.text}') #page source
# 3. Again cookies will be used through session to access batch download page
r2 = s.post(config['access-url'])
print(f'Batch Download status:::: {r2.status_code}') #200
source_code = r2.text
# print(f'Batch Download source:::: {source_code}')
I'm using Python library requests for this, but I can't seem to be able to log in to this website.
The url is https://www.bet365affiliates.com/ui/pages/affiliates/, and I've been trying post requests to https://www.bet365affiliates.com/Members/CMSitePages/SiteLogin.aspx?lng=1 with the data of "ctl00$MasterHeaderPlaceHolder$ctl00$passwordTextbox", "ctl00$MasterHeaderPlaceHolder$ctl00$userNameTextbox", etc, but I never seem to be able to get logged in.
Could someone more experienced check the page's source code and tell me what am I am missing here?
The solution could be this: Please Take attention, you could do it without selenium. If you want to do without it, firstly you should get the main affiliate page, and from the response data you could fetch all the required information (which I gather by xpaths). I just didn't have enough time to write it in fully requests.
To gather the informations from response data you could use XML tree library. With the same XPATH method, you could easily find all the requested informations.
import requests
from selenium import webdriver
Password = 'YOURPASS'
Username = 'YOURUSERNAME'
browser = webdriver.Chrome(os.getcwd()+"/"+"Chromedriver.exe")
browser.get('https://www.bet365affiliates.com/ui/pages/affiliates/Affiliates.aspx')
VIEWSTATE=browser.find_element_by_xpath('//*[#id="__VIEWSTATE"]')
SESSIONID=browser.find_element_by_xpath('//*[#id="CMSessionId"]')
PREVPAG=browser.find_element_by_xpath('//*[#id="__PREVIOUSPAGE"]')
EVENTVALIDATION=browser.find_element_by_xpath('//* [#id="__EVENTVALIDATION"]')
cookies = browser.get_cookies()
session = requests.session()
for cookie in cookies:
print cookie['name']
print cookie['value']
session.cookies.set(cookie['name'], cookie['value'])
payload = {'ctl00_AjaxScriptManager_HiddenField':'',
'__EVENTTARGET':'ctl00$MasterHeaderPlaceHolder$ctl00$goButton',
'__EVENTARGUMENT':'',
'__VIEWSTATE':VIEWSTATE,
'__PREVIOUSPAGE':PREVPAG,
'__EVENTVALIDATION':EVENTVALIDATION,
'txtPassword':Username,
'txtUserName':Password,
'CMSessionId':SESSIONID,
'returnURL':'/ui/pages/affiliates/Affiliates.aspx',
'ctl00$MasterHeaderPlaceHolder$ctl00$userNameTextbox':Username,
'ctl00$MasterHeaderPlaceHolder$ctl00$passwordTextbox':Password,
'ctl00$MasterHeaderPlaceHolder$ctl00$tempPasswordTextbox':'Password'}
session.post('https://www.bet365affiliates.com/Members/CMSitePages/SiteLogin.aspx?lng=1',data=payload)
Did you inspected the http request used by the browser to log you in?
You should replicate it.
FB
I'm trying to crawl a website using the requests library. However, the particular website I am trying to access (http://www.vi.nl/matchcenter/vandaag.shtml) has a very intrusive cookie statement.
I am trying to access the website as follows:
from bs4 import BeautifulSoup as soup
import requests
website = r"http://www.vi.nl/matchcenter/vandaag.shtml"
html = requests.get(website, headers={"User-Agent": "Mozilla/5.0"})
htmlsoup = soup(html.text, "html.parser")
This returns a web page that consists of just the cookie statement with a big button to accept. If you try accessing this page in a browser, you find that pressing the button redirects you to the requested page. How can I do this using requests?
I considered using mechanize.Browser but that seems a pretty roundabout way of doing it.
Try setting:
cookies = dict(BCPermissionLevel='PERSONAL')
html = requests.get(website, headers={"User-Agent": "Mozilla/5.0"}, cookies=cookies)
This will bypass the cookie consent page and will land you staight to the page.
Note: You could find the above by analyzing the javascript code that is run on the cookie concent page, it is a bit obfuscated but it should not be difficult. If you run into the same type of problem again, take a look at what kind of cookies does the javascript code that is executed upon a event's handling sets.
I have found this SO question which asks how to send cookies in a post using requests. The accepted answer states that the latest build of Requests will build CookieJars for you from simple dictionaries. Below is the POC code included in the original answer.
import requests
cookie = {'enwiki_session': '17ab96bd8ffbe8ca58a78657a918558'}
r = requests.post('http://wikipedia.org', cookies=cookie)