webscraper no longer retrieving data - can still access website via browser - python

I'm new to webscraping and have been trying for fun to scrape a boxing website.
My code below was working on the first attempt, and when I tried to re-run it, it was no longer retrieving the link data any more.
I can still access the website from my browser, so not sure what the error is!
Appreciate any pointers.
import os
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
import re
os.system('cls')
heavy = 'https://boxrec.com/en/ratings?r%5Brole%5D=box-pro&r%5Bsex%5D=M&r%5Bstatus%5D=a&r%5Bdivision%5D=Heavyweight&r%5Bcountry%5D=&r_go='
pages = set()
def get_links(page_url):
print("running crawler...")
global pages
req = Request(heavy, headers = {'User-Agent':'Mozilla/5.0'})
html = urlopen(req)
bs = BeautifulSoup(html.read(), 'html.parser')
for link in bs.find_all('a', href=re.compile('^(/en/box-pro/)')):
if 'href' in link.attrs:
if link.attrs['href'] not in pages:
new_page = link.attrs['href']
print(new_page)
pages.add(new_page)
get_links(new_page)
get_links('')
print("crawling done.")

If you inspect html.read() you will find that the page displays a login form. It might be that a detection system picks up your bot and tries to prevent (or at least make it harder for) you to scrape.
As an engineer at WebScrapingAPI I've tested your URL using our API and it passes each time (it returns the data, not the login page). That is because we've implemented a number of detection evasion features, including an IP rotation system. So by sending the request from another IP with a completely different browser fingerprint, the targeted website 'thinks' it's another person and passes on the information. If you want to test it yourself, here is the script you can use:
import requests
API_KEY = '<YOUR_API_KEY>'
SCRAPER_URL = 'https://api.webscrapingapi.com/v1'
TARGET_URL = 'https://boxrec.com/en/ratings?r%5Brole%5D=box-pro&r%5Bsex%5D=M&r%5Bstatus%5D=a&r%5Bdivision%5D=Heavyweight&r%5Bcountry%5D=&r_go='
PARAMS = {
"api_key":API_KEY,
"url": TARGET_URL,
"render_js":1,
}
response = requests.get(SCRAPER_URL, params=PARAMS)
print(response.text)
If you want to build your own scraper, I suggest you implement some of the techniques in this article. You might also want to actualyy create an account on your targeted website, log in using the credentials, collect the cookies and pass them to your request.
In order to collect the cookies:
Navigate to the login screen
Open developer tools in your browser (Network tab)
Log in and check the login request:
(Note that I have a failed attempt, because I didn't use real credentials to log in)
To pass the cookies to your request, simply add it as a header to your req. Example: req = Request(url, headers={'User-Agent': 'Mozilla/5.0', 'Cookie':'myCookie=lovely'}). Also, try to use the same User-Agent as the original request (the one made when you logged in). It can be found in the same login request from where you picked up the cookies.

Related

div not showing up in html from url using requests library and bs4

I have a simple script where I want to scrape a menu from a url:
https://untappd.com/v/glory-days-grill-of-ellicott-city/3329822
When I inspect the page using dev tools, I identify that the menu contained in the menu section <div class="menu-area" id="section_1026228">
So my script is fairly simple as follows:
import requests
from bs4 import BeautifulSoup
venue_url = 'https://untappd.com/v/glory-days-grill-of-ellicott-city/3329822'
response = requests.get(venue_url, headers = {'User-agent': 'Mozilla/5.0'})
soup = BeautifulSoup(response.text, 'html.parser')
menu = soup.find('div', {'class': 'menu-area'})
print(menu.text)
I have tried this on a locally saved page of the url and it works. But when I do it to the full url using the requests library, it does not work. It cannot find the div. It throws this error:
print(menu.text)
AttributeError: 'NoneType' object has no attribute 'text'
which basically means it cannot find the div. Does anyone know why this is happening and how to fix it?
I just logged out from my browser and it showed me a different page. However, my script has no login part at all. Not even sure how that would work
[It doesn't work with all sites, but it seems to be enough for this site so far.] You can login with request.Session.
# import requests
sess = requests.Session()
headers = {'user-agent': 'Mozilla/5.0'}
data = {'username': 'YOUR_EMAIL/USERNAME', 'password': 'YOUR_PASSWORD'}
loginResp = sess.post('https://untappd.com/login', headers=headers, data=data)
print(loginResp.status_code, loginResp.reason, 'from', loginResp.url) ## should print 200 OK...
response = sess.get(venue_url, headers = {'User-agent': 'Mozilla/5.0'})
## CAN CONTINUE AS BEFORE ##
I've edited my solution to one of your previous questions about this site to include cookies so that the site will treat you as logged in. For example:
# venue_url = 'https://untappd.com/v/glory-days-grill-of-ellicott-city/3329822'
gloryMenu = scrape_untappd_menu(venue_url, cookies=sess.cookies)
will collect the following data:
Note: They have a captcha when logging in so I was worried it would be too hard to automate; if it becomes an issue, you can [probably] still login on your browser before going to the page and then paste the request from your network log to curlconverter to get the cookies as a dictionary. Ofc the process is then no longer fully automated since you'll have to repeat this manual login every time the cookies expire (which could be as fast as a few hours). If you wanted to automate the login at that point, you might have to use some kind of browser automation like with selenium.

How to Login and Scrape Websites with Python?

I understand there are similar questions out there, however, I couldn't make this code to work out. Does anyone know how to login and scrape the data from this website?
from bs4 import BeautifulSoup
import requests
# Start the session
session = requests.Session()
# Create the payload
payload = {'login':<USERNAME>,
'password':<PASSWORD>
}
# Post the payload to the site to log in
s = session.post("https://www.beeradvocate.com/community/login", data=payload)
# Navigate to the next page and scrape the data
s = session.get('https://www.beeradvocate.com/place/list/?c_id=AR&s_id=0&brewery=Y')
soup = BeautifulSoup(s.text, 'html.parser')
soup.find('div', class_='titleBar')
print(soup)
The process is different for almost each site, the best way to know how to do it is to use your browser's request inspector (firefox) and look at how the site behaves when you try to login.
For your website, when you click the login button a post request is sent to https://www.beeradvocate.com/community/login/login, with a little bit of trial and error your should be able to replicate it.
Make sure you match the content-type and request headers (specifically cookies in case you need auth tokens).

Python Request module error while logging into an Wordpress site

I am writing a script to download files from a website.
import requests
import bs4 as bs
import urllib.request
import re
with requests.session() as c: #making c denote the requests.session() function
link="https://gpldl.com/wp-login.php" #login link
initial=c.get(link) #passing link through .get()
headers = {
'User-agent': 'Mozilla/5.0'
}
login_data= {"log":"****","pwd":"****","redirect_to":"https://gpldl.com/my-gpldl-account/","redirect_to_automatic":1,"rememberme": "forever"} #login data for logging in
page_int=c.post(link, data=login_data,headers=headers) #posting the login data to the login link
prefinal_link="https://gpldl.com" #initializing a part of link to be used later
page=c.get("https://gpldl.com/repository/",headers=headers) #passing the given URL through .get() to be used later
good_data = bs.BeautifulSoup(page.content, "lxml") #parsing the data from previous statement into lxml from by BS4
#loop for finding all required links
for category in good_data.find_all("a",{"class":"dt-btn-m"}):
inner_link=str(prefinal_link)+str(category.get("href"))
my_var_2 = requests.get(inner_link)
good_data_2 = bs.BeautifulSoup(my_var_2.content, "lxml") #parsing each link with lxml
for each in good_data_2.find_all("tr",{"class":"row-2"}):
for down_link_pre in each.find_all("td",{"class":"column-4"}): #downloading all files and getting their addresses for to be entered into .csv file
for down_link in down_link_pre.find_all("a"):
link_var=down_link.get("href")
file_name=link_var.split('/')[-1]
urllib.request.urlretrieve(str(down_link),str(file_name))
my_var.write("\n")
Using my code, when I access the website to download the files, the login keeps failing. Can anyone help me to find what's wrong with my code?
Edit: I think the error is with maintaining the logged in state since, when I try to access one page at a time, I'm able to access the links that can be accessed only when one is logged in. But from that, when I navigate, I think, the bot gets logged out and not able to retrieve the download links and download them.
Websites use cookies to check login status in every request to tell if it's coming from a logged in user or not, and modern browsers (Chrome/Firefox etc.) automatically manage your cookies. requests.session() has support for cookies and it handles cookies by default, so in your code with requests.session() as c c is like the miniature version of a browser, cookie is involved in every request made by c, once you log in with c, you're able to use c.get() to browse all those login-accessible-only pages.
And in your code urllib.request.urlretrieve(str(down_link),str(file_name)) is used for downloading, it has no idea of previous login state, that's why you're not able to download those files.
Instead, you should keep using c, which has the login state, to download all those files:
with open(str(file_name), 'w') as download:
response = c.get(down_link)
download.write(response.content)

Bypassing intrusive cookie statement with requests library

I'm trying to crawl a website using the requests library. However, the particular website I am trying to access (http://www.vi.nl/matchcenter/vandaag.shtml) has a very intrusive cookie statement.
I am trying to access the website as follows:
from bs4 import BeautifulSoup as soup
import requests
website = r"http://www.vi.nl/matchcenter/vandaag.shtml"
html = requests.get(website, headers={"User-Agent": "Mozilla/5.0"})
htmlsoup = soup(html.text, "html.parser")
This returns a web page that consists of just the cookie statement with a big button to accept. If you try accessing this page in a browser, you find that pressing the button redirects you to the requested page. How can I do this using requests?
I considered using mechanize.Browser but that seems a pretty roundabout way of doing it.
Try setting:
cookies = dict(BCPermissionLevel='PERSONAL')
html = requests.get(website, headers={"User-Agent": "Mozilla/5.0"}, cookies=cookies)
This will bypass the cookie consent page and will land you staight to the page.
Note: You could find the above by analyzing the javascript code that is run on the cookie concent page, it is a bit obfuscated but it should not be difficult. If you run into the same type of problem again, take a look at what kind of cookies does the javascript code that is executed upon a event's handling sets.
I have found this SO question which asks how to send cookies in a post using requests. The accepted answer states that the latest build of Requests will build CookieJars for you from simple dictionaries. Below is the POC code included in the original answer.
import requests
cookie = {'enwiki_session': '17ab96bd8ffbe8ca58a78657a918558'}
r = requests.post('http://wikipedia.org', cookies=cookie)

Python web scraping requests follow redirect

I'm trying to scrape a web site with the requests module.
Using chrome and inspect elements, I go to the url, fill in a form and click the continue button. Chrome's inspect elements (network documents) shows what chrome sent with post. It also shows multiple cookies. The site redirects to a url with among other things a session ID.
To simulate this, I try using requests. I take the form data from inspect elements and reformat it to a dictionary. I use requests.session to include the cookies.
import requests
form_data = 'currentCalForm=dep&currentCodeForm=&tripType=oneWay&searchCategory=award&originAirport=JFK&flightParams.flightDateParams.travelMonth=5&flightParams.flightDateParams.travelDay=14&flightParams.flightDateParams.searchTime=040001&destinationAirport=LHR&returnDate.travelMonth=-1000&returnDate.travelDay=-1000&adultPassengerCount=2&adultPassengerCount=1&serviceclass=coach&searchTypeMode=matrix&awardDatesFlexible=true&originAlternateAirportDistance=0&destinationAlternateAirportDistance=0&discountCode=&flightSearch=award&dateChanged=false&fromSearchPage=true&advancedSearchOpened=false&numberOfFlightsToDisplay=10&searchCategory=&aairpassSearchType=false&moreOptionsIndicator=oneWay&seniorPassengerCount=0&youngAdultPassengerCount=0&childPassengerCount=0&infantPassengerCount=0&passengerCount=2'.split('&')
payload = {}
for item in form_data:
key, value = item.split('=')
if value:
payload[key] = value
with requests.session() as s:
r = s.post('https://www.aa.com/homePage.do', params = payload, allow_redirects=True)
print r.headers
print r.history
print r.url
print r.status_code
with open('x.htm', 'wb') as f:
f.write(r.text.encode('utf8'))
requests, however, does not appear to follow the redirect. history is empty and the url appears to be the data I sent rather than what the site returned. x.htm shows a web page, but does not contain the info I expected.
From http://docs.python-requests.org/en/latest/user/quickstart/#redirection-and-history I expected r.url to contain the redirected url and r.history to contain an http response code.
What am I doing wrong?
ok what you do seems to be wrong. i am not sure how you decided to sent a post on https://www.aa.com/homePage.do, but that seems to be a get and doesnt take the params you send. when you click search your browser sends this post: https://www.americanairlines.co.uk/reservation/searchFlightsSubmit.do;jsessionid=XXXXXXXXXXXXXXXXXXX and parameters:
currentCalForm=dep
currentCodeFrom=
tripType=roundTrip
originAirport=LAX
flightParams.flightDateParams.travelMonth=10
flightParams.flightDateParams.travelDay=24
flightParams.flightDateParams.searchTime=040001
destinationAirport=JFK
returnDate.travelMonth=10
returnDate.travelDay=31
returnDate.searchTime=400001
adultPassengerCount=1
adultPassengerCount=1
childPassengerCount=0
hotelRoomCount=1
serviceclass=coach
searchTypeMode=matrix
awardDatesFlexible=true
originAlternateAirportDistance=0
destinationAlternateAirportDistance=0
discountCode=
flightSearch=revenue
dateChanged=false
fromSearchPage=true
advancedSearchOpened=false
numberOfFlightsToDisplay=10
searchCategory=
aairpassSearchType=false
moreOptionsIndicator=
seniorPassengerCount=0
youngAdultPassengerCount=0
infantPassengerCount=0
passengerCount=1
This will then give you an html back. preety mach you have to send all requests send in the browser. it might be easier for you to do it with selenium.
i found this using httpfox probably is similar to chrome networks.

Categories