Bypassing intrusive cookie statement with requests library - python

I'm trying to crawl a website using the requests library. However, the particular website I am trying to access (http://www.vi.nl/matchcenter/vandaag.shtml) has a very intrusive cookie statement.
I am trying to access the website as follows:
from bs4 import BeautifulSoup as soup
import requests
website = r"http://www.vi.nl/matchcenter/vandaag.shtml"
html = requests.get(website, headers={"User-Agent": "Mozilla/5.0"})
htmlsoup = soup(html.text, "html.parser")
This returns a web page that consists of just the cookie statement with a big button to accept. If you try accessing this page in a browser, you find that pressing the button redirects you to the requested page. How can I do this using requests?
I considered using mechanize.Browser but that seems a pretty roundabout way of doing it.

Try setting:
cookies = dict(BCPermissionLevel='PERSONAL')
html = requests.get(website, headers={"User-Agent": "Mozilla/5.0"}, cookies=cookies)
This will bypass the cookie consent page and will land you staight to the page.
Note: You could find the above by analyzing the javascript code that is run on the cookie concent page, it is a bit obfuscated but it should not be difficult. If you run into the same type of problem again, take a look at what kind of cookies does the javascript code that is executed upon a event's handling sets.

I have found this SO question which asks how to send cookies in a post using requests. The accepted answer states that the latest build of Requests will build CookieJars for you from simple dictionaries. Below is the POC code included in the original answer.
import requests
cookie = {'enwiki_session': '17ab96bd8ffbe8ca58a78657a918558'}
r = requests.post('http://wikipedia.org', cookies=cookie)

Related

webscraper no longer retrieving data - can still access website via browser

I'm new to webscraping and have been trying for fun to scrape a boxing website.
My code below was working on the first attempt, and when I tried to re-run it, it was no longer retrieving the link data any more.
I can still access the website from my browser, so not sure what the error is!
Appreciate any pointers.
import os
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
import re
os.system('cls')
heavy = 'https://boxrec.com/en/ratings?r%5Brole%5D=box-pro&r%5Bsex%5D=M&r%5Bstatus%5D=a&r%5Bdivision%5D=Heavyweight&r%5Bcountry%5D=&r_go='
pages = set()
def get_links(page_url):
print("running crawler...")
global pages
req = Request(heavy, headers = {'User-Agent':'Mozilla/5.0'})
html = urlopen(req)
bs = BeautifulSoup(html.read(), 'html.parser')
for link in bs.find_all('a', href=re.compile('^(/en/box-pro/)')):
if 'href' in link.attrs:
if link.attrs['href'] not in pages:
new_page = link.attrs['href']
print(new_page)
pages.add(new_page)
get_links(new_page)
get_links('')
print("crawling done.")
If you inspect html.read() you will find that the page displays a login form. It might be that a detection system picks up your bot and tries to prevent (or at least make it harder for) you to scrape.
As an engineer at WebScrapingAPI I've tested your URL using our API and it passes each time (it returns the data, not the login page). That is because we've implemented a number of detection evasion features, including an IP rotation system. So by sending the request from another IP with a completely different browser fingerprint, the targeted website 'thinks' it's another person and passes on the information. If you want to test it yourself, here is the script you can use:
import requests
API_KEY = '<YOUR_API_KEY>'
SCRAPER_URL = 'https://api.webscrapingapi.com/v1'
TARGET_URL = 'https://boxrec.com/en/ratings?r%5Brole%5D=box-pro&r%5Bsex%5D=M&r%5Bstatus%5D=a&r%5Bdivision%5D=Heavyweight&r%5Bcountry%5D=&r_go='
PARAMS = {
"api_key":API_KEY,
"url": TARGET_URL,
"render_js":1,
}
response = requests.get(SCRAPER_URL, params=PARAMS)
print(response.text)
If you want to build your own scraper, I suggest you implement some of the techniques in this article. You might also want to actualyy create an account on your targeted website, log in using the credentials, collect the cookies and pass them to your request.
In order to collect the cookies:
Navigate to the login screen
Open developer tools in your browser (Network tab)
Log in and check the login request:
(Note that I have a failed attempt, because I didn't use real credentials to log in)
To pass the cookies to your request, simply add it as a header to your req. Example: req = Request(url, headers={'User-Agent': 'Mozilla/5.0', 'Cookie':'myCookie=lovely'}). Also, try to use the same User-Agent as the original request (the one made when you logged in). It can be found in the same login request from where you picked up the cookies.

How to Login and Scrape Websites with Python?

I understand there are similar questions out there, however, I couldn't make this code to work out. Does anyone know how to login and scrape the data from this website?
from bs4 import BeautifulSoup
import requests
# Start the session
session = requests.Session()
# Create the payload
payload = {'login':<USERNAME>,
'password':<PASSWORD>
}
# Post the payload to the site to log in
s = session.post("https://www.beeradvocate.com/community/login", data=payload)
# Navigate to the next page and scrape the data
s = session.get('https://www.beeradvocate.com/place/list/?c_id=AR&s_id=0&brewery=Y')
soup = BeautifulSoup(s.text, 'html.parser')
soup.find('div', class_='titleBar')
print(soup)
The process is different for almost each site, the best way to know how to do it is to use your browser's request inspector (firefox) and look at how the site behaves when you try to login.
For your website, when you click the login button a post request is sent to https://www.beeradvocate.com/community/login/login, with a little bit of trial and error your should be able to replicate it.
Make sure you match the content-type and request headers (specifically cookies in case you need auth tokens).

web scraping python <span> with id

I want to scrap data in the <span/> attribute for a given website using BeautifulSoup. You can see at the screenshot where it locates. However, the code that I'm using is just returning an empty list. I can't find the data in the list that I want. What am I doing wrong?
from bs4 import BeautifulSoup
from urllib import request
url = "http://144.122.167.229"
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
data = opener.open(url).read()
soup = BeautifulSoup(data, 'html.parser')
your_data = list()
for line in soup.findAll('span', attrs={'id': 'mc1_legend_value'}):
your_data.append(line.text)
for line in soup.findAll('span'):
your_data.append(line.text)
ScreenShot : https://imgur.com/a/z0vNh
Thank you.
The dashboard from the screenshot looks to me like something javascript would generate. If you can't find the tag in the page source, that means it was later added by some javascript code or your browser tried to fix some html which it considered broken or out of place.
Keep in mind that right now you're sending a request to a server and it serves you the plain html back. A browser would parse the html and execute any javascript code if it finds any. In your case, beautiful soup or urllib doesn't execute any javascript code. urllib fetches the html and beautiful soup makes it easier to parse and extract relevant information.
If you want to get the value from that tag, I recommend using a headless browser to render your page and just after that parse it's html through beautiful soup or any other parser.
Give a try to selenium: http://selenium-python.readthedocs.io/.
You can control your own browser programmatically. You can make it request the page for you, render it, save the new html in a variable, parse it using beautifoul soup and extract the values you're interested in. I believe that it already has it's own parser implemented which you can use directly to search for that tag.
Or maybe even scrapinghub's splash: https://github.com/scrapinghub/splash
If the dashboard communicates with a server in real-time and that value is continuously received from the server, you could take a look at what requests are sent to the server in order to get that value. Take a look in developer console under the networks tab. Press F12 to open the developer console and click on Network. Refresh the page and you should get all the request send to the server along with the responses. Requests sent by the javascript are usually XMLHttpRequests. Click on XHR in the Network tab to filter out any other requests. (These are instructions for Google Chrome. Firefox might differ a bit).

How to make HTTP POST on website that uses asp.net?

I'm using Python library requests for this, but I can't seem to be able to log in to this website.
The url is https://www.bet365affiliates.com/ui/pages/affiliates/, and I've been trying post requests to https://www.bet365affiliates.com/Members/CMSitePages/SiteLogin.aspx?lng=1 with the data of "ctl00$MasterHeaderPlaceHolder$ctl00$passwordTextbox", "ctl00$MasterHeaderPlaceHolder$ctl00$userNameTextbox", etc, but I never seem to be able to get logged in.
Could someone more experienced check the page's source code and tell me what am I am missing here?
The solution could be this: Please Take attention, you could do it without selenium. If you want to do without it, firstly you should get the main affiliate page, and from the response data you could fetch all the required information (which I gather by xpaths). I just didn't have enough time to write it in fully requests.
To gather the informations from response data you could use XML tree library. With the same XPATH method, you could easily find all the requested informations.
import requests
from selenium import webdriver
Password = 'YOURPASS'
Username = 'YOURUSERNAME'
browser = webdriver.Chrome(os.getcwd()+"/"+"Chromedriver.exe")
browser.get('https://www.bet365affiliates.com/ui/pages/affiliates/Affiliates.aspx')
VIEWSTATE=browser.find_element_by_xpath('//*[#id="__VIEWSTATE"]')
SESSIONID=browser.find_element_by_xpath('//*[#id="CMSessionId"]')
PREVPAG=browser.find_element_by_xpath('//*[#id="__PREVIOUSPAGE"]')
EVENTVALIDATION=browser.find_element_by_xpath('//* [#id="__EVENTVALIDATION"]')
cookies = browser.get_cookies()
session = requests.session()
for cookie in cookies:
print cookie['name']
print cookie['value']
session.cookies.set(cookie['name'], cookie['value'])
payload = {'ctl00_AjaxScriptManager_HiddenField':'',
'__EVENTTARGET':'ctl00$MasterHeaderPlaceHolder$ctl00$goButton',
'__EVENTARGUMENT':'',
'__VIEWSTATE':VIEWSTATE,
'__PREVIOUSPAGE':PREVPAG,
'__EVENTVALIDATION':EVENTVALIDATION,
'txtPassword':Username,
'txtUserName':Password,
'CMSessionId':SESSIONID,
'returnURL':'/ui/pages/affiliates/Affiliates.aspx',
'ctl00$MasterHeaderPlaceHolder$ctl00$userNameTextbox':Username,
'ctl00$MasterHeaderPlaceHolder$ctl00$passwordTextbox':Password,
'ctl00$MasterHeaderPlaceHolder$ctl00$tempPasswordTextbox':'Password'}
session.post('https://www.bet365affiliates.com/Members/CMSitePages/SiteLogin.aspx?lng=1',data=payload)
Did you inspected the http request used by the browser to log you in?
You should replicate it.
FB

Facebook Login Using Requests error

import requests
from bs4 import BeautifulSoup
a = requests.Session()
soup = BeautifulSoup(a.get("https://www.facebook.com/").content)
payload = {
"lsd":soup.find("input",{"name":"lsd"})["value"],
"email":"my_email",
"pass":"my_password",
"persistent":"1",
"default_persistent":"1",
"timezone":"300",
"lgnrnd":soup.find("input",{"name":"lgnrnd"})["value"],
"lgndim":soup.find("input",{"name":"lgndim"})["value"],
"lgnjs":soup.find("input",{"name":"lgnjs"})["value"],
"locale":"en_US",
"qsstamp":soup.find("input",{"name":"qsstamp"})["value"]
}
soup = BeautifulSoup(a.post("https://www.facebook.com/",data = payload).content)
print([i.text for i in soup.find_all("a")])
Im playing around with requests and have read several threads here in SO about it so I decided to try it out myself.
I am stumped by this line. "qsstamp":soup.find("input",{"name":"qsstamp"})["value"]
because it returns empty thereby cause an error.
however looking at chrome developer tools this "qsstamp" is populated what am I missing here?
the payload is everything shown in the form data on chrome dev tools. so what is going on?
Using Firebug and search for qsstamp gives matched results directs to: Here
You can see: j.createHiddenInputs({qsstamp:u},v)
That means qsstamp is dynamically generated by JavaScript.
requests will not run JavaScript(since what it does is to fetch that page's HTML.) You may want to use something like dryscape or using emulated browser like Selenium.

Categories