Webscraping my grades - python

I'm trying to create a program that grabs my school grades from a website everyday. Then stores the values and creates a graph for my grades, but when i try to scrape the page the HTML that i receive is different then the HTML that i get with inspect element.
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("https://ames.usoe-dcs.org/Students/2567")
bsObj = BeautifulSoup(html.read(), 'lxml');
print(bsObj)
inspect element gives me: http://pastebin.com/BakmpqUM
while python gives me: http://pastebin.com/7gPY1WgB
i figure this is because the URL to my grades (https://ames.usoe-dcs.org/Students/2567) is private, so when you type it into the browser it returns me here:https://ames.usoe-dcs.org/Login/?DestinationURL=%2FStudents%2F2566
is there a way to use python to automatically sign me in?

The URL isn't necessarily private, however requesting the URL without cookies verifying your status as a user won't get you to the information you see when you logged in.
I would recommend opening Inspect Element to the network tab and reloading the page with your grades on it (while signed in). Then right click on the first request (should be a GET request answered with HTML, code 200), hover over copy, then click Copy as cURL command (bash). Then paste into this webpage and copy the python. It will give you the proper request for the page including the cookies and verification parameters you used to access them in the browser. From there you can parse the HTML response for your grade.
You should have something like this to receive and parse your HTML from the request:
cookies = {
...stuff...
}
headers = {
...stuff...
}
r = requests.get("https://ames.usoe-dcs.org/Students/2567", headers=headers, cookies=cookies)
soup = BeautifulSoup(r.text, "lxml")
grade = soup.find("h1", {"class":"grade"}).contents # Customize to find your grade
print(grade)
The cookies and headers dictionaries come from the cURL to Python output.

Related

webscraper no longer retrieving data - can still access website via browser

I'm new to webscraping and have been trying for fun to scrape a boxing website.
My code below was working on the first attempt, and when I tried to re-run it, it was no longer retrieving the link data any more.
I can still access the website from my browser, so not sure what the error is!
Appreciate any pointers.
import os
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
import re
os.system('cls')
heavy = 'https://boxrec.com/en/ratings?r%5Brole%5D=box-pro&r%5Bsex%5D=M&r%5Bstatus%5D=a&r%5Bdivision%5D=Heavyweight&r%5Bcountry%5D=&r_go='
pages = set()
def get_links(page_url):
print("running crawler...")
global pages
req = Request(heavy, headers = {'User-Agent':'Mozilla/5.0'})
html = urlopen(req)
bs = BeautifulSoup(html.read(), 'html.parser')
for link in bs.find_all('a', href=re.compile('^(/en/box-pro/)')):
if 'href' in link.attrs:
if link.attrs['href'] not in pages:
new_page = link.attrs['href']
print(new_page)
pages.add(new_page)
get_links(new_page)
get_links('')
print("crawling done.")
If you inspect html.read() you will find that the page displays a login form. It might be that a detection system picks up your bot and tries to prevent (or at least make it harder for) you to scrape.
As an engineer at WebScrapingAPI I've tested your URL using our API and it passes each time (it returns the data, not the login page). That is because we've implemented a number of detection evasion features, including an IP rotation system. So by sending the request from another IP with a completely different browser fingerprint, the targeted website 'thinks' it's another person and passes on the information. If you want to test it yourself, here is the script you can use:
import requests
API_KEY = '<YOUR_API_KEY>'
SCRAPER_URL = 'https://api.webscrapingapi.com/v1'
TARGET_URL = 'https://boxrec.com/en/ratings?r%5Brole%5D=box-pro&r%5Bsex%5D=M&r%5Bstatus%5D=a&r%5Bdivision%5D=Heavyweight&r%5Bcountry%5D=&r_go='
PARAMS = {
"api_key":API_KEY,
"url": TARGET_URL,
"render_js":1,
}
response = requests.get(SCRAPER_URL, params=PARAMS)
print(response.text)
If you want to build your own scraper, I suggest you implement some of the techniques in this article. You might also want to actualyy create an account on your targeted website, log in using the credentials, collect the cookies and pass them to your request.
In order to collect the cookies:
Navigate to the login screen
Open developer tools in your browser (Network tab)
Log in and check the login request:
(Note that I have a failed attempt, because I didn't use real credentials to log in)
To pass the cookies to your request, simply add it as a header to your req. Example: req = Request(url, headers={'User-Agent': 'Mozilla/5.0', 'Cookie':'myCookie=lovely'}). Also, try to use the same User-Agent as the original request (the one made when you logged in). It can be found in the same login request from where you picked up the cookies.

How to Login and Scrape Websites with Python?

I understand there are similar questions out there, however, I couldn't make this code to work out. Does anyone know how to login and scrape the data from this website?
from bs4 import BeautifulSoup
import requests
# Start the session
session = requests.Session()
# Create the payload
payload = {'login':<USERNAME>,
'password':<PASSWORD>
}
# Post the payload to the site to log in
s = session.post("https://www.beeradvocate.com/community/login", data=payload)
# Navigate to the next page and scrape the data
s = session.get('https://www.beeradvocate.com/place/list/?c_id=AR&s_id=0&brewery=Y')
soup = BeautifulSoup(s.text, 'html.parser')
soup.find('div', class_='titleBar')
print(soup)
The process is different for almost each site, the best way to know how to do it is to use your browser's request inspector (firefox) and look at how the site behaves when you try to login.
For your website, when you click the login button a post request is sent to https://www.beeradvocate.com/community/login/login, with a little bit of trial and error your should be able to replicate it.
Make sure you match the content-type and request headers (specifically cookies in case you need auth tokens).

web scraping python <span> with id

I want to scrap data in the <span/> attribute for a given website using BeautifulSoup. You can see at the screenshot where it locates. However, the code that I'm using is just returning an empty list. I can't find the data in the list that I want. What am I doing wrong?
from bs4 import BeautifulSoup
from urllib import request
url = "http://144.122.167.229"
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
data = opener.open(url).read()
soup = BeautifulSoup(data, 'html.parser')
your_data = list()
for line in soup.findAll('span', attrs={'id': 'mc1_legend_value'}):
your_data.append(line.text)
for line in soup.findAll('span'):
your_data.append(line.text)
ScreenShot : https://imgur.com/a/z0vNh
Thank you.
The dashboard from the screenshot looks to me like something javascript would generate. If you can't find the tag in the page source, that means it was later added by some javascript code or your browser tried to fix some html which it considered broken or out of place.
Keep in mind that right now you're sending a request to a server and it serves you the plain html back. A browser would parse the html and execute any javascript code if it finds any. In your case, beautiful soup or urllib doesn't execute any javascript code. urllib fetches the html and beautiful soup makes it easier to parse and extract relevant information.
If you want to get the value from that tag, I recommend using a headless browser to render your page and just after that parse it's html through beautiful soup or any other parser.
Give a try to selenium: http://selenium-python.readthedocs.io/.
You can control your own browser programmatically. You can make it request the page for you, render it, save the new html in a variable, parse it using beautifoul soup and extract the values you're interested in. I believe that it already has it's own parser implemented which you can use directly to search for that tag.
Or maybe even scrapinghub's splash: https://github.com/scrapinghub/splash
If the dashboard communicates with a server in real-time and that value is continuously received from the server, you could take a look at what requests are sent to the server in order to get that value. Take a look in developer console under the networks tab. Press F12 to open the developer console and click on Network. Refresh the page and you should get all the request send to the server along with the responses. Requests sent by the javascript are usually XMLHttpRequests. Click on XHR in the Network tab to filter out any other requests. (These are instructions for Google Chrome. Firefox might differ a bit).

Website form login using Python urllib2

I've breen trying to learn to use the urllib2 package in Python. I tried to login in as a student (the left form) to a signup page for maths students: http://reg.maths.lth.se/. I have inspected the code (using Firebug) and the left form should obviously be called using POST with a key called pnr whose value should be a string 10 characters long (the last part can perhaps not be seen from the HTML code, but it is basically my social security number so I know how long it should be). Note that the action in the header for the appropriate POST method is another URL, namely http://reg.maths.lth.se/login/student.
I tried (with a fake pnr in the example below, but I used my real number in my own code).
import urllib
import urllib2
url = 'http://reg.maths.lth.se/'
values = dict(pnr='0000000000')
data = urllib.urlencode(values)
req = urllib2.Request(url,data)
resp = urllib2.urlopen(req)
page = resp.read()
print page
While this executes, the print is the source code of the original page http://reg.maths.lth.se/, so it doesn't seem like I logged in. Also, I could add any key/value pairs to the values dictionary and it doesn't produce any error, which seems strange to me.
Also, if I go to the page http://reg.maths.lth.se/login/student, there is clearly no POST method for submitting data.
Any suggestions?
If you would inspect what request is sent to the server when you enter the number and submit the form, you would notice that it is a POST request with pnr and _token parameters:
You are missing the _token parameter which you need to extract from the HTML source of the page. It is a hidden input element:
<input name="_token" type="hidden" value="WRbJ5x05vvDlzMgzQydFxkUfcFSjSLDhknMHtU6m">
I suggest looking into tools like Mechanize, MechanicalSoup or RoboBrowser that would ease the form submission. You may also parse the HTML with an HTML parser, like BeautifulSoup yourself, extract the token and send via urllib2 or requests:
import requests
from bs4 import BeautifulSoup
PNR = "00000000"
url = "http://reg.maths.lth.se/"
login_url = "http://reg.maths.lth.se/login/student"
with requests.Session() as session:
# extract token
response = session.get(url)
soup = BeautifulSoup(response.content, "html.parser")
token = soup.find("input", {"name": "_token"})["value"]
# submit form
session.post(login_url, data={
"_token": token,
"pnr": PNR
})
# navigate to the main page again (should be logged in)
response = session.get(url)
soup = BeautifulSoup(response.content, "html.parser")
print(soup.title)

Parsing webpage with beautifulsoup to get dynamic content

I am trying to parse the following page
http://www.lyricsnmusic.com/roxy-music/while-my-heart-is-still-beating-lyrics/26925936 for the list of similar songs.
The list of similar songs is not present in the page source but is present when I use 'Inspect Element' in the browser.
How do I do it??
Current code:
url = 'http://www.lyricsnmusic.com/roxy-music/while-my-heart-is-still-beating-lyrics/26925936'
request = urllib2.Request(url)
lyricsPage = urllib2.urlopen(request).read()
soup = BeautifulSoup(lyricsPage)
The code to generate the links is:
for p in soup.find_all('p'):
s = p.find('a', { "class" : 'title' }).get('href')
Which methods are available to do this??
This is handled probably by some ajax calls so it will not be in the source,
I think you would need to "monitor network" through developer tools in the browser and look for requests you are interested in.
i.e. a random picked request URL from this page:
http://ws.audioscrobbler.com/2.0/?api_key=73581584905631c5fc15720f03b0b9c8&format=json&callback=jQuery1703329798618797213_1380004055342&method=track.getSimilar&limit=10&artist=roxy%20music&track=while%20my%20heart%20is%20still%20beating&_=1380004055943
to get/see the response enter the above URL in the browser and see the content of the response.
so you need to simulate the requests in python and after you get the response you have to parse the response for interesting details.

Categories