Scraping Web data with Python

Scraping Web data with Python - python

sorry if this is not the place for this question, but I'm not sure where else to ask.
I'm trying to scrape data from rotogrinders.com and I'm running into some challenges.
In particular, I want to be able to scrape previous NHL game data using urls of this format (obviously you can change the date for other day's data):
https://rotogrinders.com/game-stats/nhl-skater?site=draftkings&date=11-22-2016
However, when I get to the page, I notice that the data is broken up into pages, and I'm unsure what to do to get my script to get the data that's presented after clicking the "all" button at the bottom of the page.
Is there a way to do this in python? Perhaps some library that will allow button clicks? Or is there some way to get the data without actually clicking the button by being clever about the URL/request?

Actually, things are not that complicated in this case. When you click "All" no network requests are issued. All the data is already there - inside a script tag in the HTML, you just need to extract it.
Working code using requests (to download the page content), BeautifulSoup (to parse HTML and locate the desired script element), re (to extract the desired "player" array from the script) and json (to load the array string into a Python list):
import json
import re
import requests
from bs4 import BeautifulSoup
url = "https://rotogrinders.com/game-stats/nhl-skater?site=draftkings&date=11-22-2016"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
pattern = re.compile(r"var data = (\[.*?\]);$", re.MULTILINE | re.DOTALL)
script = soup.find("script", text=pattern)
data = pattern.search(script.text).group(1)
data = json.loads(data)
# printing player names for demonstration purposes
for player in data:
print(player["player"])
Prints:
Jeff Skinner
Jordan Staal
...
William Carrier
A.J. Greer

Related

Python web-scraping trouble

I've been with this all day and I'm getting a little overwhelmed, I explain, I have a personal project, scrape all the links of the acestream: // protocol from a website and turn them into a playlist for acestream. For now I can either remove the links from the web (something like the site map) or remove the acestream links from a specific subpage. One of the problems I have is that since the same acestream link appears several times on the page,
Obviously I get the same link multiple times and I only want it once. Besides, I don't know how to do it either (I'm very new to this) so that instead of putting the link in it, it automatically takes it from a list of links in a .csv, because I need to get an acestream link from each link that I put on it. in the .csv. I'm sorry about the tirade, I hope it's not a nuisance.
Hope you understand, I translated it with Google Translate
from bs4 import BeautifulSoup
import requests
# creating empty list
urls = []
# function created
def scrape(site):
# getting the request from url
r = requests.get(site)
# converting the text
s = BeautifulSoup(r.text, "html.parser")
for i in s.find_all("a"):
href = i.attrs['href']
if href.startswith("acestream://"):
site = site + href
if site not in urls:
urls.append(site)
print(site)
# calling the scrape function itself
# generally called recursion
scrape(site)
# main function
if __name__ == "__main__":
site = "https://www.websitehere.com/index.htm"
scrape(site)

Based off your last comment and your code, you can read in a .csv using
import pandas as pd
file_path = 'C:\<path to your csv>'
df = pd.read_csv(file_path)
csv_links = df['<your_column_name_for_links>'].to_list()
With this, you can get the URLs from the .csv. Just change the values in the <>.

How to solve a doubling problem when scraping with BeautifulSoup

I have a strange problem with my script which extracts some dates from a webpage.
Here is the script :
# import library
import json
import re
import requests
from bs4 import BeautifulSoup
import datetime
# Request to website and dowload HTML contents
url = 'https://www.coteur.com/cotes-basket.php'
#page = requests.get(url)
response = requests.get(url)
#soup = BeautifulSoup(page.text, 'html.parser')
soup = BeautifulSoup(response.text, 'html.parser')
s = soup.find("table", id="mediaTable").find_all('script', type='application/ld+json')
date_hour = [json.loads(re.search(r'>(.+)<', str(j), re.S).group(1))["startDate"] for j in s]
#print(date_hour)
date_hour = [sub.replace("T", " ") for sub in date_hour]
print(len(date_hour))
print(date_hour)
This code is functional. It returns the startDate element inside tag script.
But there is a doubling with one date (in the webpage, I count 24 basket events but my list is length 25). In the webpage you can see 3 events which start at 00:00 but my script extract 4 dates with 00:00
Maybe you have a idea why the site does not display these extra entries?

Maybe you have a idea why the site does not display these extra
entries?
It does not display where there are no odds. This is due to a script which runs and removes those were no odds from view. I think currently that is script identified by script:nth-child(25), which starts $(document).on('click'.... This has a test on odds.length and if 0 there is row removal.
You can test by disabling javascript and reload page. You will get same result as your python request (where js doesn't run). The row is present. Re-enable js and the row will disappear.
You can view whether there are odds by going Recontres (main table) for a given match > Cotes (also see Prognostics). If you do this with js disabled you can follow the Recontres links for all matches and see whether there are odds. In prognostics there should be odds based calculations that aren't both 0.
How to solve a doubling problem when scraping with BeautifulSoup?
There is no way, from the response you get with requests to distinguish the row(s) that will be missing on the page. I am not sure you can even make additional requests to check the odds info as it is missing for all without js. You would likely need to switch to selenium/browser automation. You then wouldn't really need BeautifulSoup at all.
There is a small outside chance you might find an API/other site that pulls the same odds and you could cross-reference.

Using urllib with Python 3

I'm trying to write a simple application that reads the HTML from a webpage, converts it to a string, and displays certain slices of that string to the user.
However, it seems like these slices change themselves! Each time I run my code I get a different output! Here's the code.
# import urllib so we can get HTML source
from urllib.request import urlopen
# import time, so we can choose which date to read from
import time
# save HTML to a variable
content = urlopen("http://www.islamicfinder.org/prayerDetail.php?country=canada&city=Toronto&state=ON&lang")
# make HTML readable and covert HTML to a string
content = str(content.read())
# select part of the string containing the prayer time table
table = content[24885:24935]
print(table) # print to test what is being selected
I'm not sure what's going on here.

You should really be using something like Beautiful soup. Something along the lines of the following should help. From looking at the source code for that url there is not id/class for the table which makes it a little bit more trickier to find.
from bs4 import BeautifulSoup
import requests
url = "http://www.islamicfinder.org/prayerDetail.php?country=canada&city=Toronto&state=ON&lang"
r = requests.get(url)
soup = BeautifulSoup(r.text)
for table in soup.find_all('table'):
# here you can find the table you want and deal with the results
print(table)

You shouldn't be looking for the part you want by grabbing the specific indexes of the list, websites are often dynamic and the list contain the exact same content each time
What you want to do is search for the table you want, so say the table started with the keyword class="prayer_table" you could find this with str.find()
better yet, extract the tables from the webpage instead of relying on str.find() The code below is from a question on extract tables from a webpage reference
from lxml import etree
import urllib
web = urllib.urlopen("http://www.ffiec.gov/census/report.aspx?year=2011&state=01&report=demographic&msa=11500")
s = web.read()
html = etree.HTML(s)
## Get all 'tr'
tr_nodes = html.xpath('//table[#id="Report1_dgReportDemographic"]/tr')
## 'th' is inside first 'tr'
header = [i[0].text for i in tr_nodes[0].xpath("th")]
## Get text from rest all 'tr'
td_content = [[td.text for td in tr.xpath('td')] for tr in tr_nodes[1:]]

BeautifulSoup findall returning empty list

So i'm very new to python and I'm trying to get data from a table from iso-ne.com/isoexpress/ using bs4 and urllib. Here's what I have so far:
from bs4 import BeautifulSoup
from urllib import urlopen
website='http://www.iso-ne.com/isoexpress/'
html=urlopen(website).read().decode('utf-8')
soup=BeautifulSoup(html, 'html.parser')
table=soup.find('div', {'class': 'chart'})
rows=table.find_all('tr')
for tr in rows:
col=tr.find_all('td')
for td in col:
text=td.find_all(class_='lmp-list-energy')
print text
When I run this, I get 6 empty brackets:
[]
[]
[]
[]
[]
[]
the data that I am trying to get is the Five Minute Real Time LMP price for the state of New Hampshire on the iso-ne website

The data are filled by javascript, which is not interpreted by beautifulsoup : you get the raw container.
What I would do (but I would check about legality and conditions...): look at requests done to the backend (e.g. by using the network mode on chrome)
=> you'll find something a call to http://iso-ne.com/ws/wsclient. Grab the parameter that your client is sending (cookies...) and replay the request (or fine-tune the parameters though trial and error).
Good luck (I did manage to replay the request for data from curl, so it should be doable in python)

web scraping in python

I'd like to scrape all the ~62000 names from this petition, using python. I'm trying to use the beautifulsoup4 library.
However, it's just not working.
Here's my code so far:
import urllib2, re
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://www.thepetitionsite.com/104/781/496/ban-pesticides-used-to-kill-tigers/index.html').read())
divs = soup.findAll('div', attrs={'class' : 'name_location'})
print divs
[]
What am I doing wrong? Also, I want to somehow access the next page to add the next set of names to the list, but I have no idea how to do that right now. Any help is appreciated, thanks.

You could try something like this:
import urllib2
from bs4 import BeautifulSoup
html = urllib2.urlopen('http://www.thepetitionsite.com/xml/petitions/104/781/496/signatures/latest.xml?1374861495')
# uncomment to try with a smaller subset of the signatures
#html = urllib2.urlopen('http://www.thepetitionsite.com/xml/petitions/104/781/496/signatures/00/00/00/05.xml')
results = []
while True:
# Read the web page in XML mode
soup = BeautifulSoup(html.read(), "xml")
try:
for s in soup.find_all("signature"):
# Scrape the names from the XML
firstname = s.find('firstname').contents[0]
lastname = s.find('lastname').contents[0]
results.append(str(firstname) + " " + str(lastname))
except:
pass
# Find the next page to scrape
prev = soup.find("prev_signature")
# Check if another page of result exists - if not break from loop
if prev == None:
break
# Get the previous URL
url = prev.contents[0]
# Open the next page of results
html = urllib2.urlopen(url)
print("Extracting data from {}".format(url))
# Print the results
print("\n")
print("====================")
print("= Printing Results =")
print("====================\n")
print(results)
Be warned though there is a lot of data there to go through and I have no idea if this is against the terms of service of the website so you would need to check it out.

In most cases it is extremely inconsiderate to simply scrape a site. You put a fairly large load on a site in a short amount of time slowing down legitimate users requests. Not to mention stealing all of their data.
Consider an alternate approach such as asking (politely) for a dump of the data (as mentioned above).
Or if you do absolutely need to scrape:
Space your requests using a timer
Scrape smartly
I took a quick glance at that page and it appears to me they use AJAX to request the signatures. Why not simply copy their AJAX request, it'll most likely be using some sort of REST call. By doing this you lessen the load on their server by only requesting the data you need. It will also be easier for you to actually process the data because it will be in a nice format.
Reedit, I looked at their robots.txt file. It dissallows /xml/ Please respect this.

what do you mean by not working? empty list or error?
if you are receiving an empty list, it is because the class "name_location" does not exist in the document. also checkout bs4's documentation on findAll

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping Web data with Python - python

Related

Python web-scraping trouble

How to solve a doubling problem when scraping with BeautifulSoup

Using urllib with Python 3

BeautifulSoup findall returning empty list

web scraping in python

Categories

Resources