How to solve a doubling problem when scraping with BeautifulSoup - python

I have a strange problem with my script which extracts some dates from a webpage.
Here is the script :
# import library
import json
import re
import requests
from bs4 import BeautifulSoup
import datetime
# Request to website and dowload HTML contents
url = 'https://www.coteur.com/cotes-basket.php'
#page = requests.get(url)
response = requests.get(url)
#soup = BeautifulSoup(page.text, 'html.parser')
soup = BeautifulSoup(response.text, 'html.parser')
s = soup.find("table", id="mediaTable").find_all('script', type='application/ld+json')
date_hour = [json.loads(re.search(r'>(.+)<', str(j), re.S).group(1))["startDate"] for j in s]
#print(date_hour)
date_hour = [sub.replace("T", " ") for sub in date_hour]
print(len(date_hour))
print(date_hour)
This code is functional. It returns the startDate element inside tag script.
But there is a doubling with one date (in the webpage, I count 24 basket events but my list is length 25). In the webpage you can see 3 events which start at 00:00 but my script extract 4 dates with 00:00
Maybe you have a idea why the site does not display these extra entries?

Maybe you have a idea why the site does not display these extra
entries?
It does not display where there are no odds. This is due to a script which runs and removes those were no odds from view. I think currently that is script identified by script:nth-child(25), which starts $(document).on('click'.... This has a test on odds.length and if 0 there is row removal.
You can test by disabling javascript and reload page. You will get same result as your python request (where js doesn't run). The row is present. Re-enable js and the row will disappear.
You can view whether there are odds by going Recontres (main table) for a given match > Cotes (also see Prognostics). If you do this with js disabled you can follow the Recontres links for all matches and see whether there are odds. In prognostics there should be odds based calculations that aren't both 0.
How to solve a doubling problem when scraping with BeautifulSoup?
There is no way, from the response you get with requests to distinguish the row(s) that will be missing on the page. I am not sure you can even make additional requests to check the odds info as it is missing for all without js. You would likely need to switch to selenium/browser automation. You then wouldn't really need BeautifulSoup at all.
There is a small outside chance you might find an API/other site that pulls the same odds and you could cross-reference.

Related

Python web-scraping trouble

I've been with this all day and I'm getting a little overwhelmed, I explain, I have a personal project, scrape all the links of the acestream: // protocol from a website and turn them into a playlist for acestream. For now I can either remove the links from the web (something like the site map) or remove the acestream links from a specific subpage. One of the problems I have is that since the same acestream link appears several times on the page,
Obviously I get the same link multiple times and I only want it once. Besides, I don't know how to do it either (I'm very new to this) so that instead of putting the link in it, it automatically takes it from a list of links in a .csv, because I need to get an acestream link from each link that I put on it. in the .csv. I'm sorry about the tirade, I hope it's not a nuisance.
Hope you understand, I translated it with Google Translate
from bs4 import BeautifulSoup
import requests
# creating empty list
urls = []
# function created
def scrape(site):
# getting the request from url
r = requests.get(site)
# converting the text
s = BeautifulSoup(r.text, "html.parser")
for i in s.find_all("a"):
href = i.attrs['href']
if href.startswith("acestream://"):
site = site + href
if site not in urls:
urls.append(site)
print(site)
# calling the scrape function itself
# generally called recursion
scrape(site)
# main function
if __name__ == "__main__":
site = "https://www.websitehere.com/index.htm"
scrape(site)
Based off your last comment and your code, you can read in a .csv using
import pandas as pd
file_path = 'C:\<path to your csv>'
df = pd.read_csv(file_path)
csv_links = df['<your_column_name_for_links>'].to_list()
With this, you can get the URLs from the .csv. Just change the values in the <>.

Data scraper: the contents of the div tag is empty (??)

I am data scraping a website to get a number. This number changes dynamically every split second, but upon inspection, the number is shown. I just need to capture that number but the div wrapper that contains it, it returns no value. What am I missing? (please go easy on me as I am quite new to Python and data scraping).
I have some code that works and returns the piece of html that supposedly contains the data I want, but no joy, the div wrapper returns no value.
import requests
from bs4 import BeautifulSoup
r = requests.get('https://deuda-publica-espana.com')
deuda = BeautifulSoup(r.text, 'html.parser')
deuda = deuda.findAll('div', {'id': 'contador_PDEH'})
print(deuda)
I don't receive any errors, I am just getting [<div class="contador_xl contador_verde" id="contador_PDEH"></div>] with no value!
Indeed it is easy with selenium. I suspect there is a js script running a counter supplying the number which is why you can't find it with your method (as mentioned in comments)
from selenium import webdriver
d = webdriver.Chrome(r'C:\Users\User\Documents\chromedriver.exe')
d.get('https://deuda-publica-espana.com/')
print(d.find_element_by_id('contador_PDEH').text)
d.quit()

Advice about using a loop through parameters of a get request

I am trying to get each runners' information from this 2017 marathon. The problem is that to get the information I want, I would have to click on each runners' name to get his partial splits.
I know that I can use a get request to get each runners' information. For example, for the runner Josh Griffiths I can use request.get using the parameters in the url.
My problem is that I don't know how to figure out the idp term because this term changes with every runner.
My questions are the following:
Is it possible to use a loop to get all runners' information using a get request? How can I solve the issue with the `idp? I mean, the fact that I don't know how this term is determined and I don't know how to define a loop using it.
Is there a better method to get each runners' information? I thought about using Selenium-Webdriver, but this would be very slow.
Any advice would be appreciated!
You will need to use something like BeautifulSoup to parse the HTML for the links you need, that way there is no need to try and figure out how to construct the request.
import requests
from bs4 import BeautifulSoup
base_url = "http://results-2017.virginmoneylondonmarathon.com/2017/"
r = requests.get(base_url + "?pid=list")
soup = BeautifulSoup(r.content, "html.parser")
tbody = soup.find('tbody')
for tr in tbody.find_all('tr'):
for a in tr.find_all('a', href=True, class_=None):
print
print a.parent.get_text(strip=True)[1:]
r_runner = requests.get(base_url + a['href'])
soup_runner = BeautifulSoup(r_runner.content, "html.parser")
# Find the start of the splits
for h2 in soup_runner.find_all('h2'):
if "Splits" in h2:
splits_table = h2.find_next('table')
splits = []
for tr in splits_table.find_all('tr'):
splits.append([td.text for td in tr.find_all('td')])
for row in splits:
print ' {}'.format(', '.join(row))
break
For each link, you then need to follow it and parse splits from the returned HTML. The script will display starting as follows:
Boniface, Anna (GBR)
5K, 10:18:05, 00:17:55, 17:55, 03:35, 16.74, -
10K, 10:36:23, 00:36:13, 18:18, 03:40, 16.40, -
15K, 10:54:53, 00:54:44, 18:31, 03:43, 16.21, -
20K, 11:13:25, 01:13:15, 18:32, 03:43, 16.19, -
Half, 11:17:31, 01:17:21, 04:07, 03:45, 16.04, -
25K, 11:32:00, 01:31:50, 14:29, 03:43, 16.18, -
30K, 11:50:44, 01:50:34, 18:45, 03:45, 16.01, -
35K, 12:09:34, 02:09:24, 18:51, 03:47, 15.93, -
40K, 12:28:43, 02:28:33, 19:09, 03:50, 15.67, -
Finish, 12:37:17, 02:37:07, 08:35, 03:55, 15.37, 1
Griffiths, Josh (GBR)
5K, 10:15:52, 00:15:48, 15:48, 03:10, 18.99, -
10K, 10:31:42, 00:31:39, 15:51, 03:11, 18.94, -
....
To better how understand how this works, you first need to take a look at the HTML source for each of the pages. The idea being is to find something unique about what you are looking for in the structure of the page to allow you to extract it using a script.
Next I would recommend reading through the documentation page for BeautifulSoup. This assumes you understand the basic structure of an HTML document. This library gives you many tools to help you search and extract elements from the HTML. For example finding where the links are. Not all webpages can be parsed like this as the information is often created using Javascript. In these cases you would need to use something like selenium but in this case, requests and beautifulsoup do the job nicely.

Scraping Web data with Python

sorry if this is not the place for this question, but I'm not sure where else to ask.
I'm trying to scrape data from rotogrinders.com and I'm running into some challenges.
In particular, I want to be able to scrape previous NHL game data using urls of this format (obviously you can change the date for other day's data):
https://rotogrinders.com/game-stats/nhl-skater?site=draftkings&date=11-22-2016
However, when I get to the page, I notice that the data is broken up into pages, and I'm unsure what to do to get my script to get the data that's presented after clicking the "all" button at the bottom of the page.
Is there a way to do this in python? Perhaps some library that will allow button clicks? Or is there some way to get the data without actually clicking the button by being clever about the URL/request?
Actually, things are not that complicated in this case. When you click "All" no network requests are issued. All the data is already there - inside a script tag in the HTML, you just need to extract it.
Working code using requests (to download the page content), BeautifulSoup (to parse HTML and locate the desired script element), re (to extract the desired "player" array from the script) and json (to load the array string into a Python list):
import json
import re
import requests
from bs4 import BeautifulSoup
url = "https://rotogrinders.com/game-stats/nhl-skater?site=draftkings&date=11-22-2016"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
pattern = re.compile(r"var data = (\[.*?\]);$", re.MULTILINE | re.DOTALL)
script = soup.find("script", text=pattern)
data = pattern.search(script.text).group(1)
data = json.loads(data)
# printing player names for demonstration purposes
for player in data:
print(player["player"])
Prints:
Jeff Skinner
Jordan Staal
...
William Carrier
A.J. Greer

web scraping in python

I'd like to scrape all the ~62000 names from this petition, using python. I'm trying to use the beautifulsoup4 library.
However, it's just not working.
Here's my code so far:
import urllib2, re
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://www.thepetitionsite.com/104/781/496/ban-pesticides-used-to-kill-tigers/index.html').read())
divs = soup.findAll('div', attrs={'class' : 'name_location'})
print divs
[]
What am I doing wrong? Also, I want to somehow access the next page to add the next set of names to the list, but I have no idea how to do that right now. Any help is appreciated, thanks.
You could try something like this:
import urllib2
from bs4 import BeautifulSoup
html = urllib2.urlopen('http://www.thepetitionsite.com/xml/petitions/104/781/496/signatures/latest.xml?1374861495')
# uncomment to try with a smaller subset of the signatures
#html = urllib2.urlopen('http://www.thepetitionsite.com/xml/petitions/104/781/496/signatures/00/00/00/05.xml')
results = []
while True:
# Read the web page in XML mode
soup = BeautifulSoup(html.read(), "xml")
try:
for s in soup.find_all("signature"):
# Scrape the names from the XML
firstname = s.find('firstname').contents[0]
lastname = s.find('lastname').contents[0]
results.append(str(firstname) + " " + str(lastname))
except:
pass
# Find the next page to scrape
prev = soup.find("prev_signature")
# Check if another page of result exists - if not break from loop
if prev == None:
break
# Get the previous URL
url = prev.contents[0]
# Open the next page of results
html = urllib2.urlopen(url)
print("Extracting data from {}".format(url))
# Print the results
print("\n")
print("====================")
print("= Printing Results =")
print("====================\n")
print(results)
Be warned though there is a lot of data there to go through and I have no idea if this is against the terms of service of the website so you would need to check it out.
In most cases it is extremely inconsiderate to simply scrape a site. You put a fairly large load on a site in a short amount of time slowing down legitimate users requests. Not to mention stealing all of their data.
Consider an alternate approach such as asking (politely) for a dump of the data (as mentioned above).
Or if you do absolutely need to scrape:
Space your requests using a timer
Scrape smartly
I took a quick glance at that page and it appears to me they use AJAX to request the signatures. Why not simply copy their AJAX request, it'll most likely be using some sort of REST call. By doing this you lessen the load on their server by only requesting the data you need. It will also be easier for you to actually process the data because it will be in a nice format.
Reedit, I looked at their robots.txt file. It dissallows /xml/ Please respect this.
what do you mean by not working? empty list or error?
if you are receiving an empty list, it is because the class "name_location" does not exist in the document. also checkout bs4's documentation on findAll

Categories