Advice about using a loop through parameters of a get request - python

I am trying to get each runners' information from this 2017 marathon. The problem is that to get the information I want, I would have to click on each runners' name to get his partial splits.
I know that I can use a get request to get each runners' information. For example, for the runner Josh Griffiths I can use request.get using the parameters in the url.
My problem is that I don't know how to figure out the idp term because this term changes with every runner.
My questions are the following:
Is it possible to use a loop to get all runners' information using a get request? How can I solve the issue with the `idp? I mean, the fact that I don't know how this term is determined and I don't know how to define a loop using it.
Is there a better method to get each runners' information? I thought about using Selenium-Webdriver, but this would be very slow.
Any advice would be appreciated!

You will need to use something like BeautifulSoup to parse the HTML for the links you need, that way there is no need to try and figure out how to construct the request.
import requests
from bs4 import BeautifulSoup
base_url = "http://results-2017.virginmoneylondonmarathon.com/2017/"
r = requests.get(base_url + "?pid=list")
soup = BeautifulSoup(r.content, "html.parser")
tbody = soup.find('tbody')
for tr in tbody.find_all('tr'):
for a in tr.find_all('a', href=True, class_=None):
print
print a.parent.get_text(strip=True)[1:]
r_runner = requests.get(base_url + a['href'])
soup_runner = BeautifulSoup(r_runner.content, "html.parser")
# Find the start of the splits
for h2 in soup_runner.find_all('h2'):
if "Splits" in h2:
splits_table = h2.find_next('table')
splits = []
for tr in splits_table.find_all('tr'):
splits.append([td.text for td in tr.find_all('td')])
for row in splits:
print ' {}'.format(', '.join(row))
break
For each link, you then need to follow it and parse splits from the returned HTML. The script will display starting as follows:
Boniface, Anna (GBR)
5K, 10:18:05, 00:17:55, 17:55, 03:35, 16.74, -
10K, 10:36:23, 00:36:13, 18:18, 03:40, 16.40, -
15K, 10:54:53, 00:54:44, 18:31, 03:43, 16.21, -
20K, 11:13:25, 01:13:15, 18:32, 03:43, 16.19, -
Half, 11:17:31, 01:17:21, 04:07, 03:45, 16.04, -
25K, 11:32:00, 01:31:50, 14:29, 03:43, 16.18, -
30K, 11:50:44, 01:50:34, 18:45, 03:45, 16.01, -
35K, 12:09:34, 02:09:24, 18:51, 03:47, 15.93, -
40K, 12:28:43, 02:28:33, 19:09, 03:50, 15.67, -
Finish, 12:37:17, 02:37:07, 08:35, 03:55, 15.37, 1
Griffiths, Josh (GBR)
5K, 10:15:52, 00:15:48, 15:48, 03:10, 18.99, -
10K, 10:31:42, 00:31:39, 15:51, 03:11, 18.94, -
....
To better how understand how this works, you first need to take a look at the HTML source for each of the pages. The idea being is to find something unique about what you are looking for in the structure of the page to allow you to extract it using a script.
Next I would recommend reading through the documentation page for BeautifulSoup. This assumes you understand the basic structure of an HTML document. This library gives you many tools to help you search and extract elements from the HTML. For example finding where the links are. Not all webpages can be parsed like this as the information is often created using Javascript. In these cases you would need to use something like selenium but in this case, requests and beautifulsoup do the job nicely.

Related

How to solve a doubling problem when scraping with BeautifulSoup

I have a strange problem with my script which extracts some dates from a webpage.
Here is the script :
# import library
import json
import re
import requests
from bs4 import BeautifulSoup
import datetime
# Request to website and dowload HTML contents
url = 'https://www.coteur.com/cotes-basket.php'
#page = requests.get(url)
response = requests.get(url)
#soup = BeautifulSoup(page.text, 'html.parser')
soup = BeautifulSoup(response.text, 'html.parser')
s = soup.find("table", id="mediaTable").find_all('script', type='application/ld+json')
date_hour = [json.loads(re.search(r'>(.+)<', str(j), re.S).group(1))["startDate"] for j in s]
#print(date_hour)
date_hour = [sub.replace("T", " ") for sub in date_hour]
print(len(date_hour))
print(date_hour)
This code is functional. It returns the startDate element inside tag script.
But there is a doubling with one date (in the webpage, I count 24 basket events but my list is length 25). In the webpage you can see 3 events which start at 00:00 but my script extract 4 dates with 00:00
Maybe you have a idea why the site does not display these extra entries?
Maybe you have a idea why the site does not display these extra
entries?
It does not display where there are no odds. This is due to a script which runs and removes those were no odds from view. I think currently that is script identified by script:nth-child(25), which starts $(document).on('click'.... This has a test on odds.length and if 0 there is row removal.
You can test by disabling javascript and reload page. You will get same result as your python request (where js doesn't run). The row is present. Re-enable js and the row will disappear.
You can view whether there are odds by going Recontres (main table) for a given match > Cotes (also see Prognostics). If you do this with js disabled you can follow the Recontres links for all matches and see whether there are odds. In prognostics there should be odds based calculations that aren't both 0.
How to solve a doubling problem when scraping with BeautifulSoup?
There is no way, from the response you get with requests to distinguish the row(s) that will be missing on the page. I am not sure you can even make additional requests to check the odds info as it is missing for all without js. You would likely need to switch to selenium/browser automation. You then wouldn't really need BeautifulSoup at all.
There is a small outside chance you might find an API/other site that pulls the same odds and you could cross-reference.

Python Web Scraping with lxml

I am trying to scrape column names (player, cost, sel., form, pts) from the page below:
https://fantasy.premierleague.com/a/statistics/total_points
However, I am failing to do so.
Before I go further, let me show you what I have done.
from lxml import html
import requests
page = 'https://fantasy.premierleague.com/a/statistics/total_points'
#Take site and structure html
page = requests.get(page)
tree = html.fromstring(page.content)
#Using the page's CSS classes, extract all links pointing to a team
Location = tree.cssselect('.ism-thead-bold tr .ism-table--el-stats__name')
When I do this, Location should be a list that contains a string "Player".
However, it returns an empty list which means cssselect did not capture anything.
Though each column name has a different 'th class', I used one of them (ism-table--el-stats__name) for this specific trial just to make it simple.
When this problem is fixed, I want to use regex since every class has different suffix after two underscores.
If anyone can help me on these two tasks, I would really appreciate!
thank you guys.

Python scraping deep nested divs whose classes change

I'm somewhat new to python, and working on this 1st part of a project where i need to get the link(s) on a FanDuel page, and i've been spinning my tires trying get the 'href'.
Here's what the Inspect Element shows:
What i'm trying to get to is highlighted above.
I see that the seems to be the parent, but as you go down the tree, the classes listed with lettering (ie - "_a _ch _al _nr _dq _ns _nt _nu") changes from day to day.
What I noticed is that the 'href' that i need has a constant "data-test-id" that does not change, so i was trying to use that as my way to find what i need, but it does not seem to be working.
I'm not sure how far, or if, I need to drill down farther to get what I need, or if my code is totally off. Thanks for your help in advance!
import requests
from bs4 import BeautifulSoup
url = "https://www.fanduel.com/contests/mlb/96"
#authentication might not be necessary, it was a test, still getting the same results
site = requests.get(url, cookies={'X-Auth-Token':'MY TOKEN IS HERE'})
soup = BeautifulSoup(site.content, 'lxml')
game = soup.find_all('a', {'data-test-id':"ContestCardEnterLink"})
#If i use this, i get an error
game = soup.find_all('a', {'data-test-id':"ContestCardEnterLink"})[('href')]
print(game)
The HTML is constructed by javascript, to check this, instead of using inspect element, use view source-page and see if the HTML is already constructed there ( this is the html that you get when you do requests.get() ) ,i've already checked this and this is true. To resolve this, you should have to use Selenium to render the javascript on the page, and then you can get the source page code by selenium after he constructed the elements from DOM.

How to scrape Price from a site that has a changing structure?

I want to scrape the pricing data from an eCommerce site called flipkart, I tried using Beautifulsoup with casperjs(nodejs utility) and similar libraries but none of them is good enough.
Here's the URL and the structure.
https://www.flipkart.com/redmi-note-4-gold-32-gb/p/itmer37fmekafqct?
the problem is the layout...What are some ways to get around this?
P.S : Is there anyway I could apply machine learning for getting the pricing data without knowing complex math? Like where do i even start?
You should probably construct your XPath in a way so it does not rely on the class, but rather on the content (node()) of the element you want to match. Alternatively you could match the data-reactid if that doesn't change?
For matching the div by data-reactid:
//div[#data-reactid=220]
Or for matching the div based on its location:
//span[child::img[#src="//img1a.flixcart.com/www/linchpin/fk-cp-zion/img/fa_8b4b59.png"]]/preceding-sibling::div
Assuming the img_path doesn't change you're on the safe side.
Since you can't use xpath due to dynamic changing you probably could try to use a regex for finding a price in the script tag on the page.
Something like this:
import requests
import re
url = "https://www.flipkart.com/redmi-note-4-gold-32-gb/p/itmer37fmekafqct"
r = requests.get(url)
pattern = re.compile('prexoAvailable\":[\w]+,\"price\":(\d+)')
result = pattern.search(r.text)
print(result.group(1))
from bs4 import BeatifulSoup
page = request.get(url, headers)
soup = BeautifulSoup(page.content, 'html.parser')
for a in soup.findAll('a', href=True, attrs={'class': '_31qSD5'}):
price = a.find('div', attrs={'class': '_1vC4OE _2rQ-NK'})
print(price.text)
E-commerce have does not allow anymore to scrape data like before, every entity of the product like product price, specification, reviews are now enclosed in a separate “Dynamic” class name.
And scraping certain data from the webpage you need to use specific class name which is dynamic. So using request.get() or soup() won't work.

web scraping in python

I'd like to scrape all the ~62000 names from this petition, using python. I'm trying to use the beautifulsoup4 library.
However, it's just not working.
Here's my code so far:
import urllib2, re
from bs4 import BeautifulSoup
soup = BeautifulSoup(urllib2.urlopen('http://www.thepetitionsite.com/104/781/496/ban-pesticides-used-to-kill-tigers/index.html').read())
divs = soup.findAll('div', attrs={'class' : 'name_location'})
print divs
[]
What am I doing wrong? Also, I want to somehow access the next page to add the next set of names to the list, but I have no idea how to do that right now. Any help is appreciated, thanks.
You could try something like this:
import urllib2
from bs4 import BeautifulSoup
html = urllib2.urlopen('http://www.thepetitionsite.com/xml/petitions/104/781/496/signatures/latest.xml?1374861495')
# uncomment to try with a smaller subset of the signatures
#html = urllib2.urlopen('http://www.thepetitionsite.com/xml/petitions/104/781/496/signatures/00/00/00/05.xml')
results = []
while True:
# Read the web page in XML mode
soup = BeautifulSoup(html.read(), "xml")
try:
for s in soup.find_all("signature"):
# Scrape the names from the XML
firstname = s.find('firstname').contents[0]
lastname = s.find('lastname').contents[0]
results.append(str(firstname) + " " + str(lastname))
except:
pass
# Find the next page to scrape
prev = soup.find("prev_signature")
# Check if another page of result exists - if not break from loop
if prev == None:
break
# Get the previous URL
url = prev.contents[0]
# Open the next page of results
html = urllib2.urlopen(url)
print("Extracting data from {}".format(url))
# Print the results
print("\n")
print("====================")
print("= Printing Results =")
print("====================\n")
print(results)
Be warned though there is a lot of data there to go through and I have no idea if this is against the terms of service of the website so you would need to check it out.
In most cases it is extremely inconsiderate to simply scrape a site. You put a fairly large load on a site in a short amount of time slowing down legitimate users requests. Not to mention stealing all of their data.
Consider an alternate approach such as asking (politely) for a dump of the data (as mentioned above).
Or if you do absolutely need to scrape:
Space your requests using a timer
Scrape smartly
I took a quick glance at that page and it appears to me they use AJAX to request the signatures. Why not simply copy their AJAX request, it'll most likely be using some sort of REST call. By doing this you lessen the load on their server by only requesting the data you need. It will also be easier for you to actually process the data because it will be in a nice format.
Reedit, I looked at their robots.txt file. It dissallows /xml/ Please respect this.
what do you mean by not working? empty list or error?
if you are receiving an empty list, it is because the class "name_location" does not exist in the document. also checkout bs4's documentation on findAll

Categories