Scraping nested html with Selenium - python

I'm looking for some help with scraping with selenium in python.
You need a paid account to view this page so creating a reproducible won't be possible.
The page I'm trying to scrape
I'm attempting to scrape the data from the pitch in the top right corner of the image under 'Spots on Field'.
<div class="player-details-football-map__UEFA player-details-football-map">
<div class="shots">
<div>
<a class="shot episode" style="left: 39.8529%; top: 28.9474%;"></a>
<div class="tooltip" style="left: 39.8529%; top: 28.9474%;">
<div class="tooltip-title">
<div class="tooltip-shoot-type">Shot on target</div>
<div class="tooltip-blow-type">Donyell Malen </div>
<div class="tooltip-shoot-name"></div>
</div>
<div class="tooltip-time">h Viktoria Koln</div>
<div class="tooltip-time">Half 1, 18:22 02/09/20</div>
<div class="tooltip-time">Length: 7.1 m</div>
<div class="tooltip-shoot-xg">Expected goals: 0.17</div>
</div>
</div>
The above is a snippet of just one of the data points I want to scrape.
I've tried using BeautifulSoup
from bs4 import BeautifulSoup
from requests import get
url = 'https://football.instatscout.com/players/294322/shots'
response = get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)
shots = html_soup.find_all('div', class_ = 'tooltip')
print(type(shots))
print(len(shots))
and nothing was being returned.
So now I've tried using Selenium.
options = Options()
options.add_argument("start-maximized")
driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\Users\James\OneDrive\Desktop\webdriver\chromedriver.exe')
driver.get('https://football.instatscout.com/players/294322/shots')
print("Page Title is : %s" %driver.title)
driver.find_element_by_name('email').send_keys('my username')
driver.find_element_by_name('pass').send_keys('my password')
driver.find_element_by_xpath('//*[contains(concat( " ", #class, " " ), concat( " ", "hRAqIl", " " ))]').click()
goals = driver.find_element_by_class_name('tooltip')
but I'm getting the error of
NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":".tooltip"}
Can someone please help point me in the right direction? I'm basically trying to scrape everything from the above HTML, that includes 'tooltip' in the class name.
Thanks

Using css selectors with bs4:
from bs4 import BeautifulSoup as soup
import re #for extracting offsets
r = [{**dict(zip(['left', 'top'], re.findall('[\d\.]+', i.div['style']))),
'shoot_type':i.select_one('.tooltip-shoot-type').text,
'name':i.select_one('.tooltip-blow-type').text,
'team':i.select_one('div:nth-of-type(2).tooltip-time').text,
'time':i.select_one('div:nth-of-type(3).tooltip-time').text,
'length':i.select_one('div:nth-of-type(4).tooltip-time').text[8:],
'expected_goals':i.select_one('.tooltip-shoot-xg').text[16:]}
for i in soup(html, 'html.parser').select('div.shots > div')]
Output:
[{'left': '39.8529', 'top': '28.9474', 'shoot_type': 'Shot on target', 'name': 'Donyell Malen ', 'team': 'h Viktoria Koln', 'time': 'Half 1, 18:22 02/09/20', 'length': '7.1 m', 'expected_goals': '0.17'}]

Related

Web scrape second number between tags

I am new to Python, and never done HTML. So any help would be appreciated.
I need to extract two numbers: '1062' and '348', from a website's inspect element.
This is my code:
page = requests.get("https://www.traderscockpit.com/?pageView=live-nse-advance-decline-ratio-chart")
soup = BeautifulSoup(page.content, 'html.parser')
Adv = soup.select_one ('.col-sm-6 .advDec:nth-child(1)').text[10:]
Dec = soup.select_two ('.col-sm-6 .advDec:nth-child(2)').text[10:]
The website element looks like below:
<div class="nifty-header-shade1 col-xs-12 col-sm-6 col-md-3">
<div class="row">
<div class="col-sm-12">
<h4>Stocks</h4>
</div>
<div class="col-sm-6">
<p class="advDec">Advanced: 1062</p>
</div>
<div class="col-sm-6">
<p class="advDec">Declined: 348</p>
</div>
</div>
</div>
Using my code, am able to extract first number (1062). But unable to extract the second number (348). Can you please help.
Assuming the Pattern is always the same, you can select your elements by text and get its next_sibling:
adv = soup.select_one('a:-soup-contains("Advanced:")').next_sibling.strip()
dec = soup.select_one('a:-soup-contains("Declined:")').next_sibling.strip()
Example
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.traderscockpit.com/?pageView=live-nse-advance-decline-ratio-chart")
soup = BeautifulSoup(page.content)
adv = soup.select_one('a:-soup-contains("Advanced:")').next_sibling.strip()
dec = soup.select_one('a:-soup-contains("Declined:")').next_sibling.strip()
print(adv, dec)
If there are always 2 elements, then the simplest way would probably be to destructure the array of selected elements.
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.traderscockpit.com/?pageView=live-nse-advance-decline-ratio-chart")
soup = BeautifulSoup(page.content, "html.parser")
adv, dec = [elm.next_sibling.strip() for elm in soup.select(".advDec a") ]
print("Advanced:", adv)
print("Declined", dec)

Web scraping from the span element

I am on a scraping project and I am lookin to scrape from the following.
<div class="spec-subcat attributes-religion">
<span class="h5">Faith:</span>
<span>Christian</span>
<span>Islam</span>
</div>
I want to extract only Christian, Islam as the output.(Without the 'Faith:').
This is my try:
faithdiv = soup.find('div', class_='spec-subcat attributes-religion')
faith = faithdiv.find('span').text.strip()
How can I make this done?
There are several ways you can fix this, I would suggest the following - Find all <span> in <div> that have not the class="h5":
soup.select('div.spec-subcat.attributes-religion span:not(.h5)')
Example
import requests
html_text = '''
<div class="spec-subcat attributes-religion">
<span class="h5">Faith:</span>
<span>Christian</span>
<span>Islam</span>
</div>
'''
soup = BeautifulSoup(html_text, 'lxml')
', '.join([x.get_text() for x in soup.select('div.spec-subcat.attributes-religion span:not(.h5)')])
Output
Christian, Islam

Unable to fetch the relevant links and discard others

I've written a script in python in combination with selenium along with BeautifulSoup to get the links leading to property details from a webpage. As the content are heavily dynamic, I made use of selenium to get the page source. When I run my script, I get lots of links including those required links.
How can I get only the relevant link from each container out of the three?
My try:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
def fetch_info(link):
driver.get(link)
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "#community-search-homes .propertyWrapper > a")))
soup = BeautifulSoup(driver.page_source, "lxml")
linklist = [item.get("href") for item in soup.select("#community-search-homes .propertyWrapper > a")]
return linklist
if __name__ == '__main__':
url = "https://www.khov.com/find-new-homes/arizona/buckeye"
driver = webdriver.Chrome()
wait = WebDriverWait(driver,10)
for newlink in fetch_info(url):
print(newlink)
driver.quit()
Results I'm having:
/find-new-homes/arizona/buckeye/85396/k-hovnanian-homes/aspire-at-sienna-hills
/find-new-homes/arizona/buckeye/85396/k-hovnanian-homes/affinity-at-verrado
/find-new-homes/arizona/buckeye/85396/four-seasons/k.-hovnanian's-four-seasons-at-victory-at-verrado
/find-new-homes/arizona/scottsdale/85255/k-hovnanian-homes/summit-at-silverstone
/find-new-homes/arizona/scottsdale/85257/k-hovnanian-homes/skye
/find-new-homes/arizona/phoenix/85020/k-hovnanian-homes/pointe-16
/find-new-homes/arizona/peoria/85383/k-hovnanian-homes/fusion-ii-at-the-meadows
/find-new-homes/arizona/scottsdale/85257/k-hovnanian-homes/aire
/find-new-homes/arizona/scottsdale/85255/k-hovnanian-homes/pinnacle-at-silverstone
/find-new-homes/arizona/peoria/85383/k-hovnanian-homes/montage-at-the-meadows
/find-new-homes/arizona/sun-city/85373/four-seasons/k.-hovnanian-s-four-seasons-at-ventana-lakes
/find-new-homes/arizona/peoria/85382/k-hovnanian-homes/park-paseo
/find-new-homes/arizona/laveen/85339/k-hovnanian-homes/affinity-at-montana-vista
/find-new-homes/arizona/laveen/85339/k-hovnanian-homes/aspire-at-montana-vista
/find-new-homes/arizona/scottsdale/85255/k-hovnanian-homes/pinnacle-ii-at-silverstone
/find-new-homes/arizona/scottsdale/85255/k-hovnanian-homes/summit-ii-at-silverstone
Results I would like to get:
/find-new-homes/arizona/buckeye/85396/k-hovnanian-homes/aspire-at-sienna-hills
/find-new-homes/arizona/buckeye/85396/k-hovnanian-homes/affinity-at-verrado
/find-new-homes/arizona/buckeye/85396/four-seasons/k.-hovnanian's-four-seasons-at-victory-at-verrado
A chunk of html elements (the link I'm after is within the second line of the following elements):
<div class="propertyWrapper clear">
<span class="link-outside"></span>
<div class="propertyCarouselWrapper">
<div class="responsiveImageCarousel enabled" style="touch-action: pan-y; user-select: none; -webkit-user-drag: none; -webkit-tap-highlight-color: rgba(0, 0, 0, 0);">
<div class="prevBtn"></div>
<div class="nextBtn"></div>
<div class="images" data-detail-url="/find-new-homes/arizona/buckeye/85396/k-hovnanian-homes/aspire-at-sienna-hills">
<ul style="width: 960px; left: 0px;">
<li style="width: 320px;"><img alt="holiday exterior new homes sienna hills usp" src="https://khovcachecdn.azureedge.net/azure/sitefinitylibraries/images/default-source/images/az/aspire-at-sienna-hills/community-thumbnails/holiday-exterior-new-homes-sienna-hills-usp.jpg?sfvrsn=4&build=1019&encoder=wic&useresizingpipeline=true&w=450&h=280&mode=crop"></li>
<li style="width: 320px;"><img alt="carnival exterior new homes sienna hills usp" src="https://khovcachecdn.azureedge.net/azure/sitefinitylibraries/images/default-source/images/az/aspire-at-sienna-hills/community-thumbnails/carnival-exterior-new-homes-sienna-hills-usp.jpg?sfvrsn=4&build=1019&encoder=wic&useresizingpipeline=true&w=450&h=280&mode=crop"></li>
</ul>
</div>
<div class="pagination" style="width: 56px;"><ul><li class="active"></li><li></li></ul></div>
</div>
</div>
<div class="propertyInfoWrapper">
<div class="marker-details-container">
<h3 class="marker-details">New Homes in Buckeye, Arizona</h3>
<div class="spacer"></div>
<h4 class="propertyListingHeader">Aspire at Sienna Hills</h4>
<p class="marker-details">21007 West Almeria Road, Buckeye, AZ 85396</p>
<p class="marker-details marker-status">Final Opportunities</p>
<div class="spacer"></div>
<p class="marker-details marker-price"><span class="bold">Priced from: </span>Mid $200s</p>
<p class="marker-details"><span class="bold">Home type: </span>Single Family Homes</p>
<p class="marker-details marker-amenities"><span class="bold">Amenities: </span>Pool, Hiking Trails, Park</p>
</div>
<div class="community-tag-container">
<a href="/find-new-homes/arizona/buckeye/85396/k-hovnanian-homes/aspire-at-sienna-hills#quick-move-in-homes" onclick="KHOV.Analytics.trackEvent('Qmi_Icon_Qmi');">
<div class="community-tag">
<div class="ctaDesc quick-move-in-badge link-inside">Quick Move In Homes</div>
<div class="ctaIcon quick-move-in-badge-icon link-inside"></div>
</div>
</a>
</div>
<a href="#request-info-form-modal" class="open-inline-modal-link" onclick="KHOV.Analytics.trackEvent('Orange_Ri_Request_Info');">
<div class="button orange-color requestInfoButton link-inside" data-urlname="aspire-at-sienna-hills">Request Info</div>
</a>
</div>
</div>
You need to include the featured id as well as results. You can use Or to combine. Latest bs4 supports not.
#propertyResultsContainer .propertyWrapper :not([onclick])[href*=find], #propertyFeaturedResultsContainer .propertyWrapper :not([onclick])[href*=find]
This can also be shortened to
#propertyResultsContainer .propertyWrapper :not([onclick])[href*=find], #propertyFeaturedResultsContainer
But that shortening may be less robust.
You can just check for the desired keyword in the link and print those, and ignore the others:
if __name__ == '__main__':
url = "https://www.khov.com/find-new-homes/arizona/buckeye"
driver = webdriver.Chrome()
wait = WebDriverWait(driver,10)
for newlink in fetch_info(url):
if url.split('/')[-1] in newlink:
print(newlink)
driver.quit()
Output:
/find-new-homes/arizona/buckeye/85396/k-hovnanian-homes/aspire-at-sienna-hills
/find-new-homes/arizona/buckeye/85396/k-hovnanian-homes/affinity-at-verrado
/find-new-homes/arizona/buckeye/85396/four-seasons/k.-hovnanian's-four-seasons-at-victory-at-verrado
Would list slicing works?
def fetch_info(link):
driver.get(link)
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "#community-search-homes .propertyWrapper > a")))
soup = BeautifulSoup(driver.page_source, "lxml")
linklist = [item.get("href") for item in soup.select("#community-search-homes .propertyWrapper > a")][:3]
return linklist

how to .find() from the actual div in my for

I'm parsing a huge file, the following HTML code is only a little part. I have many times the first div. In this div I want to get differents tags in <a> I don't care if I also get the element into the a.
I'm doing this but It doesn't work :
from bs4 import BeautifulSoup
import requests
import re
page_url = 'https://paris-sportifs.pmu.fr/'
page = requests.get(page_url)
soup = BeautifulSoup(page.text, 'html.parser')
with open('pmu.html', 'a+')as file:
for div in soup.find_all('div', class_ = 'time_group', attrs={ 'data-time_group' : re.compile("group[1-9]") }):
event_information = div.find('a', class_ = 'trow--event tc-track-element-events')
print(re.sub(r'\s+', ' ', event_information.text))
An exemple of HTML :
<div class="time_group" data-time_group="group0">
<div class="row">
<div class="col-sm-12">
<a class="trow--event tc-track-element-events" href="/event/522788/football/football/maroc-botola-pro-1/rsb-berkane-rapide-oued-zem" data-event_id="rsb_berkane__rapide_oued_zem" data-compet_id="maroc_-_botola_pro_1" data-sport_id="football" data-name="sportif.clic.paris_live.details" data-toggle="tooltip" data-placement="bottom" title="Football - Maroc - Botola Pro 1 - RSB Berkane // Rapide Oued Zem - 29 mars 2018 - 19h00">
<em class="trow--event--name">
<span>RSB Berkane // Rapide Oued Zem</span>
</em>
</a>
</div>
</div>
</div>
With the for loop i get into the different div which interest me but I don't know how I can use this div to do the next : div.find I want to do the find in the element on this div not outside (in the soup).
What I except :
<a class="trow--event tc-track-element-events" href="/event/522788/football/football/maroc-botola-pro-1/rsb-berkane-rapide-oued-zem" data-event_id="rsb_berkane__rapide_oued_zem" data-compet_id="maroc_-_botola_pro_1" data-sport_id="football" data-name="sportif.clic.paris_live.details" data-toggle="tooltip" data-placement="bottom" title="Football - Maroc - Botola Pro 1 - RSB Berkane // Rapide Oued Zem - 29 mars 2018 - 19h00">
<em class="trow--event--name">
<span>RSB Berkane // Rapide Oued Zem</span>
</em>
</a>
Then I just have to find the different tag values in my var.
I hope my english isn't horrible.
Thank you, in advance for your valuable assistance
EDIT 1 :
Let's take an exemple of source code : https://pastebin.com/KZBp9c3y
in this file when i do for div in soup.find_all('div', class_ = 'time_group', attrs={ 'data-time_group' : re.compile("group[1-9]") }): I find the first div but imagine we have multiple match in the for loop.
Then I want to find in this div the element with tag a and class trow--event... div.find('a', class_ = 'trow--event tc-track-element-events')
An exemple of possible result is:
data-event_id="brescia__pescara"
data-compet_id="italie_-_serie_b"
data-sport_id="football"
score-both :
Anyway the problem is that I don't know how to do a find from the div where I am. I'm in <div class="time_group" data-time_group="group1"> and I want to get different information. I want to parse the div from the top to the bottom.
concretely :
for div in soup:
if current_div is:
do this.....
else if:
do this...
How can I get the current_div ?
Tell me if you don't understand what I want.
Thanks you
I've find something it's not exactly what I wanted but it works :
from bs4 import BeautifulSoup
import requests
import re
page_url = 'https://paris-sportifs.pmu.fr/'
page = requests.get(page_url)
soup = BeautifulSoup(page.text, 'html.parser')
soupdiv = soup.find_all('div', class_ = 'time_group', attrs={ 'data-time_group' : re.compile("group[1-9]") })
for div in soupdiv:
test = div.find("a", {"class":"trow--event tc-track-element-events"})
print(test.text)
I doing my find from the current div in the for.
thanks you.

Extract text from webpage using Selenium in Python

How could i use python selenium to extract ": Sahih al-Bukhari 248"
the following does not seem to work
reference = find_element_by_xpath(".//div[3]/table/tbody/tr[1]/td[2]").text
print reference
see html code below
<div class="actualHadithContainer">
<!-- Begin hadith -->
<a name="1"></a>
<div class="englishcontainer">
<div class="english_hadith_full" style="display: block;">
<div class="hadith_narrated"><p>Narrated `Aisha:</p></div>
<div class="text_details">
<p>Whenever the Prophet (ﷺ) took a bath after Janaba he started by washing his hands and then performed ablution like that for the prayer. After that he would put his fingers in water and move the roots of his hair with them, and then pour three handfuls of water over his head and then pour water all over his body.</p></div>
<div class="clear"></div></div></div>
<div class="arabic_hadith_full arabic"><span class="arabic_sanad arabic"></span>
<span class="arabic_text_details arabic">حَدَّثَنَا عَبْدُ اللَّهِ بْنُ يُوسُفَ، قَالَ أَخْبَرَنَا مَالِ</span><span class="arabic_sanad arabic"></span></div>
<!-- End hadith -->
<div class="bottomItems">
<table class="hadith_reference" cellspacing="0" cellpadding="0">
<tbody><tr><td><b>Reference</b></td>
<td> : Sahih al-Bukhari 248</td></tr>
<tr><td>In-book reference</td>
<td> : Book 5, Hadith 1</td></tr>
<tr><td>USC-MSA web (English) reference</td><td> : Vol. 1, Book 5, Hadith 248</td></tr>
<tr><td> <i>(deprecated numbering scheme)</i></td></tr></tbody></table><div class="hadith_permalink">Report Error | <span class="sharelink" onclick="share('/bukhari/5/1')">Share</span></div></div>
<div class="clear"></div></div>
I am using the code below to extract other items but having difficulties with the required excerpt above.
Code:
from selenium import webdriver
import os
import re
driver = webdriver.PhantomJS()
driver.implicitly_wait(30)
driver.set_window_size(1120, 550)
driver.get("https://www.sunnah.com/bukhari/5");
print driver.title
print driver.find_element_by_css_selector('.book_page_english_name').text
print driver.find_element_by_xpath('//*[#id="main"]/div[2]/div[1]/div[3]').text
for person in driver.find_elements_by_class_name('actualHadithContainer'):
try:
title1 = person.find_element_by_xpath('.//div[#class="hadith_narrated"]/p').text
if title1:
print title1
else:
print "exception"
title1 = person.find_element_by_xpath('.//div[#class="hadith_narrated"]').text
print title1
title2 = person.find_element_by_xpath('.//div[#class="text_details"]/p').text
if title2:
print title2
else:
title2 = person.find_element_by_xpath('.//div[#class="text_details"]').text
print title2
reference = find_element_by_xpath(".//div[3]/table/tbody/tr[1]/td[2]").text
print reference
except:
print "exception"
When using selenium API, you should perform some tasks like click a button or scroll to bottom.
When you need to extract information from HTML, you should use BeautifulSoup, it is much simple:
from selenium import webdriver
import os
import re
from bs4 import BeautifulSoup
driver = webdriver.PhantomJS()
driver.implicitly_wait(30)
driver.set_window_size(1120, 550)
driver.get("https://www.sunnah.com/bukhari/5")
soup = BeautifulSoup(driver.page_source, 'lxml')
soup.find(name='table', class_='hadith_reference').tr.text
And this page is static, you should use requests:
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.sunnah.com/bukhari/5")
soup = BeautifulSoup(r.text, 'lxml')
for div in soup.find_all(class_='actualHadithContainer'):
ref = div.find(name='table', class_='hadith_reference').tr.text
print(ref)
out:
Reference : Sahih al-Bukhari 248
Reference : Sahih al-Bukhari 249
Reference : Sahih al-Bukhari 250
Reference : Sahih al-Bukhari 251
Reference : Sahih al-Bukhari 252
Reference : Sahih al-Bukhari 253
Reference : Sahih al-Bukhari 254

Categories