Extract text from webpage using Selenium in Python

Extract text from webpage using Selenium in Python - python

How could i use python selenium to extract ": Sahih al-Bukhari 248"
the following does not seem to work
reference = find_element_by_xpath(".//div[3]/table/tbody/tr[1]/td[2]").text
print reference
see html code below
<div class="actualHadithContainer">
<!-- Begin hadith -->
<a name="1"></a>
<div class="englishcontainer">
<div class="english_hadith_full" style="display: block;">
<div class="hadith_narrated"><p>Narrated `Aisha:</p></div>
<div class="text_details">
<p>Whenever the Prophet (ﷺ) took a bath after Janaba he started by washing his hands and then performed ablution like that for the prayer. After that he would put his fingers in water and move the roots of his hair with them, and then pour three handfuls of water over his head and then pour water all over his body.</p></div>
<div class="clear"></div></div></div>
<div class="arabic_hadith_full arabic"><span class="arabic_sanad arabic"></span>
<span class="arabic_text_details arabic">حَدَّثَنَا عَبْدُ اللَّهِ بْنُ يُوسُفَ، قَالَ أَخْبَرَنَا مَالِ</span><span class="arabic_sanad arabic"></span></div>
<!-- End hadith -->
<div class="bottomItems">
<table class="hadith_reference" cellspacing="0" cellpadding="0">
<tbody><tr><td><b>Reference</b></td>
<td> : Sahih al-Bukhari 248</td></tr>
<tr><td>In-book reference</td>
<td> : Book 5, Hadith 1</td></tr>
<tr><td>USC-MSA web (English) reference</td><td> : Vol. 1, Book 5, Hadith 248</td></tr>
<tr><td> <i>(deprecated numbering scheme)</i></td></tr></tbody></table><div class="hadith_permalink">Report Error | <span class="sharelink" onclick="share('/bukhari/5/1')">Share</span></div></div>
<div class="clear"></div></div>
I am using the code below to extract other items but having difficulties with the required excerpt above.
Code:
from selenium import webdriver
import os
import re
driver = webdriver.PhantomJS()
driver.implicitly_wait(30)
driver.set_window_size(1120, 550)
driver.get("https://www.sunnah.com/bukhari/5");
print driver.title
print driver.find_element_by_css_selector('.book_page_english_name').text
print driver.find_element_by_xpath('//*[#id="main"]/div[2]/div[1]/div[3]').text
for person in driver.find_elements_by_class_name('actualHadithContainer'):
try:
title1 = person.find_element_by_xpath('.//div[#class="hadith_narrated"]/p').text
if title1:
print title1
else:
print "exception"
title1 = person.find_element_by_xpath('.//div[#class="hadith_narrated"]').text
print title1
title2 = person.find_element_by_xpath('.//div[#class="text_details"]/p').text
if title2:
print title2
else:
title2 = person.find_element_by_xpath('.//div[#class="text_details"]').text
print title2
reference = find_element_by_xpath(".//div[3]/table/tbody/tr[1]/td[2]").text
print reference
except:
print "exception"

When using selenium API, you should perform some tasks like click a button or scroll to bottom.
When you need to extract information from HTML, you should use BeautifulSoup, it is much simple:
from selenium import webdriver
import os
import re
from bs4 import BeautifulSoup
driver = webdriver.PhantomJS()
driver.implicitly_wait(30)
driver.set_window_size(1120, 550)
driver.get("https://www.sunnah.com/bukhari/5")
soup = BeautifulSoup(driver.page_source, 'lxml')
soup.find(name='table', class_='hadith_reference').tr.text
And this page is static, you should use requests:
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.sunnah.com/bukhari/5")
soup = BeautifulSoup(r.text, 'lxml')
for div in soup.find_all(class_='actualHadithContainer'):
ref = div.find(name='table', class_='hadith_reference').tr.text
print(ref)
out:
Reference : Sahih al-Bukhari 248
Reference : Sahih al-Bukhari 249
Reference : Sahih al-Bukhari 250
Reference : Sahih al-Bukhari 251
Reference : Sahih al-Bukhari 252
Reference : Sahih al-Bukhari 253
Reference : Sahih al-Bukhari 254

Related

Scraping nested html with Selenium

I'm looking for some help with scraping with selenium in python.
You need a paid account to view this page so creating a reproducible won't be possible.
The page I'm trying to scrape
I'm attempting to scrape the data from the pitch in the top right corner of the image under 'Spots on Field'.
<div class="player-details-football-map__UEFA player-details-football-map">
<div class="shots">
<div>
<a class="shot episode" style="left: 39.8529%; top: 28.9474%;"></a>
<div class="tooltip" style="left: 39.8529%; top: 28.9474%;">
<div class="tooltip-title">
<div class="tooltip-shoot-type">Shot on target</div>
<div class="tooltip-blow-type">Donyell Malen </div>
<div class="tooltip-shoot-name"></div>
</div>
<div class="tooltip-time">h Viktoria Koln</div>
<div class="tooltip-time">Half 1, 18:22 02/09/20</div>
<div class="tooltip-time">Length: 7.1 m</div>
<div class="tooltip-shoot-xg">Expected goals: 0.17</div>
</div>
</div>
The above is a snippet of just one of the data points I want to scrape.
I've tried using BeautifulSoup
from bs4 import BeautifulSoup
from requests import get
url = 'https://football.instatscout.com/players/294322/shots'
response = get(url)
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)
shots = html_soup.find_all('div', class_ = 'tooltip')
print(type(shots))
print(len(shots))
and nothing was being returned.
So now I've tried using Selenium.
options = Options()
options.add_argument("start-maximized")
driver = webdriver.Chrome(chrome_options=options, executable_path=r'C:\Users\James\OneDrive\Desktop\webdriver\chromedriver.exe')
driver.get('https://football.instatscout.com/players/294322/shots')
print("Page Title is : %s" %driver.title)
driver.find_element_by_name('email').send_keys('my username')
driver.find_element_by_name('pass').send_keys('my password')
driver.find_element_by_xpath('//*[contains(concat( " ", #class, " " ), concat( " ", "hRAqIl", " " ))]').click()
goals = driver.find_element_by_class_name('tooltip')
but I'm getting the error of
NoSuchElementException: Message: no such element: Unable to locate element: {"method":"css selector","selector":".tooltip"}
Can someone please help point me in the right direction? I'm basically trying to scrape everything from the above HTML, that includes 'tooltip' in the class name.
Thanks

Using css selectors with bs4:
from bs4 import BeautifulSoup as soup
import re #for extracting offsets
r = [{**dict(zip(['left', 'top'], re.findall('[\d\.]+', i.div['style']))),
'shoot_type':i.select_one('.tooltip-shoot-type').text,
'name':i.select_one('.tooltip-blow-type').text,
'team':i.select_one('div:nth-of-type(2).tooltip-time').text,
'time':i.select_one('div:nth-of-type(3).tooltip-time').text,
'length':i.select_one('div:nth-of-type(4).tooltip-time').text[8:],
'expected_goals':i.select_one('.tooltip-shoot-xg').text[16:]}
for i in soup(html, 'html.parser').select('div.shots > div')]
Output:
[{'left': '39.8529', 'top': '28.9474', 'shoot_type': 'Shot on target', 'name': 'Donyell Malen ', 'team': 'h Viktoria Koln', 'time': 'Half 1, 18:22 02/09/20', 'length': '7.1 m', 'expected_goals': '0.17'}]

How to select second div tag with same classname?

I'm trying to select the the second div tag with the info classname, but with no success using bs4 find_next. How Do you go about selecting the text inside the second div tag that share classname?
[<div class="info">
<a href="/clubs/12/Manchester-United/overview">
Manchester United<span class="playerClub badge-20 t1"></span>
</a>
</div>
<div class="info">Defender</div>
<div class="info">
<a href="/clubs/12/Manchester-United/overview">
Manchester United<span class="playerClub badge-20 t1"></span>
</a>
</div>
<div class="info">Defender</div>]
Here is what I have tried
from bs4 import BeautifulSoup
import requests
players_url =['http://www.premierleague.com//players/13559/Axel-Tuanzebe/stats']
# this is dict where we store all information:
players = {}
for url in players_url:
player_page = requests.get(url)
cont = soup(player_page.content, 'lxml')
data = dict((k.contents[0].strip(), v.get_text(strip=True)) for k, v in zip(cont.select('.topStat span.stat, .normalStat span.stat'), cont.select('.topStat span.stat > span, .normalStat span.stat > span')))
club = {"Club" : cont.find('div', attrs={'class' : 'info'}).get_text(strip=True)}
position = {"Position": cont.find_next('div', attrs={'class' : 'info'})}
players[cont.select_one('.playerDetails .name').get_text(strip=True)] = data
print(position)

You can try follows:
clud_ele = cont.find('div', attrs={'class' : 'info'})
club = {"Club" : clud_ele.get_text(strip=True)}
position = {"Position": clud_ele.find_next('div', attrs={'class' : 'info'})}

Unable to fetch the relevant links and discard others

I've written a script in python in combination with selenium along with BeautifulSoup to get the links leading to property details from a webpage. As the content are heavily dynamic, I made use of selenium to get the page source. When I run my script, I get lots of links including those required links.
How can I get only the relevant link from each container out of the three?
My try:
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
def fetch_info(link):
driver.get(link)
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "#community-search-homes .propertyWrapper > a")))
soup = BeautifulSoup(driver.page_source, "lxml")
linklist = [item.get("href") for item in soup.select("#community-search-homes .propertyWrapper > a")]
return linklist
if __name__ == '__main__':
url = "https://www.khov.com/find-new-homes/arizona/buckeye"
driver = webdriver.Chrome()
wait = WebDriverWait(driver,10)
for newlink in fetch_info(url):
print(newlink)
driver.quit()
Results I'm having:
/find-new-homes/arizona/buckeye/85396/k-hovnanian-homes/aspire-at-sienna-hills
/find-new-homes/arizona/buckeye/85396/k-hovnanian-homes/affinity-at-verrado
/find-new-homes/arizona/buckeye/85396/four-seasons/k.-hovnanian's-four-seasons-at-victory-at-verrado
/find-new-homes/arizona/scottsdale/85255/k-hovnanian-homes/summit-at-silverstone
/find-new-homes/arizona/scottsdale/85257/k-hovnanian-homes/skye
/find-new-homes/arizona/phoenix/85020/k-hovnanian-homes/pointe-16
/find-new-homes/arizona/peoria/85383/k-hovnanian-homes/fusion-ii-at-the-meadows
/find-new-homes/arizona/scottsdale/85257/k-hovnanian-homes/aire
/find-new-homes/arizona/scottsdale/85255/k-hovnanian-homes/pinnacle-at-silverstone
/find-new-homes/arizona/peoria/85383/k-hovnanian-homes/montage-at-the-meadows
/find-new-homes/arizona/sun-city/85373/four-seasons/k.-hovnanian-s-four-seasons-at-ventana-lakes
/find-new-homes/arizona/peoria/85382/k-hovnanian-homes/park-paseo
/find-new-homes/arizona/laveen/85339/k-hovnanian-homes/affinity-at-montana-vista
/find-new-homes/arizona/laveen/85339/k-hovnanian-homes/aspire-at-montana-vista
/find-new-homes/arizona/scottsdale/85255/k-hovnanian-homes/pinnacle-ii-at-silverstone
/find-new-homes/arizona/scottsdale/85255/k-hovnanian-homes/summit-ii-at-silverstone
Results I would like to get:
/find-new-homes/arizona/buckeye/85396/k-hovnanian-homes/aspire-at-sienna-hills
/find-new-homes/arizona/buckeye/85396/k-hovnanian-homes/affinity-at-verrado
/find-new-homes/arizona/buckeye/85396/four-seasons/k.-hovnanian's-four-seasons-at-victory-at-verrado
A chunk of html elements (the link I'm after is within the second line of the following elements):
<div class="propertyWrapper clear">
<span class="link-outside"></span>
<div class="propertyCarouselWrapper">
<div class="responsiveImageCarousel enabled" style="touch-action: pan-y; user-select: none; -webkit-user-drag: none; -webkit-tap-highlight-color: rgba(0, 0, 0, 0);">
<div class="prevBtn"></div>
<div class="nextBtn"></div>
<div class="images" data-detail-url="/find-new-homes/arizona/buckeye/85396/k-hovnanian-homes/aspire-at-sienna-hills">
<ul style="width: 960px; left: 0px;">
<li style="width: 320px;"><img alt="holiday exterior new homes sienna hills usp" src="https://khovcachecdn.azureedge.net/azure/sitefinitylibraries/images/default-source/images/az/aspire-at-sienna-hills/community-thumbnails/holiday-exterior-new-homes-sienna-hills-usp.jpg?sfvrsn=4&build=1019&encoder=wic&useresizingpipeline=true&w=450&h=280&mode=crop"></li>
<li style="width: 320px;"><img alt="carnival exterior new homes sienna hills usp" src="https://khovcachecdn.azureedge.net/azure/sitefinitylibraries/images/default-source/images/az/aspire-at-sienna-hills/community-thumbnails/carnival-exterior-new-homes-sienna-hills-usp.jpg?sfvrsn=4&build=1019&encoder=wic&useresizingpipeline=true&w=450&h=280&mode=crop"></li>
</ul>
</div>
<div class="pagination" style="width: 56px;"><ul><li class="active"></li><li></li></ul></div>
</div>
</div>
<div class="propertyInfoWrapper">
<div class="marker-details-container">
<h3 class="marker-details">New Homes in Buckeye, Arizona</h3>
<div class="spacer"></div>
<h4 class="propertyListingHeader">Aspire at Sienna Hills</h4>
<p class="marker-details">21007 West Almeria Road, Buckeye, AZ 85396</p>
<p class="marker-details marker-status">Final Opportunities</p>
<div class="spacer"></div>
<p class="marker-details marker-price"><span class="bold">Priced from: </span>Mid $200s</p>
<p class="marker-details"><span class="bold">Home type: </span>Single Family Homes</p>
<p class="marker-details marker-amenities"><span class="bold">Amenities: </span>Pool, Hiking Trails, Park</p>
</div>
<div class="community-tag-container">
<a href="/find-new-homes/arizona/buckeye/85396/k-hovnanian-homes/aspire-at-sienna-hills#quick-move-in-homes" onclick="KHOV.Analytics.trackEvent('Qmi_Icon_Qmi');">
<div class="community-tag">
<div class="ctaDesc quick-move-in-badge link-inside">Quick Move In Homes</div>
<div class="ctaIcon quick-move-in-badge-icon link-inside"></div>
</div>
</a>
</div>
<a href="#request-info-form-modal" class="open-inline-modal-link" onclick="KHOV.Analytics.trackEvent('Orange_Ri_Request_Info');">
<div class="button orange-color requestInfoButton link-inside" data-urlname="aspire-at-sienna-hills">Request Info</div>
</a>
</div>
</div>

You need to include the featured id as well as results. You can use Or to combine. Latest bs4 supports not.
#propertyResultsContainer .propertyWrapper :not([onclick])[href*=find], #propertyFeaturedResultsContainer .propertyWrapper :not([onclick])[href*=find]
This can also be shortened to
#propertyResultsContainer .propertyWrapper :not([onclick])[href*=find], #propertyFeaturedResultsContainer
But that shortening may be less robust.

You can just check for the desired keyword in the link and print those, and ignore the others:
if __name__ == '__main__':
url = "https://www.khov.com/find-new-homes/arizona/buckeye"
driver = webdriver.Chrome()
wait = WebDriverWait(driver,10)
for newlink in fetch_info(url):
if url.split('/')[-1] in newlink:
print(newlink)
driver.quit()
Output:
/find-new-homes/arizona/buckeye/85396/k-hovnanian-homes/aspire-at-sienna-hills
/find-new-homes/arizona/buckeye/85396/k-hovnanian-homes/affinity-at-verrado
/find-new-homes/arizona/buckeye/85396/four-seasons/k.-hovnanian's-four-seasons-at-victory-at-verrado

Would list slicing works?
def fetch_info(link):
driver.get(link)
wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, "#community-search-homes .propertyWrapper > a")))
soup = BeautifulSoup(driver.page_source, "lxml")
linklist = [item.get("href") for item in soup.select("#community-search-homes .propertyWrapper > a")][:3]
return linklist

how to .find() from the actual div in my for

I'm parsing a huge file, the following HTML code is only a little part. I have many times the first div. In this div I want to get differents tags in <a> I don't care if I also get the element into the a.
I'm doing this but It doesn't work :
from bs4 import BeautifulSoup
import requests
import re
page_url = 'https://paris-sportifs.pmu.fr/'
page = requests.get(page_url)
soup = BeautifulSoup(page.text, 'html.parser')
with open('pmu.html', 'a+')as file:
for div in soup.find_all('div', class_ = 'time_group', attrs={ 'data-time_group' : re.compile("group[1-9]") }):
event_information = div.find('a', class_ = 'trow--event tc-track-element-events')
print(re.sub(r'\s+', ' ', event_information.text))
An exemple of HTML :
<div class="time_group" data-time_group="group0">
<div class="row">
<div class="col-sm-12">
<a class="trow--event tc-track-element-events" href="/event/522788/football/football/maroc-botola-pro-1/rsb-berkane-rapide-oued-zem" data-event_id="rsb_berkane__rapide_oued_zem" data-compet_id="maroc_-_botola_pro_1" data-sport_id="football" data-name="sportif.clic.paris_live.details" data-toggle="tooltip" data-placement="bottom" title="Football - Maroc - Botola Pro 1 - RSB Berkane // Rapide Oued Zem - 29 mars 2018 - 19h00">
<em class="trow--event--name">
<span>RSB Berkane // Rapide Oued Zem</span>
</em>
</a>
</div>
</div>
</div>
With the for loop i get into the different div which interest me but I don't know how I can use this div to do the next : div.find I want to do the find in the element on this div not outside (in the soup).
What I except :
<a class="trow--event tc-track-element-events" href="/event/522788/football/football/maroc-botola-pro-1/rsb-berkane-rapide-oued-zem" data-event_id="rsb_berkane__rapide_oued_zem" data-compet_id="maroc_-_botola_pro_1" data-sport_id="football" data-name="sportif.clic.paris_live.details" data-toggle="tooltip" data-placement="bottom" title="Football - Maroc - Botola Pro 1 - RSB Berkane // Rapide Oued Zem - 29 mars 2018 - 19h00">
<em class="trow--event--name">
<span>RSB Berkane // Rapide Oued Zem</span>
</em>
</a>
Then I just have to find the different tag values in my var.
I hope my english isn't horrible.
Thank you, in advance for your valuable assistance
EDIT 1 :
Let's take an exemple of source code : https://pastebin.com/KZBp9c3y
in this file when i do for div in soup.find_all('div', class_ = 'time_group', attrs={ 'data-time_group' : re.compile("group[1-9]") }): I find the first div but imagine we have multiple match in the for loop.
Then I want to find in this div the element with tag a and class trow--event... div.find('a', class_ = 'trow--event tc-track-element-events')
An exemple of possible result is:
data-event_id="brescia__pescara"
data-compet_id="italie_-_serie_b"
data-sport_id="football"
score-both :
Anyway the problem is that I don't know how to do a find from the div where I am. I'm in <div class="time_group" data-time_group="group1"> and I want to get different information. I want to parse the div from the top to the bottom.
concretely :
for div in soup:
if current_div is:
do this.....
else if:
do this...
How can I get the current_div ?
Tell me if you don't understand what I want.
Thanks you

I've find something it's not exactly what I wanted but it works :
from bs4 import BeautifulSoup
import requests
import re
page_url = 'https://paris-sportifs.pmu.fr/'
page = requests.get(page_url)
soup = BeautifulSoup(page.text, 'html.parser')
soupdiv = soup.find_all('div', class_ = 'time_group', attrs={ 'data-time_group' : re.compile("group[1-9]") })
for div in soupdiv:
test = div.find("a", {"class":"trow--event tc-track-element-events"})
print(test.text)
I doing my find from the current div in the for.
thanks you.

Scraping <span>flow text</span> with BeautifulSoup and urllib

I am working on scraping the data from a website using BeautifulSoup. For whatever reason, I cannot seem to find a way to get the text between span elements to print. Here is what I am running.
data = """ <div class="grouping">
<div class="a1 left" style="width:20px;">Text</div>
<div class="a2 left" style="width:30px;"><span
id="target_0">Data1</span>
</div>
<div class="a3 left" style="width:45px;"><span id="div_target_0">Data2
</span></div>
<div class="a4 left" style="width:32px;"><span id="reg_target_0">Data3
</span</div>
</div>
"""
My ultimate goal would be to able to print a list ["Text", "Data1", "Data2"] for each entry. But right now I am having trouble getting python and urllib to produce any text between the . Here is what I am running:
import urllib
from bs4 import BeautifulSoup
url = 'http://target.com'
html = urllib.urlopen(url).read()
soup = BeautifulSoup(html, "lxml")
Search_List = [0,4,5] # list of Target IDs to scrape
for i in Search_List:
h = str(i)
root = 'target_' + h
taggr = soup.find("span", { "id" : root })
print taggr, ", ", taggr.text
When I use urllib it produces this:
<span id="target_0"></span>,
<span id="target_4"></span>,
<span id="target_5"></span>,
However, I also downloaded the html file, and when I parse the downloaded file it produces this output (the one that I want):
<span id="target_0">Data1</span>, Data1
<span id="target_4">Data1</span>, Data1
<span id="target_5">Data1</span>, Data1
Can anyone explain to me why urllib doesn't produce the outcome?

use this code :
...
soup = BeautifulSoup(html, 'html.parser')
your_data = list()
for line in soup.findAll('span', attrs={'id': 'target_0'}):
your_data.append(line.text)
...
similarly add all class attributes which you need to extract data from and write your_data list in csv file. Hope this will help if this doesn't work out. let me know.

You can use the following approach to create your lists based on the source HTML you have shown:
from bs4 import BeautifulSoup
data = """
<div class="grouping">
<div class="a1 left" style="width:20px;">Text0</div>
<div class="a2 left" style="width:30px;"><span id="target_0">Data1</span></div>
<div class="a3 left" style="width:45px;"><span id="div_target_0">Data2</span></div>
<div class="a4 left" style="width:32px;"><span id="reg_target_0">Data3</span></div>
</div>
<div class="grouping">
<div class="a1 left" style="width:20px;">Text2</div>
<div class="a2 left" style="width:30px;"><span id="target_2">Data1</span></div>
<div class="a3 left" style="width:45px;"><span id="div_target_0">Data2</span></div>
<div class="a4 left" style="width:32px;"><span id="reg_target_0">Data3</span></div>
</div>
<div class="grouping">
<div class="a1 left" style="width:20px;">Text4</div>
<div class="a2 left" style="width:30px;"><span id="target_4">Data1</span></div>
<div class="a3 left" style="width:45px;"><span id="div_target_0">Data2</span></div>
<div class="a4 left" style="width:32px;"><span id="reg_target_0">Data3</span></div>
</div>
"""
soup = BeautifulSoup(data, "lxml")
search_ids = [0, 4, 5] # list of Target IDs to scrape
for i in search_ids:
span = soup.find("span", id='target_{}'.format(i))
if span:
grouping = span.parent.parent
print list(grouping.stripped_strings)[:-1] # -1 to remove "Data3"
The example has been slightly modified to show it finding IDs 0 and 4. This would display the following output:
[u'Text0', u'Data1', u'Data2']
[u'Text4', u'Data1', u'Data2']
Note, if the HTML you are getting back from your URL is different to that seen been viewing the source from your browser (i.e. the data you want is missing completely) then you will need to use a solution such as selenium to connect to your browser and extract the HTML. This is because in this case, the HTML is probably being generated locally via Javascript, and urllib does not have a Javascript processor.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract text from webpage using Selenium in Python - python

Related

Scraping nested html with Selenium

How to select second div tag with same classname?

Unable to fetch the relevant links and discard others

how to .find() from the actual div in my for

Scraping <span>flow text</span> with BeautifulSoup and urllib

Categories

Resources