Need to scrap text appears before and after script tag,
HTML:
<div class="card-body">
<div class="d-flex flex-row flex-wrap signal-row">
<div class="signal-title">EUR/USD signal</div>
<div class="ms-auto signal-value signal-color xh-highlight">
<span class="timeago fw-normal small" datetime="1656687480000" timeago-id="10">1 day ago</span>
</div>
</div>
<div class="d-flex flex-row flex-wrap signal-row">
<div class="signal-title">
From
</div>
<div class="ms-auto signal-value signal-color xh-highlight">
<span class="fw-normal small">UTC<script>w(tzo());</script>+05:30</span>
<script class="">w(hhmm(1656687480));</script>20:28
</div>
</div>
<div class="d-flex flex-row flex-wrap signal-row">
<div class="signal-title">
Till
</div>
<div class="ms-auto signal-value signal-color xh-highlight">
<span class="fw-normal small">UTC<script>w(tzo());</script>+05:30</span>
<script class="">w(hhmm(1656698280));</script>23:28
</div>
</div>
<div class="signal-row signal-status signal-color">
Filled
</div>
<div class="d-flex flex-row flex-wrap signal-row">
<div class="signal-title">
Sold at
</div>
<div class="ms-auto signal-value signal-color user-select-all">
<script>f('OCKGMP');</script>1.0407
</div>
</div>
<div class="d-flex flex-row flex-wrap signal-row">
<div class="signal-title">
Bought at
</div>
<div class="ms-auto signal-value signal-color user-select-all">
<script>f('OCKGML');</script>1.0408
</div>
need to extract UTC and +5:30 and other details available different mentioned in html span tag eg :<span class="fw-normal small">UTC<script>w(tzo());</script>+05:30</span>
Tried using next_sibling but it returns nothing, tried using etree and xpath but this is also not returning anything.
I tried using lxml etree:
dom = etree.HTML(str(soup))
t = dom.xpath("//div[#class='ms-auto signal-value signal-color']/span/script/following-sibling::text()")
for i in t:
print(i.text)
Using next siblling:
l = soup.find('script').next_siblings
Expected Output :
UTC +05:30
20:28
Simply call .text or get_text() method on your element, the script tag will be ignored.
soup.select_one('.card-body span').parent.get_text(' ', strip=True)
Note Assuming HTML is generated dynamically, so prerequisites differ from facts in your question.
Example
It will select all the <span> and iterate over ResultSet to print the texts.
from bs4 import BeautifulSoup
html='''
<div class="card-body">
<div>
<div class="ms-auto signal-value signal-color xh-highlight">
<span class="fw-normal small">UTC<script>w(tzo());</script>+05:30</span>
<script class="">w(hhmm(1656687480));</script>20:28
</div>
</div>
<div class="d-flex flex-row flex-wrap signal-row">
<div class="signal-title">
Till
</div>
<div class="ms-auto signal-value signal-color xh-highlight">
<span class="fw-normal small">UTC<script>w(tzo());</script>+05:30</span>
<script class="">w(hhmm(1656698280));</script>23:28
</div>
</div>
'''
soup = BeautifulSoup(html)
for e in soup.select('.card-body span'):
print(e.parent.get_text(' ', strip=True))
Output
UTC +05:30 20:28
UTC +05:30 23:28
Related
Im struggling with scraping a few pages ... it happens when the structure of the page implies a lot of nested divs...
Here is the code page:
<div>
<section class="ui-accordion-header ui-state-default ui-corner-all ui-accordion-icons" role="tab" id="ui-id-1" aria-controls="ui-id-2" aria-selected="false" aria-expanded="false" tabindex="0"><span class="ui-accordion-header-icon ui-icon ui-icon-triangle-1-e"></span>
<div class="detail-avocat">
<div class="nom-avocat">Me <span class="avocat_name">NAME </span></div>
<div class="type-avocat">Avocat postulant au Tribunal Judiciaire</div>
</div>
<div class="more-info">Plus d'informations</div>
</section>
<div class="ui-accordion-content ui-helper-reset ui-widget-content ui-corner-bottom" style="display: none;" id="ui-id-2" aria-labelledby="ui-id-1" role="tabpanel" aria-hidden="true">
<div class="details">
<div class="detail-avocat-row ">
<div class="detail-avocat-content overflow-h">
<span>Structure :</span>
<div>
<p>Cabinet individuel NAME</p>
</div>
</div>
</div>
<div class="detail-avocat-row ">
<div class="detail-avocat-content overflow-h">
<span>Adresse :</span>
<div>
<p>21 rue Belle Isle 57000 VILLE</p>
</div>
</div>
</div>
<div class="detail-avocat-row ">
<div class="detail-avocat-content overflow-h">
<span>Mail :</span>
<div>
<p>cabinet#mail.fr</p>
</div>
</div>
</div>
<div class="detail-avocat-row">
<div class="detail-avocat-content overflow-h">
<span>Tél :</span>
<div>
<p>Telnum</p>
</div>
</div>
</div>
<div class="detail-avocat-row">
<div class="detail-avocat-content overflow-h">
<span>Fax :</span>
<div>
<p> </p>
</div>
</div>
</div>
<div class="contact-avocat"> Contacter </div>
</div>
</div>
</div>
And here is my python code:
divtel = self.driver.find_elements(by=By.XPATH,
value=f'//div[#class="detail-avocat-content overflow-h"]/div/p')#div[#class="detail-avocat-content overflow-h"]')
for p in divtel:
print(p.text)
It doesnt print anything...with other similar pages it prints the text but in this case it doesnt altough there is text in the nested span and div/p . Do you know why?
How can i resolve my problem please?
thank you
The method .text works only when the webelement containing the text is visible in the webpage. If otherwise the webelement is hidden, you have to use .get_attribute('innerText') or .get_attribute('textContent') or .get_attribute('innerHTML') (see here for difference between them). So for example change
print(p.text)
to
print(p.get_attribute('innerText'))
The HTML is located below, If the span value is less than 20%, then I want to remove the span child up until the <div class="action"> parent only.
So for example:
<div class="item">
<div class="info">
<div class="action">
<div class="content">
<span class="content-name"> 5% </span>
</div>
</div>
</div>
</div>
From the above HTML, these code should only be removed:
<div class="action">
<div class="content">
<span class="content-name"> 5% </span>
</div>
</div>
So what should left is:
<div class="item">
<div class="info">
</div>
</div>
This is my current python code:
items = WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.XPATH, "//span[#class='content-name']")))
for item in items:
percentage_text = re.findall("\d+", item.text)[0]
if int(percentage_text) <= 20:
driver.execute_script("arguments[0].remove();", item)
But it only removes the span class and not its parent.
Here is the full HTML, I think it needs javascript to remove elements but I am very new on javascript I researched for more than 2 hours and I still can't find solutions. Thank you very much.
<div class="item">
<div class="info">
<div class="action">
<div class="content">
<span class="content-name"> 5% </span>
</div>
</div>
</div>
</div>
<div class="item">
<div class="info">
<div class="action">
<div class="content">
<span class="content-name"> 95% </span>
</div>
</div>
</div>
</div>
<div class="item">
<div class="info">
<div class="action">
<div class="content">
<span class="content-name"> 32% </span>
</div>
</div>
</div>
</div>
<div class="item">
<div class="info">
<div class="action">
<div class="content">
<span class="content-name"> 15% </span>
</div>
</div>
</div>
</div>
get to the parent of the parent:
driver.execute_script("arguments[0].parentElement.parentElement.remove();", item)
Im trying to scrape this page with Beautifulsoup.
https://www.nb.co.za/en/view-book/?id=9780798182539
How do I target specific elements if they don't have unique class or id?
Is it possible to scrape a div based on the value/text in the sibling div?
For instance, in the code below, how can I get 9780798182539 if the sibling div is <p>ISBN:</p>
<div class="row clearfix">
<div class="col-md-3 noPadding">
<p>ISBN:</p>
</div>
<div class="col-md-9 noPadding">
9780798182539
</div>
</div>
Here is the complete html:
<div class="col-lg-7 col-md-12 col-sm-12 author-details">
<h2>Step by Step: Counting to 50 </h2>
<h5>
Cuberdon
</h5>
<div class="row clearfix">
<div class="col-md-3 noPadding">
<p>ISBN:</p>
</div>
<div class="col-md-9 noPadding">
9780798182539
</div>
</div>
<div class="row clearfix">
<div class="col-md-3 noPadding">
<p>Publisher:</p>
</div>
<div class="col-md-9 noPadding">
Human & Rousseau
</div>
</div>
<div class="row clearfix">
<div class="col-md-3 noPadding">
<p>Date Released:</p>
</div>
<div class="col-md-9 noPadding">
November 2021
</div>
</div>
<div class="row clearfix">
<div class="col-md-3 noPadding">
<p>Price (incl. VAT):</p>
</div>
<div class="col-md-9 noPadding">
R 120.00
</div>
</div>
<div class="row clearfix">
<div class="col-md-3 noPadding">
<p>Format:</p>
</div>
<div class="col-md-9 noPadding">
<p>Hard cover, 32pp</p>
</div>
</div>
</div>
You can use :-soup-contains to target the p tag by its text. Wrap around the :has pseudo-class selector, and specify the relationship as direct parent child with a child > combinator, to get the immediate parent div. Then throw in an adjacent sibling combinator +, with div type selector, to move to the adjacent, div:
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('http://www.nb.co.za/nb/view-book?id=9780798182539')
soup = bs(r.content, 'lxml')
print(soup.select_one('div:has(> p:-soup-contains("ISBN:")) + div' ).text.strip())
Here is the working solution, so far.
from bs4 import BeautifulSoup
html = '''
<div class="col-lg-7 col-md-12 col-sm-12 author-details">
<h2>Step by Step: Counting to 50 </h2>
<h5>
Cuberdon
</h5>
<div class="row clearfix">
<div class="col-md-3 noPadding">
<p>ISBN:</p>
</div>
<div class="col-md-9 noPadding">
9780798182539
</div>
</div>
<div class="row clearfix">
<div class="col-md-3 noPadding">
<p>Publisher:</p>
</div>
<div class="col-md-9 noPadding">
Human & Rousseau
</div>
</div>
<div class="row clearfix">
<div class="col-md-3 noPadding">
<p>Date Released:</p>
</div>
<div class="col-md-9 noPadding">
November 2021
</div>
</div>
<div class="row clearfix">
<div class="col-md-3 noPadding">
<p>Price (incl. VAT):</p>
</div>
<div class="col-md-9 noPadding">
R 120.00
</div>
</div>
<div class="row clearfix">
<div class="col-md-3 noPadding">
<p>Format:</p>
</div>
<div class="col-md-9 noPadding">
<p>Hard cover, 32pp</p>
</div>
</div>
</div>
'''
soup = BeautifulSoup(html, "html.parser")
div_text =soup.find('div',class_="col-md-9 noPadding")
print(div_text.get_text(strip=True))
Output:
9780798182539
You could do a find_all on the main divs with class row clearfix, then filter on the divs that contain the string ISBN, and do a find on that div for the div with class col-md-9 noPadding. It would like this in list comprehension:
[i.find('div', class_='col-md-9 noPadding').get_text().strip() for i in soup.find_all('div', class_='row clearfix') if 'ISBN:' in i.get_text()][0]
Output:
9780798182539
I have written a code to fetch LinkedIn profile details but sometimes the entire HTML is not loaded for some user profiles.
I have already used the classic wait mechanisms i.e.
driver.implicitly_wait(10)
time.sleep(10)
element_present = EC.presence_of_element_located((By.CLASS_NAME, '.pv-profile-section__card-item-v2.pv-profile-section.pv-position-entity.ember-view'))
WebDriverWait(driver, 300).until(element_present)
but none of them seem to work.
A snippet of my code:
firstName = urllib.parse.quote(userFirstName)
lastName = urllib.parse.quote(userLastName)
company = urllib.parse.quote(userCompany)
driver.get('https://www.linkedin.com/search/results/people/?company='+company+'&firstName='+firstName+'&lastName='+lastName+'&origin=FACETED_SEARCH')
results = len(driver.find_elements_by_css_selector('.name.actor-name'))
for i in range(1):
print(i)
driver.find_elements_by_css_selector('.name.actor-name')[i].click()
time.sleep(10)
print(driver.current_url)
content = driver.execute_script("return document.getElementsByTagName('html')[0].innerHTML")
driver.implicitly_wait(2)
soup = BeautifulSoup(content, "html.parser")
#print(soup)
companyList = soup.findAll('section',{'class':'pv-profile-section__card-item-v2 pv-profile-section pv-position-entity ember-view'})
print("Company list length: "+str(len(companyList)))
The code does give company list for many users but it simply fails in some cases. I checked those profiles on my browser and the elements in the code did exist.
Any help/ past experience on this would be appreciated. I know trying to solve this problem will also take efforts so thanks in advance!
P.S.: A part of HTML (the experience part I care about):
<ul class="pv-profile-section__section-info section-info pv-profile-section__section-info--has-no-more">
<li class="pv-entity__position-group-pager pv-profile-section__list-item ember-view" id="ember394"> <section class="pv-profile-section__card-item-v2 pv-profile-section pv-position-entity ember-view" id="ember396"> <div class="display-flex justify-space-between full-width">
<a class="full-width ember-view" data-control-name="background_details_company" href="/search/results/index/?keywords=Aditya%20Birla%20Direct" id="ember397"> <div class="pv-entity__company-details">
<div class="pv-entity__logo company-logo">
<img alt="Aditya Birla Direct" class="pv-entity__logo-img pv-entity__logo-img EntityPhoto-square-5 lazy-image ember-view" id="ember399"/>
</div>
<div class="pv-entity__company-summary-info">
<h3 class="t-16 t-black t-bold">
<span class="visually-hidden">Company Name</span>
<span>Aditya Birla Direct</span>
</h3>
<h4 class="t-14 t-black t-normal">
<span class="visually-hidden">Total Duration</span>
<span>2 yrs 6 mos</span>
</h4>
</div>
</div>
</a>
<!-- --> </div>
<ul class="pv-entity__position-group mt2 ember-view" id="ember400"><li class="pv-entity__position-group-role-item sortable-item ember-view" id="ember402"> <div class="ember-view" id="ember403"><div class="pv-entity__role-details">
<span class="pv-entity__timeline-node"></span>
<div class="display-flex justify-space-between full-width">
<div class="pv-entity__role-container">
<div class="pv-entity__role-details-container pv-entity__role-details-container--timeline pv-entity__role-details-container--bottom-margin">
<div class="pv-entity__summary-info-v2 pv-entity__summary-info--background-section pv-entity__summary-info-margin-top">
<h3 class="t-14 t-black t-bold">
<span class="visually-hidden">Title</span>
<span>Product Designer</span>
</h3>
<!-- --> <div class="display-flex">
<h4 class="pv-entity__date-range t-14 t-black--light t-normal">
<span class="visually-hidden">Dates Employed</span>
<span>Jun 2018 – Present</span>
</h4>
<h4 class="t-14 t-black--light t-normal">
<span class="visually-hidden">Employment Duration</span>
<span class="pv-entity__bullet-item-v2">1 yr 5 mos</span>
</h4>
</div>
<h4 class="pv-entity__location t-14 t-black--light t-normal block">
<span class="visually-hidden">Location</span>
<span>Mumbai, Maharashtra, India</span>
</h4>
</div>
<!-- --> </div>
</div>
<!-- --> </div>
</div>
</div>
</li><li class="pv-entity__position-group-role-item sortable-item ember-view" id="ember405"> <div class="ember-view" id="ember406"><div class="pv-entity__role-details">
<span class="pv-entity__timeline-node"></span>
<div class="display-flex justify-space-between full-width">
<div class="pv-entity__role-container">
<div class="pv-entity__role-details-container">
<div class="pv-entity__summary-info-v2 pv-entity__summary-info--background-section pv-entity__summary-info-margin-top">
<h3 class="t-14 t-black t-bold">
<span class="visually-hidden">Title</span>
<span>UI/UX Designer</span>
</h3>
<!-- --> <div class="display-flex">
<h4 class="pv-entity__date-range t-14 t-black--light t-normal">
<span class="visually-hidden">Dates Employed</span>
<span>May 2017 – Present</span>
</h4>
<h4 class="t-14 t-black--light t-normal">
<span class="visually-hidden">Employment Duration</span>
<span class="pv-entity__bullet-item-v2">2 yrs 6 mos</span>
</h4>
</div>
<h4 class="pv-entity__location t-14 t-black--light t-normal block">
<span class="visually-hidden">Location</span>
<span>Mumbai, Maharashtra, India</span>
</h4>
</div>
<!-- --> </div>
</div>
<!-- --> </div>
</div>
</div>
</li>
</ul>
<!-- --></section>
</li><li class="pv-entity__position-group-pager pv-profile-section__list-item ember-view" id="ember408"> <section class="pv-profile-section__card-item-v2 pv-profile-section pv-position-entity ember-view" id="1192970710"> <div class="display-flex justify-space-between full-width">
<div class="display-flex flex-column full-width">
<a class="full-width ember-view" data-control-name="background_details_company" href="/search/results/index/?keywords=improove%20technology%20pvt%20ltd" id="ember411"> <div class="pv-entity__logo company-logo">
<img alt="improove technology pvt ltd" class="pv-entity__logo-img pv-entity__logo-img EntityPhoto-square-5 lazy-image ghost-company ember-view" id="ember413"/>
</div>
<div class="pv-entity__summary-info pv-entity__summary-info--background-section">
<h3 class="t-16 t-black t-bold">UI/UX Designer</h3>
<p class="visually-hidden">Company Name</p>
<p class="pv-entity__secondary-title t-14 t-black t-normal">improove technology pvt ltd</p>
<!-- -->
<div class="display-flex">
<h4 class="pv-entity__date-range t-14 t-black--light t-normal">
<span class="visually-hidden">Dates Employed</span>
<span>May 2015 – May 2017</span>
</h4>
<h4 class="t-14 t-black--light t-normal">
<span class="visually-hidden">Employment Duration</span>
<span class="pv-entity__bullet-item-v2">2 yrs 1 mo</span>
</h4>
</div>
<h4 class="pv-entity__location t-14 t-black--light t-normal block">
<span class="visually-hidden">Location</span>
<span>Delhi</span>
</h4>
</div>
</a>
<!-- --> </div>
<!-- --> </div>
</section>
</li>
I basically need the company name, role and dates employed.
Based on the updated HTML you posted, it's possible that the section element is fully loading, but its contents are not fully loaded, which could result in companyList being empty as you mentioned.
I would prefer to wait on something more specific than section:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
# Wait on ALL sections to load
WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.XPATH, "//section[contains(#class, 'pv-profile-section')]")))
# Wait on Company Name labels to load
WebDriverWait(driver, 10).until(EC.presence_of_all_elements_located((By.XPATH, "//*[contains(text(), 'Company Name')]")))
# Get company list
companyList = driver.find_elements_by_xpath("//section[contains(#class, 'pv-profile-section')]")
print(len(companyList))
This code will wait on all section elements to load, and also wait on the Company Name to load -- this might avoid the issue where a section is loaded, but its contents are not fully loaded yet.
I am trying to download a table from html which is not in the usual td/ tr format and includes images.
The html code looks like this:
<div class="dynamicBottom">
<div class="dynamicLeft">
<div class="content_block details_block scroll_tabs" data-tab="TABS_DETAILS">
<div class="header_with_improve wrap">
<div class="improve_listing_btn ui_button primary small">improve this entry</div>
<h3 class="tabs_header">Details</h3> </div>
<div class="details_tab">
<div class="table_section">
<div class="row">
<div class="ratingSummary wrap">
<div class="histogramCommon bubbleHistogram wrap">
<div class="colTitle">
Rating
</div>
<ul class="barChart">
<li>
<div class="ratingRow wrap">
<div class="label part ">
<span class="text">Location</span>
</div>
<div class="wrap row part ">
<span class="rate sprite-rating_s rating_s"> <img class="sprite-rating_s_fill rating_s_fill s45" src="https://static.tacdn.com/img2/x.gif" alt="45 out of fifty points">
</span>
</div>
</div>
<div class="ratingRow wrap">
<div class="label part ">
<span class="text">Service</span>
</div>
<div class="wrap row part ">
<span class="rate sprite-rating_s rating_s"> <img class="sprite-rating_s_fill rating_s_fill s45" src="https://static.tacdn.com/img2/x.gif" alt="45 out of fifty points">
</span>
</div>
</div>
</li>
I would like to get the table:
[Location 45 out of fifty points,
Service 45 out of fifty points].
The following code only prints "Location" and "Service" and does not include the rating.
for url in urls:
r=requests.get(url)
time.sleep(delayTime)
soup=BeautifulSoup(r.content, "lxml")
data17= soup.findAll('div', {'class' :'dynamicBottom'})
for item in (data17):
print(item.text)
And the code
data18= soup.find(attrs={'class': 'sprite-rating_s_fill rating_s_fill s45'})
print(data18["alt"] if data18 else "No meta title given")
does not help either since it is not clear which rating it represents since it only prints out "45 out of fifty points" but it is not clear for which category. Additionally, the image tag ('sprite-rating_s_fill rating_s_fill s45') varies in other tables depending on the rating.
Is there a way to extract the full table?
Or to tell Python to extract the image after a certain word, e.g. "Location"?
Thank you very much for your help!
html = '''<div class="dynamicBottom">
<div class="dynamicLeft">
<div class="content_block details_block scroll_tabs" data-tab="TABS_DETAILS">
<div class="header_with_improve wrap">
<div class="improve_listing_btn ui_button primary small">improve this entry</div>
<h3 class="tabs_header">Details</h3> </div>
<div class="details_tab">
<div class="table_section">
<div class="row">
<div class="ratingSummary wrap">
<div class="histogramCommon bubbleHistogram wrap">
<div class="colTitle">
Rating
</div>
<ul class="barChart">
<li>
<div class="ratingRow wrap">
<div class="label part ">
<span class="text">Location</span>
</div>
<div class="wrap row part ">
<span class="rate sprite-rating_s rating_s"> <img class="sprite-rating_s_fill rating_s_fill s45" src="https://static.tacdn.com/img2/x.gif" alt="45 out of fifty points">
</span>
</div>
</div>
<div class="ratingRow wrap">
<div class="label part ">
<span class="text">Service</span>
</div>
<div class="wrap row part ">
<span class="rate sprite-rating_s rating_s"> <img class="sprite-rating_s_fill rating_s_fill s45" src="https://static.tacdn.com/img2/x.gif" alt="45 out of fifty points">
</span>
</div>
</div>
</li>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for div in soup.find_all('div', class_="ratingRow wrap"):
text = div.text.strip()
alt = div.find('img').get('alt')
print(text, alt)
out:
Location 45 out of fifty points
Service 45 out of fifty points