Extract table from html including images using Python

Extract table from html including images using Python - python

I am trying to download a table from html which is not in the usual td/ tr format and includes images.
The html code looks like this:
<div class="dynamicBottom">
<div class="dynamicLeft">
<div class="content_block details_block scroll_tabs" data-tab="TABS_DETAILS">
<div class="header_with_improve wrap">
<div class="improve_listing_btn ui_button primary small">improve this entry</div>
<h3 class="tabs_header">Details</h3> </div>
<div class="details_tab">
<div class="table_section">
<div class="row">
<div class="ratingSummary wrap">
<div class="histogramCommon bubbleHistogram wrap">
<div class="colTitle">
Rating
</div>
<ul class="barChart">
<li>
<div class="ratingRow wrap">
<div class="label part ">
<span class="text">Location</span>
</div>
<div class="wrap row part ">
<span class="rate sprite-rating_s rating_s"> <img class="sprite-rating_s_fill rating_s_fill s45" src="https://static.tacdn.com/img2/x.gif" alt="45 out of fifty points">
</span>
</div>
</div>
<div class="ratingRow wrap">
<div class="label part ">
<span class="text">Service</span>
</div>
<div class="wrap row part ">
<span class="rate sprite-rating_s rating_s"> <img class="sprite-rating_s_fill rating_s_fill s45" src="https://static.tacdn.com/img2/x.gif" alt="45 out of fifty points">
</span>
</div>
</div>
</li>
I would like to get the table:
[Location 45 out of fifty points,
Service 45 out of fifty points].
The following code only prints "Location" and "Service" and does not include the rating.
for url in urls:
r=requests.get(url)
time.sleep(delayTime)
soup=BeautifulSoup(r.content, "lxml")
data17= soup.findAll('div', {'class' :'dynamicBottom'})
for item in (data17):
print(item.text)
And the code
data18= soup.find(attrs={'class': 'sprite-rating_s_fill rating_s_fill s45'})
print(data18["alt"] if data18 else "No meta title given")
does not help either since it is not clear which rating it represents since it only prints out "45 out of fifty points" but it is not clear for which category. Additionally, the image tag ('sprite-rating_s_fill rating_s_fill s45') varies in other tables depending on the rating.
Is there a way to extract the full table?
Or to tell Python to extract the image after a certain word, e.g. "Location"?
Thank you very much for your help!

html = '''<div class="dynamicBottom">
<div class="dynamicLeft">
<div class="content_block details_block scroll_tabs" data-tab="TABS_DETAILS">
<div class="header_with_improve wrap">
<div class="improve_listing_btn ui_button primary small">improve this entry</div>
<h3 class="tabs_header">Details</h3> </div>
<div class="details_tab">
<div class="table_section">
<div class="row">
<div class="ratingSummary wrap">
<div class="histogramCommon bubbleHistogram wrap">
<div class="colTitle">
Rating
</div>
<ul class="barChart">
<li>
<div class="ratingRow wrap">
<div class="label part ">
<span class="text">Location</span>
</div>
<div class="wrap row part ">
<span class="rate sprite-rating_s rating_s"> <img class="sprite-rating_s_fill rating_s_fill s45" src="https://static.tacdn.com/img2/x.gif" alt="45 out of fifty points">
</span>
</div>
</div>
<div class="ratingRow wrap">
<div class="label part ">
<span class="text">Service</span>
</div>
<div class="wrap row part ">
<span class="rate sprite-rating_s rating_s"> <img class="sprite-rating_s_fill rating_s_fill s45" src="https://static.tacdn.com/img2/x.gif" alt="45 out of fifty points">
</span>
</div>
</div>
</li>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
for div in soup.find_all('div', class_="ratingRow wrap"):
text = div.text.strip()
alt = div.find('img').get('alt')
print(text, alt)
out:
Location 45 out of fifty points
Service 45 out of fifty points

Related

BeautifulSoup extract text after tag

Need to scrap text appears before and after script tag,
HTML:
<div class="card-body">
<div class="d-flex flex-row flex-wrap signal-row">
<div class="signal-title">EUR/USD signal</div>
<div class="ms-auto signal-value signal-color xh-highlight">
<span class="timeago fw-normal small" datetime="1656687480000" timeago-id="10">1 day ago</span>
</div>
</div>
<div class="d-flex flex-row flex-wrap signal-row">
<div class="signal-title">
From
</div>
<div class="ms-auto signal-value signal-color xh-highlight">
<span class="fw-normal small">UTC<script>w(tzo());</script>+05:30</span>
<script class="">w(hhmm(1656687480));</script>20:28
</div>
</div>
<div class="d-flex flex-row flex-wrap signal-row">
<div class="signal-title">
Till
</div>
<div class="ms-auto signal-value signal-color xh-highlight">
<span class="fw-normal small">UTC<script>w(tzo());</script>+05:30</span>
<script class="">w(hhmm(1656698280));</script>23:28
</div>
</div>
<div class="signal-row signal-status signal-color">
Filled
</div>
<div class="d-flex flex-row flex-wrap signal-row">
<div class="signal-title">
Sold at
</div>
<div class="ms-auto signal-value signal-color user-select-all">
<script>f('OCKGMP');</script>1.0407
</div>
</div>
<div class="d-flex flex-row flex-wrap signal-row">
<div class="signal-title">
Bought at
</div>
<div class="ms-auto signal-value signal-color user-select-all">
<script>f('OCKGML');</script>1.0408
</div>
need to extract UTC and +5:30 and other details available different mentioned in html span tag eg :<span class="fw-normal small">UTC<script>w(tzo());</script>+05:30</span>
Tried using next_sibling but it returns nothing, tried using etree and xpath but this is also not returning anything.
I tried using lxml etree:
dom = etree.HTML(str(soup))
t = dom.xpath("//div[#class='ms-auto signal-value signal-color']/span/script/following-sibling::text()")
for i in t:
print(i.text)
Using next siblling:
l = soup.find('script').next_siblings
Expected Output :
UTC +05:30
20:28

Simply call .text or get_text() method on your element, the script tag will be ignored.
soup.select_one('.card-body span').parent.get_text(' ', strip=True)
Note Assuming HTML is generated dynamically, so prerequisites differ from facts in your question.
Example
It will select all the <span> and iterate over ResultSet to print the texts.
from bs4 import BeautifulSoup
html='''
<div class="card-body">
<div>
<div class="ms-auto signal-value signal-color xh-highlight">
<span class="fw-normal small">UTC<script>w(tzo());</script>+05:30</span>
<script class="">w(hhmm(1656687480));</script>20:28
</div>
</div>
<div class="d-flex flex-row flex-wrap signal-row">
<div class="signal-title">
Till
</div>
<div class="ms-auto signal-value signal-color xh-highlight">
<span class="fw-normal small">UTC<script>w(tzo());</script>+05:30</span>
<script class="">w(hhmm(1656698280));</script>23:28
</div>
</div>
'''
soup = BeautifulSoup(html)
for e in soup.select('.card-body span'):
print(e.parent.get_text(' ', strip=True))
Output
UTC +05:30 20:28
UTC +05:30 23:28

python selenium, cant retrieve text of xpath

Im struggling with scraping a few pages ... it happens when the structure of the page implies a lot of nested divs...
Here is the code page:
<div>
<section class="ui-accordion-header ui-state-default ui-corner-all ui-accordion-icons" role="tab" id="ui-id-1" aria-controls="ui-id-2" aria-selected="false" aria-expanded="false" tabindex="0"><span class="ui-accordion-header-icon ui-icon ui-icon-triangle-1-e"></span>
<div class="detail-avocat">
<div class="nom-avocat">Me <span class="avocat_name">NAME </span></div>
<div class="type-avocat">Avocat postulant au Tribunal Judiciaire</div>
</div>
<div class="more-info">Plus d'informations</div>
</section>
<div class="ui-accordion-content ui-helper-reset ui-widget-content ui-corner-bottom" style="display: none;" id="ui-id-2" aria-labelledby="ui-id-1" role="tabpanel" aria-hidden="true">
<div class="details">
<div class="detail-avocat-row ">
<div class="detail-avocat-content overflow-h">
<span>Structure :</span>
<div>
<p>Cabinet individuel NAME</p>
</div>
</div>
</div>
<div class="detail-avocat-row ">
<div class="detail-avocat-content overflow-h">
<span>Adresse :</span>
<div>
<p>21 rue Belle Isle 57000 VILLE</p>
</div>
</div>
</div>
<div class="detail-avocat-row ">
<div class="detail-avocat-content overflow-h">
<span>Mail :</span>
<div>
<p>cabinet#mail.fr</p>
</div>
</div>
</div>
<div class="detail-avocat-row">
<div class="detail-avocat-content overflow-h">
<span>Tél :</span>
<div>
<p>Telnum</p>
</div>
</div>
</div>
<div class="detail-avocat-row">
<div class="detail-avocat-content overflow-h">
<span>Fax :</span>
<div>
<p> </p>
</div>
</div>
</div>
<div class="contact-avocat"> Contacter </div>
</div>
</div>
</div>
And here is my python code:
divtel = self.driver.find_elements(by=By.XPATH,
value=f'//div[#class="detail-avocat-content overflow-h"]/div/p')#div[#class="detail-avocat-content overflow-h"]')
for p in divtel:
print(p.text)
It doesnt print anything...with other similar pages it prints the text but in this case it doesnt altough there is text in the nested span and div/p . Do you know why?
How can i resolve my problem please?
thank you

The method .text works only when the webelement containing the text is visible in the webpage. If otherwise the webelement is hidden, you have to use .get_attribute('innerText') or .get_attribute('textContent') or .get_attribute('innerHTML') (see here for difference between them). So for example change
print(p.text)
to
print(p.get_attribute('innerText'))

Remove parents of the elements in Python Selenium

The HTML is located below, If the span value is less than 20%, then I want to remove the span child up until the <div class="action"> parent only.
So for example:
<div class="item">
<div class="info">
<div class="action">
<div class="content">
<span class="content-name"> 5% </span>
</div>
</div>
</div>
</div>
From the above HTML, these code should only be removed:
<div class="action">
<div class="content">
<span class="content-name"> 5% </span>
</div>
</div>
So what should left is:
<div class="item">
<div class="info">
</div>
</div>
This is my current python code:
items = WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.XPATH, "//span[#class='content-name']")))
for item in items:
percentage_text = re.findall("\d+", item.text)[0]
if int(percentage_text) <= 20:
driver.execute_script("arguments[0].remove();", item)
But it only removes the span class and not its parent.
Here is the full HTML, I think it needs javascript to remove elements but I am very new on javascript I researched for more than 2 hours and I still can't find solutions. Thank you very much.
<div class="item">
<div class="info">
<div class="action">
<div class="content">
<span class="content-name"> 5% </span>
</div>
</div>
</div>
</div>
<div class="item">
<div class="info">
<div class="action">
<div class="content">
<span class="content-name"> 95% </span>
</div>
</div>
</div>
</div>
<div class="item">
<div class="info">
<div class="action">
<div class="content">
<span class="content-name"> 32% </span>
</div>
</div>
</div>
</div>
<div class="item">
<div class="info">
<div class="action">
<div class="content">
<span class="content-name"> 15% </span>
</div>
</div>
</div>
</div>

get to the parent of the parent:
driver.execute_script("arguments[0].parentElement.parentElement.remove();", item)

How to scrape data from a website with same div class names with beautifulsoup?

I am a beginner in python and web scraping, I have been scraping data and images successfully from 3 months and just got my first freelance. But this time I am finding hard as the data I am going after is having same div class name as others and I can't figure out how can I possibly try to obtain them specifically.
The Html parsed is as below
<div class="stage-star-main-aside">
<ul class="star-characteristics">
<li class="row is-copy is-bold">
<div class="gr-6">
<span class="is-copy is-std">Country</span>
</div>
<div class="gr-6">
United States
</div>
</li>
<li class="row is-copy is-bold">
<div class="gr-6">
<span class="is-copy is-std">Eye color</span>
</div>
<div class="gr-6">
blue
</div>
</li>
<li class="row is-copy is-bold">
<div class="gr-6">
<span class="is-copy is-std">Hair color</span>
</div>
<div class="gr-6">
blonde
</div>
</li>
<li class="row is-copy is-bold">
<div class="gr-6">
<span class="is-copy is-std">Height</span>
</div>
<div class="gr-6">
<span class="is-copy is-std is-muted">173.0 cm (5'8")</span>
</div>
</li>
<li class="row is-copy is-bold">
<div class="gr-6">
<span class="is-copy is-std">Weight</span>
</div>
<div class="gr-6">
<span class="is-copy is-std is-muted">58 kg (128 lbs)</span>
</div>
</li>
<li class="row is-copy is-bold">
<div class="gr-6">
<span class="is-copy is-std">BMI</span>
</div>
<div class="gr-6">
<span class="is-copy is-std is-muted">19.0 (normal)</span>
</div>
</li>
<a class="add-to-wrapper tc-add-to is-disabled is-small has-text" data-entity-id='{"star_id":"1991"}' data-hover-text="Remove from favorites" data-modal-text="Hot, hot hot! 🙂 If you would like to save this hot pornstar for later, please log in or" data-route="likeStar"
data-type="favourites" href="#" tabindex="-1">
<i class="i-fav i-anim"></i>
<span>
<span class="is-bold">
<span class="is-default-text">
Add to favorites
</span>
<span class="is-added-text">
Is favorite
</span>
</span>
<span class="is-regular">
</span>
</span>
</a>
</div>
I am trying to get country, height, weight, hair color but as it can be seen all of them have the same div class="gr-6". With the code below I get the html but how do I scrape specifically the above data from it?
import requests
from bs4 import BeautifulSoup
url = 'https://egeniotik.com/en/funnystar/ann-ann'
response = requests.get(url)
soup = BeautifulSoup(response.text,'html.parser')
tags = soup.find_all("div", attrs={'class': 'stage-star-main-aside'})
tagsec= tags.find_all("li", attrs={'class': 'row is-copy is-bold'})
On the tagsec line i get the following error
ResultSet object has no attribute '%s'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?" % key
AttributeError: ResultSet object has no attribute 'find_all'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?

Loop through the rows. The attribute name is in the first DIV, the value is in the second DIV.
rows = soup.select(".stage-star-main-aside li.row")
for row in rows:
divs = row.find_all("div", class_="gr-6")
attr_name = divs[0].get_text().strip()
attr_value = divs[1].get_text().strip()
print(f"{attr_name} = {attr_value}")

Extracting HTML content with Python using Beautiful Soup

Hi I am using the beautiful soup library to parse content from an html page.
I use the following script the get to the part of the page I want to:
review_list = soup.find(class_="review_list_score_breakdown_right")
<span class=" review_list_score_breakdown_right">
<ul class="review_score_breakdown_list list_tighten clearfix" data-et-view="bLTQHcXJVNRCSPOMcAQJO:1 bLTQHcXJVNRCSPOMcAQJO:3 " id="review_list_score_breakdown">
<li class="clearfix one_col" data-question="hotel_clean">
<p class="review_score_name">
Cleanliness
</p>
<div class="score_bar">
<div class="score_bar_value" data-score="100" style="width: 100%;">
</div>
</div>
<p class="review_score_value">
10
</p>
</li>
<li class="clearfix one_col" data-question="hotel_comfort">
<p class="review_score_name">
Comfort
</p>
<div class="score_bar">
<div class="score_bar_value" data-score="100" style="width: 100%;">
</div>
</div>
<p class="review_score_value">
10
</p>
</li>
<li class="clearfix one_col" data-question="hotel_services">
<p class="review_score_name">
Facilities
</p>
<div class="score_bar">
<div class="score_bar_value" data-score="100" style="width: 100%;">
</div>
</div>
<p class="review_score_value">
10
</p>
</li>
<li class="clearfix one_col" data-question="hotel_staff">
<p class="review_score_name">
Staff
</p>
<div class="score_bar">
<div class="score_bar_value" data-score="100" style="width: 100%;">
</div>
</div>
<p class="review_score_value">
10
</p>
</li>
<li class="clearfix one_col" data-question="hotel_value">
<p class="review_score_name">
Value for money
</p>
<div class="score_bar">
<div class="score_bar_value" data-score="100" style="width: 100%;">
</div>
</div>
<p class="review_score_value">
10
</p>
</li>
<li class="clearfix one_col" data-question="hotel_wifi">
<p class="review_score_name">
Free WiFi
</p>
<div class="score_bar">
<div class="score_bar_value" data-score="100" style="width: 100%;">
</div>
</div>
<p class="review_score_value">
10
</p>
</li>
<li class="clearfix one_col" data-question="hotel_location">
<p class="review_score_name">
Location
</p>
<div class="score_bar">
<div class="score_bar_value" data-score="100" style="width: 100%;">
</div>
</div>
<p class="review_score_value">
10
</p>
</li>
</ul>
</span>
I need to extract the score from the data-question tags. For example, if I want to know the hotel comfort score, I'd need to access data-question= "hotel_confort" I've tried with the function find() but it doesn't work.

There is no hotel_confort attrs in your codes.
review = soup.find(class_="review_list_score_breakdown_right")
hotel = review.find(attrs={"data-question" : "hotel_comfort"})
This code returns
<li class="clearfix one_col" data-question="hotel_comfort"> ..... </li>

I think what you need is the attrs find query.
Your question is similar to Extracting an attribute value with beautifulsoup
I will make it a bit specific for your case.
review = soup.find(class_="review_list_score_breakdown_right")
input = review.find(attrs={"data-question" : "hotel-comfort"})
output = input['value']
It's been awhile since I used bs4 so please debug the code.
Edit:
Here's some working code taken from your example string
review = soup.find('span', {'class' : "review_list_score_breakdown_right"})
input = review.find_all(attrs={"data-question": "hotel_comfort"})
print(input) #print the html extract which you can go down further.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract table from html including images using Python - python

Related

BeautifulSoup extract text after tag

python selenium, cant retrieve text of xpath

Remove parents of the elements in Python Selenium

How to scrape data from a website with same div class names with beautifulsoup?

Extracting HTML content with Python using Beautiful Soup

Categories

Resources