Web Scraping with BeautifulSoup4 - python

I have some html data given below i want to extract the all times from the webpage and then store the all data inside a list Variable. How can i do that.. Help Please..
<div class=panchang-box-secondary-header>
<div class="list-wrapper pl-2">
<div class="list-style-thumbnail list-layout-horizontal">
<div class="list-item-outer py-2">
<div class="d-flex w-100 align-items-center">
<span class="icon-sprite icon-sprite-sunrise"></span>
<div class=flex-grow-1>
<span class="d-block t-sm">सूर्योदय</span>
<span class="d-block b">5:31 AM</span>
</div>
</div>
</div>
<div class="list-item-outer py-2">
<div class="d-flex w-100 align-items-center">
<span class="icon-sprite icon-sprite-sunset"></span>
<div class=flex-grow-1>
<span class="d-block t-sm">सूर्यास्त</span>
<span class="d-block b">7:24 PM</span>
</div>
</div>
</div>
<div class="list-item-outer py-2">
<div class="d-flex w-100 align-items-center">
<span class="icon-sprite icon-sprite-moonrise"></span>
<div class=flex-grow-1>
<span class="d-block t-sm">चन्द्रोदय</span>
<span class="d-block b">10:05 PM</span>
</div>
</div>
</div>
<div class="list-item-outer py-2">
<div class="d-flex w-100 align-items-center">
<span class="icon-sprite icon-sprite-moonset"></span>
<div class=flex-grow-1>
<span class="d-block t-sm">चन्द्रास्त</span>
<span class="d-block b">9:12 AM</span>
</div>
</div>
</div>

Try using this:
from bs4 import BeautifulSoup
a = '''<div class=panchang-box-secondary-header>
<div class="list-wrapper pl-2">
<div class="list-style-thumbnail list-layout-horizontal">
<div class="list-item-outer py-2">
<div class="d-flex w-100 align-items-center">
<span class="icon-sprite icon-sprite-sunrise"></span>
<div class=flex-grow-1>
<span class="d-block t-sm">सूर्योदय</span>
<span class="d-block b">5:31 AM</span>
</div>
</div>
</div>
<div class="list-item-outer py-2">
<div class="d-flex w-100 align-items-center">
<span class="icon-sprite icon-sprite-sunset"></span>
<div class=flex-grow-1>
<span class="d-block t-sm">सूर्यास्त</span>
<span class="d-block b">7:24 PM</span>
</div>
</div>
</div>
<div class="list-item-outer py-2">
<div class="d-flex w-100 align-items-center">
<span class="icon-sprite icon-sprite-moonrise"></span>
<div class=flex-grow-1>
<span class="d-block t-sm">चन्द्रोदय</span>
<span class="d-block b">10:05 PM</span>
</div>
</div>
</div>
<div class="list-item-outer py-2">
<div class="d-flex w-100 align-items-center">
<span class="icon-sprite icon-sprite-moonset"></span>
<div class=flex-grow-1>
<span class="d-block t-sm">चन्द्रास्त</span>
<span class="d-block b">9:12 AM</span>
</div>
</div>
</div>'''
soup = BeautifulSoup(a,'html.parser')
time = soup.select('.d-block.b')
times = [times.text for times in time]
print(times)
Output:
['5:31 AM', '7:24 PM', '10:05 PM', '9:12 AM']

Just extract "d-block b" and push it into wherever you want.

time = soup.find_all(class_ = "d-block b").text
This will make a list that gets all the time in the webpage source and store it in the variable time

Related

python selenium, cant retrieve text of xpath

Im struggling with scraping a few pages ... it happens when the structure of the page implies a lot of nested divs...
Here is the code page:
<div>
<section class="ui-accordion-header ui-state-default ui-corner-all ui-accordion-icons" role="tab" id="ui-id-1" aria-controls="ui-id-2" aria-selected="false" aria-expanded="false" tabindex="0"><span class="ui-accordion-header-icon ui-icon ui-icon-triangle-1-e"></span>
<div class="detail-avocat">
<div class="nom-avocat">Me <span class="avocat_name">NAME </span></div>
<div class="type-avocat">Avocat postulant au Tribunal Judiciaire</div>
</div>
<div class="more-info">Plus d'informations</div>
</section>
<div class="ui-accordion-content ui-helper-reset ui-widget-content ui-corner-bottom" style="display: none;" id="ui-id-2" aria-labelledby="ui-id-1" role="tabpanel" aria-hidden="true">
<div class="details">
<div class="detail-avocat-row ">
<div class="detail-avocat-content overflow-h">
<span>Structure :</span>
<div>
<p>Cabinet individuel NAME</p>
</div>
</div>
</div>
<div class="detail-avocat-row ">
<div class="detail-avocat-content overflow-h">
<span>Adresse :</span>
<div>
<p>21 rue Belle Isle 57000 VILLE</p>
</div>
</div>
</div>
<div class="detail-avocat-row ">
<div class="detail-avocat-content overflow-h">
<span>Mail :</span>
<div>
<p>cabinet#mail.fr</p>
</div>
</div>
</div>
<div class="detail-avocat-row">
<div class="detail-avocat-content overflow-h">
<span>Tél :</span>
<div>
<p>Telnum</p>
</div>
</div>
</div>
<div class="detail-avocat-row">
<div class="detail-avocat-content overflow-h">
<span>Fax :</span>
<div>
<p> </p>
</div>
</div>
</div>
<div class="contact-avocat"> Contacter </div>
</div>
</div>
</div>
And here is my python code:
divtel = self.driver.find_elements(by=By.XPATH,
value=f'//div[#class="detail-avocat-content overflow-h"]/div/p')#div[#class="detail-avocat-content overflow-h"]')
for p in divtel:
print(p.text)
It doesnt print anything...with other similar pages it prints the text but in this case it doesnt altough there is text in the nested span and div/p . Do you know why?
How can i resolve my problem please?
thank you
The method .text works only when the webelement containing the text is visible in the webpage. If otherwise the webelement is hidden, you have to use .get_attribute('innerText') or .get_attribute('textContent') or .get_attribute('innerHTML') (see here for difference between them). So for example change
print(p.text)
to
print(p.get_attribute('innerText'))

Remove parents of the elements in Python Selenium

The HTML is located below, If the span value is less than 20%, then I want to remove the span child up until the <div class="action"> parent only.
So for example:
<div class="item">
<div class="info">
<div class="action">
<div class="content">
<span class="content-name"> 5% </span>
</div>
</div>
</div>
</div>
From the above HTML, these code should only be removed:
<div class="action">
<div class="content">
<span class="content-name"> 5% </span>
</div>
</div>
So what should left is:
<div class="item">
<div class="info">
</div>
</div>
This is my current python code:
items = WebDriverWait(driver, 10).until(EC.visibility_of_all_elements_located((By.XPATH, "//span[#class='content-name']")))
for item in items:
percentage_text = re.findall("\d+", item.text)[0]
if int(percentage_text) <= 20:
driver.execute_script("arguments[0].remove();", item)
But it only removes the span class and not its parent.
Here is the full HTML, I think it needs javascript to remove elements but I am very new on javascript I researched for more than 2 hours and I still can't find solutions. Thank you very much.
<div class="item">
<div class="info">
<div class="action">
<div class="content">
<span class="content-name"> 5% </span>
</div>
</div>
</div>
</div>
<div class="item">
<div class="info">
<div class="action">
<div class="content">
<span class="content-name"> 95% </span>
</div>
</div>
</div>
</div>
<div class="item">
<div class="info">
<div class="action">
<div class="content">
<span class="content-name"> 32% </span>
</div>
</div>
</div>
</div>
<div class="item">
<div class="info">
<div class="action">
<div class="content">
<span class="content-name"> 15% </span>
</div>
</div>
</div>
</div>
get to the parent of the parent:
driver.execute_script("arguments[0].parentElement.parentElement.remove();", item)

showing only two out of three cards in my dashboard

I am creating the dashboard view for my CRM. However while displaying the card view, only two of the three card views are visible. Can anyone help me regarding this? Is this a code formatting issue?
I am adding an image of the dashboard for my CRM below as well as the code for the cards given below.
example.html:
{% extends 'base.html' %}
{% block content %}
<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/4.0.0/css/bootstrap.min.css" />
<!-- Begin Page Content -->
<div class="container-fluid">
<!-- Page Heading -->
<div class="d-sm-flex align-items-center justify-content-between mb-4 mt-4">
<h1 class="h3 mb-0 text-gray-800">Welcome to NexCRM</h1>
<i class="fas fa-download fa-sm text-white-50"></i> Generate Report
</div>
<!-- Main Content Here -->
<div class="row">
<!-- Company Card Example -->
<div class="col-xl-3 col-md-6 mb-4">
<div class="card border-left-primary shadow h-100 py-2">
<div class="card-body">
<div class="row no-gutters align-items-center">
<div class="col mr-2">
<div class="text-xs font-weight-bold text-primary text-uppercase mb-1">Companies</div>
<div class="h5 mb-0 font-weight-bold text-gray-800">4,083</div>
</div>
<div class="col-auto">
<i class="fas fa-building fa-2x text-gray-300"></i>
</div>
</div>
</div>
</div>
</div>
<!-- Company Card Example -->
<div class="col-xl-3 col-md-6 mb-4">
<div class="card border-left-primary shadow h-100 py-2">
<div class="card-body">
<div class="row no-gutters align-items-center">
<div class="col mr-2">
<div class="text-xs font-weight-bold text-primary text-uppercase mb-1">Companies</div>
<div class="h5 mb-0 font-weight-bold text-gray-800">4,083</div>
</div>
<div class="col-auto">
<i class="fas fa-building fa-2x text-gray-300"></i>
</div>
</div>
</div>
</div>
</div>
<!-- Contact Card Example -->
<div class="col-xl-3 col-md-6 mb-4">
<div class="card border-left-warning shadow h-100 py-2">
<div class="card-body">
<div class="row no-gutters align-items-center">
<div class="col mr-2">
<div class="text-xs font-weight-bold text-warning text-uppercase mb-1">Contacts</div>
<div class="h5 mb-0 font-weight-bold text-gray-800">18,002</div>
</div>
<div class="col-auto">
<i class="fas fa-comments fa-2x text-gray-300"></i>
</div>
</div>
</div>
</div>
</div>
</div>
<!-- Content Row -->
<div class="row">
<!-- Area Chart -->
<div class="col-xl-8 col-lg-7">
<div class="card shadow mb-4">
<!-- Card Header - Dropdown -->
<div class="card-header py-3 d-flex flex-row align-items-center justify-content-between">
<h6 class="m-0 font-weight-bold text-primary">Earnings Overview</h6>
<div class="dropdown no-arrow">
<a class="dropdown-toggle" href="#" role="button" id="dropdownMenuLink" data-toggle="dropdown" aria-haspopup="true" aria-expanded="false">
<i class="fas fa-ellipsis-v fa-sm fa-fw text-gray-400"></i>
</a>
<div class="dropdown-menu dropdown-menu-right shadow animated--fade-in" aria-labelledby="dropdownMenuLink">
<div class="dropdown-header">Dropdown Header:</div>
<a class="dropdown-item" href="#">Action</a>
<a class="dropdown-item" href="#">Another action</a>
<div class="dropdown-divider"></div>
<a class="dropdown-item" href="#">Something else here</a>
</div>
</div>
</div>
<!-- Card Body -->
<div class="card-body">
<div class="chart-area">
<canvas id="myAreaChart"></canvas>
</div>
</div>
</div>
</div>
<!-- Pie Chart -->
<div class="col-xl-4 col-lg-5">
<div class="card shadow mb-4">
<!-- Card Header - Dropdown -->
<div class="card-header py-3 d-flex flex-row align-items-center justify-content-between">
<h6 class="m-0 font-weight-bold text-primary">Revenue Sources</h6>
<div class="dropdown no-arrow">
<a class="dropdown-toggle" href="#" role="button" id="dropdownMenuLink" data-toggle="dropdown" aria-haspopup="true" aria-expanded="false">
<i class="fas fa-ellipsis-v fa-sm fa-fw text-gray-400"></i>
</a>
<div class="dropdown-menu dropdown-menu-right shadow animated--fade-in" aria-labelledby="dropdownMenuLink">
<div class="dropdown-header">Dropdown Header:</div>
<a class="dropdown-item" href="#">Action</a>
<a class="dropdown-item" href="#">Another action</a>
<div class="dropdown-divider"></div>
<a class="dropdown-item" href="#">Something else here</a>
</div>
</div>
</div>
<!-- Card Body -->
<div class="card-body">
<div class="chart-pie pt-4 pb-2">
<canvas id="myPieChart"></canvas>
</div>
<div class="mt-4 text-center small">
<span class="mr-2">
<i class="fas fa-circle text-primary"></i> Direct
</span>
<span class="mr-2">
<i class="fas fa-circle text-success"></i> Social
</span>
<span class="mr-2">
<i class="fas fa-circle text-info"></i> Referral
</span>
</div>
</div>
</div>
</div>
</div>
</div>
<!-- /.container-fluid -->
{% endblock content %}

Get multiple elements with soup

I have the following HTML code, I'm trying to get "clients" for each specific "date",
but I only get the first next element :
<div class="info">
<div class="left-wrap"><span class="date">DATE-1</span></div>
</div>
<div class="clients-list">
<div>
<span class="client" >client1</span>
<span class="client" >client2</span>
<span class="client" >client3</span>
</div>
</div>
<div class="clients-list">
<div>
<span class="client" >client4</span>
<span class="client" >client5</span>
<span class="client" >client6</span>
</div>
</div>
<div class="info">
<div class="left-wrap"><span class="date" >DATE-2</span></div>
</div>
<div class="clients-list">
<div>
<span class="client" >client7</span>
<span class="client" >client8</span>
</div>
</div>
<div class="clients-list">
<div>
<span class="client" >client9</span>
<span class="client" >client10</span>
</div>
</div>
<div class="clients-list">
<div>
<span class="client" >client11</span>
<span class="client" >client12</span>
</div>
</div>
I'm using the following code :
soup=BeautifulSoup(html,'html.parser')
dates=soup.find_all(class_='date')
for date in dates:
print(date.text)
for item in date.find_next(class_='clients-list').find_all(class_='client'):
print(item.text)
The output is get is :
DATE-1
client1
client2
client3
DATE-2
client7
client8
I tried with find_next_all, but got the same output.
A bit tricky but you will get the output.Use find_next_siblings()
from bs4 import BeautifulSoup
html='''<div class="info">
<div class="left-wrap"><span class="date">DATE-1</span></div>
</div>
<div class="clients-list">
<div>
<span class="client" >client1</span>
<span class="client" >client2</span>
<span class="client" >client3</span>
</div>
</div>
<div class="clients-list">
<div>
<span class="client" >client4</span>
<span class="client" >client5</span>
<span class="client" >client6</span>
</div>
</div>
<div class="info">
<div class="left-wrap"><span class="date" >DATE-2</span></div>
</div>
<div class="clients-list">
<div>
<span class="client" >client7</span>
<span class="client" >client8</span>
</div>
</div>
<div class="clients-list">
<div>
<span class="client" >client9</span>
<span class="client" >client10</span>
</div>
</div>
<div class="clients-list">
<div>
<span class="client" >client11</span>
<span class="client" >client12</span>
</div>
</div>'''
soup=BeautifulSoup(html,'html.parser')
dates=soup.find_all(class_='date')
for date in dates:
print(date.text)
for item in date.parent.parent.find_next_siblings(class_='clients-list'):
if item.find_previous_sibling(class_='info').find_next(class_='date').text==date.text:
for client in item.find_all(class_='client'):
print(client.text)
Output:
DATE-1
client1
client2
client3
client4
client5
client6
DATE-2
client7
client8
client9
client10
client11
client12

Upload a file using selenium(python) , the input element being inside a iframe

Here is the outer html
<div class="container">
<div class="header">
<div id="logo"></div>
<div class="feedback_wrap">
<span id="rating-ask" style="display: inline-block;">
<span class="ext_text">Rate us</span>
<br>
<span class="stars">
<a class="star" data-pos="1">
<span class="empty" style="display: inline;"></span>
<span class="filled" style="display: none;"></span>
</a>
<a class="star" data-pos="2">
<span class="empty" style="display: inline;"></span>
<span class="filled" style="display: none;"></span>
</a>
<a class="star" data-pos="3">
<span class="empty" style="display: inline;"></span>
<span class="filled" style="display: none;"></span>
</a>
<a class="star" data-pos="4">
<span class="empty" style="display: inline;"></span>
<span class="filled" style="display: none;"></span>
</a>
<a class="star" data-pos="5">
<span class="empty" style="display: inline;"></span>
<span class="filled" style="display: none;"></span>
</a>
</span>
</span>
</div>
<div class="device_toggler_wrap">
<div class="device_toggler_wrap2">
<span class="phone_icon"></span>
<div class="device_toggler">
<label class="switch">
<input type="checkbox" id="deviceswitch">
<span class="slider round"></span>
</label>
</div>
<span class="tablet_icon"></span>
</div>
</div>
<div class="device_color_switcher">
<div class="device_color_switcher_row">
<div class="color color_1" data-color="#FDD8D5"></div>
<div class="color color_2" data-color="#F1D9BF"></div>
<div class="color color_3" data-color="#E6E7E9"></div>
</div>
<div class="device_color_switcher_row">
<div class="color color_4" data-color="#35383D"></div>
<div class="color color_5" data-color="#000000"></div>
<div class="color color_6" data-color="#873239"></div>
</div>
<div class="device_color_switcher_row">
<div class="color color_7" data-color="#54E9DD"></div>
<div class="color color_8" data-color="#28B3D4"></div>
<div class="color color_9" data-color="#A7A7A7"></div>
</div>
</div>
</div>
<div class="phone_wrap">
<div class="phone" style="border-color: rgb(230, 231, 233); width: 340px; max-height: 650px; border-width: 28px 7px 50px;">
<div class="arrow_left">
<span class="ext_icon"></span>
</div>
<div class="arrow_right">
<span class="ext_icon"></span>
</div>
<div class="label">
<span class="text" style="color: rgb(0, 0, 0);"></span>
</div>
<div class="insta_loading" style="display: none;">
<i class="icon"></i>
<span class="text">Loading...</span>
</div>
<div id="iframe_wrap" style="width: 340px; opacity: 1;"><iframe frameborder="0"></iframe></div>
<div class="circle"></div>
</div>
</div>
</div>
Here is the inner iframe:
https://gist.github.com/ishandutta2007/67c1698b34e58634b9a855051b67fdfd
This is my effort:
iframe = browser.find_element_by_tag_name("iframe")
browser.switch_to.frame(iframe)
try:
browser.find_element_by_css_selector("form > input").send_keys("/my/path/to/sample.jpg")
except Exception as e:
print(e)
No exception thrown but nothing happens either

Categories