I have a lot of html pages that are formatted differently but the content that interests me is the same , for example :
Page_1.html :
<div class = "block_person">
<div class="persons"><span>Jules Rodrigez</span></div>
<div class="contents"><h1>Jules Rodrigez is a programmer specialized in machine learning</h1></div>
</div>
<div class = "block_person">
<div class="persons"><span>James Alfonso</span></div>
<div class="contents"><h1>James is a singer</h1></div>
</div>
page_2.html :
<div class="many_speakers" >
<div class="speakers"><h1>Jules Rodrigez</h1></div>
<div class="summary"><span>Jules Rodrigez is a programmer specialized in data science</span></div>
</div>
<div class="many_speakers" >
<div class="speakers"><h1>Peka Yaya</h1></div>
<div class="summary"><span>Peka is a professor</span></div>
</div>
<div class="many_speakers" >
<div class="speakers"><h1>Cristiano dimaria</h1></div>
<div class="summary"><span>Cristiano is a football player</span></div>
</div>
from a page html (page_1 or page_2), i want to get a list of objects like :
from page_1.html
[{"person":"Jules Rodrigez","content":"Jules Rodrigez is a programmer specialized in machine learning"},{"person":"James Alfonso","content":"James is a singer"}]
the problem is that each page is formatted with an structure : how can we detect in an html page that a block is repeated several times and therefore it contains the requested information : for example in the page_1.html the bloc which is repeated several times is :
<div class = "block_person">
<div class="persons"><span>Jules Rodrigez</span></div>
<div class="contents"><h1>Jules Rodrigez is a programmer specialized in machine learning</h1></div>
</div>
Related
Is it possible to make a Beautifulsoup Strainer that strains all 'order-cards' from 'container-01' only (without 'order-cards' from other containers)?
Below the sample HTML
<div class="items-container" container-id="container-01">
<div class="order-card">order_01
<div class="item-card">item1</div>
<div class="item-card">item2</div>
<div class="item-card">item3</div>
<div class="item-card">item4</div>
</div>
<div class="order-card">order_02
<div class="item-card">itemA</div>
<div class="item-card">itemB</div>
<div class="item-card">itemC</div>
<div class="item-card">itemD</div>
</div>
<div class="order-card">order_03
<div class="item-card">itemW</div>
<div class="item-card">itemX</div>
<div class="item-card">itemY</div>
<div class="item-card">itemZ</div>
<div class="item-card">item</div>
</div>
</div>
<div class="items-container" container-id="container-02">
<div class="order-card">order_53
<div class="item-card">item_7</div>
<div class="item-card">item_8</div>
</div>
</div>
<div class="items-container" container-id="container-03">
<div class="order-card">order_13
<div class="item-card">item_16</div>
<div class="item-card">item_17</div>
<div class="item-card">item_18</div>
</div>
</div>
What I have so far is the code below which strains ALL 'order-cards' from ALL containers.
The goal is that 'page_soup' contains ALL 'order-card' items that are in 'container-01' only.
The following loop then uses that 'page_soup' to iterate through each item in 'order-card' to get the details from each 'item-card'.
rephrased above!
The goal is to get the details from each 'item-card' that are in 'container-01' only.
There is no need for parsing any other containers than 'container-01'.
only_item_cells = SoupStrainer('div', attrs={"class":"order-card"})
page_soup = BeautifulSoup(page_html, 'html.parser', parse_only=only_item_cells)
Following that is a loop that gets the details from ALL the 'item-cards' in ALL containers. In fact, that is NOT wanted, as the output includes items from containers other than 'container-01' only.
Running Python 3.8.8, on Anaconda, Win64
Use the appropriate attribute as you have indicated:
only_item_cells = SoupStrainer('div', attrs= {"container-id": "container-01"})
I read some webpage contents in html that has the following form:
<div class="cart">
<div class="cart-title">
<img src="https://ug3.technion.ac.il/rishum/img/regCourses.png" width="50" height="50" alt="My Courses">
המקצועות שלי
</div><div class="entry-spacer"></div><div class="cart-entry">
<div class="course-number">
104134
</div>
<div class="course-name">
אלגברה מודרנית ח
</div>
<div class="course-points">
2.5 נק'
</div>
<div class="entry-group">
קבוצה 11
</div><div class="change-group">
שנה קבוצה ל
<select name="UPG104134" onchange="showWaitAndSubmit('regCart')" class="change-group-options">
<option value=""> </option><option>12</option><option>13</option><option>21</option><option>22</option><option>23</option>
</select>
</div><div class="more-actions">
</div>
<div class="clear"></div></div><div class="entry-spacer"></div><div class="cart-entry">
<div class="course-number">
234118
</div>
<div class="course-name">
ארגון ותכנות המחשב
</div>
<div class="course-points">
3 נק'
</div>
<div class="entry-group">
קבוצה 22
</div><div class="change-group">
שנה קבוצה ל
<select name="UPG234118" onchange="showWaitAndSubmit('regCart')" class="change-group-options">
<option value=""> </option><option>11</option><option>12</option><option>13</option><option>14</option><option>21</option>
</select>
</div><div class="more-actions">
</div>
<div class="clear"></div></div><div>
Now the question is how can I read the courses numbers which appear in blue in my image??
Here's an example of how course number appears in the webpage:
<div class="course-number">
104134
</div>
and I want to read: 104134 in this example
First, I'd advise using BeautifulSoup for parsing the HTML and then, off the top of my head, you should dig in for those div tags with that class name like this.
from bs4 import BeautifulSoup
r = requests.get(<your-target>)
soup = BeautifulSoup(r.text, 'lxml')
numbers = [i.a.text for i in soup.find_all('div', attrs={"class": "course-number"})]
I didn't check this, but if it doesn't really work, with that in mind you should find a solution. Check BeautifulSoup's documentation for more information.
Note that in the previous loop, if i does not have an a tag it will throw an error, so if you don't trust the structure of the website will always be the same, better do a normal for-loop and have a try-except or deal with that in some way.
Beware that the previous method will obtain all div tags with class course-number. You may want only a subset of those, so you should either apply more filtering or traverse the HTML tree first until you get to the root of your target content.
I am using BeautifulSoup.
I would like to extract a coordinates from the website. The code of web looks like:
<a class="button button--outline link link--emphasis button-full-width js-choose-store" href="/sklep?StoreID=R034" title="Informacje o sklepie">Informacje o sklepie</a>
</div>
</div>
</div>
</div>
<div class="storelist__item ui-expandable js-accordion-store js-store" data-lat="52.225155" data-lng="20.998965" data-icon="/on/demandware.static/Sites-Hebe-Site/-/default/dw081970e9/images/map_markers/hebe.png" data-id="R379" data-coming-soon="false" data-index="81">
<div class="visually-hidden" data-popup-html>
<div class="store-popup">
<div class="store-popup__name text--uppercase">Drogeria Hebe</div>
<div class="store-popup__address">Lindleya 16</div>
<div class="store-popup__city">Warszawa, 02-013</div>
<div class="store-popup__directions">
I need to get 'data-lat' and 'data-lng'.
I had no problem to get address or name of object (it was a text), using for example:
find("div",{"class","store-popup__city"}).text
Try something along the lines of:
dat = soup.select_one('div[data-lat]')
print(dat['data-lat'],dat['data-lng'])
Output:
52.225155 20.998965
I have extracted below HTML content from an URL using Scrapy
<div id="data">
<div style="position:absolute">
<h4 class="course">Python</h4>
<h4 class="count">45</h4>
</div>
<h1 style="position:absolute">Available</h1>
<h2 style="position:absolute">Weekend</h1>
<h1 style="position:absolute">Paid Version</h1>
</div>
and using xpath
headerResponse = response.xpath('//div[#id="data"]').extract()
I have loaded them into headerResponse variable. Now I want to get value, since it doesnt have id or class how to extract them?
I want to enter a text in a text area. The HTML code is as follows:
<li class="order-unavailable string-type-key string-block clear-fix status- require_changes expanded working autogrowed activity-opened" data-string_status="require_changes" data-master_unit_count="22" data-string_id="2394473">
<div class="key-area clear-fix">
<div class="key-area-container-one clear-fix">
<div class="key-area-container-two">
<div class="col-50 col-left">
<div class="string-controls">
<a class="control-expand-toggle selected" href="#"></a>
<a class="control-activity-toggle " href="#">2</a>
<input class="control-select-string" type="checkbox">
</div>
<div class="master-content">
</div>
<div class="col-50 col-right slave-side-container">
</div>
</div>
</div>
<div class="activity-area clear-fix">
<div class="col-50 col-left">
<div class="col-50 col-right">
<div class="comment-area-inner">
<h3>Add comment</h3>
<div class="comment-container">
<textarea class="comment-content" name="comment_content"></textarea>
</div>
<div class="col-right">
<div class="clear"></div>
<strong>Notification settings</strong>
<p>The people you select will get an email when you post this comment. They'll also be notified by email every time a new comment is added.</p>
<div class="notification-settings">
</div>
</div>
</div>
The textarea component name is comment-content
The xpath of the textarea is:
/html/body/div/section/ol/li[16]/div[2]/div[2]/div/div/textarea
This is the code I am using:
driver.find_element_by_xpath("*//div[#title=\"NOTIFICATION_HOMEPAGE_REDIRECT_CHANGED_SITE\"]
/following-sibling::div[2]/div[2]/div/div/textarea").send_keys("Test comment")
Can someone hekp me how to frame the sibling tag?
div[2]/div[2]/div/div/textarea
The tag before the following-sibling keyword is correct.
Choose the textarea and enter something,
driver.find_element_by_xpath(r'//textarea[#class='comment-content']').send_keys('Test Comment')
For xpath, you can use tool Firepath plugin for Firefox