The website I'm trying to scrape looks like this:
<div align="center" class="movietable">
<span style="width:45px;height:47px;vertical-align:middle;display:table-cell;">
<img border="0" src="styles/images/cat/hd.png" alt="HdO">
</span>
</div>
<div align="left" class="movietable">
<span style="padding:0px 5px;width:455px;height:47px;vertical-align:middle;display:table-cell;">
<a data-toggle="tooltip" data-placement="bottom" data-html="true" title="" href="details.php?id=578197" data-original-title="<img src='https://trasd.tmdb.org//tqistSlQGQVlvDZHweD.jpg'>">
<b>GET THIS TEXT</b></a><br><font class="small">[Action, Horror, Sci-Fi]</font>
</span>
</div>
How can I extract:
The text in the <b> tag - in this case GET THIS TEXT
The content of the font_class= 'small' - in this case this would be Action, Horror, Sci-Fi
.movietable b works great!!
The img_scr link - in thiscase it would be https://trasd.tmdb.org//tqistSlQGQVlvDZHweD.jpg
I have no ideea how to do this
Below are CSS selectors you can use:
driver.find_element_by_css_selector('div[align=left] b')
driver.find_element_by_css_selector('div[align=left] .small')
driver.find_element_by_css_selector('a[title]').get_attribute('data-original-title')
You can access all of them using xpath:
1) [parents before this div]/div[2]/span/a/b
2) [parents before this div]/div[2]/span/font
3) [parents before this div]/div[1]/span/a/img
[parents before this div] should be /html/body/...
As per the HTML you have shared to extract the items you can use the following solution:
GET THIS TEXT:
driver.find_element_by_xpath("//div[#class='movietable' and #align='left']/span/a[#data-toggle='tooltip' and #data-placement='bottom']/b").get_attribute("innerHTML")
[Action, Horror, Sci-Fi]:
driver.find_element_by_xpath("//div[#class='movietable' and #align='left']/span//font[#class='small']").get_attribute("innerHTML")
https://trasd.tmdb.org//tqistSlQGQVlvDZHweD.jpg:
img_src = driver.find_element_by_xpath("//div[#class='movietable' and #align='left']/span/a[#data-toggle='tooltip' and #data-placement='bottom']").get_attribute("data-original-title")
src = img_src.replace("'", "-").split("-")
print(src[1])
Related
How to get the h1 text "Mini Militia - Doodle Army 2 apk"
https://www.apkmonk.com/app/com.appsomniacs.da2/
I tried this but I got None
title = soup.find('div', class_='col l8 s8')
please Note there is multiple elements on the page that have classes "hide-on-med-and-down" and "hide-on-large-only"
<div class="col l8 s8">
<h1 class="hide-on-med-and-down" style="font-size: 2em;">Mini Militia - Doodle Army 2 apk</h1>
<h1 class="hide-on-large-only" style="font-size: 1.5em;margin:0px; padding: 0px;">Mini Militia - Doodle Army 2 apk</h1>
<p class="hide-on-small-only" style="font-size: 1.2em;"><span class="item" style="display:none !important;"><span class="fn">Download Mini Militia - Doodle Army 2 APK Latest Version</span></span><b>App Rating</b>: <span class="rating"><span class="average">4.1</span>/<span class="best">5</span></span></p>
<a class="hide-on-med-and-up" onclick="ga('send', 'event', 'nav', 'similar_mob_link', 'com.appsomniacs.da2');" style="font-size: 1.2em;" href="#similar">(<b>Similar Apps</b>)</a>
</div>
This will help you.
title = soup.find("h1", {'class':'hide-on-med-and-down'}).text
What happens?
There are two <h1> on site with same content only the class names are different, to control what size, ... should displayed on different resolutions
How to fix?
Cause content is identical, just select the first <h1> in tree to get your result, class names do not matter in this case, cause result is always the same:
title = soup.find('h1').text
Output
Mini Militia - Doodle Army 2 apk
I have a table of search results in Selenium browser and each search result is defined in Html like this:
<div class="item
itemWrapper
ItemPosition1
ItemMonitor
" data-position="1" data-it-name="NAME OF THE ITEM" data-it-category="Category" role="article">
<div class="item-image">
<a href="/some/link/" target="_blank" rel="noopener" class="itemRec">
<img src="https://some.jpg" alt="some name" class="img-responsive">
</a>
</div>
<h2 class="small-text item-title">
Link Text
</h2>
<div class="item-bottom">
<div class="pull-left item-price">
<span>999</span>
</div>
<div class="pull-right detail-link">
<a href="/link/to/detail" title="link title" class="detail"
Detail
</a>
</div>
</div>
</div>
I am able to find all webelements by classname = item.
elements = driver.find_elements_by_class_name("item")
I would need to iterate over elements and get their position, name and price to be able to click to one of them:
for e in elements:
position=e.get_attribute("data-position").value,
name=e.get_attribute("data-it-name").value,
price=e.find_element(By.CLASS_NAME,'item-price').value
but this does not work - get_attribute returns None and find_element does not find any child element
Can you please advise me how to get the "data-" atributes and child elements values correctly?
Whole code using Webbot:
import webbot
from selenium.webdriver.common.by import By
web = webbot.Browser()
web.go_to('www.***.cz')
web.type('bed', classname='header-search-form')
web.press(web.Key.ENTER)
elements = web.find_elements(classname="product-item")
for e in elements:
name = e.get_attribute("data-it-name").value
price = e.find_element(By.CLASS_NAME, 'item-price').value
print(name,price)
break
classname acts weirdly in webbot. You definitely are not getting a product item there:
In [56]: elements[0].get_attribute('outerHTML')
Out[56]: '\n\n\t\t\t\t\t\t<img src="https://s.favi.cz/static/frontend/_global/images/favi-logo/favi-logo.60d511aff13247dd52f15acf6bdf2af9.svg" role="banner">\n\n\t\t\t\t\t'
Works well with a CSS selector:
In [58]: elements = web.find_elements(css_selector=".product-item")
In [59]: elements[0].get_attribute('outerHTML')
Out[59]: '<div class="\n\t\t\tproduct-item\n\t\t\titemWrapper\n\t\t\tproductItemPosition1\n\t\t\tproductItemMonitor\n\t\t\tproductItemWrapper\n\t\t\tsendProductTransactionWrapper\n\t\t\t\t\t" data-position="1" data-pr-name="Moderní box spring postel Alvares 160x200, bílá" data-tr-id="04d62b60-9d00-4d1b-b03c-2258c50bfdb9" data-pr-category="Postele" data-tr-ob-id="2144583" data-m-ob-id="2345478" role="article">\n\n\t\t<div class="product-image">\n\n\t\t\t\n\t\t\t\t\t\t\t\t\t<img src="https://s.favi.cz/static/images/t/product/300/6f/92/6f922779-bc84-483e-b1cd-ad8522ef0c92.jpg" alt="Moderní box spring postel Alvares 160x200, bílá" class="img-responsive">\n\t\t\t\t\t\t\t\n\n\t\t\t\n\t\t\t\t\t\t\t\t\t<span class="count">485</span>\n\t\t\t\t\t\t\t\n\n\t\t\t\n\t\t\t\n\t\t</div>\n\n\t\t<div class="product-labels stickers-holder">\n\n\t\t\t\t\t\t\t<span class="sticker storage white">\n\t\t\t\t\t<span class="text">Skladem</span>\n\t\t\t\t</span>\n\t\t\t\n\t\t\t\n\t\t\t\n\t\t\t\n\t\t</div>\n\n\t\t<h2 class="small-text product-item-title">\n\t\t\tModerní box spring postel Alvares 160x200, bílá\n\t\t</h2>\n\n\t\t<div class="product-bottom">\n\n\t\t\t<div class="pull-left product-item-price">\n\t\t\t\t<span>15 599 Kč</span>\n\t\t\t\t\t\t\t</div>\n\n\t\t\t<div class="pull-right product-shop-link">\n\t\t\t\t\n\t\t\t\t\tDetail\n\t\t\t\t\n\n\t\t\t\t\n\t\t\t\t\t<strong>Do obchodu</strong>\n\t\t\t\t\n\t\t\t</div>\n\n\t\t</div>\n\n\t\t\n\t</div>'
In [60]: elements[0].get_attribute('data-position')
Out[60]: '1'
In [61]: elements[0].get_attribute('data-pr-name')
Out[61]: 'Moderní box spring postel Alvares 160x200, bílá'
I am trying to extract some data from a webpage that I've parsed through BeautifulSoup.
<div class="product-data-list data-points-en_GB">
<div class="float-left in-left col-totalNetAssets" style="height: 36px;">
<span class="caption">
Net Assets of Share Class
<span class="as-of-date">
as of 20-Jul-20
</span>
</span>
<span class="data">
USD 36,636,694,134
</span>
</div>
<div class="float-left in-right col-totalNetAssetsFundLevel">
<span class="caption">
Net Assets of Fund
<span class="as-of-date">
as of 20-Jul-20
</span>
</span>
<span class="data">
USD 37,992,258,237
</span>
</div>
<div class="float-left in-left col-baseCurrencyCode" style="height: 16px;">
<span class="caption">
Fund Base Currency
<span class="as-of-date">
</span>
</span>
<span class="data">
USD
</span>
</div>
I want to capture the information from the 'caption', 'as-of-date' and 'data' spans to create something like:
[('Net Assets of Share Class','20-Jul-20','USD 36,636,694,134'),
('Net Assets of Fund','20-Jul-20','USD 37,992,258,237'),
('Fund Base Currency','','USD')]
This is my code:
data=[]
for tag in soup.findAll("div", {"id": "keyFundFacts"}):
for span in tag.findAll("div", {"class": "product-data-list data-points-en_GB"}):
a = span.find("span", {"class": "caption"}).text
b = span.find("span", {"class": "as-of-date"}).text
c = span.find("span", {"class": "data"}).text
data.append((a,b,c))
however, I only get 1 result when I look at the list 'data':
<pre>
[('\nNet Assets of Share Class\n\nas of 20-Jul-20\n\n', '\nas of 20-Jul-20\n', '\nUSD 36,636,694,134\n')]
</pre>
Aside from needing to strip out the new lines, I know I am missing something to get the script to go through all the other spans but have been staring at the screen for so long, it isn't getting any clearer.
Can anyone help put me out of my misery?!
One solution is to cycle through all the div elements that are under your main "div", {"class": "product-data-list data-points-en_GB" element. This way for each div element you will get the elements you want.
for tag in soup.findAll("div", {"id": "keyFundFacts"}):
for element in tag.findAll("div", {"class": "product-data-list data-points-en_GB"}):
for divEle in element.findAll('div')
a = divEle.find("span", {"class": "caption"}).text
b = divEle.find("span", {"class": "as-of-date"}).text
c = divEle.find("span", {"class": "data"}).text
This makes for a lot of nested loops so I don't recommend this. I suggest finding a more precise way. If you have a url with the html I could take a look.
I have stumbled upon a solution which seems to do the trick:
data=[]
for tag in soup.findAll("div", {"id": "keyFundFacts"}):
for element in tag.findAll("div", {"class": "product-data-list data-points-en_GB"}):
for thing in element.findChildren('div'):
a = thing.findNext("span", {"class": "caption"}).text
b = thing.findNext("span", {"class": "as-of-date"}).text
c = thing.findNext("span", {"class": "data"}).text
data.append((a,b,c))
Its not perfect but hopefully functional.
thanks all
I am trying to extract the data-message-id from the following html. My original goal is to extract the data-message- id for the span containing a particular text and then clicking on the star_button to star it.
<div class="message_content_header">
<div class="message_content_header_left">
krishnag0902
<span class="ts_tip_float message_current_status ts_tip ts_tip_top ts_tip_multiline ts_tip_delay_150 color_U5TPDSMQQ color_9f69e7 hidden ts_tip_hidden">
<span class="ts_tip_tip ts_tip_inner_current_status">
<span class="ts_tip_multiline_inner">
</span>
</span>
</span>
<i class="copy_only">[</i>4:34 PM<i class="copy_only">]</i><span class="ts_tip_tip"><span class="ts_tip_multiline_inner">Yesterday at 4:34:07 PM</span></span>
<span class="message_star_holder">
Star this message
</div>
</div>
<span class="message_body">hoho<span class="constrain_triple_clicks"></span></span>
<div class="rxn_panel rxns_key_message-1498084447_119862-C5UGEFBS9"></div>
<i class="copy_only"><br></i>
<span id="msg_1498084447_119862_label" class="message_aria_label hidden">
<strong>krishnag0902</strong>.
hoho.
four thirty-four PM.
</span>
and i am using the code on the above span(message_star_holder) which is returning a None
data_mess= star_button_span.find_element_by_xpath("//button[#class=
'star ts_icon ts_icon_star_o ts_icon_inherit ts_tip_top star_message
ts_tip ts_tip_float ts_tip_hidden btn_unstyle']")
print data_mess.get_attribute("innerHTML")
print star_button_span.get_attribute("data-msg-id")
star_button_span doesn't have data-msg-id attribute. data_mess has
print data_mess.get_attribute("data-msg-id")
I'm web scraping a wikipedia page using BeautifulSoup in python and I was wondering whether there is anyone to know the number of text objects in an HTML object. For example the following code gets me the following HTML:
soup.find_all(class_ = 'toctext')
<span class="toctext">Actors and actresses</span>, <span class="toctext">Archaeologists and anthropologists</span>, <span class="toctext">Architects</span>, <span class="toctext">Artists</span>, <span class="toctext">Broadcasters</span>, <span class="toctext">Businessmen</span>, <span class="toctext">Chefs</span>, <span class="toctext">Clergy</span>, <span class="toctext">Criminals</span>, <span class="toctext">Conspirators</span>, <span class="toctext">Economists</span>, <span class="toctext">Engineers</span>, <span class="toctext">Explorers</span>, <span class="toctext">Filmmakers</span>, <span class="toctext">Historians</span>, <span class="toctext">Humourists</span>, <span class="toctext">Inventors / engineers</span>, <span class="toctext">Journalists / newsreaders</span>, <span class="toctext">Military: soldiers/sailors/airmen</span>, <span class="toctext">Monarchs</span>, <span class="toctext">Musicians</span>, <span class="toctext">Philosophers</span>, <span class="toctext">Photographers</span>, <span class="toctext">Politicians</span>, <span class="toctext">Scientists</span>, <span class="toctext">Sportsmen and sportswomen</span>, <span class="toctext">Writers</span>, <span class="toctext">Other notables</span>, <span class="toctext">English expatriates</span>, <span class="toctext">References</span>, <span class="toctext">See also</span>
I can get the first text object by running the following:
soup.find_all(class_ = 'toctext')[0].text
My goal here is to get and store all of the text objects in a list. I'm doing this by using a for loop, however I don't know how many text objects there are in the html block. Naturally I would hit an error if I get to an index that doesn't exist Is there an alternative?
You can use a for...in loop.
In [13]: [t.text for t in soup.find_all(class_ = 'toctext')]
Out[13]:
['Actors and actresses',
'Archaeologists and anthropologists',
'Architects',
'Artists',
'Broadcasters',
'Businessmen',
'Chefs',
'Clergy',
'Criminals',
'Conspirators',
'Economists',
'Engineers',
'Explorers',
'Filmmakers',
'Historians',
'Humourists',
'Inventors / engineers',
'Journalists / newsreaders',
'Military: soldiers/sailors/airmen',
'Monarchs',
'Musicians',
'Philosophers',
'Photographers',
'Politicians',
'Scientists',
'Sportsmen and sportswomen',
'Writers',
'Other notables',
'English expatriates',
'References',
'See also']
Try the following code:
for txt in soup.find_all(class_ = 'toctext'):
print(txt.text)