Web parsing using beautiful soup - python

I need to parse data from a website: https://finance.yahoo.com/quote/MSFT/community.
I am trying to scrape the time of the post. For example "21 hours ago".
HTML code of the site. I am trying to extract time from
<span class = "F....>21 hours ago<...
<li class="comment Pend(2px) Mt(5px) Mb(11px) P(12px) " data-reactid="24">
<div class="Pos(r) Pstart(52px) " data-reactid="25">
<div class="Fz(12px) Mend(20px) Mb(5px)" data-reactid="26">
<div class="avatar D(ib) Bdrs(46.5%) Pos(a) Start(0) Cur(p)" data-reactid="27">...</div>
<button aria-label="See reaction history for 💰Chef💰" class="D(ib) Fw(b) P(0) Bd(0) M(0) Mend(10px)
Fz(16px) Ta(start) C($c-fuji-blue-1-a)" data-reactid="31">💰Chef💰</button>
<span class="Fz(12px) C(#828c93)"><span>21 hours ago</span></span>
<div class="Wow(bw)" data-reactid="33">...</div>
<div class="Py(4px)" data-reactid="39">...</div>
<div>...</div>
<div class="Pos(r) Pt(5px)" data-reactid="45">...</div>
</div>
</li>
The issue is, I am not able to find the time after reading the data using beautifulsoup. Here is the code which I have written.
source = requests.get(url).text
soup = BeautifulSoup(source, 'lxml')
content = soup.find('ul' , class_= "comments-list List(n) Ovs(touch) Pos(r) Mt(10px) Mstart(-12px) Pt(5px)")
li = content.find('li' , class_ = "comment Pend(2px) Mt(5px) Mb(11px) P(12px)")
print(li.prettify())
Output
li = content.find('li' , class_ = "comment Pend(2px) Mt(5px) Mb(11px) P(12px)")
print(li.prettify())
<li class="comment Pend(2px) Mt(5px) Mb(11px) P(12px)" data-reactid="24">
<div class="Pos(r) Pstart(52px)" data-reactid="25">
<div class="Fz(12px) Mend(20px) Mb(5px)" data-reactid="26">
<div class="avatar D(ib) Bdrs(46.5%) Pos(a) Start(0) Cur(p)" data-reactid="27">
<div class="Pos(r)" data-reactid="28">
<div class="avatar-text Ta(c) Bdrs(48%)" data-reactid="29" style="background-color:#ff333a;color:#fff;font-size:24px;line-height:40px;width:40px;height:40px;" title="See reaction history for 💰Chef💰">
�
</div>
<img alt="💰Chef💰" class="avatar-img Bdrs(48%) Pos(a) StretchedBox Bgc(#400090.03)" data-reactid="30" height="40" src="https://s.yimg.com/it/api/res/1.2/H4mqpmacU.CnalUl7leuZA--~A/YXBwaWQ9eW5ld3M7dz04MDtoPTgwO3E9ODA-/https://s.yimg.com/gq/1735/40147141066_fc6972_o.jpg" title="See reaction history for 💰Chef💰" width="40"/>
</div>
</div>
<button aria-label="See reaction history for 💰Chef💰" class="D(ib) Fw(b) P(0) Bd(0) M(0) Mend(10px) Fz(16px) Ta(start) C($c-fuji-blue-1-a)" data-reactid="31">
💰Chef💰
</button>
<!-- react-empty: 32 -->
</div>
<div class="Wow(bw)" data-reactid="33">
<div class="C($c-fuji-grey-l) Mb(2px) Fz(14px) Lh(20px) Pend(8px)" data-reactid="34">
<!-- react-text: 35 -->
I managed to pick up an Xbox series X and I have to say the tech is very impressive (bought it not knowing). This only makes up around 10% of Microsoft’s business and slated to out perform PS5. I personally believe this stock is still undervalued. Longs stay strong!
<!-- /react-text -->
</div>
<div class="Pos(r) D(ib) Lh(12px)" data-reactid="36">
<video autoplay="" class="W(a) Mah(320px) My(3px) H(a) Maw(100%)" data-reactid="37" height="236" loop="" muted="" playsinline="" poster="https://media.tenor.com/images/e239c83d92cd0e896dc8b9ac05be0bf4/tenor.png" width="258">
<source data-reactid="38" src="https://media.tenor.com/videos/1eff610b01f69f793fe1f267c88bbbd4/mp4" type="video/mp4"/>
</video>
</div>
</div>
<div class="Py(4px)" data-reactid="39">
<div class="D(ib) Pos(r)" data-reactid="40">
<div class="Fz(12px) Px(8px) Mend(4px) Va(m) Bdrs(3px) C($c-fuji-grey-g) Bgc($c-fuji-grey-b) Cur(d) Py(3px)" data-reactid="41">
<div class="D(ib) Mend(6px) Cur(d) Cur(p)" data-icon="traffic" data-reactid="42" style="vertical-align:middle;fill:#333;stroke:#333;stroke-width:0;width:18px;height:18px;">
</div>
<!-- react-text: 43 -->
Bullish
<!-- /react-text -->
</div>
</div>
</div>
<!-- react-empty: 44 -->
<div class="Pos(r) Pt(5px)" data-reactid="45">
<div class="D(ib)" data-reactid="46">
<button class="reply-button O(n):h O(n):a P(0) Bd(n) Cur(p) Fz(12px) Fw(n) C(#828c93) D(ib) Mend(20px)" data-reactid="47">
<svg class="Mend(6px) Cur(p)" data-icon="reply" data-reactid="48" height="15" style="fill:#828c93;stroke:#828c93;stroke-width:0;vertical-align:bottom;" viewbox="0 0 25 20" width="15">
<path d="M12.8 5.8V2.4c0-.4-.2-.7-.6-.8-.3-.2-.8-.1-1.1.1L.9 8.8c-.2.2-.4.4-.4.7s.2.6.4.8l10.2 7.1c.3.2.7.3 1.1.1.3-.2.6-.5.6-.8v-3.3h1.3c3 0 6.1.6 8.5 5.2.2.3.5.5.9.5h.3c.5-.1.8-.6.7-1C23 7.3 18.5 5.9 12.8 5.8zm-.1 5.6h-1c-.6 0-1 .4-1 .9v2.4L3.2 9.5l7.5-5.2v2.4c0 .5.5.9 1 .9 4.2 0 7.9 0 9.8 6.3-2.9-2.6-6.3-2.6-8.8-2.5z" data-reactid="49">
</path>
</svg>
<!-- react-text: 50 -->
Reply
<!-- /react-text -->
</button>
</div>
<div class="D(ib) Pos(r)" data-reactid="51">
<button aria-label="8 Thumbs Up" class="O(n):h O(n):a Bgc(t) Bdc(t) M(0) P(0) Bd(n) Mend(20px)" data-reactid="52">
<svg class="Cur(p)" data-icon="thumbsup-outline" data-reactid="53" height="12" style="vertical-align:middle;fill:#828c93;stroke:#828c93;stroke-width:0;" viewbox="0 0 24 24" width="12">
<path d="M2.4 21.7V11.6h2.1c1.1 0 1.7-.7 2.1-1.1.7-.7 1.5-1.4 2.2-2.1.3-.2.5-.4.7-.6 1.6-1.8 2.3-3.4 2.2-5.5 0 0 .1-.1.3 0 .5.1 1 .6 1.1 1.6.3 2.4-.2 4.3-.7 4.7-.9.7-.4 2 .8 2h7.1c.7 0 1.3.4 1.3.8 0 .3-.5.8-.7.8-1.6 0-1.6 2.1-.1 2.2 0 .1.1.2.1.4 0 .1 0 .2-.3.3-.3-.1-.5-.1-.7-.1-1.3 0-1.7 1.7-.5 2.2 0 0 .2.1.3.2.2.2.3.3.3.6 0 .1-.1.1-.3.3-.2.1-.5.2-.7.2-1.3.2-1.4 1.9-.1 2.2l.1.1s.2.2.2.5c0 .4-.2.6-.6.6l-16.2-.2zM20.3 8.3h-5.1c.4-1.3.6-2.9.3-4.7-.2-2-1.5-3.3-3.1-3.6-1.7-.3-3.2.7-3.1 2.4.1 1.5-.4 2.6-1.7 4l-.5.5c-.6.7-1.3 1.4-2 2-.3.2-.5.4-.7.5H1.2c-.7 0-1.2.5-1.2 1.1v12.3c0 .7.6 1.2 1.2 1.2h17.3c1.6 0 3-1 3-2.8 0-.5-.1-.9-.3-1.2.7-.5 1.1-1.1 1.1-2 0-.5-.1-.9-.3-1.3.7-.4 1.2-1.1 1.2-2 0-.4-.1-.9-.3-1.3.6-.5 1-1.3 1-2.1.1-1.8-1.7-3-3.6-3z" data-reactid="54">
</path>
</svg>
<span class="D(ib) Va(m) Fz(12px) Mstart(6px) C(#828c93)" data-reactid="55">
8
</span>
</button>
<button aria-label="0 Thumbs Down" class="O(n):h O(n):a Bgc(t) Bdc(t) M(0) P(0) Bd(n)" data-reactid="56">
<svg class="Cur(p)" data-icon="thumbsdown-outline" data-reactid="57" height="12" style="vertical-align:middle;fill:#828c93;stroke:#828c93;stroke-width:0;" viewbox="0 0 24 24" width="12">
<path d="M21.6 2.3v10.1h-2.1c-1.1 0-1.7.7-2.1 1.1-.7.7-1.5 1.4-2.2 2.1-.2.2-.4.4-.5.6-1.6 1.8-2.3 3.4-2.2 5.5 0 0-.1.1-.3 0-.5-.1-1-.6-1.1-1.6-.3-2.4.2-4.3.7-4.7.9-.7.4-2-.8-2H3.9c-.7 0-1.3-.4-1.3-.8-.2-.2.4-.6.5-.6 1.6 0 1.6-2.1.1-2.2 0-.1-.1-.2-.1-.4 0-.1 0-.2.3-.3s.5-.1.7-.1c1.3 0 1.7-1.7.5-2.2 0 0-.2-.1-.3-.2-.2-.2-.3-.3-.3-.6 0 0 .1-.1.3-.2.3-.1.6-.2.7-.2 1.3-.2 1.4-1.9.1-2.2L5 3.3s-.1-.2-.1-.5c0-.4.2-.6.6-.6l16.1.1zM0 12.7c0 1.8 1.8 3 3.7 3h5.1c-.4 1.3-.6 2.9-.3 4.7.3 1.9 1.5 3.3 3.1 3.5 1.7.3 3.2-.7 3.1-2.4-.1-1.5.4-2.6 1.7-4 .2-.2.3-.4.5-.5.7-.7 1.4-1.3 2.1-1.9.3-.2.5-.4.6-.5h3.2c.7 0 1.2-.5 1.2-1.1V1.2c0-.7-.6-1.2-1.2-1.2H5.5c-1.6 0-3 1-3 2.8 0 .5.1.9.3 1.2-.8.5-1.2 1.2-1.2 2 0 .5.1.9.3 1.3-.7.4-1.2 1.1-1.2 2 0 .4.1.9.3 1.3-.6.6-1 1.3-1 2.1" data-reactid="58">
</path>
</svg>
</button>
</div>
</div>
</div>
</li>
If I print this line of code the output says "None".
date = content.find('span' , class_="Fz(12px) C(#828c93)")
print(date)
Output
None
Can someone explain me why this is happening or help with the code which reads the time of the post (21 hours ago)?
Thank you .

You get output as "None" because BeautifulSoup gets the HTML Source from the initial Request and it doesn't have the Posted time values. Those values are rendered after the initial request. To capture those kinds of values you can use a library like "Selenium". When you use selenium, you can wait till the page loads and then pass that HTML Source into BeautifulSoup and try out. Below is a sample code created to get a basic understanding:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://finance.yahoo.com/quote/MSFT/community')
html = driver.page_source
try:
myElem = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.CLASS_NAME, 'comment Pend(2px) Mt(5px) Mb(11px) P(12px)')))
soup = BeautifulSoup(html, 'html.parser')
# Rest of the code
except TimeoutException:
print("Load failed")

import requests
from bs4 import BeautifulSoup
response = requests.get("https://finance.yahoo.com/quote/MSFT/community")
soup = BeautifulSoup(response.content, 'html.parser')
print(soup)

Related

SELENIUM FIND CHILD ELEMENTS OF GOT PARENT ELEMENT

First of all, respect and greetings to everyone.
I am building an automation application together with python and selenium.
In general, I take elements that repeat in the DOM in a batch way with the find_elements method and save them in a variable.
EX;
HTML ELEMENT THAT I CALLED GENERAL
<div class="_a9zr">
<h3 class="_a9zc">
<div class="_ab8w _ab94 _ab99 _ab9f _ab9m _ab9p _abbh _abcm"><span class="_aap6 _aap7 _aap8"><a
class="x1i10hfl xjbqb8w x6umtig x1b1mbwd xaqea5y xav7gou x9f619 x1ypdohk xt0psk2 xe8uvvx xdj266r x11i5rnm xat24cr x1mh8g0r xexx8yu x4uap5 x18d9i69 xkhd6sd x16tdsg8 x1hl2dhg xggy1nq x1a2a7pz _acan _acao _acat _acaw _a6hd"
href="/zelihaozbey6/" role="link" tabindex="0">zelihaozbey6</a></span></div>
</h3>
<div class="_a9zs"><span
class="_aacl _aaco _aacu _aacx _aad7 _aade">Emeğine sağlık çok güzel olmuşlar maşallah</span></div>
<div class="_ab8w _ab94 _ab99 _ab9f _ab9m _ab9p _abaj _abb- _abcm">
<div class="_aacl _aacn _aacu _aacy _aad6"><a
class="x1i10hfl xjbqb8w x6umtig x1b1mbwd xaqea5y xav7gou x9f619 x1ypdohk xt0psk2 xe8uvvx xdj266r x11i5rnm xat24cr x1mh8g0r xexx8yu x4uap5 x18d9i69 xkhd6sd x16tdsg8 x1hl2dhg xggy1nq x1a2a7pz _a9zg _a6hd"
href="/p/CKddDXShtW_/c/17935291282568260/" role="link" tabindex="0">
<time class="_a9ze _a9zf" datetime="2021-07-08T20:33:35.000Z" title="Tem 8, 2021">64h</time>
</a>
<button class="_a9ze">
<div class="_aacl _aacn _aacw _aacy _aad6 _aade">Yanıtla</div>
</button>
<div class=" _ab8y _ab94 _ab99 _ab9f _ab9m _ab9p _abbi _abcm"></div>
<div class="_a9ze">
<div class=" _a9zi">
<button class="_abl-" type="button">
<div class="_abm0">
<div class="_ab8w _ab94 _ab97 _ab9h _ab9m _ab9p _abcm" style="height: 24px; width: 24px;">
<svg aria-label="Yorum Seçenekleri" class="_ab6-" color="#8e8e8e" fill="#8e8e8e"
height="24" role="img" viewBox="0 0 24 24" width="24">
<circle cx="12" cy="12" r="1.5"></circle>
<circle cx="6" cy="12" r="1.5"></circle>
<circle cx="18" cy="12" r="1.5"></circle>
</svg>
</div>
</div>
</button>
</div>
</div>
</div>
</div>
consts.py
GENERAL_COMMENT_SECTION = "//ul//ul/div/li/div"
COMMENTS_BY_XPATH = "//ul//ul/div/li/div/div/div/div/span"
USERNAME_BY_XPATH = "//ul//ul/div/li/div/div/div/h3"
COMMENT_TIME = "//ul//ul/div/li/div/div/div/div/div/a/time"
EX;
comments = self.wait.until(
EC.presence_of_all_elements_located((By.XPATH, GENERAL_COMMENT_SECTION)))
with open('src/comment.txt', 'a+', encoding="utf-8") as f:
for comment in comments:
username = comment.find_element(By.XPATH, USERNAME_BY_XPATH).text
comment_text = comment.find_element(By.XPATH, COMMENTS_BY_XPATH).text
comment_time = comment.find_element(By.XPATH, COMMENT_TIME).get_attribute("title")
f.write(f"{username},{comment_text},{comment_time}\n")
f.close()
OUTPUT comments.txt
armaganfadik,Tosbikk ya,Oca 2, 2022
armaganfadik,Tosbikk ya,Oca 2, 2022
armaganfadik,Tosbikk ya,Oca 2, 2022
armaganfadik,Tosbikk ya,Oca 2, 2022
Then I aim to reach each element in the loop and get some texts and title attributes inside the element. But only the information of the first element in the loop comes. How can I solve this
since you are not referencing the child of the parent node, it will always return the first element.
Try out this one which reference the child of parent element.
GENERAL_COMMENT_SECTION = "//ul//ul/div/li/div"
COMMENTS_BY_XPATH = "./div/div/div/span"
USERNAME_BY_XPATH = "./div/div/h3"
COMMENT_TIME = "./div/div/div/div/a/time"

Beautiful Soup and extracting a div work for one site but not for the other

TOPIC CLOSED
I have this part of the code
containers = page_soup.findAll('div',{'class' : 'product-count d-flex align-items-center'})
output = ''
for container in containers:
price = container.find('span',{'class':'lang'}).text.replace(",", "") if container.find('span',{'class':'lang'}) else ""
that extract the value that I need from this HTML page
<div class="product-count d-flex align-items-center">
<span class="icon-military_tech" style="color: #FFCF57; font-size: 16px"></span>
<span class="lang">bought 24 times</span>
</div>
The result is bought 24 times
BUT for other site when the HTML code is
<div data-v-fd0de2e2=""><div data-v-fd0de2e2="" class="product-features"><!----> <div data-v-fd0de2e2=""><span data-v-fd0de2e2="" class="sold_products_count">bought 53 times</span></div></div> <!----> <!----> <div data-v-fd0de2e2="" class="product-meta"><div data-v-fd0de2e2="" class="product-sku"><strong data-v-fd0de2e2="">product code: </strong> <span data-v-fd0de2e2="">1200100045</span></div> <br data-v-fd0de2e2=""> <div data-v-fd0de2e2="" class="product-sku"><strong data-v-fd0de2e2="">weight: </strong> <span data-v-fd0de2e2="" style="direction: ltr; display: inline-block;">
0 kg
</span></div> <br data-v-fd0de2e2=""> <!----></div></div>
This modified python code gives empty file result
containers = page_soup.findAll('div',{'class' : 'product-features'})
output = ''
for container in containers:
price = container.find('span',{'class':'sold_products_count'}).text.replace(",", "") if container.find('span',{'class':'sold_products_count'}) else ""
The needed result for the last site is bought 53 times
The code loops over all of containers before exiting, and overwrites price each time, so the eventual value of price is the one from the last 'container', whether or not that contains the data you were looking for.
You can break out of the loop once you've found the value you need, like this:
containers = page_soup.findAll('div',{'class' : 'product-features'})
output = ''
for container in containers:
price = container.find('span',{'class':'sold_products_count'}).text.replace(",", "") if container.find('span',{'class':'sold_products_count'}) else ""
if price:
break
print(price)

how to scrape nested two elements with python

hi i would like to get some info which is on below < del> and < ins> tags but i could not find any solution for it can is anyone has idea about this scraping and is there any for getting those informations
this is my python code
import requests
import json
from bs4 import BeautifulSoup
header = {'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'}
base_url = "https://www.n11.com/super-firsatlar"
r = requests.get(base_url,headers=header)
if r.status_code == 200:
soup = BeautifulSoup(r.text, 'html.parser')
books = soup.find_all('li',attrs={"class":"column"})
result=[]
for book in books:
title=book.find('h3').text
link=base_url +book.find('a')['href']
picture = base_url + book.find('img')['src']
price=book.find('p', {'class': 'del'})
single ={'title':title,'link':link,'picture':picture,'price':price}
result.append(single)
with open('book.json','w', encoding='utf-8') as f:
json.dump(result,f,indent=4,ensure_ascii=False)
else:
print(r.status_code)
and this my html page
<li class="column">
<script type="text/javascript">
var customTextOptionMap = {};
</script>
<div id="p-457010862" class="columnContent ">
<div class="pro">
<a href="https://www.n11.com/urun/oppo-a73-128-gb-oppo-turkiye-garantili-1599155?magaza=gelecekbizde"
title="Oppo A73 128 GB (Oppo Türkiye Garantili)" class="plink" data-id="457010862">
<img data-original="https://n11scdn1.akamaized.net/a1/215/elektronik/cep-telefonu/oppo-a73-128-gb-oppo-turkiye-garantili__1298508275589871.jpg"
width="215" height="215"
src="https://n11scdn1.akamaized.net/a1/215/elektronik/cep-telefonu/oppo-a73-128-gb-oppo-turkiye-garantili__1298508275589871.jpg"
alt="Oppo A73 128 GB (Oppo Türkiye Garantili)" class="lazy" style="">
<h3 class="productName ">
Oppo A73 128 GB (Oppo Türkiye Garantili)</h3>
<span class="loading"></span>
</a>
</div>
<div class="proDetail">
<a href="https://www.n11.com/urun/oppo-a73-128-gb-oppo-turkiye-garantili-1599155?magaza=gelecekbizde"
class="oldPrice" title="Oppo A73 128 GB (Oppo Türkiye Garantili)">
<del>2.999, 00 TL</del>
</a> <a href="https://www.n11.com/urun/oppo-a73-128-gb-oppo-turkiye-garantili-1599155?magaza=gelecekbizde"
class="newPrice" title="Oppo A73 128 GB (Oppo Türkiye Garantili)">
<ins>2.899, 00<span content="TRY">TL</span></ins>
</a>
<div class="discount discountS">
<div>
<span class="percent">%</span>
<span class="ratio">3</span>
</div>
</div>
<span class="textImg freeShipping"></span>
<p class="catalogView-hover-separate"></p>
<div class="moreOpt">
<a title="Oppo A73 128 GB (Oppo Türkiye Garantili)" class="textImg moreOptBtn"
href="https://www.n11.com/urun/oppo-a73-128-gb-oppo-turkiye-garantili-1599155?magaza=gelecekbizde"></a>
</div>
</div>
</div>
</li>
Unless I am not understanding your question, it should be as simple as doing:
del_data = soup.find_all("del")
ins_data = soup.find_all("ins")
is this not what you're trying to achieve? If not please clarify your question
del and ins are not class names but tags. You can simply find them with Soup.find_all('del')
price = book.find_all('del')
for p in price:
print(p.text)
gives
2.999,00 TL
189,90 TL
8.308,44 TL
499,90 TL
6.999,00 TL
99,00 TL
18,00 TL
499,00 TL
169,99 TL
1.499,90 TL
3.010,00 TL
2.099,90 TL
......
which is what you want I guess. You have to access the text attribute here. So, the element is located. The way you want to serialize it is a different question.

Python selenium find checkbox

I'm trying to find check box and click in it. I've tried using:
driver.find_element_by_xpath(("//*[contains(text(), '2+ interchanges')]")).click()
or
driver.find_element_by_class_name('BpkCheckbox_bpk-checkbox__label__1vrLS BpkCheckbox_bpk-checkbox__label--small__16XBT')
or
driver.find_element_by_name('2+ interchanges').click()
but no luck, always get error:
selenium.common.exceptions.NoSuchElementException: Message: Unable to locate element:
Could you please give me some advice how i do should find this checkbox?
Thanks!
Best
The code looks like:
<div class="Filter_verticallySpaced__1Sm6F Filter_horizontallySpaced__RwfGp FiltersCheckboxes_colourOnHover__erAw_">
<label class="BpkCheckbox_bpk-checkbox__2yTOQ FiltersCheckboxes_checkboxLine__7IQLY">
<input aria-invalid="false" aria-label="2+ interchanges" checked="" class="BpkCheckbox_bpk-checkbox__input__2qMb7" name="twoPlusStops" type="checkbox"/>
<svg class="BpkCheckbox_bpk-checkbox__icon__1KuPa" height="18" style="width: 0.75rem; height: 0.75rem;" viewbox="0 0 24 24" width="18" xmlns="http://www.w3.org/2000/svg">
<path d="M9.4 20c-.5 0-.9-.2-1.3-.5l-5.8-5.1c-.4-.4-.5-1-.1-1.4l1.3-1.5c.4-.4 1-.5 1.4-.1l4 3.5c.2.1.4.1.6 0l9.2-10.5c.4-.4 1-.5 1.4-.1l1.5 1.3c.4.4.5 1 .1 1.4L10.9 19.3c-.4.5-.9.7-1.5.7z">
</path>
</svg>
<span aria-hidden="true" class="BpkCheckbox_bpk-checkbox__label__1vrLS BpkCheckbox_bpk-checkbox__label--small__16XBT">
2+ interchanges
</span>
</label>
<span class="BpkText_bpk-text__2NHsO BpkText_bpk-text--sm__345aT FiltersCheckboxes_subText__3aG9- FiltersCheckboxes_block__1ZTpQ FiltersCheckboxes_checkboxIndent__2c49O">
2 735 $
</span>
</div>
<div class="Filter_verticallySpaced__1Sm6F Filter_horizontallySpaced__RwfGp FiltersCheckboxes_colourOnHover__erAw_">
<label class="BpkCheckbox_bpk-checkbox__2yTOQ FiltersCheckboxes_checkboxLine__7IQLY">
<input aria-invalid="false" aria-label="1 interchange" checked="" class="BpkCheckbox_bpk-checkbox__input__2qMb7" name="oneStop" type="checkbox"/>
<svg class="BpkCheckbox_bpk-checkbox__icon__1KuPa" height="18" style="width: 0.75rem; height: 0.75rem;" viewbox="0 0 24 24" width="18" xmlns="http://www.w3.org/2000/svg">
<path d="M9.4 20c-.5 0-.9-.2-1.3-.5l-5.8-5.1c-.4-.4-.5-1-.1-1.4l1.3-1.5c.4-.4 1-.5 1.4-.1l4 3.5c.2.1.4.1.6 0l9.2-10.5c.4-.4 1-.5 1.4-.1l1.5 1.3c.4.4.5 1 .1 1.4L10.9 19.3c-.4.5-.9.7-1.5.7z">
</path>
</svg>
<span aria-hidden="true" class="BpkCheckbox_bpk-checkbox__label__1vrLS BpkCheckbox_bpk-checkbox__label--small__16XBT">
1 interchange
</span>
</label>
<span class="BpkText_bpk-text__2NHsO BpkText_bpk-text--sm__345aT FiltersCheckboxes_subText__3aG9- FiltersCheckboxes_block__1ZTpQ FiltersCheckboxes_checkboxIndent__2c49O">
2 787 $
</span>
</div>
With this
driver.find_element_by_class_name('BpkCheckbox_bpk-checkbox__label__1vrLS BpkCheckbox_bpk-checkbox__label--small__16XBT')
you picked two class names as one (between every class name is a space).
Try using this:
driver.find_element_by_class_name('BpkCheckbox_bpk-checkbox__label__1vrLS')
By the way- this
driver.find_element_by_name('2+ interchanges').click()
is searching for the name in an element (eg ), not the value within. (Your tag has no name attribute)
All right i found it :)
Solution, i just added wait for 10 secs before click, and changed xpath.
time.sleep(10)
stops_chkbox = driver.find_element_by_xpath('//*[#id="stops_content"]/div/div/div[3]/label').click()

How do I scrape nested data using selenium and Python>

I basically want to scrape Feb 2016 - Present under <span class="visually-hidden">, but I can't see to get to it. Here's the HTML at code:
<div class="pv-entity__summary-info">
<h3 class="Sans-17px-black-85%-semibold">Litigation Paralegal</h3>
<h4>
<span class="visually-hidden">Company Name</span>
<span class="pv-entity__secondary-title Sans-15px-black-55%">Olswang</span>
</h4>
<div class="pv-entity__position-info detail-facet m0"><h4 class="pv-entity__date-range Sans-15px-black-55%">
<span class="visually-hidden">Dates Employed</span>
<span>Feb 2016 – Present</span>
</h4><h4 class="pv-entity__duration de Sans-15px-black-55% ml0">
<span class="visually-hidden">Employment Duration</span>
<span class="pv-entity__bullet-item">1 yr 2 mos</span>
</h4><h4 class="pv-entity__location detail-facet Sans-15px-black-55% inline-block">
<span class="visually-hidden">Location</span>
<span class="pv-entity__bullet-item">London, United Kingdom</span>
</h4></div>
</div>
And here is what I've been doing at the moment with selenium in my code:
date= browser.find_element_by_xpath('.//div[#class = "pv-entity__duration de Sans-15px-black-55% ml0"]').text
print date
But this gives no results. How would I go about either pulling the date?
There is no div with class="pv-entity__duration de Sans-15px-black-55% ml0", but h4. If you want to get text of div, then try:
date= browser.find_element_by_xpath('.//div[#class = "pv-entity__position-info detail-facet m0"]').text
print date
If you want to get "Feb 2016 - Present", then try
date= browser.find_element_by_xpath('//h4[#class="pv-entity__date-range Sans-15px-black-55%"]/span[2]').text
print date
You can rewrite your xpath code something like this :
# -*- coding: utf-8 -*-
from lxml import html
import unicodedata
html_str = """
<div class="pv-entity__summary-info">
<h3 class="Sans-17px-black-85%-semibold">Litigation Paralegal</h3>
<h4>
<span class="visually-hidden">Company Name</span>
<span class="pv-entity__secondary-title Sans-15px-black-55%">Olswang</span>
</h4>
<div class="pv-entity__position-info detail-facet m0"><h4 class="pv-entity__date-range Sans-15px-black-55%">
<span class="visually-hidden">Dates Employed</span>
<span>Feb 2016 – Present</span>
</h4><h4 class="pv-entity__duration de Sans-15px-black-55% ml0">
<span class="visually-hidden">Employment Duration</span>
<span class="pv-entity__bullet-item">1 yr 2 mos</span>
</h4><h4 class="pv-entity__location detail-facet Sans-15px-black-55% inline-block">
<span class="visually-hidden">Location</span>
<span class="pv-entity__bullet-item">London, United Kingdom</span>
</h4></div>
</div>
"""
root = html.fromstring(html_str)
# For fetching Feb 2016 â Present :
txt = root.xpath('//h4[#class="pv-entity__date-range Sans-15px-black-55%"]/span/text()')[1]
# For fetching 1 yr 2 mos :
txt1 = root.xpath('//h4[#class="pv-entity__duration de Sans-15px-black-55% ml0"]/span/text()')[1]
print txt
print txt1
This will result in :
Feb 2016 â Present
1 yr 2 mos

Categories