I am new to web development and scraping in general and I am trying to challenge myself by scrape websites like LinkedIn.
Since they have embers and dynamically changing ids it is a bit more struggle to scrape properly.
I am trying to scrape the "experience section" of a LinkedIn profile by looking using the following code:
experience = driver.find_element_by_xpath('//section[#id = "experience-section"]/ul/li[#class="position"]')
the driver got the entire Linkedin profile webpage. I would like to have all the position under the "experience-section". The error message is:
Unable to locate element: {"method":"xpath","selector":"//section[#id = "experience-section"]/ul/li/div[#class="position"]"}
I am able to scrape other stuff on Linkedin, but the experience section is a big struggle for me. Is the xpath wrong? if yes, what could I change?
Thank you
<section id="experience-section" class="pv-profile-section experience-section ember-view"><header class="pv-profile-section__card-header">
<h2 class="pv-profile-section__card-heading t-20 t-black t-normal">
Experience
</h2>
<!----></header>
<ul id="ember1620" class="pv-profile-section__section-info section-info pv-profile-section__section-info--has-no-more ember-view"><li id="ember1622" class="pv-profile-section__sortable-item pv-profile-section__section-info-item relative pv-profile-section__list-item sortable-item ember-view"><div id="ember1623" class="pv-entity__position-group-pager ember-view"> <li id="392598211" class="pv-profile-section__sortable-card-item pv-profile-section pv-position-entity ember-view"><!----><a data-control-name="background_details_company" href="/company/8736/" id="ember1626" class="ember-view"> <div class="pv-entity__logo company-logo">
<img class="lazy-image pv-entity__logo-img pv-entity__logo-img EntityPhoto-square-5 loaded" alt="Bill & Melinda Gates Foundation" src="https://media.licdn.com/dms/image/C560BAQHvFIyUvuKtQA/company-logo_400_400/0?e=1556755200&v=beta&t=Qhh8_KnrE-OiuXAutFyeI69tgUF3c1ptC9N12siDO4o">
</div>
<div class="pv-entity__summary-info pv-entity__summary-info--background-section ">
<h3 class="t-16 t-black t-bold">Co-chair</h3>
<h4 class="t-16 t-black t-normal">
<span class="visually-hidden">Company Name</span>
<span class="pv-entity__secondary-title">Bill & Melinda Gates Foundation</span>
</h4>
<div class="display-flex">
<h4 class="pv-entity__date-range t-14 t-black--light t-normal">
<span class="visually-hidden">Dates Employed</span>
<span>2000 – Present</span>
</h4>
<h4 class="t-14 t-black--light t-normal">
<span class="visually-hidden">Employment Duration</span>
<span class="pv-entity__bullet-item-v2">19 yrs</span>
</h4>
</div>
<!---->
</div>
</a>
<!---->
</li>
</div>
</li><li id="ember1630" class="pv-profile-section__sortable-item pv-profile-section__section-info-item relative pv-profile-section__list-item sortable-item ember-view"><div id="ember1631" class="pv-entity__position-group-pager ember-view"> <li id="392599749" class="pv-profile-section__sortable-card-item pv-profile-section pv-position-entity ember-view"><!----><a data-control-name="background_details_company" href="/company/1035/" id="ember1634" class="ember-view"> <div class="pv-entity__logo company-logo">
<img class="lazy-image pv-entity__logo-img pv-entity__logo-img EntityPhoto-square-5 loaded" alt="Microsoft" src="https://media.licdn.com/dms/image/C4D0BAQEko6uLz7XylA/company-logo_400_400/0?e=1556755200&v=beta&t=XQhwV5ruWfGBfjgQylV9gkeXD8VnQRBHGd1bOfTs2tw">
</div>
<div class="pv-entity__summary-info pv-entity__summary-info--background-section ">
<h3 class="t-16 t-black t-bold">Co-founder</h3>
<h4 class="t-16 t-black t-normal">
<span class="visually-hidden">Company Name</span>
<span class="pv-entity__secondary-title">Microsoft</span>
</h4>
<div class="display-flex">
<h4 class="pv-entity__date-range t-14 t-black--light t-normal">
<span class="visually-hidden">Dates Employed</span>
<span>1975 – Present</span>
</h4>
<h4 class="t-14 t-black--light t-normal">
<span class="visually-hidden">Employment Duration</span>
<span class="pv-entity__bullet-item-v2">44 yrs</span>
</h4>
</div>
<!---->
</div>
</a>
<!---->
</li>
</div>
</li>
</ul>
<!----></section>
---- Update:
I used the solution provided by Sers
driver.get('https://www.linkedin.com/in/williamhgates/')
experience = driver.find_elements_by_xpath('//section[#id = "experience-section"]/ul//li')
for item in experience:
print(item.text)
print("")
and I somehow get the results twice:
Co-chair
Company Name
Bill & Melinda Gates Foundation
Dates Employed
2000 – Present
Employment Duration
19 yrs
Co-chair
Company Name
Bill & Melinda Gates Foundation
Dates Employed
2000 – Present
Employment Duration
19 yrs
Co-founder
Company Name
Microsoft
Dates Employed
1975 – Present
Employment Duration
44 yrs
Co-founder
Company Name
Microsoft
Dates Employed
1975 – Present
Employment Duration
44 yrs
The problem in you xpath is li not directly under ul, try xpath below:
//section[#id = "experience-section"]/ul//li
Update
driver.get('https://www.linkedin.com/in/williamhgates/')
experience = driver.find_elements_css_selector('#experience-section .pv-profile-section')
for item in experience:
print(item.text)
print("")
Related
This code is working:
from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup
from datetime import datetime
import time
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
page = browser.new_page()
page.goto("https://www.apple.com/br/shop/product/MV7N2BE/A/airpods-com-estojo-de-recarga")
html = page.content()
soup = BeautifulSoup(html,'html.parser')
valorAppleStore = soup.select("span.as-price-installments")[-2].get_text().replace(" à vista (10% de desconto)", '')
print(valorAppleStore)
browser.close()
But if I change headless=True, the code returns an error:
Traceback (most recent call last):
File "c:/Users/ANDERSONCARVALHODELI/Documents/py/AirpodsPW.py", line 19, in <module>
valorAppleStore = soup.select("span.as-price-installments")[-2].get_text().replace(" à vista (10% de desconto)",
'')
IndexError: list index out of range
I fixed this using:
from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup
from datetime import datetime
import time
with sync_playwright() as p:
browser = p.chromium.launch(headless=False)
page = browser.new_page()
page.goto("https://www.apple.com/br/shop/product/MV7N2BE/A/airpods-com-estojo-de-recarga")
time.sleep(1)
browser.close()
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://www.apple.com/br/shop/product/MV7N2BE/A/airpods-com-estojo-de-recarga")
html = page.content()
soup = BeautifulSoup(html,'html.parser')
valorAppleStore = soup.select("span.as-price-installments")[-2].get_text().replace(" à vista (10% de desconto)", '')
print(valorAppleStore)
But I think this is not the better choice. How do I fix this without opening the browser using headless=False and stick to headless=True?
When I print(html) before soup=..., I see:
<!DOCTYPE html><html><head> <title>Page Not Found - Apple</title> <link rel="stylesheet" href="https://www.apple.com/wss/fonts?families=SF+Pro,v1|SF+Pro+Icons,v1"> <link rel="stylesheet" href="https://www.apple.com/v/errors/c/built/styles/main.built.css" type="text/css"> <link rel="stylesheet" href="https://www.apple.com/v/errors/c/built/styles/overview.built.css" type="text/css"> <link rel="stylesheet" href="https://store.storeimages.cdn-apple.com/4982/store.apple.com/shop/rs-external/rel/us/external.css"> <link rel="stylesheet" href="https://store.storeimages.cdn-apple.com/4982/store.apple.com/shop/rs-globalelements/dist/us/globalelements.css"> <style>.more::after{content: "";}a.pointer, a.more, a.block span.more, button.unbutton.more{padding-right: .7em; background-image: url(https://store.storeimages.cdn-apple.com/4982/store.apple.com/shop/rs-web/2/dist/assets/as-legacy/base/link/res/more.svg); background-repeat: no-repeat; background-position: 100% 50%; background-size: 5px 9px; zoom: 1;}.as-globalfooter-directory-column-section-list a{margin-bottom: .8em; display: block}.as-globalfooter-directory-column-section-list a:last-child{margin-bottom: 0;}.as-globalfooter-mini .as-globalfooter-mini-shop a{color: #06c;}.as-globalfooter .as-globalfooter-mini-legal-copyright, .as-footnotes .as-globalfooter-mini-legal-copyright, .as-globalfooter .as-globalfooter-mini-legal-link, .as-footnotes .as-globalfooter-mini-legal-link{top: -3px; position: relative; z-index: 1;}.as-globalfooter .as-globalfooter-directory+.as-globalfooter-mini, .as-footnotes .as-globalfooter-directory+.as-globalfooter-mini{padding-bottom: 26px;}.container{position: relative;}hr{display: inline-block; border: 0px; border-top: 0.1em solid #CCD2D9; width: 100%}</style></head><body class="page-overview"> <nav data-store-api="/shop/bag/status" id="ac-globalnav"> <div class="ac-gn-content"> <ul class="ac-gn-list"> <p class="ac-gn-link-text">Apple</p> <p class="ac-gn-link-text">Store</p> <p class="ac-gn-link-text">Mac</p> <p class="ac-gn-link-text">iPad</p> <p class="ac-gn-link-text">iPhone</p> <p class="ac-gn-link-text">Watch</p> <p class="ac-gn-link-text">AirPods</p> <p class="ac-gn-link-text">TV & Home</p>
<p class="ac-gn-link-text">Only on Apple</p> <p class="ac-gn-link-text">Accessories</p> <p class="ac-gn-link-text">Support</p> <li class="ac-gn-item ac-gn-item-menu ac-gn-search"> <a id="ac-gn-link-search" class="ac-gn-link ac-gn-link-search" href="/us/search" data-analytics-title="search" data-analytics-intrapage-link="" aria-label="Search apple.com" role="button" aria-haspopup="true"></a> </li> <p class="ac-gn-link-text">Shopping Bag</p> </ul> </div></nav> <div id="ac-gn-placeholder"> </div><main id="main" class="main" role="main" data-page-type="overview"> <h1 class="section-headline typography-headline">The page you’re looking for can’t be found.</h1> <aside id="search-wrapper" role="search" data-analytics-region="search" aria-hidden="false"> <form id="searchform-form" class="searchform" action="/us/search" method="get" data-suggestions-url="/search-services/suggestions/"><input id="searchform-input" type="text" class="form-textbox form-textbox-text form-icon-left" aria-labelledby="textbox_label" required="" aria-required="true" data-placeholder-long="Search for Products, Stores, and Help" autocorrect="off" autocapitalize="off" autocomplete="off"><span class="form-label" id="textbox_label" aria-hidden="true">Search apple.com</span> <div id="searchform-submit" class="form-icons-wrapper form-icons-wrapper-left form-icons-focusable" type="submit" aria-label="Submit"><button class="form-icons form-icons-search15"></button></div><div id="searchform-reset" class="button-reset form-icons-wrapper form-icons-focusable" type="reset" disabled="" aria-label="Clear Search"><button class="form-icons form-icons-small form-icons-clearsolid15 form-icon-reset"></button></div></form> </aside> <div class="cta-sitemap"> <div class="cta-sitemap"> Or see our site map </div></div></main> <footer class="as-globalfooter as-globalfooter-contained"> <div class="as-globalfooter-content"> <div class="as-globalfooter-breadcrumbs"> <p class="as-globalfooter-breadcrumbs-home-icon"></p><p class="as-globalfooter-breadcrumbs-home-label">Apple</p> <div class="as-globalfooter-breadcrumbs-path"> <ol class="as-globalfooter-breadcrumbs-list"> <li class="as-globalfooter-breadcrumbs-item breadcrumbs-title"> Page Not Found</li></ol> </div></div><nav class="as-globalfooter-directory with-5-columns"> <div class="as-globalfooter-directory-column"> <div class="as-globalfooter-directory-column-section"> <h3 class="as-globalfooter-directory-column-section-title">Shop and Learn</h3> <ul class="as-globalfooter-directory-column-section-list"> Store Mac iPad iPhone Watch AirPods TV & Home iPod touch AirTag Accessories Gift Cards </ul> </div></div><div class="as-globalfooter-directory-column"> <div class="as-globalfooter-directory-column-section"> <h3 class="as-globalfooter-directory-column-section-title">Services</h3> <ul class="as-globalfooter-directory-column-section-list"> Apple Music Apple TV+ Apple Fitness+ Apple News+ Apple Arcade iCloud Apple One Apple Card Apple Books Apple Podcasts App Store </ul> </div><div class="as-globalfooter-directory-column-section"> <h3 class="as-globalfooter-directory-column-section-title">Account</h3> <ul class="as-globalfooter-directory-column-section-list"> Manage Your Apple ID Apple Store Account iCloud.com </ul> </div></div><div class="as-globalfooter-directory-column"> <div class="as-globalfooter-directory-column-section"> <h3 class="as-globalfooter-directory-column-section-title">Apple Store</h3> <ul class="as-globalfooter-directory-column-section-list"> Find a Store Genius Bar Today at Apple Apple Camp Apple Store App Refurbished and Clearance Financing Apple Trade In Order Status Shopping Help </ul> </div></div><div class="as-globalfooter-directory-column"> <div class="as-globalfooter-directory-column-section"> <h3 class="as-globalfooter-directory-column-section-title">For Business</h3> <ul class="as-globalfooter-directory-column-section-list"> Apple and Business Shop for Business </ul> </div><div class="as-globalfooter-directory-column-section"> <h3 class="as-globalfooter-directory-column-section-title">For Education</h3> <ul class="as-globalfooter-directory-column-section-list"> Apple and Education Shop for K-12 Shop for College </ul> </div><div class="as-globalfooter-directory-column-section"> <h3 class="as-globalfooter-directory-column-section-title">For Healthcare</h3> <ul class="as-globalfooter-directory-column-section-list"> Apple in Healthcare Health on Apple Watch Health Records on iPhone </ul> </div><div class="as-globalfooter-directory-column-section"> <h3 class="as-globalfooter-directory-column-section-title">For Government</h3> <ul class="as-globalfooter-directory-column-section-list"> Shop for Government Shop for Veterans and Military </ul> </div></div><div class="as-globalfooter-directory-column"> <div class="as-globalfooter-directory-column-section"> <h3 class="as-globalfooter-directory-column-section-title">Apple Values</h3> <ul class="as-globalfooter-directory-column-section-list"> Accessibility Education Environment Inclusion and Diversity Privacy <a href="/racial-equity-justice-initiative/">Racial Equity
and Justice</a> Supplier Responsibility </ul> </div><div class="as-globalfooter-directory-column-section"> <h3 class="as-globalfooter-directory-column-section-title">About Apple</h3> <ul class="as-globalfooter-directory-column-section-list"> Newsroom Apple Leadership Career Opportunities Investors Ethics & Compliance Events Contact Apple </ul> </div></div></nav> <div class="as-globalfooter-mini"> <div class="as-globalfooter-mini-shop">More ways to shop:
Find an Apple Store or other retailer near you. <span>Or call 1-800-MY-APPLE.</span> </div><div class="as-globalfooter-mini-locale"> <a class="as-globalfooter-mini-locale-link" href="/choose-country-region/" title="Choose your country or region" aria-label="United States. Choose your country or region" data-analytics-title="choose your country">United States</a> </div><p class="as-globalfooter-mini-legal-copyright">Copyright © 2022 Apple Inc. All rights reserved. </p><a class="as-globalfooter-mini-legal-link" href="/legal/privacy/">Privacy Policy </a> <a class="as-globalfooter-mini-legal-link" href="/legal/internet-services/terms/site.html">Terms of Use </a> <a class="as-globalfooter-mini-legal-link" href="/us/shop/goto/help/sales_refunds">Sales
and Refunds </a> <a class="as-globalfooter-mini-legal-link" href="/legal/">Legal </a> <a class="as-globalfooter-mini-legal-link" href="/sitemap/">Site Map </a> </div></div></footer> <script src="https://www.apple.com/v/errors/c/built/scripts/main.built.js" type="text/javascript" charset="utf-8"></script></body></html>
First of all, Playwright already has a full suite of selectors that work on the live page, so to eliminate a dependency, speed up your scrape, use less code and avoid weird errors when the static HTML snapshot gets out of sync with the live page, I suggest skipping BS.
On to the main problem, you've done good by printing the HTML to see what sort of response you're dealing with. The 404 page indicates you've been detected as a bot when running headlessly, but this can often manifest as a captcha, Cloudflare browser check page, or other "are you a robot?" notice.
As with everything in scraping, there's no one-size-fits-all solution, but one typical approach is to set a custom user agent string:
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
ua = (
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/69.0.3497.100 Safari/537.36"
)
url = (
"https://www.apple.com/br/shop/product/MV7N2BE/A/airpods-com-estojo-de-recarga"
)
page = browser.new_page(user_agent=ua)
page.goto(url, wait_until="domcontentloaded")
sel = "span.as-price-installments:last-child"
text = (
page.wait_for_selector(sel)
.text_content()
.replace("à vista (10% de desconto)", "")
.strip()
)
print(text) # => R$ 1.399,50
browser.close()
The html code is :
<div class="card border p-3">
<span class="small text-muted">Contact<br></span>
<div>Steven Cantrell</div>
<div class="small">Department of Justice</div>
<div class="small">Federal Bureau of Investigation</div>
<!---->
<!---->
<!---->
<div class="small">skcantrell#fbi.gov</div>
<div class="small">256-313-8835</div>
</div>
I want to get the output inside the <div> tag i.e. Steven Cantrell .
I need such a way that I should be able to get the contents of next tag. In this case, it is 'span',{'class':'small text-muted'}
What I tried is :
rfq_name = soup.find('span',{'class':'small text-muted'})
print(rfq_name.next)
But this printed Contact instead of the name.
You're nearly there, just change your print to: print(rfq_name.find_next('div').text)
Find the element that has the text "Contact". Then use .find_next() to get the next <div> tag.
from bs4 import BeautifulSoup
html = '''<div class="card border p-3">
<span class="small text-muted">Contact<br></span>
<div>Steven Cantrell</div>
<div class="small">Department of Justice</div>
<div class="small">Federal Bureau of Investigation</div>
<!---->
<!---->
<!---->
<div class="small">skcantrell#fbi.gov</div>
<div class="small">256-313-8835</div>
</div>'''
soup = BeautifulSoup(html, 'html.parser')
contact = soup.find(text='Contact').find_next('div').text
Output:
print(contact)
Steven Cantrell
I am working on a project to scrape data. I have a for loop that runs through 50 urls (all of which are the same page with just different information) and then I extract different things to add to a csv. The problem I am having is that when I try to extract 'job_title' in my code, many of the entries come up as 'None', though the entry is actually existent. The HTML seems to be same in each URL, but 10/50 urls are yielding 'NONE' to the following lines of code. I need the code to set job_title = 'Founder'
This is the code I am currently using:
sel = Selector(text=driver.page_source)
job_title = sel.xpath('//*[starts-with(#class, "t-16 t-black t-bold")]/text()').extract_first()
Here is the HTML from one of the urls that I was unable to extract job_title--Which is 'Founder' in this case. It is the second line of the script.
<div class="pv-entity__summary-info pv-entity__summary-info--background-section mb2">
<h3 class="t-16 t-black t-bold">Founder</h3>
<p class="visually-hidden">Company Name</p>
<p class="pv-entity__secondary-title t-14 t-black t-normal">
Genamint
<span class="pv-entity__secondary-title separator">Full-time</span>
</p>
<div class="display-flex">
<h4 class="pv-entity__date-range t-14 t-black--light t-normal">
<span class="visually-hidden">Dates Employed</span>
<span>Mar 2020 – Present</span>
</h4>
<h4 class="t-14 t-black--light t-normal">
<span class="visually-hidden">Employment Duration</span>
<span class="pv-entity__bullet-item-v2">5 mos</span>
</h4>
</div>
<h4 class="pv-entity__location t-14 t-black--light t-normal block">
<span class="visually-hidden">Location</span>
<span>New York, United States</span>
</h4>
<!---->
</div>
Any help would be appreciated.
Both those lines grab this HTML. <h3 class="nav-settings__member-name t-16 t-black t-bold> Ethan Roberti </h3>
There's no nav-settings__member-name in your sample data. Since you're using extract_first(), you get the first appearing result. One way to fix it would be :
(//div[contains(#class,"pv-entity__summary")])[1]//h3/text()
Output : Founder
Assuming you're trying to scrape LinkedIn, to get the current or last job of a person, use the following XPath :
(//section[#class="experience pp-section" or #id="experience-section"]//h3)[1]/text()
For example, for https://www.linkedin.com/in/ethan-roberti-322694174, you'll get :
Output : Summer Analyst
I am trying to use find_all to get a rather simple list of elements. No matter which parser I use it always gets a limited of elements containing anything useful and at some point all the next elements do not have any content although they clearly should. I have seen multiple posts where people had issues with it but it was always an empty list. I thought maybe it's because another part of the html is generated while scrolling down but it's not the case.
import requests
from bs4 import BeautifulSoup
URL = 'https://www.pracuj.pl/praca/analityk%20danych;kw/warszawa;wp?rd=0'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find(id='results')
job_elems = results.find_all('li', class_='results__list-container-item')
for job_elem in job_elems:
#title_elem = job_elem.find('a', class_='offer-details__title-link')
#company_elem = job_elem.find('a', class_='offer-company__name')
#location_elem = job_elem.find('li' ,class_='offer-labels__item offer-labels__item--location')
#if title_elem is None:
# continue
#print(title_elem.text.strip())
#print(company_elem.text.strip())
#print(location_elem.text.strip())
print()
print(job_elem)
EDIT: Sorry for being unclear guys. As #TanmayaMeher suggests I did not paste any html as the link was available in the code and I guess it is easier to inspect.
The picture I provided was supposed to show the part of the output where the problem starts. Please see below a part of the output as a text. First paragraph is the last element that is parsed as I expect it and another lines are elements ('li' tag) that do not contain anything while I expect them to look like the correct one.
<li class="results__list-container-item">
<div class="offer offer--border offer--remoterecruitment">
<div class="offer__click">
<a class="offer__click-area" href="https://www.pracuj.pl/praca/chapter-lead-data-standardization-risk-area-warszawa,oferta,1000235456"></a>
</div>
<div class="offer__info">
<div class="offer-details">
<div class="offer-logo">
<img alt="logo" class="offer-logo__image" src="https://i.gpcdn.pl/oferty-loga-firm/wyniki-wyszukiwania/44864.png"/>
</div>
<div class="offer-details__text">
<h3 class="offer-details__title">
<a class="offer-details__title-link" href="https://www.pracuj.pl/praca/chapter-lead-data-standardization-risk-area-warszawa,oferta,1000235456">Chapter lead – Data Standardization (Risk Area)</a>
</h3>
<p class="offer-company">
<span class="offer-company__link-wrapper"></span>
<span class="offer-company__wrapper">
<a class="offer-company__name" href="https://pracodawcy.pracuj.pl/company/20058995/profile">ING Tech Poland</a>
</span>
</p>
</div>
<div class="offer-details__badge-wrap offer-details__badge-wrap--remoterecruitment">
<i class="mdi mdi-cellphone-message offer-details__badge-icon"></i>
<span class="offer-details__badge-name offer-details__badge-name--remoterecruitment">Rekrutacja zdalna</span>
</div>
</div>
<div class="offer-labels__wrapper">
<ul class="offer-labels">
<li class="offer-labels__item offer-labels__item--location">
<i class="mdi mdi-map-marker offer-labels__item-icon"></i>Warszawa </li>
</ul>
<ul class="offer-labels">
<li class="offer-labels__item">
</li>
</ul>
</div>
<div class="offer-description">
<input class="offer-description__toggler" id="offer-description---cid-23435479" type="checkbox"/>
<label class="offer-description__toggler-label" for="offer-description---cid-23435479">
<i class="mdi mdi-chevron-down offer-description__toggler-icon"></i>
</label>
<div class="offer-description__content-wrap">
<span class="offer-description__content">
Must have You are open for other people and eager to take on new challenges, You are passionate about working with people and developing talents of others make you fulfilled, You prefer to concentrate on quality, innovation of created products...
</span>
</div>
</div>
</div>
<div class="offer-regions__port"></div>
<div class="offer-actions">
<span class="offer-actions__date">
<span class="offer-actions__date-long">opublikowana: </span>13 cze<span class="offer-actions__date-long">rwca</span> 2020
</span>
<div class="offer-actions__favs"></div>
</div>
</div>
</li>
<li class="results__list-container-item"></li>
<li class="results__list-container-item"></li>
<li class="results__list-container-item"></li>
<li class="results__list-container-item"></li>
<li class="results__list-container-item"></li>
<li class="results__list-container-item"></li>
<li class="results__list-container-item"></li>
The data is embedded in the page in the form of JavaScript window.__INITIAL_STATE__ variable. You can parse it with re/json module.
For example:
import re
import json
import requests
url = 'https://www.pracuj.pl/praca/analityk%20danych;kw/warszawa;wp?rd=0'
html_text = requests.get(url).text
data = json.loads( re.search(r'window\.__INITIAL_STATE__ = (.*?\});', html_text).group(1) )
# uncomment this to print all data:
# print(json.dumps(data, indent=4))
# print some data to screen:
for offer in data['offers']:
print('{:<10}{:<80}{}'.format(offer['commonOfferId'], offer['jobTitle'], offer['employer']))
Prints:
23433957 Senior Business Intelligence Analyst w Zespole Data Intelligence Solutions KPMG
23436174 Quantitative Associate (Model Validation) ING Tech Poland
23436175 Data Analyst (Risk Modelling) (Risk Hub) ING Tech Poland
23436664 Reports Developer (VBA) Randstad Polska Sp. z o.o.
23440135 Firmwide Data Management - Data Integration - Change Project Lead – Associate J.P. Morgan Poland Services sp. z o.o.
23440182 Treasury – Product Control (P&L and Risk) Analyst J.P. Morgan Poland Services sp. z o.o.
23441295 Brand Reporting Manager JTI Polska sp. z o.o.
... and so on.
I've been building a web scraper in BS4 and have gotten stuck. I am using Trip Advisor as a test for other data I will be going after, but am not able to isolate the tag of the 'entire' reviews. Here is an example:
https://www.tripadvisor.com/Restaurant_Review-g56010-d470148-Reviews-Chez_Nous-Humble_Texas.html
Notice in the first review, there is an icon below "the wine list is...". I am able to easily isolate the partial reviews, but have not been able to figure out a way to get BS4 to pull the reviews after a simulated 'More' click. I'm trying to figure out what tool(s) are needed for this? Do I need to use selenium instead?
The original element looks like this:
<span class="partnerRvw">
<span class="taLnk hvrIE6 tr475091998 moreLink ulBlueLinks" onclick=" ta.util.cookie.setPIDCookie(4444); ta.call('ta.servlet.Reviews.expandReviews', {type: 'dummy'}, ta.id('review_475091998'), 'review_475091998', '1', 4444);
">
More </span>
<span class="ui_icon caret-down"></span>
</span>
Looking at the HTML after you click on the More link you would find a new dynamically added class that has a with the information I need (see below):
<div class="review dyn_full_review inlineReviewUpdate provider0 first newFlag" style="display: block;">
<a name="UR475091998" class=""></a>
<div id="UR475091998" class="extended provider0 first newFlag">
<div class="col1of2">
<div class="member_info">
<div id="UID_6875524F623CC948F4F9CA95BB4A9567-SRC_475091998" class="memberOverlayLink" onmouseover="requireCallIfReady('members/memberOverlay', 'initMemberOverlay', event, this, this.id, 'Reviews', 'user_name_photo');" data-anchorwidth="90">
<div class="avatar profile_6875524F623CC948F4F9CA95BB4A9567 ">
<a onclick="">
<img src="https://media-cdn.tripadvisor.com/media/photo-l/0d/97/43/bf/joannecarpenter.jpg" class="avatar potentialFacebookAvatar avatarGUID:6875524F623CC948F4F9CA95BB4A9567" width="74" height="74">
</a>
</div>
<div class="username mo">
<span class="expand_inline scrname mbrName_6875524F623CC948F4F9CA95BB4A9567" onclick="ta.trackEventOnPage('Reviews', 'show_reviewer_info_window', 'user_name_name_click')">joannecarpenter</span>
</div>
</div>
<div class="location">
Humble, Texas
</div>
</div>
<div class="memberBadging g10n">
<div id="UID_6875524F623CC948F4F9CA95BB4A9567-CONT" class="no_cpu" onclick="ta.util.cookie.setPIDCookie('15984'); requireCallIfReady('members/memberOverlay', 'initMemberOverlay', event, this, this.id, 'Reviews', 'review_count');" data-anchorwidth="90">
<div class="levelBadge badge lvl_02">
Level <span><img src="https://static.tacdn.com/img2/badges/20px/lvl_02.png" alt="" class="icon" width="20" height="20/"></span> Contributor </div>
<div class="reviewerBadge badge">
<img src="https://static.tacdn.com/img2/badges/20px/rev_03.png" alt="" class="icon" width="20" height="20">
<span class="badgeText">6 reviews</span> </div>
<div class="contributionReviewBadge badge">
<img src="https://static.tacdn.com/img2/badges/20px/Foodie.png" alt="" class="icon" width="20" height="20">
<span class="badgeText">6 restaurant reviews</span>
</div>
</div>
</div>
</div>
<div class="col2of2">
<div class="innerBubble">
<div class="quote">“<span class="noQuotes">Dinner</span>”</div>
<div class="rating reviewItemInline">
<span class="rate sprite-rating_s rating_s"> <img class="sprite-rating_s_fill rating_s_fill s50" width="70" src="https://static.tacdn.com/img2/x.gif" alt="5 of 5 bubbles">
</span>
<span class="ratingDate relativeDate" title="April 12, 2017">Reviewed 3 days ago
<span class="new redesigned">NEW</span> </span>
<a class="viaMobile" href="/apps" target="_blank" onclick="ta.util.cookie.setPIDCookie(24687)">
<span class="ui_icon mobile-phone"></span>
via mobile
</a>
</div>
<div class="entry">
<p>
Our favorite restaurant in Houston. Definitely the best and friendliest service! The food is not only served with a flair, it is absolutely delicious. My favorite is the Lamb. It is the best! Also the duck moose, fois gras, the crispy salad and the French onion soup are all spectacular! This is a must try restaurant! The wine list is fantastic. Just ask Daniel for suggestions. He not only knows his wines; he loves what he does! We Love this place!
</p>
</div>
<div class="rating-list">
<div class="recommend">
<span class="recommend-titleInline noRatings">Visited April 2017</span>
</div>
</div>
<div class="expanded lessLink">
<span class="taLnk collapse ulBlueLinks no_cpu ">
Less
</span>
<span class="textArrow_more ui_icon caret-up"></span>
</div>
<div id="helpfulq475091998_expanded" class="helpful redesigned white_btn_container ">
<span class="isHelpful">Helpful?</span> <div class="tgt_helpfulq475091998 rnd_white_thank_btn" onclick="ta.call('ta.servlet.Reviews.helpfulVoteHandlerOb', event, this, 'LeJIVqd4EVIpECri1GII2t6mbqgqguuuxizSxiniaqgeVtIJpEJCIQQoqnQQeVsSVuqHyo3KUKqHMdkKUdvqHxfqHfGVzCQQoqnQQZiptqH5paHcVQQoqnQQrVxEJtxiGIac6XoXmqoTpcdkoKAUAAv0tEn1dkoKAUAAv0zH1o3KUK0pSM13vkooXdqn3XmffAdvqndqnAfbAo77dbAo3k0npEEeJIV1K0EJIVqiJcpV1U0Ii9VC1rZlU3XozxbZZxE2crHN2TDUJiqnkiuzsVEOxdkXqi7TxXpUgyR2xXvOfROwaqILkrzz9MvzCxMva7xEkq8xXNq8ymxbAq8AzzrhhzCxbx2vdNvEn2fnwEfq8alzCeqi53ZrgnMrHhshTtowGpNSmq89IwiVb7crUJxdevaCnJEqI33qiE5JGErJExXKx5ooItGCy5wnCTx2VA7RvxEsO3'); ta.trackEventOnPage('HELPFUL_VOTE_TEST', 'helpfulvotegiven_v2');">
<img src="https://static.tacdn.com/img2/icons/icon_thumb_white.png" class="helpful_thumbs_up white">
<img src="https://static.tacdn.com/img2/icons/icon_thumb_green.png" class="helpful_thumbs_up green">
<span class="helpful_text">Thank joannecarpenter</span> </div>
</div>
<div class="tooltips vertically_centered">
<div class="reportProblem">
<span id="ReportIAP_475091998" class="problem collapsed taLnk" onclick="ta.trackEventOnPage('Report_IAP', 'Report_Button_Clicked', 'member'); ta.call('ta.servlet.Reviews.iapFlyout', event, this, '475091998')" onmouseover="if (!this.getAttribute('data-first')) {ta.trackEventOnPage('Reviews', 'report_problem', 'hover_over_flag'); this.setAttribute('data-first', 1)} uiOverlay(event, this)" data-tooltip="" data-position="above" data-content="Problem with this review?">
<img src="https://static.tacdn.com/img2/icons/gray_flag.png" width="13" height="14" alt="">
<span class="reportTxt">Report</span> </span>
</div>
</div>
<div class="userLinks">
<div class="sameGeoActivity">
<a href="/members-citypage/joannecarpenter/g56010" target="_blank" onclick="ta.setEvtCookie('Reviews','more_reviews_by_user','',0,this.href); ta.util.cookie.setPIDCookie(19160)">
See all 5 reviews by joannecarpenter for Humble </a>
</div>
<div class="askQuestion">
<span class="taLnk ulBlueLinks" onclick="ta.trackEventOnPage('answers_review','ask_user_intercept_click' ); ta.load('ta-answers', (function() {require('answers/misc').askReviewerIntercept(this, '470148', 'joannecarpenter', '6875524F623CC948F4F9CA95BB4A9567', 'en', '475091998','Chez Nous', 39151)}).bind(this), true);">Ask joannecarpenter about Chez Nous</span>
</div>
</div>
<div class="note">
This review is the subjective opinion of a TripAdvisor member and not of TripAdvisor LLC. </div>
<div class="duplicateReviewsInline">
<div class="previous">joannecarpenter has 1 more review of Chez Nous</div> <ul class="dupReviews">
<li class="dupReviewItem">
<div class="reviewTitle">
“Joanne Carpenter”
</div>
<div class="rating">
<span class="rate sprite-rating_ss rating_ss"> <img class="sprite-rating_ss_fill rating_ss_fill ss50" width="50" src="https://static.tacdn.com/img2/x.gif" alt="5 of 5 bubbles">
</span>
<span class="date">Reviewed January 18, 2017</span>
</div>
</li>
</ul>
</div>
</div>
</div>
</div>
<div class="large">
</div>
<div class="ad iab_inlineBanner">
<div id="gpt-ad-468x60" class="adInner gptAd"></div>
</div>
</div>
Is there a way for BS4 to handle this for me?
Here's a simple example to get you started:
import selenium
from selenium import webdriver
driver = webdriver.PhantomJS()
url = "https://www.tripadvisor.com/Restaurant_Review-g56010-d470148-Reviews-Chez_Nous-Humble_Texas.html"
driver.get(url)
elem = driver.get_element_by_class_name("taLnk")
...
You could find more info about the methods here:
http://selenium-python.readthedocs.io/
In all likelihood you will need to examine a few more of these pages, to identify variations in the HTML code. For the sample you have offered, and given that you are able to obtain it by simulating a press, the following code works to select the paragraph that you seem to want.
from bs4 import BeautifulSoup
HTML = open('temp.htm').read()
soup = BeautifulSoup(HTML, 'lxml')
para = soup.select('.entry > p')
print (para[0].text)
Result:
Our favorite restaurant in Houston. Definitely the best and friendliest service! The food is not only served with a flair, it is absolutely delicious. My favorite is the Lamb. It is the best! Also the duck moose, fois gras, the crispy salad and the French onion soup are all spectacular! This is a must try restaurant! The wine list is fantastic. Just ask Daniel for suggestions. He not only knows his wines; he loves what he does! We Love this place!
Note that there are newlines before and after the paragraph.