I am learning to use scrapy and am building a simple crawler to reinforce what I am learning, and am attempting to get the next page link but am having trouble. Can anyone point me in the right direction of getting the next page link, which is located in the a of the final li
The pagination div is as follows:
<div class="pagination pagination-small hidden-phone">
<ul>
<li><a href="./viewforum.php?f=399&start=40" data-original-title="" title=""><i
class="icon-chevron-left"></i></a></li>
<li>1</li>
<span class="page-sep">, </span>
<li>2</li>
<span class="page-sep">, </span>
<li class="active"><a data-original-title="" title="">3</a></li>
<span class="page-sep">, </span>
<li>4</li>
<span class="page-sep">, </span>
<li>5</li>
<span class="page-sep">, </span>
<li>6</li>
<li class="active"><a class="pointer-fix" href="#" onclick="jumpto(); return false;" title=""
data-original-title="Jump to page"> ... </a></li>
<li>10012</li>
<li><a href="./viewforum.php?f=399&start=120" data-original-title="" title=""><i
class="icon-chevron-right"></i></a></li>
</ul>
</div>
I have tried different variations of the following, but get the wrong li returned, it still gives me the class=active li even though I used li:not([class="page-sep, active"]):
response.css('div.pagination.pagination-small.hidden-phone').css('li:not([class="page-sep, active"])').get()
example:
>>> response.css('div.pagination.pagination-small.hidden-phone').css('li:not([class="active, page-sep"])').get()
'<li class="active"><a>1</a></li>'
Thanks
Since it's the last li on the list we can use this to out advantage.
css:
In [1]: response.css('div.pagination li:last-child a::attr(href)').get()
Out[1]: './viewforum.php?f=399&start=120'
xpath:
In [2]: response.xpath('//div[contains(#class, "pagination")]//li[last()]/a/#href').get()
Out[2]: './viewforum.php?f=399&start=120'
Related
I'm writing to web scrape in python using Beautiful soup to get Box office amount $64.3M. But I'm unable to do so.
<ul class="content-meta info">
<li class="meta-row clearfix" data-qa="movie-info-item">
<div class="meta-label subtle" data-qa="movie-info-item-label">Box Office (Gross USA):</div>
<div class="meta-value" data-qa="movie-info-item-value">$64.3M</div>
</li>
<li class="meta-row clearfix" data-qa="movie-info-item">
<div class="meta-label subtle" data-qa="movie-info-item-label">Runtime:</div>
<div class="meta-value" data-qa="movie-info-item-value">
<time datetime="P2h 4mM">
2h 4m
</time>
</div>
</li>
<li class="meta-row clearfix" data-qa="movie-info-item">
<div class="meta-label subtle" data-qa="movie-info-item-label">Distributor:</div>
<div class="meta-value" data-qa="movie-info-item-value">
Universal Pictures
</div>
</li>
<li class="meta-row clearfix" data-qa="movie-info-item">
<div class="meta-label subtle" data-qa="movie-info-item-label">Production Co:</div>
<div class="meta-value" data-qa="movie-info-item-value">
Universal Pictures,
Blumhouse Productions,
Dark Universe,
Goalpost Pictures
</div>
</li>
<li class="meta-row clearfix" data-qa="movie-info-item">
<div class="meta-label subtle" data-qa="movie-info-item-label">Sound Mix:</div>
<div class="meta-value" data-qa="movie-info-item-value">
Dolby Atmos
</div>
</li>
<li class="meta-row clearfix" data-qa="movie-info-item">
<div class="meta-label subtle" data-qa="movie-info-item-label">Aspect Ratio:</div>
<div class="meta-value" data-qa="movie-info-item-value">
Scope (2.35:1)
</div>
</li>
</ul>
I tried multiple syntaxes but nothing worked.
z = soup.find("ul").get("movie-info-item-value")
for tag in soup.find_all("ul"): print("{0}: {1}".format(tag.name, tag.text))
x = soup.select('movie-info-item-value')
x = soup.select('class').get('movie-info-item-value')
I'm new to python and webscraping. Any help will be deeply appreciated. TIA!!
Compare to XPath syntax (using lxml):
from lxml import html
....
tree = html.fromstring(content) # content here is a HTML content of your page
box_office = tree.xpath('string(//div[#data-qa="movie-info-item-label"][contains(., "Box Office")]/following-sibling::div[1]/text())')
I have a single expression that extract information you need in a simple human way (find a div tag that have predefined data-qa attribute and that contains some predefined text and next extract a text of the following div). IMHO, much more readable compared to CSS selectors.
You can specify target attributes in find() and find_all() for matching a <ul> element with class attribute and value of "content-meta info" with the class_ shortcut or a dictionary object for the attributes to match.
Try this:
from bs4 import BeautifulSoup
html = '''
<ul class="content-meta info">
<li class="meta-row clearfix" data-qa="movie-info-item">
<div class="meta-label subtle" data-qa="movie-info-item-label">Box Office (Gross USA):</div>
<div class="meta-value" data-qa="movie-info-item-value">$64.3M</div>
</li>
...
</ul>'''
soup = BeautifulSoup(html, "html.parser")
elt = soup.find("ul", class_="content-meta info")\
.find('li', {'data-qa': 'movie-info-item'})\
.find('div', class_="meta-value")
print(elt.text)
If HTML has multiple "ul" elements with same class then try this to first find the Box Office element.
elt = (soup
.find(text="Box Office (Gross USA):")
.parent
.parent
.find('div', class_="meta-value")
)
print(elt.text)
Output:
$64.3M
I have a script that takes all the images I want on a webpage, then I have to take the link that enclose the image.
I actually click on every image, take the current page link and then I go back and continue with the work. This is slow but I have an a tag that "hug" my image, I don't know how to retrieve that tag. With the tag it could be easier and faster. I attach the html code and my python code!
HTML code
<div class="col-xl col-lg col-md-4 col-sm-6 col-6">
<a href="URL I WANT TO GET ">
<article>
<span class="year">2017</span>
<span class="quality">4K</span>
<span class="imdb">6.7</span>
<img width="190" height="279" src="THE IMAGE URL" class="img-full wp-post-image" alt="" loading="lazy"> <h2>TITLE</h2>
</article>
</a></div>
<div class="col-xl col-lg col-md-4 col-sm-6 col-6">
<a href="URL I WANT TO GET 2">
<article>
<span class="year">2019</span>
<span class="quality">4K</span>
<span class="imdb">8.0</span>
<img width="190" height="279" src="THE IMAGE URL 2" class="img-full wp-post-image" alt="" loading="lazy"> <h2>TITLE</h2>
</article>
</a></div>
Python code
self.driver.get(category_url)
WebDriverWait(self.driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, 'div.archivePaging'))) # a div to see if page is loaded
movies_buttons = self.driver.find_elements_by_css_selector('img.img-full.wp-post-image')
print("Getting all the links!")
for movie in movies_buttons:
self.driver.execute_script("arguments[0].scrollIntoView();", movie)
movie.click()
WebDriverWait(self.driver, 10).until(EC.visibility_of_element_located((By.CLASS_NAME, 'infoFilmSingle')))
print(self.driver.current_url)
self.driver.back()
WebDriverWait(self.driver, 10).until(EC.visibility_of_element_located((By.CSS_SELECTOR, 'div.archivePaging')))
Note that this code don't work now because i'm calling a movie object of an old page but that's not a problem because if i would just see the link i don't need to change page and so the session don't change.
An example based on what I understand you want to do - You wanna get the parent a tags href
Example
from selenium import webdriver
driver = webdriver.Chrome(executable_path=r'C:\Program Files\ChromeDriver\chromedriver.exe')
html_content = """
<div class="col-xl col-lg col-md-4 col-sm-6 col-6">
<a href="https://www.link1.de">
<article>
<span class="year">2017</span>
<span class="quality">4K</span>
<span class="imdb">6.7</span>
<img width="190" height="279" src="THE IMAGE URL" class="img-full wp-post-image" alt="" loading="lazy"> <h2>TITLE</h2>
</article>
</a>
</div>
<div class="col-xl col-lg col-md-4 col-sm-6 col-6">
<a href="https://www.link2.de">
<article>
<span class="year">2019</span>
<span class="quality">4K</span>
<span class="imdb">8.0</span>
<img width="190" height="279" src="THE IMAGE URL 2" class="img-full wp-post-image" alt="" loading="lazy"> <h2>TITLE</h2>
</article>
</a>
</div>
"""
driver.get("data:text/html;charset=utf-8,{html_content}".format(html_content=html_content))
Locate the image elements with its class and and walk up the element structur with .. in this case /../..
driver.get("data:text/html;charset=utf-8,{html_content}".format(html_content=html_content))
aTags = driver.find_elements_by_xpath("//img[contains(#class,'img-full wp-post-image')]/../..")
for ele in aTags:
x=ele.get_attribute('href')
print(x)
driver.close()
Output
https://www.link1.de/
https://www.link2.de/
<li id="button1" class="on">
<div class="supply1">
<div class="buildingimg">
<a class="fastBuild tooltip js_hideTipOnMobile" title="Metallmine auf Stufe 4 ausbauen" href="javascript:void(0);" onclick="sendBuildRequest('https://s159-de.ogame.gameforge.com/game/index.php?page=resources&modus=1&type=1&menge=1&token=0c86d8a8bf9a5c559538b0e13cb462b4', null, 1);">
<img src="https://gf2.geo.gfsrv.net/cdndf/3e567d6f16d040326c7a0ea29a4f41.gif" width="22" height="14">
</a>
<a class="detail_button tooltip js_hideTipOnMobile slideIn" title="" ref="1" id="details" href="javascript:void(0);">
<span class="ecke">
<span class="level">
<span class="textlabel">
**Metallmine**
</span>
**3** </span>
</span>
</a>
</div>
</div>
</li>
<li id="button2" class="on">
<div class="supply2">
<div class="buildingimg">
<a class="fastBuild tooltip js_hideTipOnMobile" title="" href="javascript:void(0);" onclick="sendBuildRequest('https://s159-de.ogame.gameforge.com/game/index.php?page=resources&modus=1&type=2&menge=1&token=0c86d8a8bf9a5c559538b0e13cb462b4', null, 1);">
<img src="https://gf2.geo.gfsrv.net/cdndf/3e567d6f16d040326c7a0ea29a4f41.gif" width="22" height="14">
</a>
<a class="detail_button tooltip js_hideTipOnMobile slideIn" title="" ref="2" id="details" href="javascript:void(0);">
<span class="ecke">
<span class="level">
<span class="textlabel">
**Kristallmine**
</span>
**1** </span>
</span>
</a>
</div>
</div>
</li>
Dear Community,
So I want to create a bot for a browser game (just for learning purposes of course). In the game you can build and level up metal and crystall mines to get more resources. To have the best resource proportions it is best to have a metal mine which is always 2 levels higher, than your crystal mine. Writing the code to compare the levels is no problem, but I'm having problems accessing the actual values of the "level" of the mine since there is no unique attribute to them.
Above in the code you can see the "Metallmine" and "Kristallmine" and the corresponding levels. I would like to write a code similar to:
if LevelOfKristallmine - LevelOfMetallmine <-2
driver.find_element_by_whatever('upgradebutton').click()
how can I get the values of LevelOfKristallmine and LevelOfMetallmine?
Thanks alot for your answers!
You are trying to use the ID, I assume as the values? Instead copy and paste the XPath, using something like:
driver.find_element_by_xpath('*//*[#id="example-xpath"]/div/nav/ol*').click()
To copy Xpath, f12, find the element to click, right click, copy > Xpath. Then paste in the parentheses. Follow this other link and you should figure it out mate.
I've been building a web scraper in BS4 and have gotten stuck. I am using Trip Advisor as a test for other data I will be going after, but am not able to isolate the tag of the 'entire' reviews. Here is an example:
https://www.tripadvisor.com/Restaurant_Review-g56010-d470148-Reviews-Chez_Nous-Humble_Texas.html
Notice in the first review, there is an icon below "the wine list is...". I am able to easily isolate the partial reviews, but have not been able to figure out a way to get BS4 to pull the reviews after a simulated 'More' click. I'm trying to figure out what tool(s) are needed for this? Do I need to use selenium instead?
The original element looks like this:
<span class="partnerRvw">
<span class="taLnk hvrIE6 tr475091998 moreLink ulBlueLinks" onclick=" ta.util.cookie.setPIDCookie(4444); ta.call('ta.servlet.Reviews.expandReviews', {type: 'dummy'}, ta.id('review_475091998'), 'review_475091998', '1', 4444);
">
More </span>
<span class="ui_icon caret-down"></span>
</span>
Looking at the HTML after you click on the More link you would find a new dynamically added class that has a with the information I need (see below):
<div class="review dyn_full_review inlineReviewUpdate provider0 first newFlag" style="display: block;">
<a name="UR475091998" class=""></a>
<div id="UR475091998" class="extended provider0 first newFlag">
<div class="col1of2">
<div class="member_info">
<div id="UID_6875524F623CC948F4F9CA95BB4A9567-SRC_475091998" class="memberOverlayLink" onmouseover="requireCallIfReady('members/memberOverlay', 'initMemberOverlay', event, this, this.id, 'Reviews', 'user_name_photo');" data-anchorwidth="90">
<div class="avatar profile_6875524F623CC948F4F9CA95BB4A9567 ">
<a onclick="">
<img src="https://media-cdn.tripadvisor.com/media/photo-l/0d/97/43/bf/joannecarpenter.jpg" class="avatar potentialFacebookAvatar avatarGUID:6875524F623CC948F4F9CA95BB4A9567" width="74" height="74">
</a>
</div>
<div class="username mo">
<span class="expand_inline scrname mbrName_6875524F623CC948F4F9CA95BB4A9567" onclick="ta.trackEventOnPage('Reviews', 'show_reviewer_info_window', 'user_name_name_click')">joannecarpenter</span>
</div>
</div>
<div class="location">
Humble, Texas
</div>
</div>
<div class="memberBadging g10n">
<div id="UID_6875524F623CC948F4F9CA95BB4A9567-CONT" class="no_cpu" onclick="ta.util.cookie.setPIDCookie('15984'); requireCallIfReady('members/memberOverlay', 'initMemberOverlay', event, this, this.id, 'Reviews', 'review_count');" data-anchorwidth="90">
<div class="levelBadge badge lvl_02">
Level <span><img src="https://static.tacdn.com/img2/badges/20px/lvl_02.png" alt="" class="icon" width="20" height="20/"></span> Contributor </div>
<div class="reviewerBadge badge">
<img src="https://static.tacdn.com/img2/badges/20px/rev_03.png" alt="" class="icon" width="20" height="20">
<span class="badgeText">6 reviews</span> </div>
<div class="contributionReviewBadge badge">
<img src="https://static.tacdn.com/img2/badges/20px/Foodie.png" alt="" class="icon" width="20" height="20">
<span class="badgeText">6 restaurant reviews</span>
</div>
</div>
</div>
</div>
<div class="col2of2">
<div class="innerBubble">
<div class="quote">“<span class="noQuotes">Dinner</span>”</div>
<div class="rating reviewItemInline">
<span class="rate sprite-rating_s rating_s"> <img class="sprite-rating_s_fill rating_s_fill s50" width="70" src="https://static.tacdn.com/img2/x.gif" alt="5 of 5 bubbles">
</span>
<span class="ratingDate relativeDate" title="April 12, 2017">Reviewed 3 days ago
<span class="new redesigned">NEW</span> </span>
<a class="viaMobile" href="/apps" target="_blank" onclick="ta.util.cookie.setPIDCookie(24687)">
<span class="ui_icon mobile-phone"></span>
via mobile
</a>
</div>
<div class="entry">
<p>
Our favorite restaurant in Houston. Definitely the best and friendliest service! The food is not only served with a flair, it is absolutely delicious. My favorite is the Lamb. It is the best! Also the duck moose, fois gras, the crispy salad and the French onion soup are all spectacular! This is a must try restaurant! The wine list is fantastic. Just ask Daniel for suggestions. He not only knows his wines; he loves what he does! We Love this place!
</p>
</div>
<div class="rating-list">
<div class="recommend">
<span class="recommend-titleInline noRatings">Visited April 2017</span>
</div>
</div>
<div class="expanded lessLink">
<span class="taLnk collapse ulBlueLinks no_cpu ">
Less
</span>
<span class="textArrow_more ui_icon caret-up"></span>
</div>
<div id="helpfulq475091998_expanded" class="helpful redesigned white_btn_container ">
<span class="isHelpful">Helpful?</span> <div class="tgt_helpfulq475091998 rnd_white_thank_btn" onclick="ta.call('ta.servlet.Reviews.helpfulVoteHandlerOb', event, this, 'LeJIVqd4EVIpECri1GII2t6mbqgqguuuxizSxiniaqgeVtIJpEJCIQQoqnQQeVsSVuqHyo3KUKqHMdkKUdvqHxfqHfGVzCQQoqnQQZiptqH5paHcVQQoqnQQrVxEJtxiGIac6XoXmqoTpcdkoKAUAAv0tEn1dkoKAUAAv0zH1o3KUK0pSM13vkooXdqn3XmffAdvqndqnAfbAo77dbAo3k0npEEeJIV1K0EJIVqiJcpV1U0Ii9VC1rZlU3XozxbZZxE2crHN2TDUJiqnkiuzsVEOxdkXqi7TxXpUgyR2xXvOfROwaqILkrzz9MvzCxMva7xEkq8xXNq8ymxbAq8AzzrhhzCxbx2vdNvEn2fnwEfq8alzCeqi53ZrgnMrHhshTtowGpNSmq89IwiVb7crUJxdevaCnJEqI33qiE5JGErJExXKx5ooItGCy5wnCTx2VA7RvxEsO3'); ta.trackEventOnPage('HELPFUL_VOTE_TEST', 'helpfulvotegiven_v2');">
<img src="https://static.tacdn.com/img2/icons/icon_thumb_white.png" class="helpful_thumbs_up white">
<img src="https://static.tacdn.com/img2/icons/icon_thumb_green.png" class="helpful_thumbs_up green">
<span class="helpful_text">Thank joannecarpenter</span> </div>
</div>
<div class="tooltips vertically_centered">
<div class="reportProblem">
<span id="ReportIAP_475091998" class="problem collapsed taLnk" onclick="ta.trackEventOnPage('Report_IAP', 'Report_Button_Clicked', 'member'); ta.call('ta.servlet.Reviews.iapFlyout', event, this, '475091998')" onmouseover="if (!this.getAttribute('data-first')) {ta.trackEventOnPage('Reviews', 'report_problem', 'hover_over_flag'); this.setAttribute('data-first', 1)} uiOverlay(event, this)" data-tooltip="" data-position="above" data-content="Problem with this review?">
<img src="https://static.tacdn.com/img2/icons/gray_flag.png" width="13" height="14" alt="">
<span class="reportTxt">Report</span> </span>
</div>
</div>
<div class="userLinks">
<div class="sameGeoActivity">
<a href="/members-citypage/joannecarpenter/g56010" target="_blank" onclick="ta.setEvtCookie('Reviews','more_reviews_by_user','',0,this.href); ta.util.cookie.setPIDCookie(19160)">
See all 5 reviews by joannecarpenter for Humble </a>
</div>
<div class="askQuestion">
<span class="taLnk ulBlueLinks" onclick="ta.trackEventOnPage('answers_review','ask_user_intercept_click' ); ta.load('ta-answers', (function() {require('answers/misc').askReviewerIntercept(this, '470148', 'joannecarpenter', '6875524F623CC948F4F9CA95BB4A9567', 'en', '475091998','Chez Nous', 39151)}).bind(this), true);">Ask joannecarpenter about Chez Nous</span>
</div>
</div>
<div class="note">
This review is the subjective opinion of a TripAdvisor member and not of TripAdvisor LLC. </div>
<div class="duplicateReviewsInline">
<div class="previous">joannecarpenter has 1 more review of Chez Nous</div> <ul class="dupReviews">
<li class="dupReviewItem">
<div class="reviewTitle">
“Joanne Carpenter”
</div>
<div class="rating">
<span class="rate sprite-rating_ss rating_ss"> <img class="sprite-rating_ss_fill rating_ss_fill ss50" width="50" src="https://static.tacdn.com/img2/x.gif" alt="5 of 5 bubbles">
</span>
<span class="date">Reviewed January 18, 2017</span>
</div>
</li>
</ul>
</div>
</div>
</div>
</div>
<div class="large">
</div>
<div class="ad iab_inlineBanner">
<div id="gpt-ad-468x60" class="adInner gptAd"></div>
</div>
</div>
Is there a way for BS4 to handle this for me?
Here's a simple example to get you started:
import selenium
from selenium import webdriver
driver = webdriver.PhantomJS()
url = "https://www.tripadvisor.com/Restaurant_Review-g56010-d470148-Reviews-Chez_Nous-Humble_Texas.html"
driver.get(url)
elem = driver.get_element_by_class_name("taLnk")
...
You could find more info about the methods here:
http://selenium-python.readthedocs.io/
In all likelihood you will need to examine a few more of these pages, to identify variations in the HTML code. For the sample you have offered, and given that you are able to obtain it by simulating a press, the following code works to select the paragraph that you seem to want.
from bs4 import BeautifulSoup
HTML = open('temp.htm').read()
soup = BeautifulSoup(HTML, 'lxml')
para = soup.select('.entry > p')
print (para[0].text)
Result:
Our favorite restaurant in Houston. Definitely the best and friendliest service! The food is not only served with a flair, it is absolutely delicious. My favorite is the Lamb. It is the best! Also the duck moose, fois gras, the crispy salad and the French onion soup are all spectacular! This is a must try restaurant! The wine list is fantastic. Just ask Daniel for suggestions. He not only knows his wines; he loves what he does! We Love this place!
Note that there are newlines before and after the paragraph.
I'm trying to collect the text using Bs4, selenium and Python I want to get the text "Lisa Staprans" using:
name = str(profilePageSource.find(class_="hzi-font hzi-Man-Outline").div.get_text().encode("utf-8"))[2:-1]
Here is the code:
<div class="profile-about-right">
<div class="text-bold">
SF Peninsula Interior Design Firm
<br/>
Best of Houzz 2015
</div>
<br/>
<div class="page-tags" style="display:none">
page_type: pro_plus_profile
</div>
<div class="pro-info-horizontal-list text-m text-dt-s">
<div class="info-list-label">
<i class="hzi-font hzi-Ruler">
</i>
<div class="info-list-text">
<span class="hide" itemscope="" itemtype="http://data-vocabulary.org/Breadcr
umb">
<a href="http://www.houzz.com/professionals/c/Menlo-Park--CA" itemprop="url
">
<span itemprop="title">
Professionals
</span>
</a>
</span>
<span itemprop="child" itemscope="" itemtype="http://data-vocabulary.org/Bre
adcrumb">
<a href="http://www.houzz.com/professionals/interior-designer/c/Menlo-Park-
-CA" itemprop="url">
<span itemprop="title">
Interior Designers & Decorators
</span>
</a>
</span>
</div>
</div>
<div class="info-list-label">
<i class="hzi-font hzi-Man-Outline">
</i>
<div class="info-list-text">
<b>
Contact
</b>
: Lisa Staprans
</div>
</div>
</div>
</div>
Please let me know how it would be.
I assumed you are using Beautifulsoup since you are using class_ attribute dictionary-
If there is one div with class name hzi-font hzi-Man-Outline then try-
str(profilePageSource.find(class_="hzi-font hzi-Man-Outline").findNext('div').get_text().split(":")[-1]).strip()
Extracts 'Lisa Staprans'
Here findNext navigates to next div and extracts text.
I can't test it right now but I would do :
profilePageSource.find_element_by_class_name("info-list-text").get_attribute('innerHTML')
Then you will have to split the result considering the : (if it's always the case).
For more informations : https://selenium-python.readthedocs.org/en/latest/navigating.html
Maybe something is wrong with this part:
find(class_="hzi-font hzi-Man-Outline")
An easy way to get the right information can be: right click on the element you need in the page source by inspecting it with Google Chrome, copy the xpath of the element, and then use:
profilePageSource.find_element_by_xpath(<xpath copied from Chorme>).text
Hope it helps.