I'd like to extract the text inside the class "spc spc-nowrap" using scrapy and the container software docker to scrape dynamically loaded content.
<div id="tooltipdiv" style="position: absolute; z-index: 100; left: 637.188px; top: 625.609px; display: none;">
<span class="help">
<span class="help-box2 y-h wider">
<span class="wrap-help">
<span class="spc spc-nowrap" id="tooltiptext">
text to extract
<br>
text to extract
<strong>text to extract</strong>
<br>
</span>
</span>
</span>
</span>
</div>
Which xpath or css syntax returns these data?
response.css("span#tooltiptext.spc.spc-nowrap").extract()
yields empty list
This should extract all of the text including the text in the <strong> tag.
It will be a list of, for your example the output would be: ["text to extract", "text to extract", "text to extract"]
response.xpath('//span[#id="tooltiptext"]//text()').getall()
Related
<div class="breadcrumbs">
<div class="container">
Home
<span class="divider"> </span>
Special Occasion Dresses
<span class="divider"> </span>
Evening Dresses
<span class="divider"> </span>
Formal Evening Dresses
<span class="divider"> </span>
<strong>Deep V-neck Yellow Long Prom Dress Sleeveless Satin Evening Dress</strong>
</div>
I want to scrape the third anchor from container class but I am unable to scape that one I used response.css('.breadcrumbs div.container a').getall() this selector to scrape all anchors but I get only first I am beginner I need help to scrape all these achors
Pretty simple using XPath expressions.
If you want to get anchor by position:
third_url = response.xpath('//div[#class="container"]/a[3]/#href').get()
If you want to get anchor by the text of the link:
evening_dresses_url = response.xpath('//div[#class="container"]/a[.="Evening Dresses"]/#href').get()
Here's the HTML I have webscraped. How do I extract the text called "Code I want to Extract" and then save this as a string "author"? Thanks in advance!
<a class="lead-author-profile-link" href="https://papers.ssrn.com/sol3/cf_dev/AbsByAuth.cfm?per_id=2994282" target="_blank" title="View other papers by this author"><span>Code I want to Extract</span><i aria-hidden="true" class="icon icon-gizmo-navigate-right"></i></a>
You can try it:
html_doc="""
<a class="lead-author-profile-link" href="https://papers.ssrn.com/sol3/cf_dev/AbsByAuth.cfm?per_id=2994282" target="_blank" title="View other papers by this author"><span>Code I want to Extract</span><i aria-hidden="true" class="icon icon-gizmo-navigate-right"></i></a>
"""
soup = BeautifulSoup(html_doc, 'lxml')
author = soup.find('a').text
print(author)
Output will be:
Code I want to Extract
I'm trying to parse the follow HTML code in python using beautiful soup. I would like to be able to search for text inside a tag, for example "Color" and return the text next tag "Slate, mykonos" and do so for the next tags so that for a give text category I can return it's corresponding information.
However, I'm finding it very difficult to find the right code to do this.
<h2>Details</h2>
<div class="section-inner">
<div class="_UCu">
<h3 class="_mEu">General</h3>
<div class="_JDu">
<span class="_IDu">Color</span>
<span class="_KDu">Slate, mykonos</span>
</div>
</div>
<div class="_UCu">
<h3 class="_mEu">Carrying Case</h3>
<div class="_JDu">
<span class="_IDu">Type</span>
<span class="_KDu">Protective cover</span>
</div>
<div class="_JDu">
<span class="_IDu">Recommended Use</span>
<span class="_KDu">For cell phone</span>
</div>
<div class="_JDu">
<span class="_IDu">Protection</span>
<span class="_KDu">Impact protection</span>
</div>
<div class="_JDu">
<span class="_IDu">Cover Type</span>
<span class="_KDu">Back cover</span>
</div>
<div class="_JDu">
<span class="_IDu">Features</span>
<span class="_KDu">Camera lens cutout, hard shell, rubberized, port cut-outs, raised edges</span>
</div>
</div>
I use the following code to retrieve my div tag
soup.find_all("div", "_JDu")
Once I have retrieved the tag I can navigate inside it but I can't find the right code that will enable me to find the text inside one tag and return the text in the tag after it.
Any help would be really really appreciated as I'm new to python and I have hit a dead end.
You can define a function to return the value for the key you enter:
def get_txt(soup, key):
key_tag = soup.find('span', text=key).parent
return key_tag.find_all('span')[1].text
color = get_txt(soup, 'Color')
print('Color: ' + color)
features = get_txt(soup, 'Features')
print('Features: ' + features)
Output:
Color: Slate, mykonos
Features: Camera lens cutout, hard shell, rubberized, port cut-outs, raised edges
I hope this is what you are looking for.
Explanation:
soup.find('span', text=key) returns the <span> tag whose text=key.
.parent returns the parent tag of the current <span> tag.
Example:
When key='Color', soup.find('span', text=key).parent will return
<div class="_JDu">
<span class="_IDu">Color</span>
<span class="_KDu">Slate, mykonos</span>
</div>
Now we've stored this in key_tag. Only thing left is getting the text of second <span>, which is what the line key_tag.find_all('span')[1].text does.
Give it a go. It can also give you the corresponding values. Make sure to wrap the html elements within content=""" """ variable between Triple Quotes to see how it works.
from bs4 import BeautifulSoup
soup = BeautifulSoup(content,"lxml")
for elem in soup.select("._JDu"):
item = elem.select_one("span")
if "Features" in item.text: #try to see if it misses the corresponding values
val = item.find_next("span").text
print(val)
I have this piece of code with prices from a product(the price and offer for installments) and I try to scrape with python to get only the price(649).
<span style="color: #404040; font-size: 12px;"> from </span>
<span class="money-int">649</span>
<sup class="money-decimal">99</sup>
<span class="money-currency">$</span>
<br />
<span style="color: #404040; font-size: 12px;">from
<b>
<span class="money-int">37</span>
<sup class="money-decimal">35</sup>
<span class="money-currency">$</span>/month
</b>
</span>
I tried using re.findall like this
match = re.findall('\"money-int\"\>(\d*)\<\/span\>\<sup class=\"money-decimal\"\>(\d*)',content)
The problem is I get list with both prices, 649 and 37 and I need only 649.
re.findall(r"<span[^>]*class=\"money-int\"[^>]*>([^<]*)</span>[^<]*<sup[^>]*class=\"money-decimal\"[^>]*>([^<]*)</sup>", YOUR_STRING)
Consider using an xml parser to do this job to avoid future headaches:
#!/usr/bin/python
from bs4 import BeautifulSoup as BS
html = '''
<span style="color: #404040; font-size: 12px;"> from </span>
<span class="money-int">649</span>
<sup class="money-decimal">99</sup>
<span class="money-currency">$</span>
<br />
<span style="color: #404040; font-size: 12px;">from
<b>
<span class="money-int">37</span>
<sup class="money-decimal">35</sup>
<span class="money-currency">$</span>/month
</b>
</span>
'''
soup = BS(html, 'lxml')
print soup.find_all("span", attrs={"class": "money-int"})[0].get_text()
Online demo on ideone
I'm trying to collect the text using Bs4, selenium and Python I want to get the text "Lisa Staprans" using:
name = str(profilePageSource.find(class_="hzi-font hzi-Man-Outline").div.get_text().encode("utf-8"))[2:-1]
Here is the code:
<div class="profile-about-right">
<div class="text-bold">
SF Peninsula Interior Design Firm
<br/>
Best of Houzz 2015
</div>
<br/>
<div class="page-tags" style="display:none">
page_type: pro_plus_profile
</div>
<div class="pro-info-horizontal-list text-m text-dt-s">
<div class="info-list-label">
<i class="hzi-font hzi-Ruler">
</i>
<div class="info-list-text">
<span class="hide" itemscope="" itemtype="http://data-vocabulary.org/Breadcr
umb">
<a href="http://www.houzz.com/professionals/c/Menlo-Park--CA" itemprop="url
">
<span itemprop="title">
Professionals
</span>
</a>
</span>
<span itemprop="child" itemscope="" itemtype="http://data-vocabulary.org/Bre
adcrumb">
<a href="http://www.houzz.com/professionals/interior-designer/c/Menlo-Park-
-CA" itemprop="url">
<span itemprop="title">
Interior Designers & Decorators
</span>
</a>
</span>
</div>
</div>
<div class="info-list-label">
<i class="hzi-font hzi-Man-Outline">
</i>
<div class="info-list-text">
<b>
Contact
</b>
: Lisa Staprans
</div>
</div>
</div>
</div>
Please let me know how it would be.
I assumed you are using Beautifulsoup since you are using class_ attribute dictionary-
If there is one div with class name hzi-font hzi-Man-Outline then try-
str(profilePageSource.find(class_="hzi-font hzi-Man-Outline").findNext('div').get_text().split(":")[-1]).strip()
Extracts 'Lisa Staprans'
Here findNext navigates to next div and extracts text.
I can't test it right now but I would do :
profilePageSource.find_element_by_class_name("info-list-text").get_attribute('innerHTML')
Then you will have to split the result considering the : (if it's always the case).
For more informations : https://selenium-python.readthedocs.org/en/latest/navigating.html
Maybe something is wrong with this part:
find(class_="hzi-font hzi-Man-Outline")
An easy way to get the right information can be: right click on the element you need in the page source by inspecting it with Google Chrome, copy the xpath of the element, and then use:
profilePageSource.find_element_by_xpath(<xpath copied from Chorme>).text
Hope it helps.