How do I extract certain elements from this webscraped HTML - python

Here's the HTML I have webscraped. How do I extract the text called "Code I want to Extract" and then save this as a string "author"? Thanks in advance!
<a class="lead-author-profile-link" href="https://papers.ssrn.com/sol3/cf_dev/AbsByAuth.cfm?per_id=2994282" target="_blank" title="View other papers by this author"><span>Code I want to Extract</span><i aria-hidden="true" class="icon icon-gizmo-navigate-right"></i></a>

You can try it:
html_doc="""
<a class="lead-author-profile-link" href="https://papers.ssrn.com/sol3/cf_dev/AbsByAuth.cfm?per_id=2994282" target="_blank" title="View other papers by this author"><span>Code I want to Extract</span><i aria-hidden="true" class="icon icon-gizmo-navigate-right"></i></a>
"""
soup = BeautifulSoup(html_doc, 'lxml')
author = soup.find('a').text
print(author)
Output will be:
Code I want to Extract

Related

How to scrape dynamic content displayed in tooltips using scrapy

I'd like to extract the text inside the class "spc spc-nowrap" using scrapy and the container software docker to scrape dynamically loaded content.
<div id="tooltipdiv" style="position: absolute; z-index: 100; left: 637.188px; top: 625.609px; display: none;">
<span class="help">
<span class="help-box2 y-h wider">
<span class="wrap-help">
<span class="spc spc-nowrap" id="tooltiptext">
text to extract
<br>
text to extract
<strong>text to extract</strong>
<br>
</span>
</span>
</span>
</span>
</div>
Which xpath or css syntax returns these data?
response.css("span#tooltiptext.spc.spc-nowrap").extract()
yields empty list
This should extract all of the text including the text in the <strong> tag.
It will be a list of, for your example the output would be: ["text to extract", "text to extract", "text to extract"]
response.xpath('//span[#id="tooltiptext"]//text()').getall()

How to select div including a span with specific id via beautifulsoup?

I want to scrape a text of some div that includes a span with specific id or class.
for example:
<div class="class1" >
<span id="span1"></span>
text to scrape
</div>
<div class="class1" >
<span id="span2"></span>
text to scrape
</div>
<div class="class1" >
<span id="span1"></span>
text to scrape
</div>
<div class="class1" >
<span id="span3"></span>
text to scrape
</div>
I want to get the text in the div (class1) but specifically only the one witch include span (span1)
thanks
Thank you I have solved my problem using this code below:
soup = BeautifulSoup(html, "html.parser")
for x in soup.find_all(class_='class1'):
x_span = x.find('span',class_='span1')
print(x_span.parent.text.strip())
this should do
soup.select_one('.class1:has(#span1)').text
As Beso mentioned there is an typo in your html that should be fixed.
How to select?
Simplest approache in my opinion is to use a css selector to select all <div> that have a <span> with class named "span1" (cause there is more than one in your example html)
soup.select('div:has(> #span1)')
or even more specific as mentioned by diggusbickus:
soup.select('.class1:has(> #span1)')
To get the text of all of them you have to iterate the result set:
[x.get_text(strip=True) for x in soup.select('div:has(> #span1)')]
That will give you a list of texts:
['text to scrape', 'text to scrape']
Example
from bs4 import BeautifulSoup
html='''
<div class="class1" >
<span id="span1"></span>
text to scrape
</div>
<div class="class1" >
<span id="span2"></span>
text to scrape
</div>
<div class="class1" >
<span id="span1"></span>
text to scrape
</div>
<div class="class1" >
<span id="span3"></span>
text to scrape
</div>
'''
soup = BeautifulSoup(html, "html.parser")
[x.get_text(strip=True) for x in soup.select('div:has(> #span1)')]

How can i crawl web data that not in tags(class name is same)

Sorry.
I have asked a question like this.
After that i still have problem about data not in tag.
A few different the question i asked
(How can i crawl web data that not in tags)
<div class="bbs" id="main-content">
<div class="metaline">
<span class="article-meta-tag">
author
</span>
<span class="article-meta-value">
Jorden
</span>
</div>
<div class="metaline">
<span class="article-meta-tag">
board
</span>
<span class="article-meta-value">
NBA
</span>
</div>
I am here
</div>
I only need
I am here
The string is a child of the main div of type NavigableString, so you can loop through div.children and filter based on the type of the node:
from bs4 import BeautifulSoup, NavigableString
[x.strip() for x in soup.find("div", {'id': 'main-content'}).children if isinstance(x, NavigableString) and x.strip()]
# [u'I am here']
Data:
soup = BeautifulSoup("""<div class="bbs" id="main-content">
<div class="metaline">
<span class="article-meta-tag">
author
</span>
<span class="article-meta-value">
Jorden
</span>
</div>
<div class="metaline">
<span class="article-meta-tag">
board
</span>
<span class="article-meta-value">
NBA
</span>
</div>
I am here
</div>""", "html.parser")
soup = BeautifulSoup(that_html)
div_tag = soup.div
required_string = div_tag.string
go thought this documentation

Python Regex Scrape

I have this piece of code with prices from a product(the price and offer for installments) and I try to scrape with python to get only the price(649).
<span style="color: #404040; font-size: 12px;"> from </span>
<span class="money-int">649</span>
<sup class="money-decimal">99</sup>
<span class="money-currency">$</span>
<br />
<span style="color: #404040; font-size: 12px;">from
<b>
<span class="money-int">37</span>
<sup class="money-decimal">35</sup>
<span class="money-currency">$</span>/month
</b>
</span>
I tried using re.findall like this
match = re.findall('\"money-int\"\>(\d*)\<\/span\>\<sup class=\"money-decimal\"\>(\d*)',content)
The problem is I get list with both prices, 649 and 37 and I need only 649.
re.findall(r"<span[^>]*class=\"money-int\"[^>]*>([^<]*)</span>[^<]*<sup[^>]*class=\"money-decimal\"[^>]*>([^<]*)</sup>", YOUR_STRING)
Consider using an xml parser to do this job to avoid future headaches:
#!/usr/bin/python
from bs4 import BeautifulSoup as BS
html = '''
<span style="color: #404040; font-size: 12px;"> from </span>
<span class="money-int">649</span>
<sup class="money-decimal">99</sup>
<span class="money-currency">$</span>
<br />
<span style="color: #404040; font-size: 12px;">from
<b>
<span class="money-int">37</span>
<sup class="money-decimal">35</sup>
<span class="money-currency">$</span>/month
</b>
</span>
'''
soup = BS(html, 'lxml')
print soup.find_all("span", attrs={"class": "money-int"})[0].get_text()
Online demo on ideone

Python RegEx with Beautifulsoup 4 not working

I want to find all div tags which have a certain pattern in their class name but my code is not working as desired.
This is the code snippet
soup = BeautifulSoup(html_doc, 'html.parser')
all_findings = soup.findAll('div',attrs={'class':re.compile(r'common text .*')})
where html_doc is the string with the following html
<div class="common text sighting_4619012">
<div class="hide-c">
<div class="icon location"></div>
<p class="reason"></p>
<p class="small">These will not appear</p>
<span class="button secondary ">wait</span>
</div>
<div class="show-c">
</div>
</div>
But all_findings is coming out as an empty list while it should have found one item.
It's working in the case of exact match
all_findings = soup.findAll('div',attrs={'class':re.compile(r'hide-c')})
I am using bs4.
Instead of using a regular expression, put the classes you are looking for in a list:
all_findings = soup.findAll('div',attrs={'class':['common', 'text']})
Example code:
from bs4 import BeautifulSoup
html_doc = """<div class="common text sighting_4619012">
<div class="hide-c">
<div class="icon location"></div>
<p class="reason"></p>
<p class="small">These will not appear</p>
<span class="button secondary ">wait</span>
</div>
<div class="show-c">
</div>
</div>"""
soup = BeautifulSoup(html_doc, 'html.parser')
all_findings = soup.findAll('div',attrs={'class':['common', 'text']})
print all_findings
This outputs:
[<div class="common text sighting_4619012">
<div class="hide-c">
<div class="icon location"></div>
<p class="reason"></p>
<p class="small">These will not appear</p>
<span class="button secondary ">wait</span>
</div>
<div class="show-c">
</div>
</div>]
To extend #Andy's answer, you can make a list of class names and compiled regular expressions:
soup.find_all('div', {'class': ["common", "text", re.compile(r'sighting_\d{5}')]})
Note that, in this case, you'll get the div elements with one of the specified classes/patterns - in other words, it's common or text or sighting_ followed by five digits.
If you want to have them joined with "and", one option would be to turn off the special treatment for "class" attributes by having the document parsed as "xml":
soup = BeautifulSoup(html_doc, 'xml')
all_findings = soup.find_all('div', class_=re.compile(r'common text sighting_\d{5}'))
print all_findings

Categories