Python Regex Scrape - python

I have this piece of code with prices from a product(the price and offer for installments) and I try to scrape with python to get only the price(649).
<span style="color: #404040; font-size: 12px;"> from </span>
<span class="money-int">649</span>
<sup class="money-decimal">99</sup>
<span class="money-currency">$</span>
<br />
<span style="color: #404040; font-size: 12px;">from
<b>
<span class="money-int">37</span>
<sup class="money-decimal">35</sup>
<span class="money-currency">$</span>/month
</b>
</span>
I tried using re.findall like this
match = re.findall('\"money-int\"\>(\d*)\<\/span\>\<sup class=\"money-decimal\"\>(\d*)',content)
The problem is I get list with both prices, 649 and 37 and I need only 649.

re.findall(r"<span[^>]*class=\"money-int\"[^>]*>([^<]*)</span>[^<]*<sup[^>]*class=\"money-decimal\"[^>]*>([^<]*)</sup>", YOUR_STRING)

Consider using an xml parser to do this job to avoid future headaches:
#!/usr/bin/python
from bs4 import BeautifulSoup as BS
html = '''
<span style="color: #404040; font-size: 12px;"> from </span>
<span class="money-int">649</span>
<sup class="money-decimal">99</sup>
<span class="money-currency">$</span>
<br />
<span style="color: #404040; font-size: 12px;">from
<b>
<span class="money-int">37</span>
<sup class="money-decimal">35</sup>
<span class="money-currency">$</span>/month
</b>
</span>
'''
soup = BS(html, 'lxml')
print soup.find_all("span", attrs={"class": "money-int"})[0].get_text()
Online demo on ideone

Related

How to scrape dynamic content displayed in tooltips using scrapy

I'd like to extract the text inside the class "spc spc-nowrap" using scrapy and the container software docker to scrape dynamically loaded content.
<div id="tooltipdiv" style="position: absolute; z-index: 100; left: 637.188px; top: 625.609px; display: none;">
<span class="help">
<span class="help-box2 y-h wider">
<span class="wrap-help">
<span class="spc spc-nowrap" id="tooltiptext">
text to extract
<br>
text to extract
<strong>text to extract</strong>
<br>
</span>
</span>
</span>
</span>
</div>
Which xpath or css syntax returns these data?
response.css("span#tooltiptext.spc.spc-nowrap").extract()
yields empty list
This should extract all of the text including the text in the <strong> tag.
It will be a list of, for your example the output would be: ["text to extract", "text to extract", "text to extract"]
response.xpath('//span[#id="tooltiptext"]//text()').getall()

How do I extract certain elements from this webscraped HTML

Here's the HTML I have webscraped. How do I extract the text called "Code I want to Extract" and then save this as a string "author"? Thanks in advance!
<a class="lead-author-profile-link" href="https://papers.ssrn.com/sol3/cf_dev/AbsByAuth.cfm?per_id=2994282" target="_blank" title="View other papers by this author"><span>Code I want to Extract</span><i aria-hidden="true" class="icon icon-gizmo-navigate-right"></i></a>
You can try it:
html_doc="""
<a class="lead-author-profile-link" href="https://papers.ssrn.com/sol3/cf_dev/AbsByAuth.cfm?per_id=2994282" target="_blank" title="View other papers by this author"><span>Code I want to Extract</span><i aria-hidden="true" class="icon icon-gizmo-navigate-right"></i></a>
"""
soup = BeautifulSoup(html_doc, 'lxml')
author = soup.find('a').text
print(author)
Output will be:
Code I want to Extract

How to extract from multiple <span> tags and group the data together using BS4?

I have extracted data between span tags based on its class, from a webpage. But at times, the webpage splits a line into multiple fragments and stores it in consecutive tags. All the children span tags have the same class name.
Following is the HTML snippet:
<p class="Paragraph SCX">
<span class="TextRun SCX">
<span class="NormalTextRun SCX">
This week
</span>
</span>
<span class="TextRun SCX">
<span class="NormalTextRun SCX">
(12/
</span>
</span>
<span class="TextRun SCX">
<span class="NormalTextRun SCX">
11
</span>
</span>
<span class="TextRun SCX">
<span class="NormalTextRun SCX">
- 12/1
</span>
</span>
<span class="TextRun SCX">
<span class="NormalTextRun SCX">
7
</span>
</span>
<span class="TextRun SCX">
<span class="NormalTextRun SCX">
):
</span>
</span>
<span class="EOP SCX">
</span>
</p>
From the above HTML snippet, I need to extract only the innermost span data.
Python code to extract data using BS4:
for data in elem.find_all('span', class_="TextRun"):
a = data.find('span').contents[0]
a = a.string.replace(u'\xa0', '')
print (a)
events_parsed_thisweek.append(a)
This code results in each data being separately printed as separate entity.
Required Output:
This Week ((12/11 - 12/17):
Any idea how to combine these span tag data together? Thanks!
Give this a go. Make sure to wrap the whole html within content variable.
from bs4 import BeautifulSoup
soup = BeautifulSoup(content,'lxml')
data = ''.join([' '.join(item.text.split()) for item in soup.select(".NormalTextRun")])
print(data)
Output:
This week(12/11- 12/17):
You could try combining the relevant information together in a string using the join method.
dates = ''
for data in elem.find_all('span', class_='TextRun'):
dates.join([dates, data.text])

BeautifulSoup missing/skipping tags

would appreciate if you can point me in the right direction. Is there a better way of doing this and capture all the data (with html tags class "Document Text")) ...
If i do like this. I missing some tags in the end orginal html string is 20K in size(so its lot of data).
soup = BeautifulSoup(r.content, 'html5lib')
c.case_html = str(soup.find('div', class_='DocumentText')
print(self.case_html)
Following is the code for scraping which works fine for now but the second new tag is added it is broken.
soup = BeautifulSoup(r.content, 'html5lib')
c.case_html = str(soup.find('div', class_='DocumentText').find_all(['p','center','small']))
print(self.case_html)
Sample html is as follows original is around the 20K string size
<form name="form1" id="form1">
<div id="theDocument" class="DocumentText" style="position: relative; float: left; overflow: scroll; height: 739px;">
<p>PTag</p>
<p> <center> First center </center> </p>
<small> this is small</small>
<p>...</p>
<p> <center> Second Center </center> </p>
<p>....</p>
</div>
</form>
Expected output to be this
<div id="theDocument" class="DocumentText" style="position: relative; float: left; overflow: scroll; height: 739px;">
<p>PTag</p>
<p> <center> First center </center> </p>
<small> this is small</small>
<p>...</p>
<p> <center> Second Center </center> </p>
<p>....</p>
</div>
You can try this. I just based my answer on your given html code. If you need clarifications, just let me know. Thanks!
soup = BeautifulSoup(r.content, 'html5lib')
case_html = soup.select('div.DocumentText')
print(case_html.get_text())

How can i crawl web data that not in tags(class name is same)

Sorry.
I have asked a question like this.
After that i still have problem about data not in tag.
A few different the question i asked
(How can i crawl web data that not in tags)
<div class="bbs" id="main-content">
<div class="metaline">
<span class="article-meta-tag">
author
</span>
<span class="article-meta-value">
Jorden
</span>
</div>
<div class="metaline">
<span class="article-meta-tag">
board
</span>
<span class="article-meta-value">
NBA
</span>
</div>
I am here
</div>
I only need
I am here
The string is a child of the main div of type NavigableString, so you can loop through div.children and filter based on the type of the node:
from bs4 import BeautifulSoup, NavigableString
[x.strip() for x in soup.find("div", {'id': 'main-content'}).children if isinstance(x, NavigableString) and x.strip()]
# [u'I am here']
Data:
soup = BeautifulSoup("""<div class="bbs" id="main-content">
<div class="metaline">
<span class="article-meta-tag">
author
</span>
<span class="article-meta-value">
Jorden
</span>
</div>
<div class="metaline">
<span class="article-meta-tag">
board
</span>
<span class="article-meta-value">
NBA
</span>
</div>
I am here
</div>""", "html.parser")
soup = BeautifulSoup(that_html)
div_tag = soup.div
required_string = div_tag.string
go thought this documentation

Categories