BeautifulSoup missing/skipping tags - python

would appreciate if you can point me in the right direction. Is there a better way of doing this and capture all the data (with html tags class "Document Text")) ...
If i do like this. I missing some tags in the end orginal html string is 20K in size(so its lot of data).
soup = BeautifulSoup(r.content, 'html5lib')
c.case_html = str(soup.find('div', class_='DocumentText')
print(self.case_html)
Following is the code for scraping which works fine for now but the second new tag is added it is broken.
soup = BeautifulSoup(r.content, 'html5lib')
c.case_html = str(soup.find('div', class_='DocumentText').find_all(['p','center','small']))
print(self.case_html)
Sample html is as follows original is around the 20K string size
<form name="form1" id="form1">
<div id="theDocument" class="DocumentText" style="position: relative; float: left; overflow: scroll; height: 739px;">
<p>PTag</p>
<p> <center> First center </center> </p>
<small> this is small</small>
<p>...</p>
<p> <center> Second Center </center> </p>
<p>....</p>
</div>
</form>
Expected output to be this
<div id="theDocument" class="DocumentText" style="position: relative; float: left; overflow: scroll; height: 739px;">
<p>PTag</p>
<p> <center> First center </center> </p>
<small> this is small</small>
<p>...</p>
<p> <center> Second Center </center> </p>
<p>....</p>
</div>

You can try this. I just based my answer on your given html code. If you need clarifications, just let me know. Thanks!
soup = BeautifulSoup(r.content, 'html5lib')
case_html = soup.select('div.DocumentText')
print(case_html.get_text())

Related

Referencing a div title class with xpath selenium python

So I have started messing around with Selenium in python and I am just not able to figure this xpath out.
<div title class="popupBox" style="left: 0px; to p: 490px;">
<svg class="cp Drag" viewbox="0 0 24 24" style="height: 10px; width: 10px;">...</svg>
<div title class="eqBox">
<div title class="ib cp">
<img title src="src.svg">
<div title class="bred"></div>
<div title class="ds2">+1</div>
<div title class="ds3 IQ">TOP</div>
</div>
</div>
I need to be able to locate the TOP and get the result of TOP.
Hopefully that makes sense and I provided what is needed.
Try:
driver.find_element(By.XPATH, '//*[#class="ds3 IQ"]').text.splitlines[-1]

get all soup above a certain div

I have a soup of this format:
<div class = 'foo'>
<table> </table>
<p> </p>
<p> </p>
<p> </p>
<div class = 'bar'>
<p> </p>
.
.
</div>
I want to scrape all the paragraphs between the table and bar div. The challenge is that number of paragraphs between these is not constant. So I can't just get the first three paragraphs (it could be anywhere from 1-5).
How do I go about dividing this soup to get the the paragraphs. Regex seems decent at first, but it didn't work for me as later I would still need a soup object to allow for further extraction.
Thanks a ton
You could select your element, iterate over its siblings and break if there is no p:
for t in soup.div.table.find_next_siblings():
if t.name != 'p':
break
print(t)
or other way around and closer to your initial question - select the <div class = 'bar'> and find_previous_siblings('p'):
for t in soup.select_one('.bar').find_previous_siblings('p'):
print(t)
Example
from bs4 import BeautifulSoup
html='''
<div class = 'foo'>
<table> </table>
<p> </p>
<p> </p>
<p> </p>
<div class = 'bar'>
<p> </p>
.
.
</div>
'''
soup = BeautifulSoup(html)
for t in soup.div.table.find_next_siblings():
if t.name != 'p':
break
print(t)
Output
<p> </p>
<p> </p>
<p> </p>
If html as shown then just use :not to filter out later sibling p tags
from bs4 import BeautifulSoup
html='''
<div class = 'foo'>
<table> </table>
<p> </p>
<p> </p>
<p> </p>
<div class = 'bar'>
<p> </p>
.
.
</div>
'''
soup = BeautifulSoup(html)
soup.select('.foo > table ~ p:not(.bar ~ p)')

Get href links from a tag

This is just one part of the HTML and there are multiple products on the page with the same HTML construction
I want all the href's for all the products on page
<div class="row product-layout-category product-layout-list">
<div class="product-col wow fadeIn animated" style="visibility: visible;">
<a href="the link I want" class="product-item">
<div class="product-item-image">
<img data-src="link to an image" alt="name of the product" title="name of the product" class="img-responsive lazy" src="link to an image">
</div>
<div class="product-item-desc">
<p><span><strong>brand</strong></span></p>
<p><span class="font-size-16">name of the product</span></p>
<p class="product-item-price>
<span>product price</span></p>
</div>
</a>
</div>
.
.
.
With this code I wrote I only get None printed a bunch of times
from bs4 import BeautifulSoup
import requests
url = 'link to the site'
response = requests.get(url)
page = response.content
soup = BeautifulSoup(page, 'html.parser')
##this includes the part that I gave you
items = soup.find('div', {'class': 'product-layout-category'})
allItems = items.find_all('a')
for n in allItems:
print(n.href)
How can I get it to print all the href's in there?
Looking at your HTML code, you can use CSS selector a.product-item. This will select all <a> tags with class="product-item":
from bs4 import BeautifulSoup
html_text = """
<div class="row product-layout-category product-layout-list">
<div class="product-col wow fadeIn animated" style="visibility: visible;">
<a href="the link I want" class="product-item">
<div class="product-item-image">
<img data-src="link to an image" alt="name of the product" title="name of the product" class="img-responsive lazy" src="link to an image">
</div>
<div class="product-item-desc">
<p><span><strong>brand</strong></span></p>
<p><span class="font-size-16">name of the product</span></p>
<p class="product-item-price>
<span>product price</span></p>
</div>
</a>
</div>
"""
soup = BeautifulSoup(html_text, "html.parser")
for link in soup.select("a.product-item"):
print(link.get("href")) # or link["href"]
Prints:
the link I want

Python Regex Scrape

I have this piece of code with prices from a product(the price and offer for installments) and I try to scrape with python to get only the price(649).
<span style="color: #404040; font-size: 12px;"> from </span>
<span class="money-int">649</span>
<sup class="money-decimal">99</sup>
<span class="money-currency">$</span>
<br />
<span style="color: #404040; font-size: 12px;">from
<b>
<span class="money-int">37</span>
<sup class="money-decimal">35</sup>
<span class="money-currency">$</span>/month
</b>
</span>
I tried using re.findall like this
match = re.findall('\"money-int\"\>(\d*)\<\/span\>\<sup class=\"money-decimal\"\>(\d*)',content)
The problem is I get list with both prices, 649 and 37 and I need only 649.
re.findall(r"<span[^>]*class=\"money-int\"[^>]*>([^<]*)</span>[^<]*<sup[^>]*class=\"money-decimal\"[^>]*>([^<]*)</sup>", YOUR_STRING)
Consider using an xml parser to do this job to avoid future headaches:
#!/usr/bin/python
from bs4 import BeautifulSoup as BS
html = '''
<span style="color: #404040; font-size: 12px;"> from </span>
<span class="money-int">649</span>
<sup class="money-decimal">99</sup>
<span class="money-currency">$</span>
<br />
<span style="color: #404040; font-size: 12px;">from
<b>
<span class="money-int">37</span>
<sup class="money-decimal">35</sup>
<span class="money-currency">$</span>/month
</b>
</span>
'''
soup = BS(html, 'lxml')
print soup.find_all("span", attrs={"class": "money-int"})[0].get_text()
Online demo on ideone

Python RegEx with Beautifulsoup 4 not working

I want to find all div tags which have a certain pattern in their class name but my code is not working as desired.
This is the code snippet
soup = BeautifulSoup(html_doc, 'html.parser')
all_findings = soup.findAll('div',attrs={'class':re.compile(r'common text .*')})
where html_doc is the string with the following html
<div class="common text sighting_4619012">
<div class="hide-c">
<div class="icon location"></div>
<p class="reason"></p>
<p class="small">These will not appear</p>
<span class="button secondary ">wait</span>
</div>
<div class="show-c">
</div>
</div>
But all_findings is coming out as an empty list while it should have found one item.
It's working in the case of exact match
all_findings = soup.findAll('div',attrs={'class':re.compile(r'hide-c')})
I am using bs4.
Instead of using a regular expression, put the classes you are looking for in a list:
all_findings = soup.findAll('div',attrs={'class':['common', 'text']})
Example code:
from bs4 import BeautifulSoup
html_doc = """<div class="common text sighting_4619012">
<div class="hide-c">
<div class="icon location"></div>
<p class="reason"></p>
<p class="small">These will not appear</p>
<span class="button secondary ">wait</span>
</div>
<div class="show-c">
</div>
</div>"""
soup = BeautifulSoup(html_doc, 'html.parser')
all_findings = soup.findAll('div',attrs={'class':['common', 'text']})
print all_findings
This outputs:
[<div class="common text sighting_4619012">
<div class="hide-c">
<div class="icon location"></div>
<p class="reason"></p>
<p class="small">These will not appear</p>
<span class="button secondary ">wait</span>
</div>
<div class="show-c">
</div>
</div>]
To extend #Andy's answer, you can make a list of class names and compiled regular expressions:
soup.find_all('div', {'class': ["common", "text", re.compile(r'sighting_\d{5}')]})
Note that, in this case, you'll get the div elements with one of the specified classes/patterns - in other words, it's common or text or sighting_ followed by five digits.
If you want to have them joined with "and", one option would be to turn off the special treatment for "class" attributes by having the document parsed as "xml":
soup = BeautifulSoup(html_doc, 'xml')
all_findings = soup.find_all('div', class_=re.compile(r'common text sighting_\d{5}'))
print all_findings

Categories