Python BeautifulSoup Same name DIV, ignore first

Python BeautifulSoup Same name DIV, ignore first - python

So playing around with the python bs4 and trying to work out how to ignore the same DIV name to collect the data for the second lot.
Below is an example of the code I am try to extract ##Wanted data##
##Pointless Data###
<div class="PowerDetails">
<div class="Company">
<p class="RunningCost">$4.44</p>
<p class="Time">peek</p>
<p class="RunningCost"> $2.33</p>
<p class="Time">Off-peek</p>
</div>
</div>
##Wanted data##
<div class="PowerDetails">
<div class="Company">
<p class="RunningCost">$8.88</p>
<p class="Time">peek</p>
<p class="RunningCost"> $9.99</p>
<p class="Time">Off-peek</p>
</div>
<div class="Company">
<p class="RunningCost">$8.88</p>
<p class="Time">peek</p>
<p class="RunningCost"> $7.77</p>
<p class="Time">Off-peek</p>
</div>
</div>
from bs4 import BeautifulSoup
soup = BeautifulSoup("<html_text>" , "html.parser")
div = soup.find("div")
div.find_all("div", {"class":"PowerDetails"})
PowerDetails[1].find_all("p", "class":"RunningCost")
PowerDetails[1].find_all("p", "class":"Time")

find_all() will return list. use slicing or index to access just elements you want.

You can slice the resultant list to get elements from the 1st index onwards. But, first you are not finding the right tags in your code.
from bs4 import BeautifulSoup
html_doc = """
<div class="PowerDetails">
<div class="Company">
<p class="RunningCost">$4.44</p>
<p class="Time">peek</p>
<p class="RunningCost"> $2.33</p>
<p class="Time">Off-peek</p>
</div>
</div>
##Wanted data##
<div class="PowerDetails">
<div class="Company">
<p class="RunningCost">$8.88</p>
<p class="Time">peek</p>
<p class="RunningCost"> $9.99</p>
<p class="Time">Off-peek</p>
</div>
<div class="Company">
<p class="RunningCost">$8.88</p>
<p class="Time">peek</p>
<p class="RunningCost"> $7.77</p>
<p class="Time">Off-peek</p>
</div>
</div>
"""
soup = BeautifulSoup(html_doc , "html.parser")
# You can get the divs with one line of code
powerDetails = soup.find_all(class_="PowerDetails")
print(len(powerDetails)) # Outputs 2
Now, you can slice the list to ignore the first div
powerDetails = powerDetails[1:] # Get elements from 2nd element onwards (ignoring the first one)
print(len(powerDetails)) # Outputs 1
Now, you will have a list with one element only
print(powerDetails)
Output:
[<div class="PowerDetails">
<div class="Company">
<p class="RunningCost">$8.88</p>
<p class="Time">peek</p>
<p class="RunningCost"> $9.99</p>
<p class="Time">Off-peek</p>
</div>
<div class="Company">
<p class="RunningCost">$8.88</p>
<p class="Time">peek</p>
<p class="RunningCost"> $7.77</p>
<p class="Time">Off-peek</p>
</div>
</div>]

Another method.
from simplified_scrapy import SimplifiedDoc
html = '''
##Pointless Data###
<div class="PowerDetails">
<div class="Company">
<p class="RunningCost">$4.44</p>
<p class="Time">peek</p>
<p class="RunningCost"> $2.33</p>
<p class="Time">Off-peek</p>
</div>
</div>
##Wanted data##
<div class="PowerDetails">
<div class="Company">
<p class="RunningCost">$8.88</p>
<p class="Time">peek</p>
<p class="RunningCost"> $9.99</p>
<p class="Time">Off-peek</p>
</div>
<div class="Company">
<p class="RunningCost">$8.88</p>
<p class="Time">peek</p>
<p class="RunningCost"> $7.77</p>
<p class="Time">Off-peek</p>
</div>
</div>
'''
doc = SimplifiedDoc(html)
# First method, get all, use index.
PowerDetails = doc.selects('div.PowerDetails')[1].selects(
'div.Company').selects('p')
for ps in PowerDetails:
print([(p['class'], p.text) for p in ps])
# Second method, skip the first with parameter start
PowerDetails = doc.getElement(
'div', value='PowerDetails',
start='class="PowerDetails"').selects('div.Company').selects('p')
for ps in PowerDetails:
print([(p['class'], p.text) for p in ps])
Result:
[('RunningCost', '$8.88'), ('Time', 'peek'), ('RunningCost', '$9.99'), ('Time', 'Off-peek')]
[('RunningCost', '$8.88'), ('Time', 'peek'), ('RunningCost', '$7.77'), ('Time', 'Off-peek')]
[('RunningCost', '$8.88'), ('Time', 'peek'), ('RunningCost', '$9.99'), ('Time', 'Off-peek')]
[('RunningCost', '$8.88'), ('Time', 'peek'), ('RunningCost', '$7.77'), ('Time', 'Off-peek')]

Related

Unable to scrape h1 class with python/beautiful soup

I am trying to scrape a title from an h1 class, but I keep getting "None"
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find('h1', {'class': 'prod-name'})
print(title)
I've also tried using this way:
name_div = soup.find_all('div', {'class': 'col-md-12 col-sm-12 col-xs-12'})[0]
name = name_div.find('h1').text
print(name)
in which case I get: "IndexError: list index out of range"
Can anybody help me out?
This is the source code:
<div class="row attachDetails __web-inspector-hidebefore-shortcut__">
<div class="row">
<div class="col-md-12 col-sm-12 col-xs-12">
<div class="brand-desc">POLO RALPH LAUREN</div>
<h1 class="prod-name">ARAN CREWNECK SWEATER</h1>
<div class="panel-group" id="accordion">
<div class="borders-overview">
<div class="panel-heading">
<h4 class="panel-title">
<label class="overview-label collapsed" data-angle="overview-label" data-toggle="collapse" data-parent="#accordion" href="#collapse1">
<a class="fa fa-angle-up pull-right"></a>
<a class="over-view">OVERVIEW</a>
<span class="color-disp over-view">COLOR: FAWN GREY HEATHER</span>
<span class="style-num over-view">MATERIAL# : 710766783002
</span></label>
</h4>
</div>
<div id="collapse1" class="panel-collapse collapse">
<div class="short-desc-section"></div>
</div>
</div>
<div class="border-details">
<div class="panel-heading">
<h4 class="panel-title">
<label class="prod-details collapsed" data-angle="prod-details" data-toggle="collapse" data-parent="#accordion" href="#collapse2">
<a class="detail-link">Details</a>
<a class="fa fa-angle-up pull-right"></a>
</label>
</h4>
</div>
<div id="collapse2" class="long-desc panel-collapse collapse">
<div><ol><li>STANDARD FIT</li><li>COTTON</li></ol></div>
<ol>
<div><li><b>Board:</b> S196SC23</li></div>
<!--***********************************************************************************************************-->
</ol>
</div>
</div>
</div>
</div>
</div>
</div>

Beautifulsoup: Get a range of divs

I just found out about how to process webpages in python using BeautifulSoup.
There's a list of div from which I want to get those in a specific range. The range is defined by two div that have a h2 child.
How would I do that? Thank you for your support!
EDIT: I added an actual representation of my html code below instead of a previous "simplified" version that was missing tags.
The new code shows a root div with class foo-bar-details.
Nested are 9 div tags. Two of which have a nested h2 tag. All of those 9 div tags contain img elements deeply nested within. What I need is each img element of those divs that are between the ones containing the h2 element.
An expected outcome if applied to the html code below would be:
<img src="../../images/123456_thumb.jpg" alt="Image 123456" title="Image 123456">
<img src="../../images/67890_thumb.JPG" alt="Image 67890 " title="Image 67890">
This is the html code:
<div class="foo-bar-details">
<div class="padding-y-10 padding-x-40 gray-sand-bg" id="sec-feat-3-1">
<div class="row">
<div class="col-sm-6 info-panel">
<div class="row">
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>fsuhfsdf </strong>
</p>
</div>
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>Feat</strong><span class="icon-help"></span>
</p>
</div>
</div>
</div>
<div class="col-sm-6 foo-images">
<div class="row">
<img src="../../images/39826_thumb.JPG" alt="Image 39826" title="Image 39826 ">
<div class="img-description">
</div>
</div>
</div>
</div>
</div>
<div class="padding-y-10 padding-x-40 gray-sand-bg" id="sec-feat-3-1">
<div class="row">
<div class="col-sm-6 info-panel">
<div class="row">
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>JHFDFD </strong>
</p>
</div>
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>Feat</strong><span class="icon-help"></span>
</p>
</div>
</div>
</div>
<div class="col-sm-6 foo-images">
<div class="row">
<img src="../../images/223234_thumb.JPG" alt="Image 223234" title="Image 223234 ">
<div class="img-description">
</div>
</div>
</div>
</div>
</div>
<div class="padding-y-10 padding-x-40 gray-sand-bg" id="sec-feat-3-1">
<div class="row">
<div class="col-sm-6 info-panel">
<div class="row">
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>sdfsdf </strong>
</p>
</div>
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>Feat</strong><span class="icon-help"></span>
</p>
</div>
</div>
</div>
<div class="col-sm-6 foo-images">
<div class="row">
<img src="../../images/223823_thumb.JPG" alt="Image 223823" title="Image 223823 ">
<div class="img-description">
</div>
</div>
</div>
</div>
</div>
<div class="element-header mystic-bg padding-y-10 padding-x-20" id="elem-4">
<h2 class="h3 margin-bottom-5">
Foo
</h2>
<ul class="list-inline margin-0">
<li> Foo feature </li>
...
</ul>
</div>
<div id="info-panel-header" class="padding-y-10 padding-x-40">
<div class="row">
<div class="col-se-6 element-info">
<div class="col-se-12">
<div class="row">
</div>
</div>
</div>
<div class="col-sm-6 foo-images">
<div class="row">
<img src="../../images/123456_thumb.jpg" alt="Image 123456" title="Image 123456">
<div class="img-description">
</div>
</div>
</div>
</div>
</div>
<div class="padding-y-10 padding-x-40 gray-wild-sand-bg" id="sec-feat-4-1">
<div class="row">
<div class="col-sm-6 info-panel">
<div class="row">
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>Foo strin: </strong>
</p>
</div>
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>Barbar</strong><span class="icon-help"></span>
</p>
</div>
</div>
<div class="row">
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>Mine: </strong>
</p>
</div>
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
TEST<span class="icon-help"></span>
</p>
</div>
</div>
</div>
<div class="col-sm-6 foo-images">
<div class="row">
<img src="../../images/67890_thumb.JPG" alt="Image 67890 " title="Image 67890">
<div class="img-description">
</div>
</div>
</div>
</div>
</div>
<div class="element-header mystic-bg padding-y-10 padding-x-20" id="elem-5">
<h2 class="h3 margin-bottom-5">
Bar
</h2>
<ul class="list-inline margin-0">
<li> Bar feature </li>
...
</ul>
</div>
<div class="padding-y-10 padding-x-40 gray-sand-bg" id="sec-feat-3-1">
<div class="row">
<div class="col-sm-6 info-panel">
<div class="row">
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>fsuhfsdf </strong>
</p>
</div>
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>Feat</strong><span class="icon-help"></span>
</p>
</div>
</div>
</div>
<div class="col-sm-6 foo-images">
<div class="row">
<img src="../../images/39826_thumb.JPG" alt="Image 39826" title="Image 39826 ">
<div class="img-description">
</div>
</div>
</div>
</div>
</div>
<div class="padding-y-10 padding-x-40 gray-sand-bg" id="sec-feat-3-1">
<div class="row">
<div class="col-sm-6 info-panel">
<div class="row">
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>fsuhfsdf </strong>
</p>
</div>
<div class="col-sm-6 margin-bottom-10">
<p class="margin-0">
<strong>Feat</strong><span class="icon-help"></span>
</p>
</div>
</div>
</div>
<div class="col-sm-6 foo-images">
<div class="row">
<img src="../../images/209876_thumb.JPG" alt="Image 209876" title="Image 209876 ">
<div class="img-description">
</div>
</div>
</div>
</div>
</div>
</div>

Here is a solution involving lxml.html:
We extract all divs between the first and last divs which contain an h2 tag:
import lxml.html
# HTML file saved as "file.html"
file_name = "file.html"
with open(file_name, 'r') as f:
tree = lxml.html.fromstring(f.read())
# all_div = tree.findall('div')
all_div = tree.find_class('foo-bar-details')[0].findall('div')
start, stop = None, None
for k, div in enumerate(all_div):
if div.findall('h2') and start is None:
print("Range starts at %d" % k)
start = k
continue
if div.findall('h2') and start is not None:
print("Range stops at %d" % k)
stop = k + 1 # add one as range stops at k - 1
continue
# div_list = all_div[start:stop]
img_list = [_.xpath('.//img') for _ in all_div[start:stop]]
print(img_list)
# [[], [<Element img at 0x20b58d73f40>], [<Element img at 0x20b58d73f90>], []]
# Or
img_list = [_.xpath('.//img/#src') for _ in all_div[start:stop]]
print(img_list)
# [[], ['../../images/123456_thumb.jpg'], ['../../images/67890_thumb.JPG'], []]

Another solution involving SimplifiedDoc:
from simplified_scrapy.simplified_doc import SimplifiedDoc
html ='''
<div class="foo-bar-details">
<div class="element-header mystic-bg padding-y-10 padding-x-20" id="elem-4">
<h2 class="h3 margin-bottom-5">
Foo
</h2>
<ul class="list-inline margin-0">
<li> Foo feature </li>
...
</ul>
</div>
<div id="info-panel-header" class="padding-y-10 padding-x-40">Test 1</div>
<div class="padding-y-10 padding-x-40 gray-wild-sand-bg" id="foo-feat-4-1">Test 2</div>
<div class="padding-y-10 padding-x-40 " id="foo-feat-4-2">Test 3</div>
<div class="padding-y-10 padding-x-40 gray-wild-sand-bg" id="foo-feat-4-3">Test 4</div>
<div class="element-header mystic-bg padding-y-10 padding-x-20" id="elem-5">
<h2 class="h3 margin-bottom-5">
Bar
</h2>
<ul class="list-inline margin-0">
<li> Bar feature </li>
...
</ul>
</div>
</div>
'''
doc = SimplifiedDoc(html)
divs = doc.select('div.foo-bar-details').divs.contains('<h2')
print ([div.id for div in divs])
divs = doc.select('div.foo-bar-details').divs.notContains('<h2')
print ([div.id for div in divs])
Result:
['elem-4', 'elem-5']
['info-panel-header', 'foo-feat-4-1', 'foo-feat-4-2', 'foo-feat-4-3']
Simplifieddoc library does not rely on the third-party library, which is lighter and faster, perfect for beginners.
Here are more examples here

If I understand you correctly, you want to find <img> tags and corresponding <h2> to which the images belong to.
This example (txt variable contains the HTML snippet from your question):
from bs4 import BeautifulSoup
soup = BeautifulSoup(txt, 'html.parser')
out = {}
for img in soup.select('div:has(h2) ~ div img'):
out.setdefault(img.find_previous('h2').get_text(strip=True), []).append(img['src'])
from pprint import pprint
pprint(out)
Prints:
{'Bar': ['../../images/39826_thumb.JPG', '../../images/209876_thumb.JPG'],
'Foo': ['../../images/123456_thumb.jpg', '../../images/67890_thumb.JPG']}

Iterate div tags using lxml and retrieve text for a dictionary in python

I just came to know lxmlx in python and I'm in the need for some help as I have no experience with XPath.
I want to get text data from a webpage into a dictionary.
I'm referring to the html snippet I posted below. Within the original html page there's a div element of the class general-info that I retrieve using the following line:
general_info = document_tree.xpath("//div[contains(concat(' ', normalize-space(#class), ' '), 'general-info')]")
From here on I want to iterate over the nested divs and get the 2 <p> tags as key and value. The text inside the <strong> being the key.
There can also be empty div tags and there can be a special case where the key and the value for the dictionary can be within the same div (see the last element).
EDIT:
The number of elements can change, so it would be best to use the <strong> tags as starting point and then search for the next <p> tag.
This is code that I was able to write using BeautifulSoup:
generalinfo = documentSoup.findAll("div", {"class": "general-info"})
if generalinfo:
strongs = generalinfo[0].find_all('strong')
for descr in strongs:
p = descr.find_next_sibling("p")
if p:
key = descr.text.strip().rstrip(':')
details_dict[key] = p.text.strip()
else:
nextdiv = descr.parent.parent.find_next_sibling("div")
if nextdiv:
child = nextdiv.findChild()
if child:
key = descr.text.strip()[:-1]
details_dict[key] = child.text.strip()
I am going for the following output:
['Title:' : 'This is a title',
'Owner:' : 'This is an owner',
'Category:' : 'This is a categroy',
'Type:' : 'This is a type',
'Special case:' : 'This is a special case']
If anyone can help me out here I will appreciate this!
html code:
<body>
<main>
<div>
...
<div class="general-info margin-bottom-20 margin-top-20">
<div class="row padding-x-20">
<div class="col-sm-4">
<p class="margin-0">
<strong>Title:</strong>
</p>
</div>
<div class="col-sm-8">
<p class="margin-0">This is a title</p>
</div>
</div>
<div class="row padding-x-20">
<div class="col-sm-4">
<p class="margin-0">
<strong>Owner:</strong>
</p>
</div>
<div class="col-sm-8">
<p class="margin-0">This is an owner</p>
</div>
</div>
<h2 class="h3 margin-top-10 margin-bottom-10 padding-x-20">Validity</h2>
<div class="row padding-x-20">
<div class="col-sm-4">
<p class="margin-0">
<strong>Category:</strong>
</p>
</div>
<div class="col-sm-8">
<p class="margin-0">This is a category</p>
</div>
</div>
<div class="row padding-x-40"></div>
<div class="row padding-x-20">
<div class="col-sm-4">
<p class="margin-0">
<strong>Type:</strong>
</p>
</div>
<div class="col-sm-8">
<p class="margin-0">This is a type</p>
</div>
</div>
<div class="row padding-x-40">
<div>
<strong>Special case:</strong>
<p>This is a special case</p>
</div>
</div>
</div>
...

I believe this is about as generalized as I can get given the html provided:
general_info = doc.xpath("//div[contains(concat(' ', normalize-space(#class), ' '), 'general-info')]//p[#class='margin-0']")
for i in general_info :
if len(i.xpath('./strong/text()'))>0:
topic = i.xpath('./strong/text()')[0]
if len(i.text.strip())>0:
entry += i.text.replace('\n','').strip()
print(topic+' '+i.text.replace('\n','').strip())
special = general_info[0].xpath('./ancestor::div[#class="general-info margin-bottom-20 margin-top-20"]//div/div/strong')[0]
print(special.text+" ",special.xpath('./following-sibling::p/text()')[0])
Output:
('Title: This is a title',
'Owner: This is an owner',
'Category: This is a category',
'Type: This is a type',
'Special case: This is a special case')

I recommend another solution, which is very suitable for extracting data from XML.
from simplified_scrapy.spider import SimplifiedDoc
html='''
<body>
<main>
<div class="general-info margin-bottom-20 margin-top-20">
<div class="row padding-x-20">
<div class="col-sm-4">
<p class="margin-0">
<strong>Title:</strong>
</p>
</div>
<div class="col-sm-8">
<p class="margin-0">This is a title</p>
</div>
</div>
<div class="row padding-x-20">
<div class="col-sm-4">
<p class="margin-0">
<strong>Owner:</strong>
</p>
</div>
<div class="col-sm-8">
<p class="margin-0">This is an owner</p>
</div>
</div>
<h2 class="h3 margin-top-10 margin-bottom-10 padding-x-20">Validity</h2>
<div class="row padding-x-20">
<div class="col-sm-4">
<p class="margin-0">
<strong>Category:</strong>
</p>
</div>
<div class="col-sm-8">
<p class="margin-0">This is a category</p>
</div>
</div>
<div class="row padding-x-40"></div>
<div class="row padding-x-20">
<div class="col-sm-4">
<p class="margin-0">
<strong>Type:</strong>
</p>
</div>
<div class="col-sm-8">
<p class="margin-0">This is a type</p>
</div>
</div>
<div class="row padding-x-40">
<div>
<strong>Special case:</strong>
<p>This is a special case</p>
</div>
</div>
</div>
'''
data={}
doc = SimplifiedDoc(html) # create doc
divs = doc.selects('div.general-info')
# First way
for div in divs:
strongs = div.strongs
for strong in strongs:
p = strong.next
if not p:
p=strong.parent.next
data[strong.text]=p.text
print(data)
data={}
# Second way
for div in divs:
ds = div.selects('strong|p>text()')
for i in range(0,len(ds),2):
data[ds[i]]=ds[i+1]
print(data)
Result:
{'Title:': 'This is a title', 'Owner:': 'This is an owner', 'Category:': 'This is a category', 'Type:': 'This is a type', 'Special case:': 'This is a special case'}
{'Title:': 'This is a title', 'Owner:': 'This is an owner', 'Category:': 'This is a category', 'Type:': 'This is a type', 'Special case:': 'This is a special case'}
Here are more examples:https://github.com/yiyedata/simplified-scrapy-demo/blob/master/doc_examples/

How can one replace an element in lxml?

I have a text that I get (data entered by users of CRM) web service, which returns a "terrifying format". I am filtering with python before using the data, but when it comes to removing line breaks (br) removed me also the texts. The code is as follows:
description = '''
<div id="highlight" class="section">
<p>
text...............
</p>
<br>
<h1>TITLE</h1>
<p>Multiple text
<br>
</p>
<ul>
<li>bad layer....</li>
</ul>
<p>
<br>subTitle
</p>
<p> </p>
<p style="text-align: center;">
<br>Text1
<br>Text2
<br>Text3
<br>Text4
<br>Text5
<br>Text6
</p>
<p style="text-align: center;">
<strong>small title</strong>
<br>Text small</p>
<p style="text-align: center;">
<strong>highlighted text</strong>
<br>
<br><strong>Text1</strong>
<br>Text2
<br>Text3
<br>Text4
</p>
<p style="text-align: center;">
<strong>small text</strong>
<br>Text1
<br>Text2
</p>
<p style="text-align: center;">
<strong>small text</strong>
<br>description
</p>
<p style="text-align: center;">
<br> </p>
<p><strong>description two</strong></p>
<p>
<br> </p>
</div>
'''
tree = html.fragment_fromstring( description )
for element in tree.xpath('//br'):
#element.getparent().remove(element)
print element.text
print element.getparent().getchildren()
#print element
#print element.getparent()
#print element.getchildren()
#print element.getnext()
#print '--------------------------------'
I have tried to remove the br with element.getparent().remove(element), but also deletes the text, I did tests to see if the texts belong to any node, but not so.
I've thought about changing the br by li, making the p with stylo in ul, but I can't think as do it, something like this (the previous text lame):
..........
..........
<ul>
<li>Text1</li>
<li>Text2</li>
<li>Text3</li>
<li>Text4</li>
<li>Text5</li>
<li>Text6</li>
</ul>
<ul>
<li><strong>small title</strong></li>
<li>Text small</li></ul>
<ul>
<li><strong>highlighted text</strong></li>
<li><strong>Text1</strong></li>
<li>Text2</li>
<li>Text3</li>
<li>Text4</li>
</ul>
<ul>
<li><strong>small text</strong></li>
<li>Text1</li>
<li>Text2</li>
</ul>
<ul>
<li><strong>small text</strong></li>
<li>description</li>
</ul>
<ul>
<li> </li></ul>
........
I can't think as take texts, because I thought that just choosing the xpath of the node p with style and its value, creating nodes children of li and a parent ul, eliminated p.
Is possible? Thanks
Regards

You can use lxml.etree.strip_elements, like so:
import lxml.etree
import lxml.html
tree = lxml.html.fragment_fromstring(description)
lxml.etree.strip_elements(tree, 'br', with_tail=False)
print(lxml.etree.tostring(tree, pretty_print=True))

Using SPLIT to create a list of HTML

I have a return value from a search I'm doing which returns alot of HTML.
for i in deal_list:
regex2 = '(?s)'+'<figure class="deal-card deal-list-tile deal-tile deal-tile-standard" data-bhc="'+ i +'"'+'.+?</figure>'
pattern2 = re.compile(regex2)
info2 = re.search(pattern2,htmltext)
html_captured = info2.group(0).split('</figure>')
print html_captured
Here is an example what is being returned:
<figure class="deal-card deal-list-tile deal-tile deal-tile-standard" data-bhc="deal:giorgios-brick-oven-pizza-wine-bar" data-bhd="{"accessType":"extended"}" data-bh-viewport="respect">
<a href="//www" class="deal-tile-inner">
<img>
<figcaption>
<div class="deal-tile-content">
<p class="deal-title should-truncate">Up to 73% Off Wine-Tasting Dinner at 1742 Wine Bar</p>
<p class="merchant-name truncation ">1742 Wine Bar</p>
<p class="deal-location truncate-others ">
<span class="deal-location-name">Upper East Side</span>
</p>
<div class="description should-truncate deal-tile-description"><p>Wine tasting includes three reds and three whites; dinner consists of one appetizer, two entrees, and a bottle of wine</p></div>
</div>
<div class="purchase-info clearfix ">
<p class="deal-price">
<s class="original-price">$178.90</s>
<s class="discount-price">$49</s>
</p>
<div class="hide show-in-list-view">
<p class="deal-tile-actions">
<button class="btn-small btn-buy" data-bhw="ViewDealButton">
View Deal
</button>
</p>
</div>
</div>
</figcaption>
</a>
</figure>
<figure class="deal-card deal-list-tile deal-tile deal-tile-standard" data-bhc="deal:statler-grill-4" data-bhd="{"accessType":"extended"}" data-bh-viewport="respect">
<a href="//www" class="deal-tile-inner">
<img>
<figcaption>
<div class="deal-tile-content">
<p class="deal-title should-truncate">Up to 59% Off Four-Course Dinner at Statler Grill</p>
<p class="merchant-name truncation ">Statler Grill</p>
<p class="deal-location truncate-others ">
<span class="deal-location-name">Midtown</span>
</p>
<div class="description should-truncate deal-tile-description"><p>Chefs sear marbled new york prime sirloin and dice fresh sashimi-grade tuna to satisfy appetites amid white tablecloths and chandeliers</p></div>
</div>
<div class="purchase-info clearfix ">
<p class="deal-price">
<s class="original-price">$213</s>
<s class="discount-price">$89</s>
</p>
<div class="hide show-in-list-view">
<p class="deal-tile-actions">
<button class="btn-small btn-buy" data-bhw="ViewDealButton">
View Deal
</button>
</p>
</div>
</div>
</figcaption>
</a>
</figure>
I want to use html_captured = info2.group(0).split('</figure> so that all HTML between each new set of tags become an element of a list, in this case HTML_CAPTURED.
It kind of works except that each becomes its own list with a '' at the end. For example: ['<figure .... </figure>','']['<figure .... </figure>','']
But what I want is ['<figure .... </figure>','<figure .... </figure>','<figure .... </figure>'...etc]

There are special tools for parsing HTML - HTML parsers.
Example using BeautifulSoup:
from bs4 import BeautifulSoup
data = """
your html here
"""
soup = BeautifulSoup(data)
print [figure for figure in soup.find_all('figure')]
Also see why you should not use regex for parsing HTML:
RegEx match open tags except XHTML self-contained tags

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python BeautifulSoup Same name DIV, ignore first - python

find_all() will return list. use slicing or index to access just elements you want.

Related

Unable to scrape h1 class with python/beautiful soup

Beautifulsoup: Get a range of divs

Iterate div tags using lxml and retrieve text for a dictionary in python

How can one replace an element in lxml?

Using SPLIT to create a list of HTML

Categories

Resources