I have a text that I get (data entered by users of CRM) web service, which returns a "terrifying format". I am filtering with python before using the data, but when it comes to removing line breaks (br) removed me also the texts. The code is as follows:
description = '''
<div id="highlight" class="section">
<p>
text...............
</p>
<br>
<h1>TITLE</h1>
<p>Multiple text
<br>
</p>
<ul>
<li>bad layer....</li>
</ul>
<p>
<br>subTitle
</p>
<p> </p>
<p style="text-align: center;">
<br>Text1
<br>Text2
<br>Text3
<br>Text4
<br>Text5
<br>Text6
</p>
<p style="text-align: center;">
<strong>small title</strong>
<br>Text small</p>
<p style="text-align: center;">
<strong>highlighted text</strong>
<br>
<br><strong>Text1</strong>
<br>Text2
<br>Text3
<br>Text4
</p>
<p style="text-align: center;">
<strong>small text</strong>
<br>Text1
<br>Text2
</p>
<p style="text-align: center;">
<strong>small text</strong>
<br>description
</p>
<p style="text-align: center;">
<br> </p>
<p><strong>description two</strong></p>
<p>
<br> </p>
</div>
'''
tree = html.fragment_fromstring( description )
for element in tree.xpath('//br'):
#element.getparent().remove(element)
print element.text
print element.getparent().getchildren()
#print element
#print element.getparent()
#print element.getchildren()
#print element.getnext()
#print '--------------------------------'
I have tried to remove the br with element.getparent().remove(element), but also deletes the text, I did tests to see if the texts belong to any node, but not so.
I've thought about changing the br by li, making the p with stylo in ul, but I can't think as do it, something like this (the previous text lame):
..........
..........
<ul>
<li>Text1</li>
<li>Text2</li>
<li>Text3</li>
<li>Text4</li>
<li>Text5</li>
<li>Text6</li>
</ul>
<ul>
<li><strong>small title</strong></li>
<li>Text small</li></ul>
<ul>
<li><strong>highlighted text</strong></li>
<li><strong>Text1</strong></li>
<li>Text2</li>
<li>Text3</li>
<li>Text4</li>
</ul>
<ul>
<li><strong>small text</strong></li>
<li>Text1</li>
<li>Text2</li>
</ul>
<ul>
<li><strong>small text</strong></li>
<li>description</li>
</ul>
<ul>
<li> </li></ul>
........
I can't think as take texts, because I thought that just choosing the xpath of the node p with style and its value, creating nodes children of li and a parent ul, eliminated p.
Is possible? Thanks
Regards
You can use lxml.etree.strip_elements, like so:
import lxml.etree
import lxml.html
tree = lxml.html.fragment_fromstring(description)
lxml.etree.strip_elements(tree, 'br', with_tail=False)
print(lxml.etree.tostring(tree, pretty_print=True))
Related
So playing around with the python bs4 and trying to work out how to ignore the same DIV name to collect the data for the second lot.
Below is an example of the code I am try to extract ##Wanted data##
##Pointless Data###
<div class="PowerDetails">
<div class="Company">
<p class="RunningCost">$4.44</p>
<p class="Time">peek</p>
<p class="RunningCost"> $2.33</p>
<p class="Time">Off-peek</p>
</div>
</div>
##Wanted data##
<div class="PowerDetails">
<div class="Company">
<p class="RunningCost">$8.88</p>
<p class="Time">peek</p>
<p class="RunningCost"> $9.99</p>
<p class="Time">Off-peek</p>
</div>
<div class="Company">
<p class="RunningCost">$8.88</p>
<p class="Time">peek</p>
<p class="RunningCost"> $7.77</p>
<p class="Time">Off-peek</p>
</div>
</div>
from bs4 import BeautifulSoup
soup = BeautifulSoup("<html_text>" , "html.parser")
div = soup.find("div")
div.find_all("div", {"class":"PowerDetails"})
PowerDetails[1].find_all("p", "class":"RunningCost")
PowerDetails[1].find_all("p", "class":"Time")
find_all() will return list. use slicing or index to access just elements you want.
You can slice the resultant list to get elements from the 1st index onwards. But, first you are not finding the right tags in your code.
from bs4 import BeautifulSoup
html_doc = """
<div class="PowerDetails">
<div class="Company">
<p class="RunningCost">$4.44</p>
<p class="Time">peek</p>
<p class="RunningCost"> $2.33</p>
<p class="Time">Off-peek</p>
</div>
</div>
##Wanted data##
<div class="PowerDetails">
<div class="Company">
<p class="RunningCost">$8.88</p>
<p class="Time">peek</p>
<p class="RunningCost"> $9.99</p>
<p class="Time">Off-peek</p>
</div>
<div class="Company">
<p class="RunningCost">$8.88</p>
<p class="Time">peek</p>
<p class="RunningCost"> $7.77</p>
<p class="Time">Off-peek</p>
</div>
</div>
"""
soup = BeautifulSoup(html_doc , "html.parser")
# You can get the divs with one line of code
powerDetails = soup.find_all(class_="PowerDetails")
print(len(powerDetails)) # Outputs 2
Now, you can slice the list to ignore the first div
powerDetails = powerDetails[1:] # Get elements from 2nd element onwards (ignoring the first one)
print(len(powerDetails)) # Outputs 1
Now, you will have a list with one element only
print(powerDetails)
Output:
[<div class="PowerDetails">
<div class="Company">
<p class="RunningCost">$8.88</p>
<p class="Time">peek</p>
<p class="RunningCost"> $9.99</p>
<p class="Time">Off-peek</p>
</div>
<div class="Company">
<p class="RunningCost">$8.88</p>
<p class="Time">peek</p>
<p class="RunningCost"> $7.77</p>
<p class="Time">Off-peek</p>
</div>
</div>]
Another method.
from simplified_scrapy import SimplifiedDoc
html = '''
##Pointless Data###
<div class="PowerDetails">
<div class="Company">
<p class="RunningCost">$4.44</p>
<p class="Time">peek</p>
<p class="RunningCost"> $2.33</p>
<p class="Time">Off-peek</p>
</div>
</div>
##Wanted data##
<div class="PowerDetails">
<div class="Company">
<p class="RunningCost">$8.88</p>
<p class="Time">peek</p>
<p class="RunningCost"> $9.99</p>
<p class="Time">Off-peek</p>
</div>
<div class="Company">
<p class="RunningCost">$8.88</p>
<p class="Time">peek</p>
<p class="RunningCost"> $7.77</p>
<p class="Time">Off-peek</p>
</div>
</div>
'''
doc = SimplifiedDoc(html)
# First method, get all, use index.
PowerDetails = doc.selects('div.PowerDetails')[1].selects(
'div.Company').selects('p')
for ps in PowerDetails:
print([(p['class'], p.text) for p in ps])
# Second method, skip the first with parameter start
PowerDetails = doc.getElement(
'div', value='PowerDetails',
start='class="PowerDetails"').selects('div.Company').selects('p')
for ps in PowerDetails:
print([(p['class'], p.text) for p in ps])
Result:
[('RunningCost', '$8.88'), ('Time', 'peek'), ('RunningCost', '$9.99'), ('Time', 'Off-peek')]
[('RunningCost', '$8.88'), ('Time', 'peek'), ('RunningCost', '$7.77'), ('Time', 'Off-peek')]
[('RunningCost', '$8.88'), ('Time', 'peek'), ('RunningCost', '$9.99'), ('Time', 'Off-peek')]
[('RunningCost', '$8.88'), ('Time', 'peek'), ('RunningCost', '$7.77'), ('Time', 'Off-peek')]
I just came to know lxmlx in python and I'm in the need for some help as I have no experience with XPath.
I want to get text data from a webpage into a dictionary.
I'm referring to the html snippet I posted below. Within the original html page there's a div element of the class general-info that I retrieve using the following line:
general_info = document_tree.xpath("//div[contains(concat(' ', normalize-space(#class), ' '), 'general-info')]")
From here on I want to iterate over the nested divs and get the 2 <p> tags as key and value. The text inside the <strong> being the key.
There can also be empty div tags and there can be a special case where the key and the value for the dictionary can be within the same div (see the last element).
EDIT:
The number of elements can change, so it would be best to use the <strong> tags as starting point and then search for the next <p> tag.
This is code that I was able to write using BeautifulSoup:
generalinfo = documentSoup.findAll("div", {"class": "general-info"})
if generalinfo:
strongs = generalinfo[0].find_all('strong')
for descr in strongs:
p = descr.find_next_sibling("p")
if p:
key = descr.text.strip().rstrip(':')
details_dict[key] = p.text.strip()
else:
nextdiv = descr.parent.parent.find_next_sibling("div")
if nextdiv:
child = nextdiv.findChild()
if child:
key = descr.text.strip()[:-1]
details_dict[key] = child.text.strip()
I am going for the following output:
['Title:' : 'This is a title',
'Owner:' : 'This is an owner',
'Category:' : 'This is a categroy',
'Type:' : 'This is a type',
'Special case:' : 'This is a special case']
If anyone can help me out here I will appreciate this!
html code:
<body>
<main>
<div>
...
<div class="general-info margin-bottom-20 margin-top-20">
<div class="row padding-x-20">
<div class="col-sm-4">
<p class="margin-0">
<strong>Title:</strong>
</p>
</div>
<div class="col-sm-8">
<p class="margin-0">This is a title</p>
</div>
</div>
<div class="row padding-x-20">
<div class="col-sm-4">
<p class="margin-0">
<strong>Owner:</strong>
</p>
</div>
<div class="col-sm-8">
<p class="margin-0">This is an owner</p>
</div>
</div>
<h2 class="h3 margin-top-10 margin-bottom-10 padding-x-20">Validity</h2>
<div class="row padding-x-20">
<div class="col-sm-4">
<p class="margin-0">
<strong>Category:</strong>
</p>
</div>
<div class="col-sm-8">
<p class="margin-0">This is a category</p>
</div>
</div>
<div class="row padding-x-40"></div>
<div class="row padding-x-20">
<div class="col-sm-4">
<p class="margin-0">
<strong>Type:</strong>
</p>
</div>
<div class="col-sm-8">
<p class="margin-0">This is a type</p>
</div>
</div>
<div class="row padding-x-40">
<div>
<strong>Special case:</strong>
<p>This is a special case</p>
</div>
</div>
</div>
...
I believe this is about as generalized as I can get given the html provided:
general_info = doc.xpath("//div[contains(concat(' ', normalize-space(#class), ' '), 'general-info')]//p[#class='margin-0']")
for i in general_info :
if len(i.xpath('./strong/text()'))>0:
topic = i.xpath('./strong/text()')[0]
if len(i.text.strip())>0:
entry += i.text.replace('\n','').strip()
print(topic+' '+i.text.replace('\n','').strip())
special = general_info[0].xpath('./ancestor::div[#class="general-info margin-bottom-20 margin-top-20"]//div/div/strong')[0]
print(special.text+" ",special.xpath('./following-sibling::p/text()')[0])
Output:
('Title: This is a title',
'Owner: This is an owner',
'Category: This is a category',
'Type: This is a type',
'Special case: This is a special case')
I recommend another solution, which is very suitable for extracting data from XML.
from simplified_scrapy.spider import SimplifiedDoc
html='''
<body>
<main>
<div class="general-info margin-bottom-20 margin-top-20">
<div class="row padding-x-20">
<div class="col-sm-4">
<p class="margin-0">
<strong>Title:</strong>
</p>
</div>
<div class="col-sm-8">
<p class="margin-0">This is a title</p>
</div>
</div>
<div class="row padding-x-20">
<div class="col-sm-4">
<p class="margin-0">
<strong>Owner:</strong>
</p>
</div>
<div class="col-sm-8">
<p class="margin-0">This is an owner</p>
</div>
</div>
<h2 class="h3 margin-top-10 margin-bottom-10 padding-x-20">Validity</h2>
<div class="row padding-x-20">
<div class="col-sm-4">
<p class="margin-0">
<strong>Category:</strong>
</p>
</div>
<div class="col-sm-8">
<p class="margin-0">This is a category</p>
</div>
</div>
<div class="row padding-x-40"></div>
<div class="row padding-x-20">
<div class="col-sm-4">
<p class="margin-0">
<strong>Type:</strong>
</p>
</div>
<div class="col-sm-8">
<p class="margin-0">This is a type</p>
</div>
</div>
<div class="row padding-x-40">
<div>
<strong>Special case:</strong>
<p>This is a special case</p>
</div>
</div>
</div>
'''
data={}
doc = SimplifiedDoc(html) # create doc
divs = doc.selects('div.general-info')
# First way
for div in divs:
strongs = div.strongs
for strong in strongs:
p = strong.next
if not p:
p=strong.parent.next
data[strong.text]=p.text
print(data)
data={}
# Second way
for div in divs:
ds = div.selects('strong|p>text()')
for i in range(0,len(ds),2):
data[ds[i]]=ds[i+1]
print(data)
Result:
{'Title:': 'This is a title', 'Owner:': 'This is an owner', 'Category:': 'This is a category', 'Type:': 'This is a type', 'Special case:': 'This is a special case'}
{'Title:': 'This is a title', 'Owner:': 'This is an owner', 'Category:': 'This is a category', 'Type:': 'This is a type', 'Special case:': 'This is a special case'}
Here are more examples:https://github.com/yiyedata/simplified-scrapy-demo/blob/master/doc_examples/
How to find all the text on a page that falls under this criteria using beautiful soup?
<tr>
<td class="d_g_l_e" style="border-right:none;”>
<img src="/d2l/img/LP/pixel.gif" width="20" height="20" alt=“”
</td>
<th scope="row" class="d_gt d_ich" style="border-left:none;”>
<div class="dco”>
<div class="dco_c”>
<div class="dco”>
<div class="dco_c”>
<strong> **EXTRACT THIS (NAME)** </strong>
</div>
</div>
</div>
</div>
</th>
<td class="d_gn d_gr d_gt”>
<div class="dco”>
<div class="dco_c”>
<div class="dco”>
<div class="dco_c" style="text-align:right;”>
<div style="text-align:center;display:inline;”>
<label id="z_c"> **EXTRACT THIS (GRADE)** </label>
</div>
</div>
</div>
</div>
</div>
</td>
<td class="d_gn d_gr d_gt"> </td>
</tr>
I want the program to scan the whole html page and collect all of the variables this appear in this form. If the "tr" tag (main tag I'm looking for) has both a NAME and a GRADE underneath it, add the name to a list (List1), and then add the grade to a separate list (List2). If one of the two is missing underneath the "tr" tag, skip it, and don't record anything. So by the time the script is done scanning the page, a list would look something like:
List1 = [Grade 1, Grade 2, Grade 3, Grade 4]
List2 = [10/20, 20/40, 50/50, 33/44]
Also, the "z" label ID for the grade text changes from grade to grade, ex. z_a, z_b, z_c.
For each tr on the page, find strong tag inside the th and label tag inside the td:
soup = BeautifulSoup(data)
for row in soup.find_all('tr'):
grade = row.select('th strong')
name = row.select('td label')
if grade and name:
print grade[0].text, name[0].text
I'm trying to find all elements that are h3 class="threadtitle" and within this element, if there is the text "NSW" return the value of the < a> element.
<h3 class="threadtitle">
<img border="0" alt="MARKET PLACE/AUCTIONS" src="vbcover/ibid/images/auction_open.png" title="MARKET PLACE/AUCTIONS">
<span class="prefix understate">
<b>
<font size="2" face="arial" color="#0000FF">NSW</font>
</b>
</span>
<a id="thread_title_1234" class="title" href="showthread.php?t=1234">Banana man</a>
</h3>
This is what I have so far:
I can find individual elements like this:
import requests
from lxml import etree, html
response '''
<h3 class="threadtitle">
<img border="0" alt="MARKET PLACE/AUCTIONS" src="vbcover/ibid/images/auction_open.png" title="MARKET PLACE/AUCTIONS">
<span class="prefix understate">
<b>
<font size="2" face="arial" color="#0000FF">NSW</font>
</b>
</span>
<a id="thread_title_1234" class="title" href="showthread.php?t=1234">Banana man</a>
</h3>
'''
tree = html.fromstring(response.text)
test = tree.xpath("//font[text()='NSW']")
#or
test2 = tree.xpath("//h3[#class='threadtitle']")
for i in test:
print i
NSW
But I don't know how to combine these.
The above example should return 'Banana man'
try this xpath:
//h3[#class='threadtitle'][descendant::font/text() = 'NSW']/a/text()
I have a return value from a search I'm doing which returns alot of HTML.
for i in deal_list:
regex2 = '(?s)'+'<figure class="deal-card deal-list-tile deal-tile deal-tile-standard" data-bhc="'+ i +'"'+'.+?</figure>'
pattern2 = re.compile(regex2)
info2 = re.search(pattern2,htmltext)
html_captured = info2.group(0).split('</figure>')
print html_captured
Here is an example what is being returned:
<figure class="deal-card deal-list-tile deal-tile deal-tile-standard" data-bhc="deal:giorgios-brick-oven-pizza-wine-bar" data-bhd="{"accessType":"extended"}" data-bh-viewport="respect">
<a href="//www" class="deal-tile-inner">
<img>
<figcaption>
<div class="deal-tile-content">
<p class="deal-title should-truncate">Up to 73% Off Wine-Tasting Dinner at 1742 Wine Bar</p>
<p class="merchant-name truncation ">1742 Wine Bar</p>
<p class="deal-location truncate-others ">
<span class="deal-location-name">Upper East Side</span>
</p>
<div class="description should-truncate deal-tile-description"><p>Wine tasting includes three reds and three whites; dinner consists of one appetizer, two entrees, and a bottle of wine</p></div>
</div>
<div class="purchase-info clearfix ">
<p class="deal-price">
<s class="original-price">$178.90</s>
<s class="discount-price">$49</s>
</p>
<div class="hide show-in-list-view">
<p class="deal-tile-actions">
<button class="btn-small btn-buy" data-bhw="ViewDealButton">
View Deal
</button>
</p>
</div>
</div>
</figcaption>
</a>
</figure>
<figure class="deal-card deal-list-tile deal-tile deal-tile-standard" data-bhc="deal:statler-grill-4" data-bhd="{"accessType":"extended"}" data-bh-viewport="respect">
<a href="//www" class="deal-tile-inner">
<img>
<figcaption>
<div class="deal-tile-content">
<p class="deal-title should-truncate">Up to 59% Off Four-Course Dinner at Statler Grill</p>
<p class="merchant-name truncation ">Statler Grill</p>
<p class="deal-location truncate-others ">
<span class="deal-location-name">Midtown</span>
</p>
<div class="description should-truncate deal-tile-description"><p>Chefs sear marbled new york prime sirloin and dice fresh sashimi-grade tuna to satisfy appetites amid white tablecloths and chandeliers</p></div>
</div>
<div class="purchase-info clearfix ">
<p class="deal-price">
<s class="original-price">$213</s>
<s class="discount-price">$89</s>
</p>
<div class="hide show-in-list-view">
<p class="deal-tile-actions">
<button class="btn-small btn-buy" data-bhw="ViewDealButton">
View Deal
</button>
</p>
</div>
</div>
</figcaption>
</a>
</figure>
I want to use html_captured = info2.group(0).split('</figure> so that all HTML between each new set of tags become an element of a list, in this case HTML_CAPTURED.
It kind of works except that each becomes its own list with a '' at the end. For example: ['<figure .... </figure>','']['<figure .... </figure>','']
But what I want is ['<figure .... </figure>','<figure .... </figure>','<figure .... </figure>'...etc]
There are special tools for parsing HTML - HTML parsers.
Example using BeautifulSoup:
from bs4 import BeautifulSoup
data = """
your html here
"""
soup = BeautifulSoup(data)
print [figure for figure in soup.find_all('figure')]
Also see why you should not use regex for parsing HTML:
RegEx match open tags except XHTML self-contained tags