Python xpath and conditionals - python

I'm trying to find all elements that are h3 class="threadtitle" and within this element, if there is the text "NSW" return the value of the < a> element.
<h3 class="threadtitle">
<img border="0" alt="MARKET PLACE/AUCTIONS" src="vbcover/ibid/images/auction_open.png" title="MARKET PLACE/AUCTIONS">
<span class="prefix understate">
<b>
<font size="2" face="arial" color="#0000FF">NSW</font>
</b>
</span>
<a id="thread_title_1234" class="title" href="showthread.php?t=1234">Banana man</a>
</h3>
This is what I have so far:
I can find individual elements like this:
import requests
from lxml import etree, html
response '''
<h3 class="threadtitle">
<img border="0" alt="MARKET PLACE/AUCTIONS" src="vbcover/ibid/images/auction_open.png" title="MARKET PLACE/AUCTIONS">
<span class="prefix understate">
<b>
<font size="2" face="arial" color="#0000FF">NSW</font>
</b>
</span>
<a id="thread_title_1234" class="title" href="showthread.php?t=1234">Banana man</a>
</h3>
'''
tree = html.fromstring(response.text)
test = tree.xpath("//font[text()='NSW']")
#or
test2 = tree.xpath("//h3[#class='threadtitle']")
for i in test:
print i
NSW
But I don't know how to combine these.
The above example should return 'Banana man'

try this xpath:
//h3[#class='threadtitle'][descendant::font/text() = 'NSW']/a/text()

Related

How to get value of html tags with beautifulsoup in python?

I scraped multiple pages, some of which have a red class, some of which I do not want to store red class values in an array, but I want those pages that do not have this class to be in an empty array. Because of that I wrote this code and now I want to get value of them. can you help me?
for i in soup:
search = i.find_all('div', {'class':"red"})
if len(search)>0:
whoFollowThisDr.append(i.find_all('div', {'class':"info"},'span'))
i = i.text
else:
whoFollowThisDr.append(' ')
whoFollowThisDr
output:
[[<div class="info"> <strong> a</strong> <span>b</span> </div>,
<div class="info"> <strong> c</strong> <span>d</span> </div>,
<div class="info"> <strong style="font-size: 15px !important;"> e</strong> <span style="font-size: 12px !important;">f</span> </div>,
<div class="info"> <strong style="font-size: 15px !important;"> g</strong> <span style="font-size: 12px !important;">h</span> </div>],
[<div class="info"> <strong> i</strong> <span>j</span> </div>]]
What I want:
[[a,c,e],[i]]
i = i.text has no effect, since you never use i after the assignment. You should use .text when you're appending to the list. Use a list comprehension to call it on each element.
whoFollowThisDr.append([div.text for div in i.find_all('div', {'class':"info"},'span')])

Python How to finde the right value with soup

I am trying to get the proce of an item from the following html.
This is the src
<div class="a-section a-spacing-small a-spacing-top-small">
<span class="a-declarative" data-action="show-all-offers-display" data-show-all-offers-display="{}">
<a class="a-link-normal" href="/gp/offer-listing/B08HLZXHZY/ref=dp_olp_NEW_mbc?ie=UTF8&condition=NEW">
<span>Neu (3) ab </span><span class="a-size-base a-color-price">1.930,99 €</span>
</a>
</span>
<span class="a-size-base a-color-base">& <b>Kostenlose Lieferung</b></span>
</div>
This is the code that I tried
html = """\
HTML Code here from the top.
"""
soup = Soup(html)
soup.find("span", {"a-size-base a-color-price": ""}).text
There are number of issues in your code. See below:
html = """<div class="a-section a-spacing-small a-spacing-top-small">
<span class="a-declarative" data-action="show-all-offers-display" data-show-all-offers-display="{}">
<a class="a-link-normal" href="/gp/offer-listing/B08HLZXHZY/ref=dp_olp_NEW_mbc?ie=UTF8&condition=NEW">
<span>Neu (3) ab </span><span class="a-size-base a-color-price">1.930,99 €</span>
</a>
</span>
<span class="a-size-base a-color-base">& <b>Kostenlose Lieferung</b></span>
</div>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
print(soup.find("span", {"class":"a-size-base a-color-price"}).text.strip())
output
1.930,99 €

Not getting desired text from BeautifulSoup

<h3 class="jd_header3 text" style="font-size: 12px;">
Shift Pattern:
</h3>
<ul class="jd_NoBulletinRight">
<li style="font-size:11px;">
<span class="text">
No Shift
</span>
</li>
</ul>
<h3 class="jd_header3 text" style="align:left;font-size:12px;">
Salary:
</h3>
<ul class="jd_NoBulletinRight">
<li>
<table border="0" cellpadding="0" cellspacing="0">
<tbody>
<tr>
<td align="left" style="word-wrap: break-word;font-size: 11px;" valign="top">
<span class="text">
S$3,500.00
<span class="text">
-
</span>
S$5,400.00
</span>
</td>
</tr>
</tbody>
</table>
</li>
</ul>
This is a part of my BeautifulSoup tree. I wish to get the salary range S$3500 - S$5400. Following the suggestion here I use the following code:
salary = bsObj.find(text="Salary:").parent.nextSibling.find("td").get_text()
print(salary)
I get the error:
AttributeError: 'int' object has no attribute 'get_text'
But when I simply print out the integer:
salary = bsObj.find(text="Salary:").parent.nextSibling.find("td")
print(salary)
I get:
-1
Which is not what I want. I have used Selenium to obtain the page, so any javascript is already loaded.
Any ideas?
Try the following code, however I don't think your code can get your expect output:
>>> bsObj.find('td', {'align': "left"}).text
'\n\n S$3,500.00\n \n -\n \n S$5,400.00
\n \n'
>>> ' '.join(bsObj.find('td', {'align': "left"}).text.split())
'S$3,500.00 - S$5,400.00'
Not sure about this "get_text" attribute, but with BeautifulSoup, I rely heavily on .text as shown below. Is this what you're looking for?
s = '''<html here>'''
soup = BeautifulSoup(s, 'html.parser')
bsObj = soup.findAll('td')
for i in bsObj:
print(i.text)
>>>
S$3,500.00
-
S$5,400.00

How can one replace an element in lxml?

I have a text that I get (data entered by users of CRM) web service, which returns a "terrifying format". I am filtering with python before using the data, but when it comes to removing line breaks (br) removed me also the texts. The code is as follows:
description = '''
<div id="highlight" class="section">
<p>
text...............
</p>
<br>
<h1>TITLE</h1>
<p>Multiple text
<br>
</p>
<ul>
<li>bad layer....</li>
</ul>
<p>
<br>subTitle
</p>
<p> </p>
<p style="text-align: center;">
<br>Text1
<br>Text2
<br>Text3
<br>Text4
<br>Text5
<br>Text6
</p>
<p style="text-align: center;">
<strong>small title</strong>
<br>Text small</p>
<p style="text-align: center;">
<strong>highlighted text</strong>
<br>
<br><strong>Text1</strong>
<br>Text2
<br>Text3
<br>Text4
</p>
<p style="text-align: center;">
<strong>small text</strong>
<br>Text1
<br>Text2
</p>
<p style="text-align: center;">
<strong>small text</strong>
<br>description
</p>
<p style="text-align: center;">
<br> </p>
<p><strong>description two</strong></p>
<p>
<br> </p>
</div>
'''
tree = html.fragment_fromstring( description )
for element in tree.xpath('//br'):
#element.getparent().remove(element)
print element.text
print element.getparent().getchildren()
#print element
#print element.getparent()
#print element.getchildren()
#print element.getnext()
#print '--------------------------------'
I have tried to remove the br with element.getparent().remove(element), but also deletes the text, I did tests to see if the texts belong to any node, but not so.
I've thought about changing the br by li, making the p with stylo in ul, but I can't think as do it, something like this (the previous text lame):
..........
..........
<ul>
<li>Text1</li>
<li>Text2</li>
<li>Text3</li>
<li>Text4</li>
<li>Text5</li>
<li>Text6</li>
</ul>
<ul>
<li><strong>small title</strong></li>
<li>Text small</li></ul>
<ul>
<li><strong>highlighted text</strong></li>
<li><strong>Text1</strong></li>
<li>Text2</li>
<li>Text3</li>
<li>Text4</li>
</ul>
<ul>
<li><strong>small text</strong></li>
<li>Text1</li>
<li>Text2</li>
</ul>
<ul>
<li><strong>small text</strong></li>
<li>description</li>
</ul>
<ul>
<li> </li></ul>
........
I can't think as take texts, because I thought that just choosing the xpath of the node p with style and its value, creating nodes children of li and a parent ul, eliminated p.
Is possible? Thanks
Regards
You can use lxml.etree.strip_elements, like so:
import lxml.etree
import lxml.html
tree = lxml.html.fragment_fromstring(description)
lxml.etree.strip_elements(tree, 'br', with_tail=False)
print(lxml.etree.tostring(tree, pretty_print=True))

Using SPLIT to create a list of HTML

I have a return value from a search I'm doing which returns alot of HTML.
for i in deal_list:
regex2 = '(?s)'+'<figure class="deal-card deal-list-tile deal-tile deal-tile-standard" data-bhc="'+ i +'"'+'.+?</figure>'
pattern2 = re.compile(regex2)
info2 = re.search(pattern2,htmltext)
html_captured = info2.group(0).split('</figure>')
print html_captured
Here is an example what is being returned:
<figure class="deal-card deal-list-tile deal-tile deal-tile-standard" data-bhc="deal:giorgios-brick-oven-pizza-wine-bar" data-bhd="{"accessType":"extended"}" data-bh-viewport="respect">
<a href="//www" class="deal-tile-inner">
<img>
<figcaption>
<div class="deal-tile-content">
<p class="deal-title should-truncate">Up to 73% Off Wine-Tasting Dinner at 1742 Wine Bar</p>
<p class="merchant-name truncation ">1742 Wine Bar</p>
<p class="deal-location truncate-others ">
<span class="deal-location-name">Upper East Side</span>
</p>
<div class="description should-truncate deal-tile-description"><p>Wine tasting includes three reds and three whites; dinner consists of one appetizer, two entrees, and a bottle of wine</p></div>
</div>
<div class="purchase-info clearfix ">
<p class="deal-price">
<s class="original-price">$178.90</s>
<s class="discount-price">$49</s>
</p>
<div class="hide show-in-list-view">
<p class="deal-tile-actions">
<button class="btn-small btn-buy" data-bhw="ViewDealButton">
View Deal
</button>
</p>
</div>
</div>
</figcaption>
</a>
</figure>
<figure class="deal-card deal-list-tile deal-tile deal-tile-standard" data-bhc="deal:statler-grill-4" data-bhd="{"accessType":"extended"}" data-bh-viewport="respect">
<a href="//www" class="deal-tile-inner">
<img>
<figcaption>
<div class="deal-tile-content">
<p class="deal-title should-truncate">Up to 59% Off Four-Course Dinner at Statler Grill</p>
<p class="merchant-name truncation ">Statler Grill</p>
<p class="deal-location truncate-others ">
<span class="deal-location-name">Midtown</span>
</p>
<div class="description should-truncate deal-tile-description"><p>Chefs sear marbled new york prime sirloin and dice fresh sashimi-grade tuna to satisfy appetites amid white tablecloths and chandeliers</p></div>
</div>
<div class="purchase-info clearfix ">
<p class="deal-price">
<s class="original-price">$213</s>
<s class="discount-price">$89</s>
</p>
<div class="hide show-in-list-view">
<p class="deal-tile-actions">
<button class="btn-small btn-buy" data-bhw="ViewDealButton">
View Deal
</button>
</p>
</div>
</div>
</figcaption>
</a>
</figure>
I want to use html_captured = info2.group(0).split('</figure> so that all HTML between each new set of tags become an element of a list, in this case HTML_CAPTURED.
It kind of works except that each becomes its own list with a '' at the end. For example: ['<figure .... </figure>','']['<figure .... </figure>','']
But what I want is ['<figure .... </figure>','<figure .... </figure>','<figure .... </figure>'...etc]
There are special tools for parsing HTML - HTML parsers.
Example using BeautifulSoup:
from bs4 import BeautifulSoup
data = """
your html here
"""
soup = BeautifulSoup(data)
print [figure for figure in soup.find_all('figure')]
Also see why you should not use regex for parsing HTML:
RegEx match open tags except XHTML self-contained tags

Categories