Python xpath and conditionals

Python xpath and conditionals - python

I'm trying to find all elements that are h3 class="threadtitle" and within this element, if there is the text "NSW" return the value of the < a> element.
<h3 class="threadtitle">
<img border="0" alt="MARKET PLACE/AUCTIONS" src="vbcover/ibid/images/auction_open.png" title="MARKET PLACE/AUCTIONS">
<span class="prefix understate">
<b>
<font size="2" face="arial" color="#0000FF">NSW</font>
</b>
</span>
<a id="thread_title_1234" class="title" href="showthread.php?t=1234">Banana man</a>
</h3>
This is what I have so far:
I can find individual elements like this:
import requests
from lxml import etree, html
response '''
<h3 class="threadtitle">
<img border="0" alt="MARKET PLACE/AUCTIONS" src="vbcover/ibid/images/auction_open.png" title="MARKET PLACE/AUCTIONS">
<span class="prefix understate">
<b>
<font size="2" face="arial" color="#0000FF">NSW</font>
</b>
</span>
<a id="thread_title_1234" class="title" href="showthread.php?t=1234">Banana man</a>
</h3>
'''
tree = html.fromstring(response.text)
test = tree.xpath("//font[text()='NSW']")
#or
test2 = tree.xpath("//h3[#class='threadtitle']")
for i in test:
print i
NSW
But I don't know how to combine these.
The above example should return 'Banana man'

try this xpath:
//h3[#class='threadtitle'][descendant::font/text() = 'NSW']/a/text()

Related

How to get value of html tags with beautifulsoup in python?

I scraped multiple pages, some of which have a red class, some of which I do not want to store red class values in an array, but I want those pages that do not have this class to be in an empty array. Because of that I wrote this code and now I want to get value of them. can you help me?
for i in soup:
search = i.find_all('div', {'class':"red"})
if len(search)>0:
whoFollowThisDr.append(i.find_all('div', {'class':"info"},'span'))
i = i.text
else:
whoFollowThisDr.append(' ')
whoFollowThisDr
output:
[[<div class="info"> <strong> a</strong> <span>b</span> </div>,
<div class="info"> <strong> c</strong> <span>d</span> </div>,
<div class="info"> <strong style="font-size: 15px !important;"> e</strong> <span style="font-size: 12px !important;">f</span> </div>,
<div class="info"> <strong style="font-size: 15px !important;"> g</strong> <span style="font-size: 12px !important;">h</span> </div>],
[<div class="info"> <strong> i</strong> <span>j</span> </div>]]
What I want:
[[a,c,e],[i]]

i = i.text has no effect, since you never use i after the assignment. You should use .text when you're appending to the list. Use a list comprehension to call it on each element.
whoFollowThisDr.append([div.text for div in i.find_all('div', {'class':"info"},'span')])

Python How to finde the right value with soup

I am trying to get the proce of an item from the following html.
This is the src
<div class="a-section a-spacing-small a-spacing-top-small">
<span class="a-declarative" data-action="show-all-offers-display" data-show-all-offers-display="{}">
<a class="a-link-normal" href="/gp/offer-listing/B08HLZXHZY/ref=dp_olp_NEW_mbc?ie=UTF8&condition=NEW">
<span>Neu (3) ab </span><span class="a-size-base a-color-price">1.930,99 €</span>
</a>
</span>
<span class="a-size-base a-color-base">& <b>Kostenlose Lieferung</b></span>
</div>
This is the code that I tried
html = """\
HTML Code here from the top.
"""
soup = Soup(html)
soup.find("span", {"a-size-base a-color-price": ""}).text

There are number of issues in your code. See below:
html = """<div class="a-section a-spacing-small a-spacing-top-small">
<span class="a-declarative" data-action="show-all-offers-display" data-show-all-offers-display="{}">
<a class="a-link-normal" href="/gp/offer-listing/B08HLZXHZY/ref=dp_olp_NEW_mbc?ie=UTF8&condition=NEW">
<span>Neu (3) ab </span><span class="a-size-base a-color-price">1.930,99 €</span>
</a>
</span>
<span class="a-size-base a-color-base">& <b>Kostenlose Lieferung</b></span>
</div>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
print(soup.find("span", {"class":"a-size-base a-color-price"}).text.strip())
output
1.930,99 €

Not getting desired text from BeautifulSoup

<h3 class="jd_header3 text" style="font-size: 12px;">
Shift Pattern:
</h3>
<ul class="jd_NoBulletinRight">
<li style="font-size:11px;">
<span class="text">
No Shift
</span>
</li>
</ul>
<h3 class="jd_header3 text" style="align:left;font-size:12px;">
Salary:
</h3>
<ul class="jd_NoBulletinRight">
<li>
<table border="0" cellpadding="0" cellspacing="0">
<tbody>
<tr>
<td align="left" style="word-wrap: break-word;font-size: 11px;" valign="top">
<span class="text">
S$3,500.00
<span class="text">
-
</span>
S$5,400.00
</span>
</td>
</tr>
</tbody>
</table>
</li>
</ul>
This is a part of my BeautifulSoup tree. I wish to get the salary range S$3500 - S$5400. Following the suggestion here I use the following code:
salary = bsObj.find(text="Salary:").parent.nextSibling.find("td").get_text()
print(salary)
I get the error:
AttributeError: 'int' object has no attribute 'get_text'
But when I simply print out the integer:
salary = bsObj.find(text="Salary:").parent.nextSibling.find("td")
print(salary)
I get:
-1
Which is not what I want. I have used Selenium to obtain the page, so any javascript is already loaded.
Any ideas?

Try the following code, however I don't think your code can get your expect output:
>>> bsObj.find('td', {'align': "left"}).text
'\n\n S$3,500.00\n \n -\n \n S$5,400.00
\n \n'
>>> ' '.join(bsObj.find('td', {'align': "left"}).text.split())
'S$3,500.00 - S$5,400.00'

Not sure about this "get_text" attribute, but with BeautifulSoup, I rely heavily on .text as shown below. Is this what you're looking for?
s = '''<html here>'''
soup = BeautifulSoup(s, 'html.parser')
bsObj = soup.findAll('td')
for i in bsObj:
print(i.text)
>>>
S$3,500.00
-
S$5,400.00

How can one replace an element in lxml?

I have a text that I get (data entered by users of CRM) web service, which returns a "terrifying format". I am filtering with python before using the data, but when it comes to removing line breaks (br) removed me also the texts. The code is as follows:
description = '''
<div id="highlight" class="section">
<p>
text...............
</p>
<br>
<h1>TITLE</h1>
<p>Multiple text
<br>
</p>
<ul>
<li>bad layer....</li>
</ul>
<p>
<br>subTitle
</p>
<p> </p>
<p style="text-align: center;">
<br>Text1
<br>Text2
<br>Text3
<br>Text4
<br>Text5
<br>Text6
</p>
<p style="text-align: center;">
<strong>small title</strong>
<br>Text small</p>
<p style="text-align: center;">
<strong>highlighted text</strong>
<br>
<br><strong>Text1</strong>
<br>Text2
<br>Text3
<br>Text4
</p>
<p style="text-align: center;">
<strong>small text</strong>
<br>Text1
<br>Text2
</p>
<p style="text-align: center;">
<strong>small text</strong>
<br>description
</p>
<p style="text-align: center;">
<br> </p>
<p><strong>description two</strong></p>
<p>
<br> </p>
</div>
'''
tree = html.fragment_fromstring( description )
for element in tree.xpath('//br'):
#element.getparent().remove(element)
print element.text
print element.getparent().getchildren()
#print element
#print element.getparent()
#print element.getchildren()
#print element.getnext()
#print '--------------------------------'
I have tried to remove the br with element.getparent().remove(element), but also deletes the text, I did tests to see if the texts belong to any node, but not so.
I've thought about changing the br by li, making the p with stylo in ul, but I can't think as do it, something like this (the previous text lame):
..........
..........
<ul>
<li>Text1</li>
<li>Text2</li>
<li>Text3</li>
<li>Text4</li>
<li>Text5</li>
<li>Text6</li>
</ul>
<ul>
<li><strong>small title</strong></li>
<li>Text small</li></ul>
<ul>
<li><strong>highlighted text</strong></li>
<li><strong>Text1</strong></li>
<li>Text2</li>
<li>Text3</li>
<li>Text4</li>
</ul>
<ul>
<li><strong>small text</strong></li>
<li>Text1</li>
<li>Text2</li>
</ul>
<ul>
<li><strong>small text</strong></li>
<li>description</li>
</ul>
<ul>
<li> </li></ul>
........
I can't think as take texts, because I thought that just choosing the xpath of the node p with style and its value, creating nodes children of li and a parent ul, eliminated p.
Is possible? Thanks
Regards

You can use lxml.etree.strip_elements, like so:
import lxml.etree
import lxml.html
tree = lxml.html.fragment_fromstring(description)
lxml.etree.strip_elements(tree, 'br', with_tail=False)
print(lxml.etree.tostring(tree, pretty_print=True))

Using SPLIT to create a list of HTML

I have a return value from a search I'm doing which returns alot of HTML.
for i in deal_list:
regex2 = '(?s)'+'<figure class="deal-card deal-list-tile deal-tile deal-tile-standard" data-bhc="'+ i +'"'+'.+?</figure>'
pattern2 = re.compile(regex2)
info2 = re.search(pattern2,htmltext)
html_captured = info2.group(0).split('</figure>')
print html_captured
Here is an example what is being returned:
<figure class="deal-card deal-list-tile deal-tile deal-tile-standard" data-bhc="deal:giorgios-brick-oven-pizza-wine-bar" data-bhd="{"accessType":"extended"}" data-bh-viewport="respect">
<a href="//www" class="deal-tile-inner">
<img>
<figcaption>
<div class="deal-tile-content">
<p class="deal-title should-truncate">Up to 73% Off Wine-Tasting Dinner at 1742 Wine Bar</p>
<p class="merchant-name truncation ">1742 Wine Bar</p>
<p class="deal-location truncate-others ">
<span class="deal-location-name">Upper East Side</span>
</p>
<div class="description should-truncate deal-tile-description"><p>Wine tasting includes three reds and three whites; dinner consists of one appetizer, two entrees, and a bottle of wine</p></div>
</div>
<div class="purchase-info clearfix ">
<p class="deal-price">
<s class="original-price">$178.90</s>
<s class="discount-price">$49</s>
</p>
<div class="hide show-in-list-view">
<p class="deal-tile-actions">
<button class="btn-small btn-buy" data-bhw="ViewDealButton">
View Deal
</button>
</p>
</div>
</div>
</figcaption>
</a>
</figure>
<figure class="deal-card deal-list-tile deal-tile deal-tile-standard" data-bhc="deal:statler-grill-4" data-bhd="{"accessType":"extended"}" data-bh-viewport="respect">
<a href="//www" class="deal-tile-inner">
<img>
<figcaption>
<div class="deal-tile-content">
<p class="deal-title should-truncate">Up to 59% Off Four-Course Dinner at Statler Grill</p>
<p class="merchant-name truncation ">Statler Grill</p>
<p class="deal-location truncate-others ">
<span class="deal-location-name">Midtown</span>
</p>
<div class="description should-truncate deal-tile-description"><p>Chefs sear marbled new york prime sirloin and dice fresh sashimi-grade tuna to satisfy appetites amid white tablecloths and chandeliers</p></div>
</div>
<div class="purchase-info clearfix ">
<p class="deal-price">
<s class="original-price">$213</s>
<s class="discount-price">$89</s>
</p>
<div class="hide show-in-list-view">
<p class="deal-tile-actions">
<button class="btn-small btn-buy" data-bhw="ViewDealButton">
View Deal
</button>
</p>
</div>
</div>
</figcaption>
</a>
</figure>
I want to use html_captured = info2.group(0).split('</figure> so that all HTML between each new set of tags become an element of a list, in this case HTML_CAPTURED.
It kind of works except that each becomes its own list with a '' at the end. For example: ['<figure .... </figure>','']['<figure .... </figure>','']
But what I want is ['<figure .... </figure>','<figure .... </figure>','<figure .... </figure>'...etc]

There are special tools for parsing HTML - HTML parsers.
Example using BeautifulSoup:
from bs4 import BeautifulSoup
data = """
your html here
"""
soup = BeautifulSoup(data)
print [figure for figure in soup.find_all('figure')]
Also see why you should not use regex for parsing HTML:
RegEx match open tags except XHTML self-contained tags

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python xpath and conditionals - python

try this xpath: //h3[#class='threadtitle'][descendant::font/text() = 'NSW']/a/text()

Related

How to get value of html tags with beautifulsoup in python?

Python How to finde the right value with soup

Not getting desired text from BeautifulSoup

How can one replace an element in lxml?

Using SPLIT to create a list of HTML

Categories

Resources