Parse the date string from html in lxml

Parse the date string from html in lxml - python

s = """
<tbody>
<tr>
<td style="border-bottom: none">
<span class="graytext" style="font-weight: bold;"> Reply #3 - </span>
<span class="graytext" style="font-size: 11px">
05/13/09 2:02am
<br>
</span>
</td>
</tr>
</tbody>
"""
In the HTML string I need to take out the date string.
I tried in this way
import lxml
doc = lxml.html.fromstring(s)
doc.xpath('//span[#class="graytext" and #style="font-size: 11px"]')
But this is not working. I should have to take only the Datestring.

Your query is selecting the span, you need to grab the text from it:
>>> doc.xpath('//span[#class="graytext" and #style="font-size: 11px"]')
[<Element span at 1c9d4c8>]
Most queries return a sequence, I normally use a helper function that gets the first item.
from lxml import etree
s = """
<tbody>
<tr>
<td style="border-bottom: none">
<span class="graytext" style="font-weight: bold;"> Reply #3 - </span>
<span class="graytext" style="font-size: 11px">
05/13/09 2:02am
<br>
</span>
</td>
</tr>
</tbody>
"""
doc = etree.HTML(s)
def first(sequence,default=None):
for item in sequence:
return item
return default
Then:
>>> doc.xpath('//span[#class="graytext" and #style="font-size: 11px"]')
[<Element span at 1c9d4c8>]
>>> doc.xpath('//span[#class="graytext" and #style="font-size: 11px"]/text()')
['\n 05/13/09 2:02am\n ']
>>> first(doc.xpath('//span[#class="graytext" and #style="font-size: 11px"]/text()'),'').strip()
'05/13/09 2:02am'

Try the following instead of the last line:
print doc.xpath('//span[#class="graytext" and #style="font-size: 11px"]/text()')[0]
The first part of the xpath expression is correct, //span[#class="graytext" and #style="font-size: 11px"] selects all matching span nodes and then you need to specify what you want to select from the node. text() used here selects the contents of the node.

Related

How to use BeautifulSoup to search for a list of tags, with one item in the list having an attribute?

Does anyone know how to use bs4 in python to search for multiple tags, one of which will need an attribute?
For example, to search for all occurrences of one tag with an attribute, I know I can do this:
tr_list = soup_object.find_all('tr', id=True)
And I know I can also do this:
tag_list = soup_object.find_all(['a', 'b', 'p', 'li'])
But I can not figure out how to combine the two statements, which in theory would give me a list, in order of occurrence of all of those html tags, with each 'tr' tag having an id.
html snippet would be something like below:
<tr id="uniqueID">
<td nowrap="" valign="baseline" width="8%">
<b>
A_time_as_text
</b>
</td>
<td class="storyTitle">
<a href="a_link.com" target="_new">
some_text
</a>
<b>
a_headline_as_text
</b>
a_number_as_text
</td>
</tr>
<tr>
<td>
<br/>
</td>
<td class="st-Art">
<ul>
<li>
more_text_text_text
<strong>
more_text_text_text
<font color="228822">
more_text_text_text
</font>
</strong>
more_text_text_text
</li>
<li>
more_text_text_text
<ul>
<li>
more_text_text_text
</li>
</ul>
</li>
</ul>
</td>
</tr>
<tr>
</tr>
Thanks for all help in advance!

I would suggest you add tr to the required list of tags and then check for the presence of the id attribute within the loop:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
for tag in soup.find_all(['a', 'b', 'p', 'li', 'tr']):
if tag.name != 'tr' or (tag.name == 'tr' and tag.get('id')):
print tag.name
For your html, this would display:
tr
b
a
b
li
li
li
Note, if you are actually trying to get a b p and li tags that are inside a tr with an id present, then the following approach would be more suitable:
for tr in soup.find_all('tr', id=True):
for tag in tr.find_all(['a', 'b', 'p', 'li']):
print tag.name, tag.get_text(strip=True)
This would give you:
b A_time_as_text
a some_text
b a_headline_as_text

Not getting desired text from BeautifulSoup

<h3 class="jd_header3 text" style="font-size: 12px;">
Shift Pattern:
</h3>
<ul class="jd_NoBulletinRight">
<li style="font-size:11px;">
<span class="text">
No Shift
</span>
</li>
</ul>
<h3 class="jd_header3 text" style="align:left;font-size:12px;">
Salary:
</h3>
<ul class="jd_NoBulletinRight">
<li>
<table border="0" cellpadding="0" cellspacing="0">
<tbody>
<tr>
<td align="left" style="word-wrap: break-word;font-size: 11px;" valign="top">
<span class="text">
S$3,500.00
<span class="text">
-
</span>
S$5,400.00
</span>
</td>
</tr>
</tbody>
</table>
</li>
</ul>
This is a part of my BeautifulSoup tree. I wish to get the salary range S$3500 - S$5400. Following the suggestion here I use the following code:
salary = bsObj.find(text="Salary:").parent.nextSibling.find("td").get_text()
print(salary)
I get the error:
AttributeError: 'int' object has no attribute 'get_text'
But when I simply print out the integer:
salary = bsObj.find(text="Salary:").parent.nextSibling.find("td")
print(salary)
I get:
-1
Which is not what I want. I have used Selenium to obtain the page, so any javascript is already loaded.
Any ideas?

Try the following code, however I don't think your code can get your expect output:
>>> bsObj.find('td', {'align': "left"}).text
'\n\n S$3,500.00\n \n -\n \n S$5,400.00
\n \n'
>>> ' '.join(bsObj.find('td', {'align': "left"}).text.split())
'S$3,500.00 - S$5,400.00'

Not sure about this "get_text" attribute, but with BeautifulSoup, I rely heavily on .text as shown below. Is this what you're looking for?
s = '''<html here>'''
soup = BeautifulSoup(s, 'html.parser')
bsObj = soup.findAll('td')
for i in bsObj:
print(i.text)
>>>
S$3,500.00
-
S$5,400.00

Python, Beautiful Soup: how to get the desired element

I am trying to arrive to a certain element, parsing a source code of a site.
this is a snippet from the part i'm trying to parse (here until Friday), but it is the same for all the days of the week
<div id="intForecast">
<h2>Forecast for Rome</h2>
<table cellspacing="0" cellpadding="0" id="nonCA">
<tr>
<td onclick="showDetails('1');return false" id="day1" class="on">
<span>Thursday</span>
<div class="intIcon"><img src="http://icons.wunderground.com/graphics/conds/2005/sunny.gif" alt="sunny" /></div>
<div>Clear</div>
<div><span class="hi">H <span>22</span>°</span> / <span class="lo">L <span>11</span>°</span></div>
</td>
<td onclick="showDetails('2');return false" id="day2" class="off">
<span>Friday</span>
<div class="intIcon"><img src="http://icons.wunderground.com/graphics/conds/2005/partlycloudy.gif" alt="partlycloudy" /></div>
<div>Partly Cloudy</div>
<div><span class="hi">H <span>21</span>°</span> / <span class="lo">L <span>15</span>°</span></div>
</td>
</tr>
</table>
</div>
....and so on for all the days
Actually i got my result but in a ugly way i think:
forecastFriday= soup.find('div',text='Friday').findNext('div').findNext('div').string
now, as you can see i go deep down the elements repeating .findNext('div')and finally arrive at .string
I want to get the information "Partly Cloudy" of Friday
So any more pythonic way to do this?
thanks!

Simply find all of the <td>s and iterate over them:
soup = BeautifulSoup(your_html)
div = soup('div',{'id':'intForecast'})[0]
tds = div.find('table').findAll('td')
for td in tds:
day = td('span')[0].text
forecast = td('div')[1].text
print day, forecast

Find a List of Tags Based on Text Value of Children in Beautiful Soup

I have a question about selecting a list of tags (or single tags) using a condition on one of the attributes of it's children. Specifically, given the HTML code:
<tbody>
<tr class="" data-row="0">
<tr class="" data-row="1">
<tr class="" data-row="2">
<td align="right" csk="13">13</td>
<td align="left" csk="Jones,Andre">Andre Jones
</td>
<tr class="" data-row="3">
<td align="right" csk="7">7</td>
<td align="left" csk="Jones,DeAndre">DeAndre Jones
</td>
<tr class="" data-row="4">
<tr class="" data-row="5">
I have a unicode variable coming from an external loop and I am trying to look through each row in the table to extract the <tr> tags with Player==Table.tr.a.text and to identify duplicate player names in Table. So, for instance, if there is more than one player with Player=Andre Jones the MyRow object returns all <tr> tags that contain that players name, while if there is only one row with Player=Andre Jones, then MyRow just contains the single element <tr> with anchor text attribute equal to Andre Jones. I've been trying things like
Table = soup.find('tbody')
MyRow = Table.find_all(lambda X: X.name=='tr' and Player == X.text)
But this returns [] for MyRow. If I use
MyRow = Table.find_all(lambda X: X.name=='tr' and Player in X.text)
This will pick any <tr> that has Player as a substring of X.text. In the example code above, it extracts both <tr> tags withe Table.tr.td.a.text=='Andre Jones' and Table.tr.td.a.text=='DeAndre Jones'. Any help would be appreciated.

You could do this easily with XPath and lxml:
import lxml.html
root = lxml.html.fromstring('''...''')
td = root.xpath('//tr[.//a[text() = "FooName"]]')
The BeautifulSoup "equivalent" would be something like:
rows = soup.find('tbody').find_all('tr')
td = next(row for row in rows if row.find('a', text='FooName'))
Or if you think about it backwards:
td = soup.find('a', text='FooName').find_parent('tr')

Whatever you desire. :)
Solution1
Logic: find the first tag whose tag name is tr and contains 'FooName' in this tag's text including its children.
# Exact Match (text is unicode, turn into str)
print Table.find(lambda tag: tag.name=='tr' and 'FooName' == tag.text.encode('utf-8'))
# Fuzzy Match
# print Table.find(lambda tag: tag.name=='tr' and 'FooName' in tag.text)
Output:
<tr class="" data-row="2">
<td align="right" csk="3">3</td>
<td align="left" csk="Wentz,Parker">
FooName
</td>
</tr>
Solution2
Logic: find the element whose text contains FooName, the anchor tag in this case. Then go up the tree and search for the all its parents(including ancestors) whose tag name is tr
# Exact Match
print Table.find(text='FooName').find_parent('tr')
# Fuzzy Match
# import re
# print Table.find(text=re.compile('FooName')).find_parent('tr')
Output
<tr class="" data-row="2">
<td align="right" csk="3">3</td>
<td align="left" csk="Wentz,Parker">
FooName
</td>
</tr>

Error with lxml syntax to parse table elements while validating column contents

I have the following code to parse html table. How do I check for a specified text in <td> element? This doesn't work : val=doc.xpath('//tr/td[child::*[text()="Street :"]/span/text()'). I am trying to extract the <span> text only when the <td> text matches 'Street :'. Any feedback is much appreciated!
import lxml.html as lh
html='''<tr>
<td>
Street : <span> High St. </span>
</td>
</tr>
<tr>
<td>
City : <span> Hightstown </span>
</td>
</tr>'''
doc=lh.fromstring(html)
#val=doc.xpath('//tr/td[child::*[text()="Street :"]/span/text()')
#street=doc.xpath('//tr/td/text()')
val=doc.xpath('//tr/td/span/text()')
#print street
print val

>>> doc.xpath('//tr/td[contains(text(),"Street :")]/span/text()')
[' High St. ']

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parse the date string from html in lxml - python

Related

How to use BeautifulSoup to search for a list of tags, with one item in the list having an attribute?

Not getting desired text from BeautifulSoup

Python, Beautiful Soup: how to get the desired element

Find a List of Tags Based on Text Value of Children in Beautiful Soup

Error with lxml syntax to parse table elements while validating column contents

Categories

Resources