Xpath get only first string of a node - python

I am doing a web-scraping to a site and i just want to get the first string of a node, i have tried already the child and contains function.
The html code that i have:
<div id="m0" style="visibility:visible; display:block;">
<table class="fl">
<tr bgcolor="white"><td class="v px3"></td>
<td class="ch">
<a title="Id: NetViet" class="A3">NetViet</a></td>
<td class="cr" ">Clear</td>
<tr bgcolor="white"><td class="v px3"></td>
<td class="ch">
<a title="Id: Vozrojdenie.tv" class="A3">VOTV</a></td>
<tr bgcolor="white"><td class="v px3"></td>
<td class="ch">
<A class="A3" title="Id: Suryoyo Sat" HREF="http://www.suryoyosat.com/" TARGET="_blank">Suryoyo Sat</A></td>
<td class="cr" ">Clear</td>
<div id="m1" style="visibility:visible; display:block;">
<table class="fl">
<tr bgcolor="#DDD0B8"><td class="v px3"></td>
<td class="ch">
<a title="Sporadic or full 16/9 transmission"></td>
<td class="cr" ">Conax<br />Irdeto 2<br />Mediaguard 3<br />Nagravision 3<br />Viaccess 3.0</td>
<tr bgcolor="#DDD0B8"><td class="v px3"></td>
<td class="ch">
<a title="Id: Sportklub HD" class="A3">Sport Klub HD Poland</a></td>
<td class="cr" ">Conax<br />Mediaguard 3<br />Nagravision 3<br />Viaccess 3.0</td>
<tr bgcolor="#DDD0B8"><td class="v px3"></td>
<td class="ch">
<a title="Id: Animal Planet HD" class="A3">Animal Planet HD</a></td>
<td class="cr" ">Conax<br />Irdeto 2<br />Mediaguard 3<br />Nagravision 3<br />Viaccess 3.0</td>
I am using the xpath query:
encrypted=tree.xpath('//div[#id="%s"]/table[#class= "fl"]/tr/td[#class="cr"]/text()'%div)
it returns:
[['Clear','Clear','Clear'],['Conax','Irdeto 2','Mediaguard 3','Nagravision 3','Conax','Mediaguard 3','Nagravision 3', 'Viaccess 3.0',...]]
and i want it to return:
[['Clear','Clear','Clear'],['Conax','Conax','Conax',...]]
I am trying this query but it gives me nothing:
encrypted=tree.xpath('substring-before(//div[#id="%s"]/table[#class= "fl"]/tr/td[#class="cr"]/text(),"C")'%div)
Any idea? (I am using lxml and requests from python, xpath 1.0)

Related

Xpath how to get all text in the tag

I have this html code:
<div id="m0" style="visibility:visible; display:block;">
<table class="fl">
<tr bgcolor="white"><td class="v px3"></td>
<td class="ch">
<a title="Id: NetViet" class="A3">NetViet</a></td>
</tr>
<div id="m1" style="visibility:visible; display:block;">
<table class="fl">
<td class="ch">
<A class="A3" title="Id: Kino Polska Muzyka" HREF="http://www.kinopolskamuzyka.pl/" TARGET="_blank">Kino Polska Muzyka</A>
</tr>
<td class="ch">
<i>HBO3 HD</i></td>
</tr>
<td class="ch"> Faktura</td>
</tr>
My xpath is : tree.xpath('//div[#id="%s"]/table[#class= "fl"]/tr/td[#class="ch"]/a/text()'%div)
but it does not give me all the channels. I want to get all text in <td class="ch">, the result that i want is:
[['NetViet'],['Kino Polska Muzyka','HB03','Faktura']]
Any idea? Thanks in advance.
Besides your messed up html structure, remove 'tr' and 'a' nodes from your xpath, because not every 'td' is surrounded by those.
Why not use css selectors to target td tag elements with that class? For this type of selection is it likely faster than xpath.
from bs4 import BeautifulSoup as bs
html = '''
<div id="m0" style="visibility:visible; display:block;">
<table class="fl">
<tr bgcolor="white"><td class="v px3"></td>
<td class="ch">
<a title="Id: NetViet" class="A3">NetViet</a></td>
</tr>
<div id="m1" style="visibility:visible; display:block;">
<table class="fl">
<td class="ch">
<A class="A3" title="Id: Kino Polska Muzyka" HREF="http://www.kinopolskamuzyka.pl/" TARGET="_blank">Kino Polska Muzyka</A>
</tr>
<td class="ch">
<i>HBO3 HD</i></td>
</tr>
<td class="ch"> Faktura</td>
</tr>
'''
soup = bs(html, 'lxml')
items = [item.text.strip() for item in soup.select('td.ch')]
print(items)

How to select several dynamic span elements in a table

I am trying to pick the text located in tables. Dynamic means that there are sometimes one table and sometimes more than one tables. So my question is, how to pick the text from this tables.
This is what i tried:
from selenium import webdriver
# webdriver
browser = webdriver.Chrome("C:/Chrome/chromedriver.exe")
browser.get("http://homepage")
pick = browser.find_elements_by_xpath("//*[#id=\"xpath\"]/table[11]/tbody/tr/td/table[2]/tbody/tr/td[1]/span[2]")
pick.get_attribute("innerHTML")
and this is the xpath from each element:
//*[#id="xpath"]/table[11]/tbody/tr/td/table[2]/tbody/tr/td[1]/span[2]
//*[#id="xpath"]/table[11]/tbody/tr/td/table[3]/tbody/tr/td[1]/span[2]
//*[#id="xpath"]/table[11]/tbody/tr/td/table[4]/tbody/tr/td[1]/span[2]
and this is the html code:
<table style="width:700px; " border="0" cellpadding="0" cellspacing="0" width="700px">
<tbody>
<tr>
<td style="border:1px; border-style:solid; ">
<table style="width:700px; " border="0" cellpadding="0" cellspacing="0" width="700px">
<tbody>
<tr>
<td style="border:1px; border-bottom-color:silver; border-bottom-style:solid; width:250px; " valign="top" width="250px"><span> </span><span style="font-size:10pt; font-weight:bold; "> </span></td>
<td style="border:1px; border-bottom-color:silver; border-bottom-style:solid; width:450px; " valign="middle" width="450px"><span style="font-size:10pt; "> </span><br></td>
</tr>
</tbody>
</table>
<table style="width:700px; " border="0" cellpadding="0" cellspacing="0" width="700px">
<tbody>
<tr>
<td style="border:1px; border-bottom-color:silver; border-bottom-style:solid; width:250px; " valign="top" width="250px"><span> </span><span style="font-size:10pt; ">3</span></td>
<td style="border:1px; border-bottom-color:silver; border-bottom-style:solid; width:450px; " valign="middle" width="450px"><span style="font-size:10pt; ">Bleaching preparations and other substances for laundry use; cleaning, polishing, scouring and abrasive preparations; soaps; perfumery, essential oils, cosmetics, hair lotions; dentifrices (all the goods listed alphabetically in the Nice Classification, included in this class).</span><br></td>
</tr>
</tbody>
</table>
<table style="width:700px; " border="0" cellpadding="0" cellspacing="0" width="700px">
<tbody>
<tr>
<td style="border:1px; border-bottom-color:silver; border-bottom-style:solid; width:250px; " valign="top" width="250px"><span> </span><span style="font-size:10pt; ">4</span></td>
<td style="border:1px; border-bottom-color:silver; border-bottom-style:solid; width:450px; " valign="middle" width="450px"><span style="font-size:10pt; ">Industrial oils and greases; lubricants; dust absorbing, wetting and binding compositions; fuels (including motor spirit) and illuminants; candles, wicks (all goods of this class included in the alphabetical list of Nice Classification).</span><br></td>
</tr>
</tbody>
</table>
<table style="width:700px; " border="0" cellpadding="0" cellspacing="0" width="700px">
<tbody>
<tr>
<td style="border:1px; border-bottom-color:silver; border-bottom-style:solid; width:250px; " valign="top" width="250px"><span> </span><span style="font-size:10pt; ">5</span></td>
<td style="border:1px; border-bottom-color:silver; border-bottom-style:solid; width:450px; " valign="middle" width="450px"><span style="font-size:10pt; ">Pharmaceutical and veterinary preparations; sanitary preparations for medical purposes; dietetic foods and substances adapted for medical and veterinary use; food for babies; dietary supplements for humans and animals;plasters, materials for dressings; material for stopping teeth and dental wax; disinfectants; preparations for destroying vermin; fungicides, herbicides;(all goods of this class included in the alphabetical list of Nice Classification).</span><br></td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
thank you for any help!

Beautifulsoup results to pandas dataframe

The below code returns me a table with the following results
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text, 'lxml')
mylist = soup.find(attrs={'class': 'table_grey_border'})
print(mylist)
results - it stretches on for 1700 rows
<table cellpadding="0" cellspacing="2" class="table_grey_border" width="100%">
<tr valign="top">
<td class="verd_black12" width="18%"><b>STOCK CODE</b></td>
<td class="verd_black12" width="42%"><b>NAME OF LISTED SECURITIES</b></td>
<td class="verd_black12" width="19%"><b>BOARD LOT</b></td>
<td class="verd_black12" colspan="4" width="12%"><b>REMARK</b></td>
</tr>
<tr class="tr_normal">
<td class="verd_black12" width="18%">00001</td>
<td class="verd_black12" width="42%">CKH HOLDINGS</td>
<td class="verd_black12" width="19%">500</td>
<td align="center" class="verd_black12" width="3%">#</td>
<td align="center" class="verd_black12" width="3%">H</td>
<td align="center" class="verd_black12" width="3%">O</td>
<td align="center" class="verd_black12" width="3%">F</td>
</tr>
<tr class="tr_normal">
<td class="verd_black12" width="18%">00002</td>
<td class="verd_black12" width="42%">CLP HOLDINGS</td>
<td class="verd_black12" width="19%">500</td>
<td align="center" class="verd_black12" width="3%">#</td>
<td align="center" class="verd_black12" width="3%">H</td>
<td align="center" class="verd_black12" width="3%">O</td>
<td align="center" class="verd_black12" width="3%">F</td>
</tr>
...
My question is, how do I put each of these rows into Pandas Dataframe? I tried the below code, but i'm returned with an error
a = pandas.read_html(mylist)
print(a)
error
TypeError: 'NoneType' object is not callable
Document:
pandas.read_html(url, attrs={'class': 'table_grey_border'})

Navigating html table lxml

I have some html which looks like:
<html>
<body>
<table cellpadding="0" cellspacing="0" border="0" width="100%">
<tr>
<td align="left" colspan="4">
<!-- BEGIN NEXT PREV LINKS -->
<table cellspacing="2" cellpadding="0" border="0">
<tr>
<td align="left"><font style="color:gray">Previous</font> </td>
<td align="center" colspan="2" nowrap><b>1-100 of 273 employees</b></td>
<td align="right"> Next</td>
</tr>
<tr>
<td align="left" colspan="2"><font style="color:gray">First Page</font></td>
<td align="right" colspan="2"> Last Page</td>
</tr>
</table>
<!-- END NEXT PREV LINKS -->
</td>
<td colspan="9" align="right">
Add Checked to Favorites
<br>
Add Checked to Excluded
</td>
</tr>
<tr>
<td rowspan="2"></td><td rowspan="2"></td> <td rowspan="2" valign="bottom" style="padding-right:5px;"><b><a href=""/></td>
<td rowspan="2" valign="bottom" style="padding-right:5px;"><b>Position</b></td>
<td colspan="2" align="center" valign="bottom" height="16"><b>Ratings</b><br><img src="/images/shim_333333.gif" width="130" height="1" alt="" hspace="5"></td> <td rowspan="2"> </td> <td rowspan="2" valign="bottom" style="padding-right:5px;"><b>Birth Date</b></td>
<td rowspan="2" valign="bottom" style="padding-right:5px;"><b>States</b></td>
<td rowspan="2"> </td><td rowspan="2"></td> <td rowspan="2" colspan="3" align="right" valign="bottom">Clear All </td> </tr>
<tr>
<td align="center"><b>In-State<br>Rating</b></td>
<td align="center"><b>Out of State<br>Rating</b></td>
</tr>
<tr>
<td colspan="13" valign="bottom"><img src="/images/shim.gif" width="100%" height="1" alt=""></td>
</tr> <tr>
<td align="right" colspan=13><img src="/images/shim_dddddd.gif" width="100%" height="1" border="0" alt=""></td>
</tr> <tr >
<td></td><td><b style="">X</b></td>
<td nowrap><p>Cruise, Tom </p></td>
<td nowrap>Actor </td>
<td align="center"><img src="/images/stars_2_sm_green.gif" alt="instate
Recommendation
Rating" height="11" width="55" align="middle" hspace="0" vspace="0"></td>
<td align="center"><img src="/images/stars_4_sm.gif" alt="Summary
Estimate
Rating" height="11" width="55" align="middle" hspace="0" vspace="0"></td>
<td> </td>
<td nowrap>1948 </td>
<td nowrap>CA</td>
<td></td><td></td>
<td> </td>
<td align="right"><input type="checkbox" name="employee_cb" value="198720" style="height:15px"></td>
</tr> <tr>
<td align="right" colspan=13><img src="/images/shim_dddddd.gif" width="100%" height="1" border="0" alt=""></td>
</tr> <tr >
<td><b style="">X</b></td><td></td>
<td nowrap><p>Schwarzenegger, Arnold </p></td>
<td nowrap>Governor </td>
<td align="center"><img src="/images/ohuohausd.jpg" alt="instate
Recommendation
Rating" height="11" width="55" align="middle" hspace="0" vspace="0"></td>
<td align="center"><img src="/images/ohuohausd.jpg" alt="Summary
Estimate
Rating" height="11" width="55" align="middle" hspace="0" vspace="0"></td>
<td> </td>
<td nowrap>No Current Date </td>
<td nowrap>-</td>
<td></td><td></td>
<td> </td>
<td align="right"><input type="checkbox" name="employee_cb" value="61184" style="height:15px"></td>
</tr> <tr >
<td><b style="">X</b></td><td></td>
<td nowrap><p>Obama, Barack </p></td>
<td nowrap>President </td>
<td align="center"><img src="/images/ohuohausd.jpg" alt="instate
Recommendation
Rating" height="11" width="55" align="middle" hspace="0" vspace="0"></td>
<td align="center"><img src="/images/ohuohausd.jpg" alt="Summary
Estimate
Rating" height="11" width="55" align="middle" hspace="0" vspace="0"></td>
<td> </td>
<td nowrap>No Current Date </td>
<td nowrap>-</td>
<td></td><td></td>
<td> </td>
<td align="right"><input type="checkbox" name="employee_cb" value="225747" style="height:15px"></td>
</tr>
<tr height="15">
<td align="right" colspan="14">
<!-- BEGIN NEXT PREV LINKS -->
<table cellspacing="2" cellpadding="0" border="0">
<tr>
<td align="left"><font style="color:gray">Previous</font> </td>
<td align="center" colspan="2" nowrap><b>1-100 of 273 employees</b></td>
<td align="right"> Next</td>
</tr>
<tr>
<td align="left" colspan="2"><font style="color:gray">First Page</font></td>
<td align="right" colspan="2"> Last Page</td>
</tr>
</table>
<!-- END NEXT PREV LINKS -->
</td>
</tr> <tr>
<td colspan="12" valign="bottom" nowrap><br>
<b style="">X</bfdgdfgb style="">X</b>Lorem ipsum dolor sit amet, consectetur adipiscing elit<br>
<b style="c">X</b>dfgfdg<b style="">X</b>Lorem ipsum dolor sit amet, consectetur adipiscing elit<br> <b style="">F</b>: A dsd "<b style="">F</b>Lorem ipsum dolor sit amet, consectetur adipiscing elit<br>
dfgdfg"<b style="">F</b>"Lorem ipsum dolor sit amet, consectetur adipiscing elit<br>
<b style="">E</b>gfhbgdfg"<b style="">E</b>Lorem ipsum dolor sit amet, consectetur adipiscing elit
</td>
</tr><tr><td colspan="20">
<table cellpadding="0" cellspacing="0" border="0" width="100%" align="center">
<tr>
<td colspan="2"><img src="/images/shim.gif" width="100%" height="5" alt=""></td>
</tr>
<tr>
<td valign="top">States: </td>
<td>CA=California; ND=North Dakota</td>
</tr>
</table>
</td></tr>
</table></body>
</html>
Looking for similar questions, I was able to construct (noting that the table is always 17th in the full html code):
data = open("employeetest.htm",'r').read()
root = lh.fromstring(data)
rows = root.xpath("//table")[17].findall("tr")
data = list()
for row in rows:
data.append([c.text_content() for c in row.getchildren()])
print data
This produces a very messy list. My end goal is just to get
[['Cruise, Tom', 'Actor', '1948', 'CA'], ['Schwarzenegger, Arnold', 'Governor', 'No Current Date', '-'], ...]
However, all this information contained in the table produces a lot of strange elements. I know I can clean the resultant \xa0 by replacing with a single space. I'm not really sure how to navigate this further. Thanks!
Not sure what the ... should be in your expected output but to get the data in the first three sublists, you can narrow down the search looking for trs that have a nowrap attribute and only one attribute altogether:
from lxml import html
root = html.fromstring(h)
rows = root.xpath("//tr[td[#nowrap and text() and count(#*)=1]]")
data = list()
for row in rows:
print(row.xpath(".//td[#nowrap]//text()"))
Output:
['Cruise, Tom', u'\xa0', u'Actor\xa0', u'1948\xa0', 'CA']
['Schwarzenegger, Arnold', u'\xa0', u'Governor\xa0', u'No Current Date\xa0', '-']
['Obama, Barack', u'\xa0', u'President\xa0', u'No Current Date\xa0', '-']
You will have to traverse the html document and get a more refined XPath. Additionally, you face the challenge of related data in different elements requiring two XPath expressions. This will require some manipulation to get the final related results together:
import lxml.etree as et
with open("employeetest.htm",'r') as f:
text = f.read().replace('&nbsp', '').replace(';', '')
root = et.HTML(text)
# XPATH LISTS (W/ RELATED ITEMS)
items1 = root.xpath("//td/p/a/text()")
items2 = root.xpath("//td[p/a/text()]/following-sibling::td/text()")
# NUMBER OF ITEMS RELATED BETWEEN EACH
r = int(len(items2)/len(items1))
# ITERATE THROUGH WITH LIST SLICE AND APPEND
data = []
for i in range(r):
inner = []
inner.append(items1[i])
for j in items2[0+i*r:2+i*r]: # SLICE BY EVERY THREE ITEMS
inner.append(j)
data.append(inner)
print(data)
# [['Cruise, Tom', 'Actor', '1948'],
# ['Schwarzenegger, Arnold', 'Governor', 'No Current Date'],
# ['Obama, Barack', 'President', 'No Current Date']]

beautiful soup get children that are Tags (not Navigable Strings) from a Tag

Beautiful soup documentation provides attributes .contents and .children to access the children of a given tag (a list and an iterable respectively), and includes both Navigable Strings and Tags. I want only the children of type Tag.
I'm currently accomplishing this using list comprehension:
rows=[x for x in table.tbody.children if type(x)==bs4.element.Tag]
but I'm wondering if there is a better/more pythonic/built-in way to get just Tag children.
thanks to J.F.Sebastian , the following will work:
rows=table.tbody.find_all(True, recursive=False)
Documentation here: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#true
In my case, I needed actual rows in the table, so I ended up using the following, which is more precise and I think more readable:
rows=table.tbody.find_all('tr')
Again, docs: http://www.crummy.com/software/BeautifulSoup/bs4/doc/#navigating-using-tag-names
I believe this is a better way than iterating through all the children of a Tag.
Worked with the following input:
<table cellspacing="0" cellpadding="0">
<thead>
<tr class="title-row">
<th class="title" colspan="100">
<div style="position:relative;">
President
<span class="pct-rpt">
99% reporting
</span>
</div>
</th>
</tr>
<tr class="header-row">
<th class="photo first">
</th>
<th class="candidate ">
Candidate
</th>
<th class="party ">
Party
</th>
<th class="votes ">
Votes
</th>
<th class="pct ">
Pct.
</th>
<th class="change ">
Change from ‘08
</th>
<th class="evotes last">
Electoral Votes
</th>
</tr>
</thead>
<tbody>
<tr class="">
<td class="photo first">
<div class="photo_wrap"><img alt="P-barack-obama" height="48" src="http://i1.nyt.com/projects/assets/election_2012/images/candidate_photos/election_night/p-barack-obama.jpg?1352320690" width="68" /></div>
</td>
<td class="candidate ">
<div class="winner dem"><img alt="Hp-checkmark#2x" height="9" src="http://i1.nyt.com/projects/assets/election_2012/images/swatches/hp-checkmark#2x.png?1352320690" width="10" />Barack Obama</div>
</td>
<td class="party ">
Dem.
</td>
<td class="votes ">
2,916,811
</td>
<td class="pct ">
57.3%
</td>
<td class="change ">
-4.6%
</td>
<td class="evotes last">
20
</td>
</tr>
<tr class="">
<td class="photo first">
</td>
<td class="candidate ">
<div class="not-winner">Mitt Romney</div>
</td>
<td class="party ">
Rep.
</td>
<td class="votes ">
2,090,116
</td>
<td class="pct ">
41.1%
</td>
<td class="change ">
+4.3%
</td>
<td class="evotes last">
0
</td>
</tr>
<tr class="">
<td class="photo first">
</td>
<td class="candidate ">
<div class="not-winner">Gary Johnson</div>
</td>
<td class="party ">
Lib.
</td>
<td class="votes ">
54,798
</td>
<td class="pct ">
1.1%
</td>
<td class="change ">
–
</td>
<td class="evotes last">
0
</td>
</tr>
<tr class="last-row">
<td class="photo first">
</td>
<td class="candidate ">
div class="not-winner">Jill Stein</div>
</td>
<td class="party ">
Green
</td>
<td class="votes ">
29,336
</td>
<td class="pct ">
0.6%
</td>
<td class="change ">
–
</td>
<td class="evotes last">
0
</td>
</tr>
<tr>
<td class="footer" colspan="100">
President Map |
President Big Board |
Exit Polls
</td>
</tr>
</tbody>
</table>

Categories