extract specific element from nested elements using lxml html

extract specific element from nested elements using lxml html - python

Hi all I am having some problems that I think can be attributed to xpath problems. I am using the html module from the lxml package to try and get at some data. I am providing the most simplified situation below, but keep in mind the html I am working with is much uglier.
<table>
<tr>
<td>
<table>
<tr><td></td></tr>
<tr><td>
<table>
<tr><td><u><b>Header1</b></u></td></tr>
<tr><td>Data</td></tr>
</table>
</td></tr>
</table>
</td></tr>
</table>
What I really want is the deeply nested table, because it has the header text "Header1".
I am trying like so:
from lxml import html
page = '...'
tree = html.fromstring(page)
print tree.xpath('//table[//*[contains(text(), "Header1")]]')
but that gives me all of the table elements. I just want the one table that contains this text. I understand what is going on but am having a hard time figuring out how to do this besides breaking out some nasty regex.
Any thoughts?

Use:
//td[text() = 'Header1']/ancestor::table[1]

Find the header you are interested in and then pull out its table.
//u[b = 'Header1']/ancestor::table[1]
or
//td[not(.//table) and .//b = 'Header1']/ancestor::table[1]
Note that // always starts at the document root (!). You can't do:
//table[//*[contains(text(), "Header1")]]
and expect the inner predicate (//*…) to magically start at the right context. Use .// to start at the context node. Even then, this:
//table[.//*[contains(text(), "Header1")]]
won't work since even the outermost table contains the text 'Header1' somewhere deep down, so the predicate evaluates to true for every table in your example. Use not() like I did to make sure no other tables are nested.
Also, don't test the condition on every node .//*, since it can't be true for every node to begin with. It's more efficient to be specific.

Perhaps this would work for you:
tree.xpath("//table[not(descendant::table)]/*[contains(., 'Header1')]")
The not(descendant::table) bit ensures that you're getting the innermost table.

table, = tree.xpath('//*[.="Header1"]/ancestor::table[1]')
//*[text()="Header1"] selects an element anywhere in a document with text Header1.
ancestor::table[1] selects the first ancestor of the element that is table.
Complete example
#!/usr/bin/env python
from lxml import html
page = """
<table>
<tr>
<td>
<table>
<tr><td></td></tr>
<tr><td>
<table>
<tr><td><u><b>Header1</b></u></td></tr>
<tr><td>Data</td></tr>
</table>
</td></tr>
</table>
</td></tr>
</table>
"""
tree = html.fromstring(page)
table, = tree.xpath('//*[.="Header1"]/ancestor::table[1]')
print html.tostring(table)

Related

How do I extract text between two objects using XPath?

I'm using XPath to extract different web elements on a webpage, but have his a roadblock on one particular object that is sitting between two objects, but doesn't have a closing object behind it for a while.
I've been able to successfully extract other elements from the webpage, but don't know how to proceed at this point.
Here is a copy of what the HTML looks like from the Inspector:
<body>
<table>
<tbody>
<tr>
<td id="left_column">
<div id="top">
<h1></h1>
#SOME TEXT
<div>
<table>
.......
</table>
</div>
</div>
</td>
</tr>
Any suggestions would be greatly appreciated! Thank you!

Here is a thought that I hope will help, but with out seeing the entire HTML I can't give more then just an idea. I have more experience with Selenium in java, so I am not 100% sure that python will have the same functionality but I imagine it does.
You should be able to get the text from any WebElement. In Java it would look something like this, but I imagine it should be too hard to change it to python
WebElement top = driver.findElement(By.xpath("//div[#id='top']"));
String topString = top.getText();
If in your case your getting more then just the "#SomeText" you would need to remove the text from the other elements that you don't want. Something like:
WebElement topH1 = top.findElement(By.xpath("./h1"));
WebElement topInsideDiv = top.findElement(By.xpath("./div"));
String topHString = topH1.getText();
String topInsideDivString = topTable.getText();
//since you know that the H1 string would come first and the inside div
//would come after you could take the substring of the topString
String result = topString.subString(topHString.length,
topString.length - topInsideDivString.length);
This is really just an idea on how you could do it. The way that you determine the part of the string that you would be interested in might need to be more complex. It could be that you just cycle through the strings to determine where you need to break apart the entire string to get what you want. If there is text before the tag you would need to be more complex about your solution, perhaps by searching for the text and discounting anything you find before it, but without that information I cant really help out more then this.

lxml xpath get text between two nested tables

I have a html that has nested tables. I wish to find the text between a outside table and inside tables. I thought this is a classic question but so far hasn't find the answer. What I have come up with is
tree.xpath(//p[not(ancestor-or-self::table)]). But this isn't working but because all text descends from the outside table. Also just use preceding::table isn't enough because the text can surrounds the inside table.
For an conceptual example if a table looks liek this [...text1...[inside table No.1]...text2...[inside table No.2]...text3...], how can I get the text1/2/3 only without being contaminated by texts from the inside tables No.1&2. Maybe this is my thought, is it possible to build a concept of table layer via xpath, so I can tell lxml or other libraries that "Give me all text between layer 0 and 1"
Below is a simplified sample html file. In reality, the outside table may contains many nested tables but I just want the text between the most outside table and its 1st nested tables. Thanks folks!
<table>
<tr><td>
<p> text I want </p>
<div> they can be in different types of nodes </div>
<table>
<tr><td><p> unwanted text </p></td></tr>
<tr><td>
<table>
<tr><td><u> unwanted text</u></td></tr>
</table>
</td></tr>
</table>
<p> text I also want </p>
<div> as long as they're inside the root table and outside the first-level inside tables </div>
</td></tr>
<tr><td>
<u> they can be between the first-level inside tables </u>
<table>
</table>
</td></tr>
</table>
And it returns ["text I want", "they can be in different types of nodes", "text I also want", "as long as they're inside the root table and outside the first-level inside tables", "they can be between the first-level inside tables"].

One of the XPaths that could do this, if the outer most table is the root element:
/table/descendant::table[1]/preceding::p
Here, you traverse to the first descendant table of the outermost table, and then select all its preceding p elements.
If not, you will have to take a different approach of accessing the p elements in between the tables, may be using generate-id() function.

Get text of td following the second occurrence of an element in Selenium using Python

I'm trying to find the text after a remarks field in a form. However, the table has multiple remarks fields. I want to be able to grab the text in the td that follows the td of the second remarks field. I have the following html:
<table>
<tbody>
<tr>...</tr>
<tr>...</tr>
<tr>
<td>Remarks:</td>
<td>this is the first remarks field
</tr>
<tr>
<td>AnotherField:</td>
<td>Content of that field</td>
</tr>
<tr>
<td>Remarks:</td>
<td>this is the second remarks field</td>
</tr>
<tr>...</tr>
</tbody>
</table>
To grab the text out of the first remarks field, I can do the following:
ret = driver.find_element_by_xpath("//td[contains(text(),'Remarks')]/following::td")
print ret.text
However, I need to grab the content out of the second remarks field. This has to be done based on the index of the occurrences of 'Remarks', not based on the index. I've wanted to try things like this:
ret = self.driver.find_element_by_xpath("//td[contains(text(),'Remarks')][1]/following::td")
or:
rets = self.driver.find_elements_by_xpath("//td[contains(text(),'Remarks')]")[1]
ret = elements.find_element_by_xpath("/following::td")
Understandingly, these do not work. Is there a way of doing this? Using a command along the lines of 'the field after the nth occurrence of Remarks' is what I'm looking for.
P.S. This will have to be done using xpath. Reason being, I'm trying to convert a coworkers code into selenium from another application that has everything revolved around xpath.
I'm using Selenium-2.44.0 and Python 2.7.

Indexing starts from 1 in XPath:
(//td[contains(., 'Remarks')]/following-sibling::td)[2]
Or, you can use find_elements_by_xpath() and get the second item:
elements = self.driver.find_elements_by_xpath("//td[contains(., 'Remarks')]/following-sibling::td")
print elements[1].text

How to loop through a html-table-dataset in Python

I'm first time poster here trying to pick up some Python skills; please be kind to me :-)
While I'm not a complete stranger to programming concepts (I've been messing around with PHP before), the transition to Python has turned out to be somewhat difficult for me. I guess this mostly has to do with the fact that I lack most - if not all - basic understanding of common "design patterns" (?) and such.
Having that said, this is the problem. Part of my current project involves writing a simple scraper by utilizing Beautiful Soup. The data to be processed has a somewhat similar structure to the one which is laid out below.
<table>
<tr>
<td class="date">2011-01-01</td>
</tr>
<tr class="item">
<td class="headline">Headline</td>
<td class="link">Link</td>
</tr>
<tr class="item">
<td class="headline">Headline</td>
<td class="link">Link</td>
</tr>
<tr>
<td class="date">2011-01-02</td>
</tr>
<tr class="item">
<td class="headline">Headline</td>
<td class="link">Link</td>
</tr>
<tr class="item">
<td class="headline">Headline</td>
<td class="link">Link</td>
</tr>
</table>
The main issue is that I simply can't get my head around how to 1) keep track of the current date (tr->td class="date") while 2) looping over the items in the subsequent tr:s (tr class="item"->td class="headline" and tr class="item"->td class="link") and 3) store the processed data in an array.
Additionally, all data will be inserted into a database where each entry must contain the following information;
date
headline
link
Note that crud:ing the database is not part of the problem, I only mentioned this in order to better illustrate what I'm trying to accomplish here :-)
Now, there are many different ways to skin a cat. So while a solution to the issue at hand is indeed very welcome, I'd be extremely grateful if someone would care to elaborate on the actual logic and strategy you would make use of in order to "attack" this kind of problem :-)
Last but not least, sorry for such a noobish question.

The basic problem is that this table is marked up for looks, not for semantic structure. Properly done, each date and its related items should share a parent. Unfortunately, they don't, so we'll have to make do.
The basic strategy is to iterate through each row in the table
if the first tabledata has class 'date', we get the date value and update last_seen_date
Otherwise, we get extract a headline and a link, then save (last_seen_date, headline, link) to the database
.
import BeautifulSoup
fname = r'c:\mydir\beautifulSoup.html'
soup = BeautifulSoup.BeautifulSoup(open(fname, 'r'))
items = []
last_seen_date = None
for el in soup.findAll('tr'):
daterow = el.find('td', {'class':'date'})
if daterow is None: # not a date - get headline and link
headline = el.find('td', {'class':'headline'}).text
link = el.find('a').get('href')
items.append((last_seen_date, headline, link))
else: # get new date
last_seen_date = daterow.text

You can use Element Tree which is included in the python package.
http://docs.python.org/library/xml.etree.elementtree.html
from xml.etree.ElementTree import ElementTree
tree = ElementTree()
tree.parse('page.xhtml') #This is the XHTML provided in the OP
root = tree.getroot() #Returns the heading "table" element
print(root.tag) #"table"
for eachTableRow in root.getchildren():
#root.getchildren() is a list of all of the <tr> elements
#So we're going to loop over them and check their attributes
if 'class' in eachTableRow.attrib:
#Good to go. Now we know to look for the headline and link
pass
else:
#Okay, so look for the date
pass
That should be enough to get you on your way to parsing this.

BeautifulSoup or regex HTML table to data structure?

I've got an HTML table that I'm trying to parse the information from. However, some of the tables span multiple rows/columns, so what I would like to do is use something like BeautifulSoup to parse the table into some type of Python structure. I'm thinking of just using a list of lists so I would turn something like
<tr>
<td>1,1</td>
<td>1,2</td>
</tr>
<tr>
<td>2,1</td>
<td>2,2</td>
</tr>
into
[['1,1', '1,2'],
['2,1', '2,2']]
Which I (think) should be fairly straightforward. However, there are some slight complications because some of the cells span multiple rows/cols. Plus there's a lot of completely unnecessary information:
<td ondblclick="DoAdd('/student_center/sc_all_rooms/d05/09/2010/editformnew?display=W&style=L&positioning=A&adddirect=yes&accessid=CreateNewEdit&filterblock=N&popeditform=yes&returncalendar=student_center/sc_all_rooms')"
class="listdefaultmonthbg"
style="cursor:crosshair;"
width="5%"
nowrap="1"
rowspan="1">
<a class="listdatelink"
href="/student_center/sc_all_rooms/d05/09/2010/edit?style=L&display=W&positioning=A&filterblock=N&adddirect=yes&accessid=CreateNewEdit">Sep 5</a>
</td>
And what the code really looks like is even worse. All I really need out of there is:
<td rowspan="1">Sep 5</td>
Two rows later, there is a with a rowspan of 17. For multi-row spans I was thinking something like this:
<tr>
<td rowspan="2">Sep 5</td>
<td>Some event</td>
</tr>
<tr>
<td>Some other event</td>
</tr>
would end out like this:
[["Sep 5", "Some event"],
[None, "Some other event"]]
There are multiple tables on the page, and I can find the one I want already, I'm just not sure how to parse out the information I need. I know I can use BeautfulSoup to "RenderContents", but in some cases there are link tags that I need to get rid of (while keeping the text).
I was thinking of a process something like this:
Find table
Count rows in tables (len(table.findAll('tr'))?)
Create list
Parse table into list (BeautifulSoup syntax???)
???
Profit! (Well, it's a purely internal program, so not really... )

There was a recent discussion on the python group on linkedin about a similar issue, and apparently lxml is the most recommended pythonic parser for html pages.
http://www.linkedin.com/groupItem?view=&gid=25827&type=member&item=27735259&qid=d2948a0e-6c0c-4256-851b-5e7007859553&goback=.gmp_25827

You'll probably need to identify the table with some attrs, id or name.
from BeautifulSoup import BeautifulSoup
data = """
<table>
<tr>
<td>1,1</td>
<td>1,2</td>
</tr>
<tr>
<td>2,1</td>
<td>2,2</td>
</tr>
</table>
"""
soup = BeautifulSoup(data)
for t in soup.findAll('table'):
for tr in t.findAll('tr'):
print [td.contents for td in tr.findAll('td')]
Edit: What should do the program if there're multiple links?
Ex:
<td>A B C</td>

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

extract specific element from nested elements using lxml html - python

Use: //td[text() = 'Header1']/ancestor::table[1]

Perhaps this would work for you: tree.xpath("//table[not(descendant::table)]/*[contains(., 'Header1')]") The not(descendant::table) bit ensures that you're getting the innermost table.

Related

How do I extract text between two objects using XPath?

lxml xpath get text between two nested tables

Get text of td following the second occurrence of an element in Selenium using Python

How to loop through a html-table-dataset in Python

BeautifulSoup or regex HTML table to data structure?

Categories

Resources