lxml xpath get text between two nested tables

lxml xpath get text between two nested tables - python

I have a html that has nested tables. I wish to find the text between a outside table and inside tables. I thought this is a classic question but so far hasn't find the answer. What I have come up with is
tree.xpath(//p[not(ancestor-or-self::table)]). But this isn't working but because all text descends from the outside table. Also just use preceding::table isn't enough because the text can surrounds the inside table.
For an conceptual example if a table looks liek this [...text1...[inside table No.1]...text2...[inside table No.2]...text3...], how can I get the text1/2/3 only without being contaminated by texts from the inside tables No.1&2. Maybe this is my thought, is it possible to build a concept of table layer via xpath, so I can tell lxml or other libraries that "Give me all text between layer 0 and 1"
Below is a simplified sample html file. In reality, the outside table may contains many nested tables but I just want the text between the most outside table and its 1st nested tables. Thanks folks!
<table>
<tr><td>
<p> text I want </p>
<div> they can be in different types of nodes </div>
<table>
<tr><td><p> unwanted text </p></td></tr>
<tr><td>
<table>
<tr><td><u> unwanted text</u></td></tr>
</table>
</td></tr>
</table>
<p> text I also want </p>
<div> as long as they're inside the root table and outside the first-level inside tables </div>
</td></tr>
<tr><td>
<u> they can be between the first-level inside tables </u>
<table>
</table>
</td></tr>
</table>
And it returns ["text I want", "they can be in different types of nodes", "text I also want", "as long as they're inside the root table and outside the first-level inside tables", "they can be between the first-level inside tables"].

One of the XPaths that could do this, if the outer most table is the root element:
/table/descendant::table[1]/preceding::p
Here, you traverse to the first descendant table of the outermost table, and then select all its preceding p elements.
If not, you will have to take a different approach of accessing the p elements in between the tables, may be using generate-id() function.

Related

Cannot grab text between span tags

I am having a heck of a time grabbing text from the span tag (568,789,073,292). The text is constantly changing. See HTML below.
I thought this should work but its not
xpath=(//div[#id='RxValidPackets']/span)[3]
The elements below is from row 3 of a table.
When I try to grab row 6, I try this
xpath=(//div[#id='RxValidPackets']/span)[6]
There are 6 rows of data with the same elements. The only difference is the span tags.
Can you help?
Here is the HTML:
<td class="field-undefined-container">
<div id="RxValidPackets" class="field-container-value">
<span>568,789,073,292</span>
</div>
</td>

Make sure you don't use the same ID in every row. In an HTML document, IDs are unique.

HTML div with padding

I apologies in advance because I'm sure this is pretty straightforward but frontend isn't my field!
I have an html document I'm having issues with. The general structure is as follows:
<div style="padding-bottom: 1cm;">
<h4>Digital</h4>
<h5 style="text-align: left;">Data Table:</h5>
<table>
<!-- the rest of the table here -->
</table>
</div>
I have several of these div elements being generated by Jinja (python). My intention of the padding is to add space between the div elements in the page. This works as I'd expect if the is empty (or very few rows) but as soon as the table grows the padding no longer works as I'd expect.
The HTML above gives me an output like the one below: (note how the empty space between 'Second table' and 'Third table' is smaller than between the space between 'First table' and 'second table'

use margin-bottom instead and allow overflow on your parent element.
for better result style the parent element that holds them
use flexbox or grid

extract text from two tables by Beautiful Soup

I have a lot of pages where the structure is the following:
<table class='CERTAIN_CLASS'> ... </table>
A lot of stuff here (divs, ps, brs, images)
<table class='CERTAIN_CLASS'> ... </table>
What is the most efficient way to extract the text (text only!) from everything between two tables of a certain class? I've found a lot of similar questions on SO, but nothing on this secifically this task.

Assuming that you have already loaded the content of the html page:
For a specific class:
text.find(class_="CERTAIN CLASS").text.strip()
To find all the text from this certain class, then you could iterate through every element:
text.findAll(class_="CERTAIN CLASS"):

How do I extract text between two objects using XPath?

I'm using XPath to extract different web elements on a webpage, but have his a roadblock on one particular object that is sitting between two objects, but doesn't have a closing object behind it for a while.
I've been able to successfully extract other elements from the webpage, but don't know how to proceed at this point.
Here is a copy of what the HTML looks like from the Inspector:
<body>
<table>
<tbody>
<tr>
<td id="left_column">
<div id="top">
<h1></h1>
#SOME TEXT
<div>
<table>
.......
</table>
</div>
</div>
</td>
</tr>
Any suggestions would be greatly appreciated! Thank you!

Here is a thought that I hope will help, but with out seeing the entire HTML I can't give more then just an idea. I have more experience with Selenium in java, so I am not 100% sure that python will have the same functionality but I imagine it does.
You should be able to get the text from any WebElement. In Java it would look something like this, but I imagine it should be too hard to change it to python
WebElement top = driver.findElement(By.xpath("//div[#id='top']"));
String topString = top.getText();
If in your case your getting more then just the "#SomeText" you would need to remove the text from the other elements that you don't want. Something like:
WebElement topH1 = top.findElement(By.xpath("./h1"));
WebElement topInsideDiv = top.findElement(By.xpath("./div"));
String topHString = topH1.getText();
String topInsideDivString = topTable.getText();
//since you know that the H1 string would come first and the inside div
//would come after you could take the substring of the topString
String result = topString.subString(topHString.length,
topString.length - topInsideDivString.length);
This is really just an idea on how you could do it. The way that you determine the part of the string that you would be interested in might need to be more complex. It could be that you just cycle through the strings to determine where you need to break apart the entire string to get what you want. If there is text before the tag you would need to be more complex about your solution, perhaps by searching for the text and discounting anything you find before it, but without that information I cant really help out more then this.

extract specific element from nested elements using lxml html

Hi all I am having some problems that I think can be attributed to xpath problems. I am using the html module from the lxml package to try and get at some data. I am providing the most simplified situation below, but keep in mind the html I am working with is much uglier.
<table>
<tr>
<td>
<table>
<tr><td></td></tr>
<tr><td>
<table>
<tr><td><u><b>Header1</b></u></td></tr>
<tr><td>Data</td></tr>
</table>
</td></tr>
</table>
</td></tr>
</table>
What I really want is the deeply nested table, because it has the header text "Header1".
I am trying like so:
from lxml import html
page = '...'
tree = html.fromstring(page)
print tree.xpath('//table[//*[contains(text(), "Header1")]]')
but that gives me all of the table elements. I just want the one table that contains this text. I understand what is going on but am having a hard time figuring out how to do this besides breaking out some nasty regex.
Any thoughts?

Use:
//td[text() = 'Header1']/ancestor::table[1]

Find the header you are interested in and then pull out its table.
//u[b = 'Header1']/ancestor::table[1]
or
//td[not(.//table) and .//b = 'Header1']/ancestor::table[1]
Note that // always starts at the document root (!). You can't do:
//table[//*[contains(text(), "Header1")]]
and expect the inner predicate (//*…) to magically start at the right context. Use .// to start at the context node. Even then, this:
//table[.//*[contains(text(), "Header1")]]
won't work since even the outermost table contains the text 'Header1' somewhere deep down, so the predicate evaluates to true for every table in your example. Use not() like I did to make sure no other tables are nested.
Also, don't test the condition on every node .//*, since it can't be true for every node to begin with. It's more efficient to be specific.

Perhaps this would work for you:
tree.xpath("//table[not(descendant::table)]/*[contains(., 'Header1')]")
The not(descendant::table) bit ensures that you're getting the innermost table.

table, = tree.xpath('//*[.="Header1"]/ancestor::table[1]')
//*[text()="Header1"] selects an element anywhere in a document with text Header1.
ancestor::table[1] selects the first ancestor of the element that is table.
Complete example
#!/usr/bin/env python
from lxml import html
page = """
<table>
<tr>
<td>
<table>
<tr><td></td></tr>
<tr><td>
<table>
<tr><td><u><b>Header1</b></u></td></tr>
<tr><td>Data</td></tr>
</table>
</td></tr>
</table>
</td></tr>
</table>
"""
tree = html.fromstring(page)
table, = tree.xpath('//*[.="Header1"]/ancestor::table[1]')
print html.tostring(table)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

lxml xpath get text between two nested tables - python

Related

Cannot grab text between span tags

HTML div with padding

extract text from two tables by Beautiful Soup

How do I extract text between two objects using XPath?

extract specific element from nested elements using lxml html

Categories

Resources