How do I extract text between two objects using XPath? - python

I'm using XPath to extract different web elements on a webpage, but have his a roadblock on one particular object that is sitting between two objects, but doesn't have a closing object behind it for a while.
I've been able to successfully extract other elements from the webpage, but don't know how to proceed at this point.
Here is a copy of what the HTML looks like from the Inspector:
<body>
<table>
<tbody>
<tr>
<td id="left_column">
<div id="top">
<h1></h1>
#SOME TEXT
<div>
<table>
.......
</table>
</div>
</div>
</td>
</tr>
Any suggestions would be greatly appreciated! Thank you!

Here is a thought that I hope will help, but with out seeing the entire HTML I can't give more then just an idea. I have more experience with Selenium in java, so I am not 100% sure that python will have the same functionality but I imagine it does.
You should be able to get the text from any WebElement. In Java it would look something like this, but I imagine it should be too hard to change it to python
WebElement top = driver.findElement(By.xpath("//div[#id='top']"));
String topString = top.getText();
If in your case your getting more then just the "#SomeText" you would need to remove the text from the other elements that you don't want. Something like:
WebElement topH1 = top.findElement(By.xpath("./h1"));
WebElement topInsideDiv = top.findElement(By.xpath("./div"));
String topHString = topH1.getText();
String topInsideDivString = topTable.getText();
//since you know that the H1 string would come first and the inside div
//would come after you could take the substring of the topString
String result = topString.subString(topHString.length,
topString.length - topInsideDivString.length);
This is really just an idea on how you could do it. The way that you determine the part of the string that you would be interested in might need to be more complex. It could be that you just cycle through the strings to determine where you need to break apart the entire string to get what you want. If there is text before the tag you would need to be more complex about your solution, perhaps by searching for the text and discounting anything you find before it, but without that information I cant really help out more then this.

Related

extract text from two tables by Beautiful Soup

I have a lot of pages where the structure is the following:
<table class='CERTAIN_CLASS'> ... </table>
A lot of stuff here (divs, ps, brs, images)
<table class='CERTAIN_CLASS'> ... </table>
What is the most efficient way to extract the text (text only!) from everything between two tables of a certain class? I've found a lot of similar questions on SO, but nothing on this secifically this task.
Assuming that you have already loaded the content of the html page:
For a specific class:
text.find(class_="CERTAIN CLASS").text.strip()
To find all the text from this certain class, then you could iterate through every element:
text.findAll(class_="CERTAIN CLASS"):

Only extract information from the div class if it contains a certain word using xpath

I am trying to scrape information from the following website https://www.rawson.co.za
However, sometimes, the information changes it's position.
I am struggling to check for only the 'Building size' and store that as the size, since the div class looks like this:
<div class="features__item">
<div class="features__icon icon-house" aria-hidden="true"></div>
<div class="features__label">Building Size 130m²</div>
</div>
I am able to extract that but sometimes it takes other information due to the property either not having it or something else being at the position of it.
This is what i have for size now (I am accessing the information from the child/property pages):
size = response.xpath("//div[#class='features']/div[#class='features__list']/div[#class='row']/div[#class='col col--1-2'][2]/div[#class='features__item'][1]/div[#class='features__label']/text()").re(r'\d+')[0]
What I would like to take is the Building size information(only numbers) if it exists and put None if there is no building size available. I am struggling with the text part in the div class. I have tried to construct a for loop that will check if it contains the ''Building Size'' but nothing has worked yet. Any help would be very much appreciated! Thank you!
Simple:
size = response.xpath("//div[#class='features__label'][contains(., 'Building Size')]/text()").re_first(r'\d+')

python how to identify block html contain text?

I have raw HTML files and i remove script tag.
I want to identify in the DOM the block elements (like <h1> <p> <div> etc, not <a> <em> <b> etc) and enclose them in <div> tags.
Is there any easy way to do it in python?
is there library in python to identify the block element
Thanks
UPDATE
actually i want to extract the html document. i have to identify the blocks which contain text. For each text element i have to find its closest parent element that are displayed as block. After that for each block i will extract the feature such as size and posisition of the block.
You should use something like Beautiful Soup or HTMLParser.
Have a look at their docs: Beautiful Soup or HTMLParser.
You should find what you are looking fore there. If you cannot get it to work, consider asking a more specific question.
Here is a simple example, how you cold go about this. Say 'data' is the raw content of a site, then you could:
soup = BeautifulSoup(data) # you may need to add from_encoding="utf-8"or so
Then you might want to walk through the tree looking for a specific node and to something with it. You could use a fct like this:
def walker(soup):
if soup.name is not None:
for child in soup.children:
# do stuff with the node
print ':'.join([str(child.name), str(type(child))])
walker(child)
Note: the code is from this great tutorial.

PyQuery: Get only text of element, not text of child elements

I have the following HTML:
<h1 class="price">
<span class="strike">$325.00</span>$295.00
</h1>
I'd like to get the $295 out. However, if I simply use PyQuery as follows:
price = pq('h1').text()
I get both prices.
Extracting only direct child text for an element in jQuery looks reasonably complicated - is there a way to do it at all in PyQuery?
Currently I'm extracting the first price separately, then using replace to remove it from the text, which is a bit fiddly.
Thanks for your help.
I don't think there is an clean way to do that. At least I've found this solution:
>>> print doc('h1').html(doc('h1')('span').outerHtml())
<h1 class="price"><span class="strike">$325.00</span></h1>
You can use .text() instead of .outerHtml() if you don't want to keep the span tag.
Removing the first one is much more easy:
>>> print doc('h1').remove('span')
<h1 class="price">
$295.00
</h1>

extract specific element from nested elements using lxml html

Hi all I am having some problems that I think can be attributed to xpath problems. I am using the html module from the lxml package to try and get at some data. I am providing the most simplified situation below, but keep in mind the html I am working with is much uglier.
<table>
<tr>
<td>
<table>
<tr><td></td></tr>
<tr><td>
<table>
<tr><td><u><b>Header1</b></u></td></tr>
<tr><td>Data</td></tr>
</table>
</td></tr>
</table>
</td></tr>
</table>
What I really want is the deeply nested table, because it has the header text "Header1".
I am trying like so:
from lxml import html
page = '...'
tree = html.fromstring(page)
print tree.xpath('//table[//*[contains(text(), "Header1")]]')
but that gives me all of the table elements. I just want the one table that contains this text. I understand what is going on but am having a hard time figuring out how to do this besides breaking out some nasty regex.
Any thoughts?
Use:
//td[text() = 'Header1']/ancestor::table[1]
Find the header you are interested in and then pull out its table.
//u[b = 'Header1']/ancestor::table[1]
or
//td[not(.//table) and .//b = 'Header1']/ancestor::table[1]
Note that // always starts at the document root (!). You can't do:
//table[//*[contains(text(), "Header1")]]
and expect the inner predicate (//*…) to magically start at the right context. Use .// to start at the context node. Even then, this:
//table[.//*[contains(text(), "Header1")]]
won't work since even the outermost table contains the text 'Header1' somewhere deep down, so the predicate evaluates to true for every table in your example. Use not() like I did to make sure no other tables are nested.
Also, don't test the condition on every node .//*, since it can't be true for every node to begin with. It's more efficient to be specific.
Perhaps this would work for you:
tree.xpath("//table[not(descendant::table)]/*[contains(., 'Header1')]")
The not(descendant::table) bit ensures that you're getting the innermost table.
table, = tree.xpath('//*[.="Header1"]/ancestor::table[1]')
//*[text()="Header1"] selects an element anywhere in a document with text Header1.
ancestor::table[1] selects the first ancestor of the element that is table.
Complete example
#!/usr/bin/env python
from lxml import html
page = """
<table>
<tr>
<td>
<table>
<tr><td></td></tr>
<tr><td>
<table>
<tr><td><u><b>Header1</b></u></td></tr>
<tr><td>Data</td></tr>
</table>
</td></tr>
</table>
</td></tr>
</table>
"""
tree = html.fromstring(page)
table, = tree.xpath('//*[.="Header1"]/ancestor::table[1]')
print html.tostring(table)

Categories