PyQuery: Get only text of element, not text of child elements

PyQuery: Get only text of element, not text of child elements - python

I have the following HTML:
<h1 class="price">
<span class="strike">$325.00</span>$295.00
</h1>
I'd like to get the $295 out. However, if I simply use PyQuery as follows:
price = pq('h1').text()
I get both prices.
Extracting only direct child text for an element in jQuery looks reasonably complicated - is there a way to do it at all in PyQuery?
Currently I'm extracting the first price separately, then using replace to remove it from the text, which is a bit fiddly.
Thanks for your help.

I don't think there is an clean way to do that. At least I've found this solution:
>>> print doc('h1').html(doc('h1')('span').outerHtml())
<h1 class="price"><span class="strike">$325.00</span></h1>
You can use .text() instead of .outerHtml() if you don't want to keep the span tag.
Removing the first one is much more easy:
>>> print doc('h1').remove('span')
<h1 class="price">
$295.00
</h1>

Related

Not able to extract data using scrapy with class names containing spaces and hyphens

I am new to scrapy and I have to extract text from a tag with multiple class names, where the class names contain spaces and hyphens.
Example:
<div class="info">
<span class="price sale">text1</span>
<span class="title ng-binding">some text</span>
</div>
When i use the code:
response.xpath("//span[contains(#class,'price sale')]/text()").extract()
I am able to get text1 but when I use:
response.xpath("//span[contains(#class,'title ng-binding')]/text()").extract()
I get an empty list. Why is this happening and how to handle this?

The expression you're looking for is:
//span[contains(#class, 'title') and contains(#class, 'ng-binding')]
I highly suggest XPath visualizer, which can help you debug xpath expressions easily. It can be found here:
http://xpathvisualizer.codeplex.com/
Or with CSS try
response.css("span.title.ng-binding")
Or there is a chance that element with ng-binding is loaded via Javascript/Ajax hence not included in initial server response.

You can replace the spaces with "." in your code when using response.css().
In your case you can try:
response.css("span.title.ng-binding::text").extract()
This code should return the text you are looking for.

How do I extract text between two objects using XPath?

I'm using XPath to extract different web elements on a webpage, but have his a roadblock on one particular object that is sitting between two objects, but doesn't have a closing object behind it for a while.
I've been able to successfully extract other elements from the webpage, but don't know how to proceed at this point.
Here is a copy of what the HTML looks like from the Inspector:
<body>
<table>
<tbody>
<tr>
<td id="left_column">
<div id="top">
<h1></h1>
#SOME TEXT
<div>
<table>
.......
</table>
</div>
</div>
</td>
</tr>
Any suggestions would be greatly appreciated! Thank you!

Here is a thought that I hope will help, but with out seeing the entire HTML I can't give more then just an idea. I have more experience with Selenium in java, so I am not 100% sure that python will have the same functionality but I imagine it does.
You should be able to get the text from any WebElement. In Java it would look something like this, but I imagine it should be too hard to change it to python
WebElement top = driver.findElement(By.xpath("//div[#id='top']"));
String topString = top.getText();
If in your case your getting more then just the "#SomeText" you would need to remove the text from the other elements that you don't want. Something like:
WebElement topH1 = top.findElement(By.xpath("./h1"));
WebElement topInsideDiv = top.findElement(By.xpath("./div"));
String topHString = topH1.getText();
String topInsideDivString = topTable.getText();
//since you know that the H1 string would come first and the inside div
//would come after you could take the substring of the topString
String result = topString.subString(topHString.length,
topString.length - topInsideDivString.length);
This is really just an idea on how you could do it. The way that you determine the part of the string that you would be interested in might need to be more complex. It could be that you just cycle through the strings to determine where you need to break apart the entire string to get what you want. If there is text before the tag you would need to be more complex about your solution, perhaps by searching for the text and discounting anything you find before it, but without that information I cant really help out more then this.

In BeautifulSoup, Ignore Children Elements While Getting Parent Element Data

I have html as follows:
<html>
<div class="maindiv">
text data here
<br>
continued text data
<br>
<div class="somename">
text & data I want to omit
</div>
</div>
</html>
I am trying to get only the the text found in the maindivelement, without getting text data found in the somename element. In most cases, in my experience anyway, most text data is contained within some child element. I have ran into this particular case however where the data seems to be contained somewhat will-nilly and is a bit harder to filter.
My approach is as follows:
textdata= soup.find('div', class_='maindiv').get_text()
This gets all the text data found within the maindiv element, as well as the text data found in the somename div element.
The logic I'd like to use is more along the lines of:
textdata = soup.find('div', class_='maindiv').get_text(recursive=False) which would omit any text data found within the somename element.
I know the recursive=False argument works for locating only parent-level elemenets when searching the DOM structure using BeautifulSoup, but can't be used with the .get_text() method.
I've realized the approach of finding all the text, then subtracting the string data found in the somename element from the string data found in the maindiv element, but I'm looking for something a little more efficient.

Not that far from your subtracting method, but one way to do it (at least in Python 3) is to discard all child divs.
s = soup.find('div', class_='maindiv')
for child in s.find_all("div"):
child.decompose()
print(s.get_text())
Would print something like:
text data here
continued text data
That might be a bit more efficient and flexible than subtracting the strings, though it still needs to go through the children first.

from bs4 import BeautifulSoup
html ='''
<html>
<div class="maindiv">
text data here
<br>
continued text data
<br>
<div class="somename">
text & data I want to omit
</div>
</div>
</html>'''
soup = BeautifulSoup(html, 'lxml')
soup.find('div', class_="maindiv").next_element
out:
'\n text data here \n '

how to use find all method from BS4 to scrape certain strings

<li class="sre" data-tn-component="asdf-search-result" id="85e08291696a3726" itemscope="" itemtype="http://schema.org/puppies">
<div class="sre-entry">
<div class="sre-side-bar">
</div>
<div class="sre-content">
<div class="clickable_asdf_card" onclick="window.open('/r/85e08291696a3726?sp=0', '_blank')" style="cursor: pointer;" target="_blank">
I need to grab the string '/r/85e08291696a3726?sp=0' which occurs throughout a page. I'm not sure how to use the soup.find_all method to do this. The strings that I need always occur next to '
This is what I was thinking (below) but obviously I am getting the parameters wrong. How would I format the find_all method to return the '/r/85e08291696a3726?sp=0' strings throughout the page?
for divsec in soup.find_all('div', class_='clickable_asdf_card'):
print('got links')
x=x+1
I read the documentation for bs4 and I was thinking about using find_all('clickable_asdf_card') to find all occurrences of the string I need but then what? Is there a way to adjust the parameters to return the string I need?

Use BeautifulSoup's built-in regular expression search to find and extract the desired substring from an onclick attribute value:
import re
pattern = re.compile(r"window\.open\('(.*?)', '_blank'\)")
for item in soup.find_all(onclick=pattern):
print(pattern.search(item["onclick"]).group(1))
If there is just a single element you want to find, use find() instead of find_all().

How do I parse with LXML recursively in an elegant way?

For example, consider the following HTML:
<div class="class1">
<div id="element1">
text1
</div>
<div id="element2">
text2
</div>
<div id="element3">
text3
</div>
</div>
What I am trying to achieve is to parse different elements, which attributes are already known.
The way that I am doing it now:
index = len(tree.xpath('//div[#class="class1"]')
for i in range(0, index):
print tree.xpath('//div[#class="class1"][i]/text()')
But it becomes kinda messy when it comes to longer xpaths.
Is there another way to do this?
edit-
for example,
first_elem = tree.xpath('//div[#class="class1"]')[0]
is it possible to do something like:
first_elem.xpath() which searches in <div id="element1"> ?
edit-
found the weird way to do this in lxml:
for i in tree.xpath('//div[#class="class1"]'):
str1 = html.tostring(i)
tree = html.fromstring(str1)
< do things here >

Your xpath seems to be wrong , when you do -
tree.xpath('//div[#class="class1"][i]/text()')
i does not get substituted inside automatically. In anycase, you do not need to do what you are doing , tree.xpath would return a list of all matching elements, you can simple use the xpath you want (even if it results in more than one element) , and then iterate over the result and print it. Example (or what you are trying to do) -
for i in tree.xpath('//div[#class="class1"]/div/text()'):
print i
This should print the text from inside each div in the main div with attribute class as class1 .
You do not even need that, if you know a way to uniquely identify the element (using attributes/indexing, etc) , you can directly use that, example , to get the text for element1 , use -
for i in tree.xpath('//div[#id="element1"]/text()'):
print i
Also, seems like your xml has lots of not needed newlines and whitespaces , you can strip them by calling i.strip() .

You can use starts-with to get div where id starts with element
for i in tree.xpath("//div[starts-with(#id, 'element')]/text()"):
print(i.strip())
and this yields
text1
text2
text3

If you want to get all Childs of a element, i recommend to use iter():
for element in tree.iter():
print element.text.strip()
output:
text1
text2
text3
you can also define a tagname tree.iter(tag="div")

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

PyQuery: Get only text of element, not text of child elements - python

Related

Not able to extract data using scrapy with class names containing spaces and hyphens

How do I extract text between two objects using XPath?

In BeautifulSoup, Ignore Children Elements While Getting Parent Element Data

how to use find all method from BS4 to scrape certain strings

How do I parse with LXML recursively in an elegant way?

Categories

Resources