How do I parse with LXML recursively in an elegant way?

How do I parse with LXML recursively in an elegant way? - python

For example, consider the following HTML:
<div class="class1">
<div id="element1">
text1
</div>
<div id="element2">
text2
</div>
<div id="element3">
text3
</div>
</div>
What I am trying to achieve is to parse different elements, which attributes are already known.
The way that I am doing it now:
index = len(tree.xpath('//div[#class="class1"]')
for i in range(0, index):
print tree.xpath('//div[#class="class1"][i]/text()')
But it becomes kinda messy when it comes to longer xpaths.
Is there another way to do this?
edit-
for example,
first_elem = tree.xpath('//div[#class="class1"]')[0]
is it possible to do something like:
first_elem.xpath() which searches in <div id="element1"> ?
edit-
found the weird way to do this in lxml:
for i in tree.xpath('//div[#class="class1"]'):
str1 = html.tostring(i)
tree = html.fromstring(str1)
< do things here >

Your xpath seems to be wrong , when you do -
tree.xpath('//div[#class="class1"][i]/text()')
i does not get substituted inside automatically. In anycase, you do not need to do what you are doing , tree.xpath would return a list of all matching elements, you can simple use the xpath you want (even if it results in more than one element) , and then iterate over the result and print it. Example (or what you are trying to do) -
for i in tree.xpath('//div[#class="class1"]/div/text()'):
print i
This should print the text from inside each div in the main div with attribute class as class1 .
You do not even need that, if you know a way to uniquely identify the element (using attributes/indexing, etc) , you can directly use that, example , to get the text for element1 , use -
for i in tree.xpath('//div[#id="element1"]/text()'):
print i
Also, seems like your xml has lots of not needed newlines and whitespaces , you can strip them by calling i.strip() .

You can use starts-with to get div where id starts with element
for i in tree.xpath("//div[starts-with(#id, 'element')]/text()"):
print(i.strip())
and this yields
text1
text2
text3

If you want to get all Childs of a element, i recommend to use iter():
for element in tree.iter():
print element.text.strip()
output:
text1
text2
text3
you can also define a tagname tree.iter(tag="div")

Related

Python: exclude outer wrapping element when getting content via css/xpath selector

I tried this code to get the HTML content of element div.entry-content:
response.css('div.entry-content').get()
However, it returns the wrapping element too:
<div class="entry-content">
<p>**my content**</p>
<p>more content</p>
</div>
But I want just the contents, so in my case: <p>**my content**</p><p>more content</p>
I also tried an xpath selector response.xpath('//div[#class="entry-content"]').get(), but with the same result as above.
Based on F.Hoque's answer below I tried:
response.xpath('//article/div[#class="entry-content"]//p/text()').getall() and response.xpath('//article/div[#class="entry-content"]//p').getall()
These however, returns arrays of respectively all p elements and the content of each found p element. I however want the HTML contents (in a single value) of the div.entry-content element without the wrapping element itself.
I've tried Googling, but can't find anything.

As you said, your main div contains multiple p tags and you want to extract the text node value from those p tags. //p will select all the p tags.
response.xpath('//div[#class="entry-content"]//p').getall()
The following expression will remove the array
p_tags = ''.join([x.get() for x in response.xpath('//article/div[#class="entry-content"]//p')])

You content is in the <p> tag, not the <div>
response.css('div.entry-content p').get()
or
response.xpath('//div[#class="entry-content"]/p').get()

Not able to extract data using scrapy with class names containing spaces and hyphens

I am new to scrapy and I have to extract text from a tag with multiple class names, where the class names contain spaces and hyphens.
Example:
<div class="info">
<span class="price sale">text1</span>
<span class="title ng-binding">some text</span>
</div>
When i use the code:
response.xpath("//span[contains(#class,'price sale')]/text()").extract()
I am able to get text1 but when I use:
response.xpath("//span[contains(#class,'title ng-binding')]/text()").extract()
I get an empty list. Why is this happening and how to handle this?

The expression you're looking for is:
//span[contains(#class, 'title') and contains(#class, 'ng-binding')]
I highly suggest XPath visualizer, which can help you debug xpath expressions easily. It can be found here:
http://xpathvisualizer.codeplex.com/
Or with CSS try
response.css("span.title.ng-binding")
Or there is a chance that element with ng-binding is loaded via Javascript/Ajax hence not included in initial server response.

You can replace the spaces with "." in your code when using response.css().
In your case you can try:
response.css("span.title.ng-binding::text").extract()
This code should return the text you are looking for.

Python Selenium xPath select from div class a rel

<div class='into'>
<div class="state " rel="AA" style="width:80px;">AA (1028)</div>
<div class="state " rel="BB" style="width:80px;">BB (307)</div>
</div>
I'd like to select one of elements rel="AA" or rel="BB" to click on it, tried several ways. The most usable idea was:
browser.find_element_by_xpath("//div[#class='into']/[text()='AA']").click()
However there is a number after the text what is various.
browser.find_element_by_xpath("//div[#class='into']/[rel='AA']").click()
And this not works.

Use the following xpath
browser.find_element_by_xpath(".//div[#class='into']/div[#rel='CA']").click()
Also can use normalize-spacemethod to omit the spaces in your class name like below -
browser.find_element_by_xpath(".//div[normalize-space(#class)='state'][#rel='AA']").click()

If you need your XPath to match one of elements with attributes rel="AA" or rel="BB" (in case one of them might not be present on page) then try below:
browser.find_element_by_xpath("//div[#class='into']/div[#rel="AA" or #rel="BB"]").click()

If you want to use your example with text() then you could use either of the following:
browser.find_element_by_xpath("//div[#class='into']/div[contains(text(), 'AA')]").click()
or
browser.find_element_by_xpath("//div[#class='into']/div[starts-with(text(), 'AA')]").click()
otherwise use the answer given by #lauda and use #rel to declare it as an attribute

Extract h1 text from div class with scrapy or selenium

I am using python along with scrapy and selenium.I want to extract the text from the h1 tag which is inside a div class.
For example:
<div class = "example">
<h1>
This is an example
</h1>
</div>
This is my tried code:
for single_event in range(1,length_of_alllinks):
source_link.append(alllinks[single_event])
driver.get(alllinks[single_event])
s = Selector(response)
temp = s.xpath('//div[#class="example"]//#h1').extract()
print temp
title.append(temp)
print title
Each and every time I tried different methods I got an empty list.
Now, I want to extract "This is an example" i.e h1 text and store it or append it in a list i.e in my example title.
Like:
temp = ['This is an example']

Try the following to extract the intended text:
s.xpath('//div[#class="example"]/h1/text()').extract()

For once, it seems that in your HTML the class attribute of the is "example" but in your code you're looking for other class values; At least for XPath queries, keep in mind that you search by exact attribute value. You can use something like:
s.xpath('//div[contains(#class, "example")]')
To find an element that has the "example" class but may have additional classes. I'm not sure if this is a mistake or this is your actual code. In addition the fact that you have spaces in your HTML around the '=' sign of the class attribute may not be helping some parsers either.
Second, your query used in s.xpath seems wrong. Try something like this:
temp = s.xpath('//div[#class="example"]/h1').extract()
Its not clear from your code what s is, so I'm assuming the extract() method does what you think it does. Maybe a more clean code sample would help us help you.

PyQuery: Get only text of element, not text of child elements

I have the following HTML:
<h1 class="price">
<span class="strike">$325.00</span>$295.00
</h1>
I'd like to get the $295 out. However, if I simply use PyQuery as follows:
price = pq('h1').text()
I get both prices.
Extracting only direct child text for an element in jQuery looks reasonably complicated - is there a way to do it at all in PyQuery?
Currently I'm extracting the first price separately, then using replace to remove it from the text, which is a bit fiddly.
Thanks for your help.

I don't think there is an clean way to do that. At least I've found this solution:
>>> print doc('h1').html(doc('h1')('span').outerHtml())
<h1 class="price"><span class="strike">$325.00</span></h1>
You can use .text() instead of .outerHtml() if you don't want to keep the span tag.
Removing the first one is much more easy:
>>> print doc('h1').remove('span')
<h1 class="price">
$295.00
</h1>

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do I parse with LXML recursively in an elegant way? - python

You can use starts-with to get div where id starts with element for i in tree.xpath("//div[starts-with(#id, 'element')]/text()"): print(i.strip()) and this yields text1 text2 text3

If you want to get all Childs of a element, i recommend to use iter(): for element in tree.iter(): print element.text.strip() output: text1 text2 text3 you can also define a tagname tree.iter(tag="div")

Related

Python: exclude outer wrapping element when getting content via css/xpath selector

Not able to extract data using scrapy with class names containing spaces and hyphens

Python Selenium xPath select from div class a rel

Extract h1 text from div class with scrapy or selenium

PyQuery: Get only text of element, not text of child elements

Categories

Resources