XPath not expected result with lxml - python

I'm sorry if my question isn't formatted right, English isn't my native language.
I'm trying to get the table from the following url Bulapedia, Bulbasaur But lxml gives me very weird results when I use xpath.
I've tried the following:
for elem in tree.xpath('//*[#id="mw-content-text"]//table[14]//tr[3]//td//table//tr//td'):
print(etree.tostring(elem, pretty_print=True))
This doesn't give me the data I need, it gives values from a different table data, randomized even.
I'm at a loss of what to try now, cssselect isn't an option either, since that seems to change depending on which Pokemon I'm searching for.
I'm trying to get the following results:

Other than the first element *[#id="mw-content-text"], all the rest of the elements in your XPath should be the immediate children of the ones before them. By using // you're selecting elements of any depth within the parent, which is not what you want.
Change all but the first //s to / and it should work as intended:
for elem in tree.xpath('//*[#id="mw-content-text"]/table[14]/tr[3]/td/table/tr/td'):
print(etree.tostring(elem, pretty_print=True))

Related

Xpath seems didn't return expected span

I am trying to use xpath to download all images from a webpage.
Have managed to found the specific element which has several spans, and the full xpath looks as below:
/html/body/div[2]/div[3]/main/ul/li[4]/article/div[1]/a/span/span[1]
/html/body/div[2]/div[3]/main/ul/li[4]/article/div[1]/a/span/span[2]
/html/body/div[2]/div[3]/main/ul/li[4]/article/div[1]/a/span/span[3]
etc.
Currently I've get the whole element till "li[4]" level and tried to use below code to find all the leaves elements of the tree, but the returned value is empty:
->node.xpath('./article/div[#class="flex-box"]/a/span[starts-with(#class,"grid-box")]/span')
->[]
And the parent node length is only 1 instead of the number of the leaves which I expected to be at least 4-5 here:
->len(node.xpath('./article/div[#class="flex-box"]/a/span[starts-with(#class,"grid-box")]'))
->1
->node.xpath('./article/div[#class="flex-box"]/a/span[starts-with(#class,"grid-box")]')[0]
-><Element span at 0x1ac51134040>
Any one could help me figure out what is going on here?

BeautifulSoup, getting more returns than expected with regex

Using BeautifulSoup, I have the following line:
dimensions = SOUP.select(".specs__title > h4", text=re.compile(r'Dimensions'))
However, it's returning more than just the tags that have a text of 'Dimensions' as shown in these results:
[<h4>Dimensions</h4>, <h4>Details</h4>, <h4>Warranty / Certifications</h4>]
Am I using the regex incorrectly with the way SOUP works?
The select interface doesn't have a text keyword. Before we go further, the following is assuming you are using BeautifulSoup 4.7+.
If you'd like to filter by text, you might be able to do something like this:
dimensions = SOUP.select(".specs__title > h4:contains(Dimensions)")
More information on the :contains() pseudo-class implementation is available here: https://facelessuser.github.io/soupsieve/selectors/#:contains.
EDIT: To clarify, there is no way to incorporate regex directly into a select call currently. You would have to filter the elements after the fact to use regex. In the future there may be a way to use regex via some custom pseudo-class, but currently there is no such feature available in Soup Sieve (Beautiful Soup's select implementation in 4.7+).

In Selenium, how do I include a specific node [1] using find_elements_by_css_selector()

In the case that I want the first use of class so I don't have to guess the find_elements_by_xpath(), what are my options for this? The goal is to write less code, assuring any changes to the source I am scraping can be fixed easily. Is it possible to essentially
find_elements_by_css_selector('source[1]')
This code does not work as is though.
I am using selenium with Python and will likely be using phantomJS as the webdriver (Firefox for testing).
In CSS Selectors, square brackets select attributes, so your sample code is trying to select the 'source' type element with an attribute named 1, eg
<source 1="your_element" />
Whereas I gather you're trying to find the first in a list that looks like this:
<source>Blah</source>
<source>Rah</source>
If you just want the first matching element, you can use the singular form:
element = find_element_by_css_selector("source")
The form you were using returns a list, so you're also able to get the n-1th element to find the nth instance on the page (Lists index from 0):
element = find_elements_by_css_selector("source")[0]
Finally, if you want your CSS selectors to be completely explicit in which element they're finding, you can use the nth-of-type selector:
element = find_element_by_css_selector("source:nth-of-type(1)")
You might find some other helpful information at this blog post from Sauce Labs to help you write flexible selectors to replace your XPath.

How do I find all rows in one table with BeautifulSoup

I'm trying to write my first parser with BeautifulSoup (BS4) and hitting a conceptual issue, I think. I haven't done much with Python -- I'm much better at PHP.
I can get BeautifulSoup to find the table I want, but when I try to step into the table and find all the rows, I get some variation on:
AttributeError: 'ResultSet' object has no attribute 'attr'
I tried walking through the sample code at How do I draw out specific data from an opened url in Python using urllib2? and got more or less the same error (note: if you want to try it you'll need a working URL.)
Some of what I'm reading says that the issue is that the ResultSet is a list. How would I know that? If I do print type(table) it just tells me <class 'bs4.element.ResultSet'>
I can find text in the table with:
for row in table:
text = ''.join(row.findAll(text=True))
print text
but if I try to search for HTML with:
for row in table:
text = ''.join(row.find_all('tr'))
print text
It complains about expected string, Tag found So how do I wrangle this string (which is a string full of HTML) back into a beautifulsoup object that I can parse?
BeautifulSoup data-types are bizarre to say the least. A lot of times they don't give enough information to easily piece together the puzzle. I know your pain! Anyway...on to my answer...
Its hard to provide a completely accurate example without seeing more of your code, or knowing the actual site you're attempting to scrape, but I'll do my best.
The problem is your ''.join(). .findAll('tr') returns a list of elements of the BeautifulSoup datatype 'tag'. Its how BS knows to find trs. Because of this, you're passing the wrong datatype to your ''.join().
You should code one more iteration. (I'm assuming there are td tags withing the trs)
text_list = []
for row in table:
table_row = row('tr')
for table_data in table_row:
td = table_data('td')
for td_contents in td:
content = td_contents.contents[0]
text_list.append(content)
text = ' '.join(str(x) for x in text_list)
This returns the entire table content into a single string. You can refine the value of text by simply changing the locations of text_list and text =.
This probably looks like more code than is required, and that might be true, but I've found my scrapes to be much more thorough and accurate when I go about it this way.

Need python lxml syntax help for parsing html

I am brand new to python, and I need some help with the syntax for finding and iterating through html tags using lxml. Here are the use-cases I am dealing with:
HTML file is fairly well formed (but not perfect). Has multiple tables on screen, one containing a set of search results, and one each for a header and footer. Each result row contains a link for the search result detail.
I need to find the middle table with the search result rows (this one I was able to figure out):
self.mySearchTables = self.mySearchTree.findall(".//table")
self.myResultRows = self.mySearchTables[1].findall(".//tr")
I need to find the links contained in this table (this is where I'm getting stuck):
for searchRow in self.myResultRows:
searchLink = patentRow.findall(".//a")
It doesn't seem to actually locate the link elements.
I need the plain text of the link. I imagine it would be something like searchLink.text if I actually got the link elements in the first place.
Finally, in the actual API reference for lxml, I wasn't able to find information on the find and the findall calls. I gleaned these from bits of code I found on google. Am I missing something about how to effectively find and iterate over HTML tags using lxml?
Okay, first, in regards to parsing the HTML: if you follow the recommendation of zweiterlinde and S.Lott at least use the version of beautifulsoup included with lxml. That way you will also reap the benefit of a nice xpath or css selector interface.
However, I personally prefer Ian Bicking's HTML parser included in lxml.
Secondly, .find() and .findall() come from lxml trying to be compatible with ElementTree, and those two methods are described in XPath Support in ElementTree.
Those two functions are fairly easy to use but they are very limited XPath. I recommend trying to use either the full lxml xpath() method or, if you are already familiar with CSS, using the cssselect() method.
Here are some examples, with an HTML string parsed like this:
from lxml.html import fromstring
mySearchTree = fromstring(your_input_string)
Using the css selector class your program would roughly look something like this:
# Find all 'a' elements inside 'tr' table rows with css selector
for a in mySearchTree.cssselect('tr a'):
print 'found "%s" link to href "%s"' % (a.text, a.get('href'))
The equivalent using xpath method would be:
# Find all 'a' elements inside 'tr' table rows with xpath
for a in mySearchTree.xpath('.//tr/*/a'):
print 'found "%s" link to href "%s"' % (a.text, a.get('href'))
Is there a reason you're not using Beautiful Soup for this project? It will make dealing with imperfectly formed documents much easier.

Categories