Reading text in elements using lxml.etree - python

I am using the Python version of the lxml libray. I am currently trying to parse the text from a table but am encountering a problem in that some of the text is links.
For example, one of the cells may look something like this:
<td>
Can I kick it, <a>to all the people</a> who can quest like a <a>tribe</a> does
</td>
Say after parsing the html, the td element is stored as foo. Then foo.text will not display the whole text, only the parts that aren't links. Moreover, if I find the link text using [i.text for i in foo.getchildren()] I no longer know the order in which to put the non-link text and link text.
Is there an easy way to get around this?

Well after searching for an hour, within 2 minutes of posting this question I have found the solution.
Use the method foo.text_content() and this will display what is needed.

Related

Python Mammoth Strange <a> elements within HTML headings

I just found the Mammoth Python package a couple of days ago and its a great tool which really creates clean HTML code from a Word doc. Its nearly perfect. There is just one artifact I don’t understand. The heading elements (h1-h6) it creates from the Word headings contain several <a> elements with strange TOC ids. Looks like this:
<h1><a id="_Toc48228035"></a><a id="_Toc48288791"></a><a id="_Toc48303673"></a><a id="_Toc48306159"></a><a id="_Toc48308644"></a><a id="_Toc48311128"></a><a id="_Toc48313611"></a>Arteriosklerose</h1>
Does anybody know how the get rid of these?
Thanks in advance
Cheers,
Peter
This is just a guess, but I hope it helps:
TOC stands most probably for "Table of Content". When you want to skip to an element in the page, (like a certain Chapter), you give the chapter an ID and append #ID to your url. In this way the browser would scroll directly to that point.
I guess you are using a table of content somehow and it has links in it and when you inspect them you fill find something like Arteriosklerose

Webscraping with selenium, beautiful soup, python - trouble finding specific text

I am quite new to python and webscraping and I am trying to pull the following text ($1.74), and all the other relevant odds on the page from a website:
HMTL text that I am trying to pull
For similar situations previously I have been successful by using a for loop inside another for loop, but on those occasions I was searching by 'class'. I cannot search by class here as there are a lot of other 'td's that have the same class type, and not the odds that I want. Here I would like to (and I am not sure if it is possible) search via 'data-bettype'. The reason I am trying to search via that, and not 'data compid data-bettype', is that when I print out the full HTML in python, it looks like so:
HMTL printed to Python
The relevant part of my code here is:
soup_playup = BeautifulSoup(source_playup, 'lxml')
#print(soup_playup.prettify())
for odds_a in soup_playup.find_all('td',{'data-bettype','Awin'}):
for odds in odds_a.find_all('div'):
print(odds.text)
I am not receiving any errors when I run this code, but it seems as though it just will not find the text.
The correct format for looking up attributes is a dictionary of key-value pairs like so:
soup_playup.find_all('td',attrs={'data-bettype':'Awin'})

parsing chat logs in python, currently using BeautifulSoup

I am having some issues parsing an IM chat log using Python 2.7. I am currently using BeautifulSoup.get_text. This generally works, but sometimes masks interesting stuff. For instance:
<font color="#A82F2F"><font size="2">(3/11/2016 3:11:57 PM)</font> <b>user name:</b></font> <html xmlns='http://jabber.org/protocol/xhtml-im'><body xmlns='http://www.w3.org/1999/xhtml'><p>Have you posted the key to https://___.edu/sshkeys/?</p></body></html><br/>
In this case, I get the Have you posted the key to part, but it strips out the https:________ part.
Most, not all, the lines are formatted the same. i.e. date time, user, interesting stuff.
Is there a better way to parse this to get the text AND all the interesting stuff?
You can utilize find_all:
for anchor in soup.find_all('a', href=True):
print("The anchor url={} text={}".format(anchor['href'], anchor['text'])
Depending on how you want to output this information, you'd have to get more or less clever.

Python lxml XPath with deep nesting with specific search

The xpath for text I wish to extract is reliably located deep in the tree at
...table/tbody/tr[4]/td[2]
Specifically, td[2] is structured like so
<td class="val">xyz</td>
I am trying to extract the text "xyz", but a broad search returns multiple results. For example the following path returns 10 elements.
xpath('//td[#class="val"]')
... while a specific search doesn't return any elements. I am unsure why the following returns nothing.
xpath('//tbody/tr/td[#class="val"]')
One solution involves..
table = root.xpath('//table[#class="123"]')
#going down the tree
xyz = table[0][3][1]
print vol.text
However, I am pretty sure this extremely brittle. I would appreciate it if someone could tell me how to construct an xpath search that would be both un-brittle and relatively cheap on resources
You haven't mentioned it explicitly, but if your target table and td tag classes are reliable then you could do something like:
//table[#class="123"]/descendant::td[#class="val"]
And you half dodge the issue of tbody being there or not.
However, there's no substitute for actually seeing the material you are trying to parse for recommending XPATH queries...
...table/tbody/tr[4]/td[2]
I guess you found this XPath via a tool like Firebug. One thing to note about tools like Firebug (or other inspect tools within browsers) is that they use the DOM tree generated by the browser itself and most (if not all) HTML parsers in browsers would try hard to make the passed HTML valid. This often requires adding various tags the standard dictates.
<tbody> is one of these tags. <tr> tags are only allowed as a child of <thead>, <tbody> or <tfoot> tags. Unfortunately, in my experience, you will rarely see one of these tags inside a <table> in the actual source, but a browser would add these necessary tags while parsing to make HTML valid since standard requires to do so.
To cut this story short, there is probably no <tbody> tag in your actual source. That is why your XPath returns nothing.
As for generating XPath queries, this highly depends on the particular page/xml. In general, positional queries such as td[4] should be the last resort since they tend to break easily when something is added before them. You should inspect the markup carefully and try to come up queries that use attributes like id or class since they add specificity more reliably than the positional ones. But in the end, it all boils down to the specifics of the page in question.
This seems to be working
from lxml import etree
doc = etree.HTML('<html><body><table><tbody><tr><td>bad</td><td class="val">xyz</td></tr></tbody></table></body></html>')
print doc.xpath('//tbody/tr/td[#class="val"]')[0].text
output:
xyz
So what is your problem?

Parsing HTML with XPath, Python and Scrapy

I am writing a Scrapy program to extract the data.
This is the url, and I want to scrape 20111028013117 (code) information. I have taken XPath from FireFox add-on XPather. This is the path:
/html/body/p/table/tbody/tr/td/table[2]/tbody/tr[1]/td/table[3]/tbody/tr/td[2]/table[1]/tbody/tr/td/table/tbody/tr/td[2]/table[3]/tbody/tr/td/table/tbody/tr[2]/td[2]
While I am trying to execute this
try:
temp_list = hxs.select("/html/body/p/table/tbody/tr/td/table[2]/tbody/tr[1]/td/table[3]/tbody/tr/td[2]/table[1]/tbody/tr/td/table/tbody/tr/td[2]/table[3]/tbody/tr/td/table/tbody/tr[2]/td[2]").extract()
print "temp_list:" + str(temp_list)
except:
print "error"
It returns an empty list, I am struggling to find out an answer for this from the last 4 hours. I am a newbie to scrapy eventhough I handled issues very well for other projects, but it seems to be a bit difficult.
The reason of why your xpath doesn't work is becuase of tbody. You have to remove it and check if you get that result that you want.
You can read this in scrapy documentation: http://doc.scrapy.org/en/0.14/topics/firefox.html
Firefox, in particular, is known for adding <tbody> elements to
tables. Scrapy, on the other hand, does not modify the original page
HTML, so you won’t be able to extract any data if you use <tbody> in
your XPath expressions.
I see that the element you are hunting for is inside a <table>.
Firefox adds tbody tag for every table, even if it does not exists in source HTML code.
That's might be the reason, that your xpath query works in the browser, but fails in Scrapy.
As suggested, use other anchors in your xpath query.
You can extract data with more ease using more robust XPaths instead of taking the direct output from XPather.
For the data you are matching, this XPath would do a lot better:
//font[contains(text(),'Code')]/parent::td/following-sibling::td/font/text()
This will match the <font> tag containing "Code", then go to the td tag above it and select the next td -> font, which contains the code you are looking for.
Have you tried removing a few node tags at the end of the query, and re-running until you get a result? Do this several times until you get something, then add items back in cautiously until the query is rectified.
Also, check that your target page validates as XHTML - an invalid page would probably upset the parser.

Categories