How to best extract the following content in html string in python? - python

Assuming I have the following string with line breaks:
<table>
<tr>
<td valign="top">House Exterior:</td><td>Round</td>
</tr>
<tr>
<td>EF</td><td>House AB</td></tr>
<tr>
<td valign="top">Settlement Date:</td>
<td valign="top">2/3/2013</td>
</tr>
</table>
What is the best way to create a simple python dictionary with the following:
I want to extract the Settlement Date into a dict or some kind of regex match. What is the best way to do this?
NOTE: A sample in some utility is fine, but am looking for a better way than to have a variable that has contains text like this and having to go through a lot of .next.next.next.next.next until I finally get to settlement date, which is why I posted this question in the first place.

If the data is highly regular, then a regex isn't a bad choice. Here's a straight-forward approach:
regex = re.compile(r'>Settlement Date:</td>[^>]*>([^<]*)')
match = regex.search(data)
print match.group(1)

Related

Scrape an dynamically row table using Python, Selenium and XPath

I am trying to scrape using Selenium and XPath in Python, to get the "SIRET" row from the table.
I have tried different types of XPaths, but I couldn't do it.
One problem is that the " class="reportRow" " element is changing dynamically and it can't be scrapped after the position number.
The "SIRET" raw and his "td class" subelements values, can be scrapped after the "SIRET" text or in some other way?
This are the manual steps that I am doing when I acces the site:
The site contain only the root domain.
After I acces the site thru login data, I enter an search criteria, which open an page where I have to click on an link which open an popup window whith an table.
The table contain 4 rows and 8 columns, the first row contains the name of the colums, and the other 3 rows contain data as the the "SIRET" one.
The position of that 3 rows is changing regularly, depending on the data that is recievd from an specific server.
That is why I want to scarpe that row and his values by the "SIRET" text.
My final scraped data should look like this: SIRET 646 90 0.2% $2.94 1.03 0.07 4.52.
Thank you very much for your inputs.
<div class="table_container">
<table>
<tbody>
<tr class="reportHead">.....</tr></tbody>
<tbody>
<tr class="reportRow ">....</tr>
<tr class="reportRow ">....</tr>
<tr class="reportRow ">
<td data-actual="SIRET" class="reportKeyword">SIRET</td>
<td class="td2">646</td>
<td class="td1">90</td>
<td class="rcr">0.2%</td>
<td class="td1">$2.94</td>
<td class="td1">1.03</td>
<td class="td1">0.07</td>
<td class="td1 rctl">4.52</td>
</tr>
</tbody>
<tfoot style="display: none;">....</tfoot>
</table>
You can use xpath like this
SIRET= driver.find_element_by_xpath("//td[#data-actual='SIRET']")
Then you can use .text operation to get text
if data is dyanmically change then you have to use
SIRET= driver.find_element_by_xpath("//td[#class='reportKeyword']")
If I have understood the question correctly, you are trying to get the string "SIRET" from the <td> node which changes dynamically. To do that you can use the following line of code :
print(driver.find_element_by_xpath("//td[#class='reportKeyword']").get_attribute("innerHTML"))
Strange. As a matter of fact, the solution is not as intricate:
driver.find_element_by_xpath("//td[#data-actual='SIRET']/../td")

YQL xpath not robust enough

Previously working n xpath using python and it is robust to extract data from a webpage. Now I need to use YQL for the same webpage but it is not robust enough.
I want to get is
1. Last (AUD)
2. Close
3.Close (%)
4. Cumulative Volume
from https://www.shareinvestor.com/fundamental/factsheet.html?counter=TPM.AX
The xpath I use in python are as below:
xpath('//td[contains(., "Last")]/strong/text()')
xpath('//td[contains(., "Change")]/strong/text()')[0]
xpath('//td[contains(., "Change (%)")]/strong/text()')
xpath('//td[contains(., "Cumulative Volume")]/following-sibling::td[1]/text()')
part of the html is here
<tr>
<td rowspan="2" class="sic_lastdone">Last (AUD): <strong>6.750</strong></td>
<td class="sic_change">Change: <strong>-0.080</strong></td>
<td>High: <strong>6.920</strong></td>
<td rowspan="2" class="sic_remarks">
Remarks: <strong>-</strong>
</td>
</tr>
<tr>
<td class="sic_change">Change (%): <strong>-1.17</strong></td>
<td>Low: <strong>6.700</strong></td>
</tr>
<tr>
<tr>
<td>Cumulative Volume (share)</td>
<td class='sic_volume'>3,100,209</td>
<td>Cumulative Value</td>
<td class='sic_value'></td>
</tr>
But when I want to apply in YQL, it did not work. It only work with
select * from html where
url="https://www.shareinvestor.com/fundamental/factsheet.html?counter=TPM.AX"
and xpath="//td/strong"
It will gets a lot of data. I want a specific data and need to be robust, so that changes of the webpage, my query still working.How to get the YQL xpath that is robust?
You should probably avoid building your xpaths according to visible text.
I always build xpath according to tag attributes, since they usually do not change. That makes the xpath result unique and immune to visible text change in the HTML.
For example, "Last (AUD):" value xpath:
//td[#class="sic_lastdone"]/strong/text()

Get text of td following the second occurrence of an element in Selenium using Python

I'm trying to find the text after a remarks field in a form. However, the table has multiple remarks fields. I want to be able to grab the text in the td that follows the td of the second remarks field. I have the following html:
<table>
<tbody>
<tr>...</tr>
<tr>...</tr>
<tr>
<td>Remarks:</td>
<td>this is the first remarks field
</tr>
<tr>
<td>AnotherField:</td>
<td>Content of that field</td>
</tr>
<tr>
<td>Remarks:</td>
<td>this is the second remarks field</td>
</tr>
<tr>...</tr>
</tbody>
</table>
To grab the text out of the first remarks field, I can do the following:
ret = driver.find_element_by_xpath("//td[contains(text(),'Remarks')]/following::td")
print ret.text
However, I need to grab the content out of the second remarks field. This has to be done based on the index of the occurrences of 'Remarks', not based on the index. I've wanted to try things like this:
ret = self.driver.find_element_by_xpath("//td[contains(text(),'Remarks')][1]/following::td")
or:
rets = self.driver.find_elements_by_xpath("//td[contains(text(),'Remarks')]")[1]
ret = elements.find_element_by_xpath("/following::td")
Understandingly, these do not work. Is there a way of doing this? Using a command along the lines of 'the field after the nth occurrence of Remarks' is what I'm looking for.
P.S. This will have to be done using xpath. Reason being, I'm trying to convert a coworkers code into selenium from another application that has everything revolved around xpath.
I'm using Selenium-2.44.0 and Python 2.7.
Indexing starts from 1 in XPath:
(//td[contains(., 'Remarks')]/following-sibling::td)[2]
Or, you can use find_elements_by_xpath() and get the second item:
elements = self.driver.find_elements_by_xpath("//td[contains(., 'Remarks')]/following-sibling::td")
print elements[1].text

How to loop through a html-table-dataset in Python

I'm first time poster here trying to pick up some Python skills; please be kind to me :-)
While I'm not a complete stranger to programming concepts (I've been messing around with PHP before), the transition to Python has turned out to be somewhat difficult for me. I guess this mostly has to do with the fact that I lack most - if not all - basic understanding of common "design patterns" (?) and such.
Having that said, this is the problem. Part of my current project involves writing a simple scraper by utilizing Beautiful Soup. The data to be processed has a somewhat similar structure to the one which is laid out below.
<table>
<tr>
<td class="date">2011-01-01</td>
</tr>
<tr class="item">
<td class="headline">Headline</td>
<td class="link">Link</td>
</tr>
<tr class="item">
<td class="headline">Headline</td>
<td class="link">Link</td>
</tr>
<tr>
<td class="date">2011-01-02</td>
</tr>
<tr class="item">
<td class="headline">Headline</td>
<td class="link">Link</td>
</tr>
<tr class="item">
<td class="headline">Headline</td>
<td class="link">Link</td>
</tr>
</table>
The main issue is that I simply can't get my head around how to 1) keep track of the current date (tr->td class="date") while 2) looping over the items in the subsequent tr:s (tr class="item"->td class="headline" and tr class="item"->td class="link") and 3) store the processed data in an array.
Additionally, all data will be inserted into a database where each entry must contain the following information;
date
headline
link
Note that crud:ing the database is not part of the problem, I only mentioned this in order to better illustrate what I'm trying to accomplish here :-)
Now, there are many different ways to skin a cat. So while a solution to the issue at hand is indeed very welcome, I'd be extremely grateful if someone would care to elaborate on the actual logic and strategy you would make use of in order to "attack" this kind of problem :-)
Last but not least, sorry for such a noobish question.
The basic problem is that this table is marked up for looks, not for semantic structure. Properly done, each date and its related items should share a parent. Unfortunately, they don't, so we'll have to make do.
The basic strategy is to iterate through each row in the table
if the first tabledata has class 'date', we get the date value and update last_seen_date
Otherwise, we get extract a headline and a link, then save (last_seen_date, headline, link) to the database
.
import BeautifulSoup
fname = r'c:\mydir\beautifulSoup.html'
soup = BeautifulSoup.BeautifulSoup(open(fname, 'r'))
items = []
last_seen_date = None
for el in soup.findAll('tr'):
daterow = el.find('td', {'class':'date'})
if daterow is None: # not a date - get headline and link
headline = el.find('td', {'class':'headline'}).text
link = el.find('a').get('href')
items.append((last_seen_date, headline, link))
else: # get new date
last_seen_date = daterow.text
You can use Element Tree which is included in the python package.
http://docs.python.org/library/xml.etree.elementtree.html
from xml.etree.ElementTree import ElementTree
tree = ElementTree()
tree.parse('page.xhtml') #This is the XHTML provided in the OP
root = tree.getroot() #Returns the heading "table" element
print(root.tag) #"table"
for eachTableRow in root.getchildren():
#root.getchildren() is a list of all of the <tr> elements
#So we're going to loop over them and check their attributes
if 'class' in eachTableRow.attrib:
#Good to go. Now we know to look for the headline and link
pass
else:
#Okay, so look for the date
pass
That should be enough to get you on your way to parsing this.

BeautifulSoup or regex HTML table to data structure?

I've got an HTML table that I'm trying to parse the information from. However, some of the tables span multiple rows/columns, so what I would like to do is use something like BeautifulSoup to parse the table into some type of Python structure. I'm thinking of just using a list of lists so I would turn something like
<tr>
<td>1,1</td>
<td>1,2</td>
</tr>
<tr>
<td>2,1</td>
<td>2,2</td>
</tr>
into
[['1,1', '1,2'],
['2,1', '2,2']]
Which I (think) should be fairly straightforward. However, there are some slight complications because some of the cells span multiple rows/cols. Plus there's a lot of completely unnecessary information:
<td ondblclick="DoAdd('/student_center/sc_all_rooms/d05/09/2010/editformnew?display=W&style=L&positioning=A&adddirect=yes&accessid=CreateNewEdit&filterblock=N&popeditform=yes&returncalendar=student_center/sc_all_rooms')"
class="listdefaultmonthbg"
style="cursor:crosshair;"
width="5%"
nowrap="1"
rowspan="1">
<a class="listdatelink"
href="/student_center/sc_all_rooms/d05/09/2010/edit?style=L&display=W&positioning=A&filterblock=N&adddirect=yes&accessid=CreateNewEdit">Sep 5</a>
</td>
And what the code really looks like is even worse. All I really need out of there is:
<td rowspan="1">Sep 5</td>
Two rows later, there is a with a rowspan of 17. For multi-row spans I was thinking something like this:
<tr>
<td rowspan="2">Sep 5</td>
<td>Some event</td>
</tr>
<tr>
<td>Some other event</td>
</tr>
would end out like this:
[["Sep 5", "Some event"],
[None, "Some other event"]]
There are multiple tables on the page, and I can find the one I want already, I'm just not sure how to parse out the information I need. I know I can use BeautfulSoup to "RenderContents", but in some cases there are link tags that I need to get rid of (while keeping the text).
I was thinking of a process something like this:
Find table
Count rows in tables (len(table.findAll('tr'))?)
Create list
Parse table into list (BeautifulSoup syntax???)
???
Profit! (Well, it's a purely internal program, so not really... )
There was a recent discussion on the python group on linkedin about a similar issue, and apparently lxml is the most recommended pythonic parser for html pages.
http://www.linkedin.com/groupItem?view=&gid=25827&type=member&item=27735259&qid=d2948a0e-6c0c-4256-851b-5e7007859553&goback=.gmp_25827
You'll probably need to identify the table with some attrs, id or name.
from BeautifulSoup import BeautifulSoup
data = """
<table>
<tr>
<td>1,1</td>
<td>1,2</td>
</tr>
<tr>
<td>2,1</td>
<td>2,2</td>
</tr>
</table>
"""
soup = BeautifulSoup(data)
for t in soup.findAll('table'):
for tr in t.findAll('tr'):
print [td.contents for td in tr.findAll('td')]
Edit: What should do the program if there're multiple links?
Ex:
<td>A B C</td>

Categories