HTML Parsing: Get elements between two elements? - python

I'm parsing with lxml on Python 2.7
I have some html that looks like this:
<tr height="45" valign="bottom">
<td colspan="2" class="DATE">Wednesday, Aug 5 2015 </td>
</tr>
<tr>
<td/>
</tr>
<tr>
<td> </td>
<td/>
</tr>
<tr>
<td/>
<td> - No Calendar Matters Currently Set<br/></td>
</tr>
<tr height="45" valign="bottom">
<td colspan="2" class="DATE">Thursday, Aug 6 2015 </td>
</tr>
Is there any way for me to get a list of all td element objects in between the two elements of class="DATE"?
Basically, I need all the info associated with, say Aug 5, but since the other elements before the next date aren't children I'm struggling to figure out how to get them.

Write as want: all elements with td[#class="DATE"] ahead and before
//td[following::td[#class="DATE"] and preceding::td[#class="DATE"]]
but this set will not contain td tags with #class="DATE"
To include them use xpath
//td[(following::td[#class="DATE"] and preceding::td[#class="DATE"]) or #class="DATE"]

Related

Python Beautifulsoup traverse a table with particular text content in innerHTML then get contents until before a particular element

I have an html with a lots of table to traverse to like below:
<html>
.. omitted parts since I am interested on the HTML table..
<table>
<tbody>
<tr>
<td>
<table>
<tbody>
<tr>
<td class="labeltitle">
<tbody>
<tr>
<td class="labeltitle">
<font color="FFD700">Floor Activity<a name="#jump_fa"></a></font>
</td>
<td class="labelplain"> </td>
</tr>
</tbody>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
<table>
... omitted just to show the td that I am interested to scrape ...
<td class="labelplain"> Senator(s)</td>
<td class="labelplain">
<table>
<tbody>
<tr>
<td class="labelplain">VILLAR JR., MANUEL B.<br></td>
</tr>
</tbody>
</table>
</td>
...
<table>
<table>
... More tables like the table above (the one with VILLAR Jr.)
</table>
<table>
<tbody>
<tr>
<td class="labeltitle">
<table>
<tbody>
<tr>
<td class="labeltitle"> <font color="FFD700">Vote(s)<a name="#jump_vote"></a></font></td>
<td class="labelplain"> </td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
... more tables
</html>
The table I want to traverse is the td with class "labeltitle" and a child element "font" that has text "Floor Activity". Every table below it, I want to get the html code until before the table that has a td class="labeltitle" with child "font" and text content is "Vote(s)". I am trying with xpath like so:
table = dom.xpath("//table[8]/tbody/tr/td")
print (table)
but to no avail, I am getting empty arrays. Anything would do (e.g. with or without xpath).
I also tried the following:
rows = soup.find('a', attrs={'name' :'#jump_fa'}).find_parent('table').find_parent('table')
I am able to traverse the table with content "Floor Activity". The abovementioned code only gives me the content of the table for that particular parent, exact output I am getting below:
<tr>
<td class="labeltitle" height="22"><table border="0" cellpadding="0" cellspacing="0" width="100%">
<tr>
<td class="labeltitle" width="50%"> <font color="FFD700">Floor
Activity<a name="#jump_fa"></a></font></td>
<td align="right" class="labelplain" width="50%">
</td>
</tr>
</table></td>
</tr>
I am trying out this one Find next siblings until a certain one using beautifulsoup because it seems it fits my use case but the problem is I am getting error "'NoneType' object has no attribute 'next_sibling'" which should be the case since update2 script does not include the other tables, so update2 code is out of the equation.
My expected output for this is a json file (special characters are escaped) like:
{"title":' + '"' + str(var) + '"' + ',"body":" + flooract + ' + "`}
*where flooract is the html code of the tables with special characters escaped. Sample snippet:
<table>\n<tbody>\n<tr>\n<td class=\"labelplain\"> Status Date<\/td><td class=\"labelplain\"> 10/12/2005<\/td>\n<\/tr>\n<tr><td class=\"labelplain\"> Parliamentary Status<\/td>\n<td class=\"labelplain\"><table>\n<tbody><tr>\n<td class="labelplain">SPONSORSHIP SPEECH<br>...Until Period of Committee Amendments
Link to sample file here: https://issuances-library.senate.gov.ph/54629.html
I have attached an image of the site:
Screenshot 3, I have encircled in red lines what I only wanted to get from the HTML file:

How to use beautiful soup to get elements that contain/do not contain specific classes

I want to get a table and save it to Excel with pyhon scripts. Here is the response:
<body>
<table id="need">
<tr height="30" align="center">
<td>need</td>
<td id="td1">need</td>
<td id="td2" type="wholeLast">not need</td>
<td id="td3" type="whole">need</td>
...
</tr>
<tr height="30" align="center" cid="2" class="txt">
<td>not need</td>
<td id="td1">not need</td>
<td id="td2" type="wholeLast">not need</td>
<td id="td3" type="whole">not need</td>
...
</tr>
...
</table>
<table>
...
</table>
</body>
I need to get the contents in <tr> except <tr> with 'class="txt"' and the <td> except <td> with 'type="wholeLast"'. In short, I need to get all the "need" in the above response.
I tried this:trs = soup.find_all("tr", attrs={"height": "30", "align": "center"}). But I don't know how to remove the <td> which type="wholeLast". Maybe I need to use other ways.
Any suggestion is appreciate.
with css selectors and not pseudo class you could do this
tds=soup.select('tr:not(.txt) td:not([type="wholeLast"])')

Need a dynamic python selenium way of picking an element by xpath

This is the HTML it needs to pick from:
<tbody class="datepickerDays">
<tr>
<th class="datepickerWeek"><span>40</span></th>
<td class="datepickerNotInMonth"><span>28</span></td>
<td class="datepickerNotInMonth"><span>29</span></td>
<td class="datepickerNotInMonth"><span>30</span></td>
<td class=""><span>1</span></td>
<td class=""><span>2</span></td>
<td class="datepickerSaturday"><span>3</span></td>
<td class="datepickerSunday"><span>4</span></td>
</tr>
<tr>
<th class="datepickerWeek"><span>41</span></th>
<td class=""><span>5</span></td>
<td class=""><span>6</span></td>
<td class=""><span>7</span></td>
<td class="datepickerSelected"><span>8</span></td>
<td class=""><span>9</span></td>
<td class="datepickerSaturday"><span>10</span></td>
<td class="datepickerSunday"><span>11</span></td>
</tr>
<tr>
<th class="datepickerWeek"><span>42</span></th>
<td class=""><span>12</span></td>
<td class=""><span>13</span></td>
<td class=""><span>14</span></td>
<td class=""><span>15</span></td>
<td class=""><span>16</span></td>
<td class="datepickerSaturday"><span>17</span></td>
<td class="datepickerSunday"><span>18</span></td>
</tr>
<tr>
<th class="datepickerWeek"><span>43</span></th>
<td class=""><span>19</span></td>
<td class=""><span>20</span></td>
<td class=""><span>21</span></td>
<td class=""><span>22</span></td>
<td class=""><span>23</span></td>
<td class="datepickerSaturday"><span>24</span></td>
<td class="datepickerSunday"><span>25</span></td>
</tr>
<tr>
<th class="datepickerWeek"><span>44</span></th>
<td class=""><span>26</span></td>
<td class=""><span>27</span></td>
<td class=""><span>28</span></td>
<td class=""><span>29</span></td>
<td class=""><span>30</span></td>
<td class="datepickerSaturday"><span>31</span></td>
<td class="datepickerNotInMonth datepickerSunday"><span>1</span></td>
</tr>
<tr>
<th class="datepickerWeek"><span>45</span></th>
<td class="datepickerNotInMonth"><span>2</span></td>
<td class="datepickerNotInMonth"><span>3</span></td>
<td class="datepickerNotInMonth"><span>4</span></td>
<td class="datepickerNotInMonth"><span>5</span></td>
<td class="datepickerNotInMonth"><span>6</span></td>
<td class="datepickerNotInMonth datepickerSaturday"><span>7</span></td>
<td class="datepickerNotInMonth datepickerSunday"><span>8</span></td>
</tr>
</tbody>
The code should determine what date it is today and click on that day. I think that there is no need for month/year because the only view the program will see is the current month anyway. If your solution can provide a month-picker also, it would be great.
So we need the current date (for example: 8th, while the previous date was 5), the current day name, and the program needs to pick according to that.
Current efforts:
driver.find_element_by_xpath('//td[#class="datepickerSelected"]/a[text()="8"]').click()
But Selenium doesn't click on it.
I can't show you the entire code, or the website we are using it on because it is inside a login environment.
Use the following xpath to find the element.
driver.find_element_by_xpath('//td[#class="datepickerSelected"]/a[./span[text()="8"]]').click()
To get today's date, you can use datetime. See the docs for more info. Once you have it, you can insert the day into the locator and click the element.
There are a couple problems with your locator vs the HTML that you posted.
//td[#class="datepickerSelected"]/a[text()="8"]
This is looking for a TD that has a class "datepickerSelected" but it doesn't exist in the HTML you posted. I'm assuming that class only appears after you've selected a date but when you first enter the page, this won't be true so we can't use that class to locate the day we want.
The text() method finds text inside of the element specified, in this case an A tag. If you look at the HTML, the text is actually inside the SPAN child of the A tag. There are a couple ways to deal with this. You can change that part of the locator to be /a/span[text()="8"] or use . which "flattens" the text from all child elements, e.g. /a[.="8"]. Either way will work.
Another problem you will have to deal with is if the day is late or early in the month, then it shows up twice in the HTML, e.g. 2 or 28. To get the right one, you need to specify the day in the SPAN under a TD with an empty class. The wrong ones have a TD with the class datepickerNotInMonth.
Taking all this into account, here's the code I would use.
import datetime
today = datetime.datetime.now().day
driver.find_element_by_xpath(f'//td[#class=""]/a[.="{today}"]').click()
The locator finds a TD that contains an empty class that has a child A that contains (the flattened) text corresponding to today's day.

Scrape Table Using Scrapy

apologies for the long post -
I have a table that I am trying to dig into using scrapy, but can't quite figure out how to dig into this table deep enough.
This is the table:
<table class="detail-table" border="0" cellspacing="0">
<tbody>
<tr id="trAnimalID">
...
</tr>
<tr id="trSpecies">
...
</tr>
<tr id="trBreed">
...
</tr>
<tr id="trAge">
...
<tr id="trSex">
...
</tr>
<tr id="trSize">
...
</tr>
<tr id="trColor">
...
</tr>
<tr id="trDeclawed">
...
</tr>
<tr id="trHousetrained">
...
</tr>
<tr id="trLocation">
...
</tr>
<tr id="trIntakeDate">
<td class="detail-label" align="right">
<b>Intake Date</b>
</td>
<td class="detail-value">
<span id="lblIntakeDate">3/31/2020</span>
</td>
</tr>
<tr id="trStage">
<td class="detail-label" align="right">
<b>Stage</b>
</td>
<td class="detail-value">
<span id="lblStage">Reserved</span>
</td>
</tr>
</tbody></table>
I can dig into it using the scrapy shell command:
text = response.xpath('//*[#class="detail-table"]//tr')[10].extract()
I am getting back this:
'<tr id="trIntakeDate">\r\n\t
<td class="detail-label" align="right">\r\n
<b>Intake Date</b>\r\n
</td>\r\n\t
<td class="detail-value">\r\n
<span id="lblIntakeDate">3/31/2020</span>\xa0\r\n
</td>\r\n
</tr>'
I can't quite figure out how to just get the value for lblIntakeDate. I just need 3/31/2020. Additionally, I'd like to run this as a lambda, and can't quite figure out how to get the execute function to dump out a json file like I can using command line. Any ideas?
Try it:
//table[#class='detail-table']/tbody//tr/td/span[#id='lblIntakeDate']/text()
Go https://www.online-toolz.com/tools/xpath-tester-online.php
And please remove redundant characters such as
try:
from urllib.request import urlopen
url = ''
html = urlopen(url)
bs = BeautifulSoup(html.read(), 'html.parser')
for i in bs.find_all('a'):
print(i.get_text())

Python: robust xpath for table in tr and td tags, eliminate unwanted data

I need to robust way to get the xpath for this url "http://www.screener.com/v2/stocks/view/5131"
However, there are too many blank space before the desirable data in between and it is not robust.
The part I need is 11.48,9.05,11.53 from the html below:
<div class="table-responsive">
<table class="table table-hover">
<tr>
<th>Financial Year</th>
<th class="number">Revenue ('000)</th>
<th class="number">Net ('000)</th>
<th class="number">EPS</th>
<th></th>
</tr>
<tr>
<td>30 Nov, 2017</td>
<td class="number">205,686</td>
<td class="number">52,812</td>
<td class="number">11.48</td>
<td></td>
</tr>
<tr>
<td>30 Nov, 2016</td>
<td class="number">191,301</td>
<td class="number">41,598</td>
<td class="number">9.05</td>
<td></td>
</tr>
<tr>
<td>30 Nov, 2015</td>
<td class="number">225,910</td>
<td class="number">51,082</td>
<td class="number">11.53</td>
<td></td>
</tr>
My code as below
from lxml import html
import requests
page = requests.get('http://www.screener.com/v2/stocks/view/5131')
output = html.fromstring(page.content)
output.xpath('//tr/td/following-sibling::td/text()')
How the code shall be change so that it can robustly get the three number from the tables as shown above?
I just want the output 11.48,9.05,11.53but I unable to get rid of too many of the data inside teh tables
Try below XPath to get desired output:
//div[#id="annual"]//tr/td[position() = last() - 1]/text()

Categories