Scrape Table Using Scrapy - python

apologies for the long post -
I have a table that I am trying to dig into using scrapy, but can't quite figure out how to dig into this table deep enough.
This is the table:
<table class="detail-table" border="0" cellspacing="0">
<tbody>
<tr id="trAnimalID">
...
</tr>
<tr id="trSpecies">
...
</tr>
<tr id="trBreed">
...
</tr>
<tr id="trAge">
...
<tr id="trSex">
...
</tr>
<tr id="trSize">
...
</tr>
<tr id="trColor">
...
</tr>
<tr id="trDeclawed">
...
</tr>
<tr id="trHousetrained">
...
</tr>
<tr id="trLocation">
...
</tr>
<tr id="trIntakeDate">
<td class="detail-label" align="right">
<b>Intake Date</b>
</td>
<td class="detail-value">
<span id="lblIntakeDate">3/31/2020</span>
</td>
</tr>
<tr id="trStage">
<td class="detail-label" align="right">
<b>Stage</b>
</td>
<td class="detail-value">
<span id="lblStage">Reserved</span>
</td>
</tr>
</tbody></table>
I can dig into it using the scrapy shell command:
text = response.xpath('//*[#class="detail-table"]//tr')[10].extract()
I am getting back this:
'<tr id="trIntakeDate">\r\n\t
<td class="detail-label" align="right">\r\n
<b>Intake Date</b>\r\n
</td>\r\n\t
<td class="detail-value">\r\n
<span id="lblIntakeDate">3/31/2020</span>\xa0\r\n
</td>\r\n
</tr>'
I can't quite figure out how to just get the value for lblIntakeDate. I just need 3/31/2020. Additionally, I'd like to run this as a lambda, and can't quite figure out how to get the execute function to dump out a json file like I can using command line. Any ideas?

Try it:
//table[#class='detail-table']/tbody//tr/td/span[#id='lblIntakeDate']/text()
Go https://www.online-toolz.com/tools/xpath-tester-online.php
And please remove redundant characters such as

try:
from urllib.request import urlopen
url = ''
html = urlopen(url)
bs = BeautifulSoup(html.read(), 'html.parser')
for i in bs.find_all('a'):
print(i.get_text())

Related

Python Beautifulsoup traverse a table with particular text content in innerHTML then get contents until before a particular element

I have an html with a lots of table to traverse to like below:
<html>
.. omitted parts since I am interested on the HTML table..
<table>
<tbody>
<tr>
<td>
<table>
<tbody>
<tr>
<td class="labeltitle">
<tbody>
<tr>
<td class="labeltitle">
<font color="FFD700">Floor Activity<a name="#jump_fa"></a></font>
</td>
<td class="labelplain"> </td>
</tr>
</tbody>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
<table>
... omitted just to show the td that I am interested to scrape ...
<td class="labelplain"> Senator(s)</td>
<td class="labelplain">
<table>
<tbody>
<tr>
<td class="labelplain">VILLAR JR., MANUEL B.<br></td>
</tr>
</tbody>
</table>
</td>
...
<table>
<table>
... More tables like the table above (the one with VILLAR Jr.)
</table>
<table>
<tbody>
<tr>
<td class="labeltitle">
<table>
<tbody>
<tr>
<td class="labeltitle"> <font color="FFD700">Vote(s)<a name="#jump_vote"></a></font></td>
<td class="labelplain"> </td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
... more tables
</html>
The table I want to traverse is the td with class "labeltitle" and a child element "font" that has text "Floor Activity". Every table below it, I want to get the html code until before the table that has a td class="labeltitle" with child "font" and text content is "Vote(s)". I am trying with xpath like so:
table = dom.xpath("//table[8]/tbody/tr/td")
print (table)
but to no avail, I am getting empty arrays. Anything would do (e.g. with or without xpath).
I also tried the following:
rows = soup.find('a', attrs={'name' :'#jump_fa'}).find_parent('table').find_parent('table')
I am able to traverse the table with content "Floor Activity". The abovementioned code only gives me the content of the table for that particular parent, exact output I am getting below:
<tr>
<td class="labeltitle" height="22"><table border="0" cellpadding="0" cellspacing="0" width="100%">
<tr>
<td class="labeltitle" width="50%"> <font color="FFD700">Floor
Activity<a name="#jump_fa"></a></font></td>
<td align="right" class="labelplain" width="50%">
</td>
</tr>
</table></td>
</tr>
I am trying out this one Find next siblings until a certain one using beautifulsoup because it seems it fits my use case but the problem is I am getting error "'NoneType' object has no attribute 'next_sibling'" which should be the case since update2 script does not include the other tables, so update2 code is out of the equation.
My expected output for this is a json file (special characters are escaped) like:
{"title":' + '"' + str(var) + '"' + ',"body":" + flooract + ' + "`}
*where flooract is the html code of the tables with special characters escaped. Sample snippet:
<table>\n<tbody>\n<tr>\n<td class=\"labelplain\"> Status Date<\/td><td class=\"labelplain\"> 10/12/2005<\/td>\n<\/tr>\n<tr><td class=\"labelplain\"> Parliamentary Status<\/td>\n<td class=\"labelplain\"><table>\n<tbody><tr>\n<td class="labelplain">SPONSORSHIP SPEECH<br>...Until Period of Committee Amendments
Link to sample file here: https://issuances-library.senate.gov.ph/54629.html
I have attached an image of the site:
Screenshot 3, I have encircled in red lines what I only wanted to get from the HTML file:

How to use beautiful soup to get elements that contain/do not contain specific classes

I want to get a table and save it to Excel with pyhon scripts. Here is the response:
<body>
<table id="need">
<tr height="30" align="center">
<td>need</td>
<td id="td1">need</td>
<td id="td2" type="wholeLast">not need</td>
<td id="td3" type="whole">need</td>
...
</tr>
<tr height="30" align="center" cid="2" class="txt">
<td>not need</td>
<td id="td1">not need</td>
<td id="td2" type="wholeLast">not need</td>
<td id="td3" type="whole">not need</td>
...
</tr>
...
</table>
<table>
...
</table>
</body>
I need to get the contents in <tr> except <tr> with 'class="txt"' and the <td> except <td> with 'type="wholeLast"'. In short, I need to get all the "need" in the above response.
I tried this:trs = soup.find_all("tr", attrs={"height": "30", "align": "center"}). But I don't know how to remove the <td> which type="wholeLast". Maybe I need to use other ways.
Any suggestion is appreciate.
with css selectors and not pseudo class you could do this
tds=soup.select('tr:not(.txt) td:not([type="wholeLast"])')

Python: robust xpath for table in tr and td tags, eliminate unwanted data

I need to robust way to get the xpath for this url "http://www.screener.com/v2/stocks/view/5131"
However, there are too many blank space before the desirable data in between and it is not robust.
The part I need is 11.48,9.05,11.53 from the html below:
<div class="table-responsive">
<table class="table table-hover">
<tr>
<th>Financial Year</th>
<th class="number">Revenue ('000)</th>
<th class="number">Net ('000)</th>
<th class="number">EPS</th>
<th></th>
</tr>
<tr>
<td>30 Nov, 2017</td>
<td class="number">205,686</td>
<td class="number">52,812</td>
<td class="number">11.48</td>
<td></td>
</tr>
<tr>
<td>30 Nov, 2016</td>
<td class="number">191,301</td>
<td class="number">41,598</td>
<td class="number">9.05</td>
<td></td>
</tr>
<tr>
<td>30 Nov, 2015</td>
<td class="number">225,910</td>
<td class="number">51,082</td>
<td class="number">11.53</td>
<td></td>
</tr>
My code as below
from lxml import html
import requests
page = requests.get('http://www.screener.com/v2/stocks/view/5131')
output = html.fromstring(page.content)
output.xpath('//tr/td/following-sibling::td/text()')
How the code shall be change so that it can robustly get the three number from the tables as shown above?
I just want the output 11.48,9.05,11.53but I unable to get rid of too many of the data inside teh tables
Try below XPath to get desired output:
//div[#id="annual"]//tr/td[position() = last() - 1]/text()

Extract specific value from HTML using xpath and scrapy

I have following html Code:
<tr data-live="COumykPG" data-dt="10,11,2017,19,00" data-def="1">
<td class="table-matches__tt"><span class="table-matches__time" data-live-cell="time">19:00</span><span>Oberneuland</span> - <span>Habenhauser</span></td>
<td class="livebet" data-live-cell="livebet"> </td>
<td class="table-matches__streams" data-live-cell="score">
</td>
<td class="table-matches__odds" data-oid="2p2k5xv464x0x6ev9v">1.10</td>
<td class="table-matches__odds" data-oid="2p2k5xv498x0x0">7.44</td>
<td class="table-matches__odds" data-oid="2p2k5xv464x0x6eva0">12.40</td>
</tr>
I try to scrap from the following code the 3 float values: 1,10 7.44 12.40
The expression that i tried to use for geting the value was the following:
response.xpath('//a/#target').extract()
Output that I get is 'mySelections'.
Iwant to get the value next to it. What is the right expression for it?
Thank you in advance
What's wrong
response.xpath('//a/#target').extract()
Why?
If you format your HTML, the error is obvious.
You want to extract text from a tag, not the target attribute.
<tr data-live="COumykPG" data-dt="10,11,2017,19,00" data-def="1">
<td class="table-matches__tt">
<span class="table-matches__time" data-live-cell="time">19:00</span>
<a href="/soccer/germany/oberliga-bremen/oberneuland-habenhauser/COumykPG/" data-live-cell="matchlink">
<span>Oberneuland</span> - <span>Habenhauser</span>
</a>
</td>
<td class="livebet" data-live-cell="livebet"> </td>
<td class="table-matches__streams" data-live-cell="score"></td>
<td class="table-matches__odds" data-oid="2p2k5xv464x0x6ev9v">
<a href="/myselections.php?action=3&matchid=COumykPG&outcomeid=2p2k5xv464x0x6ev9v&otheroutcomes=2p2k5xv498x0x0,2p2k5xv464x0x6eva0"
onclick="return my_selections_click('1x2', 'soccer');"
title="Add to My Selections"
target="mySelections">1.10</a>
</td>
<td class="table-matches__odds" data-oid="2p2k5xv498x0x0">
<a href="/myselections.php?action=3&matchid=COumykPG&outcomeid=2p2k5xv498x0x0&otheroutcomes=2p2k5xv464x0x6ev9v,2p2k5xv464x0x6eva0"
onclick="return my_selections_click('1x2', 'soccer');"
title="Add to My Selections"
target="mySelections">7.44</a>
</td>
<td class="table-matches__odds" data-oid="2p2k5xv464x0x6eva0">
<a href="/myselections.php?action=3&matchid=COumykPG&outcomeid=2p2k5xv464x0x6eva0&otheroutcomes=2p2k5xv464x0x6ev9v,2p2k5xv498x0x0"
onclick="return my_selections_click('1x2', 'soccer');"
title="Add to My Selections"
target="mySelections">12.40</a>
</td>
</tr>
How to fix it
Use one of those followings
response.xpath('//a/text()').extract()
According to other developers, response.xpath sometimes will cause bugs, you should use scrapy's selector instead.
from scrapy.selector import Selector
result_array = Selector(text=response.body).xpath('//a/text()').extract()

scrapy xpath return empty list while response is not empty

I'm using scrapy to scrape information from 2 tables on the website
I firstly scrape the tables. It turns out that staffs and students are empty while response is not empty. I also find the table tab in page source. Can anyone find out what's the problem?
import scrapy
from universities.items import UniversitiesItem
class UniversityOfSouthCarolinaColumbia(scrapy.Spider):
name = 'uscc'
allowed_domains = ['sc.edu']
start_urls = ['http://www.sc.edu/about/directory/?name=']
def parse(self, response):
for ln in ['Zhao']:
query = response.url + ln
yield scrapy.Request(query, callback=self.parse_item)
#staticmethod
def parse_item(response):
staffs = response.xpath('//table[#id="directorystaff"]/tbody/tr[#role="row"]')
students = response.xpath('//table[#id="directorystudent"]/tbody/tr[#role="row"]')
print('--------------------------')
print('staffs', staffs)
print('==========================')
print('students', students)
It's realy cool question. I'm investigate this. And I has concluded that the response does not contain info about the tags attribute. I think that browser is modify page_source_body with anybody script adding attribute to tags.
In response tr-tags do not have attribute 'role'
Please see it:
<table class="display" id="directorystaff" width="100%">
<thead>
<tr>
<th style="text-align: left">Name</th>
<th style="text-align: left">Email</th>
<th style="text-align: left">Phone</th>
<th style="text-align: left">Department</th>
<th style="text-align: left">Office Address</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">Zhao, Xia </td>
<td style="text-align: left"> </td>
<td style="text-align: left">(803) 777-8436 </td>
<td style="text-align: left">Chemistry </td>
<td style="text-align: left">537 </td>
</tr>
<tr>
<td style="text-align: left">Zhao, Xing </td>
<td style="text-align: left"> </td>
<td style="text-align: left"> </td>
<td style="text-align: left">Mechanical Engineering </td>
<td style="text-align: left"> </td>
</tr>
In this picture we see the response page
and in this picture we see page on browser:
So, if you want got list of staffs , I'm recommend next XPath:
//table[#id="directorystaff"]/tbody/tr/td
And for students, I'm recommend next XPath:
//table[#id="directorystudent"]/tbody/tr/td
If you want something else, you can modify this is XPath query.
This is example for you:
import requests
from lxml import html
x = requests.get("https://www.sc.edu/about/directory/?name=Zhao")
ht = html.fromstring(x.text)
element = ht.xpath('//table[#id="directorystaff"]/tbody/tr/td')
for el in element:
print(el.text)
And output:
>>Zhao, Xia  
>> 
>>(803) 777-8436 
>>Chemistry 
>>537 
>>Zhao, Xing  
>> 
>> 
>>Mechanical Engineering 

Categories