Extract specific value from HTML using xpath and scrapy

Extract specific value from HTML using xpath and scrapy - python

I have following html Code:
<tr data-live="COumykPG" data-dt="10,11,2017,19,00" data-def="1">
<td class="table-matches__tt"><span class="table-matches__time" data-live-cell="time">19:00</span><span>Oberneuland</span> - <span>Habenhauser</span></td>
<td class="livebet" data-live-cell="livebet"> </td>
<td class="table-matches__streams" data-live-cell="score">
</td>
<td class="table-matches__odds" data-oid="2p2k5xv464x0x6ev9v">1.10</td>
<td class="table-matches__odds" data-oid="2p2k5xv498x0x0">7.44</td>
<td class="table-matches__odds" data-oid="2p2k5xv464x0x6eva0">12.40</td>
</tr>
I try to scrap from the following code the 3 float values: 1,10 7.44 12.40
The expression that i tried to use for geting the value was the following:
response.xpath('//a/#target').extract()
Output that I get is 'mySelections'.
Iwant to get the value next to it. What is the right expression for it?
Thank you in advance

What's wrong
response.xpath('//a/#target').extract()
Why?
If you format your HTML, the error is obvious.
You want to extract text from a tag, not the target attribute.
<tr data-live="COumykPG" data-dt="10,11,2017,19,00" data-def="1">
<td class="table-matches__tt">
<span class="table-matches__time" data-live-cell="time">19:00</span>
<a href="/soccer/germany/oberliga-bremen/oberneuland-habenhauser/COumykPG/" data-live-cell="matchlink">
<span>Oberneuland</span> - <span>Habenhauser</span>
</a>
</td>
<td class="livebet" data-live-cell="livebet"> </td>
<td class="table-matches__streams" data-live-cell="score"></td>
<td class="table-matches__odds" data-oid="2p2k5xv464x0x6ev9v">
<a href="/myselections.php?action=3&matchid=COumykPG&outcomeid=2p2k5xv464x0x6ev9v&otheroutcomes=2p2k5xv498x0x0,2p2k5xv464x0x6eva0"
onclick="return my_selections_click('1x2', 'soccer');"
title="Add to My Selections"
target="mySelections">1.10</a>
</td>
<td class="table-matches__odds" data-oid="2p2k5xv498x0x0">
<a href="/myselections.php?action=3&matchid=COumykPG&outcomeid=2p2k5xv498x0x0&otheroutcomes=2p2k5xv464x0x6ev9v,2p2k5xv464x0x6eva0"
onclick="return my_selections_click('1x2', 'soccer');"
title="Add to My Selections"
target="mySelections">7.44</a>
</td>
<td class="table-matches__odds" data-oid="2p2k5xv464x0x6eva0">
<a href="/myselections.php?action=3&matchid=COumykPG&outcomeid=2p2k5xv464x0x6eva0&otheroutcomes=2p2k5xv464x0x6ev9v,2p2k5xv498x0x0"
onclick="return my_selections_click('1x2', 'soccer');"
title="Add to My Selections"
target="mySelections">12.40</a>
</td>
</tr>
How to fix it
Use one of those followings
response.xpath('//a/text()').extract()
According to other developers, response.xpath sometimes will cause bugs, you should use scrapy's selector instead.
from scrapy.selector import Selector
result_array = Selector(text=response.body).xpath('//a/text()').extract()

Related

How to use beautiful soup to get elements that contain/do not contain specific classes

I want to get a table and save it to Excel with pyhon scripts. Here is the response:
<body>
<table id="need">
<tr height="30" align="center">
<td>need</td>
<td id="td1">need</td>
<td id="td2" type="wholeLast">not need</td>
<td id="td3" type="whole">need</td>
...
</tr>
<tr height="30" align="center" cid="2" class="txt">
<td>not need</td>
<td id="td1">not need</td>
<td id="td2" type="wholeLast">not need</td>
<td id="td3" type="whole">not need</td>
...
</tr>
...
</table>
<table>
...
</table>
</body>
I need to get the contents in <tr> except <tr> with 'class="txt"' and the <td> except <td> with 'type="wholeLast"'. In short, I need to get all the "need" in the above response.
I tried this:trs = soup.find_all("tr", attrs={"height": "30", "align": "center"}). But I don't know how to remove the <td> which type="wholeLast". Maybe I need to use other ways.
Any suggestion is appreciate.

with css selectors and not pseudo class you could do this
tds=soup.select('tr:not(.txt) td:not([type="wholeLast"])')

Scrape Table Using Scrapy

apologies for the long post -
I have a table that I am trying to dig into using scrapy, but can't quite figure out how to dig into this table deep enough.
This is the table:
<table class="detail-table" border="0" cellspacing="0">
<tbody>
<tr id="trAnimalID">
...
</tr>
<tr id="trSpecies">
...
</tr>
<tr id="trBreed">
...
</tr>
<tr id="trAge">
...
<tr id="trSex">
...
</tr>
<tr id="trSize">
...
</tr>
<tr id="trColor">
...
</tr>
<tr id="trDeclawed">
...
</tr>
<tr id="trHousetrained">
...
</tr>
<tr id="trLocation">
...
</tr>
<tr id="trIntakeDate">
<td class="detail-label" align="right">
<b>Intake Date</b>
</td>
<td class="detail-value">
<span id="lblIntakeDate">3/31/2020</span>
</td>
</tr>
<tr id="trStage">
<td class="detail-label" align="right">
<b>Stage</b>
</td>
<td class="detail-value">
<span id="lblStage">Reserved</span>
</td>
</tr>
</tbody></table>
I can dig into it using the scrapy shell command:
text = response.xpath('//*[#class="detail-table"]//tr')[10].extract()
I am getting back this:
'<tr id="trIntakeDate">\r\n\t
<td class="detail-label" align="right">\r\n
<b>Intake Date</b>\r\n
</td>\r\n\t
<td class="detail-value">\r\n
<span id="lblIntakeDate">3/31/2020</span>\xa0\r\n
</td>\r\n
</tr>'
I can't quite figure out how to just get the value for lblIntakeDate. I just need 3/31/2020. Additionally, I'd like to run this as a lambda, and can't quite figure out how to get the execute function to dump out a json file like I can using command line. Any ideas?

Try it:
//table[#class='detail-table']/tbody//tr/td/span[#id='lblIntakeDate']/text()
Go https://www.online-toolz.com/tools/xpath-tester-online.php
And please remove redundant characters such as

try:
from urllib.request import urlopen
url = ''
html = urlopen(url)
bs = BeautifulSoup(html.read(), 'html.parser')
for i in bs.find_all('a'):
print(i.get_text())

Is it possible that BeautifulSoup can not parse a table in html documents?

Here an example of the code I used to scraping the table:
with open ('text.txt', 'w') as algroo:
for row in RoOtbody.find_all('tr'):
for cell in row.find_all('td'):
algroo.write(cell.text)
algroo.write('\n')
I already used Selenium and requests to extract the outer html from the webpage. I also tried to use html.parser and lxml.
The html looks like this:
<tr class="table">
<td class="table" valign="top">
<p class="tbl-hdr">HS heading</p>
</td>
<td class="table" valign="top">
<p class="tbl-hdr">Desccription of product</p>
</td>
<td class="table" colspan="2" valign="top">
<p class="tbl-hdr">Working or processing, carried out on non-originating
materials, which confers originating status</p>
</td>
</tr>
The problem is that when I open the txt file, all the cells elements are in a single column like the one below, literaly:
HS heading
Desccription of product
Working or processing, carried out on non-originating materials, which confers originating status
In all the tutorials I watched and read, they should be in the same row, like this:
HS headingDesccription of productWorking or processing, carried out on non-originating materials, which confers originating status
Can anyone help me, please?

I don't know if this will help you
from simplified_scrapy.simplified_doc import SimplifiedDoc
html = '''<tr class="table">
<td class="table" valign="top">
<p class="tbl-hdr">HS heading</p>
</td>
<td class="table" valign="top">
<p class="tbl-hdr">Desccription of product</p>
</td>
<td class="table" colspan="2" valign="top">
<p class="tbl-hdr">Working or processing, carried out on non-originating
materials, which confers originating status</p>
</td>
</tr>'''
doc = SimplifiedDoc(html)
tr = doc.tr # get first tr
print (tr.text)
print (tr.getText(' '))
tds = tr.tds # get all td
for td in tds:
print (td.text)

How can I click this option with Python+selenium?

How can I click “Actions“ option with Python+selenium? I have tried a lot of methods, please help me with some suggestions, thank you.
The following methods do not work：
driver.find_element_by_xpath("//*[#id="tabGroup_tabtable"]/tbody/tr/td[2]").click()
driver.find_element_by_css_selector("#tabGroup_tabtable > tbody > tr > td:nth-child(2)").click()
<table id="tabGroup_tabtable" class="tabGroup_tabtable">
<tbody>
<tr>
<td onclick="setFullHelpID(HelpLinks.EDITOR_COMPUTEROVERVIEW);tabGroupSetSelected(0);resize();" tabindex="0" onkeydown="if (event.keyCode==13||event.keyCode==32) {tabGroupSetSelected(0);resize();}" class="tab_selected">
<div class="tab_name">General</div>
</td>
<td onclick="setFullHelpID(HelpLinks.EDITOR_COMPUTEROVERVIEW_ACTIONS);tabGroupSetSelected(1);resize();" tabindex="0" onkeydown="if (event.keyCode==13||event.keyCode==32) {tabGroupSetSelected(1);resize();}" class="tab" onmouseover="this.className='tab_over';"
onmouseout="this.className='tab';">
<div class="tab_name">Actions</div>
</td>
<td onclick="setFullHelpID();tabGroupSetSelected(2);loadEvents();" tabindex="0" onkeydown="if (event.keyCode==13||event.keyCode==32) {tabGroupSetSelected(2);loadEvents();}" class="tab" onmouseover="this.className='tab_over';" onmouseout="this.className='tab';">
<div class="tab_name">System Events</div>
</td>
</tr>
</tbody>
</table>

Thank you, I have already solved it.
since my page is in the frame, I need to switch to the framework first using driver.switch_to.frame("input your frame name").

scrapy xpath return empty list while response is not empty

I'm using scrapy to scrape information from 2 tables on the website
I firstly scrape the tables. It turns out that staffs and students are empty while response is not empty. I also find the table tab in page source. Can anyone find out what's the problem?
import scrapy
from universities.items import UniversitiesItem
class UniversityOfSouthCarolinaColumbia(scrapy.Spider):
name = 'uscc'
allowed_domains = ['sc.edu']
start_urls = ['http://www.sc.edu/about/directory/?name=']
def parse(self, response):
for ln in ['Zhao']:
query = response.url + ln
yield scrapy.Request(query, callback=self.parse_item)
#staticmethod
def parse_item(response):
staffs = response.xpath('//table[#id="directorystaff"]/tbody/tr[#role="row"]')
students = response.xpath('//table[#id="directorystudent"]/tbody/tr[#role="row"]')
print('--------------------------')
print('staffs', staffs)
print('==========================')
print('students', students)

It's realy cool question. I'm investigate this. And I has concluded that the response does not contain info about the tags attribute. I think that browser is modify page_source_body with anybody script adding attribute to tags.
In response tr-tags do not have attribute 'role'
Please see it:
<table class="display" id="directorystaff" width="100%">
<thead>
<tr>
<th style="text-align: left">Name</th>
<th style="text-align: left">Email</th>
<th style="text-align: left">Phone</th>
<th style="text-align: left">Department</th>
<th style="text-align: left">Office Address</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">Zhao, Xia </td>
<td style="text-align: left"> </td>
<td style="text-align: left">(803) 777-8436 </td>
<td style="text-align: left">Chemistry </td>
<td style="text-align: left">537 </td>
</tr>
<tr>
<td style="text-align: left">Zhao, Xing </td>
<td style="text-align: left"> </td>
<td style="text-align: left"> </td>
<td style="text-align: left">Mechanical Engineering </td>
<td style="text-align: left"> </td>
</tr>
In this picture we see the response page
and in this picture we see page on browser:
So, if you want got list of staffs , I'm recommend next XPath:
//table[#id="directorystaff"]/tbody/tr/td
And for students, I'm recommend next XPath:
//table[#id="directorystudent"]/tbody/tr/td
If you want something else, you can modify this is XPath query.
This is example for you:
import requests
from lxml import html
x = requests.get("https://www.sc.edu/about/directory/?name=Zhao")
ht = html.fromstring(x.text)
element = ht.xpath('//table[#id="directorystaff"]/tbody/tr/td')
for el in element:
print(el.text)
And output:
>>Zhao, Xia  
>> 
>>(803) 777-8436 
>>Chemistry 
>>537 
>>Zhao, Xing  
>> 
>> 
>>Mechanical Engineering

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract specific value from HTML using xpath and scrapy - python

Related

How to use beautiful soup to get elements that contain/do not contain specific classes

Scrape Table Using Scrapy

Is it possible that BeautifulSoup can not parse a table in html documents?

How can I click this option with Python+selenium?

scrapy xpath return empty list while response is not empty

Categories

Resources