I am trying to write a small python script that scrape tracking details for an internal system. The details are presented in a html table below. I am looking to turn it into python tuples:
(processed, unit b door 3, 30-MAY-16 12:19)
(created, unit b door 2, 30-MAY-16 06:17)
for example. I am using Splinter.
<table class="resultView" cellspacing="0" rules="all" border="1" style="width:540px;border-collapse:collapse;">
<tr class="clearHeader">
<th align="left" scope="col">Activity</th><th scope="col"> </th><th align="center" scope="col">Date</th>
</tr>
<tr class="statusRow">
<td style="width:30%;">Processed</td>
<td align="center"> Unit B<br /> Door 3 </td>
<td align="center" style="width:20%;">30-May-16<br/>12:19</td>
</tr>
<tr class="statusAlternate">
<td style="width:30%;">Created</td>
<td align="center"> Unit B <br /> Door 2</td>
<td align="center" style="width:20%;">30-May-16<br/>06:17</td>
</tr>
</table>
If I run:
for update in browser.find_by_css('tr'):
print update.find_by_css('td')
it displays:
[<splinter.driver.webdriver.WebDriverElement object at 0x103085e90>,
<splinter.driver.webdriver.WebDriverElement object at 0x103085ed0>,
<splinter.driver.webdriver.WebDriverElement object at 0x1030b4050>]
Which is what I would have expected. However, I cannot access the value from it. Changing the line to:
print update.find_by_css('td').value
gives the error:
AttributeError: 'ElementList' object has no attribute 'value'
This is a list so I try to access the first element on the list with
print update.find_by_css('td').first.value
I then get this error:
splinter.exceptions.ElementDoesNotExist: no elements could be found with css "td"
I cannot work out what I am doing wrong?
I think that your problem is that you are looking for "tr" or "td" into your table with css 'tr' or 'td' and any of the "tr" and/or "td" in your hable don't have this class
I suggest you to this case, use xpath to look for elements that you want to find
Related
I have an html with a lots of table to traverse to like below:
<html>
.. omitted parts since I am interested on the HTML table..
<table>
<tbody>
<tr>
<td>
<table>
<tbody>
<tr>
<td class="labeltitle">
<tbody>
<tr>
<td class="labeltitle">
<font color="FFD700">Floor Activity<a name="#jump_fa"></a></font>
</td>
<td class="labelplain"> </td>
</tr>
</tbody>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
<table>
... omitted just to show the td that I am interested to scrape ...
<td class="labelplain"> Senator(s)</td>
<td class="labelplain">
<table>
<tbody>
<tr>
<td class="labelplain">VILLAR JR., MANUEL B.<br></td>
</tr>
</tbody>
</table>
</td>
...
<table>
<table>
... More tables like the table above (the one with VILLAR Jr.)
</table>
<table>
<tbody>
<tr>
<td class="labeltitle">
<table>
<tbody>
<tr>
<td class="labeltitle"> <font color="FFD700">Vote(s)<a name="#jump_vote"></a></font></td>
<td class="labelplain"> </td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
... more tables
</html>
The table I want to traverse is the td with class "labeltitle" and a child element "font" that has text "Floor Activity". Every table below it, I want to get the html code until before the table that has a td class="labeltitle" with child "font" and text content is "Vote(s)". I am trying with xpath like so:
table = dom.xpath("//table[8]/tbody/tr/td")
print (table)
but to no avail, I am getting empty arrays. Anything would do (e.g. with or without xpath).
I also tried the following:
rows = soup.find('a', attrs={'name' :'#jump_fa'}).find_parent('table').find_parent('table')
I am able to traverse the table with content "Floor Activity". The abovementioned code only gives me the content of the table for that particular parent, exact output I am getting below:
<tr>
<td class="labeltitle" height="22"><table border="0" cellpadding="0" cellspacing="0" width="100%">
<tr>
<td class="labeltitle" width="50%"> <font color="FFD700">Floor
Activity<a name="#jump_fa"></a></font></td>
<td align="right" class="labelplain" width="50%">
</td>
</tr>
</table></td>
</tr>
I am trying out this one Find next siblings until a certain one using beautifulsoup because it seems it fits my use case but the problem is I am getting error "'NoneType' object has no attribute 'next_sibling'" which should be the case since update2 script does not include the other tables, so update2 code is out of the equation.
My expected output for this is a json file (special characters are escaped) like:
{"title":' + '"' + str(var) + '"' + ',"body":" + flooract + ' + "`}
*where flooract is the html code of the tables with special characters escaped. Sample snippet:
<table>\n<tbody>\n<tr>\n<td class=\"labelplain\"> Status Date<\/td><td class=\"labelplain\"> 10/12/2005<\/td>\n<\/tr>\n<tr><td class=\"labelplain\"> Parliamentary Status<\/td>\n<td class=\"labelplain\"><table>\n<tbody><tr>\n<td class="labelplain">SPONSORSHIP SPEECH<br>...Until Period of Committee Amendments
Link to sample file here: https://issuances-library.senate.gov.ph/54629.html
I have attached an image of the site:
Screenshot 3, I have encircled in red lines what I only wanted to get from the HTML file:
This is the HTML it needs to pick from:
<tbody class="datepickerDays">
<tr>
<th class="datepickerWeek"><span>40</span></th>
<td class="datepickerNotInMonth"><span>28</span></td>
<td class="datepickerNotInMonth"><span>29</span></td>
<td class="datepickerNotInMonth"><span>30</span></td>
<td class=""><span>1</span></td>
<td class=""><span>2</span></td>
<td class="datepickerSaturday"><span>3</span></td>
<td class="datepickerSunday"><span>4</span></td>
</tr>
<tr>
<th class="datepickerWeek"><span>41</span></th>
<td class=""><span>5</span></td>
<td class=""><span>6</span></td>
<td class=""><span>7</span></td>
<td class="datepickerSelected"><span>8</span></td>
<td class=""><span>9</span></td>
<td class="datepickerSaturday"><span>10</span></td>
<td class="datepickerSunday"><span>11</span></td>
</tr>
<tr>
<th class="datepickerWeek"><span>42</span></th>
<td class=""><span>12</span></td>
<td class=""><span>13</span></td>
<td class=""><span>14</span></td>
<td class=""><span>15</span></td>
<td class=""><span>16</span></td>
<td class="datepickerSaturday"><span>17</span></td>
<td class="datepickerSunday"><span>18</span></td>
</tr>
<tr>
<th class="datepickerWeek"><span>43</span></th>
<td class=""><span>19</span></td>
<td class=""><span>20</span></td>
<td class=""><span>21</span></td>
<td class=""><span>22</span></td>
<td class=""><span>23</span></td>
<td class="datepickerSaturday"><span>24</span></td>
<td class="datepickerSunday"><span>25</span></td>
</tr>
<tr>
<th class="datepickerWeek"><span>44</span></th>
<td class=""><span>26</span></td>
<td class=""><span>27</span></td>
<td class=""><span>28</span></td>
<td class=""><span>29</span></td>
<td class=""><span>30</span></td>
<td class="datepickerSaturday"><span>31</span></td>
<td class="datepickerNotInMonth datepickerSunday"><span>1</span></td>
</tr>
<tr>
<th class="datepickerWeek"><span>45</span></th>
<td class="datepickerNotInMonth"><span>2</span></td>
<td class="datepickerNotInMonth"><span>3</span></td>
<td class="datepickerNotInMonth"><span>4</span></td>
<td class="datepickerNotInMonth"><span>5</span></td>
<td class="datepickerNotInMonth"><span>6</span></td>
<td class="datepickerNotInMonth datepickerSaturday"><span>7</span></td>
<td class="datepickerNotInMonth datepickerSunday"><span>8</span></td>
</tr>
</tbody>
The code should determine what date it is today and click on that day. I think that there is no need for month/year because the only view the program will see is the current month anyway. If your solution can provide a month-picker also, it would be great.
So we need the current date (for example: 8th, while the previous date was 5), the current day name, and the program needs to pick according to that.
Current efforts:
driver.find_element_by_xpath('//td[#class="datepickerSelected"]/a[text()="8"]').click()
But Selenium doesn't click on it.
I can't show you the entire code, or the website we are using it on because it is inside a login environment.
Use the following xpath to find the element.
driver.find_element_by_xpath('//td[#class="datepickerSelected"]/a[./span[text()="8"]]').click()
To get today's date, you can use datetime. See the docs for more info. Once you have it, you can insert the day into the locator and click the element.
There are a couple problems with your locator vs the HTML that you posted.
//td[#class="datepickerSelected"]/a[text()="8"]
This is looking for a TD that has a class "datepickerSelected" but it doesn't exist in the HTML you posted. I'm assuming that class only appears after you've selected a date but when you first enter the page, this won't be true so we can't use that class to locate the day we want.
The text() method finds text inside of the element specified, in this case an A tag. If you look at the HTML, the text is actually inside the SPAN child of the A tag. There are a couple ways to deal with this. You can change that part of the locator to be /a/span[text()="8"] or use . which "flattens" the text from all child elements, e.g. /a[.="8"]. Either way will work.
Another problem you will have to deal with is if the day is late or early in the month, then it shows up twice in the HTML, e.g. 2 or 28. To get the right one, you need to specify the day in the SPAN under a TD with an empty class. The wrong ones have a TD with the class datepickerNotInMonth.
Taking all this into account, here's the code I would use.
import datetime
today = datetime.datetime.now().day
driver.find_element_by_xpath(f'//td[#class=""]/a[.="{today}"]').click()
The locator finds a TD that contains an empty class that has a child A that contains (the flattened) text corresponding to today's day.
I am using selenium in python. I have come across this table webelement. I need to check if a string is present in the webelement and return a corresponding string in case its present.
<table width="700px" class="tableListGrid">
<thead>
<tr class="tableInfoTrBox">
<th>Date</th>
<th>Task Code</th>
<!-- th>Phone Number</th -->
<th>Fota Job</th>
<th colspan="2" class="thLineEnd">Task Description</th>
</tr>
</thead>
<tbody>
<tr class="tableTr_r">
<td>2018-04-06 05:48:29</td>
<td>FU</td>
<!-- td></td -->
<td>
57220180406-JSA69596727
</td>
<td style="text-align:left;">
updated from [A730FXXU1ARAB/A730FOJM1ARAB/A730FXXU1ARAB] to [A730FXXU2ARC9/A730FOJM2ARC1/A730FXXU2ARC9]
</td>
<td>
<table class="btnTypeE">
<tr>
<td>
View
</td>
</tr>
</table>
</td>
</tr>
</tbody>
</table>
I need to search for "A730FXXU2ARC9/A730FOJM2ARC1/A730FXXU2ARC9" in this element and return "57220180406-JSA69596727" which is present in same row at a different place in the web page. Is it possible to do in selenium ?
EDIT: Cleaned the code to only contain useful data.
It can be achieved by finding the element using the following Xpath:
//td[contains(., 'A730FXXU2ARC9/A730FOJM2ARC1/A730FXXU2ARC9')]/preceding-sibling::td[1]/a
Xpath can be read as
find td which contains "A730FXXU2ARC9/A730FOJM2ARC1/A730FXXU2ARC9". Then find the td
preceding the found td and move to the a tag
After this you can get text using selenium
driver.find_element(By.XPATH, '//td[contains(., 'A730FXXU2ARC9/A730FOJM2ARC1/A730FXXU2ARC9')]/preceding-sibling::td[1]/a').text
To look out for a text e.g.A730FXXU2ARC9/A730FOJM2ARC1/A730FXXU2ARC9 and find out an associated text e.g. 57220180406-JSA69596727, you can write a function as follows :
def test_me(myString):
myText = driver.find_element_by_xpath("//table[#class='tableListGrid']//tbody/tr[#class='tableTr_r']//td[.='" + myString + "']//preceding::td[1]/a").get_attribute("innerHTML")
Now, from your main()/#Test you can call the function with the desired text as follows :
test_me("A730FXXU2ARC9/A730FOJM2ARC1/A730FXXU2ARC9")
i try get href from this table
<div class="squad-container">
<table class="table squad sortable" id="page_team_1_block_team_squad_8-table">
<thead>
<tr class="group-head">
<th colspan="4">Goalkeepers </th>
</tr>
</thead>
<tbody>
<tr>
<td style="width:50px;">Reda Sayed</td>
<td style="vertical-align: top;">
<div><a href="/474798/" >Reda Sayed</a></div>
<div style="padding-left: 27px;">25 years old</div>
</td>
</tr>
</tbody>
i use
response.xpath('//table[#class="table squad sortable"]//tr//td//a/#href').extract_first()
and didnt work with i need know what is the problem in code and what is different if i use double // or single slash
I don't think there is any problem with your xpath from we human's perspective. However, the xpath or css can be different from your spider's perspective, i.e. your spider may 'see' page differently.
Try using 'scrapy shell' to test your xpath or css and see if any data can be extracted. Here is the link to the doc in case you need: https://doc.scrapy.org/en/latest/topics/shell.html
To sum up: modify the xpath you wrote, 'cause your spider won't find any data with that xpath, and scrapy shell can help you.:)
How can I use beautiful soup and selectorgadget to scrape a website. For example I have a website - (a newegg product) and I would like my script to return all of the specifications of that product (click on SPECIFICATIONS) by this I mean - Intel, Desktop, ......, 2.4GHz, 1066Mhz, ...... , 3 years limited.
After using selectorgadget I get the string-
.desc
How do I use this?
Thanks :)
Inspecting the page, I can see that the specifications are placed in a div with the ID pcraSpecs:
<div id="pcraSpecs">
<script type="text/javascript">...</script>
<TABLE cellpadding="0" cellspacing="0" class="specification">
<TR>
<TD colspan="2" class="title">Model</TD>
</TR>
<TR>
<TD class="name">Brand</TD>
<TD class="desc"><script type="text/javascript">document.write(neg_specification_newline('Intel'));</script></TD>
</TR>
<TR>
<TD class="name">Processors Type</TD>
<TD class="desc"><script type="text/javascript">document.write(neg_specification_newline('Desktop'));</script></TD>
</TR>
...
</TABLE>
</div>
desc is the class of the table cells.
What you want to do is to extract the contents of this table.
soup.find(id="pcraSpecs").findAll("td") should get you started.
Have you tried using Feedity - http://feedity.com for creating a custom RSS feed from any webpage.