Parsing an HTML file with selectorgadget.com - python

How can I use beautiful soup and selectorgadget to scrape a website. For example I have a website - (a newegg product) and I would like my script to return all of the specifications of that product (click on SPECIFICATIONS) by this I mean - Intel, Desktop, ......, 2.4GHz, 1066Mhz, ...... , 3 years limited.
After using selectorgadget I get the string-
.desc
How do I use this?
Thanks :)

Inspecting the page, I can see that the specifications are placed in a div with the ID pcraSpecs:
<div id="pcraSpecs">
<script type="text/javascript">...</script>
<TABLE cellpadding="0" cellspacing="0" class="specification">
<TR>
<TD colspan="2" class="title">Model</TD>
</TR>
<TR>
<TD class="name">Brand</TD>
<TD class="desc"><script type="text/javascript">document.write(neg_specification_newline('Intel'));</script></TD>
</TR>
<TR>
<TD class="name">Processors Type</TD>
<TD class="desc"><script type="text/javascript">document.write(neg_specification_newline('Desktop'));</script></TD>
</TR>
...
</TABLE>
</div>
desc is the class of the table cells.
What you want to do is to extract the contents of this table.
soup.find(id="pcraSpecs").findAll("td") should get you started.

Have you tried using Feedity - http://feedity.com for creating a custom RSS feed from any webpage.

Related

Python Beautifulsoup traverse a table with particular text content in innerHTML then get contents until before a particular element

I have an html with a lots of table to traverse to like below:
<html>
.. omitted parts since I am interested on the HTML table..
<table>
<tbody>
<tr>
<td>
<table>
<tbody>
<tr>
<td class="labeltitle">
<tbody>
<tr>
<td class="labeltitle">
<font color="FFD700">Floor Activity<a name="#jump_fa"></a></font>
</td>
<td class="labelplain"> </td>
</tr>
</tbody>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
<table>
... omitted just to show the td that I am interested to scrape ...
<td class="labelplain"> Senator(s)</td>
<td class="labelplain">
<table>
<tbody>
<tr>
<td class="labelplain">VILLAR JR., MANUEL B.<br></td>
</tr>
</tbody>
</table>
</td>
...
<table>
<table>
... More tables like the table above (the one with VILLAR Jr.)
</table>
<table>
<tbody>
<tr>
<td class="labeltitle">
<table>
<tbody>
<tr>
<td class="labeltitle"> <font color="FFD700">Vote(s)<a name="#jump_vote"></a></font></td>
<td class="labelplain"> </td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
... more tables
</html>
The table I want to traverse is the td with class "labeltitle" and a child element "font" that has text "Floor Activity". Every table below it, I want to get the html code until before the table that has a td class="labeltitle" with child "font" and text content is "Vote(s)". I am trying with xpath like so:
table = dom.xpath("//table[8]/tbody/tr/td")
print (table)
but to no avail, I am getting empty arrays. Anything would do (e.g. with or without xpath).
I also tried the following:
rows = soup.find('a', attrs={'name' :'#jump_fa'}).find_parent('table').find_parent('table')
I am able to traverse the table with content "Floor Activity". The abovementioned code only gives me the content of the table for that particular parent, exact output I am getting below:
<tr>
<td class="labeltitle" height="22"><table border="0" cellpadding="0" cellspacing="0" width="100%">
<tr>
<td class="labeltitle" width="50%"> <font color="FFD700">Floor
Activity<a name="#jump_fa"></a></font></td>
<td align="right" class="labelplain" width="50%">
</td>
</tr>
</table></td>
</tr>
I am trying out this one Find next siblings until a certain one using beautifulsoup because it seems it fits my use case but the problem is I am getting error "'NoneType' object has no attribute 'next_sibling'" which should be the case since update2 script does not include the other tables, so update2 code is out of the equation.
My expected output for this is a json file (special characters are escaped) like:
{"title":' + '"' + str(var) + '"' + ',"body":" + flooract + ' + "`}
*where flooract is the html code of the tables with special characters escaped. Sample snippet:
<table>\n<tbody>\n<tr>\n<td class=\"labelplain\"> Status Date<\/td><td class=\"labelplain\"> 10/12/2005<\/td>\n<\/tr>\n<tr><td class=\"labelplain\"> Parliamentary Status<\/td>\n<td class=\"labelplain\"><table>\n<tbody><tr>\n<td class="labelplain">SPONSORSHIP SPEECH<br>...Until Period of Committee Amendments
Link to sample file here: https://issuances-library.senate.gov.ph/54629.html
I have attached an image of the site:
Screenshot 3, I have encircled in red lines what I only wanted to get from the HTML file:

Use beautifulSoup to find a table after a header?

I am trying to scrape some data off a website. The data that I want is listed in a table, but there are multiple tables and no ID's. I then had the idea that I would find the header just above the table I was searching for and then use that as an indicator.
This has really troubled me, so as a last resort, I wanted to ask if there were someone who knows how to BeautifulSoup to find the table.
A snipped of the HTML code is provided beneath, thanks in advance :)
The table I am interested in, is the table right beneath <h2>Mine neaste vagter</h2>
<h2>Min aktuelle vagt</h2>
<div>
<a href='/shifts/detail/595212/'>Flere detaljer</a>
<p>Vagt starter: <b>11/06 2021 - 07:00</b></p>
<p>Vagt slutter: <b>11/06 2021 - 11:00</b></p>
<h2>Masker</h2>
<table class='list'>
<tr><th>Type</th><th>Fra</th><th> </th><th>Til</th></tr>
<tr>
<td>Fri egen regningD</td>
<td>07:00</td>
<td> - </td>
<td>11:00</td>
</tr>
</table>
</div>
<hr>
<h2>Mine neaste vagter</h2>
<table class='list'>
<tr>
<th class="alignleft">Dato</th>
<th class="alignleft">Rolle</th>
<th class="alignleft">Tidsrum</th>
<th></th>
<th class="alignleft">Bytte</th>
<th class="alignleft" colspan='2'></th>
</tr>
<tr class="rowA separator">
<td>
<h3>12/6</h3>
</td>
<td>Kundeservice</td>
<td>18:00 → 21:30 (3.5 t)</td>
<td style="max-width: 20em;"></td>
<td>
<a href="/shifts/ajax/popup/595390/" class="swap shiftpop">
Byt denne vagt
</a>
</td>
<td><a href="/shifts/detail/595390/">Detaljer</td>
<td>
</td>
</tr>
Here are two approaches to find the correct <table>:
Since the table you want is the last one in the HTML, you can use find_all() and using index slicing [-1] to find the last table:
print(soup.find_all("table", class_="list")[-1])
Find the h2 element by text, and the use the find_next() method to find the table:
print(soup.find(lambda tag: tag.name == "h2" and "Mine neaste vagter" in tag.text).find_next("table"))
You can use :-soup-contains (or just :contains) to target the <h2> by its text and then use find_next to move to the table:
from bs4 import BeautifulSoup as bs
html = '''your html'''
soup = bs(html, 'lxml')
soup.select_one('h2:-soup-contains("Mine neaste vagter")').find_next('table')
This is assuming the HTML, as shown, is returned by whatever access method you are using.

How to extract the innerText of a <td> element with respect to the innerText of another <td> element

I am using selenium in python. I have come across this table webelement. I need to check if a string is present in the webelement and return a corresponding string in case its present.
<table width="700px" class="tableListGrid">
<thead>
<tr class="tableInfoTrBox">
<th>Date</th>
<th>Task Code</th>
<!-- th>Phone Number</th -->
<th>Fota Job</th>
<th colspan="2" class="thLineEnd">Task Description</th>
</tr>
</thead>
<tbody>
<tr class="tableTr_r">
<td>2018-04-06 05:48:29</td>
<td>FU</td>
<!-- td></td -->
<td>
57220180406-JSA69596727
</td>
<td style="text-align:left;">
updated from [A730FXXU1ARAB/A730FOJM1ARAB/A730FXXU1ARAB] to [A730FXXU2ARC9/A730FOJM2ARC1/A730FXXU2ARC9]
</td>
<td>
<table class="btnTypeE">
<tr>
<td>
View
</td>
</tr>
</table>
</td>
</tr>
</tbody>
</table>
I need to search for "A730FXXU2ARC9/A730FOJM2ARC1/A730FXXU2ARC9" in this element and return "57220180406-JSA69596727" which is present in same row at a different place in the web page. Is it possible to do in selenium ?
EDIT: Cleaned the code to only contain useful data.
It can be achieved by finding the element using the following Xpath:
//td[contains(., 'A730FXXU2ARC9/A730FOJM2ARC1/A730FXXU2ARC9')]/preceding-sibling::td[1]/a
Xpath can be read as
find td which contains "A730FXXU2ARC9/A730FOJM2ARC1/A730FXXU2ARC9". Then find the td
preceding the found td and move to the a tag
After this you can get text using selenium
driver.find_element(By.XPATH, '//td[contains(., 'A730FXXU2ARC9/A730FOJM2ARC1/A730FXXU2ARC9')]/preceding-sibling::td[1]/a').text
To look out for a text e.g.A730FXXU2ARC9/A730FOJM2ARC1/A730FXXU2ARC9 and find out an associated text e.g. 57220180406-JSA69596727, you can write a function as follows :
def test_me(myString):
myText = driver.find_element_by_xpath("//table[#class='tableListGrid']//tbody/tr[#class='tableTr_r']//td[.='" + myString + "']//preceding::td[1]/a").get_attribute("innerHTML")
Now, from your main()/#Test you can call the function with the desired text as follows :
test_me("A730FXXU2ARC9/A730FOJM2ARC1/A730FXXU2ARC9")

scrapy xpath return empty data from table

i try get href from this table
<div class="squad-container">
<table class="table squad sortable" id="page_team_1_block_team_squad_8-table">
<thead>
<tr class="group-head">
<th colspan="4">Goalkeepers </th>
</tr>
</thead>
<tbody>
<tr>
<td style="width:50px;">Reda Sayed</td>
<td style="vertical-align: top;">
<div><a href="/474798/" >Reda Sayed</a></div>
<div style="padding-left: 27px;">25 years old</div>
</td>
</tr>
</tbody>
i use
response.xpath('//table[#class="table squad sortable"]//tr//td//a/#href').extract_first()
and didnt work with i need know what is the problem in code and what is different if i use double // or single slash
I don't think there is any problem with your xpath from we human's perspective. However, the xpath or css can be different from your spider's perspective, i.e. your spider may 'see' page differently.
Try using 'scrapy shell' to test your xpath or css and see if any data can be extracted. Here is the link to the doc in case you need: https://doc.scrapy.org/en/latest/topics/shell.html
To sum up: modify the xpath you wrote, 'cause your spider won't find any data with that xpath, and scrapy shell can help you.:)

Getting exact HTML of an xpath from selenium python driver

Basically I am an trying to get the html table of a page to save in database. Where I am getting the element by:
data = browser.find_element_by_xpath('/html/body/table[2]')
Here is the HTML that I am looking for:
<table class="general" cellspacing="2" cellpadding="3" border="0" width="550">
<tbody>
<tr class="generaltitle">
<td colspan="3">I am a basic title</td>
</tr>
<tr valign="top"><td colspan="3"><hr></td></tr>
</tbody></table>
But when I call this it gives me a raw result likt this:
"I am a basic title\n \n\n "
<br>
<hr>
I am calling it like this: data.get_attribute('innerHTML')
How to get the exact HTML?
Perhaps you were looking for data.get_attribute('outerHTML')

Categories