Structuring a table using Scrapy Data - python

I have a website that contains tables (trs and tds). I want to create a structured CSV file from the table data. I'm trying to create field names from the scraped table as those field names can change depending upon the month or selections.
While I have been successful at iterating through the table and actually scraping the data I want to use as my field names I have yet to figure out how to yield that data into the CSV file.
Right now I have them scraped into an Item named "h1header" and when yielded to a CSV file they appear as rows under that item key "h1header" so:
Project Owning Org
Project Date Range
Fee Factor
Project Organization
Project Manager
Fee Calculation Method
Project Code
Project Lead
Status
Project Title
Total Project Value
Condition
External System Code
Funded Value
Billing Type
What I would ultimately like is the following:
Project Owning Org, Project Date Range, Fee Factor, Project Organization ...etc
so instead of rows they are columns and then I can populate the multiple tables on the page that are formatted with the same h1header with the data as field values of those columns.
Below is an example of the html that I'm scraping. This particular tbody.h1 repeats multiple times on the page depending on the results.
<table class="report">
<tbody class="h1"><tr><td colspan="22">
<table class="report" >
<tbody class="h1">
<tr>
<td class="label">Project Owning Organization:</td><td>1.02.10</td>
<td class="label">Project Date Range:</td><td>8/12/2020 - 8/11/2021</td>
<td class="label">Fee Factor:</td><td>—</td>
</tr>
<tr>
<td class="label">Project Organization:</td><td>1.2.26.1</td>
<td class="label">Project Manager:</td><td>Smith, John</td>
<td class="label">Fee Calculation Method:</td><td>—</td>
</tr>
<tr>
<td class="label">Project Code:</td><td>PROJECT.001</td>
<td class="label">Project Lead:</td><td>Doe, Jane</td>
<td class="label">Status:</td><td>Backlog</td>
</tr>
<tr>
<td class="label">Project Title:</td><td>Scrapy Project</td>
<td class="label">Total Project Value:</td><td>1,438.00</td>
<td class="label">Condition:</td><td>Green<img src="/images/status_green.png" alt="Green"
title="Green"></td>
</tr>
<tr>
<td class="label">External System Code:</td><td>—</td>
<td class="label">Funded Value:</td><td>1,438.00</td>
<td class="label">Billing Type:</td><td>FP</td>
</tr>
</tbody>
There are other tables within this html (tbody.h1 and tbody.detail) where I will then need to append columns to the above.
I've done this in Java using Beautiful Soup by creating and writing to arrays then ultimately exporting those built arrays as csv files. Python Scrapy is FAR easier to get the data than Java was and I'm sure I'm over complicating this but am stuck trying to figure it out so any guidance would be appreciated!

Try this.
from simplified_scrapy import SimplifiedDoc, req, utils
html = '''
<table class="report">
<tbody class="h1"><tr><td colspan="22">
<table class="report" >
<tbody class="h1">
<tr>
<td class="label">Project Owning Organization:</td><td>1.02.10</td>
<td class="label">Project Date Range:</td><td>8/12/2020 - 8/11/2021</td>
<td class="label">Fee Factor:</td><td>—</td>
</tr>
<tr>
<td class="label">Project Organization:</td><td>1.2.26.1</td>
<td class="label">Project Manager:</td><td>Smith, John</td>
<td class="label">Fee Calculation Method:</td><td>—</td>
</tr>
<tr>
<td class="label">Project Code:</td><td>PROJECT.001</td>
<td class="label">Project Lead:</td><td>Doe, Jane</td>
<td class="label">Status:</td><td>Backlog</td>
</tr>
<tr>
<td class="label">Project Title:</td><td>Scrapy Project</td>
<td class="label">Total Project Value:</td><td>1,438.00</td>
<td class="label">Condition:</td><td>Green<img src="/images/status_green.png" alt="Green"
title="Green"></td>
</tr>
<tr>
<td class="label">External System Code:</td><td>—</td>
<td class="label">Funded Value:</td><td>1,438.00</td>
<td class="label">Billing Type:</td><td>FP</td>
</tr>
</tbody>
</table>
</tbody>
</table>
'''
# html = req.get('your url')
# html = utils.getFileContent('your file path')
# header = []
rows = []
doc = SimplifiedDoc(html)
tds = doc.selects('table.report>table.report>td')
row = []
for i in range(0,len(tds),2):
# header.append(tds[i].text.strip(':'))
row.append(tds[i+1].text)
# rows.append(header)
rows.append(row)
utils.save2csv('test.csv', rows, mode='a')

dabingsou, thank you for the inspiration. While your solution didn't work for my code your idea to use a csv utility other than what was bundled with Scrapy was the solution I was looking for!
My code was very similar to what you wrote with the exception of your MUCH more simplistic way of only looping once through the headers! Below is my code that I utilized and then simply added the csv package to write the file perfectly! This code utilizes Scrapy vs simple-scrapy and allows me to scrape the page using scrapy-splash.
def h1header_scrape(self, response):
td_labels = response.css('tbody.h1 td.label')
h1headers = []
lastitemnum = len(td_labels)-1 #This provides the last item number and subtracts the duplicate "Project Organization" label in the first tr in the css
lastheader= response.css('tbody.h1 td.label::text')[lastitemnum].get() #Using the count function defined above this gets the last header name in h1 and uses it to stop the for loop below
for td_label in td_labels:
columns = td_label.css('td.label')
header = columns.css('.label::text').get()
h1header = columns.css('.label::text').get().replace(":","")
h1headers.append(h1header)
if header == lastheader:
break
print(h1headers)
with open('testfile.csv','w',newline='') as file:
writer = csv.writer(file)
writer.writerow(h1headers)

Related

Python Beautifulsoup traverse a table with particular text content in innerHTML then get contents until before a particular element

I have an html with a lots of table to traverse to like below:
<html>
.. omitted parts since I am interested on the HTML table..
<table>
<tbody>
<tr>
<td>
<table>
<tbody>
<tr>
<td class="labeltitle">
<tbody>
<tr>
<td class="labeltitle">
<font color="FFD700">Floor Activity<a name="#jump_fa"></a></font>
</td>
<td class="labelplain"> </td>
</tr>
</tbody>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
<table>
... omitted just to show the td that I am interested to scrape ...
<td class="labelplain"> Senator(s)</td>
<td class="labelplain">
<table>
<tbody>
<tr>
<td class="labelplain">VILLAR JR., MANUEL B.<br></td>
</tr>
</tbody>
</table>
</td>
...
<table>
<table>
... More tables like the table above (the one with VILLAR Jr.)
</table>
<table>
<tbody>
<tr>
<td class="labeltitle">
<table>
<tbody>
<tr>
<td class="labeltitle"> <font color="FFD700">Vote(s)<a name="#jump_vote"></a></font></td>
<td class="labelplain"> </td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
... more tables
</html>
The table I want to traverse is the td with class "labeltitle" and a child element "font" that has text "Floor Activity". Every table below it, I want to get the html code until before the table that has a td class="labeltitle" with child "font" and text content is "Vote(s)". I am trying with xpath like so:
table = dom.xpath("//table[8]/tbody/tr/td")
print (table)
but to no avail, I am getting empty arrays. Anything would do (e.g. with or without xpath).
I also tried the following:
rows = soup.find('a', attrs={'name' :'#jump_fa'}).find_parent('table').find_parent('table')
I am able to traverse the table with content "Floor Activity". The abovementioned code only gives me the content of the table for that particular parent, exact output I am getting below:
<tr>
<td class="labeltitle" height="22"><table border="0" cellpadding="0" cellspacing="0" width="100%">
<tr>
<td class="labeltitle" width="50%"> <font color="FFD700">Floor
Activity<a name="#jump_fa"></a></font></td>
<td align="right" class="labelplain" width="50%">
</td>
</tr>
</table></td>
</tr>
I am trying out this one Find next siblings until a certain one using beautifulsoup because it seems it fits my use case but the problem is I am getting error "'NoneType' object has no attribute 'next_sibling'" which should be the case since update2 script does not include the other tables, so update2 code is out of the equation.
My expected output for this is a json file (special characters are escaped) like:
{"title":' + '"' + str(var) + '"' + ',"body":" + flooract + ' + "`}
*where flooract is the html code of the tables with special characters escaped. Sample snippet:
<table>\n<tbody>\n<tr>\n<td class=\"labelplain\"> Status Date<\/td><td class=\"labelplain\"> 10/12/2005<\/td>\n<\/tr>\n<tr><td class=\"labelplain\"> Parliamentary Status<\/td>\n<td class=\"labelplain\"><table>\n<tbody><tr>\n<td class="labelplain">SPONSORSHIP SPEECH<br>...Until Period of Committee Amendments
Link to sample file here: https://issuances-library.senate.gov.ph/54629.html
I have attached an image of the site:
Screenshot 3, I have encircled in red lines what I only wanted to get from the HTML file:

Need a dynamic python selenium way of picking an element by xpath

This is the HTML it needs to pick from:
<tbody class="datepickerDays">
<tr>
<th class="datepickerWeek"><span>40</span></th>
<td class="datepickerNotInMonth"><span>28</span></td>
<td class="datepickerNotInMonth"><span>29</span></td>
<td class="datepickerNotInMonth"><span>30</span></td>
<td class=""><span>1</span></td>
<td class=""><span>2</span></td>
<td class="datepickerSaturday"><span>3</span></td>
<td class="datepickerSunday"><span>4</span></td>
</tr>
<tr>
<th class="datepickerWeek"><span>41</span></th>
<td class=""><span>5</span></td>
<td class=""><span>6</span></td>
<td class=""><span>7</span></td>
<td class="datepickerSelected"><span>8</span></td>
<td class=""><span>9</span></td>
<td class="datepickerSaturday"><span>10</span></td>
<td class="datepickerSunday"><span>11</span></td>
</tr>
<tr>
<th class="datepickerWeek"><span>42</span></th>
<td class=""><span>12</span></td>
<td class=""><span>13</span></td>
<td class=""><span>14</span></td>
<td class=""><span>15</span></td>
<td class=""><span>16</span></td>
<td class="datepickerSaturday"><span>17</span></td>
<td class="datepickerSunday"><span>18</span></td>
</tr>
<tr>
<th class="datepickerWeek"><span>43</span></th>
<td class=""><span>19</span></td>
<td class=""><span>20</span></td>
<td class=""><span>21</span></td>
<td class=""><span>22</span></td>
<td class=""><span>23</span></td>
<td class="datepickerSaturday"><span>24</span></td>
<td class="datepickerSunday"><span>25</span></td>
</tr>
<tr>
<th class="datepickerWeek"><span>44</span></th>
<td class=""><span>26</span></td>
<td class=""><span>27</span></td>
<td class=""><span>28</span></td>
<td class=""><span>29</span></td>
<td class=""><span>30</span></td>
<td class="datepickerSaturday"><span>31</span></td>
<td class="datepickerNotInMonth datepickerSunday"><span>1</span></td>
</tr>
<tr>
<th class="datepickerWeek"><span>45</span></th>
<td class="datepickerNotInMonth"><span>2</span></td>
<td class="datepickerNotInMonth"><span>3</span></td>
<td class="datepickerNotInMonth"><span>4</span></td>
<td class="datepickerNotInMonth"><span>5</span></td>
<td class="datepickerNotInMonth"><span>6</span></td>
<td class="datepickerNotInMonth datepickerSaturday"><span>7</span></td>
<td class="datepickerNotInMonth datepickerSunday"><span>8</span></td>
</tr>
</tbody>
The code should determine what date it is today and click on that day. I think that there is no need for month/year because the only view the program will see is the current month anyway. If your solution can provide a month-picker also, it would be great.
So we need the current date (for example: 8th, while the previous date was 5), the current day name, and the program needs to pick according to that.
Current efforts:
driver.find_element_by_xpath('//td[#class="datepickerSelected"]/a[text()="8"]').click()
But Selenium doesn't click on it.
I can't show you the entire code, or the website we are using it on because it is inside a login environment.
Use the following xpath to find the element.
driver.find_element_by_xpath('//td[#class="datepickerSelected"]/a[./span[text()="8"]]').click()
To get today's date, you can use datetime. See the docs for more info. Once you have it, you can insert the day into the locator and click the element.
There are a couple problems with your locator vs the HTML that you posted.
//td[#class="datepickerSelected"]/a[text()="8"]
This is looking for a TD that has a class "datepickerSelected" but it doesn't exist in the HTML you posted. I'm assuming that class only appears after you've selected a date but when you first enter the page, this won't be true so we can't use that class to locate the day we want.
The text() method finds text inside of the element specified, in this case an A tag. If you look at the HTML, the text is actually inside the SPAN child of the A tag. There are a couple ways to deal with this. You can change that part of the locator to be /a/span[text()="8"] or use . which "flattens" the text from all child elements, e.g. /a[.="8"]. Either way will work.
Another problem you will have to deal with is if the day is late or early in the month, then it shows up twice in the HTML, e.g. 2 or 28. To get the right one, you need to specify the day in the SPAN under a TD with an empty class. The wrong ones have a TD with the class datepickerNotInMonth.
Taking all this into account, here's the code I would use.
import datetime
today = datetime.datetime.now().day
driver.find_element_by_xpath(f'//td[#class=""]/a[.="{today}"]').click()
The locator finds a TD that contains an empty class that has a child A that contains (the flattened) text corresponding to today's day.

Scrapy: how to make a request to scrape a link inside a page

I already made a question, about this website, but i think i got ahead of facts and now i'm stuck.
The structure of the website is something like this:
<table>
<tr>
<td class="header" colspan="2">something</td>
</tr>
</table>
<br/>
<table>
<tr>
<td class="header" colspan="2">something2</td>
</tr>
</table>
<br/>
<table>
<tr>
<td class="header" colspan="2">something3</td>
</tr>
</table>
But inside one of one of those tables there is a list of members and I need to extract the profile information of each member, but each profile is variable, so the table with its information changes, depending on the privacy settings.
The table i need to scrape is something like this, but with many members:
<table>
<tr>
<td colspan="4" class="header">members</td>
</tr>
<tr>
<td class="title">Name</td>
<td class="title">position</td>
<td class="title">hours</td>
<td class="title">observ</td>
</tr>
<tr>
<td class="c1">
1.- Homer Simpson
</td>
<td class="c1">
safety inspector
</td>
<td class="c1">
10
</td>
<td class="c1">
Neglect his duties
</td>
</tr>
<table>
I already have most of the code to extract the information from the tables, but now I do not understand how to do the function that allows me to extract the information from the profiles of each member
My spider is defined this way:
class Scraper(scrapy.Spider):
name = 'scraper'
start_urls = ['somesite.com']
rules = {
# Rule to extract profile info
Rule(LinkExtractor(allow =(), restrict_xpaths = ('/table[6]//tr/td[1]')),
callback = 'parse_member', follow = False)
}
def parse(self, response):
# logic to scrape each table
def parse_member(self,response):
# logic to scrape each profile for every member
But when I run the spider, I only get the results of extracting each table inside the main page. But I do not get the data for each user profile.
How can i follow the link for each user profile and scrape the data inside without breaking the code to scrape the tables inside the main page?
I think you don't need a Rule at all, you could do something like:
def parse(self, response):
tables = response.xpath('//table[./tr/td[contains(text(), "members")]]')
for table in tables:
for href in table.css('tr td a::attr(href)'):
yield Request(href, callback=self.parse_member_profile)
def parse_member_profile(self, response):
...

How to extract elements from html with BeautifulSoup

I am beginning to learn python and would like to try to use BeautifulSoup to extract the elements in the below html.
This html is taken from a voice recording system that logs the time and date in local time, UTC, call duration, called number, name, calling number, name, etc
There are usually hundreds of these entries.
What I am attempting to do is extract the elements and print them in one line to a comma delimited format in order to compare with call detail records from call manager. This will help to verify that all calls were recorded and not missed.
I believe BeautifulSoup is the right tool to do this.
Could someone point me in the right direction?
<tbody>
<tr class="formRowLight">
<td class="formRowLight" >24/10/16<br>16:24:47</td>
<td class="formRowLight" >24/10/16 07:24:47</td>
<td class="formRowLight" >00:45</td>
<td class="formRowLight" >31301</td>
<td class="formRowLight" >Joe Smith</td>
<td class="formRowLight" >31111</td>
<td class="formRowLight" >Jane Doe</td>
<td class="formRowLight" >N/A</td>
<td class="formRowLight" >1432875648934</td>
<td align="center" class"formRowLight"> </td>
<tr class="formRowLight">
<td class="formRowLight" >24/10/16<br>17:33:02</td>
<td class="formRowLight" >24/10/16 08:33:02</td>
<td class="formRowLight" >00:58</td>
<td class="formRowLight" >35664</td>
<td class="formRowLight" >Billy Bob</td>
<td class="formRowLight" >227045665</td>
<td class="formRowLight" >James Dean</td>
<td class="formRowLight" >N/A</td>
<td class="formRowLight" >9934959586849</td>
<td align="center" class"formRowLight"> </td>
</tr>
</tbody>
The pandas.read_html() would make things much easier - it would convert your tabular data from the HTML table into a dataframe which, if needed, you can later dump into CSV.
Here is a sample code to get you started:
import pandas as pd
data = """
<table>
<thead>
<tr>
<th>Date</th>
<th>Name</th>
<th>ID</th>
</tr>
</thead>
<tbody>
<tr class="formRowLight">
<td class="formRowLight">24/10/16<br>16:24:47</td>
<td class="formRowLight">Joe Smith</td>
<td class="formRowLight">1432875648934</td>
</tr>
<tr class="formRowLight">
<td class="formRowLight">24/10/16<br>17:33:02</td>
<td class="formRowLight">Billy Bob</td>
<td class="formRowLight">9934959586849</td>
</tr>
</tbody>
</table>"""
df = pd.read_html(data)[0]
print(df.to_csv(index=False))
Prints:
Date,Name,ID
24/10/1616:24:47,Joe Smith,1432875648934
24/10/1617:33:02,Billy Bob,9934959586849
FYI, read_html() actually uses BeautifulSoup to parse HTML under-the-hood.
import BeautifulSoup
import urllib2
import requests
request = urllib2.Request(your url)
response = urllib2.urlopen(request)
soup = BeautifulSoup.BeautifulSoup(response)
mylist = []
div = soup.findAll('tr', {"class":"formRowLight"})
for line in div:
text= video.findNext('td',{"class":"formRowLight"}).text
mylist.append(text)
print mylist
But you need to edit this code a litt to prevent any duplicated content.
Yes, BeautifulSoup is a good tool to reach for in this problem. Something to get you started would be as follows:
from bs4 import BeautifulSoup
with open("my_log.html") as log_file:
html = log_file.read()
soup = BeautifulSoup(html)
#normally you specify a parser too `(html, 'lxml')` for example
#without specifying a parser, it will warn you and select one automatically
table_rows = soup.find_all("tr") #get list of all <tr> tags
for row in table_rows:
table_cells = row.find_all("td") #get list all <td> tags in row
joined_text = ",".join(cell.get_text() for cell in table_cells)
print(joined_text)
However, pandas's read_html may make this a bit more seamless, as mentioned in another answer to this question. Arguably pandas may be a better hammer to hit this nail with, but learning to use BeautifulSoup for this will also give you the skills to scrape all kinds of HTML in the future.
First get list of html strings, To get that follow this Convert BeautifulSoup4 HTML Table to a list of lists, iterating over each Tag elements
Then perform following operation in that,
This will fetch you all values of elements you desire !
for element in html_list:
output = soup.select(element)[0].text
print("%s ," % output)
This will give you what you desires,
Hope that helps !

Parsing an HTML file with selectorgadget.com

How can I use beautiful soup and selectorgadget to scrape a website. For example I have a website - (a newegg product) and I would like my script to return all of the specifications of that product (click on SPECIFICATIONS) by this I mean - Intel, Desktop, ......, 2.4GHz, 1066Mhz, ...... , 3 years limited.
After using selectorgadget I get the string-
.desc
How do I use this?
Thanks :)
Inspecting the page, I can see that the specifications are placed in a div with the ID pcraSpecs:
<div id="pcraSpecs">
<script type="text/javascript">...</script>
<TABLE cellpadding="0" cellspacing="0" class="specification">
<TR>
<TD colspan="2" class="title">Model</TD>
</TR>
<TR>
<TD class="name">Brand</TD>
<TD class="desc"><script type="text/javascript">document.write(neg_specification_newline('Intel'));</script></TD>
</TR>
<TR>
<TD class="name">Processors Type</TD>
<TD class="desc"><script type="text/javascript">document.write(neg_specification_newline('Desktop'));</script></TD>
</TR>
...
</TABLE>
</div>
desc is the class of the table cells.
What you want to do is to extract the contents of this table.
soup.find(id="pcraSpecs").findAll("td") should get you started.
Have you tried using Feedity - http://feedity.com for creating a custom RSS feed from any webpage.

Categories