This is the HTML it needs to pick from:
<tbody class="datepickerDays">
<tr>
<th class="datepickerWeek"><span>40</span></th>
<td class="datepickerNotInMonth"><span>28</span></td>
<td class="datepickerNotInMonth"><span>29</span></td>
<td class="datepickerNotInMonth"><span>30</span></td>
<td class=""><span>1</span></td>
<td class=""><span>2</span></td>
<td class="datepickerSaturday"><span>3</span></td>
<td class="datepickerSunday"><span>4</span></td>
</tr>
<tr>
<th class="datepickerWeek"><span>41</span></th>
<td class=""><span>5</span></td>
<td class=""><span>6</span></td>
<td class=""><span>7</span></td>
<td class="datepickerSelected"><span>8</span></td>
<td class=""><span>9</span></td>
<td class="datepickerSaturday"><span>10</span></td>
<td class="datepickerSunday"><span>11</span></td>
</tr>
<tr>
<th class="datepickerWeek"><span>42</span></th>
<td class=""><span>12</span></td>
<td class=""><span>13</span></td>
<td class=""><span>14</span></td>
<td class=""><span>15</span></td>
<td class=""><span>16</span></td>
<td class="datepickerSaturday"><span>17</span></td>
<td class="datepickerSunday"><span>18</span></td>
</tr>
<tr>
<th class="datepickerWeek"><span>43</span></th>
<td class=""><span>19</span></td>
<td class=""><span>20</span></td>
<td class=""><span>21</span></td>
<td class=""><span>22</span></td>
<td class=""><span>23</span></td>
<td class="datepickerSaturday"><span>24</span></td>
<td class="datepickerSunday"><span>25</span></td>
</tr>
<tr>
<th class="datepickerWeek"><span>44</span></th>
<td class=""><span>26</span></td>
<td class=""><span>27</span></td>
<td class=""><span>28</span></td>
<td class=""><span>29</span></td>
<td class=""><span>30</span></td>
<td class="datepickerSaturday"><span>31</span></td>
<td class="datepickerNotInMonth datepickerSunday"><span>1</span></td>
</tr>
<tr>
<th class="datepickerWeek"><span>45</span></th>
<td class="datepickerNotInMonth"><span>2</span></td>
<td class="datepickerNotInMonth"><span>3</span></td>
<td class="datepickerNotInMonth"><span>4</span></td>
<td class="datepickerNotInMonth"><span>5</span></td>
<td class="datepickerNotInMonth"><span>6</span></td>
<td class="datepickerNotInMonth datepickerSaturday"><span>7</span></td>
<td class="datepickerNotInMonth datepickerSunday"><span>8</span></td>
</tr>
</tbody>
The code should determine what date it is today and click on that day. I think that there is no need for month/year because the only view the program will see is the current month anyway. If your solution can provide a month-picker also, it would be great.
So we need the current date (for example: 8th, while the previous date was 5), the current day name, and the program needs to pick according to that.
Current efforts:
driver.find_element_by_xpath('//td[#class="datepickerSelected"]/a[text()="8"]').click()
But Selenium doesn't click on it.
I can't show you the entire code, or the website we are using it on because it is inside a login environment.
Use the following xpath to find the element.
driver.find_element_by_xpath('//td[#class="datepickerSelected"]/a[./span[text()="8"]]').click()
To get today's date, you can use datetime. See the docs for more info. Once you have it, you can insert the day into the locator and click the element.
There are a couple problems with your locator vs the HTML that you posted.
//td[#class="datepickerSelected"]/a[text()="8"]
This is looking for a TD that has a class "datepickerSelected" but it doesn't exist in the HTML you posted. I'm assuming that class only appears after you've selected a date but when you first enter the page, this won't be true so we can't use that class to locate the day we want.
The text() method finds text inside of the element specified, in this case an A tag. If you look at the HTML, the text is actually inside the SPAN child of the A tag. There are a couple ways to deal with this. You can change that part of the locator to be /a/span[text()="8"] or use . which "flattens" the text from all child elements, e.g. /a[.="8"]. Either way will work.
Another problem you will have to deal with is if the day is late or early in the month, then it shows up twice in the HTML, e.g. 2 or 28. To get the right one, you need to specify the day in the SPAN under a TD with an empty class. The wrong ones have a TD with the class datepickerNotInMonth.
Taking all this into account, here's the code I would use.
import datetime
today = datetime.datetime.now().day
driver.find_element_by_xpath(f'//td[#class=""]/a[.="{today}"]').click()
The locator finds a TD that contains an empty class that has a child A that contains (the flattened) text corresponding to today's day.
Related
I have an html with a lots of table to traverse to like below:
<html>
.. omitted parts since I am interested on the HTML table..
<table>
<tbody>
<tr>
<td>
<table>
<tbody>
<tr>
<td class="labeltitle">
<tbody>
<tr>
<td class="labeltitle">
<font color="FFD700">Floor Activity<a name="#jump_fa"></a></font>
</td>
<td class="labelplain"> </td>
</tr>
</tbody>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
<table>
... omitted just to show the td that I am interested to scrape ...
<td class="labelplain"> Senator(s)</td>
<td class="labelplain">
<table>
<tbody>
<tr>
<td class="labelplain">VILLAR JR., MANUEL B.<br></td>
</tr>
</tbody>
</table>
</td>
...
<table>
<table>
... More tables like the table above (the one with VILLAR Jr.)
</table>
<table>
<tbody>
<tr>
<td class="labeltitle">
<table>
<tbody>
<tr>
<td class="labeltitle"> <font color="FFD700">Vote(s)<a name="#jump_vote"></a></font></td>
<td class="labelplain"> </td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
... more tables
</html>
The table I want to traverse is the td with class "labeltitle" and a child element "font" that has text "Floor Activity". Every table below it, I want to get the html code until before the table that has a td class="labeltitle" with child "font" and text content is "Vote(s)". I am trying with xpath like so:
table = dom.xpath("//table[8]/tbody/tr/td")
print (table)
but to no avail, I am getting empty arrays. Anything would do (e.g. with or without xpath).
I also tried the following:
rows = soup.find('a', attrs={'name' :'#jump_fa'}).find_parent('table').find_parent('table')
I am able to traverse the table with content "Floor Activity". The abovementioned code only gives me the content of the table for that particular parent, exact output I am getting below:
<tr>
<td class="labeltitle" height="22"><table border="0" cellpadding="0" cellspacing="0" width="100%">
<tr>
<td class="labeltitle" width="50%"> <font color="FFD700">Floor
Activity<a name="#jump_fa"></a></font></td>
<td align="right" class="labelplain" width="50%">
</td>
</tr>
</table></td>
</tr>
I am trying out this one Find next siblings until a certain one using beautifulsoup because it seems it fits my use case but the problem is I am getting error "'NoneType' object has no attribute 'next_sibling'" which should be the case since update2 script does not include the other tables, so update2 code is out of the equation.
My expected output for this is a json file (special characters are escaped) like:
{"title":' + '"' + str(var) + '"' + ',"body":" + flooract + ' + "`}
*where flooract is the html code of the tables with special characters escaped. Sample snippet:
<table>\n<tbody>\n<tr>\n<td class=\"labelplain\"> Status Date<\/td><td class=\"labelplain\"> 10/12/2005<\/td>\n<\/tr>\n<tr><td class=\"labelplain\"> Parliamentary Status<\/td>\n<td class=\"labelplain\"><table>\n<tbody><tr>\n<td class="labelplain">SPONSORSHIP SPEECH<br>...Until Period of Committee Amendments
Link to sample file here: https://issuances-library.senate.gov.ph/54629.html
I have attached an image of the site:
Screenshot 3, I have encircled in red lines what I only wanted to get from the HTML file:
I have a website that contains tables (trs and tds). I want to create a structured CSV file from the table data. I'm trying to create field names from the scraped table as those field names can change depending upon the month or selections.
While I have been successful at iterating through the table and actually scraping the data I want to use as my field names I have yet to figure out how to yield that data into the CSV file.
Right now I have them scraped into an Item named "h1header" and when yielded to a CSV file they appear as rows under that item key "h1header" so:
Project Owning Org
Project Date Range
Fee Factor
Project Organization
Project Manager
Fee Calculation Method
Project Code
Project Lead
Status
Project Title
Total Project Value
Condition
External System Code
Funded Value
Billing Type
What I would ultimately like is the following:
Project Owning Org, Project Date Range, Fee Factor, Project Organization ...etc
so instead of rows they are columns and then I can populate the multiple tables on the page that are formatted with the same h1header with the data as field values of those columns.
Below is an example of the html that I'm scraping. This particular tbody.h1 repeats multiple times on the page depending on the results.
<table class="report">
<tbody class="h1"><tr><td colspan="22">
<table class="report" >
<tbody class="h1">
<tr>
<td class="label">Project Owning Organization:</td><td>1.02.10</td>
<td class="label">Project Date Range:</td><td>8/12/2020 - 8/11/2021</td>
<td class="label">Fee Factor:</td><td>—</td>
</tr>
<tr>
<td class="label">Project Organization:</td><td>1.2.26.1</td>
<td class="label">Project Manager:</td><td>Smith, John</td>
<td class="label">Fee Calculation Method:</td><td>—</td>
</tr>
<tr>
<td class="label">Project Code:</td><td>PROJECT.001</td>
<td class="label">Project Lead:</td><td>Doe, Jane</td>
<td class="label">Status:</td><td>Backlog</td>
</tr>
<tr>
<td class="label">Project Title:</td><td>Scrapy Project</td>
<td class="label">Total Project Value:</td><td>1,438.00</td>
<td class="label">Condition:</td><td>Green<img src="/images/status_green.png" alt="Green"
title="Green"></td>
</tr>
<tr>
<td class="label">External System Code:</td><td>—</td>
<td class="label">Funded Value:</td><td>1,438.00</td>
<td class="label">Billing Type:</td><td>FP</td>
</tr>
</tbody>
There are other tables within this html (tbody.h1 and tbody.detail) where I will then need to append columns to the above.
I've done this in Java using Beautiful Soup by creating and writing to arrays then ultimately exporting those built arrays as csv files. Python Scrapy is FAR easier to get the data than Java was and I'm sure I'm over complicating this but am stuck trying to figure it out so any guidance would be appreciated!
Try this.
from simplified_scrapy import SimplifiedDoc, req, utils
html = '''
<table class="report">
<tbody class="h1"><tr><td colspan="22">
<table class="report" >
<tbody class="h1">
<tr>
<td class="label">Project Owning Organization:</td><td>1.02.10</td>
<td class="label">Project Date Range:</td><td>8/12/2020 - 8/11/2021</td>
<td class="label">Fee Factor:</td><td>—</td>
</tr>
<tr>
<td class="label">Project Organization:</td><td>1.2.26.1</td>
<td class="label">Project Manager:</td><td>Smith, John</td>
<td class="label">Fee Calculation Method:</td><td>—</td>
</tr>
<tr>
<td class="label">Project Code:</td><td>PROJECT.001</td>
<td class="label">Project Lead:</td><td>Doe, Jane</td>
<td class="label">Status:</td><td>Backlog</td>
</tr>
<tr>
<td class="label">Project Title:</td><td>Scrapy Project</td>
<td class="label">Total Project Value:</td><td>1,438.00</td>
<td class="label">Condition:</td><td>Green<img src="/images/status_green.png" alt="Green"
title="Green"></td>
</tr>
<tr>
<td class="label">External System Code:</td><td>—</td>
<td class="label">Funded Value:</td><td>1,438.00</td>
<td class="label">Billing Type:</td><td>FP</td>
</tr>
</tbody>
</table>
</tbody>
</table>
'''
# html = req.get('your url')
# html = utils.getFileContent('your file path')
# header = []
rows = []
doc = SimplifiedDoc(html)
tds = doc.selects('table.report>table.report>td')
row = []
for i in range(0,len(tds),2):
# header.append(tds[i].text.strip(':'))
row.append(tds[i+1].text)
# rows.append(header)
rows.append(row)
utils.save2csv('test.csv', rows, mode='a')
dabingsou, thank you for the inspiration. While your solution didn't work for my code your idea to use a csv utility other than what was bundled with Scrapy was the solution I was looking for!
My code was very similar to what you wrote with the exception of your MUCH more simplistic way of only looping once through the headers! Below is my code that I utilized and then simply added the csv package to write the file perfectly! This code utilizes Scrapy vs simple-scrapy and allows me to scrape the page using scrapy-splash.
def h1header_scrape(self, response):
td_labels = response.css('tbody.h1 td.label')
h1headers = []
lastitemnum = len(td_labels)-1 #This provides the last item number and subtracts the duplicate "Project Organization" label in the first tr in the css
lastheader= response.css('tbody.h1 td.label::text')[lastitemnum].get() #Using the count function defined above this gets the last header name in h1 and uses it to stop the for loop below
for td_label in td_labels:
columns = td_label.css('td.label')
header = columns.css('.label::text').get()
h1header = columns.css('.label::text').get().replace(":","")
h1headers.append(h1header)
if header == lastheader:
break
print(h1headers)
with open('testfile.csv','w',newline='') as file:
writer = csv.writer(file)
writer.writerow(h1headers)
I need to robust way to get the xpath for this url "http://www.screener.com/v2/stocks/view/5131"
However, there are too many blank space before the desirable data in between and it is not robust.
The part I need is 11.48,9.05,11.53 from the html below:
<div class="table-responsive">
<table class="table table-hover">
<tr>
<th>Financial Year</th>
<th class="number">Revenue ('000)</th>
<th class="number">Net ('000)</th>
<th class="number">EPS</th>
<th></th>
</tr>
<tr>
<td>30 Nov, 2017</td>
<td class="number">205,686</td>
<td class="number">52,812</td>
<td class="number">11.48</td>
<td></td>
</tr>
<tr>
<td>30 Nov, 2016</td>
<td class="number">191,301</td>
<td class="number">41,598</td>
<td class="number">9.05</td>
<td></td>
</tr>
<tr>
<td>30 Nov, 2015</td>
<td class="number">225,910</td>
<td class="number">51,082</td>
<td class="number">11.53</td>
<td></td>
</tr>
My code as below
from lxml import html
import requests
page = requests.get('http://www.screener.com/v2/stocks/view/5131')
output = html.fromstring(page.content)
output.xpath('//tr/td/following-sibling::td/text()')
How the code shall be change so that it can robustly get the three number from the tables as shown above?
I just want the output 11.48,9.05,11.53but I unable to get rid of too many of the data inside teh tables
Try below XPath to get desired output:
//div[#id="annual"]//tr/td[position() = last() - 1]/text()
I am beginning to learn python and would like to try to use BeautifulSoup to extract the elements in the below html.
This html is taken from a voice recording system that logs the time and date in local time, UTC, call duration, called number, name, calling number, name, etc
There are usually hundreds of these entries.
What I am attempting to do is extract the elements and print them in one line to a comma delimited format in order to compare with call detail records from call manager. This will help to verify that all calls were recorded and not missed.
I believe BeautifulSoup is the right tool to do this.
Could someone point me in the right direction?
<tbody>
<tr class="formRowLight">
<td class="formRowLight" >24/10/16<br>16:24:47</td>
<td class="formRowLight" >24/10/16 07:24:47</td>
<td class="formRowLight" >00:45</td>
<td class="formRowLight" >31301</td>
<td class="formRowLight" >Joe Smith</td>
<td class="formRowLight" >31111</td>
<td class="formRowLight" >Jane Doe</td>
<td class="formRowLight" >N/A</td>
<td class="formRowLight" >1432875648934</td>
<td align="center" class"formRowLight"> </td>
<tr class="formRowLight">
<td class="formRowLight" >24/10/16<br>17:33:02</td>
<td class="formRowLight" >24/10/16 08:33:02</td>
<td class="formRowLight" >00:58</td>
<td class="formRowLight" >35664</td>
<td class="formRowLight" >Billy Bob</td>
<td class="formRowLight" >227045665</td>
<td class="formRowLight" >James Dean</td>
<td class="formRowLight" >N/A</td>
<td class="formRowLight" >9934959586849</td>
<td align="center" class"formRowLight"> </td>
</tr>
</tbody>
The pandas.read_html() would make things much easier - it would convert your tabular data from the HTML table into a dataframe which, if needed, you can later dump into CSV.
Here is a sample code to get you started:
import pandas as pd
data = """
<table>
<thead>
<tr>
<th>Date</th>
<th>Name</th>
<th>ID</th>
</tr>
</thead>
<tbody>
<tr class="formRowLight">
<td class="formRowLight">24/10/16<br>16:24:47</td>
<td class="formRowLight">Joe Smith</td>
<td class="formRowLight">1432875648934</td>
</tr>
<tr class="formRowLight">
<td class="formRowLight">24/10/16<br>17:33:02</td>
<td class="formRowLight">Billy Bob</td>
<td class="formRowLight">9934959586849</td>
</tr>
</tbody>
</table>"""
df = pd.read_html(data)[0]
print(df.to_csv(index=False))
Prints:
Date,Name,ID
24/10/1616:24:47,Joe Smith,1432875648934
24/10/1617:33:02,Billy Bob,9934959586849
FYI, read_html() actually uses BeautifulSoup to parse HTML under-the-hood.
import BeautifulSoup
import urllib2
import requests
request = urllib2.Request(your url)
response = urllib2.urlopen(request)
soup = BeautifulSoup.BeautifulSoup(response)
mylist = []
div = soup.findAll('tr', {"class":"formRowLight"})
for line in div:
text= video.findNext('td',{"class":"formRowLight"}).text
mylist.append(text)
print mylist
But you need to edit this code a litt to prevent any duplicated content.
Yes, BeautifulSoup is a good tool to reach for in this problem. Something to get you started would be as follows:
from bs4 import BeautifulSoup
with open("my_log.html") as log_file:
html = log_file.read()
soup = BeautifulSoup(html)
#normally you specify a parser too `(html, 'lxml')` for example
#without specifying a parser, it will warn you and select one automatically
table_rows = soup.find_all("tr") #get list of all <tr> tags
for row in table_rows:
table_cells = row.find_all("td") #get list all <td> tags in row
joined_text = ",".join(cell.get_text() for cell in table_cells)
print(joined_text)
However, pandas's read_html may make this a bit more seamless, as mentioned in another answer to this question. Arguably pandas may be a better hammer to hit this nail with, but learning to use BeautifulSoup for this will also give you the skills to scrape all kinds of HTML in the future.
First get list of html strings, To get that follow this Convert BeautifulSoup4 HTML Table to a list of lists, iterating over each Tag elements
Then perform following operation in that,
This will fetch you all values of elements you desire !
for element in html_list:
output = soup.select(element)[0].text
print("%s ," % output)
This will give you what you desires,
Hope that helps !
I'm parsing with lxml on Python 2.7
I have some html that looks like this:
<tr height="45" valign="bottom">
<td colspan="2" class="DATE">Wednesday, Aug 5 2015 </td>
</tr>
<tr>
<td/>
</tr>
<tr>
<td> </td>
<td/>
</tr>
<tr>
<td/>
<td> - No Calendar Matters Currently Set<br/></td>
</tr>
<tr height="45" valign="bottom">
<td colspan="2" class="DATE">Thursday, Aug 6 2015 </td>
</tr>
Is there any way for me to get a list of all td element objects in between the two elements of class="DATE"?
Basically, I need all the info associated with, say Aug 5, but since the other elements before the next date aren't children I'm struggling to figure out how to get them.
Write as want: all elements with td[#class="DATE"] ahead and before
//td[following::td[#class="DATE"] and preceding::td[#class="DATE"]]
but this set will not contain td tags with #class="DATE"
To include them use xpath
//td[(following::td[#class="DATE"] and preceding::td[#class="DATE"]) or #class="DATE"]