grabbing child from html table - python

Trying to grab some table data from a website.
Here's a sample of the html that can be found here https://www.madeinalabama.com/warn-list/:
<div class="warn-data">
<table>
<thead>
<tr>
<th>Closing or Layoff</th>
<th>Initial Report Date</th>
<th>Planned Starting Date</th>
<th>Company</th>
<th>City</th>
<th>Planned # Affected Employees</th>
</tr>
</thead>
<tbody>
<tr>
<td>Closing * </td>
<td>09/17/2019</td>
<td>11/14/2019</td>
<td>FLOWERS BAKING CO. </td>
<td>Opelika </td>
<td> 146 </td>
</tr>
<tr>
<td>Closing * </td>
<td>08/05/2019</td>
<td>10/01/2019</td>
<td>INFORM DIAGNOSTICS </td>
<td>Daphne </td>
<td> 72 </td>
</tr>
I'm trying to grab the data corresponding to the 6th td for each table row.
I tried this:
url = 'https://www.madeinalabama.com/warn-list/'
browser = webdriver.Chrome()
browser.get(url)
elements = browser.find_elements_by_xpath("//table/tbody/tr/td[6]").text
and elements comes back as this:
<selenium.webdriver.remote.webelement.WebElement (session="8199967d541da323f5d5c72623a5e607", element="7d2f8991-d30b-4bc0-bfa5-4b7e909fb56c")>,
<selenium.webdriver.remote.webelement.WebElement (session="8199967d541da323f5d5c72623a5e607", element="ba0cd72d-d105-4f8c-842f-6f20b3c2a9de")>,
<selenium.webdriver.remote.webelement.WebElement (session="8199967d541da323f5d5c72623a5e607", element="1ec14439-0732-4417-ac4f-be118d8d1f85")>,
<selenium.webdriver.remote.webelement.WebElement (session="8199967d541da323f5d5c72623a5e607", element="d8226534-4fc7-406c-935a-d43d6d777bfb")>]
Desired output is a simple df like this:
Planned # Affected Employees
146
72
.
.
.
Please someone explain how to do this using selenium find_elements_by_xpath. We have ample beautiful_soup explanations.

You can use pd.read_html() function:
txt = '''<div class="warn-data">
<table>
<thead>
<tr>
<th>Closing or Layoff</th>
<th>Initial Report Date</th>
<th>Planned Starting Date</th>
<th>Company</th>
<th>City</th>
<th>Planned # Affected Employees</th>
</tr>
</thead>
<tbody>
<tr>
<td>Closing * </td>
<td>09/17/2019</td>
<td>11/14/2019</td>
<td>FLOWERS BAKING CO. </td>
<td>Opelika </td>
<td> 146 </td>
</tr>
<tr>
<td>Closing * </td>
<td>08/05/2019</td>
<td>10/01/2019</td>
<td>INFORM DIAGNOSTICS </td>
<td>Daphne </td>
<td> 72 </td>
</tr>'''
df = pd.read_html(txt)[0]
print(df)
Prints:
Closing or Layoff Initial Report Date Planned Starting Date Company City Planned # Affected Employees
0 Closing * 09/17/2019 11/14/2019 FLOWERS BAKING CO. Opelika 146
1 Closing * 08/05/2019 10/01/2019 INFORM DIAGNOSTICS Daphne 72
Then:
print(df['Planned # Affected Employees'])
Prints:
0 146
1 72
Name: Planned # Affected Employees, dtype: int64
EDIT: Solution with BeautifulSoup:
soup = BeautifulSoup(txt, 'html.parser')
all_data = []
for tr in soup.select('.warn-data tr:has(td)'):
*_, last_column = tr.select('td')
all_data.append(last_column.get_text(strip=True))
df = pd.DataFrame({'Planned': all_data})
print(df)
Prints:
Planned
0 146
1 72
Or:
soup = BeautifulSoup(txt, 'html.parser')
all_data = [td.get_text(strip=True) for td in soup.select('.warn-data tr > td:nth-child(6)')]
df = pd.DataFrame({'Planned': all_data})
print(df)

You could also do td:nth-last-child(1) assuming its the last child
soup.select('div.warn-data > table > tbody > tr > td:nth-last-child(1)')
Example
from bs4 import BeautifulSoup
html = """
<div class="warn-data">
<table>
<thead>
<tr>
<th>Closing or Layoff</th>
<th>Initial Report Date</th>
<th>Planned Starting Date</th>
<th>Company</th>
<th>City</th>
<th>Planned # Affected Employees</th>
</tr>
</thead>
<tbody>
<tr>
<td>Closing *</td>
<td>09/17/2019</td>
<td>11/14/2019</td>
<td>FLOWERS BAKING CO.</td>
<td>Opelika</td>
<td> 146 </td>
</tr>
<tr>
<td>Closing *</td>
<td>08/05/2019</td>
<td>10/01/2019</td>
<td>INFORM DIAGNOSTICS</td>
<td>Daphne</td>
<td> 72 </td>
</tr>
"""
soup = BeautifulSoup(html, features='html.parser')
elements = soup.select('div.warn-data > table > tbody > tr > td:nth-last-child(1)')
for index, item in enumerate(elements):
print(index, item.text)

Related

Python : Scrape each info in table without class using beautifulsoup4

I'm new to python and i have a problem for scraping with beautifulsoup4 a table containing informations of a book because each tr and td of the table doesnt contain classnames.
https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html
here is the table in the website:
<table class="table table-striped">
<tr>
<th>
UPC
</th>
<td>
a897fe39b1053632
</td>
</tr>
<tr>
<th>
Product Type
</th>
<td>
Books
</td>
</tr>
<tr>
<th>
Price (excl. tax)
</th>
<td>
£51.77
</td>
</tr>
<tr>
<th>
Price (incl. tax)
</th>
<td>
£51.77
</td>
</tr>
<tr>
<th>
Tax
</th>
<td>
£0.00
</td>
</tr>
<tr>
<th>
Availability
</th>
<td>
In stock (22 available)
</td>
</tr>
<tr>
<th>
Number of reviews
</th>
<td>
0
</td>
</tr>
</table>
the only thing i learned is with classnames, for example : book_price = soup.find('td', class_='book-price').
but in this situation i am blocked...
Is there something like find and pair the first th tag with the first td and the second th tag with the second td and so on.
i see something like that :
import requests
from bs4 import BeautifulSoup
book_url = "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"
page = requests.get(book_url)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find('table').prettify()
table_infos = soup.find('table')
for info in table_infos.findAll('tr'):
upc = ...
price = ...
tax = ...
thank you !

BeautifulSoup4 extract multiple data from TD tags within TR

Using beautifulsou 4 to scrape a HTML table.
To display values from one of the table rows and remove any empty td fields.
The source being scraped shares classes=''
So is there any way to pull the data form just one row? using
data-name ="Georgia" in the html source below
Using: beautifulsoup4
Current code
import bs4 as bs from urllib.request import FancyURLopener
class MyOpener(FancyURLopener):
version = 'My new User-Agent' # Set this to a string you want for your user agent
myopener = MyOpener()
sauce = myopener.open('')
soup = bs.BeautifulSoup(sauce,'lxml')
#table = soupe.table
table = soup.find('table')
table_rows = table.find_all_next('tr')
for tr in table_rows:
td = tr.find_all('td')
row = [i.text for i in td]
print(row)
HTML SOURCE
<tr>
<td class="text--gray">
<span class="save-button" data-status="unselected" data-type="country" data-name="Kazakhstan">★</span>
Kazakhstan
</td>
<td class="text--green">
81
</td>
<td class="text--green">
9
</td>
<td class="text--green">
12.5
</td>
<td class="text--red">
0
</td>
<td class="text--red">
0
</td>
<td class="text--red">
0
</td>
<td class="text--blue">
0
</td>
<td class="text--yellow">
0
</td>
</tr>
<tr>
<td class="text--gray">
<span class="save-button" data-status="unselected" data-type="country" data-name="Georgia">★</span>
Georgia
</td>
<td class="text--green">
75
</td>
<td class="text--green">
0
</td>
<td class="text--green">
0
</td>
<td class="text--red">
0
</td>
<td class="text--red">
0
</td>
<td class="text--red">
0
</td>
<td class="text--blue">
10
</td>
<td class="text--yellow">
1
</td>
</tr>
Are you talking about something like:
tr.find_all('td',{'data-name' : True})
That should find any td that contains data name. I could be reading your question all wrong though.

How to extract items line by line

I need some help here using beautifulsoup4 to extract data from my inventory webpage.
The webpage was written in the following format: name of the item, followed by a table listing the multiple rows of details for that particular inventory.
I am interested in getting the item name, actual qty and expiry date.
How do I go about doing it given such HTML structure (see appended)?
<div style="font-weight: bold">Item X</div>
<table cellspacing="0" cellpadding="0" class="table table-striped report-table" style="width: 800px">
<thead>
<tr>
<th> </th>
<th>Supplier</th>
<th>Packing</th>
<th>Default Qty</th>
<th>Expensive</th>
<th>Reorder Point</th>
<th>Actual Qty</th>
<th>Expiry Date</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>Company 1</td>
<td>3.8 L</td>
<td>
4
</td>
<td>
No
</td>
<td>2130.00</td>
<td>350.00</td>
<td>31-05-2019</td>
</tr>
<tr>
<td>2</td>
<td>Company 1</td>
<td>3.8 L</td>
<td>
4
</td>
<td>
No
</td>
<td>2130.00</td>
<td>15200.00</td>
<td>31-05-2019</td>
</tr>
<tr>
<td>3</td>
<td>Company 1</td>
<td>3.8 L</td>
<td>
4
</td>
<td>
No
</td>
<td>2130.00</td>
<td>210.00</td>
<td>31-05-2019</td>
</tr>
<tr>
<td colspan="5"> </td>
<td>Total Qty 15760.00</td>
<td> </td>
</tr>
</tbody>
</table>
<div style="font-weight: bold">Item Y</div>
<table cellspacing="0" cellpadding="0" class="table table-striped report-table" style="width: 800px">
<thead>
<tr>
<th> </th>
<th>Supplier</th>
<th>Packing</th>
<th>Default Qty</th>
<th>Expensive</th>
<th>Reorder Point</th>
<th>Actual Qty</th>
<th>Expiry Date</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>Company 2</td>
<td>50X10's</td>
<td>
10
</td>
<td>
Yes
</td>
<td>1090.00</td>
<td>271.00</td>
<td>31-01-2020</td>
</tr>
<tr>
<td>2</td>
<td>Company 2</td>
<td>50X10's</td>
<td>
10
</td>
<td>
Yes
</td>
<td>1090.00</td>
<td>500.00</td>
<td>31-01-2020</td>
</tr>
<tr>
<td>3</td>
<td>Company 2</td>
<td>50X10's</td>
<td>
10
</td>
<td>
Yes
</td>
<td>1090.00</td>
<td>69.00</td>
<td>31-01-2020</td>
</tr>
<tr>
<td>4</td>
<td>Company 2</td>
<td>50X10's</td>
<td>
10
</td>
<td>
Yes
</td>
<td>1090.00</td>
<td>475.00</td>
<td>01-01-2020</td>
</tr>
<tr>
<td colspan="5"> </td>
<td>Total Qty 1315.00</td>
<td> </td>
</tr>
</tbody>
</table>
Here is one way to do it. The idea is to iterate over the items - div elements with bold substring inside the style attribute. Then, for every item, get the next table sibling using find_next_sibling() and parse the row data into a dictionary for convenient access by a header name:
from bs4 import BeautifulSoup
data = """your HTML here"""
soup = BeautifulSoup(data, "lxml")
for item in soup.select("div[style*=bold]"):
item_name = item.get_text()
table = item.find_next_sibling("table")
headers = [th.get_text(strip=True) for th in table('th')]
for row in table('tr')[1:-1]:
row_data = dict(zip(headers, [td.get_text(strip=True) for td in row('td')]))
print(item_name, row_data['Actual Qty'], row_data['Expiry Date'])
print("-----")
Prints:
Item X 350.00 31-05-2019
Item X 15200.00 31-05-2019
Item X 210.00 31-05-2019
-----
Item Y 271.00 31-01-2020
Item Y 500.00 31-01-2020
Item Y 69.00 31-01-2020
Item Y 475.00 01-01-2020
-----
One solution is to iterate through each row tag i.e. <tr> and then just figure out what the column cell at each index represents and access columns that way. To do so, you can use the find_all method in BeautifulSoup, which will return a list of all elements with the tag given.
Example:
from bs4 import BeautifulSoup
html_doc = YOUR HTML HERE
soup = BeautifulSoup(html_doc, 'html.parser')
for row in soup.find_all("tr"):
cells = row.find_all("td")
if len(cells) == 0:
#This is the header row
else:
#If you want to access the text of the Default Quantity column for example
default_qty = cells[3].text
Note that in the case that the tr tag is actually the header row, then there will not be td tags (there will only be th tags), so in this case len(cells)==0
You can select all divs and walk through to find the next table.
If you go over the rows of the table with the exception of the last row, you can extract text from specific cells and build your inventory list.
soup = BeautifulSoup(markup, "html5lib")
inventory = []
for itemdiv in soup.select('div'):
table = itemdiv.find_next('table')
for supply_row in table.tbody.select('tr')[:-1]:
sn, supplier, _, actual_qty, _, _, _, exp = supply_row.select('td')
item = map(lambda node: node.text.strip(), [sn, supplier, actual_qty, exp])
item[1:1] = [itemdiv.text]
inventory.append(item)
print(inventory)
You can use the csv library to write the inventory like so:
import csv
with open('some.csv', 'wb') as f:
writer = csv.writer(f, delimiter="|")
writer.writerow(('S/N', 'Item', 'Supplier', 'Actual Qty', 'Expiry Date'))
writer.writerows(inventory)

Extracting data from a table using Python Beautiful soup

I'm trying to parse rows within a table (the departure board times) from the following:
buscms_widget_departureboard_ui_displayStop_Callback("
<div class='\"livetimes\"'>
<table class='\"busexpress-clientwidgets-departures-departureboard\"'>
<thead>
<tr class='\"rowStopName\"'>
<th colspan='\"3\"' data-bearing='\"SW\"' data-lat='\"51.7505683898926\"' data-lng='\"-1.225102186203\"' title='\"oxfajmwg\"'>
Divinity Road
</th>
<tr>
<tr class='\"textHeader\"'>
<th colspan='\"3\"'>
text 69325694 to 84637 for live times
</th>
<tr>
<tr class='\"rowHeaders\"'>
<th>
service
</th>
<th>
destination
</th>
<th>
time
</th>
<tr>
</tr>
</tr>
</tr>
</tr>
</tr>
</tr>
</thead>
<tbody>
<tr class='\"rowServiceDeparture\"'>
<td class='\"colServiceName\"'>
4A (OBC)
</td>
<td class='\"colDestination\"' rise\"="" title='\"Elms'>
Elms Rise
</td>
<td 21:49:00\"="" class='\"colDepartureTime\"' data-departuretime='\"20/02/2017' mins\"="" title='\"5'>
5 mins
</td>
</tr>
<tr class='\"rowServiceDeparture\"'>
<td class='\"colServiceName\"'>
4A (OBC)
</td>
<td class='\"colDestination\"' rise\"="" title='\"Elms'>
Elms Rise
</td>
<td 22:11:00\"="" class='\"colDepartureTime\"' data-departuretime='\"20/02/2017' mins\"="" title='\"27'>
27 mins
</td>
</tr>
<tr class='\"rowServiceDeparture\"'>
<td class='\"colServiceName\"'>
4 (OBC)
</td>
<td class='\"colDestination\"' title='\"Abingdon\"'>
Abingdon
</td>
<td 22:29:00\"="" class='\"colDepartureTime\"' data-departuretime='\"20/02/2017' title='\"22:29\"'>
22:29
</td>
</tr>
<tr class='\"rowServiceDeparture\"'>
<td class='\"colServiceName\"'>
4A (OBC)
</td>
<td class='\"colDestination\"' rise\"="" title='\"Elms'>
Elms Rise
</td>
<td 22:49:00\"="" class='\"colDepartureTime\"' data-departuretime='\"20/02/2017' mins\"="" title='\"65'>
65 mins
</td>
</tr>
<tr class='\"rowServiceDeparture\"'>
<td class='\"colServiceName\"'>
4A (OBC)
</td>
<td class='\"colDestination\"' rise\"="" title='\"Elms'>
Elms Rise
</td>
<td 23:09:00\"="" class='\"colDepartureTime\"' data-departuretime='\"20/02/2017' title='\"23:09\"'>
23:09
</td>
</tr>
</tbody>
</table>
</div>
<div class='\"scrollmessage_container\"'>
<div class='\"scrollmessage\"'>
</div>
</div>
<div class='\"services\"'>
<a class='\"service' href='\"#\"' onclick="\"serviceNameClick('');\"" selected\"="">
all
</a>
<a class='\"service\"' href='\"#\"' onclick="\"serviceNameClick('4');\"">
4
</a>
</div>
<div class="dptime">
<span>
times generated at:
</span>
<span>
21:43
</span>
</div>
");
In particular, I'm trying to extract all the departure times - so I'd like to capture the minutes from departure - for example 12 minutes away.
I have the following code:
# import libraries
import urllib.request
from bs4 import BeautifulSoup
# specify the url
quote_page = 'http://www.buscms.com/api/REST/html/departureboard.aspx?callback=buscms_widget_departureboard_ui_displayStop_Callback&clientid=Nimbus&stopcode=69325694&format=jsonp&servicenamefilder=&cachebust=123&sourcetype=siri&requestor=Netescape&includeTimestamp=true&_=1487625719723'
# query the website and return the html to the variable 'page'
page = urllib.request.urlopen(quote_page)
# parse the html using beautiful soap and store in variable `soup`
soup = BeautifulSoup(page, 'html.parser')
print(soup.prettify())
I'm not sure how to find the minutes from departure from the above? Is it something like:
minutes_from_depart = soup.find("tbody", attrs={'td': 'mins'})
Could you try this ?
import urllib.request
from bs4 import BeautifulSoup
import re
quote_page = 'http://www.buscms.com/api/REST/html/departureboard.aspx?callback=buscms_widget_departureboard_ui_displayStop_Callback&clientid=Nimbus&stopcode=69325694&format=jsonp&servicenamefilder=&cachebust=123&sourcetype=siri&requestor=Netescape&includeTimestamp=true&_=1487625719723'
page = urllib.request.urlopen(quote_page).read()
soup = BeautifulSoup(page, 'lxml')
print(soup.prettify())
minutes = soup.find_all("td", class_=re.compile(r"colDepartureTime"))
for elements in minutes:
print(elements.getText())
So I got to my answer with the following code - which was actually quite easy once I had played around with the soup.find_all function:
import urllib.request
from bs4 import BeautifulSoup
# specify the url
quote_page = 'http://www.buscms.com/api/REST/html/departureboard.aspx?callback=buscms_widget_departureboard_ui_displayStop_Callback&clientid=Nimbus&stopcode=69325694&format=jsonp&servicenamefilder=&cachebust=123&sourcetype=siri&requestor=Netescape&includeTimestamp=true&_=1487625719723'
# query the website and return the html to the variable 'page'
page = urllib.request.urlopen(quote_page)
# parse the html using beautiful soap and store in variable `soup`
soup = BeautifulSoup(page, 'html.parser')
for link in soup.find_all('td',class_='\\"colDepartureTime\\"'):
print(link.get_text())
I get the following output:
10:40
10 mins
21 mins
30 mins
40 mins
50 mins
60 mins

scraping tables with beautifulsoup

I seem to be stuck, If i had the following table:
<table align=center cellpadding=3 cellspacing=0 border=1>
<tr bgcolor="#EEEEFF">
<td align="center">
40 </td>
<td align="center">
44 </td>
<td align="center">
<font color="green"><b>+4</b></font>
</td>
<td align="center">
1,000</td>
<td align="center">
15,000 </td>
<td align="center">
44,000 </td>
<td align="center">
<font color="green"><b><nobr>+193.33%</nobr></b></font>
</td>
</tr>
what would be the ideal way to use find_all to pull the 44,000 td from the table?
If it is a recurring position of the table you would like to scrape you would like to scrape I would use beautiful soup to extract all elements in the table and then extract that data. See the pseudo code below.
known_position = 5
tds = bs4.find_all('td')
number = tds[known_position].text()
on the other hand if you're specifically searching for a given number I would just iterate over the list.
tds = bs4.find_all('td')
for td in tds:
if td.text = 'number here':
# do your stuff

Categories