How to extract items line by line - python

I need some help here using beautifulsoup4 to extract data from my inventory webpage.
The webpage was written in the following format: name of the item, followed by a table listing the multiple rows of details for that particular inventory.
I am interested in getting the item name, actual qty and expiry date.
How do I go about doing it given such HTML structure (see appended)?
<div style="font-weight: bold">Item X</div>
<table cellspacing="0" cellpadding="0" class="table table-striped report-table" style="width: 800px">
<thead>
<tr>
<th> </th>
<th>Supplier</th>
<th>Packing</th>
<th>Default Qty</th>
<th>Expensive</th>
<th>Reorder Point</th>
<th>Actual Qty</th>
<th>Expiry Date</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>Company 1</td>
<td>3.8 L</td>
<td>
4
</td>
<td>
No
</td>
<td>2130.00</td>
<td>350.00</td>
<td>31-05-2019</td>
</tr>
<tr>
<td>2</td>
<td>Company 1</td>
<td>3.8 L</td>
<td>
4
</td>
<td>
No
</td>
<td>2130.00</td>
<td>15200.00</td>
<td>31-05-2019</td>
</tr>
<tr>
<td>3</td>
<td>Company 1</td>
<td>3.8 L</td>
<td>
4
</td>
<td>
No
</td>
<td>2130.00</td>
<td>210.00</td>
<td>31-05-2019</td>
</tr>
<tr>
<td colspan="5"> </td>
<td>Total Qty 15760.00</td>
<td> </td>
</tr>
</tbody>
</table>
<div style="font-weight: bold">Item Y</div>
<table cellspacing="0" cellpadding="0" class="table table-striped report-table" style="width: 800px">
<thead>
<tr>
<th> </th>
<th>Supplier</th>
<th>Packing</th>
<th>Default Qty</th>
<th>Expensive</th>
<th>Reorder Point</th>
<th>Actual Qty</th>
<th>Expiry Date</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>Company 2</td>
<td>50X10's</td>
<td>
10
</td>
<td>
Yes
</td>
<td>1090.00</td>
<td>271.00</td>
<td>31-01-2020</td>
</tr>
<tr>
<td>2</td>
<td>Company 2</td>
<td>50X10's</td>
<td>
10
</td>
<td>
Yes
</td>
<td>1090.00</td>
<td>500.00</td>
<td>31-01-2020</td>
</tr>
<tr>
<td>3</td>
<td>Company 2</td>
<td>50X10's</td>
<td>
10
</td>
<td>
Yes
</td>
<td>1090.00</td>
<td>69.00</td>
<td>31-01-2020</td>
</tr>
<tr>
<td>4</td>
<td>Company 2</td>
<td>50X10's</td>
<td>
10
</td>
<td>
Yes
</td>
<td>1090.00</td>
<td>475.00</td>
<td>01-01-2020</td>
</tr>
<tr>
<td colspan="5"> </td>
<td>Total Qty 1315.00</td>
<td> </td>
</tr>
</tbody>
</table>

Here is one way to do it. The idea is to iterate over the items - div elements with bold substring inside the style attribute. Then, for every item, get the next table sibling using find_next_sibling() and parse the row data into a dictionary for convenient access by a header name:
from bs4 import BeautifulSoup
data = """your HTML here"""
soup = BeautifulSoup(data, "lxml")
for item in soup.select("div[style*=bold]"):
item_name = item.get_text()
table = item.find_next_sibling("table")
headers = [th.get_text(strip=True) for th in table('th')]
for row in table('tr')[1:-1]:
row_data = dict(zip(headers, [td.get_text(strip=True) for td in row('td')]))
print(item_name, row_data['Actual Qty'], row_data['Expiry Date'])
print("-----")
Prints:
Item X 350.00 31-05-2019
Item X 15200.00 31-05-2019
Item X 210.00 31-05-2019
-----
Item Y 271.00 31-01-2020
Item Y 500.00 31-01-2020
Item Y 69.00 31-01-2020
Item Y 475.00 01-01-2020
-----

One solution is to iterate through each row tag i.e. <tr> and then just figure out what the column cell at each index represents and access columns that way. To do so, you can use the find_all method in BeautifulSoup, which will return a list of all elements with the tag given.
Example:
from bs4 import BeautifulSoup
html_doc = YOUR HTML HERE
soup = BeautifulSoup(html_doc, 'html.parser')
for row in soup.find_all("tr"):
cells = row.find_all("td")
if len(cells) == 0:
#This is the header row
else:
#If you want to access the text of the Default Quantity column for example
default_qty = cells[3].text
Note that in the case that the tr tag is actually the header row, then there will not be td tags (there will only be th tags), so in this case len(cells)==0

You can select all divs and walk through to find the next table.
If you go over the rows of the table with the exception of the last row, you can extract text from specific cells and build your inventory list.
soup = BeautifulSoup(markup, "html5lib")
inventory = []
for itemdiv in soup.select('div'):
table = itemdiv.find_next('table')
for supply_row in table.tbody.select('tr')[:-1]:
sn, supplier, _, actual_qty, _, _, _, exp = supply_row.select('td')
item = map(lambda node: node.text.strip(), [sn, supplier, actual_qty, exp])
item[1:1] = [itemdiv.text]
inventory.append(item)
print(inventory)
You can use the csv library to write the inventory like so:
import csv
with open('some.csv', 'wb') as f:
writer = csv.writer(f, delimiter="|")
writer.writerow(('S/N', 'Item', 'Supplier', 'Actual Qty', 'Expiry Date'))
writer.writerows(inventory)

Related

Python : Scrape each info in table without class using beautifulsoup4

I'm new to python and i have a problem for scraping with beautifulsoup4 a table containing informations of a book because each tr and td of the table doesnt contain classnames.
https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html
here is the table in the website:
<table class="table table-striped">
<tr>
<th>
UPC
</th>
<td>
a897fe39b1053632
</td>
</tr>
<tr>
<th>
Product Type
</th>
<td>
Books
</td>
</tr>
<tr>
<th>
Price (excl. tax)
</th>
<td>
£51.77
</td>
</tr>
<tr>
<th>
Price (incl. tax)
</th>
<td>
£51.77
</td>
</tr>
<tr>
<th>
Tax
</th>
<td>
£0.00
</td>
</tr>
<tr>
<th>
Availability
</th>
<td>
In stock (22 available)
</td>
</tr>
<tr>
<th>
Number of reviews
</th>
<td>
0
</td>
</tr>
</table>
the only thing i learned is with classnames, for example : book_price = soup.find('td', class_='book-price').
but in this situation i am blocked...
Is there something like find and pair the first th tag with the first td and the second th tag with the second td and so on.
i see something like that :
import requests
from bs4 import BeautifulSoup
book_url = "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"
page = requests.get(book_url)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find('table').prettify()
table_infos = soup.find('table')
for info in table_infos.findAll('tr'):
upc = ...
price = ...
tax = ...
thank you !

grabbing child from html table

Trying to grab some table data from a website.
Here's a sample of the html that can be found here https://www.madeinalabama.com/warn-list/:
<div class="warn-data">
<table>
<thead>
<tr>
<th>Closing or Layoff</th>
<th>Initial Report Date</th>
<th>Planned Starting Date</th>
<th>Company</th>
<th>City</th>
<th>Planned # Affected Employees</th>
</tr>
</thead>
<tbody>
<tr>
<td>Closing * </td>
<td>09/17/2019</td>
<td>11/14/2019</td>
<td>FLOWERS BAKING CO. </td>
<td>Opelika </td>
<td> 146 </td>
</tr>
<tr>
<td>Closing * </td>
<td>08/05/2019</td>
<td>10/01/2019</td>
<td>INFORM DIAGNOSTICS </td>
<td>Daphne </td>
<td> 72 </td>
</tr>
I'm trying to grab the data corresponding to the 6th td for each table row.
I tried this:
url = 'https://www.madeinalabama.com/warn-list/'
browser = webdriver.Chrome()
browser.get(url)
elements = browser.find_elements_by_xpath("//table/tbody/tr/td[6]").text
and elements comes back as this:
<selenium.webdriver.remote.webelement.WebElement (session="8199967d541da323f5d5c72623a5e607", element="7d2f8991-d30b-4bc0-bfa5-4b7e909fb56c")>,
<selenium.webdriver.remote.webelement.WebElement (session="8199967d541da323f5d5c72623a5e607", element="ba0cd72d-d105-4f8c-842f-6f20b3c2a9de")>,
<selenium.webdriver.remote.webelement.WebElement (session="8199967d541da323f5d5c72623a5e607", element="1ec14439-0732-4417-ac4f-be118d8d1f85")>,
<selenium.webdriver.remote.webelement.WebElement (session="8199967d541da323f5d5c72623a5e607", element="d8226534-4fc7-406c-935a-d43d6d777bfb")>]
Desired output is a simple df like this:
Planned # Affected Employees
146
72
.
.
.
Please someone explain how to do this using selenium find_elements_by_xpath. We have ample beautiful_soup explanations.
You can use pd.read_html() function:
txt = '''<div class="warn-data">
<table>
<thead>
<tr>
<th>Closing or Layoff</th>
<th>Initial Report Date</th>
<th>Planned Starting Date</th>
<th>Company</th>
<th>City</th>
<th>Planned # Affected Employees</th>
</tr>
</thead>
<tbody>
<tr>
<td>Closing * </td>
<td>09/17/2019</td>
<td>11/14/2019</td>
<td>FLOWERS BAKING CO. </td>
<td>Opelika </td>
<td> 146 </td>
</tr>
<tr>
<td>Closing * </td>
<td>08/05/2019</td>
<td>10/01/2019</td>
<td>INFORM DIAGNOSTICS </td>
<td>Daphne </td>
<td> 72 </td>
</tr>'''
df = pd.read_html(txt)[0]
print(df)
Prints:
Closing or Layoff Initial Report Date Planned Starting Date Company City Planned # Affected Employees
0 Closing * 09/17/2019 11/14/2019 FLOWERS BAKING CO. Opelika 146
1 Closing * 08/05/2019 10/01/2019 INFORM DIAGNOSTICS Daphne 72
Then:
print(df['Planned # Affected Employees'])
Prints:
0 146
1 72
Name: Planned # Affected Employees, dtype: int64
EDIT: Solution with BeautifulSoup:
soup = BeautifulSoup(txt, 'html.parser')
all_data = []
for tr in soup.select('.warn-data tr:has(td)'):
*_, last_column = tr.select('td')
all_data.append(last_column.get_text(strip=True))
df = pd.DataFrame({'Planned': all_data})
print(df)
Prints:
Planned
0 146
1 72
Or:
soup = BeautifulSoup(txt, 'html.parser')
all_data = [td.get_text(strip=True) for td in soup.select('.warn-data tr > td:nth-child(6)')]
df = pd.DataFrame({'Planned': all_data})
print(df)
You could also do td:nth-last-child(1) assuming its the last child
soup.select('div.warn-data > table > tbody > tr > td:nth-last-child(1)')
Example
from bs4 import BeautifulSoup
html = """
<div class="warn-data">
<table>
<thead>
<tr>
<th>Closing or Layoff</th>
<th>Initial Report Date</th>
<th>Planned Starting Date</th>
<th>Company</th>
<th>City</th>
<th>Planned # Affected Employees</th>
</tr>
</thead>
<tbody>
<tr>
<td>Closing *</td>
<td>09/17/2019</td>
<td>11/14/2019</td>
<td>FLOWERS BAKING CO.</td>
<td>Opelika</td>
<td> 146 </td>
</tr>
<tr>
<td>Closing *</td>
<td>08/05/2019</td>
<td>10/01/2019</td>
<td>INFORM DIAGNOSTICS</td>
<td>Daphne</td>
<td> 72 </td>
</tr>
"""
soup = BeautifulSoup(html, features='html.parser')
elements = soup.select('div.warn-data > table > tbody > tr > td:nth-last-child(1)')
for index, item in enumerate(elements):
print(index, item.text)

Is it possible to add a new <td> instance to a <tr> row with bs4?

I want to edit a table of an .htm file, which roughly looks like this:
<table>
<tr>
<td>
parameter A
</td>
<td>
value A
</td>
<tr/>
<tr>
<td>
parameter B
</td>
<td>
value B
</td>
<tr/>
...
</table>
I made a preformatted template in Word, which has nicely formatted style="" attributes. I insert parameter values into the appropreatte tds from a poorly formatted .html file (This is the output from a scientific program). My job is to automate the creation of html tables so that they can be used in a paper, basically.
This works fine, while the template has empty td instances in a tr. But when I try create additional tds inside a tr (over which I iterate), I get stuck. The .append and .append_after methods for the rows just overwrite existing td instances. I need to create new tds, since I want to create the number of columns dynamically and I need to iterate over a number of up to 5 unformatted input .html files.
from bs4 import BeautifulSoup
with open('template.htm') as template:
template = BeautifulSoup(template)
template = template.find('table')
lines_template = template.findAll('tr')
for line in lines_template:
newtd = line.findAll('td')[-1]
newtd['control_string'] = 'this_is_new'
line.append(newtd)
=> No new tds. The last one is just overwritten. No new column was created.
I want to copy and paste the last td in a row, because it will have the correct style="" for that row. Is it possible to just copy a bs4.element with all the formatting and add it as the last td in a tr? If not, what module/approach should I use?
Thanks in advance.
You can copy the attributes by assigning to the attrs:
data = '''<table>
<tr>
<td style="color:red;">
parameter A
</td>
<td style="color:blue;">
value A
</td>
</tr>
<tr>
<td style="color:red;">
parameter B
</td>
<td style="color:blue;">
value B
</td>
</tr>
</table>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
for i, tr in enumerate(soup.select('tr'), 1):
tds = tr.select('td')
new_td = soup.new_tag('td', attrs=tds[-1].attrs)
new_td.append('This is data for row {}'.format(i))
tr.append(new_td)
print(soup.table.prettify())
Prints:
<table>
<tr>
<td style="color:red;">
parameter A
</td>
<td style="color:blue;">
value A
</td>
<td style="color:blue;">
This is data for row 1
</td>
</tr>
<tr>
<td style="color:red;">
parameter B
</td>
<td style="color:blue;">
value B
</td>
<td style="color:blue;">
This is data for row 2
</td>
</tr>
</table>

How to read the data from HTML file and write the data to CSV file using python?

I have a .html file report which consists of the data in terms of tables and pass-fail criteria. so I want this data to be written to .csv file Using Python3.
Please suggest me how to proceed?
For example, the data will be like this:
<h2>Sequence Evaluation of Entire Project <em class="contentlink">[Contents]</em> </h2>
<table width="100%" class="coverage">
<tr class="nohover">
<td colspan="8" class="tableabove">Test Sequence State</td>
</tr>
<tr>
<th colspan="2" style="white-space:nowrap;">Metric</th>
<th colspan="2">Percentage</th>
<th>Target</th>
<th>Total</th>
<th>Reached</th>
<th>Unreached</th>
</tr>
<tr>
<td colspan="2">Test Sequence Work Progress</td>
<td>100.0%</td>
<td>
<table class="metricbar">
<tr class="borderX">
<td class="white"></td>
<td class="target"></td>
<td class="white" colspan="2"></td>
</tr>
<tr>
<td class="covreached" width="99%"></td>
<td class="target" width="1%"></td>
<td class="covreached" width="0%"></td>
<td class="covnotreached" width="0%"></td>
</tr>
<tr class="borderX">
<td class="white"></td>
<td class="target"></td>
<td class="white" colspan="2"></td>
</tr>
</table>
</td>
<td>100%</td>
<td>24</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
Assuming you know header and really only need the associated percentage, with bs4 4.7.1 you can use :contains to target header and then take next td. You would be reading your HTML from file into html variable shown.
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
html = '''
<h2>Sequence Evaluation of Entire Project <em class="contentlink">[Contents]</em> </h2>
<table width="100%" class="coverage">
<tr class="nohover">
<td colspan="8" class="tableabove">Test Sequence State</td>
</tr>
<tr>
<th colspan="2" style="white-space:nowrap;">Metric</th>
<th colspan="2">Percentage</th>
<th>Target</th>
<th>Total</th>
<th>Reached</th>
<th>Unreached</th>
</tr>
<tr>
<td colspan="2">Test Sequence Work Progress</td>
<td>100.0%</td>
<td>
<table class="metricbar">
<tr class="borderX">
<td class="white"></td>
<td class="target"></td>
<td class="white" colspan="2"></td>
</tr>
<tr>
<td class="covreached" width="99%"></td>
<td class="target" width="1%"></td>
<td class="covreached" width="0%"></td>
<td class="covnotreached" width="0%"></td>
</tr>
<tr class="borderX">
<td class="white"></td>
<td class="target"></td>
<td class="white" colspan="2"></td>
</tr>
</table>
</td>
<td>100%</td>
<td>24</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
'''
soup = bs(html, 'lxml') # 'html.parser' if lxml not installed
header = 'Test Sequence Work Progress'
result = soup.select_one('td:contains("' + header + '") + td').text
df = pd.DataFrame([result], columns = [header])
print(df)
df.to_csv(r'C:\Users\User\Desktop\data.csv', sep=',', encoding='utf-8-sig',index = False )
import csv
from bs4 import BeautifulSoup
out = open('out.csv', 'w', encoding='utf-8')
path="my.html" #add the path of your local file here
soup = BeautifulSoup(open(path), 'html.parser')
for link in soup.find_all('p'): #add tag whichyou want to extract
a=link.get_text()
out.write(a)
out.write('\n')
out.close()

Parsing webpage with robobrowser and beautifulsoup

I'm new to webscraping trying to parse a website after doing a form submission with robobrowser. I get the correct data back (I can view it when I do: print(browser.parsed)) but am having trouble parsing it. The relevant part of the source code of the webpage looks like this:
<div id="ii">
<tr>
<td scope="row" id="t1a"> ID (ID Number)</a></td>
<td headers="t1a">1234567 </td>
</tr>
<tr>
<td scope="row" id="t1b">Participant Name</td>
<td headers="t1b">JONES, JOHN </td>
</tr>
<tr>
<td scope="row" id="t1c">Sex</td>
<td headers="t1c">MALE </td>
</tr>
<tr>
<td scope="row" id="t1d">Date of Birth</td>
<td headers="t1d">11/25/2016 </td>
</tr>
<tr>
<td scope="row" id="t1e">Race / Ethnicity</a></td>
<td headers="t1e">White </td>
</tr>
if I do
in: browser.select('#t1b")
I get:
out: [<td id="t1b" scope="row">Inmate Name</td>]
instead of JONES, JOHN.
The only way I've been able to get the relevant data is doing:
browser.select('tr')
This returns a list of each of the 29 with each 'tr' that I can convert to text and search for the relevant info.
I've also tried creating a BeautifulSoup object:
x = browser.select('#ii')
soup = BeautifulSoup(x[0].text, "html.parser")
but it loses all tags/ids and so I can't figure out how to search within it.
Is there an easy way to have it loop through each element with 'tr' and get the actual data and not the label as oppose to repeatedly converting to a string variable and searching through it?
Thanks
Get all the "label" td elements and get the next td sibling value collecting results into a dict:
from pprint import pprint
from bs4 import BeautifulSoup
data = """
<table>
<tr>
<td scope="row" id="t1a"> ID (ID Number)</a></td>
<td headers="t1a">1234567 </td>
</tr>
<tr>
<td scope="row" id="t1b">Participant Name</td>
<td headers="t1b">JONES, JOHN </td>
</tr>
<tr>
<td scope="row" id="t1c">Sex</td>
<td headers="t1c">MALE </td>
</tr>
<tr>
<td scope="row" id="t1d">Date of Birth</td>
<td headers="t1d">11/25/2016 </td>
</tr>
<tr>
<td scope="row" id="t1e">Race / Ethnicity</a></td>
<td headers="t1e">White </td>
</tr>
</table>
"""
soup = BeautifulSoup(data, 'html5lib')
data = {
label.get_text(strip=True): label.find_next_sibling("td").get_text(strip=True)
for label in soup.select("tr > td[scope=row]")
}
pprint(data)
Prints:
{'Date of Birth': '11/25/2016',
'ID (ID Number)': '1234567',
'Participant Name': 'JONES, JOHN',
'Race / Ethnicity': 'White',
'Sex': 'MALE'}

Categories