Python BeautifulSoup4 get range of table rows

Python BeautifulSoup4 get range of table rows - python

I have a HTML table containing 6 table rows:
<table>
<tr>
<th>1</th>
<td><p>1</p></td>
</tr>
<tr>
<th>2</th>
<td><p>2</p></td>
</tr>
<tr>
<th>3</th>
<td><p>3</p></td>
</tr>
<tr>
<th>4</th>
<td><p>4</p></td>
</tr>
<tr>
<th>5</th>
<td><p>5</p></td>
</tr>
<tr>
<th>6</th>
<td><p>6</p></td>
</tr>
</table>
My goal here is to extract only the first 5 rows.
How can i code it in python such that BeautifulSoup breaks after getting the first 5 rows?

You can use the limit kwarg in findAll to grab only the first n elements
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
trs = soup.find('table').findAll('tr', limit=5)

Related

Parse HTML table for specific content in one column and print resulting table to file with python

I have a file test_input.htm with a table:
<table>
<thead>
<tr>
<th>Acronym</th>
<th>Full Term</th>
<th>Definition</th>
<th>Product </th>
</tr>
</thead>
<tbody>
<tr>
<td>a1</td>
<td>term</td>
<td>
<p>texttext.</p>
<p>Source: PRISMA-GLO</p>
</td>
<td>
<p>PRISMA</p>
<p>SDDS-NG</p>
</td>
</tr>
<tr>
<td>a2</td>
<td>term</td>
<td>
<p>texttext.</p>
<p>Source: PRISMA-GLO</p>
</td>
<td>
<p>PRISMA</p>
</td>
</tr>
<tr>
<td>a3</td>
<td>term</td>
<td>
<p>texttext.</p>
<p>Source: PRISMA-GLO</p>
</td>
<td>
<p>SDDS-NG</p>
</td>
</tr>
<tr>
<td>a4</td>
<td>term</td>
<td>
<p>texttext.</p>
<p>Source: SD-GLO</p>
</td>
<td>
<p>SDDS-NG</p>
</td>
</tr>
</tbody>
</table>
I would like to write only table rows to file test_output.htm that contain the keyword PRISMA in column 4 (Product).
The follwing script gives me all table rows that contain the keyword PRISMA in any of the 4 columns:
from bs4 import BeautifulSoup
file_input = open('test_input.htm')
results = BeautifulSoup(file_input.read(), 'html.parser')
inhalte = results.find_all('tr')
with open('test_output.htm', 'a') as f:
data = [[td.findChildren(text=True) for td in inhalte]]
for line in inhalte: #if you see a line in the table
if line.get_text().find('PRISMA') > -1 : #and you find the specific string
f.write("%s\n" % str(line))
I really tried hard but could not figure out how to restict the search to column 4.
The following did not work:
data = [[td.findChildren(text=True) for td in tr.findAll('td')[4]] for tr in inhalte]
I would really appreciate if someone could help me find the solution.

Select more specific to get the elements you expect - For example use css selectors to achieve your task. Following line will only select tr from table thats fourth td contains PRISMA:
soup.select('table tr:has(td:nth-of-type(4):-soup-contains("PRISMA"))')
Example
from bs4 import BeautifulSoup
file_input = open('test_input.htm')
soup = BeautifulSoup(file_input.read(), 'html.parser')
with open('test_output.htm', 'a') as f:
for line in soup.select('table tr:has(td:nth-of-type(4):-soup-contains("PRISMA"))'):
f.write("%s\n" % str(line))

How to read the data from HTML file and write the data to CSV file using python?

I have a .html file report which consists of the data in terms of tables and pass-fail criteria. so I want this data to be written to .csv file Using Python3.
Please suggest me how to proceed?
For example, the data will be like this:
<h2>Sequence Evaluation of Entire Project <em class="contentlink">[Contents]</em> </h2>
<table width="100%" class="coverage">
<tr class="nohover">
<td colspan="8" class="tableabove">Test Sequence State</td>
</tr>
<tr>
<th colspan="2" style="white-space:nowrap;">Metric</th>
<th colspan="2">Percentage</th>
<th>Target</th>
<th>Total</th>
<th>Reached</th>
<th>Unreached</th>
</tr>
<tr>
<td colspan="2">Test Sequence Work Progress</td>
<td>100.0%</td>
<td>
<table class="metricbar">
<tr class="borderX">
<td class="white"></td>
<td class="target"></td>
<td class="white" colspan="2"></td>
</tr>
<tr>
<td class="covreached" width="99%"></td>
<td class="target" width="1%"></td>
<td class="covreached" width="0%"></td>
<td class="covnotreached" width="0%"></td>
</tr>
<tr class="borderX">
<td class="white"></td>
<td class="target"></td>
<td class="white" colspan="2"></td>
</tr>
</table>
</td>
<td>100%</td>
<td>24</td>
<td>-</td>
<td>-</td>
</tr>
<tr>

Assuming you know header and really only need the associated percentage, with bs4 4.7.1 you can use :contains to target header and then take next td. You would be reading your HTML from file into html variable shown.
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
html = '''
<h2>Sequence Evaluation of Entire Project <em class="contentlink">[Contents]</em> </h2>
<table width="100%" class="coverage">
<tr class="nohover">
<td colspan="8" class="tableabove">Test Sequence State</td>
</tr>
<tr>
<th colspan="2" style="white-space:nowrap;">Metric</th>
<th colspan="2">Percentage</th>
<th>Target</th>
<th>Total</th>
<th>Reached</th>
<th>Unreached</th>
</tr>
<tr>
<td colspan="2">Test Sequence Work Progress</td>
<td>100.0%</td>
<td>
<table class="metricbar">
<tr class="borderX">
<td class="white"></td>
<td class="target"></td>
<td class="white" colspan="2"></td>
</tr>
<tr>
<td class="covreached" width="99%"></td>
<td class="target" width="1%"></td>
<td class="covreached" width="0%"></td>
<td class="covnotreached" width="0%"></td>
</tr>
<tr class="borderX">
<td class="white"></td>
<td class="target"></td>
<td class="white" colspan="2"></td>
</tr>
</table>
</td>
<td>100%</td>
<td>24</td>
<td>-</td>
<td>-</td>
</tr>
<tr>
'''
soup = bs(html, 'lxml') # 'html.parser' if lxml not installed
header = 'Test Sequence Work Progress'
result = soup.select_one('td:contains("' + header + '") + td').text
df = pd.DataFrame([result], columns = [header])
print(df)
df.to_csv(r'C:\Users\User\Desktop\data.csv', sep=',', encoding='utf-8-sig',index = False )

import csv
from bs4 import BeautifulSoup
out = open('out.csv', 'w', encoding='utf-8')
path="my.html" #add the path of your local file here
soup = BeautifulSoup(open(path), 'html.parser')
for link in soup.find_all('p'): #add tag whichyou want to extract
a=link.get_text()
out.write(a)
out.write('\n')
out.close()

How to extract items line by line

I need some help here using beautifulsoup4 to extract data from my inventory webpage.
The webpage was written in the following format: name of the item, followed by a table listing the multiple rows of details for that particular inventory.
I am interested in getting the item name, actual qty and expiry date.
How do I go about doing it given such HTML structure (see appended)?
<div style="font-weight: bold">Item X</div>
<table cellspacing="0" cellpadding="0" class="table table-striped report-table" style="width: 800px">
<thead>
<tr>
<th> </th>
<th>Supplier</th>
<th>Packing</th>
<th>Default Qty</th>
<th>Expensive</th>
<th>Reorder Point</th>
<th>Actual Qty</th>
<th>Expiry Date</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>Company 1</td>
<td>3.8 L</td>
<td>
4
</td>
<td>
No
</td>
<td>2130.00</td>
<td>350.00</td>
<td>31-05-2019</td>
</tr>
<tr>
<td>2</td>
<td>Company 1</td>
<td>3.8 L</td>
<td>
4
</td>
<td>
No
</td>
<td>2130.00</td>
<td>15200.00</td>
<td>31-05-2019</td>
</tr>
<tr>
<td>3</td>
<td>Company 1</td>
<td>3.8 L</td>
<td>
4
</td>
<td>
No
</td>
<td>2130.00</td>
<td>210.00</td>
<td>31-05-2019</td>
</tr>
<tr>
<td colspan="5"> </td>
<td>Total Qty 15760.00</td>
<td> </td>
</tr>
</tbody>
</table>
<div style="font-weight: bold">Item Y</div>
<table cellspacing="0" cellpadding="0" class="table table-striped report-table" style="width: 800px">
<thead>
<tr>
<th> </th>
<th>Supplier</th>
<th>Packing</th>
<th>Default Qty</th>
<th>Expensive</th>
<th>Reorder Point</th>
<th>Actual Qty</th>
<th>Expiry Date</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>Company 2</td>
<td>50X10's</td>
<td>
10
</td>
<td>
Yes
</td>
<td>1090.00</td>
<td>271.00</td>
<td>31-01-2020</td>
</tr>
<tr>
<td>2</td>
<td>Company 2</td>
<td>50X10's</td>
<td>
10
</td>
<td>
Yes
</td>
<td>1090.00</td>
<td>500.00</td>
<td>31-01-2020</td>
</tr>
<tr>
<td>3</td>
<td>Company 2</td>
<td>50X10's</td>
<td>
10
</td>
<td>
Yes
</td>
<td>1090.00</td>
<td>69.00</td>
<td>31-01-2020</td>
</tr>
<tr>
<td>4</td>
<td>Company 2</td>
<td>50X10's</td>
<td>
10
</td>
<td>
Yes
</td>
<td>1090.00</td>
<td>475.00</td>
<td>01-01-2020</td>
</tr>
<tr>
<td colspan="5"> </td>
<td>Total Qty 1315.00</td>
<td> </td>
</tr>
</tbody>
</table>

Here is one way to do it. The idea is to iterate over the items - div elements with bold substring inside the style attribute. Then, for every item, get the next table sibling using find_next_sibling() and parse the row data into a dictionary for convenient access by a header name:
from bs4 import BeautifulSoup
data = """your HTML here"""
soup = BeautifulSoup(data, "lxml")
for item in soup.select("div[style*=bold]"):
item_name = item.get_text()
table = item.find_next_sibling("table")
headers = [th.get_text(strip=True) for th in table('th')]
for row in table('tr')[1:-1]:
row_data = dict(zip(headers, [td.get_text(strip=True) for td in row('td')]))
print(item_name, row_data['Actual Qty'], row_data['Expiry Date'])
print("-----")
Prints:
Item X 350.00 31-05-2019
Item X 15200.00 31-05-2019
Item X 210.00 31-05-2019
-----
Item Y 271.00 31-01-2020
Item Y 500.00 31-01-2020
Item Y 69.00 31-01-2020
Item Y 475.00 01-01-2020
-----

One solution is to iterate through each row tag i.e. <tr> and then just figure out what the column cell at each index represents and access columns that way. To do so, you can use the find_all method in BeautifulSoup, which will return a list of all elements with the tag given.
Example:
from bs4 import BeautifulSoup
html_doc = YOUR HTML HERE
soup = BeautifulSoup(html_doc, 'html.parser')
for row in soup.find_all("tr"):
cells = row.find_all("td")
if len(cells) == 0:
#This is the header row
else:
#If you want to access the text of the Default Quantity column for example
default_qty = cells[3].text
Note that in the case that the tr tag is actually the header row, then there will not be td tags (there will only be th tags), so in this case len(cells)==0

You can select all divs and walk through to find the next table.
If you go over the rows of the table with the exception of the last row, you can extract text from specific cells and build your inventory list.
soup = BeautifulSoup(markup, "html5lib")
inventory = []
for itemdiv in soup.select('div'):
table = itemdiv.find_next('table')
for supply_row in table.tbody.select('tr')[:-1]:
sn, supplier, _, actual_qty, _, _, _, exp = supply_row.select('td')
item = map(lambda node: node.text.strip(), [sn, supplier, actual_qty, exp])
item[1:1] = [itemdiv.text]
inventory.append(item)
print(inventory)
You can use the csv library to write the inventory like so:
import csv
with open('some.csv', 'wb') as f:
writer = csv.writer(f, delimiter="|")
writer.writerow(('S/N', 'Item', 'Supplier', 'Actual Qty', 'Expiry Date'))
writer.writerows(inventory)

Parsing webpage with robobrowser and beautifulsoup

I'm new to webscraping trying to parse a website after doing a form submission with robobrowser. I get the correct data back (I can view it when I do: print(browser.parsed)) but am having trouble parsing it. The relevant part of the source code of the webpage looks like this:
<div id="ii">
<tr>
<td scope="row" id="t1a"> ID (ID Number)</a></td>
<td headers="t1a">1234567 </td>
</tr>
<tr>
<td scope="row" id="t1b">Participant Name</td>
<td headers="t1b">JONES, JOHN </td>
</tr>
<tr>
<td scope="row" id="t1c">Sex</td>
<td headers="t1c">MALE </td>
</tr>
<tr>
<td scope="row" id="t1d">Date of Birth</td>
<td headers="t1d">11/25/2016 </td>
</tr>
<tr>
<td scope="row" id="t1e">Race / Ethnicity</a></td>
<td headers="t1e">White </td>
</tr>
if I do
in: browser.select('#t1b")
I get:
out: [<td id="t1b" scope="row">Inmate Name</td>]
instead of JONES, JOHN.
The only way I've been able to get the relevant data is doing:
browser.select('tr')
This returns a list of each of the 29 with each 'tr' that I can convert to text and search for the relevant info.
I've also tried creating a BeautifulSoup object:
x = browser.select('#ii')
soup = BeautifulSoup(x[0].text, "html.parser")
but it loses all tags/ids and so I can't figure out how to search within it.
Is there an easy way to have it loop through each element with 'tr' and get the actual data and not the label as oppose to repeatedly converting to a string variable and searching through it?
Thanks

Get all the "label" td elements and get the next td sibling value collecting results into a dict:
from pprint import pprint
from bs4 import BeautifulSoup
data = """
<table>
<tr>
<td scope="row" id="t1a"> ID (ID Number)</a></td>
<td headers="t1a">1234567 </td>
</tr>
<tr>
<td scope="row" id="t1b">Participant Name</td>
<td headers="t1b">JONES, JOHN </td>
</tr>
<tr>
<td scope="row" id="t1c">Sex</td>
<td headers="t1c">MALE </td>
</tr>
<tr>
<td scope="row" id="t1d">Date of Birth</td>
<td headers="t1d">11/25/2016 </td>
</tr>
<tr>
<td scope="row" id="t1e">Race / Ethnicity</a></td>
<td headers="t1e">White </td>
</tr>
</table>
"""
soup = BeautifulSoup(data, 'html5lib')
data = {
label.get_text(strip=True): label.find_next_sibling("td").get_text(strip=True)
for label in soup.select("tr > td[scope=row]")
}
pprint(data)
Prints:
{'Date of Birth': '11/25/2016',
'ID (ID Number)': '1234567',
'Participant Name': 'JONES, JOHN',
'Race / Ethnicity': 'White',
'Sex': 'MALE'}

How to get text content of multiple <td> tags inside a table using PyQuery?

How to select attribute's text field from given book-details table field where values are in text or in text field?
<table cellspacing="0" class="fk-specs-type2">
<tr>
<th class="group-head" colspan="2">Book Details</th>
</tr>
<tr>
<td class="specs-key">Publisher</td>
<td class="specs-value fk-data">HARPER COLLINS INDIA</td>
</tr>
<tr>
<td class="specs-key">ISBN-13</td>
<td class="specs-value fk-data">9789350291924</td>
</tr>
</table>

You can use following code snippet to get Publisher and ISBN-13 data:
from pyquery import PyQuery
html = """<table cellspacing="0" class="fk-specs-type2">
<tr>
<th class="group-head" colspan="2">Book Details</th>
</tr>
<tr>
<td class="specs-key">Publisher</td>
<td class="specs-value fk-data">HARPER COLLINS INDIA</td>
</tr>
<tr>
<td class="specs-key">ISBN-13</td>
<td class="specs-value fk-data">9789350291924</td>
</tr>
</table>
"""
doc = PyQuery(html)
for td in doc("table.fk-specs-type2").find("td.specs-key"):
print td.text, td.getnext().text
It should print following two lines
Publisher HARPER COLLINS INDIA
ISBN-13 9789350291924

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python BeautifulSoup4 get range of table rows - python

You can use the limit kwarg in findAll to grab only the first n elements from bs4 import BeautifulSoup soup = BeautifulSoup(html) trs = soup.find('table').findAll('tr', limit=5)

Related

Parse HTML table for specific content in one column and print resulting table to file with python

How to read the data from HTML file and write the data to CSV file using python?

How to extract items line by line

Parsing webpage with robobrowser and beautifulsoup

How to get text content of multiple <td> tags inside a table using PyQuery?

Categories

Resources