Beautifulsoup HTML table parsing--only able to get the last row? - python

I have a simple HTML table to parse but somehow Beautifulsoup is only able to get me results from the last row. I'm wondering if anyone would take a look at that and see what's wrong. So I already created the rows object from the HTML table:
<table class='participants-table'>
<thead>
<tr>
<th data-field="name" class="sort-direction-toggle name">Name</th>
<th data-field="type" class="sort-direction-toggle type active-sort asc">Type</th>
<th data-field="sector" class="sort-direction-toggle sector">Sector</th>
<th data-field="country" class="sort-direction-toggle country">Country</th>
<th data-field="joined_on" class="sort-direction-toggle joined-on">Joined On</th>
</tr>
</thead>
<tbody>
<tr>
<th class='name'>Grontmij</th>
<td class='type'>Company</td>
<td class='sector'>General Industrials</td>
<td class='country'>Netherlands</td>
<td class='joined-on'>2000-09-20</td>
</tr>
<tr>
<th class='name'>Groupe Bial</th>
<td class='type'>Company</td>
<td class='sector'>Pharmaceuticals & Biotechnology</td>
<td class='country'>Portugal</td>
<td class='joined-on'>2004-02-19</td>
</tr>
</tbody>
</table>
I use the following codes to get the rows:
table=soup.find_all("table", class_="participants-table")
table1=table[0]
rows=table1.find_all('tr')
rows=rows[1:]
This gets:
rows=[<tr>
<th class="name">Grontmij</th>
<td class="type">Company</td>
<td class="sector">General Industrials</td>
<td class="country">Netherlands</td>
<td class="joined-on">2000-09-20</td>
</tr>, <tr>
<th class="name">Groupe Bial</th>
<td class="type">Company</td>
<td class="sector">Pharmaceuticals & Biotechnology</td>
<td class="country">Portugal</td>
<td class="joined-on">2004-02-19</td>
</tr>]
As expected, it looks like. However, if I continue:
for row in rows:
cells = row.find_all('th')
I'm only able to get the last entry!
cells=[<th class="name">Groupe Bial</th>]
What is going on? This is my first time using beautifulsoup, and what I'd like to do is to export this table into CSV. Any help is greatly appreciated! Thanks

You need to extend if you want all the th tags in a single list, you just keep reassigning cells = row.find_all('th') so when your print cells outside the loop you will only see what it was last assigned to i.e the last th in the last tr:
cells = []
for row in rows:
cells.extend(row.find_all('th'))
Also since there is only one table you can just use find:
soup = BeautifulSoup(html)
table = soup.find("table", class_="participants-table")
If you want to skip the thead row you can use a css selector:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
rows = soup.select("table.participants-table thead ~ tr")
cells = [tr.th for tr in rows]
print(cells)
cells will give you:
[<th class="name">Grontmij</th>, <th class="name">Groupe Bial</th>]
To write the whole table to csv:
import csv
soup = BeautifulSoup(html, "html.parser")
rows = soup.select("table.participants-table tr")
with open("data.csv", "w") as out:
wr = csv.writer(out)
wr.writerow([th.text for th in rows[0].find_all("th")] + ["URL"])
for row in rows[1:]:
wr.writerow([tag.text for tag in row.find_all()] + [row.th.a["href"]])
which for you sample will give you:
Name,Type,Sector,Country,Joined On,URL
Grontmij,Company,General Industrials,Netherlands,2000-09-20,/what-is-gc/participants/4479-Grontmij
Groupe Bial,Company,Pharmaceuticals & Biotechnology,Portugal,2004-02-19,/what-is-gc/participants/4492-Groupe-Bial

Related

Parse HTML table for specific content in one column and print resulting table to file with python

I have a file test_input.htm with a table:
<table>
<thead>
<tr>
<th>Acronym</th>
<th>Full Term</th>
<th>Definition</th>
<th>Product </th>
</tr>
</thead>
<tbody>
<tr>
<td>a1</td>
<td>term</td>
<td>
<p>texttext.</p>
<p>Source: PRISMA-GLO</p>
</td>
<td>
<p>PRISMA</p>
<p>SDDS-NG</p>
</td>
</tr>
<tr>
<td>a2</td>
<td>term</td>
<td>
<p>texttext.</p>
<p>Source: PRISMA-GLO</p>
</td>
<td>
<p>PRISMA</p>
</td>
</tr>
<tr>
<td>a3</td>
<td>term</td>
<td>
<p>texttext.</p>
<p>Source: PRISMA-GLO</p>
</td>
<td>
<p>SDDS-NG</p>
</td>
</tr>
<tr>
<td>a4</td>
<td>term</td>
<td>
<p>texttext.</p>
<p>Source: SD-GLO</p>
</td>
<td>
<p>SDDS-NG</p>
</td>
</tr>
</tbody>
</table>
I would like to write only table rows to file test_output.htm that contain the keyword PRISMA in column 4 (Product).
The follwing script gives me all table rows that contain the keyword PRISMA in any of the 4 columns:
from bs4 import BeautifulSoup
file_input = open('test_input.htm')
results = BeautifulSoup(file_input.read(), 'html.parser')
inhalte = results.find_all('tr')
with open('test_output.htm', 'a') as f:
data = [[td.findChildren(text=True) for td in inhalte]]
for line in inhalte: #if you see a line in the table
if line.get_text().find('PRISMA') > -1 : #and you find the specific string
f.write("%s\n" % str(line))
I really tried hard but could not figure out how to restict the search to column 4.
The following did not work:
data = [[td.findChildren(text=True) for td in tr.findAll('td')[4]] for tr in inhalte]
I would really appreciate if someone could help me find the solution.
Select more specific to get the elements you expect - For example use css selectors to achieve your task. Following line will only select tr from table thats fourth td contains PRISMA:
soup.select('table tr:has(td:nth-of-type(4):-soup-contains("PRISMA"))')
Example
from bs4 import BeautifulSoup
file_input = open('test_input.htm')
soup = BeautifulSoup(file_input.read(), 'html.parser')
with open('test_output.htm', 'a') as f:
for line in soup.select('table tr:has(td:nth-of-type(4):-soup-contains("PRISMA"))'):
f.write("%s\n" % str(line))

Is it possible to add a new <td> instance to a <tr> row with bs4?

I want to edit a table of an .htm file, which roughly looks like this:
<table>
<tr>
<td>
parameter A
</td>
<td>
value A
</td>
<tr/>
<tr>
<td>
parameter B
</td>
<td>
value B
</td>
<tr/>
...
</table>
I made a preformatted template in Word, which has nicely formatted style="" attributes. I insert parameter values into the appropreatte tds from a poorly formatted .html file (This is the output from a scientific program). My job is to automate the creation of html tables so that they can be used in a paper, basically.
This works fine, while the template has empty td instances in a tr. But when I try create additional tds inside a tr (over which I iterate), I get stuck. The .append and .append_after methods for the rows just overwrite existing td instances. I need to create new tds, since I want to create the number of columns dynamically and I need to iterate over a number of up to 5 unformatted input .html files.
from bs4 import BeautifulSoup
with open('template.htm') as template:
template = BeautifulSoup(template)
template = template.find('table')
lines_template = template.findAll('tr')
for line in lines_template:
newtd = line.findAll('td')[-1]
newtd['control_string'] = 'this_is_new'
line.append(newtd)
=> No new tds. The last one is just overwritten. No new column was created.
I want to copy and paste the last td in a row, because it will have the correct style="" for that row. Is it possible to just copy a bs4.element with all the formatting and add it as the last td in a tr? If not, what module/approach should I use?
Thanks in advance.
You can copy the attributes by assigning to the attrs:
data = '''<table>
<tr>
<td style="color:red;">
parameter A
</td>
<td style="color:blue;">
value A
</td>
</tr>
<tr>
<td style="color:red;">
parameter B
</td>
<td style="color:blue;">
value B
</td>
</tr>
</table>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
for i, tr in enumerate(soup.select('tr'), 1):
tds = tr.select('td')
new_td = soup.new_tag('td', attrs=tds[-1].attrs)
new_td.append('This is data for row {}'.format(i))
tr.append(new_td)
print(soup.table.prettify())
Prints:
<table>
<tr>
<td style="color:red;">
parameter A
</td>
<td style="color:blue;">
value A
</td>
<td style="color:blue;">
This is data for row 1
</td>
</tr>
<tr>
<td style="color:red;">
parameter B
</td>
<td style="color:blue;">
value B
</td>
<td style="color:blue;">
This is data for row 2
</td>
</tr>
</table>

BeautifulSoup find() returns odd data

I am using BeautifulSoup to get data off a website. I can find the data I want but when I print it, it comes out as "-1" The value in the field is 32.27. Here is the code I'm using
import requests
from BeautifulSoup import BeautifulSoup
import csv
symbols = {'451020'}
with open('industry_pe.csv', "ab") as csv_file:
writer = csv.writer(csv_file, delimiter=',')
writer.writerow(['Industry','PE'])
for s in symbols:
try:
url = 'https://eresearch.fidelity.com/eresearch/markets_sectors/sectors/industries.jhtml?tab=learn&industry='
full = url + s
response = requests.get(full)
html = response.content
soup = BeautifulSoup(html)
for PE in soup.find("div", {"class": "sec-fundamentals"}):
print PE
#IndPE = PE.find("td")
#print IndPE
When I print PE it returns this...
<h2>
Industry Fundamentals
<span>AS OF 03/08/2018</span>
</h2>
<table summary="" class="data-tbl">
<colgroup>
<col class="col1" />
<col class="col2" />
</colgroup>
<thead>
<tr>
<th scope="col"></th>
<th scope="col"></th>
</tr>
</thead>
<tbody>
<tr>
<th scope="row" class="align-left"><a href="javascript:void(0);" onclick="javasc
ript:openPopup('https://www.fidelity.com//webcontent/ap010098-etf-content/18.01/
help/research/learn_er_glossary_3.shtml#priceearningsratio',420,450);return fals
e;">P/E (Last Year GAAP Actual)</a></th>
<td>
32.27
</td>
</tr>
<tr>
<th scope="row" class="align-left"><a href="javascript:void(0);" onclick="javasc
ript:openPopup('https://www.fidelity.com//webcontent/ap010098-etf-content/18.01/
help/research/learn_er_glossary_3.shtml#priceearningsratio',420,450);return fals
e;">P/E (This Year's Estimate)</a>.....
I want to get the value 32.27 from 'td' but when i use the code i have commented out to get and print 'td' it gives me this.
-1
None
-1
<td>
32.27
</td>
-1
any ideas?
The find() method returns the tag which is the first match. Iterating over the contents of a tag, will give you all the tags one by one.
So, to get the <td> tags in the table, you should first find the table and store it in a variable. And then iterate over all the td tags using find_all('td').
table = soup.find("div", {"class": "sec-fundamentals"})
for row in table.find_all('td'):
print(row.text.strip())
Partial Output:
32.27
34.80
$122.24B
$3.41
14.14%
15.88%
If you want only the first value, you can use this:
table = soup.find("div", {"class": "sec-fundamentals"})
value = table.find('td').text.strip()
print(value)
# 32.27

How to extract items line by line

I need some help here using beautifulsoup4 to extract data from my inventory webpage.
The webpage was written in the following format: name of the item, followed by a table listing the multiple rows of details for that particular inventory.
I am interested in getting the item name, actual qty and expiry date.
How do I go about doing it given such HTML structure (see appended)?
<div style="font-weight: bold">Item X</div>
<table cellspacing="0" cellpadding="0" class="table table-striped report-table" style="width: 800px">
<thead>
<tr>
<th> </th>
<th>Supplier</th>
<th>Packing</th>
<th>Default Qty</th>
<th>Expensive</th>
<th>Reorder Point</th>
<th>Actual Qty</th>
<th>Expiry Date</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>Company 1</td>
<td>3.8 L</td>
<td>
4
</td>
<td>
No
</td>
<td>2130.00</td>
<td>350.00</td>
<td>31-05-2019</td>
</tr>
<tr>
<td>2</td>
<td>Company 1</td>
<td>3.8 L</td>
<td>
4
</td>
<td>
No
</td>
<td>2130.00</td>
<td>15200.00</td>
<td>31-05-2019</td>
</tr>
<tr>
<td>3</td>
<td>Company 1</td>
<td>3.8 L</td>
<td>
4
</td>
<td>
No
</td>
<td>2130.00</td>
<td>210.00</td>
<td>31-05-2019</td>
</tr>
<tr>
<td colspan="5"> </td>
<td>Total Qty 15760.00</td>
<td> </td>
</tr>
</tbody>
</table>
<div style="font-weight: bold">Item Y</div>
<table cellspacing="0" cellpadding="0" class="table table-striped report-table" style="width: 800px">
<thead>
<tr>
<th> </th>
<th>Supplier</th>
<th>Packing</th>
<th>Default Qty</th>
<th>Expensive</th>
<th>Reorder Point</th>
<th>Actual Qty</th>
<th>Expiry Date</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>Company 2</td>
<td>50X10's</td>
<td>
10
</td>
<td>
Yes
</td>
<td>1090.00</td>
<td>271.00</td>
<td>31-01-2020</td>
</tr>
<tr>
<td>2</td>
<td>Company 2</td>
<td>50X10's</td>
<td>
10
</td>
<td>
Yes
</td>
<td>1090.00</td>
<td>500.00</td>
<td>31-01-2020</td>
</tr>
<tr>
<td>3</td>
<td>Company 2</td>
<td>50X10's</td>
<td>
10
</td>
<td>
Yes
</td>
<td>1090.00</td>
<td>69.00</td>
<td>31-01-2020</td>
</tr>
<tr>
<td>4</td>
<td>Company 2</td>
<td>50X10's</td>
<td>
10
</td>
<td>
Yes
</td>
<td>1090.00</td>
<td>475.00</td>
<td>01-01-2020</td>
</tr>
<tr>
<td colspan="5"> </td>
<td>Total Qty 1315.00</td>
<td> </td>
</tr>
</tbody>
</table>
Here is one way to do it. The idea is to iterate over the items - div elements with bold substring inside the style attribute. Then, for every item, get the next table sibling using find_next_sibling() and parse the row data into a dictionary for convenient access by a header name:
from bs4 import BeautifulSoup
data = """your HTML here"""
soup = BeautifulSoup(data, "lxml")
for item in soup.select("div[style*=bold]"):
item_name = item.get_text()
table = item.find_next_sibling("table")
headers = [th.get_text(strip=True) for th in table('th')]
for row in table('tr')[1:-1]:
row_data = dict(zip(headers, [td.get_text(strip=True) for td in row('td')]))
print(item_name, row_data['Actual Qty'], row_data['Expiry Date'])
print("-----")
Prints:
Item X 350.00 31-05-2019
Item X 15200.00 31-05-2019
Item X 210.00 31-05-2019
-----
Item Y 271.00 31-01-2020
Item Y 500.00 31-01-2020
Item Y 69.00 31-01-2020
Item Y 475.00 01-01-2020
-----
One solution is to iterate through each row tag i.e. <tr> and then just figure out what the column cell at each index represents and access columns that way. To do so, you can use the find_all method in BeautifulSoup, which will return a list of all elements with the tag given.
Example:
from bs4 import BeautifulSoup
html_doc = YOUR HTML HERE
soup = BeautifulSoup(html_doc, 'html.parser')
for row in soup.find_all("tr"):
cells = row.find_all("td")
if len(cells) == 0:
#This is the header row
else:
#If you want to access the text of the Default Quantity column for example
default_qty = cells[3].text
Note that in the case that the tr tag is actually the header row, then there will not be td tags (there will only be th tags), so in this case len(cells)==0
You can select all divs and walk through to find the next table.
If you go over the rows of the table with the exception of the last row, you can extract text from specific cells and build your inventory list.
soup = BeautifulSoup(markup, "html5lib")
inventory = []
for itemdiv in soup.select('div'):
table = itemdiv.find_next('table')
for supply_row in table.tbody.select('tr')[:-1]:
sn, supplier, _, actual_qty, _, _, _, exp = supply_row.select('td')
item = map(lambda node: node.text.strip(), [sn, supplier, actual_qty, exp])
item[1:1] = [itemdiv.text]
inventory.append(item)
print(inventory)
You can use the csv library to write the inventory like so:
import csv
with open('some.csv', 'wb') as f:
writer = csv.writer(f, delimiter="|")
writer.writerow(('S/N', 'Item', 'Supplier', 'Actual Qty', 'Expiry Date'))
writer.writerows(inventory)

Python BeautifulSoup4 get range of table rows

I have a HTML table containing 6 table rows:
<table>
<tr>
<th>1</th>
<td><p>1</p></td>
</tr>
<tr>
<th>2</th>
<td><p>2</p></td>
</tr>
<tr>
<th>3</th>
<td><p>3</p></td>
</tr>
<tr>
<th>4</th>
<td><p>4</p></td>
</tr>
<tr>
<th>5</th>
<td><p>5</p></td>
</tr>
<tr>
<th>6</th>
<td><p>6</p></td>
</tr>
</table>
My goal here is to extract only the first 5 rows.
How can i code it in python such that BeautifulSoup breaks after getting the first 5 rows?
You can use the limit kwarg in findAll to grab only the first n elements
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
trs = soup.find('table').findAll('tr', limit=5)

Categories