Can anyone tell me how i can get the table in a HTML page which has a the most rows? I'm using BeautifulSoup.
There is one little problem though. Sometimes, there seems to be one table nested inside another.
<table>
<tr>
<td>
<table>
<tr>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
</tr>
</table>
<td>
</tr>
</table>
When the table.findAll('tr') code executes, it would count all the child rows for the table and the rows for the nested table under it. The parent table has just one row but the nested table has three and I would consider that to be the largest table. Below is the code that I'm using to dig out the largest table currently but it doesn't take the aforementioned scenario into consideration.
soup = BeautifulSoup(html)
#Get the largest table
largest_table = None
max_rows = 0
for table in soup.findAll('table'):
number_of_rows = len(table.findAll('tr'))
if number_of_rows > max_rows:
largest_table = table
max_rows = number_of_rows
I'm really lost with this. Any help guys?
Thanks in advance
Calculate number_of_rows like that:
number_of_rows = len(table.findAll(lambda tag: tag.name == 'tr' and tag.findParent('table') == table))
Related
I have a file test_input.htm with a table:
<table>
<thead>
<tr>
<th>Acronym</th>
<th>Full Term</th>
<th>Definition</th>
<th>Product </th>
</tr>
</thead>
<tbody>
<tr>
<td>a1</td>
<td>term</td>
<td>
<p>texttext.</p>
<p>Source: PRISMA-GLO</p>
</td>
<td>
<p>PRISMA</p>
<p>SDDS-NG</p>
</td>
</tr>
<tr>
<td>a2</td>
<td>term</td>
<td>
<p>texttext.</p>
<p>Source: PRISMA-GLO</p>
</td>
<td>
<p>PRISMA</p>
</td>
</tr>
<tr>
<td>a3</td>
<td>term</td>
<td>
<p>texttext.</p>
<p>Source: PRISMA-GLO</p>
</td>
<td>
<p>SDDS-NG</p>
</td>
</tr>
<tr>
<td>a4</td>
<td>term</td>
<td>
<p>texttext.</p>
<p>Source: SD-GLO</p>
</td>
<td>
<p>SDDS-NG</p>
</td>
</tr>
</tbody>
</table>
I would like to write only table rows to file test_output.htm that contain the keyword PRISMA in column 4 (Product).
The follwing script gives me all table rows that contain the keyword PRISMA in any of the 4 columns:
from bs4 import BeautifulSoup
file_input = open('test_input.htm')
results = BeautifulSoup(file_input.read(), 'html.parser')
inhalte = results.find_all('tr')
with open('test_output.htm', 'a') as f:
data = [[td.findChildren(text=True) for td in inhalte]]
for line in inhalte: #if you see a line in the table
if line.get_text().find('PRISMA') > -1 : #and you find the specific string
f.write("%s\n" % str(line))
I really tried hard but could not figure out how to restict the search to column 4.
The following did not work:
data = [[td.findChildren(text=True) for td in tr.findAll('td')[4]] for tr in inhalte]
I would really appreciate if someone could help me find the solution.
Select more specific to get the elements you expect - For example use css selectors to achieve your task. Following line will only select tr from table thats fourth td contains PRISMA:
soup.select('table tr:has(td:nth-of-type(4):-soup-contains("PRISMA"))')
Example
from bs4 import BeautifulSoup
file_input = open('test_input.htm')
soup = BeautifulSoup(file_input.read(), 'html.parser')
with open('test_output.htm', 'a') as f:
for line in soup.select('table tr:has(td:nth-of-type(4):-soup-contains("PRISMA"))'):
f.write("%s\n" % str(line))
So I have a table:
<table>
<tr>
<th>Firstname</th>
<th>Lastname</th>
<th>Age</th>
</tr>
<tr>
<td>Jill</td>
<td>Smith</td>
<td>50</td>
</tr>
<tr>
<td>Eve</td>
<td>Jackson</td>
<td>94</td>
</tr>
</table>
Here's my code to find the a random element inside
table = driver.find_element_by_xpath("//*[text()='Smith']")
I think I can get the parent which is the <tr> that <td> Smith is in by doing this
parent = table.parent
But my goal is to find the <table> and print out all the values in them such as...
Firstname, Lastname, Age, Jill, Smith......
I'm not really sure how to go about doing this since the table has no Class and Id.
You can try with the below snippet.
nodes = driver.find_elements_by_xpath("//table//tr/*")
for node in nodes:
print(node.text)
I want to edit a table of an .htm file, which roughly looks like this:
<table>
<tr>
<td>
parameter A
</td>
<td>
value A
</td>
<tr/>
<tr>
<td>
parameter B
</td>
<td>
value B
</td>
<tr/>
...
</table>
I made a preformatted template in Word, which has nicely formatted style="" attributes. I insert parameter values into the appropreatte tds from a poorly formatted .html file (This is the output from a scientific program). My job is to automate the creation of html tables so that they can be used in a paper, basically.
This works fine, while the template has empty td instances in a tr. But when I try create additional tds inside a tr (over which I iterate), I get stuck. The .append and .append_after methods for the rows just overwrite existing td instances. I need to create new tds, since I want to create the number of columns dynamically and I need to iterate over a number of up to 5 unformatted input .html files.
from bs4 import BeautifulSoup
with open('template.htm') as template:
template = BeautifulSoup(template)
template = template.find('table')
lines_template = template.findAll('tr')
for line in lines_template:
newtd = line.findAll('td')[-1]
newtd['control_string'] = 'this_is_new'
line.append(newtd)
=> No new tds. The last one is just overwritten. No new column was created.
I want to copy and paste the last td in a row, because it will have the correct style="" for that row. Is it possible to just copy a bs4.element with all the formatting and add it as the last td in a tr? If not, what module/approach should I use?
Thanks in advance.
You can copy the attributes by assigning to the attrs:
data = '''<table>
<tr>
<td style="color:red;">
parameter A
</td>
<td style="color:blue;">
value A
</td>
</tr>
<tr>
<td style="color:red;">
parameter B
</td>
<td style="color:blue;">
value B
</td>
</tr>
</table>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
for i, tr in enumerate(soup.select('tr'), 1):
tds = tr.select('td')
new_td = soup.new_tag('td', attrs=tds[-1].attrs)
new_td.append('This is data for row {}'.format(i))
tr.append(new_td)
print(soup.table.prettify())
Prints:
<table>
<tr>
<td style="color:red;">
parameter A
</td>
<td style="color:blue;">
value A
</td>
<td style="color:blue;">
This is data for row 1
</td>
</tr>
<tr>
<td style="color:red;">
parameter B
</td>
<td style="color:blue;">
value B
</td>
<td style="color:blue;">
This is data for row 2
</td>
</tr>
</table>
I am scraping from an HTML table in this format:
<table>
<tr>
<th>Name</th>
<th>Date</th>
<th>Number</th>
<th>Address</th>
</tr>
<tr> 1
<td> Name-1 </td>
<td> Date-1 </td>
<td> Number-1 </td>
<td> Address-1 </td>
</tr>
<tr> 2
<td> Name-2 </td>
<td> Date-2 </td>
<td> Number-2 </td>
<td> Address-2 </td>
</tr>
</table>
It is the only table on that page. I want to store each TD tag with it's corresponding TH tag info to make a list, then eventually have it saved to a CSV. The actual info isn't saved with a -number, that's just to illustrate. The data has hundreds of table rows all with the same set of data formatted in this way in the table.
Basically, I'd want to make the 'name' be the 1st TD cell in each TR row, the date be the 2nd, and so on.
I can't seem to find a way to do this with Python3 and BeautifulSoup4, I know there's a way, I'm just too new.
Thank you all for your help, I am learning a lot as I go.
Assuming the data is uniform, the following basic example should work:
table_rows = soup.find_all("tr") #list of all <tr> tags
for row in table_rows:
cells = row.find_all("td") #list of all <td> tags within a row
if not cells: #skip rows without td elements
continue
name, date, number, address = cells #unpack list of <td> tags into separate variables
I have a similar issue. The script from sytech is working. Though, for instance, a table with 100 rows, my code will first show row 15 instead of the first row that appears in the html, then display row 16, row 17...row 100, row 1, row 2. using Clive's code above, I would get the following:
[<td> Name-15 </td>, <td> Date-15 </td>,<td> Number-15 </td>, <td> Address-15 </td>] [<td> Name-16 </td>, <td> Date-16 </td>,<td> Number-16 </td>, <td> Address-16 </td>] [<td> Name-16 </td>, <td> Date-16 </td>,<td> Number-16 </td>, <td> Address-16 </td>] etc... [<td> Name-100 </td>, <td> Date-100 </td>,<td> Number-100 </td>, <td> Address-100 </td>] [<td> Name-1 </td>, <td> Date-1 </td>,<td> Number-1 </td>, <td> Address-1 </td>]
Any idea why it wouldn't start with the first row?
Apologies if this is formatted badly, and thank you for the help!
I have a simple HTML table to parse but somehow Beautifulsoup is only able to get me results from the last row. I'm wondering if anyone would take a look at that and see what's wrong. So I already created the rows object from the HTML table:
<table class='participants-table'>
<thead>
<tr>
<th data-field="name" class="sort-direction-toggle name">Name</th>
<th data-field="type" class="sort-direction-toggle type active-sort asc">Type</th>
<th data-field="sector" class="sort-direction-toggle sector">Sector</th>
<th data-field="country" class="sort-direction-toggle country">Country</th>
<th data-field="joined_on" class="sort-direction-toggle joined-on">Joined On</th>
</tr>
</thead>
<tbody>
<tr>
<th class='name'>Grontmij</th>
<td class='type'>Company</td>
<td class='sector'>General Industrials</td>
<td class='country'>Netherlands</td>
<td class='joined-on'>2000-09-20</td>
</tr>
<tr>
<th class='name'>Groupe Bial</th>
<td class='type'>Company</td>
<td class='sector'>Pharmaceuticals & Biotechnology</td>
<td class='country'>Portugal</td>
<td class='joined-on'>2004-02-19</td>
</tr>
</tbody>
</table>
I use the following codes to get the rows:
table=soup.find_all("table", class_="participants-table")
table1=table[0]
rows=table1.find_all('tr')
rows=rows[1:]
This gets:
rows=[<tr>
<th class="name">Grontmij</th>
<td class="type">Company</td>
<td class="sector">General Industrials</td>
<td class="country">Netherlands</td>
<td class="joined-on">2000-09-20</td>
</tr>, <tr>
<th class="name">Groupe Bial</th>
<td class="type">Company</td>
<td class="sector">Pharmaceuticals & Biotechnology</td>
<td class="country">Portugal</td>
<td class="joined-on">2004-02-19</td>
</tr>]
As expected, it looks like. However, if I continue:
for row in rows:
cells = row.find_all('th')
I'm only able to get the last entry!
cells=[<th class="name">Groupe Bial</th>]
What is going on? This is my first time using beautifulsoup, and what I'd like to do is to export this table into CSV. Any help is greatly appreciated! Thanks
You need to extend if you want all the th tags in a single list, you just keep reassigning cells = row.find_all('th') so when your print cells outside the loop you will only see what it was last assigned to i.e the last th in the last tr:
cells = []
for row in rows:
cells.extend(row.find_all('th'))
Also since there is only one table you can just use find:
soup = BeautifulSoup(html)
table = soup.find("table", class_="participants-table")
If you want to skip the thead row you can use a css selector:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
rows = soup.select("table.participants-table thead ~ tr")
cells = [tr.th for tr in rows]
print(cells)
cells will give you:
[<th class="name">Grontmij</th>, <th class="name">Groupe Bial</th>]
To write the whole table to csv:
import csv
soup = BeautifulSoup(html, "html.parser")
rows = soup.select("table.participants-table tr")
with open("data.csv", "w") as out:
wr = csv.writer(out)
wr.writerow([th.text for th in rows[0].find_all("th")] + ["URL"])
for row in rows[1:]:
wr.writerow([tag.text for tag in row.find_all()] + [row.th.a["href"]])
which for you sample will give you:
Name,Type,Sector,Country,Joined On,URL
Grontmij,Company,General Industrials,Netherlands,2000-09-20,/what-is-gc/participants/4479-Grontmij
Groupe Bial,Company,Pharmaceuticals & Biotechnology,Portugal,2004-02-19,/what-is-gc/participants/4492-Groupe-Bial