I'm trying to get only the first two columns of a webpage table using beautifulsoup in python. The problem is that this table sometimes contains nested tables in the third column. The structure of the html is similar to this:
<table class:"relative-table wrapped">
<tbody>
<tr>
<td>
<\td>
<td>
<\td>
<td>
<\td>
<\tr>
<tr>
<td>
<\td>
<td>
<\td>
<td>
<div class="table-wrap">
<table class="relative-table wrapped">
...
...
<\table>
<\div>
<\td>
<\tr>
<\tbody>
<\table>
The main problem is that I don't know how to simply ignore every third td so I don't read the nested tables inside the main one. I just want to have a list with the first column of the main table and another list with the second column of the main table but the nested table ruins everything when I'm reading.
I have tried with this code:
import requests
from bs4 import BeautifulSoup
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
links = soup.select("table.relative-table tbody tr td.confluenceTd")
anchorList = []
for anchor in links:
anchorList.append(anchor.text)
del anchorList[2:len(anchorList):3]
for anchorItem in anchorList:
print(anchorItem)
print('-------------------')
This works really good until I reach the nested table and then it starts deleting other columns.
I have also tried this other code:
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
for row in soup.findAll('table')[0].tbody.findAll('tr'):
firstColumn = row.findAll('td')[0].contents
secondColumn = row.findAll('td')[1].contents
print(firstColumn, secondColumn)
But I get an IndexError because it's reading the nested tabble and the nested table only has one td.
Does anyone knows how could I read the first two columns and ignore the rest?
Thank you.
It may needs some improved examples / details to clarify, but as I understand you are selecting the first <table> and try to iterate its rows:
soup.table.select('tr:not(:has(table))')
Above selector would exclude all thr rows that includes an additional <table>.
Alternative would be to get rid of these last/third <td> :
for row in soup.table.select('tr'):
row.select_one('td:last-of-type').decompose()
#### or by its index row.select_one('td:nth-of-type(3)').decompose()
Now you could perform your selections on a <table> with two columns.
Example
from bs4 import BeautifulSoup
html ='''
<table class:"relative-table wrapped">
<tbody>
<tr>
<td>
</td>
<td>
</td>
<td>
</td>
</tr>
<tr>
<td>
</td>
<td>
</td>
<td>
<div class="table-wrap">
<table class="relative-table wrapped">
...
...
</table>
</div>
</td>
</tr>
</tbody>
</table>
'''
soup = BeautifulSoup(html, 'html.parser')
for row in soup.table.select('tr'):
row.select_one('td:last-of-type').decompose()
soup
New soup
<table class:"relative-table="" wrapped"="">
<tbody>
<tr>
<td>
</td>
<td>
</td>
</tr>
<tr>
<td>
</td>
<td>
</td>
</tr>
</tbody>
</table>
I am trying to scrape some data off a website. The data that I want is listed in a table, but there are multiple tables and no ID's. I then had the idea that I would find the header just above the table I was searching for and then use that as an indicator.
This has really troubled me, so as a last resort, I wanted to ask if there were someone who knows how to BeautifulSoup to find the table.
A snipped of the HTML code is provided beneath, thanks in advance :)
The table I am interested in, is the table right beneath <h2>Mine neaste vagter</h2>
<h2>Min aktuelle vagt</h2>
<div>
<a href='/shifts/detail/595212/'>Flere detaljer</a>
<p>Vagt starter: <b>11/06 2021 - 07:00</b></p>
<p>Vagt slutter: <b>11/06 2021 - 11:00</b></p>
<h2>Masker</h2>
<table class='list'>
<tr><th>Type</th><th>Fra</th><th> </th><th>Til</th></tr>
<tr>
<td>Fri egen regningD</td>
<td>07:00</td>
<td> - </td>
<td>11:00</td>
</tr>
</table>
</div>
<hr>
<h2>Mine neaste vagter</h2>
<table class='list'>
<tr>
<th class="alignleft">Dato</th>
<th class="alignleft">Rolle</th>
<th class="alignleft">Tidsrum</th>
<th></th>
<th class="alignleft">Bytte</th>
<th class="alignleft" colspan='2'></th>
</tr>
<tr class="rowA separator">
<td>
<h3>12/6</h3>
</td>
<td>Kundeservice</td>
<td>18:00 → 21:30 (3.5 t)</td>
<td style="max-width: 20em;"></td>
<td>
<a href="/shifts/ajax/popup/595390/" class="swap shiftpop">
Byt denne vagt
</a>
</td>
<td><a href="/shifts/detail/595390/">Detaljer</td>
<td>
</td>
</tr>
Here are two approaches to find the correct <table>:
Since the table you want is the last one in the HTML, you can use find_all() and using index slicing [-1] to find the last table:
print(soup.find_all("table", class_="list")[-1])
Find the h2 element by text, and the use the find_next() method to find the table:
print(soup.find(lambda tag: tag.name == "h2" and "Mine neaste vagter" in tag.text).find_next("table"))
You can use :-soup-contains (or just :contains) to target the <h2> by its text and then use find_next to move to the table:
from bs4 import BeautifulSoup as bs
html = '''your html'''
soup = bs(html, 'lxml')
soup.select_one('h2:-soup-contains("Mine neaste vagter")').find_next('table')
This is assuming the HTML, as shown, is returned by whatever access method you are using.
For the long story short part, just discovered BeautifulSoup yesterday, haven't done scripting or coding of any sort for many years, under time crunch, begging for help. :)
My end goal is scraping a series of web pages with vertical style data tables and dropping to CSV. With ye olde Google, along with my first post on stack overflow earlier today (at least first time in a decade or more), I got the basics down. I can input a text file with the list of URLs, identify the DIV that contains the table I need, scrape the table so that the first column becomes my header, and second becomes the data row, and repeat for next URLs (without repeating header). The snag I've hit is that the code of these pages is far worse than I thought, including a ton of extra lines, extra spaces, and now as I'm finding, nested tags inside the tags, most of which are empty. But, between the spans and the extra lines, it causes the script I have so far to ignore some of the data inside the TD. For an example of the hideous page code:
<div id="One" class="collapse show" aria-labelledby="headingOne" data-parent="#accordionExample">
<div class="card-body">
<table class="table table-borderless">
<tbody>
<tr>
<td>ID:</td>
<td>
096626 180012
</td>
</tr>
<tr>
<td>Address:</td>
<td>
1234 Main St
</td>
</tr>
<tr>
<td>Addr City:</td>
<td>
City
</td>
</tr>
<tr>
<td> Name :</td>
<td>
Last name, first name<span> </span>
</td>
</tr>
<tr>
<td>In Care Of Address:</td>
<td>
1234<span> </span>
<span> </span>
Main<span> </span>
St <span> </span>
<span> </span>
<span> </span>
</td>
</tr>
<tr>
<td>City/State/Zip:</td>
<td>
City<span> </span>
ST<span> </span>
Zip<span>-</span>
Zip+4
</td>
</tr>
</tbody>
</table>
</div>
</div>
The code I have so far is (right now, the url text file has the name of a locally stored HTML file as above, but have tested with the actual URLs to verify that part works):
from bs4 import BeautifulSoup
import requests
import csv
import pandas as pd
contents = []
headers = []
values = []
rows = []
num = 0
with open('sampleurls.txt','r') as csvf:
urls = csv.reader(csvf)
for url in urls:
contents.append(url)
for url in contents:
html = open(url[0]).read()
soup = BeautifulSoup(html, 'html.parser')
trs = soup.select('div#One tr')
for t in trs:
for header, value in zip(t.select('td')[0], t.select('td')[1]):
if num == 0:
headers.append(' '.join(header.split()))
values.append(' '.join(value.split()))
rows.append(values)
values = []
num += 1
df = pd.DataFrame(rows, columns= headers)
print(df.head())
df.to_csv('output5.csv')
When executed, the script seems to ignore anything that comes after a newline, or span, not sure which. The output I get is:
,ID:,Address:,Addr City:,Name :,In Care Of Address:,City/State/Zip:
0,096626 180012,1234 Main St,City,"Last name, first name",1234,City
In the "In Care Of Address:" column, instead of getting "1234 Main St", I just get "1234". I also tried this without the join/split function, and the remaining part of the address is still ignored. Is there a way around this? In theory, I don't need any data inside the spans as the only one populated is the hypen in the zip+4, which I don't care about.
Side note, I'm assuming the first column in the output is part of the CSV writing function, but if there's a way to get rid of it I'd like to. Not huge as I can ignore that when I import the CSV into my database, but the cleaner the better.
It is easier with correct info in post in first place.. :)
try:
from bs4 import BeautifulSoup
import pandas as pd
html = '''
<html>
<body>
<div aria-labelledby="headingOne" class="collapse show" data-parent="#accordionExample" id="One">
<div class="card-body">
<table class="table table-borderless">
<tbody>
<tr>
<td>
ID:
</td>
<td>
096626 180012
</td>
</tr>
<tr>
<td>
Address:
</td>
<td>
1234 Main St
</td>
</tr>
<tr>
<td>
Addr City:
</td>
<td>
City
</td>
</tr>
<tr>
<td>
Name :
</td>
<td>
Last name, first name
<span>
</span>
</td>
</tr>
<tr>
<td>
In Care Of Address:
</td>
<td>
1234
<span>
</span>
<span>
</span>
Main
<span>
</span>
St
<span>
</span>
<span>
</span>
<span>
</span>
</td>
</tr>
<tr>
<td>
City/State/Zip:
</td>
<td>
City
<span>
</span>
ST
<span>
</span>
Zip
<span>
-
</span>
Zip+4
</td>
</tr>
</tbody>
</table>
</div>
</div>
</body>
</html>
'''
contents = []
headers = []
values = []
rows = []
num = 0
soup = BeautifulSoup(html, 'html.parser')
trs = soup.select('div#One tr')
for t in trs:
for header, value in zip(t.select('td')[0], t.select('td:nth-child(2)')):
if num == 0:
headers.append(' '.join(header.split()))
values.append(value.get_text(' ', strip=True))
rows.append(values)
df = pd.DataFrame(rows, columns= headers)
print(df.head())
df.to_csv('output5.csv')
Hope it works now with more relevant info.
The csv:
I'm trying to scrape a saved html page of results and copy the entries for each and iterate through the document. However I can't figure out how to narrow down the element to start. The data I want to grab is in the "td" tags below each of the following "tr" tags:
<tr bgcolor="#d7d7d7">
<td valign="top" nowrap="">
Submittal<br>20190919-5000
<!-- ParentAccession= -->
<br>
</td>
<td valign="top">
09/18/2019<br>
09/19/2019
</td>
<td valign="top" nowrap="">
ER19-2760-000<br>ER19-2762-000<br>ER19-2763-000<br>ER19-2764-000<br>ER1 9-2765-000<br>ER19-2766-000<br>ER19-2768-000<br><br>
</td>
<td valign="top">
(doc-less) Motion to Intervene of Snohomish County Public Utility District No. 1 under ER19-2760, et. al..<br>Availability: Public<br>
</td>
<td valign="top">
<classtype>Intervention /<br> Motion/Notice of Intervention</classtype>
</td>
<td valign="top">
<table valign="top">
<input type="HIDDEN" name="ext" value="TXT"><tbody><tr><td valign="top"> <input type="checkbox" name="subcheck" value="V:14800341:12904817:15359058:TXT"></td><td> Text</td><td> & nbsp; 0K</td></tr><input type="HIDDEN" name="ext" value="PDF"><tr><td valign="top"> <input type="checkbox" name="subcheck" value="V:14800341:12904822:15359063:PDF"></td><td> FERC Generated PDF</td><td> 11K</td></tr>
</tbody></table>
</td>
The next tag is: with the same structure as the one above. These alternate so the results are in different colors on the results page.
I need to go through all of the subsequent td tags and grab the data but they aren't differentiated by a class or anything I can zero in on. The code I wrote grabs the entire contents of the td tags text and appends it but I need to treat each td tag as a separate item and then do the same for the next entry etc.
By setting the td[0] value I start at the first td tag but I don't think this is the correct approach.
from bs4 import BeautifulSoup
import urllib
import re
soup = BeautifulSoup(open("/Users/Desktop/FERC/uploads/ferris_9-19-2019-9-19-2019.electric.submittal.html"), "html.parser")
data = []
for td in soup.findAll(bgcolor=["#d7d7d7", "White"]):
values = [td[0].text.strip() for td in td.findAll('td')]
data.append(values)
print(data)
I have some HTML code (for display in a browser) in a string which contains any number of svg images such as:
<table>
<tr>
<td><img src="http://localhost/images/Store.Tools.svg"></td>
<td><img src="http://localhost/images/Store.Diapers.svg"></td>
</tr>
</table>
I would like to find all HTML links and replace them to the following (in order to attach them as email):
<table>
<tr>
<td><cid:image1></td><td><cid:image2></td>
</tr>
</table>
SVG filenames can have any arbitrary number of dots, chars, and numbers.
What's the best way to do this in python?
I'd use an HTML Parser to find all img tags and replace them.
Example using BeautifulSoup and it's replace_with():
from bs4 import BeautifulSoup
data = """
<table><tr>
<td><img src="http://localhost/images/Store.Tools.svg"></td>
<td><img src="http://localhost/images/Store.Diapers.svg"></td>
</tr></table>
"""
soup = BeautifulSoup(data, 'html.parser')
for index, image in enumerate(soup.find_all('img'), start=1):
tag = soup.new_tag('img', src='cid:image{}'.format(index))
image.replace_with(tag)
print soup.prettify()
Prints:
<table>
<tr>
<td>
<img src="cid:image1"/>
</td>
<td>
<img src="cid:image2"/>
</td>
</tr>
</table>