I wanted to extract information from this table to a csv file, but only the number of grade and age without the "grade:" and "age:" part:
<table>
<tbody>
<tr>
<td><b>Grade:</b></td>
<td>11</td>
</tr>
<tr>
<td><b>Age:</b></td>
<td>15</td>
</tr>
</tbody>
</table>
Most of the tutorials I've find only shows how to parse all the tables into a csv file rather than parsing the next line of located words:
import csv
from bs4 import BeautifulSoup as bs
with open("1.html") as fp:
soup = bs(fp, 'html.parser')
tables = soup.find_all('table')
filename = "input.csv"
csv_writer = csv.writer(open(filename, 'w'))
for tr in soup.find_all("tr"):
data = []
for th in tr.find_all("th"):
data.append(th.text)
if data:
csv_writer.writerow(data)
continue
for td in tr.find_all("td"):
if td.a:
data.append(td.a.text.strip())
else:
data.append(td.text.strip())
if data:
csv_writer.writerow(data)
How should I do it? Thanks!
You can use the find_next() method to search for a <td> following a <b>:
soup = BeautifulSoup(html, "html.parser")
for tag in soup.select("table tr > td > b"):
print(tag.find_next("td").text)
Related
I am extracting a table from my customer website and I need to parse this HTML into a Pandas dataframe. However, on the table I want to store all the HREFs into my dataframe.
My HTML has the following schema:
<table>
<tr>
<th>Col_1</th>
<th>Col_2</th>
<th>Col_3</th>
<th>Col_4</th>
<th>Col_5</th>
<th>Col_6</th>
<th>Col_7</th>
<th>Col_8</th>
<th>Col_9</th>
</tr>
<tr>
<td>Office</td>
<td>Office2</td>
<td>Customer</td>
<td></td>
<td>New Doc<br>my_work.jpg</td>
<td>Person_2<br>Person_3<br>Person 3</td>
<td>Person_1<br>Person_1<br>Person_1<br>Person_1</td>
<td>STATUS</td>
<td>9030303</td>
</tr>
</table>
I have this code:
soup = BeautifulSoup(page.content, "html.parser")
html_table = soup.find('table')
df = pd.read_html(str(html_table), header=0)[0]
df['Link'] = [link.get('href') for link in html_table.find_all('a')]
I am just trying to create a column with all the links from each index (if has more than one, then group it). But when I run this code I got:
Length of values (1102) does not match length of index (435)
What I am doing wrong?
Thanks!
You don't need read_html, and the Dataframe should be defined like this:
html_table = soup.find('table')
hyperlinks=soup.find_all("a")
l=[]
for a in hyperlinks:
l.append([a.text,a.get("href")])
pd.DataFrame(l,columns=["Names","Links"])
Update:
#here we get headers:
headers=[]
html_table = soup.find('table')
trs=html_table.find_all("tr")
headers=[th.text for th in trs[0].find_all("th")]
#an empty dataframe with all headers as columns and one row index:
df=pd.DataFrame(columns=headers,index=[0])
#here we get contents:
body_td=trs[1].find_all("td")
i=0
for td in body_td:
HyperLinks=td.find_all("a")
cell=[a.get("href") for a in HyperLinks]
df.iloc[0,i]=cell
i+=1
You could grab the links before looping the tds using a list comprehension to get all hrefs for a given row; grab all the td text into a list and extend that list with a nested list of one item, which is the list of hrefs you previously collected:
from bs4 import BeautifulSoup as bs
import pandas as pd
html = '''<table>
<tr>
<th>Col_1</th>
<th>Col_2</th>
<th>Col_3</th>
<th>Col_4</th>
<th>Col_5</th>
<th>Col_6</th>
<th>Col_7</th>
<th>Col_8</th>
<th>Col_9</th>
</tr>
<tr>
<td>Office</td>
<td>Office2</td>
<td>Customer</td>
<td></td>
<td>New Doc<br>my_work.jpg</td>
<td>Person_2<br>Person_3<br>Person 3</td>
<td>Person_1<br>Person_1<br>Person_1<br>Person_1</td>
<td>STATUS</td>
<td>9030303</td>
</tr>
</table>'''
soup = bs(html, 'lxml')
results = []
headers = [i.text for i in soup.select('table th')]
headers.append('Links')
for _row in soup.select('table tr')[1:]:
row = []
links = [i['href'] for i in _row.select('a')]
for _td in _row.select('td'):
row.append(_td.text)
row.extend([links])
results.append(row)
df = pd.DataFrame(results, columns = headers)
df
I am scraping a table on a website where I am only trying to return any row where the class is blank (Row 1 and 4)
<tr class>Row 1</tr>
<tr class="is-oos ">Row 2</tr>
<tr class="somethingelse">Row 3</tr>
<tr class>Row 4</tr>
(Note there is a trailing space at the end of the is-oos class.
When I do soup.findAll('tr', class_=None) it matches all the rows. This is because Row 2 has the class ['is-oos', ''] due to the trailing space. Is there a simple way to do a soup.findAll() or soup.select() to match these rows?
Try class_="":
from bs4 import BeautifulSoup
html_doc = """<tr class>Row 1</tr>
<tr class="is-oos ">Row 2</tr>
<tr class="somethingelse">Row 3</tr>
<tr class>Row 4</tr>"""
soup = BeautifulSoup(html_doc, "html.parser")
print(*soup.find_all('tr', class_=""))
# Or to only get the text
print( '\n'.join(t.text for t in soup.find_all('tr', class_="")) )
Outputs:
<tr class="">Row 1</tr> <tr class="">Row 4</tr>
Row 1
Row 4
Edit To only get what's in stock, we can check the attributes of the tag:
import requests
from bs4 import BeautifulSoup
URL = "https://gun.deals/search/apachesolr_search/736676037018"
soup = BeautifulSoup(requests.get(URL).text, "html.parser")
for tag in soup.find_all('tr'):
if tag.attrs.get('class') == ['price-compare-table__oos-breaker', 'js-oos-breaker']:
break
print(tag.text.strip())
I have a sample HTML in a variable html_doc like this :
html_doc = """<table class="sample">
<tbody>
<tr class="title"><td colspan="2">Info</td></tr>
<tr>
<td class="light">Time</td>
<td>01/01/1970, 00:00:00</td>
</tr>
<td class="highlight">URL</td>
<td>https://test.com</td>
</tr>
</tbody>
</table>"""
Using Javascript its pretty straightforward if I want to parse the DOM. But if I want to grab ONLY the URL (https://test.com) and Time (01/01/1970, 00:00:00) in 2 different variables from the <td> tag above, how can I do it if there is no class name associated with it.
My test.py file
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc,'html.parser')
test = soup.find_all("td")
print(test)
You already got all td elements. You can iterate through all of them:
for td in soup.find_all('td'):
if td.text.startswith('http'):
print(td, td.text)
# <td>https://test.com</td> https://test.com
If you want, you can be a bit less explicit by searching for the td element with "highlight" class and find the next sibling, but this is more error prone in case the DOM will change:
for td in soup.find_all('td', {'class': 'highlight'}):
print(td.find_next_sibling())
# <td>https://test.com</td>
You can try using regular expression to get the url
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(html_doc,'html.parser')
test = soup.find_all("td")
for tag in test:
urls = re.match('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', tag.text)
time = re.match('[0-9/:, ]+',tag.text)
if urls!= None:
print(urls.group(0))
if time!= None:
print(time.group(0))
Output
01/01/1970, 00:00:00
https://test.com
This is a very specific solution. If you need a general approach, Hari Krishnan's solution with a few tweaks might be more suitable.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc,'html.parser')
tds = []
for td in soup.find_all('td', {'class': ['highlight', 'light']}):
tds.append(td.find_next_sibling().string)
time, link = tds
With the reference of #DeepSpace
import bs4, re
from bs4 import BeautifulSoup
html_doc = """<table class="sample">
<tbody>
<tr class="title"><td colspan="2">Info</td></tr>
<tr>
<td class="light">Time</td>
<td>01/01/1970, 00:00:00</td>
</tr>
<td class="highlight">URL</td>
<td>https://test.com</td>
</tr>
</tbody>
</table>"""
datepattern = re.compile("\d{2}/\d{2}/\d{4}, \d{2}:\d{2}:\d{2}")
soup = BeautifulSoup(html_doc,'html.parser')
for td in soup.find_all('td'):
if td.text.startswith('http'):
link = td.text
elif datepattern.search(td.text):
time = td.text
print(link, time)
html = '''
<div class="container">
<h2>Countries & Capitals</h2>
<table class="two-column td-red">
<thead><tr><th>Country</th><th>Capital city</th></tr></thead><tbody>
<tr class="grey"><td>Afghanistan</td><td>Kabul</td></tr>
<tr><td>Albania</td><td>Tirana</td></tr>
</tbody>
</table>
</div>
Given this HTML, I would like to specifically parse the country name and the capital city name and put them into a dictionary so that I can get
dict["Afghanistan] = 'Kabul'
I've started by doing
soup = BeautifulSoup(open(filename), 'lxml')
countries = {}
# YOUR CODE HERE
table = soup.find_all('table')
for each in table:
if each.find('tr'):
continue
else:
print(each.prettify())
return countries
But it's confusing since it's the first time using it.
You can select the "tr" elements if they have two "td" child elements you have your data:
from bs4 import BeautifulSoup
html = """
<div class="container">
<h2>Countries & Capitals</h2>
<table class="two-column td-red">
<thead><tr><th>Country</th><th>Capital city</th></tr></thead><tbody>
<tr class="grey"><td>Afghanistan</td><td>Kabul</td></tr>
<tr><td>Albania</td><td>Tirana</td></tr>
</tbody>
</table>
</div>
"""
soup = BeautifulSoup(html, 'lxml')
countries = {}
trs = soup.find_all('tr')
for tr in trs:
tds = tr.find_all("td")
if len (tds) ==2:
countries[tds[0].text] = tds[1].text
print (countries)
Outputs:
{'Afghanistan': 'Kabul', 'Albania': 'Tirana'}
The solution is for the given html example:
from bs4 import BeautifulSoup # assuming you did pip install bs4
soup = BeautifulSoup(html, "html.parser") # the html you mentioned
table_data = soup.find('table')
data = {} # {'country': 'capital'} dict
for row in table_data.find_all('tr'):
row_data = row.find_all('td')
if row_data:
data[row_data[0].text] = row_data[1].text
I've skipped the try, except block for any erroneous case. I suggest to go through documentation of BeautifulSoup, it covers everything.
How about this:
from bs4 import BeautifulSoup
element ='''
<div class="container">
<h2>Countries & Capitals</h2>
<table class="two-column td-red">
<thead>
<tr><th>Country</th><th>Capital city</th></tr>
</thead>
<tbody>
<tr class="grey"><td>Afghanistan</td><td>Kabul</td></tr>
<tr><td>Albania</td><td>Tirana</td></tr>
</tbody>
</table>
</div>
'''
soup = BeautifulSoup(element, 'lxml')
countries = {}
for data in soup.select("tr"):
elem = [item.text for item in data.select("th,td")]
countries[elem[0]] = elem[1]
print(countries)
Output:
{'Afghanistan': 'Kabul', 'Country': 'Capital city', 'Albania': 'Tirana'}
So I'm trying to scrape data from the table on the Michigan Department of Health and Human Services website using BeautifulSoup 4.0 and I don't know how to format it properly.
I have the code below written to get the and information from the website but I'm at a loss as how to format it so that it has the same appearance as the table on the website when I print it or save it as a .txt/ .csv file. I've looked around here and on a bunch of other websites for an answer but I'm not sure how to go forward with this. I'm very much a beginner so any help would be appreciated.
My code just prints a long list of either the table rows or table data:
import urllib2
import bs4
from bs4 import BeautifulSoup
url = "https://www.mdch.state.mi.us/osr/natality/BirthsTrends.asp"
page = urllib2.urlopen(url)
soup = BeautifulSoup((page), "html.parser")
table = soup.find("table")
rows = table.find_all("tr")
for tr in rows:
tds = tr.find_all('td')
print tds
The HTML that I'm looking at is below as well:
<table border=0 cellpadding=3 cellspacing=0 width=640 align="center">
<thead style="display: table-header-group;">
<tr height=18 align="center">
<th height=35 align="left" colspan="2">County</th>
<th height="35" align="right">
2005
</th>
that part shows the years as headers and goes until 2015 and then the state and county data is further down:
<tr height="40" >
<th class="LeftAligned" colspan="2">Michigan</th>
<td>
127,518
</td>
and so on for the rest of the counties.
Again, any help is greatly appreciated.
You need to store your table in a list
import urllib2
import bs4
from bs4 import BeautifulSoup
url = "https://www.mdch.state.mi.us/osr/natality/BirthsTrends.asp"
page = urllib2.urlopen(url)
soup = BeautifulSoup((page), "html.parser")
table = soup.find("table")
rows = table.find_all("tr")
table_contents = [] # store your table here
for tr in rows:
if rows.index(tr) == 0 :
row_cells = [ th.getText().strip() for th in tr.find_all('th') if th.getText().strip() != '' ]
else :
row_cells = ([ tr.find('th').getText() ] if tr.find('th') else [] ) + [ td.getText().strip() for td in tr.find_all('td') if td.getText().strip() != '' ]
if len(row_cells) > 1 :
table_contents += [ row_cells ]
Now table_contents has the same structure and data as the table on the page.