I am extracting a table from my customer website and I need to parse this HTML into a Pandas dataframe. However, on the table I want to store all the HREFs into my dataframe.
My HTML has the following schema:
<table>
<tr>
<th>Col_1</th>
<th>Col_2</th>
<th>Col_3</th>
<th>Col_4</th>
<th>Col_5</th>
<th>Col_6</th>
<th>Col_7</th>
<th>Col_8</th>
<th>Col_9</th>
</tr>
<tr>
<td>Office</td>
<td>Office2</td>
<td>Customer</td>
<td></td>
<td>New Doc<br>my_work.jpg</td>
<td>Person_2<br>Person_3<br>Person 3</td>
<td>Person_1<br>Person_1<br>Person_1<br>Person_1</td>
<td>STATUS</td>
<td>9030303</td>
</tr>
</table>
I have this code:
soup = BeautifulSoup(page.content, "html.parser")
html_table = soup.find('table')
df = pd.read_html(str(html_table), header=0)[0]
df['Link'] = [link.get('href') for link in html_table.find_all('a')]
I am just trying to create a column with all the links from each index (if has more than one, then group it). But when I run this code I got:
Length of values (1102) does not match length of index (435)
What I am doing wrong?
Thanks!
You don't need read_html, and the Dataframe should be defined like this:
html_table = soup.find('table')
hyperlinks=soup.find_all("a")
l=[]
for a in hyperlinks:
l.append([a.text,a.get("href")])
pd.DataFrame(l,columns=["Names","Links"])
Update:
#here we get headers:
headers=[]
html_table = soup.find('table')
trs=html_table.find_all("tr")
headers=[th.text for th in trs[0].find_all("th")]
#an empty dataframe with all headers as columns and one row index:
df=pd.DataFrame(columns=headers,index=[0])
#here we get contents:
body_td=trs[1].find_all("td")
i=0
for td in body_td:
HyperLinks=td.find_all("a")
cell=[a.get("href") for a in HyperLinks]
df.iloc[0,i]=cell
i+=1
You could grab the links before looping the tds using a list comprehension to get all hrefs for a given row; grab all the td text into a list and extend that list with a nested list of one item, which is the list of hrefs you previously collected:
from bs4 import BeautifulSoup as bs
import pandas as pd
html = '''<table>
<tr>
<th>Col_1</th>
<th>Col_2</th>
<th>Col_3</th>
<th>Col_4</th>
<th>Col_5</th>
<th>Col_6</th>
<th>Col_7</th>
<th>Col_8</th>
<th>Col_9</th>
</tr>
<tr>
<td>Office</td>
<td>Office2</td>
<td>Customer</td>
<td></td>
<td>New Doc<br>my_work.jpg</td>
<td>Person_2<br>Person_3<br>Person 3</td>
<td>Person_1<br>Person_1<br>Person_1<br>Person_1</td>
<td>STATUS</td>
<td>9030303</td>
</tr>
</table>'''
soup = bs(html, 'lxml')
results = []
headers = [i.text for i in soup.select('table th')]
headers.append('Links')
for _row in soup.select('table tr')[1:]:
row = []
links = [i['href'] for i in _row.select('a')]
for _td in _row.select('td'):
row.append(_td.text)
row.extend([links])
results.append(row)
df = pd.DataFrame(results, columns = headers)
df
Related
I wanted to extract information from this table to a csv file, but only the number of grade and age without the "grade:" and "age:" part:
<table>
<tbody>
<tr>
<td><b>Grade:</b></td>
<td>11</td>
</tr>
<tr>
<td><b>Age:</b></td>
<td>15</td>
</tr>
</tbody>
</table>
Most of the tutorials I've find only shows how to parse all the tables into a csv file rather than parsing the next line of located words:
import csv
from bs4 import BeautifulSoup as bs
with open("1.html") as fp:
soup = bs(fp, 'html.parser')
tables = soup.find_all('table')
filename = "input.csv"
csv_writer = csv.writer(open(filename, 'w'))
for tr in soup.find_all("tr"):
data = []
for th in tr.find_all("th"):
data.append(th.text)
if data:
csv_writer.writerow(data)
continue
for td in tr.find_all("td"):
if td.a:
data.append(td.a.text.strip())
else:
data.append(td.text.strip())
if data:
csv_writer.writerow(data)
How should I do it? Thanks!
You can use the find_next() method to search for a <td> following a <b>:
soup = BeautifulSoup(html, "html.parser")
for tag in soup.select("table tr > td > b"):
print(tag.find_next("td").text)
I am scraping a table on a website where I am only trying to return any row where the class is blank (Row 1 and 4)
<tr class>Row 1</tr>
<tr class="is-oos ">Row 2</tr>
<tr class="somethingelse">Row 3</tr>
<tr class>Row 4</tr>
(Note there is a trailing space at the end of the is-oos class.
When I do soup.findAll('tr', class_=None) it matches all the rows. This is because Row 2 has the class ['is-oos', ''] due to the trailing space. Is there a simple way to do a soup.findAll() or soup.select() to match these rows?
Try class_="":
from bs4 import BeautifulSoup
html_doc = """<tr class>Row 1</tr>
<tr class="is-oos ">Row 2</tr>
<tr class="somethingelse">Row 3</tr>
<tr class>Row 4</tr>"""
soup = BeautifulSoup(html_doc, "html.parser")
print(*soup.find_all('tr', class_=""))
# Or to only get the text
print( '\n'.join(t.text for t in soup.find_all('tr', class_="")) )
Outputs:
<tr class="">Row 1</tr> <tr class="">Row 4</tr>
Row 1
Row 4
Edit To only get what's in stock, we can check the attributes of the tag:
import requests
from bs4 import BeautifulSoup
URL = "https://gun.deals/search/apachesolr_search/736676037018"
soup = BeautifulSoup(requests.get(URL).text, "html.parser")
for tag in soup.find_all('tr'):
if tag.attrs.get('class') == ['price-compare-table__oos-breaker', 'js-oos-breaker']:
break
print(tag.text.strip())
html = '''
<div class="container">
<h2>Countries & Capitals</h2>
<table class="two-column td-red">
<thead><tr><th>Country</th><th>Capital city</th></tr></thead><tbody>
<tr class="grey"><td>Afghanistan</td><td>Kabul</td></tr>
<tr><td>Albania</td><td>Tirana</td></tr>
</tbody>
</table>
</div>
Given this HTML, I would like to specifically parse the country name and the capital city name and put them into a dictionary so that I can get
dict["Afghanistan] = 'Kabul'
I've started by doing
soup = BeautifulSoup(open(filename), 'lxml')
countries = {}
# YOUR CODE HERE
table = soup.find_all('table')
for each in table:
if each.find('tr'):
continue
else:
print(each.prettify())
return countries
But it's confusing since it's the first time using it.
You can select the "tr" elements if they have two "td" child elements you have your data:
from bs4 import BeautifulSoup
html = """
<div class="container">
<h2>Countries & Capitals</h2>
<table class="two-column td-red">
<thead><tr><th>Country</th><th>Capital city</th></tr></thead><tbody>
<tr class="grey"><td>Afghanistan</td><td>Kabul</td></tr>
<tr><td>Albania</td><td>Tirana</td></tr>
</tbody>
</table>
</div>
"""
soup = BeautifulSoup(html, 'lxml')
countries = {}
trs = soup.find_all('tr')
for tr in trs:
tds = tr.find_all("td")
if len (tds) ==2:
countries[tds[0].text] = tds[1].text
print (countries)
Outputs:
{'Afghanistan': 'Kabul', 'Albania': 'Tirana'}
The solution is for the given html example:
from bs4 import BeautifulSoup # assuming you did pip install bs4
soup = BeautifulSoup(html, "html.parser") # the html you mentioned
table_data = soup.find('table')
data = {} # {'country': 'capital'} dict
for row in table_data.find_all('tr'):
row_data = row.find_all('td')
if row_data:
data[row_data[0].text] = row_data[1].text
I've skipped the try, except block for any erroneous case. I suggest to go through documentation of BeautifulSoup, it covers everything.
How about this:
from bs4 import BeautifulSoup
element ='''
<div class="container">
<h2>Countries & Capitals</h2>
<table class="two-column td-red">
<thead>
<tr><th>Country</th><th>Capital city</th></tr>
</thead>
<tbody>
<tr class="grey"><td>Afghanistan</td><td>Kabul</td></tr>
<tr><td>Albania</td><td>Tirana</td></tr>
</tbody>
</table>
</div>
'''
soup = BeautifulSoup(element, 'lxml')
countries = {}
for data in soup.select("tr"):
elem = [item.text for item in data.select("th,td")]
countries[elem[0]] = elem[1]
print(countries)
Output:
{'Afghanistan': 'Kabul', 'Country': 'Capital city', 'Albania': 'Tirana'}
So I'm trying to scrape data from the table on the Michigan Department of Health and Human Services website using BeautifulSoup 4.0 and I don't know how to format it properly.
I have the code below written to get the and information from the website but I'm at a loss as how to format it so that it has the same appearance as the table on the website when I print it or save it as a .txt/ .csv file. I've looked around here and on a bunch of other websites for an answer but I'm not sure how to go forward with this. I'm very much a beginner so any help would be appreciated.
My code just prints a long list of either the table rows or table data:
import urllib2
import bs4
from bs4 import BeautifulSoup
url = "https://www.mdch.state.mi.us/osr/natality/BirthsTrends.asp"
page = urllib2.urlopen(url)
soup = BeautifulSoup((page), "html.parser")
table = soup.find("table")
rows = table.find_all("tr")
for tr in rows:
tds = tr.find_all('td')
print tds
The HTML that I'm looking at is below as well:
<table border=0 cellpadding=3 cellspacing=0 width=640 align="center">
<thead style="display: table-header-group;">
<tr height=18 align="center">
<th height=35 align="left" colspan="2">County</th>
<th height="35" align="right">
2005
</th>
that part shows the years as headers and goes until 2015 and then the state and county data is further down:
<tr height="40" >
<th class="LeftAligned" colspan="2">Michigan</th>
<td>
127,518
</td>
and so on for the rest of the counties.
Again, any help is greatly appreciated.
You need to store your table in a list
import urllib2
import bs4
from bs4 import BeautifulSoup
url = "https://www.mdch.state.mi.us/osr/natality/BirthsTrends.asp"
page = urllib2.urlopen(url)
soup = BeautifulSoup((page), "html.parser")
table = soup.find("table")
rows = table.find_all("tr")
table_contents = [] # store your table here
for tr in rows:
if rows.index(tr) == 0 :
row_cells = [ th.getText().strip() for th in tr.find_all('th') if th.getText().strip() != '' ]
else :
row_cells = ([ tr.find('th').getText() ] if tr.find('th') else [] ) + [ td.getText().strip() for td in tr.find_all('td') if td.getText().strip() != '' ]
if len(row_cells) > 1 :
table_contents += [ row_cells ]
Now table_contents has the same structure and data as the table on the page.
I am trying to extract a table from this website: personal.vanguard.com
I'm trying to get "Holdings" and "Market values" columns.
I've tried that query but with no luck:
from bs4 import BeautifulSoup
import urllib2
soup = BeautifulSoup(urllib2.urlopen('https://personal.vanguard.com/us/FundsAllHoldings?FundId=0970&FundIntExt=INT&tableName=Equity&tableIndex=0').read())
print(soup.prettify())
print soup('tbody')
table = soup.find("tbody", { "class" : "Holding" })
print table
for row in table.findAll("tr"):
cells = row.findAll("td")
You could select all rows using this expression:
soup.select('tbody tr')
Then, for each row you could extract all columns:
[tr('td') for tr in soup.select('tbody tr')]
# Example output (note the first empty row):
[[],
[<td align="left">zulily Inc. Class A</td>,
<td>965,202</td>,
<td class="nr">$12,750,318</td>],
[<td align="left">xG Technology Inc.</td>,
<td>34,385</td>,
<td class="nr">$57,767</td>],
[<td align="left">vTv Therapeutics Inc. Class A</td>,
<td>80,223</td>,
<td class="nr">$802,230</td>],
[<td align="left">salesforce.com inc</td>,
<td>11,014,606</td>,
<td class="nr">$807,370,620</td>],
[<td align="left">pSivida Corp.</td>,
<td>447,326</td>,
<td class="nr">$1,816,144</td>],
[<td align="left">lululemon athletica Inc.</td>,
<td>1,737,050</td>,
<td class="nr">$109,190,963</td>]]
All you need is to filter required columns.
from bs4 import BeautifulSoup
import urllib2
url = 'https://personal.vanguard.com/us/FundsAllHoldings?FundId=0970&FundIntExt=INT&tableName=Equity&tableIndex=0'
soup = BeautifulSoup(urllib2.urlopen(url))
table = soup.find("tbody", { "class" : "right" })
for row in table.findAll("tr"):
cells = row.findAll("td")
if len(cells) > 0: # skip first row
holding = cells[0]
mv = cells[2]
print holding, mv