Python - Web-Scraping - Parsing HTML Table - Concat multiple href into one column - python

I am extracting a table from my customer website and I need to parse this HTML into a Pandas dataframe. However, on the table I want to store all the HREFs into my dataframe.
My HTML has the following schema:
<table>
<tr>
<th>Col_1</th>
<th>Col_2</th>
<th>Col_3</th>
<th>Col_4</th>
<th>Col_5</th>
<th>Col_6</th>
<th>Col_7</th>
<th>Col_8</th>
<th>Col_9</th>
</tr>
<tr>
<td>Office</td>
<td>Office2</td>
<td>Customer</td>
<td></td>
<td>New Doc<br>my_work.jpg</td>
<td>Person_2<br>Person_3<br>Person 3</td>
<td>Person_1<br>Person_1<br>Person_1<br>Person_1</td>
<td>STATUS</td>
<td>9030303</td>
</tr>
</table>
I have this code:
soup = BeautifulSoup(page.content, "html.parser")
html_table = soup.find('table')
df = pd.read_html(str(html_table), header=0)[0]
df['Link'] = [link.get('href') for link in html_table.find_all('a')]
I am just trying to create a column with all the links from each index (if has more than one, then group it). But when I run this code I got:
Length of values (1102) does not match length of index (435)
What I am doing wrong?
Thanks!

You don't need read_html, and the Dataframe should be defined like this:
html_table = soup.find('table')
hyperlinks=soup.find_all("a")
l=[]
for a in hyperlinks:
l.append([a.text,a.get("href")])
pd.DataFrame(l,columns=["Names","Links"])
Update:
#here we get headers:
headers=[]
html_table = soup.find('table')
trs=html_table.find_all("tr")
headers=[th.text for th in trs[0].find_all("th")]
#an empty dataframe with all headers as columns and one row index:
df=pd.DataFrame(columns=headers,index=[0])
#here we get contents:
body_td=trs[1].find_all("td")
i=0
for td in body_td:
HyperLinks=td.find_all("a")
cell=[a.get("href") for a in HyperLinks]
df.iloc[0,i]=cell
i+=1

You could grab the links before looping the tds using a list comprehension to get all hrefs for a given row; grab all the td text into a list and extend that list with a nested list of one item, which is the list of hrefs you previously collected:
from bs4 import BeautifulSoup as bs
import pandas as pd
html = '''<table>
<tr>
<th>Col_1</th>
<th>Col_2</th>
<th>Col_3</th>
<th>Col_4</th>
<th>Col_5</th>
<th>Col_6</th>
<th>Col_7</th>
<th>Col_8</th>
<th>Col_9</th>
</tr>
<tr>
<td>Office</td>
<td>Office2</td>
<td>Customer</td>
<td></td>
<td>New Doc<br>my_work.jpg</td>
<td>Person_2<br>Person_3<br>Person 3</td>
<td>Person_1<br>Person_1<br>Person_1<br>Person_1</td>
<td>STATUS</td>
<td>9030303</td>
</tr>
</table>'''
soup = bs(html, 'lxml')
results = []
headers = [i.text for i in soup.select('table th')]
headers.append('Links')
for _row in soup.select('table tr')[1:]:
row = []
links = [i['href'] for i in _row.select('a')]
for _td in _row.select('td'):
row.append(_td.text)
row.extend([links])
results.append(row)
df = pd.DataFrame(results, columns = headers)
df

Related

Beautifulsoup get text from table grid under located words' grid

I wanted to extract information from this table to a csv file, but only the number of grade and age without the "grade:" and "age:" part:
<table>
<tbody>
<tr>
<td><b>Grade:</b></td>
<td>11</td>
</tr>
<tr>
<td><b>Age:</b></td>
<td>15</td>
</tr>
</tbody>
</table>
Most of the tutorials I've find only shows how to parse all the tables into a csv file rather than parsing the next line of located words:
import csv
from bs4 import BeautifulSoup as bs
with open("1.html") as fp:
soup = bs(fp, 'html.parser')
tables = soup.find_all('table')
filename = "input.csv"
csv_writer = csv.writer(open(filename, 'w'))
for tr in soup.find_all("tr"):
data = []
for th in tr.find_all("th"):
data.append(th.text)
if data:
csv_writer.writerow(data)
continue
for td in tr.find_all("td"):
if td.a:
data.append(td.a.text.strip())
else:
data.append(td.text.strip())
if data:
csv_writer.writerow(data)
How should I do it? Thanks!
You can use the find_next() method to search for a <td> following a <b>:
soup = BeautifulSoup(html, "html.parser")
for tag in soup.select("table tr > td > b"):
print(tag.find_next("td").text)

Beautifulsoup Match Empty Class

I am scraping a table on a website where I am only trying to return any row where the class is blank (Row 1 and 4)
<tr class>Row 1</tr>
<tr class="is-oos ">Row 2</tr>
<tr class="somethingelse">Row 3</tr>
<tr class>Row 4</tr>
(Note there is a trailing space at the end of the is-oos class.
When I do soup.findAll('tr', class_=None) it matches all the rows. This is because Row 2 has the class ['is-oos', ''] due to the trailing space. Is there a simple way to do a soup.findAll() or soup.select() to match these rows?
Try class_="":
from bs4 import BeautifulSoup
html_doc = """<tr class>Row 1</tr>
<tr class="is-oos ">Row 2</tr>
<tr class="somethingelse">Row 3</tr>
<tr class>Row 4</tr>"""
soup = BeautifulSoup(html_doc, "html.parser")
print(*soup.find_all('tr', class_=""))
# Or to only get the text
print( '\n'.join(t.text for t in soup.find_all('tr', class_="")) )
Outputs:
<tr class="">Row 1</tr> <tr class="">Row 4</tr>
Row 1
Row 4
Edit To only get what's in stock, we can check the attributes of the tag:
import requests
from bs4 import BeautifulSoup
URL = "https://gun.deals/search/apachesolr_search/736676037018"
soup = BeautifulSoup(requests.get(URL).text, "html.parser")
for tag in soup.find_all('tr'):
if tag.attrs.get('class') == ['price-compare-table__oos-breaker', 'js-oos-breaker']:
break
print(tag.text.strip())

How can I parse two strings in a table row by using beautifulsoup?

html = '''
<div class="container">
<h2>Countries & Capitals</h2>
<table class="two-column td-red">
<thead><tr><th>Country</th><th>Capital city</th></tr></thead><tbody>
<tr class="grey"><td>Afghanistan</td><td>Kabul</td></tr>
<tr><td>Albania</td><td>Tirana</td></tr>
</tbody>
</table>
</div>
Given this HTML, I would like to specifically parse the country name and the capital city name and put them into a dictionary so that I can get
dict["Afghanistan] = 'Kabul'
I've started by doing
soup = BeautifulSoup(open(filename), 'lxml')
countries = {}
# YOUR CODE HERE
table = soup.find_all('table')
for each in table:
if each.find('tr'):
continue
else:
print(each.prettify())
return countries
But it's confusing since it's the first time using it.
You can select the "tr" elements if they have two "td" child elements you have your data:
from bs4 import BeautifulSoup
html = """
<div class="container">
<h2>Countries & Capitals</h2>
<table class="two-column td-red">
<thead><tr><th>Country</th><th>Capital city</th></tr></thead><tbody>
<tr class="grey"><td>Afghanistan</td><td>Kabul</td></tr>
<tr><td>Albania</td><td>Tirana</td></tr>
</tbody>
</table>
</div>
"""
soup = BeautifulSoup(html, 'lxml')
countries = {}
trs = soup.find_all('tr')
for tr in trs:
tds = tr.find_all("td")
if len (tds) ==2:
countries[tds[0].text] = tds[1].text
print (countries)
Outputs:
{'Afghanistan': 'Kabul', 'Albania': 'Tirana'}
The solution is for the given html example:
from bs4 import BeautifulSoup # assuming you did pip install bs4
soup = BeautifulSoup(html, "html.parser") # the html you mentioned
table_data = soup.find('table')
data = {} # {'country': 'capital'} dict
for row in table_data.find_all('tr'):
row_data = row.find_all('td')
if row_data:
data[row_data[0].text] = row_data[1].text
I've skipped the try, except block for any erroneous case. I suggest to go through documentation of BeautifulSoup, it covers everything.
How about this:
from bs4 import BeautifulSoup
element ='''
<div class="container">
<h2>Countries & Capitals</h2>
<table class="two-column td-red">
<thead>
<tr><th>Country</th><th>Capital city</th></tr>
</thead>
<tbody>
<tr class="grey"><td>Afghanistan</td><td>Kabul</td></tr>
<tr><td>Albania</td><td>Tirana</td></tr>
</tbody>
</table>
</div>
'''
soup = BeautifulSoup(element, 'lxml')
countries = {}
for data in soup.select("tr"):
elem = [item.text for item in data.select("th,td")]
countries[elem[0]] = elem[1]
print(countries)
Output:
{'Afghanistan': 'Kabul', 'Country': 'Capital city', 'Albania': 'Tirana'}

Extracting and Printing Table Headers and Data with Beautiful Soup with Python 2.7

So I'm trying to scrape data from the table on the Michigan Department of Health and Human Services website using BeautifulSoup 4.0 and I don't know how to format it properly.
I have the code below written to get the and information from the website but I'm at a loss as how to format it so that it has the same appearance as the table on the website when I print it or save it as a .txt/ .csv file. I've looked around here and on a bunch of other websites for an answer but I'm not sure how to go forward with this. I'm very much a beginner so any help would be appreciated.
My code just prints a long list of either the table rows or table data:
import urllib2
import bs4
from bs4 import BeautifulSoup
url = "https://www.mdch.state.mi.us/osr/natality/BirthsTrends.asp"
page = urllib2.urlopen(url)
soup = BeautifulSoup((page), "html.parser")
table = soup.find("table")
rows = table.find_all("tr")
for tr in rows:
tds = tr.find_all('td')
print tds
The HTML that I'm looking at is below as well:
<table border=0 cellpadding=3 cellspacing=0 width=640 align="center">
<thead style="display: table-header-group;">
<tr height=18 align="center">
<th height=35 align="left" colspan="2">County</th>
<th height="35" align="right">
2005
</th>
that part shows the years as headers and goes until 2015 and then the state and county data is further down:
<tr height="40" >
<th class="LeftAligned" colspan="2">Michigan</th>
<td>
127,518
</td>
and so on for the rest of the counties.
Again, any help is greatly appreciated.
You need to store your table in a list
import urllib2
import bs4
from bs4 import BeautifulSoup
url = "https://www.mdch.state.mi.us/osr/natality/BirthsTrends.asp"
page = urllib2.urlopen(url)
soup = BeautifulSoup((page), "html.parser")
table = soup.find("table")
rows = table.find_all("tr")
table_contents = [] # store your table here
for tr in rows:
if rows.index(tr) == 0 :
row_cells = [ th.getText().strip() for th in tr.find_all('th') if th.getText().strip() != '' ]
else :
row_cells = ([ tr.find('th').getText() ] if tr.find('th') else [] ) + [ td.getText().strip() for td in tr.find_all('td') if td.getText().strip() != '' ]
if len(row_cells) > 1 :
table_contents += [ row_cells ]
Now table_contents has the same structure and data as the table on the page.

python : parse table using beautifulsoup

I am trying to extract a table from this website: personal.vanguard.com
I'm trying to get "Holdings" and "Market values" columns.
I've tried that query but with no luck:
from bs4 import BeautifulSoup
import urllib2
soup = BeautifulSoup(urllib2.urlopen('https://personal.vanguard.com/us/FundsAllHoldings?FundId=0970&FundIntExt=INT&tableName=Equity&tableIndex=0').read())
print(soup.prettify())
print soup('tbody')
table = soup.find("tbody", { "class" : "Holding" })
print table
for row in table.findAll("tr"):
cells = row.findAll("td")
You could select all rows using this expression:
soup.select('tbody tr')
Then, for each row you could extract all columns:
[tr('td') for tr in soup.select('tbody tr')]
# Example output (note the first empty row):
[[],
[<td align="left">zulily Inc. Class A</td>,
<td>965,202</td>,
<td class="nr">$12,750,318</td>],
[<td align="left">xG Technology Inc.</td>,
<td>34,385</td>,
<td class="nr">$57,767</td>],
[<td align="left">vTv Therapeutics Inc. Class A</td>,
<td>80,223</td>,
<td class="nr">$802,230</td>],
[<td align="left">salesforce.com inc</td>,
<td>11,014,606</td>,
<td class="nr">$807,370,620</td>],
[<td align="left">pSivida Corp.</td>,
<td>447,326</td>,
<td class="nr">$1,816,144</td>],
[<td align="left">lululemon athletica Inc.</td>,
<td>1,737,050</td>,
<td class="nr">$109,190,963</td>]]
All you need is to filter required columns.
from bs4 import BeautifulSoup
import urllib2
url = 'https://personal.vanguard.com/us/FundsAllHoldings?FundId=0970&FundIntExt=INT&tableName=Equity&tableIndex=0'
soup = BeautifulSoup(urllib2.urlopen(url))
table = soup.find("tbody", { "class" : "right" })
for row in table.findAll("tr"):
cells = row.findAll("td")
if len(cells) > 0: # skip first row
holding = cells[0]
mv = cells[2]
print holding, mv

Categories