python : parse table using beautifulsoup - python

I am trying to extract a table from this website: personal.vanguard.com
I'm trying to get "Holdings" and "Market values" columns.
I've tried that query but with no luck:
from bs4 import BeautifulSoup
import urllib2
soup = BeautifulSoup(urllib2.urlopen('https://personal.vanguard.com/us/FundsAllHoldings?FundId=0970&FundIntExt=INT&tableName=Equity&tableIndex=0').read())
print(soup.prettify())
print soup('tbody')
table = soup.find("tbody", { "class" : "Holding" })
print table
for row in table.findAll("tr"):
cells = row.findAll("td")

You could select all rows using this expression:
soup.select('tbody tr')
Then, for each row you could extract all columns:
[tr('td') for tr in soup.select('tbody tr')]
# Example output (note the first empty row):
[[],
[<td align="left">zulily Inc. Class A</td>,
<td>965,202</td>,
<td class="nr">$12,750,318</td>],
[<td align="left">xG Technology Inc.</td>,
<td>34,385</td>,
<td class="nr">$57,767</td>],
[<td align="left">vTv Therapeutics Inc. Class A</td>,
<td>80,223</td>,
<td class="nr">$802,230</td>],
[<td align="left">salesforce.com inc</td>,
<td>11,014,606</td>,
<td class="nr">$807,370,620</td>],
[<td align="left">pSivida Corp.</td>,
<td>447,326</td>,
<td class="nr">$1,816,144</td>],
[<td align="left">lululemon athletica Inc.</td>,
<td>1,737,050</td>,
<td class="nr">$109,190,963</td>]]
All you need is to filter required columns.

from bs4 import BeautifulSoup
import urllib2
url = 'https://personal.vanguard.com/us/FundsAllHoldings?FundId=0970&FundIntExt=INT&tableName=Equity&tableIndex=0'
soup = BeautifulSoup(urllib2.urlopen(url))
table = soup.find("tbody", { "class" : "right" })
for row in table.findAll("tr"):
cells = row.findAll("td")
if len(cells) > 0: # skip first row
holding = cells[0]
mv = cells[2]
print holding, mv

Related

Python - Web-Scraping - Parsing HTML Table - Concat multiple href into one column

I am extracting a table from my customer website and I need to parse this HTML into a Pandas dataframe. However, on the table I want to store all the HREFs into my dataframe.
My HTML has the following schema:
<table>
<tr>
<th>Col_1</th>
<th>Col_2</th>
<th>Col_3</th>
<th>Col_4</th>
<th>Col_5</th>
<th>Col_6</th>
<th>Col_7</th>
<th>Col_8</th>
<th>Col_9</th>
</tr>
<tr>
<td>Office</td>
<td>Office2</td>
<td>Customer</td>
<td></td>
<td>New Doc<br>my_work.jpg</td>
<td>Person_2<br>Person_3<br>Person 3</td>
<td>Person_1<br>Person_1<br>Person_1<br>Person_1</td>
<td>STATUS</td>
<td>9030303</td>
</tr>
</table>
I have this code:
soup = BeautifulSoup(page.content, "html.parser")
html_table = soup.find('table')
df = pd.read_html(str(html_table), header=0)[0]
df['Link'] = [link.get('href') for link in html_table.find_all('a')]
I am just trying to create a column with all the links from each index (if has more than one, then group it). But when I run this code I got:
Length of values (1102) does not match length of index (435)
What I am doing wrong?
Thanks!
You don't need read_html, and the Dataframe should be defined like this:
html_table = soup.find('table')
hyperlinks=soup.find_all("a")
l=[]
for a in hyperlinks:
l.append([a.text,a.get("href")])
pd.DataFrame(l,columns=["Names","Links"])
Update:
#here we get headers:
headers=[]
html_table = soup.find('table')
trs=html_table.find_all("tr")
headers=[th.text for th in trs[0].find_all("th")]
#an empty dataframe with all headers as columns and one row index:
df=pd.DataFrame(columns=headers,index=[0])
#here we get contents:
body_td=trs[1].find_all("td")
i=0
for td in body_td:
HyperLinks=td.find_all("a")
cell=[a.get("href") for a in HyperLinks]
df.iloc[0,i]=cell
i+=1
You could grab the links before looping the tds using a list comprehension to get all hrefs for a given row; grab all the td text into a list and extend that list with a nested list of one item, which is the list of hrefs you previously collected:
from bs4 import BeautifulSoup as bs
import pandas as pd
html = '''<table>
<tr>
<th>Col_1</th>
<th>Col_2</th>
<th>Col_3</th>
<th>Col_4</th>
<th>Col_5</th>
<th>Col_6</th>
<th>Col_7</th>
<th>Col_8</th>
<th>Col_9</th>
</tr>
<tr>
<td>Office</td>
<td>Office2</td>
<td>Customer</td>
<td></td>
<td>New Doc<br>my_work.jpg</td>
<td>Person_2<br>Person_3<br>Person 3</td>
<td>Person_1<br>Person_1<br>Person_1<br>Person_1</td>
<td>STATUS</td>
<td>9030303</td>
</tr>
</table>'''
soup = bs(html, 'lxml')
results = []
headers = [i.text for i in soup.select('table th')]
headers.append('Links')
for _row in soup.select('table tr')[1:]:
row = []
links = [i['href'] for i in _row.select('a')]
for _td in _row.select('td'):
row.append(_td.text)
row.extend([links])
results.append(row)
df = pd.DataFrame(results, columns = headers)
df

Beautifulsoup Match Empty Class

I am scraping a table on a website where I am only trying to return any row where the class is blank (Row 1 and 4)
<tr class>Row 1</tr>
<tr class="is-oos ">Row 2</tr>
<tr class="somethingelse">Row 3</tr>
<tr class>Row 4</tr>
(Note there is a trailing space at the end of the is-oos class.
When I do soup.findAll('tr', class_=None) it matches all the rows. This is because Row 2 has the class ['is-oos', ''] due to the trailing space. Is there a simple way to do a soup.findAll() or soup.select() to match these rows?
Try class_="":
from bs4 import BeautifulSoup
html_doc = """<tr class>Row 1</tr>
<tr class="is-oos ">Row 2</tr>
<tr class="somethingelse">Row 3</tr>
<tr class>Row 4</tr>"""
soup = BeautifulSoup(html_doc, "html.parser")
print(*soup.find_all('tr', class_=""))
# Or to only get the text
print( '\n'.join(t.text for t in soup.find_all('tr', class_="")) )
Outputs:
<tr class="">Row 1</tr> <tr class="">Row 4</tr>
Row 1
Row 4
Edit To only get what's in stock, we can check the attributes of the tag:
import requests
from bs4 import BeautifulSoup
URL = "https://gun.deals/search/apachesolr_search/736676037018"
soup = BeautifulSoup(requests.get(URL).text, "html.parser")
for tag in soup.find_all('tr'):
if tag.attrs.get('class') == ['price-compare-table__oos-breaker', 'js-oos-breaker']:
break
print(tag.text.strip())

Find data within HTML tags using Python

I have the following HTML code I am trying to scrape from a website:
<td>Net Taxes Due<td>
<td class="value-column">$2,370.00</td>
<td class="value-column">$2,408.00</td>
What I am trying to accomplish is to search the page to find the text "Net Taxes Due" within the tag, find the siblings of the tag, and send the results into a Pandas data frame.
I have the following code:
soup = BeautifulSoup(url, "html.parser")
table = soup.select('#Net Taxes Due')
cells = table.find_next_siblings('td')
cells = [ele.text.strip() for ele in cells]
df = pd.DataFrame(np.array(cells))
print(df)
I've been all over the web looking for a solution and can't come up with something. Appreciate any help.
Thanks!
In the following I expected to use indices 1 and 2 but 2 and 3 seems to work when using lxml.html and xpath
import requests
from lxml.html import fromstring
# url = ''
# tree = html.fromstring( requests.get(url).content)
h = '''
<td>Net Taxes Due<td>
<td class="value-column">$2,370.00</td>
<td class="value-column">$2,408.00</td>
'''
tree = fromstring(h)
links = [link.text for link in tree.xpath('//td[text() = "Net Taxes Due"]/following-sibling::td[2] | //td[text() = "Net Taxes Due"]/following-sibling::td[3]' )]
print(links)
Make sure to add the tag name along with your search string. This is how you can do that:
from bs4 import BeautifulSoup
htmldoc = """
<tr>
<td>Net Taxes Due</td>
<td class="value-column">$2,370.00</td>
<td class="value-column">$2,408.00</td>
</tr>
"""
soup = BeautifulSoup(htmldoc, "html.parser")
item = soup.find('td',text='Net Taxes Due').find_next_sibling("td")
print(item)
Your .select() call is not correct. # in a selector is used to match an element's ID, not its text contents, so #Net means to look for an element with id="Net". Spaces in a selector mean to look for descendants that match each successive selector. So #Net Taxes Due searches for something like:
<div id="Net">
<taxes>
<due>...</due>
</taxes>
</div>
To search for an element containing a specific string, use .find() with the string keyword:
table = soup.find(string="Net Taxes Due")
Assuming that there's an actual HTML table involved:
<html>
<table>
<tr>
<td>Net Taxes Due</td>
<td class="value-column">$2,370.00</td>
<td class="value-column">$2,408.00</td>
</tr>
</table>
</html>
soup = BeautifulSoup(url, "html.parser")
table = soup.find('tr')
df = [x.text for x in table.findAll('td', {'class':'value-column'})]
These should work. If you are using bs4 4.7.0, you "could" use select. But if you are on an older version, or just prefer the find interface, you can use that. Basically as stated earlier, you cannot reference content with #, that is an ID.
import bs4
markup = """
<td>Net Taxes Due</td>
<td class="value-column">$2,370.00</td>
<td class="value-column">$2,408.00</td>
"""
# Version 4.7.0
soup = bs4.BeautifulSoup(markup, "html.parser")
cells = soup.select('td:contains("Net Taxes Due") ~ td.value-column')
cells = [ele.text.strip() for ele in cells]
print(cells)
# Version < 4.7.0 or if you prefer find
soup = bs4.BeautifulSoup(markup, "html.parser")
cells = soup.find('td', text="Net Taxes Due").find_next_siblings('td')
cells = [ele.text.strip() for ele in cells]
print(cells)
You would get this
['$2,370.00', '$2,408.00']
['$2,370.00', '$2,408.00']

Extracting and Printing Table Headers and Data with Beautiful Soup with Python 2.7

So I'm trying to scrape data from the table on the Michigan Department of Health and Human Services website using BeautifulSoup 4.0 and I don't know how to format it properly.
I have the code below written to get the and information from the website but I'm at a loss as how to format it so that it has the same appearance as the table on the website when I print it or save it as a .txt/ .csv file. I've looked around here and on a bunch of other websites for an answer but I'm not sure how to go forward with this. I'm very much a beginner so any help would be appreciated.
My code just prints a long list of either the table rows or table data:
import urllib2
import bs4
from bs4 import BeautifulSoup
url = "https://www.mdch.state.mi.us/osr/natality/BirthsTrends.asp"
page = urllib2.urlopen(url)
soup = BeautifulSoup((page), "html.parser")
table = soup.find("table")
rows = table.find_all("tr")
for tr in rows:
tds = tr.find_all('td')
print tds
The HTML that I'm looking at is below as well:
<table border=0 cellpadding=3 cellspacing=0 width=640 align="center">
<thead style="display: table-header-group;">
<tr height=18 align="center">
<th height=35 align="left" colspan="2">County</th>
<th height="35" align="right">
2005
</th>
that part shows the years as headers and goes until 2015 and then the state and county data is further down:
<tr height="40" >
<th class="LeftAligned" colspan="2">Michigan</th>
<td>
127,518
</td>
and so on for the rest of the counties.
Again, any help is greatly appreciated.
You need to store your table in a list
import urllib2
import bs4
from bs4 import BeautifulSoup
url = "https://www.mdch.state.mi.us/osr/natality/BirthsTrends.asp"
page = urllib2.urlopen(url)
soup = BeautifulSoup((page), "html.parser")
table = soup.find("table")
rows = table.find_all("tr")
table_contents = [] # store your table here
for tr in rows:
if rows.index(tr) == 0 :
row_cells = [ th.getText().strip() for th in tr.find_all('th') if th.getText().strip() != '' ]
else :
row_cells = ([ tr.find('th').getText() ] if tr.find('th') else [] ) + [ td.getText().strip() for td in tr.find_all('td') if td.getText().strip() != '' ]
if len(row_cells) > 1 :
table_contents += [ row_cells ]
Now table_contents has the same structure and data as the table on the page.

Extracting a row from a table from a url

I want to download EPS value for all years (Under Annual Trends) from the below link.
http://www.bseindia.com/stock-share-price/stockreach_financials.aspx?scripcode=500180&expandable=0
I tried using Beautiful Soup as mentioned in the below answer.
Extracting table contents from html with python and BeautifulSoup
But couldn't proceed after the below code. I feel I am very close to my answer. Any help will be greatly appreciated.
from bs4 import BeautifulSoup
import urllib2
html = urllib2.urlopen("http://www.bseindia.com/stock-share-price/stockreach_financials.aspx?scripcode=500180&expandable=0").read()
soup=BeautifulSoup(html)
table = soup.find('table',{'id' :'acr'})
#the below code wasn't working as I expected it to be
tr = table.find('tr', text='EPS')
I am open to using any other language to get this done
The text is in the td not the tr so get the td using the text and then call .parent to get the tr:
In [12]: table = soup.find('table',{'id' :'acr'})
In [13]: tr = table.find('td', text='EPS').parent
In [14]: print(tr)
<tr><td class="TTRow_left" style="padding-left: 30px;">EPS</td><td class="TTRow_right">48.80</td>
<td class="TTRow_right">42.10</td>
<td class="TTRow_right">35.50</td>
<td class="TTRow_right">28.50</td>
<td class="TTRow_right">22.10</td>
</tr>
In [15]: [td.text for td in tr.select("td + td")]
Out[15]: [u'48.80', u'42.10', u'35.50', u'28.50', u'22.10']
Which you will see exactly matches what is on the page.
Another approach would be to call find_next_siblings:
In [17]: tds = table.find('td', text='EPS').find_next_siblings("td")
In [18]: tds
Out[19]:
[<td class="TTRow_right">48.80</td>,
<td class="TTRow_right">42.10</td>,
<td class="TTRow_right">35.50</td>,
<td class="TTRow_right">28.50</td>,
<td class="TTRow_right">22.10</td>]
In [20]: [td.text for td in tds]
Out[20]: [u'48.80', u'42.10', u'35.50', u'28.50', u'22.10']

Categories