Extracting a row from a table from a url

Extracting a row from a table from a url - python

I want to download EPS value for all years (Under Annual Trends) from the below link.
http://www.bseindia.com/stock-share-price/stockreach_financials.aspx?scripcode=500180&expandable=0
I tried using Beautiful Soup as mentioned in the below answer.
Extracting table contents from html with python and BeautifulSoup
But couldn't proceed after the below code. I feel I am very close to my answer. Any help will be greatly appreciated.
from bs4 import BeautifulSoup
import urllib2
html = urllib2.urlopen("http://www.bseindia.com/stock-share-price/stockreach_financials.aspx?scripcode=500180&expandable=0").read()
soup=BeautifulSoup(html)
table = soup.find('table',{'id' :'acr'})
#the below code wasn't working as I expected it to be
tr = table.find('tr', text='EPS')
I am open to using any other language to get this done

The text is in the td not the tr so get the td using the text and then call .parent to get the tr:
In [12]: table = soup.find('table',{'id' :'acr'})
In [13]: tr = table.find('td', text='EPS').parent
In [14]: print(tr)
<tr><td class="TTRow_left" style="padding-left: 30px;">EPS</td><td class="TTRow_right">48.80</td>
<td class="TTRow_right">42.10</td>
<td class="TTRow_right">35.50</td>
<td class="TTRow_right">28.50</td>
<td class="TTRow_right">22.10</td>
</tr>
In [15]: [td.text for td in tr.select("td + td")]
Out[15]: [u'48.80', u'42.10', u'35.50', u'28.50', u'22.10']
Which you will see exactly matches what is on the page.
Another approach would be to call find_next_siblings:
In [17]: tds = table.find('td', text='EPS').find_next_siblings("td")
In [18]: tds
Out[19]:
[<td class="TTRow_right">48.80</td>,
<td class="TTRow_right">42.10</td>,
<td class="TTRow_right">35.50</td>,
<td class="TTRow_right">28.50</td>,
<td class="TTRow_right">22.10</td>]
In [20]: [td.text for td in tds]
Out[20]: [u'48.80', u'42.10', u'35.50', u'28.50', u'22.10']

Related

Find data within HTML tags using Python

I have the following HTML code I am trying to scrape from a website:
<td>Net Taxes Due<td>
<td class="value-column">$2,370.00</td>
<td class="value-column">$2,408.00</td>
What I am trying to accomplish is to search the page to find the text "Net Taxes Due" within the tag, find the siblings of the tag, and send the results into a Pandas data frame.
I have the following code:
soup = BeautifulSoup(url, "html.parser")
table = soup.select('#Net Taxes Due')
cells = table.find_next_siblings('td')
cells = [ele.text.strip() for ele in cells]
df = pd.DataFrame(np.array(cells))
print(df)
I've been all over the web looking for a solution and can't come up with something. Appreciate any help.
Thanks!

In the following I expected to use indices 1 and 2 but 2 and 3 seems to work when using lxml.html and xpath
import requests
from lxml.html import fromstring
# url = ''
# tree = html.fromstring( requests.get(url).content)
h = '''
<td>Net Taxes Due<td>
<td class="value-column">$2,370.00</td>
<td class="value-column">$2,408.00</td>
'''
tree = fromstring(h)
links = [link.text for link in tree.xpath('//td[text() = "Net Taxes Due"]/following-sibling::td[2] | //td[text() = "Net Taxes Due"]/following-sibling::td[3]' )]
print(links)

Make sure to add the tag name along with your search string. This is how you can do that:
from bs4 import BeautifulSoup
htmldoc = """
<tr>
<td>Net Taxes Due</td>
<td class="value-column">$2,370.00</td>
<td class="value-column">$2,408.00</td>
</tr>
"""
soup = BeautifulSoup(htmldoc, "html.parser")
item = soup.find('td',text='Net Taxes Due').find_next_sibling("td")
print(item)

Your .select() call is not correct. # in a selector is used to match an element's ID, not its text contents, so #Net means to look for an element with id="Net". Spaces in a selector mean to look for descendants that match each successive selector. So #Net Taxes Due searches for something like:
<div id="Net">
<taxes>
<due>...</due>
</taxes>
</div>
To search for an element containing a specific string, use .find() with the string keyword:
table = soup.find(string="Net Taxes Due")

Assuming that there's an actual HTML table involved:
<html>
<table>
<tr>
<td>Net Taxes Due</td>
<td class="value-column">$2,370.00</td>
<td class="value-column">$2,408.00</td>
</tr>
</table>
</html>
soup = BeautifulSoup(url, "html.parser")
table = soup.find('tr')
df = [x.text for x in table.findAll('td', {'class':'value-column'})]

These should work. If you are using bs4 4.7.0, you "could" use select. But if you are on an older version, or just prefer the find interface, you can use that. Basically as stated earlier, you cannot reference content with #, that is an ID.
import bs4
markup = """
<td>Net Taxes Due</td>
<td class="value-column">$2,370.00</td>
<td class="value-column">$2,408.00</td>
"""
# Version 4.7.0
soup = bs4.BeautifulSoup(markup, "html.parser")
cells = soup.select('td:contains("Net Taxes Due") ~ td.value-column')
cells = [ele.text.strip() for ele in cells]
print(cells)
# Version < 4.7.0 or if you prefer find
soup = bs4.BeautifulSoup(markup, "html.parser")
cells = soup.find('td', text="Net Taxes Due").find_next_siblings('td')
cells = [ele.text.strip() for ele in cells]
print(cells)
You would get this
['$2,370.00', '$2,408.00']
['$2,370.00', '$2,408.00']

Parse the DOM like Javascript using BeautifulSoup

I have a sample HTML in a variable html_doc like this :
html_doc = """<table class="sample">
<tbody>
<tr class="title"><td colspan="2">Info</td></tr>
<tr>
<td class="light">Time</td>
<td>01/01/1970, 00:00:00</td>
</tr>
<td class="highlight">URL</td>
<td>https://test.com</td>
</tr>
</tbody>
</table>"""
Using Javascript its pretty straightforward if I want to parse the DOM. But if I want to grab ONLY the URL (https://test.com) and Time (01/01/1970, 00:00:00) in 2 different variables from the <td> tag above, how can I do it if there is no class name associated with it.
My test.py file
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc,'html.parser')
test = soup.find_all("td")
print(test)

You already got all td elements. You can iterate through all of them:
for td in soup.find_all('td'):
if td.text.startswith('http'):
print(td, td.text)
# <td>https://test.com</td> https://test.com
If you want, you can be a bit less explicit by searching for the td element with "highlight" class and find the next sibling, but this is more error prone in case the DOM will change:
for td in soup.find_all('td', {'class': 'highlight'}):
print(td.find_next_sibling())
# <td>https://test.com</td>

You can try using regular expression to get the url
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(html_doc,'html.parser')
test = soup.find_all("td")
for tag in test:
urls = re.match('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', tag.text)
time = re.match('[0-9/:, ]+',tag.text)
if urls!= None:
print(urls.group(0))
if time!= None:
print(time.group(0))
Output
01/01/1970, 00:00:00
https://test.com

This is a very specific solution. If you need a general approach, Hari Krishnan's solution with a few tweaks might be more suitable.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc,'html.parser')
tds = []
for td in soup.find_all('td', {'class': ['highlight', 'light']}):
tds.append(td.find_next_sibling().string)
time, link = tds

With the reference of #DeepSpace
import bs4, re
from bs4 import BeautifulSoup
html_doc = """<table class="sample">
<tbody>
<tr class="title"><td colspan="2">Info</td></tr>
<tr>
<td class="light">Time</td>
<td>01/01/1970, 00:00:00</td>
</tr>
<td class="highlight">URL</td>
<td>https://test.com</td>
</tr>
</tbody>
</table>"""
datepattern = re.compile("\d{2}/\d{2}/\d{4}, \d{2}:\d{2}:\d{2}")
soup = BeautifulSoup(html_doc,'html.parser')
for td in soup.find_all('td'):
if td.text.startswith('http'):
link = td.text
elif datepattern.search(td.text):
time = td.text
print(link, time)

Extracting and Printing Table Headers and Data with Beautiful Soup with Python 2.7

So I'm trying to scrape data from the table on the Michigan Department of Health and Human Services website using BeautifulSoup 4.0 and I don't know how to format it properly.
I have the code below written to get the and information from the website but I'm at a loss as how to format it so that it has the same appearance as the table on the website when I print it or save it as a .txt/ .csv file. I've looked around here and on a bunch of other websites for an answer but I'm not sure how to go forward with this. I'm very much a beginner so any help would be appreciated.
My code just prints a long list of either the table rows or table data:
import urllib2
import bs4
from bs4 import BeautifulSoup
url = "https://www.mdch.state.mi.us/osr/natality/BirthsTrends.asp"
page = urllib2.urlopen(url)
soup = BeautifulSoup((page), "html.parser")
table = soup.find("table")
rows = table.find_all("tr")
for tr in rows:
tds = tr.find_all('td')
print tds
The HTML that I'm looking at is below as well:
<table border=0 cellpadding=3 cellspacing=0 width=640 align="center">
<thead style="display: table-header-group;">
<tr height=18 align="center">
<th height=35 align="left" colspan="2">County</th>
<th height="35" align="right">
2005
</th>
that part shows the years as headers and goes until 2015 and then the state and county data is further down:
<tr height="40" >
<th class="LeftAligned" colspan="2">Michigan</th>
<td>
127,518
</td>
and so on for the rest of the counties.
Again, any help is greatly appreciated.

You need to store your table in a list
import urllib2
import bs4
from bs4 import BeautifulSoup
url = "https://www.mdch.state.mi.us/osr/natality/BirthsTrends.asp"
page = urllib2.urlopen(url)
soup = BeautifulSoup((page), "html.parser")
table = soup.find("table")
rows = table.find_all("tr")
table_contents = [] # store your table here
for tr in rows:
if rows.index(tr) == 0 :
row_cells = [ th.getText().strip() for th in tr.find_all('th') if th.getText().strip() != '' ]
else :
row_cells = ([ tr.find('th').getText() ] if tr.find('th') else [] ) + [ td.getText().strip() for td in tr.find_all('td') if td.getText().strip() != '' ]
if len(row_cells) > 1 :
table_contents += [ row_cells ]
Now table_contents has the same structure and data as the table on the page.

Extract data from an HTML tag which has specific attributes

<tr bgcolor="#FFFFFF">
<td class="tablecontent" scope="row" rowspan="1">
ZURICH AMERICAN INSURANCE COMPANY
</td>
<td class="tablecontent" scope="row" rowspan="1">
FARMERS GROUP INC (14523)
</td>
<td class="tablecontent" scope="row">
znaf
</td>
<td class="tablecontent" scope="row">
anhm
</td>
</tr>
I have an HTML document which contains multiple tr tags. I want to extract the href link from the first td and data from third td tag onwards under every tr tag. How can this be achieved?

You can find all tr elements, iterate over them, then do the context-specific searches for the inner td elements and get the first and the third:
for tr in soup.find_all('tr'):
cells = tr.find_all('td')
if len(cells) < 3:
continue # safety pillow
link = cells[0].a['href'] # assuming every first td has an "a" element
data = cells[2].get_text()
print(link, data)
As a side note and depending what you are trying to accomplish in the HTML parsing, I usually find pandas.read_html() a great and convenient way to parse HTML tables into dataframes and work with the dataframes after, which are quite convenient data structures to work with.

You can use css selector nth-of-type to navigate through the tds
Here's a sample"
soup = BeautifulSoup(html, 'html.parser')
a = soup.select('td:nth-of-type(1) a')[0]
href = a['href']
td = soup.select("td:nth-of-type(3)")[0]
text = td.get_text(strip=True)

Deleting all content between brackets from a string using python

I am using beautiful soup to grab data from an html page, and when I grab the data, I am left with this:
<tr>
<td class="main rank">1</td>
<td class="main company"><a href="/colleges/williams-college/">
<img alt="" src="http://i.forbesimg.com/media/lists/colleges/williams-college_50x50.jpg">
<h3>Williams College</h3></img></a></td>
<td class="main">Massachusetts</td>
<td class="main">$61,850</td>
<td class="main">2,124</td>
</tr>
This is the beautifulsoup command I am using to get this:
html = open('collegelist.html')
test = BeautifulSoup(html)
soup = test.find_all('tr')
I now want to manipulate this text so that it outputs
1
Williams College
Massachusetts
$62,850
2,214
and I having difficulty doing so for the entire document, where I have about 700 of these entries. Any advice would be appreciated.

Just get the .text (or use get_text()) for every tr in the loop:
soup = BeautifulSoup(open('collegelist.html'))
for tr in soup.find_all('tr'):
print tr.text # or tr.get_text()
For the HTML you've provided it prints:
1
Williams College
Massachusetts
$61,850
2,124

use get_text()
soup = BeautifulSoup(html)
"".join([x.get_text() for x in soup.find_all('tr')])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting a row from a table from a url - python

Related

Find data within HTML tags using Python

Parse the DOM like Javascript using BeautifulSoup

Extracting and Printing Table Headers and Data with Beautiful Soup with Python 2.7

Extract data from an HTML tag which has specific attributes

Deleting all content between brackets from a string using python

Categories

Resources