web scraping issue, trying to acquire info into csv and than charts

web scraping issue, trying to acquire info into csv and than charts - python

Here what is up my code. It gives me very complete information. I am scraping the stock prices for my top 10 favorite space tech companies. I want to get the stock prices for the course of 10 hours, or I might just run the code ten different times. I can not use api's. This is for a school project. I then want to combine all the data into ten one big chart using matplotlib that would show these stock prices. Or ten charts for each stock. I want to use this type of chart.
Any advice would be awesome. Here is my current code:
#import libraries
import pandas as pd
#scraping my top ten favorite space companies, attempted to pick compaines with pure play interest in space
urls = ['https://finance.yahoo.com/quote/GILT/', 'https://finance.yahoo.com/quote/LORL?p=LORL&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/I?p=I&.tsrc=fin-srch' , 'https://finance.yahoo.com/quote/VSAT?p=VSAT&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/RTN?p=RTN&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/UTX?ltr=1', 'https://finance.yahoo.com/quote/TDY?ltr=1', 'https://finance.yahoo.com/quote/ORBC?ltr=1', 'https://finance.yahoo.com/quote/SPCE?p=SPCE&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/BA?p=BA&.tsrc=fin-srch',]
def parsePrice(r):
df = pd.read_html(r)[0].T
cols = list(df.iloc[0,:])
temp_df = pd.DataFrame([list(df.iloc[1,:])], columns=cols)
temp_df['url'] = r
return temp_df
df = pd.DataFrame()
for r in urls:
df = df.append(parsePrice(r), sort=True).reset_index(drop=True)
df.to_csv('C:/Users/n_gor/Desktop/webscape/Nicholas Final Projects/spacestocklisting.csv', index=False)
print (df.to_string())
CSV File output:
52 Week Range Ask Avg. Volume Bid Day's Range Open Previous Close Volume url
0 7.32 - 9.87 8.09 x 800 23415 8.06 x 800 8.01 - 8.11 8.10 8.01 6337 https://finance.yahoo.com/quote/GILT/
1 32.14 - 42.77 32.74 x 1100 41759 32.59 x 1000 32.28 - 32.75 32.32 32.28 14685 https://finance.yahoo.com/quote/LORL?p=LORL&.t...
2 5.55 - 27.29 6.64 x 800 5746553 6.63 x 2900 6.51 - 6.68 6.64 6.65 995245 https://finance.yahoo.com/quote/I?p=I&.tsrc=fi...
3 55.93 - 97.31 72.21 x 800 281600 72.16 x 1000 71.51 - 72.80 72.26 72.32 74758 https://finance.yahoo.com/quote/VSAT?p=VSAT&.t...
4 144.27 - 220.03 215.54 x 1000 1560562 215.37 x 800 214.87 - 217.45 215.85 214.86 203957 https://finance.yahoo.com/quote/RTN?p=RTN&.tsr...
5 100.48 - 149.81 145.03 x 800 2749725 144.96 x 800 144.41 - 145.56 145.49 144.52 489169 https://finance.yahoo.com/quote/UTX?ltr=1
6 189.35 - 351.53 343.34 x 800 280325 342.80 x 800 342.84 - 346.29 344.16 343.58 42326 https://finance.yahoo.com/quote/TDY?ltr=1
7 3.5800 - 9.7900 4.1400 x 1300 778343 4.1300 x 800 4.1200 - 4.2000 4.1700 4.1500 62335 https://finance.yahoo.com/quote/ORBC?ltr=1
8 6.90 - 12.09 7.37 x 900 2280333 7.38 x 800 7.24 - 7.48 7.30 7.22 539082 https://finance.yahoo.com/quote/SPCE?p=SPCE&.t...
9 292.47 - 446.01 348.73 x 800 4420225 348.79 x 800 345.70 - 350.42 350.22 348.84 1258813 https://finance.yahoo.com/quote/BA?p=BA&.tsrc=...
Can I add the stock names to this? any advice on how to complete this project? Im a bit lost.

Just need to parse the title header:
#import libraries
import pandas as pd
import requests
from bs4 import BeautifulSoup
#scraping my top ten favorite space companies, attempted to pick compaines with pure play interest in space
urls = ['https://finance.yahoo.com/quote/GILT/', 'https://finance.yahoo.com/quote/LORL?p=LORL&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/I?p=I&.tsrc=fin-srch' , 'https://finance.yahoo.com/quote/VSAT?p=VSAT&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/RTN?p=RTN&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/UTX?ltr=1', 'https://finance.yahoo.com/quote/TDY?ltr=1', 'https://finance.yahoo.com/quote/ORBC?ltr=1', 'https://finance.yahoo.com/quote/SPCE?p=SPCE&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/BA?p=BA&.tsrc=fin-srch',]
def parsePrice(r):
response = requests.get(r)
soup = BeautifulSoup(response.text, 'html.parser')
titleHeader = soup.find('div', {'id':'quote-header-info'})
title = titleHeader.find('h1').text
comp = title.split('-')[-1].strip()
abr = title.split('-')[0].strip()
print (title)
df = pd.read_html(response.text)[0].T
cols = list(df.iloc[0,:])
temp_df = pd.DataFrame([list(df.iloc[1,:])], columns=cols)
temp_df['url'] = r
temp_df['company name'] = comp
temp_df['stock name'] = abr
return temp_df
df = pd.DataFrame()
for r in urls:
df = df.append(parsePrice(r), sort=True).reset_index(drop=True)
df.to_csv('C:/Users/n_gor/Desktop/webscape/Nicholas Final Projects/spacestocklisting.csv', index=False)
print (df.to_string())

you can use pandas.DataFrame.insert ,
if you have all stock-names in a list,
stock_names = ['GILT', 'LORL', 'I', 'VSAT', 'RTN', 'UTX', 'TDY', 'ORBC', 'SPCE', 'BA']
# insert to the begining(column at index 0) of the dataFrame
df.insert(0, "column_heading", stock_names)
or you can get all stock names from urls using regular expressions and add it to your df
import re
stock_names= [re.findall('[A-Z]+',x)[0] for x in urls]
# insert to the begining(column at index 0) of the dataFrame
df.insert(0, "column_heading", stock_names)

Related

How to remove whitespace/tab from an entry when scraping a web table? (python)

I've cobbled together the following code that scrapes a website table using Beautiful Soup.
The script is working as intended except for the first two entries.
Q1: The first entry consists of two empty brackets... how do I omit them?
Q2: The second entry has a hiden tab creating whitespace in the second element that I can't get rid of. How do I remove it?
Code:
import requests
import pandas as pd
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
testlink = "https://www.crutchfield.com/p_13692194/JL-Audio-12TW3-D8.html?tp=64077"
r = requests.get(testlink, headers=headers)
soup = BeautifulSoup(r.content, 'lxml')
table = soup.find('table', class_='table table-striped')
df = pd.DataFrame(columns=['col1', 'col2'])
rows = []
for i, row in enumerate(table.find_all('tr')):
rows.append([el.text.strip() for el in row.find_all('td')])
for row in rows:
print(row)
Results:
[]
['Size', '12 -inch']
['Impedance (Ohms)', '4, 16']
['Cone Material', 'Mica-Filled IMPP']
['Surround Material', 'Rubber']
['Ideal Sealed Box Volume (cubic feet)', '1']
['Ideal Ported Box Volume (cubic feet)', '1.3']
['Port diameter (inches)', 'N/A']
['Port length (inches)', 'N/A']
['Free-Air', 'No']
['Dual Voice Coil', 'Yes']
['Sensitivity', '84.23 dB at 1 watt']
['Frequency Response', '24 - 200 Hz']
['Max RMS Power Handling', '400']
['Peak Power Handling (Watts)', '800']
['Top Mount Depth (inches)', '3 1/2']
['Bottom Mount Depth (inches)', 'N/A']
['Cutout Diameter or Length (inches)', '11 5/8']
['Vas (liters)', '34.12']
['Fs (Hz)', '32.66']
['Qts', '0.668']
['Xmax (millimeters)', '15.2']
['Parts Warranty', '1 Year']
['Labor Warranty', '1 Year']

Let's simplify, shall we?
import pandas as pd
df = pd.read_html('https://www.crutchfield.com/S-f7IbEJ40aHd/p_13692194/JL-Audio-12TW3-D8.html?tp=64077')[0]
df.columns = ['Property', 'Value', 'Not Needed']
print(df[['Property', 'Value']])
Result in terminal:
Property Value
0 Size 12 -inch
1 Impedance (Ohms) 4, 16
2 Cone Material Mica-Filled IMPP
3 Surround Material Rubber
4 Ideal Sealed Box Volume (cubic feet) 1
5 Ideal Ported Box Volume (cubic feet) 1.3
6 Port diameter (inches) NaN
7 Port length (inches) NaN
8 Free-Air No
9 Dual Voice Coil Yes
10 Sensitivity 84.23 dB at 1 watt
11 Frequency Response 24 - 200 Hz
12 Max RMS Power Handling 400
13 Peak Power Handling (Watts) 800
14 Top Mount Depth (inches) 3 1/2
15 Bottom Mount Depth (inches) NaN
16 Cutout Diameter or Length (inches) 11 5/8
17 Vas (liters) 34.12
18 Fs (Hz) 32.66
19 Qts 0.668
20 Xmax (millimeters) 15.2
21 Parts Warranty 1 Year
22 Labor Warranty 1 Year
Pandas documentation can be found here.

You can clean the results like this if you want.
rows = []
for i, row in enumerate(table.find_all('tr')):
cells = [
el.text.strip().replace("\t", "") ## remove tabs
for el
in row.find_all('td')
]
## don't add a row with no tds
if cells:
rows.append(cells)
I think you can further simplify this with a walrus :=
rows = [
[cell.text.strip().replace("\t", "") for cell in cells]
for row in table.find_all('tr')
if (cells := row.find_all('td'))
]

Let Pandas do it all
No need for anything else
pandas can read tables inside html
url='https://www.crutchfield.com/p_13692194/JL-Audio-12TW3-D8.html?tp=64077'
df=pd.read_html(url,attrs={'class':'table table-striped'})[0]
df.columns=['Features','Specs','Blank']
df.drop('Blank',axis=1,inplace=True) # get rid of the hidden column
Thats it
Seems to me all good no spaces
if you still feel there are spaces left in some column
df['Features']=df['Features'].apply(lambda x:x.strip()) #Not Needed
if you need to pass headers in request..(you can pass requests response to pd.read_html)
ps: it works without headers for the given URL
df=pd.read_html(requests.get(url,headers=headers).content,
attrs={'class':'table table-striped'})[0]

Converting ordinary data into time seris dataframes panda Python

I have a small problem concerning conversion of data to time series. Here are the steps that i carried out.
I have the output data as follows :
Beautiful Soup is a library that makes it easy to scrape information from web pages. It sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree.
url1 = 'http://financials.morningstar.com/finan/financials/getFinancePart.html?&callback=xxx&t=BBRI'
url2 = 'http://financials.morningstar.com/finan/financials/getKeyStatPart.html?&callback=xxx&t=BBRI'
soup1 = BeautifulSoup(json.loads(re.findall(r'xxx\((.*)\)', requests.get(url1).text)[0])['componentData'], 'lxml')
soup2 = BeautifulSoup(json.loads(re.findall(r'xxx\((.*)\)', requests.get(url2).text)[0])['componentData'], 'lxml')
def print_table(soup):
for i, tr in enumerate(soup.select('tr')):
row_data = [td.text for td in tr.select('th, td') if td.text]
if not row_data:
continue
if len(row_data) < 12:
row_data = ['X'] + row_data
for j, td in enumerate(row_data):
if j==0:
print('{: >30}'.format(td))
else:
print('{: ^12}'.format(td))
print()
print_table(soup1)
produce output
X
2010-12
2011-12
2012-12
2013-12
2014-12
2015-12
2016-12
2017-12
2018-12
2019-12
TTM
Revenue IDR Mil
30,552,600
40,203,051
43,104,711
51,133,344
59,556,636
69,813,152
82,504,537
90,844,308
99,067,098
108,468,320
105,847,159
I need to convert it to a dataframe with panda being to:
data
X Revenue IDR Mil
2010-12 30,552,600
2011-12 40,203,051
2012-12 43,104,711
2013-12 51,133,344
2014-12 59,556,636
2015-12 69,813,152
2016-12 82,504,537
2017-12 90,844,308
2018-12 99,067,098
2019-12 108,468,320
2020-12 105,847,159

This is a bit simplified from what you are doing, but I think it gets you where you need, mostly from Bitto Bennichan,
import json
import pandas as pd
url1 = 'http://financials.morningstar.com/finan/financials/getFinancePart.html?t=BBRI'
url2 = 'http://financials.morningstar.com/finan/financials/getKeyStatPart.html?t=BBRI'
lm_json = requests.get(url1).json()
df_list=pd.read_html(lm_json["componentData"])
df_list[0].transpose()

webscraping error, it says out of index, whats up

I'm trying to make an app that scrapes my top ten favorite space related stock prices. but
List item I have some trouble with my code and I'm new to scraping. Once I get this to work, I want to put it into a csv file and make a bar graph with it, I would love some help and suggestions. Also Im doing this in Anaconda :
#import libraries
import bs4
from bs4 import BeautifulSoup
#grequests is a unique library that allows you to use many urls with ease
#must install qrequest in annacode use : conda install -c conda-forge grequests
#if you know a better way to do this, please let me know
import grequests
#scraping my top ten favorite space companies, attempted to pick compaines with pure play interest in space
urls = ['https://finance.yahoo.com/quote/GILT/', 'https://finance.yahoo.com/quote/LORL?p=LORL&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/I?p=I&.tsrc=fin-srch' , 'https://finance.yahoo.com/quote/VSAT?p=VSAT&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/RTN?p=RTN&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/UTX?ltr=1', 'https://finance.yahoo.com/quote/TDY?ltr=1', 'https://finance.yahoo.com/quote/ORBC?ltr=1', 'https://finance.yahoo.com/quote/SPCE?p=SPCE&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/BA?p=BA&.tsrc=fin-srch',]
unsent_request = (grequests.get(url) for url in urls)
results = grequests.map(unsent_request)
def parsePrice(r):
soup = bs4.BeautifulSoup(r.text,"html")
price=soup.find_all('div',{'class':'Trsdu(0.3s) Fw(b) Fz(36px) Mb(-4px) D(ib)" data-reactid="52">4.1500'})[0].find('span').text
return price
for r in results:
parsePrice(r)
SO what code is bringing this error:
IndexError Traceback (most recent call last)
<ipython-input-6-9ac8cb94b6fb> in <module>
5
6 for r in results:
----> 7 parsePrice(r)
<ipython-input-6-9ac8cb94b6fb> in parsePrice(r)
1 def parsePrice(r):
2 soup = bs4.BeautifulSoup(r.text,"html")
----> 3 price=soup.find_all('div',{'class':'Trsdu(0.3s) Fw(b) Fz(36px) Mb(-4px) D(ib)" data-reactid="52">4.1500'})[0].find('span').text
4 return price
5
IndexError: list index out of range
whats up?

The data on the page is with in <table> tags. Use pandas' .read_html(), as it uses BeautifulSoup under the hood. That way you can grab more.
That data is also available through API/XHR, but won't get into that, as that'll be slightly more complex.
import pandas as pd
#scraping my top ten favorite space companies, attempted to pick compaines with pure play interest in space
urls = ['https://finance.yahoo.com/quote/GILT/', 'https://finance.yahoo.com/quote/LORL?p=LORL&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/I?p=I&.tsrc=fin-srch' , 'https://finance.yahoo.com/quote/VSAT?p=VSAT&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/RTN?p=RTN&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/UTX?ltr=1', 'https://finance.yahoo.com/quote/TDY?ltr=1', 'https://finance.yahoo.com/quote/ORBC?ltr=1', 'https://finance.yahoo.com/quote/SPCE?p=SPCE&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/BA?p=BA&.tsrc=fin-srch',]
def parsePrice(r):
df = pd.read_html(r)[0].T
cols = list(df.iloc[0,:])
temp_df = pd.DataFrame([list(df.iloc[1,:])], columns=cols)
temp_df['url'] = r
return temp_df
df = pd.DataFrame()
for r in urls:
df = df.append(parsePrice(r), sort=True).reset_index(drop=True)
df.to_csv('path/filename.csv', index=False)
Output:
print (df.to_string())
52 Week Range Ask Avg. Volume Bid Day's Range Open Previous Close Volume url
0 7.32 - 9.87 8.09 x 800 23415 8.06 x 800 8.01 - 8.11 8.10 8.01 6337 https://finance.yahoo.com/quote/GILT/
1 32.14 - 42.77 32.74 x 1100 41759 32.59 x 1000 32.28 - 32.75 32.32 32.28 14685 https://finance.yahoo.com/quote/LORL?p=LORL&.t...
2 5.55 - 27.29 6.64 x 800 5746553 6.63 x 2900 6.51 - 6.68 6.64 6.65 995245 https://finance.yahoo.com/quote/I?p=I&.tsrc=fi...
3 55.93 - 97.31 72.21 x 800 281600 72.16 x 1000 71.51 - 72.80 72.26 72.32 74758 https://finance.yahoo.com/quote/VSAT?p=VSAT&.t...
4 144.27 - 220.03 215.54 x 1000 1560562 215.37 x 800 214.87 - 217.45 215.85 214.86 203957 https://finance.yahoo.com/quote/RTN?p=RTN&.tsr...
5 100.48 - 149.81 145.03 x 800 2749725 144.96 x 800 144.41 - 145.56 145.49 144.52 489169 https://finance.yahoo.com/quote/UTX?ltr=1
6 189.35 - 351.53 343.34 x 800 280325 342.80 x 800 342.84 - 346.29 344.16 343.58 42326 https://finance.yahoo.com/quote/TDY?ltr=1
7 3.5800 - 9.7900 4.1400 x 1300 778343 4.1300 x 800 4.1200 - 4.2000 4.1700 4.1500 62335 https://finance.yahoo.com/quote/ORBC?ltr=1
8 6.90 - 12.09 7.37 x 900 2280333 7.38 x 800 7.24 - 7.48 7.30 7.22 539082 https://finance.yahoo.com/quote/SPCE?p=SPCE&.t...
9 292.47 - 446.01 348.73 x 800 4420225 348.79 x 800 345.70 - 350.42 350.22 348.84 1258813 https://finance.yahoo.com/quote/BA?p=BA&.tsrc=...
But IF you must go the route of BeautifulSoup, your find_all() is incorrect. First the class is strictly the text between the double quotes after class=. You've included other attributes of the element such as datareact-id, and the actual content/text that you are wanting to pull. Secondly, that class is a child of the <span> tag, not the div tag. If you pull the div tag, that's fine, but then you'd still need to go inside THAT element to get the text.
Give this a try:
import bs4
import requests
#scraping my top ten favorite space companies, attempted to pick compaines with pure play interest in space
urls = ['https://finance.yahoo.com/quote/GILT/', 'https://finance.yahoo.com/quote/LORL?p=LORL&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/I?p=I&.tsrc=fin-srch' , 'https://finance.yahoo.com/quote/VSAT?p=VSAT&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/RTN?p=RTN&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/UTX?ltr=1', 'https://finance.yahoo.com/quote/TDY?ltr=1', 'https://finance.yahoo.com/quote/ORBC?ltr=1', 'https://finance.yahoo.com/quote/SPCE?p=SPCE&.tsrc=fin-srch', 'https://finance.yahoo.com/quote/BA?p=BA&.tsrc=fin-srch',]
def parsePrice(r):
resp = requests.get(r)
soup = bs4.BeautifulSoup(resp.text,"html")
price=soup.find_all('span',{'class':'Trsdu(0.3s) Trsdu(0.3s) Fw(b) Fz(36px) Mb(-4px) D(b)'})[0].text
return price
for r in urls:
print (parsePrice(r))
Output:
8.06
32.76
6.60
72.22
215.54
145.14
343.28
4.1550
7.43
348.32

Summarizing data from last 12 months in dataframe by category using python

I am trying to create a summary of data by category for the last 12 months (excluding the current month). I have summarized the previous 3 months with the following code, but doing so for 12 months seems cumbersome. I am wondering if there is a more efficient and effective way of dynamically slicing data for the last 12 months. df1 is the complete data set which I load from a DB connection using a SQL query. I use .drop() to slice out the unwanted columns of data and only leave me with the count.
import pandas as pd
import datetime
df1.Start_Date = pd.DatetimeIndex(df1.Start_Date)
today = datetime.date.today()
currentfirst = today.replace(day=1)
thirdMonth = currentfirst - pd.offsets.MonthBegin(3)
secondMonth = currentfirst - pd.offsets.MonthBegin(2)
firstMonth = currentfirst - pd.offsets.MonthBegin(1)
fst_label = firstMonth.strftime('%B')
snd_label = secondMonth.strftime('%B')
thd_label = thirdMonth.strftime('%B')
def monthly_vol(df, label, start_date, end_date):
"""Slices df1 into previous months and sums the volume of each change class."""
if start_date is not None:
df = df1[df1.Start_Date >= start_date]
if end_date is not None:
df = df[df.Start_Date < end_date]
df_count = df.groupby('Change Class').count().drop(['Start_Date', 'Risk Level', 'Change Coordinator', 'Change Coordinator Group'], axis=1)
return df_count
fst_month = monthly_vol(df1, fst_label, firstMonth, currentfirst)
snd_month = monthly_vol(df1, snd_label, secondMonth, firstMonth)
thd_month = monthly_vol(df1, thd_label, thirdMonth, secondMonth)
def month_merge(df1, df2, df3):
"""Merges monthly dataframes together."""
new_df = pd.merge(df1, df2, left_index=True, right_index=True).merge(df3, left_index=True, right_index=True)
new_df.columns = [fst_label, snd_label, thd_label]
print(new_df)
return new_df
monthly_vol = month_merge(fst_month, snd_month, thd_month)
This will give the output:
May April March
Change Class
Emergency 36 36 32
Expedited 17 24 35
Normal 182 146 134
Standard 256 210 267
Bonus question:
It would be nice to get the average of the total volume for each category in the same dataframe. Somewhat like this:
May MayAVG April AprilAVG March MarchAVG
Change Class
Emergency 36 7.33 36 8.65 32 6.84
Expedited 17 3.46 24 5.77 35 7.48
Normal 182 37.07 146 35.10 134 28.63
Standard 256 52.14 10 50.48 267 57.05
Any help would be much appreciated!

Why don't you try using Dictionary ? A dictionary is a key value pair of data.
For example : {"3": "March", "4": "April" }.
So where ever you are maintaining a pair, you can instead use dictionary.
Populate those dictionaries inside a loop.
See below.
month_dict = {"3": "March", "2": "April", "1": "May"}
thirdMonth = currentfirst - pd.offsets.MonthBegin(3)
secondMonth = currentfirst - pd.offsets.MonthBegin(2)
firstMonth = currentfirst - pd.offsets.MonthBegin(1)
label_dict = {}
fst_label = firstMonth.strftime('%B')
snd_label = secondMonth.strftime('%B')
thd_label = thirdMonth.strftime('%B')
vol_month = {}
fst_month = monthly_vol(df1, fst_label, firstMonth, currentfirst)
snd_month = monthly_vol(df1, snd_label, secondMonth, firstMonth)
thd_month = monthly_vol(df1, thd_label, thirdMonth, secondMonth)

Python code for fetching data from website

I am using the following code.
And getting [].
Please help me find my mistake.
from urllib import urlopen
optionsUrl = 'http://www.moneycontrol.com/commodity/'
optionsPage = urlopen(optionsUrl)
from bs4 import BeautifulSoup
soup = BeautifulSoup(optionsPage)
print soup.findAll(text='MCX')

This will grab that list of commodities for you (tested on python 2.7). You need to isolate that commodity table then work your way down the rows, reading each row and extracting the data from each column
import urllib2
import bs4
# Page with commodities
URL = "http://www.moneycontrol.com/commodity/"
# Download the page data and create a BeautitulSoup object
commodityPage = urllib2.urlopen(URL)
commodityPageText = commodityPage.read()
commodityPageSoup = bs4.BeautifulSoup(commodityPageText)
# Extract the div with the commodities table and find all the table rows
commodityTable = commodityPageSoup.find_all("div", "equity com_ne")
commodittTableRows = commodityTable[0].find_all("tr")
# Trim off the table header row
commodittTableRows = commodittTableRows[1:]
# Iterate over the table rows and print out the commodity name and price
for commodity in commodittTableRows:
# Separate all the table columns
columns = commodity.find_all("td")
# -------------- Get the values from each column
# ROW 1: Name and date
nameAndDate = columns[0].text
nameAndDate = nameAndDate.split('-')
name = nameAndDate[0].strip()
date = nameAndDate[1].strip()
# ROW 2: Price
price = float(columns[1].text)
# ROW 3: Change
change = columns[2].text.replace(',', '') # Remove commas from change value
change = float(change)
# ROW 4: Percentage change
percentageChange = columns[3].text.replace('%', '') # Remove the percentage symbol
percentageChange = float(percentageChange)
# Print out the data
print "%s on %s was %.2f - a change of %.2f (%.2f%%)" % (name, date, price, change, percentageChange)
Which gives the results
Gold on 5 Oct was 30068.00 - a change of 497.00 (1.68%)
Silver on 5 Dec was 50525.00 - a change of 1115.00 (2.26%)
Crudeoil on 19 Sep was 6924.00 - a change of 93.00 (1.36%)
Naturalgas on 25 Sep was 233.80 - a change of 0.30 (0.13%)
Aluminium on 30 Sep was 112.25 - a change of 0.55 (0.49%)
Copper on 29 Nov was 459.80 - a change of 3.40 (0.74%)
Nickel on 30 Sep was 882.20 - a change of 5.90 (0.67%)
Lead on 30 Sep was 131.80 - a change of 0.70 (0.53%)
Zinc on 30 Sep was 117.85 - a change of 0.75 (0.64%)
Menthaoil on 30 Sep was 871.90 - a change of 1.80 (0.21%)
Cotton on 31 Oct was 21350.00 - a change of 160.00 (0.76%)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

web scraping issue, trying to acquire info into csv and than charts - python

Related

How to remove whitespace/tab from an entry when scraping a web table? (python)

Converting ordinary data into time seris dataframes panda Python

webscraping error, it says out of index, whats up

Summarizing data from last 12 months in dataframe by category using python

Python code for fetching data from website

Categories

Resources