I'm trying to download data from Google Finance from a list of stocks symbols inside a .csv file.
This is the class that I'm trying to adapt from this site:
import urllib,time,datetime
import csv
class Quote(object):
DATE_FMT = '%Y-%m-%d'
TIME_FMT = '%H:%M:%S'
def __init__(self):
self.symbol = ''
self.date,self.time,self.open_,self.high,self.low,self.close,self.volume = ([] for _ in range(7))
def append(self,dt,open_,high,low,close,volume):
self.date.append(dt.date())
self.time.append(dt.time())
self.open_.append(float(open_))
self.high.append(float(high))
self.low.append(float(low))
self.close.append(float(close))
self.volume.append(int(volume))
def append_csv(self, filename):
with open(filename, 'a') as f:
f.write(self.to_csv())
def __repr__(self):
return self.to_csv()
def get_symbols(self, filename):
for line in open(filename,'r'):
if line != 'codigo':
print line
q = GoogleQuote(line,'2014-01-01','2014-06-20')
q.append_csv('data.csv')
class GoogleQuote(Quote):
''' Daily quotes from Google. Date format='yyyy-mm-dd' '''
def __init__(self,symbol,start_date,end_date=datetime.date.today().isoformat()):
super(GoogleQuote,self).__init__()
self.symbol = symbol.upper()
start = datetime.date(int(start_date[0:4]),int(start_date[5:7]),int(start_date[8:10]))
end = datetime.date(int(end_date[0:4]),int(end_date[5:7]),int(end_date[8:10]))
url_string = "http://www.google.com/finance/historical?q={0}".format(self.symbol)
url_string += "&startdate={0}&enddate={1}&output=csv".format(
start.strftime('%b %d, %Y'),end.strftime('%b %d, %Y'))
csv = urllib.urlopen(url_string).readlines()
csv.reverse()
for bar in xrange(0,len(csv)-1):
try:
#ds,open_,high,low,close,volume = csv[bar].rstrip().split(',')
#open_,high,low,close = [float(x) for x in [open_,high,low,close]]
#dt = datetime.datetime.strptime(ds,'%d-%b-%y')
#self.append(dt,open_,high,low,close,volume)
data = csv[bar].rstrip().split(',')
dt = datetime.datetime.strftime(data[0],'%d-%b-%y')
close = data[4]
self.append(dt,close)
except:
print "error " + str(len(csv)-1)
print "error " + csv[bar]
if __name__ == '__main__':
q = Quote() # create a generic quote object
q.get_symbols('list.csv')
But, for some quotes, the code doesn't return all data (e.g. BIOM3), some fields return as '-'. How can I handle the split in these cases?
For last, at some point of the script, it stops of download the data because the script stops, it doesn't return any message. How can I handle this problem?
It should work, but notice that the ticker should be: BVMF:ABRE11
In [250]:
import pandas.io.data as web
import datetime
start = datetime.datetime(2010, 1, 1)
end = datetime.datetime(2013, 1, 27)
df=web.DataReader("BVMF:ABRE11", 'google', start, end)
print df.head(10)
Open High Low Close Volume
?Date
2011-07-26 19.79 19.79 18.30 18.50 1843700
2011-07-27 18.45 18.60 17.65 17.89 1475100
2011-07-28 18.00 18.50 18.00 18.30 441700
2011-07-29 18.30 18.84 18.20 18.70 392800
2011-08-01 18.29 19.50 18.29 18.86 217800
2011-08-02 18.86 18.86 18.60 18.80 154600
2011-08-03 18.90 18.90 18.00 18.00 168700
2011-08-04 17.50 17.85 16.50 16.90 238700
2011-08-05 17.00 17.00 15.63 16.00 253000
2011-08-08 15.50 15.96 14.35 14.50 224300
[10 rows x 5 columns]
In [251]:
df=web.DataReader("BVMF:BIOM3", 'google', start, end)
print df.head(10)
Open High Low Close Volume
?Date
2010-01-04 2.90 2.90 2.90 2.90 0
2010-01-05 3.00 3.00 3.00 3.00 0
2010-01-06 3.01 3.01 3.01 3.01 0
2010-01-07 3.01 3.09 3.01 3.09 2000
2010-01-08 3.01 3.01 3.01 3.01 0
2010-01-11 3.00 3.00 3.00 3.00 0
2010-01-12 3.00 3.00 3.00 3.00 0
2010-01-13 3.00 3.10 3.00 3.00 7000
2010-01-14 3.00 3.00 3.00 3.00 0
2010-01-15 3.00 3.00 3.00 3.00 1000
[10 rows x 5 columns]
Related
This is what I have so far to retrieve data from polygon.io. For each sym I would like to add it to df
sym = ['OCGN', 'TKAT', 'MMAT', 'MDIA', 'PHUN']
df = pd.DataFrame()
i=0
for i in range(len(sym)):
stock = sym[i]
url = f'https://api.polygon.io/v2/aggs/ticker/{stock}/range/1/day/{fromdate}/{to}?
adjusted=true&sort=asc&limit=50000&apiKey=Demo'.format()
tick = requests.get(url)
tick =pd.json_normalize(tick.json()["results"])
daa = (tick.iloc[[-1]])
data = pd.DataFrame(daa)
df = df.append(data, ignore_index=True)
print(df)
output
v vw o c h l t
0 15426806.0 6.2736 6.31 6.03 6.66 6.030 1638334800000
1 464144.0 4.9949 5.16 4.73 5.28 4.640 1638334800000
2 8101699.0 3.5164 3.75 3.36 3.82 3.300 1638334800000
3 109407.0 5.0286 4.90 4.77 5.28 4.654 1638334800000
4 45679175.0 3.7679 3.01 3.25 3.26 2.780 1638334800000
First off, I think it would be helpful to offer some background about what I want to do. I have a time-series dataset that describes air quality in a region, with hour resolution. Each row is an observation, each column is a different parameter (eg. Temperature, Pressure, Particulate matter, etc.) I want to take an average of observations for each hour in the day, across the entire five year dataset. However, I first need to distinguish between summer and winter observations. Here are a few rows for reference:
Date Time WSA WSV WDV WSM SGT T2M T10M DELTA_T PBAR SRAD RH PM25 AQI
0 2015-01-01 00:00:00 0.9 0.2 334 3.2 70.9 29.2 29.1 -0.1 740.4 8 102.5 69.0 157.970495
1 2015-01-01 01:00:00 1.5 0.7 129 4.0 58.8 29.6 29.2 -0.4 740.2 8 102.5 23.5 74.974249
2 2015-01-01 02:00:00 0.8 0.8 70 2.7 18.0 28.7 28.3 -0.4 740.3 7 102.2 40.1 112.326633
3 2015-01-01 03:00:00 1.1 1.0 82 3.4 21.8 28.2 27.8 -0.4 740.1 6 102.0 31.1 90.957082
4 2015-01-01 04:00:00 1.0 0.8 65 4.7 34.3 27.3 27.2 -0.2 739.7 6 101.7 13.7 54.364807
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
43175 2016-12-30 19:00:00 1.7 0.7 268 4.1 63.6 33.8 34.1 0.3 738.8 8 100.7 38.4 108.140704
43176 2016-12-30 20:00:00 1.5 0.1 169 3.3 77.5 33.2 33.7 0.5 738.7 9 101.0 27.2 82.755365
43177 2016-12-30 21:00:00 1.4 0.5 278 4.0 65.7 32.5 32.8 0.3 738.6 9 101.4 42.5 118.236181
43178 2016-12-30 22:00:00 2.8 2.7 277 6.5 16.7 33.2 33.3 0.1 738.6 9 101.6 25.2 78.549356
43179 2016-12-30 23:00:00 1.9 0.3 241 4.2 74.2 31.0 31.6 0.6 738.4 9 100.4 18.7 64.879828
[43180 rows x 15 columns]
I have tried splitting the dataset into two based on season, and plotting each separately. This works, but I cannot manage to make the plot display a legend.
mask = (df['Date'].dt.month > 3) & (df['Date'].dt.month < 10)
summer = df[mask]
winter = df[~mask]
summer = summer.groupby(summer['Time'].dt.hour).mean().reset_index()
winter = winter.groupby(winter['Time'].dt.hour).mean().reset_index()
p = (
ggplot(mapping=aes( x='Time', y='PM25')) +
geom_point(data=summer, color='red')+
geom_point(data=winter, color='blue')
)
print(p)
Plotting with separate dataframes:
[1]: https://i.stack.imgur.com/W75kk.png
I did some more research, and learned that plotnine/ggplot can color-code data points based on one of their attributes. This approach requires the data to be a single dataset, so I added a parameter specifying the season. However, when I group by hour, this 'Season' attribute is removed. I assume it is because you cannot take the mean of non-numeric data. As such, I find myself in a bit of a paradox.
Here is the my attempt at keeping the data together and adding a 'Season' column:
df.insert(0,'Season', 0)
summer = (df['Date'].dt.month > 3) & (df['Date'].dt.month < 10)
df['Season'] = df.where(summer, other='w')
df['Season'] = df.where(~summer, other='s')
df = df.groupby(df['Time'].dt.hour).mean()
print(df)
p = (
ggplot(data = df, mapping=aes( x='Time', y='PM25', color='Season')) +
geom_point()
)
print(p)
When I try to run this, it raises the following, and if I inspect the dataframe all non-numeric paramters have been removed:
plotnine.exceptions.PlotnineError: "Could not evaluate the 'color' mapping: 'Season' (original error: name 'Season' is not defined)"
Any suggestions would be hugely appreciated.
Data provided has been saved to airq.csv. Besides to Season column, Hour column has been included. Code provided has been used. 'Hour' and 'Season' have to be provided in groupby function. Two plotnine.ggplot possibilities are provided. Fist using geom_point, and second one adding facet_wrap. Theme customization has been included for each case.
from plotnine import *
import pandas as pd
df = pd.read_csv('airq.csv', parse_dates=[0,1])
df.insert(0,'Season', 0)
summer = (df['Date'].dt.month > 3) & (df['Date'].dt.month < 9)
df['Season'] = df.where(summer, other='Winter')
df['Season'] = df.where(~summer, other='Summer')
df['Hour'] = df['Time'].dt.hour
df = df.groupby(['Hour', 'Season']).mean().reset_index()
custom_axis = theme(axis_text_x = element_text(color="grey", size=6, angle=90, hjust=.3),
axis_text_y = element_text(color="grey", size=6),
plot_title = element_text(size = 25, face = "bold"),
axis_title = element_text(size = 10)
)
(
ggplot(data = df, mapping = aes(x='Hour', y='PM25',
color='Season')) + geom_point() +
custom_axis + ylab("Particulate matter 2.5 micrometres") + xlab("Hour") + labs(title="PM air quality report")
)
custom_axis = theme(axis_text_x = element_text(color="grey", size=6, angle=90, hjust=.3),
axis_text_y = element_text(color="grey", size=6),
plot_title = element_text(size = 25, face = "bold"),
axis_title = element_text(size = 10),
panel_spacing_y=.4,
figure_size=(8, 4)
)
(
ggplot(data = df, mapping = aes(x='Hour', y='PM25')) + geom_point(alpha=1) + facet_wrap('Season') +
custom_axis + ylab("Particulate matter 2.5 micrometres") + xlab("Hour") + labs(title="PM air quality report")
)
Here is a example of data we want to process:
df_size = 1000000
df_random = pd.DataFrame({'boat_id' : np.random.choice(range(300),df_size),
'X' :np.random.random_integers(0,1000,df_size),
'target_Y' :np.random.random_integers(0,10,df_size)})
X boat_id target_Y
0 482 275 6
1 705 245 4
2 328 102 6
3 631 227 6
4 234 236 8
...
I want to obtain an output like this :
X0 X1 X2 X3 X4 X5 X6 X7 X8 X9 target_Y boat_id
40055 684.0 692.0 950.0 572.0 442.0 850.0 75.0 140.0 382.0 576.0 0.0 1
40056 178.0 949.0 490.0 777.0 335.0 559.0 397.0 729.0 701.0 44.0 4.0 1
40057 21.0 818.0 341.0 577.0 612.0 57.0 303.0 183.0 519.0 357.0 0.0 1
40058 501.0 1000.0 999.0 532.0 765.0 913.0 964.0 922.0 772.0 534.0 1.0 2
40059 305.0 906.0 724.0 996.0 237.0 197.0 414.0 171.0 369.0 299.0 8.0 2
40060 408.0 796.0 815.0 638.0 691.0 598.0 913.0 579.0 650.0 955.0 2.0 3
40061 298.0 512.0 247.0 824.0 764.0 414.0 71.0 440.0 135.0 707.0 9.0 4
40062 535.0 687.0 945.0 859.0 718.0 580.0 427.0 284.0 122.0 777.0 2.0 4
40063 352.0 115.0 228.0 69.0 497.0 387.0 552.0 473.0 574.0 759.0 3.0 4
40064 179.0 870.0 862.0 186.0 25.0 125.0 925.0 310.0 335.0 739.0 7.0 4
...
I did the folowing code, but it is way to slow.
It groupby, cut with enumerate, transpose then merge result into one pandas Dataframe
start_time = time.time()
N = 10
col_names = map(lambda x: 'X'+str(x), range(N))
compil = pd.DataFrame(columns = col_names)
i = 0
# I group by boat ID
for boat_id, df_boat in df_random.groupby('boat_id'):
# then I cut every 50 line
for (line_number, (index, row)) in enumerate(df_boat.iterrows()):
if line_number%5 == 0:
compil_new_line_X = list(df_boat.iloc[line_number-N:line_number,:]["X"])
# filter to avoid issues at the start and end of the columns
if len (compil_new_line_X ) == N:
compil.loc[i,col_names] = compil_new_line_X
compil.loc[i, 'target_Y'] = row['target_Y']
compil.loc[i,'boat_id'] = row['boat_id']
i += 1
print("Total %s seconds" % (time.time() - start_time))
Total 232.947000027 seconds
My question is:
How to do somethings every "x number of line"? Then merge result?
Do it exist a way to vectorize that kind of operation?
Here is a solution that improve calculation time by 35%.
It use a 'groupby' for 'boat_ID' then 'groupby.apply' to divide groups in smalls chunks.
Then a final apply to create the new line. We probably still can improve it.
df_size = 1000000
df_random = pd.DataFrame({'boat_id' : np.random.choice(range(300),df_size),
'X' :np.random.random_integers(0,1000,df_size),
'target_Y' : np.random.random_integers(0,10,df_size)})
start_time = time.time()
len_of_chunks = 10
col_names = map(lambda x: 'X'+str(x), range(N))+['boat_id', 'target_Y']
def prepare_data(group):
# this function create the new line we will put in 'compil'
info_we_want_to_keep =['boat_id', 'target_Y']
info_and_target = group.tail(1)[info_we_want_to_keep].values
k = group["X"]
return np.hstack([k.values, info_and_target[0]]) # this create the new line we will put in 'compil'
# we group by ID (boat)
# we divide in chunk of len "len_of_chunks"
# we apply prepare data from each chunk
groups = df_random.groupby('boat_id').apply(lambda x: x.groupby(np.arange(len(x)) // len_of_chunks).apply(prepare_data))
# we reset index
# we take the '0' columns containing valuable info
# we put info in a new 'compil' dataframe
# we drop uncomplet line ( generated by chunk < len_of_chunks )
compil = pd.DataFrame(groups.reset_index()[0].values.tolist(), columns= col_names).dropna()
print("Total %s seconds" % (time.time() - start_time))
Total 153.781999826 seconds
say
>>> import tushare as ts
>>> df=ts.get_stock_basics()
print(df)
>>> print(df)
name industry area pe ... profit gpr npr holders
code ...
000629 攀钢钒钛 小金属 四川 15.46 ... 176.65 25.39 16.99 328000.0
002113 天润数娱 互联网 湖南 122.93 ... 75.16 52.10 10.44 63566.0
603029 天鹅股份 农用机械 山东 0.00 ... -132.89 35.05 -8.76 13965.0
600721 百花村 生物制药 新疆 23.82 ... 21.22 42.37 24.30 20891.0
300493 润欣科技 通信设备 上海 53.90 ... 2.11 10.69 3.45 23308.0
600532 宏达矿业 普钢 上海 0.00 ... -64.67 2.00 -4.47 16451.0
300749 顶固集创 家居用品 广东 50.01 ... 0.00 38.01 7.66 56044.0
300748 金力永磁 元器件 江西 40.92 ... 0.00 20.94 8.47 79240.0
002931 锋龙股份 汽车配件 浙江 70.47 ... 0.00 32.10 13.05 15734.0
600101 明星电力 水力发电 四川 23.21 ... 26.58 13.71 7.13 36654.0
002219 恒康医疗 中成药 甘肃 51.07 ... -56.30 31.11 3.88 24831.0
000593 大通燃气 供气供热 四川 161.89 ... 3.12 22.64 3.14 29631.0
002937 兴瑞科技 元器件 浙江 29.97 ... 0.00 26.30 10.13 87519.0
600568 中珠医疗 区域地产 湖北 51.79 ... -66.89 51.11 13.11 21593.0
603701 德宏股份 汽车配件 浙江 21.60 ... -3.29 31.54 16.26 13549.0
600603 广汇物流 仓储物流 四川 17.85 ... 76.97 55.25 24.74 26642.0
300005 探路者 服饰 北京 71.50 ... -69.47 30.23 2.75 42941.0
002568 百润股份 红黄药酒 上海 40.70 ... 28.41 68.29 14.17 21333.0
000697 炼石有色 航空 陕西 0.00 ... 10.23 18.64 -19.91 33614.0
002007 华兰生物 生物制药 河南 38.68 ... 5.05 60.09 37.71 44000.0
000782 美达股份 化纤 广东 60.01 ... 142.72 8.78 1.31 42060.0
603538 美诺华 化学制药 浙江 25.74 ... 37.21 27.65 12.94 11346.0
002627 宜昌交运 公共交通 湖北 33.06 ... 41.84 12.09 4.05 9891.0
002864 盘龙药业 中成药 陕西 57.00 ... 71.07 70.89 14.40 22538.0
300649 杭州园林 建筑施工 浙江 63.27 ... 90.70 18.98 7.17 16732.0
300168 万达信息 软件服务 上海 138.01 ... 101.63 38.12 7.64 42770.0
002299 圣农发展 农业综合 福建 30.52 ... 204.28 13.87 6.61 22637.0
600290 华仪电气 电气设备 浙江 144.15 ... 13.64 25.08 1.48 14832.0
002496 ST辉丰 农药化肥 江苏 17.21 ... -58.07 35.44 5.96 63487.0
002437 誉衡药业 化学制药 黑龙江 16.93 ... 3.09 73.33 8.91 45373.0
I want that it outputs code(the name of the first column) one by one, so I can input each code into another function.
You can use pandas.Index.map:
import pandas as pd
from math import sqrt
df = pd.DataFrame({'col1': (1,2,3), 'col2': (3,4,6),}, index=[1,4,9])
df
Out:
col1 col2
1 1 3
4 2 4
9 3 6
mapped_index = df.index.map(sqrt)
mapped_index
Out:
Float64Index([1.0, 2.0, 3.0], dtype='float64')
Then, if you need, you can just iterate through the result:
for i in df.index.map(sqrt):
....
A simple demonstration traversing DataFrame index:
import pandas as pd
import tushare as ts
df = ts.get_stock_basics()
print(df)
for i in df.index:
print(i, type(i))
I have a SQL table like this:
Ticker Return Shares
AGJ 2.20 1265
ATA 1.78 698
ARS 9.78 10939
ARE -7.51 -26389
AIM 0.91 1758
ABT 10.02 -5893
AC -5.73 -2548
ATD 6.51 7850
AP 1.98 256
ALA -9.58 8524
So essentially, a table of stocks I've longed/shorted.
I want to find the top 4 best performers in this table, so the shorts (shares < 0) who have the lowest return, and the longs (shares > 0) who have the highest return.
Essentially, returning this:
Ticker Return Shares
ARS 9.78 10939
ARE -7.51 -26389
AC -5.73 -2548
ATD 6.51 7850
How would I be able to write the query that lets me do this?
Or, if it's easier, if there are any pandas functions that would do the same thing if I turned this table into a pandas dataframe.
Something like this:
select top (4) t.*
from t
order by (case when shares < 0 then - [return] else [return] end) desc;
Pandas solution:
In [134]: df.loc[(np.sign(df.Shares)*df.Return).nlargest(4).index]
Out[134]:
Ticker Return Shares
2 ARS 9.78 10939
3 ARE -7.51 -26389
7 ATD 6.51 7850
6 AC -5.73 -2548
Explanation:
In [137]: (np.sign(df.Shares)*df.Return)
Out[137]:
0 2.20
1 1.78
2 9.78
3 7.51
4 0.91
5 -10.02
6 5.73
7 6.51
8 1.98
9 -9.58
dtype: float64
In [138]: (np.sign(df.Shares)*df.Return).nlargest(4)
Out[138]:
2 9.78
3 7.51
7 6.51
6 5.73
dtype: float64