Concatenate several stock price dataframes - python

I am getting monthly price data from yahoo with pandas_datareader like this:
import pandas_datareader.data as web
fb = web.get_data_yahoo('FB', '06/01/2012', interval='m')
amzn = web.get_data_yahoo('AMZN', '06/01/2012', interval='m')
nflx = web.get_data_yahoo('NFLX', '06/01/2012', interval='m')
goog = web.get_data_yahoo('GOOG', '06/01/2012', interval='m')
Which I then clean up to get the closing price like this:
import pandas as pd
amzn = amzn.rename(columns={'Adj Close': 'AMZN'})
amzn = pd.DataFrame(amzn['AMZN'], columns=['AMZN'])
Repeating the clean up for all four dataframes. With this done I want to merge these four data frames together. To do this I am using:
data = pd.concat([fb, amzn, nlfx, goog])
However this results in a dataframe where only three out of four columns data is NaN. I have verified that the dates match up. Why is this happening? Any insight is appreciated.

There is a better approach - use Pandas.Panel:
In [20]: p = web.get_data_yahoo(['FB','AMZN','NFLX','GOOG'], '06/01/2012', interval='m')
In [21]: p.loc['Adj Close']
Out[21]:
AMZN FB GOOG NFLX
Date
2012-06-01 228.350006 31.100000 289.745758 9.784286
2012-07-02 233.300003 21.709999 316.169373 8.121428
2012-08-01 248.270004 18.059999 342.203369 8.531428
2012-09-04 254.320007 21.660000 376.873779 7.777143
2012-10-01 232.889999 21.110001 339.810760 11.320000
2012-11-01 252.050003 28.000000 348.836761 11.672857
2012-12-03 250.869995 26.620001 353.337280 13.227143
2013-01-02 265.500000 30.980000 377.468170 23.605715
2013-02-01 264.269989 27.250000 400.200500 26.868572
2013-03-01 266.489990 25.580000 396.698975 27.040001
2013-04-01 253.809998 27.770000 411.873840 30.867144
2013-05-01 269.200012 24.350000 435.175598 32.321430
2013-06-03 277.690002 24.879999 439.746002 30.155714
2013-07-01 301.220001 36.799999 443.432343 34.925713
2013-08-01 280.980011 41.290001 423.027710 40.558571
2013-09-03 312.640015 50.230000 437.518250 44.172855
2013-10-01 364.029999 50.209999 514.776123 46.068573
2013-11-01 393.619995 47.009998 529.266602 52.257141
2013-12-02 398.790009 54.650002 559.796204 52.595715
2014-01-02 358.690002 62.570000 589.896118 58.475716
2014-02-03 362.100006 68.459999 607.218811 63.661430
2014-03-03 336.369995 60.240002 556.972473 50.290001
2014-04-01 304.130005 59.779999 526.662415 46.005714
2014-05-01 312.549988 63.299999 559.892578 59.689999
2014-06-02 324.779999 67.290001 575.282593 62.942856
... ... ... ... ...
2015-02-02 380.160004 78.970001 558.402527 67.844284
2015-03-02 372.100006 82.220001 548.002441 59.527142
2015-04-01 421.779999 78.769997 537.340027 79.500000
2015-05-01 429.230011 79.190002 532.109985 89.151428
2015-06-01 434.089996 85.769997 520.510010 93.848572
2015-07-01 536.150024 94.010002 625.609985 114.309998
2015-08-03 512.890015 89.430000 618.250000 115.029999
2015-09-01 511.890015 89.900002 608.419983 103.260002
2015-10-01 625.900024 101.970001 710.809998 108.379997
2015-11-02 664.799988 104.239998 742.599976 123.330002
2015-12-01 675.890015 104.660004 758.880005 114.379997
2016-01-04 587.000000 112.209999 742.950012 91.839996
2016-02-01 552.520020 106.919998 697.770020 93.410004
2016-03-01 593.640015 114.099998 744.950012 102.230003
2016-04-01 659.590027 117.580002 693.010010 90.029999
2016-05-02 722.789978 118.809998 735.719971 102.570000
2016-06-01 715.619995 114.279999 692.099976 91.480003
2016-07-01 758.809998 123.940002 768.789978 91.250000
2016-08-01 769.159973 126.120003 767.049988 97.449997
2016-09-01 837.309998 128.270004 777.289978 98.550003
2016-10-03 789.820007 130.990005 784.539978 124.870003
2016-11-01 750.570007 118.419998 758.039978 117.000000
2016-12-01 749.869995 115.050003 771.820007 123.800003
2017-01-03 823.479980 130.320007 796.789978 140.710007
2017-02-01 807.640015 132.059998 801.340027 140.970001
[57 rows x 4 columns]
Panel axes:
In [22]: p.axes
Out[22]:
[Index(['Open', 'High', 'Low', 'Close', 'Volume', 'Adj Close'], dtype='object'),
DatetimeIndex(['2012-06-01', '2012-07-02', '2012-08-01', '2012-09-04', '2012-10-01', '2012-11-01', '2012-12-03', '2013-01-02', '2013-02-01'
, '2013-03-01', '2013-04-01', '2013-05-01', '2013-06-03',
'2013-07-01', '2013-08-01', '2013-09-03', '2013-10-01', '2013-11-01', '2013-12-02', '2014-01-02', '2014-02-03', '2014-03-03'
, '2014-04-01', '2014-05-01', '2014-06-02', '2014-07-01',
'2014-08-01', '2014-09-02', '2014-10-01', '2014-11-03', '2014-12-01', '2015-01-02', '2015-02-02', '2015-03-02', '2015-04-01'
, '2015-05-01', '2015-06-01', '2015-07-01', '2015-08-03',
'2015-09-01', '2015-10-01', '2015-11-02', '2015-12-01', '2016-01-04', '2016-02-01', '2016-03-01', '2016-04-01', '2016-05-02'
, '2016-06-01', '2016-07-01', '2016-08-01', '2016-09-01',
'2016-10-03', '2016-11-01', '2016-12-01', '2017-01-03', '2017-02-01'],
dtype='datetime64[ns]', name='Date', freq=None),
Index(['AMZN', 'FB', 'GOOG', 'NFLX'], dtype='object')]

Related

Finding max day of the month and year in a list of pandas DataFrame

I'm trying to get the last day from a month, but it's returning just one year. I need to get all records from all years. Any suggestions?
import pandas as pd
import numpy as np
df = pd.read_csv('PETR4_BOV_D_cor.csv', engine='c', skiprows=1, parse_dates=['date'], names=['ticker', 'date', 'trades', 'close', 'low', 'high', 'open', 'vol', 'qty', 'avg'])
df.date = pd.to_datetime(df.date)
df = df.set_index('date')
Result
ticker trades close low high open vol qty avg
date
2015-05-29 PETR4 44895 11.577403 11.577403 11.999936 11.934209 901139500.0 72120400 11.732238
2015-06-01 PETR4 31861 11.614961 11.502286 11.877871 11.671299 489916746.0 39483500 11.650736
2015-06-02 PETR4 47249 12.056274 11.708858 12.159559 11.783975 582467511.0 45754100 11.953363
2015-06-03 PETR4 37454 12.046884 11.943598 12.300404 12.168949 629815443.0 48703400 12.142376
2015-06-05 PETR4 34917 11.793364 11.661910 11.999936 11.812143 452516624.0 36024200 11.794773
2016-12-23 PETR4 23100 13.370821 13.154859 13.474106 13.192418 309168316.0 21776900 13.330539
2016-12-26 PETR4 4840 13.539834 13.398989 13.568003 13.445938 82501537.0 5734300 13.509224
2016-12-27 PETR4 13617 13.530444 13.389600 13.661899 13.614951 215534672.0 14949200 13.537768
2016-12-28 PETR4 20265 13.877860 13.483496 13.906029 13.549223 277762881.0 18979900 13.741335
2016-12-29 PETR4 19721 13.962367 13.633730 13.971756 13.943587 266439891.0 18090600 13.829128
395 rows × 9 columns
Groupby
df.groupby(df.index.month).apply(pd.Series.tail, 1).reset_index(level=0, drop=True)
Return
ticker trades close low high open vol qty avg
date
2016-01-29 PETR4 64685 4.544577 4.244109 4.563356 4.413122 4.398262e+08 93013900 4.439976
2016-02-29 PETR4 36334 4.826265 4.676031 4.910772 4.769928 4.312967e+08 84165000 4.811617
2016-03-31 PETR4 44127 7.840334 7.690100 8.103243 7.849723 5.834259e+08 69529900 7.878831
2016-04-29 PETR4 39767 9.605582 9.399011 9.849713 9.774596 5.482716e+08 53536700 9.615911
2016-05-31 PETR4 56676 7.549255 7.549255 8.046905 7.849723 4.804290e+08 58131400 7.760052
2016-06-30 PETR4 19998 8.845023 8.676010 8.910751 8.845023 4.090867e+08 43553100 8.819483
2016-07-29 PETR4 44681 11.145480 10.901350 11.248766 11.042195 7.142205e+08 60478800 11.088579
2016-08-31 PETR4 45622 12.065663 11.934209 12.413079 12.328573 7.848222e+08 60716500 12.137024
2016-09-30 PETR4 28869 12.741716 12.619651 12.929508 12.704157 5.284275e+08 38771900 12.797209
2016-10-31 PETR4 48694 16.610240 16.535123 17.060942 16.929487 7.480264e+08 42059100 16.699535
2016-11-30 PETR4 57759 15.023394 14.657199 15.201797 14.929498 1.316175e+09 82547300 14.971282
2016-12-29 PETR4 19721 13.962367 13.633730 13.971756 13.943587 2.664399e+08 18090600 13.829128
Tks #Ch3steR and David.
I changed the code like the David sample and separate the month and year in a new columns to sort the data.
import pandas as pd
import numpy as np
df = pd.read_csv('PETR4_BOV_D_cor.csv', engine='c', skiprows=1, parse_dates=['date'], names=['ticker', 'date', 'trades', 'close', 'low', 'high', 'open', 'vol', 'qty', 'avg'])
df['year'] = pd.DatetimeIndex(df.date).year
df['month'] = pd.DatetimeIndex(df.date).month
df.date = pd.to_datetime(df.date)
df = df.set_index('date')
result = df.groupby([df.index.month, df.index.year]).apply(pd.Series.tail, 1).reset_index(level=0, drop=True)
sorted = result.sort_values(['year', 'month'])
Result
ticker trades close low high open vol qty avg year month
date date
2015 2015-05-29 PETR4 44895 11.577403 11.577403 11.999936 11.934209 9.011395e+08 72120400 11.732238 2015 5
2015-06-30 PETR4 41060 11.934209 11.877871 12.197118 12.112611 4.078387e+08 31922400 11.996086 2015 6
2015-07-31 PETR4 38481 9.859102 9.690089 10.046895 9.840323 3.838792e+08 36465800 9.884548 2015 7
2015-08-31 PETR4 64015 8.629062 7.999957 8.722958 8.178360 6.765446e+08 75379100 8.427373 2015 8
2015-09-30 PETR4 77202 6.798086 6.544566 6.816865 6.807475 7.660222e+08 107245600 6.706725 2015 9
... ... ... ... ... ... ... ... ... ... ... ... ...
2020 2020-02-28 PETR4 109660 25.340000 24.620000 25.560000 25.160000 2.230362e+09 89095300 25.033400 2020 2
2020-03-31 PETR4 169315 13.990000 13.600000 14.540000 13.600000 2.180450e+09 155314800 14.038900 2020 3
2020-04-30 PETR4 80537 18.050000 17.700000 18.420000 17.980000 1.433551e+09 79395500 18.055800 2020 4
2020-05-29 PETR4 85912 20.340000 19.300000 20.340000 19.550000 2.528730e+09 127224200 19.876200 2020 5
2020-06-02 PETR4 99499 21.400000 20.600000 21.400000 20.750000 1.592479e+09 76091600 20.928500 2020 6

How to select rows of unique dates in DateTimeIndex

Suppose i have a DataFrame with DateTimeIndex like this:
Date_TimeOpen High Low Close Volume
2018-01-22 11:05:00 948.00 948.10 947.95 948.10 9820.0
2018-01-22 11:06:00 948.10 949.60 948.05 949.30 33302.0
2018-01-22 11:07:00 949.25 949.85 949.20 949.85 20522.0
2018-03-27 09:15:00 907.20 908.80 905.00 908.15 126343.0
2018-03-27 09:16:00 908.20 909.20 906.55 906.60 38151.0
2018-03-29 09:30:00 908.90 910.45 908.80 910.15 46429.0
I want to select only the first row of each Unique Date (discard Time) so that i get such output as below:
Date_Time Open High Low Close Volume
2018-01-22 11:05:00 948.00 948.10 947.95 948.10 9820.0
2018-03-27 09:15:00 907.20 908.80 905.00 908.15 126343.0
2018-03-29 09:30:00 908.90 910.45 908.80 910.15 46429.0
I tried with loc and iloc but it dint helped.
Any help will be greatly appreciated.
You need to group by date and get the first element of each group:
import pandas as pd
data = [['2018-01-22 11:05:00', 948.00, 948.10, 947.95, 948.10, 9820.0],
['2018-01-22 11:06:00', 948.10, 949.60, 948.05, 949.30, 33302.0],
['2018-01-22 11:07:00', 949.25, 949.85, 949.20, 949.85, 20522.0],
['2018-03-27 09:15:00', 907.20, 908.80, 905.00, 908.15, 126343.0],
['2018-03-27 09:16:00', 908.20, 909.20, 906.55, 906.60, 38151.0],
['2018-03-29 09:30:00', 908.90, 910.45, 908.80, 910.15, 46429.0]]
df = pd.DataFrame(data=data)
df = df.set_index([0])
df.columns = ['Open', 'High', 'Low', 'Close', 'Volume']
result = df.groupby(pd.to_datetime(df.index).date).head(1)
print(result)
Output
Open High Low Close Volume
0
2018-01-22 11:05:00 948.0 948.10 947.95 948.10 9820.0
2018-03-27 09:15:00 907.2 908.80 905.00 908.15 126343.0
2018-03-29 09:30:00 908.9 910.45 908.80 910.15 46429.0

Pandas resample offset from the most recent year end date?

Can somone explain what is going on with my resampling?
For example,
In [53]: daily_3mo_treasury.resample('5Y').mean()
Out[53]:
1993-12-31 2.997120
1998-12-31 4.917730
2003-12-31 3.297176
2008-12-31 2.997204
2013-12-31 0.097330
2018-12-31 0.534476
Where the last date in my time series is 2018-08-23 2.04
I really want my resample from the most recent year-end instead, so for example from 2017-12-31 to 2012-12-31 and so on.
I tried,
end = daily_3mo_treasury.index.searchsorted(date(2017,12,31))
daily_3mo_treasury.iloc[:end].resample('5Y').mean()
In [66]: daily_3mo_treasury.iloc[:end].resample('5Y').mean()
Out[66]:
1993-12-31 2.997120
1998-12-31 4.917730
2003-12-31 3.297176
2008-12-31 2.997204
2013-12-31 0.097330
2018-12-31 0.333467
dtype: float64
Where the last value in daily_3mo_treasury.iloc[:end] is 2017-12-29 1.37
How come my second 5 year resample is not ending 2017-12-31?
Edit: My index is sorted.
From #ALollz - When you resample, the bins are based on the first date in your index.
sistart = daily_3mo_treasury.index.searchsorted(date(1992,12,31))
siend = daily_3mo_treasury.index.searchsorted(date(2017,12,31))
In [95]: daily_3mo_treasury.iloc[sistart:siend].resample('5Y').mean()
Out[95]:
1992-12-31 3.080000
1997-12-31 4.562246
2002-12-31 4.050696
2007-12-31 2.925971
2012-12-31 0.360775
2017-12-31 0.278233
dtype: float64

Find difference between dates pandas using loc function

I have this dataframe
open high low close volume
TimeStamp
2017-12-22 13:15:00 12935.00 13200.00 12508.71 12514.91 244.728611
2017-12-22 13:30:00 12514.91 12999.99 12508.71 12666.34 150.457869
2017-12-22 13:45:00 12666.33 12899.97 12094.00 12094.00 198.680014
2017-12-22 14:00:00 12094.01 12354.99 11150.00 11150.00 256.812634
2017-12-22 14:15:00 11150.01 12510.00 10400.00 12276.33 262.217127
I want to know if every rows have exactly 15 minutes diference in time
So I build a new column with a shift of the first columns
open high low close volume \
TimeStamp
2017-12-20 13:30:00 17503.98 17600.00 17100.57 17119.89 312.773644
2017-12-20 13:45:00 17119.89 17372.98 17049.00 17170.00 322.953671
2017-12-20 14:00:00 17170.00 17573.00 17170.00 17395.74 236.085829
2017-12-20 14:15:00 17395.74 17398.00 17200.01 17280.00 220.467382
2017-12-20 14:30:00 17280.00 17313.94 17150.00 17256.05 222.760598
new_time
TimeStamp
2017-12-20 13:30:00 2017-12-20 13:45:00
2017-12-20 13:45:00 2017-12-20 14:00:00
2017-12-20 14:00:00 2017-12-20 14:15:00
2017-12-20 14:15:00 2017-12-20 14:30:00
2017-12-20 14:30:00 2017-12-20 14:45:00
Now I want to locate every row that don't respect the 15minutes diference rule so I did
dfh.loc[(dfh['new_time'].to_pydatetime()-dfh.index.to_pydatetime())>datetime.timedelta(0, 900)]
I get this error,
Traceback (most recent call last):
File "<pyshell#252>", line 1, in <module>
dfh.loc[(dfh['new_time'].to_pydatetime()-dfh.index.to_pydatetime())>datetime.timedelta(0, 900)]
File "C:\Users\Araujo\AppData\Local\Programs\Python\Python36\lib\site-packages\pandas\core\generic.py", line 3614, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'Series' object has no attribute 'to_pydatetime'
Is there any way of do this?
EDIT:
Shift just works with periodic, there is any way of do this with non periodic?
This would work:
import pandas as pd
import numpy as np
import datetime as dt
data = [
['2017-12-22 13:15:00', 12935.00, 13200.00, 12508.71, 12514.91, 244.728611],
['2017-12-22 13:30:00', 12514.91, 12999.99, 12508.71, 12666.34, 150.457869],
['2017-12-22 13:45:00', 12666.33, 12899.97, 12094.00, 12094.00, 198.680014],
['2017-12-22 14:00:00', 12094.01, 12354.99, 11150.00, 11150.00, 256.812634],
['2017-12-22 14:15:00', 11150.01, 12510.00, 10400.00, 12276.33, 262.217127]
]
df = pd.DataFrame(data, columns = ['Timestamp', 'open', 'high', 'low', 'close', 'volume'])
df['Timestamp'] = pd.to_datetime(df['Timestamp'])
df['plus_15'] = df['Timestamp'].shift(1) + dt.timedelta(minutes = 15)
df['valid_time'] = np.where((df['Timestamp'] == df['plus_15']) | (df.index == 0), 1, 0)
print(df[['Timestamp', 'valid_time']])
#output
Timestamp valid_time
0 2017-12-22 13:15:00 1
1 2017-12-22 13:30:00 1
2 2017-12-22 13:45:00 1
3 2017-12-22 14:00:00 1
4 2017-12-22 14:15:00 1
So create a new column, plus 15, that looks at the previous timestamp and adds 15 minutes to it. Then create another column, valid time, which compares the timestamp column to the plus 15 column, and marks 1 when they are equal and 0 when they are not.
Can we do something like this?
import pandas as pd
import numpy as np
data = '''\
TimeStamp open high low close volume
2017-12-22T13:15:00 12935.00 13200.00 12508.71 12514.91 244.728611
2017-12-22T13:30:00 12514.91 12999.99 12508.71 12666.34 150.457869
2017-12-22T13:45:00 12666.33 12899.97 12094.00 12094.00 198.680014
2017-12-22T14:00:00 12094.01 12354.99 11150.00 11150.00 256.812634
2017-12-22T14:15:00 11150.01 12510.00 10400.00 12276.33 262.217127'''
df = pd.read_csv(pd.compat.StringIO(data),
sep='\s+', parse_dates=['TimeStamp'], index_col=['TimeStamp'])
df['new_time'] = df.index[1:].tolist()+[np.NaN]
# df['new_time'] = np.roll(df.index, -1) # if last is not first+15min
# use boolean indexing to filter away unwanted rows
df[[(dt2-dt1)/np.timedelta64(1, 's') == 900
for dt1,dt2 in zip(df.index.values,df.new_time.values)]]

Dataframe not printing properly

I downloaded a dataframe to csv, made some changes and then tried to call is again . for some reasons the date column is all mixed up.
can some one please help and tell me why I am getting this message.
before saving as csv my df looked like this:
aapl = web.DataReader("AAPL", "yahoo", start, end)
bbry = web.DataReader("BBRY", "yahoo", start, end)
lulu = web.DataReader("LULU", "yahoo", start, end)
amzn = web.DataReader("AMZN", "yahoo", start, end)
# Below I create a DataFrame consisting of the adjusted closing price of these stocks, first by making a list of these objects and using the join method
stocks = pd.DataFrame({"AAPL": aapl["Adj Close"],
"BBRY": bbry["Adj Close"],
"LULU": lulu["Adj Close"],
"AMZN":amzn["Adj Close"]}, pd.date_range(start, end, freq='BM'))
​
stocks.head()
​
Out[60]:
AAPL AMZN BBRY LULU
2011-11-30 49.987684 192.289993 17.860001 49.700001
2011-12-30 52.969683 173.100006 14.500000 46.660000
2012-01-31 59.702715 194.440002 16.629999 63.130001
2012-02-29 70.945373 179.690002 14.170000 67.019997
2012-03-30 78.414750 202.509995 14.700000 74.730003
In [74]:
stocks.to_csv('A5.csv', encoding='utf-8')
after reading the correct csv it now looks like this:
In [81]:
stocks1.head()
Out[81]:
Unnamed: 0 AAPL AMZN BBRY LULU
0 2011-11-30 00:00:00 49.987684 192.289993 17.860001 49.700001
1 2011-12-30 00:00:00 52.969683 173.100006 14.500000 46.660000
2 2012-01-31 00:00:00 59.702715 194.440002 16.629999 63.130001
3 2012-02-29 00:00:00 70.945373 179.690002 14.170000 67.019997
4 2012-03-30 00:00:00 78.414750 202.509995 14.700000 74.730003
why is it not recognizing the date column as date?
Thanks for your help
I would suggest you to use HDF store instead of CSV - it's much faster, it preserves your dtypes, you can conditionally select subsets of your data sets, it supports fast compression, etc.
import pandas_datareader.data as web
stocklist = ['AAPL','BBRY','LULU','AMZN']
p = web.DataReader(stocklist, 'yahoo', '2011-11-01', '2012-04-01')
df = p['Adj Close'].resample('M').last()
print(df)
# saving DF to HDF file
store = pd.HDFStore(r'd:/temp/stocks.h5')
store.append('stocks', df, data_columns=True, complib='blosc', complevel=5)
store.close()
Output:
AAPL AMZN BBRY LULU
Date
2011-11-30 49.987684 192.289993 17.860001 49.700001
2011-12-31 52.969683 173.100006 14.500000 46.660000
2012-01-31 59.702715 194.440002 16.629999 63.130001
2012-02-29 70.945373 179.690002 14.170000 67.019997
2012-03-31 78.414750 202.509995 14.700000 74.730003
let's read our data back from the HDF file:
In [9]: store = pd.HDFStore(r'd:/temp/stocks.h5')
In [10]: x = store.select('stocks')
In [11]: x
Out[11]:
AAPL AMZN BBRY LULU
Date
2011-11-30 49.987684 192.289993 17.860001 49.700001
2011-12-31 52.969683 173.100006 14.500000 46.660000
2012-01-31 59.702715 194.440002 16.629999 63.130001
2012-02-29 70.945373 179.690002 14.170000 67.019997
2012-03-31 78.414750 202.509995 14.700000 74.730003
you can select your data conditionally:
In [12]: x = store.select('stocks', where="AAPL >= 50 and AAPL <= 70")
In [13]: x
Out[13]:
AAPL AMZN BBRY LULU
Date
2011-12-31 52.969683 173.100006 14.500000 46.660000
2012-01-31 59.702715 194.440002 16.629999 63.130001
check index dtype:
In [14]: x.index.dtype
Out[14]: dtype('<M8[ns]')
In [15]: x.index.dtype_str
Out[15]: 'datetime64[ns]'

Categories