how do I compute the returns for the following dataframe? Let the name of the dataframe be refined_df
0 1
Date
2020-02-03 TSLA MSFT
2020-02-19 AMZN ADBE
2020-03-05 OYST GPRO
2020-03-20 AMZN OYST
2020-04-06 SGEN AEYE
2020-04-22 AEYE TSLA
2020-05-07 AAPL SGEN
and we also have another dataframe, storage_openprices
AAL AAPL ADBE AEYE AMZN GOOG GPRO MSFT OYST PACB RADI SGEN TSLA
Date
2020-01-14 27.910000 79.175003 347.010010 5.300000 1885.880005 1439.010010 4.230 163.389999 28.010000 4.850000 NaN 104.849998 108.851997
2020-01-15 27.450001 77.962502 346.420013 5.020000 1872.250000 1430.209961 4.160 162.619995 26.510000 4.800000 NaN 108.550003 105.952003
2020-01-16 27.790001 78.397499 345.980011 5.060000 1882.989990 1447.439941 4.280 164.350006 25.530001 4.930000 NaN 107.330002 98.750000
2020-01-17 28.299999 79.067497 349.000000 4.840000 1885.890015 1462.910034 4.360 167.419998 24.740000 5.030000 NaN 108.410004 101.522003
2020-01-21 27.969999 79.297501 346.369995 4.880000 1865.000000 1479.119995 4.280 166.679993 26.190001 4.950000 NaN 108.379997 106.050003
What I want is to return a new dataframe with the log returns of particular stock for the specific duration.
For example, the (0,0) entry of the returned dataframe is the log return for holding TSLA from 2020-02-03 till 2020-02-19. We refer to the open prices for tesla from refined_df
Similarly, for the (1,0) entry we return the log return for holding AMZN from 2020-02-19 till 2020-03-05.
Unsure if I should be using the apply and lambda function. My issue is calling the next row in computing the log returns.
EDIT:
The output, a dataframe should look like
0 1
Date
2020-02-03 0.14 0.21
2020-02-19 0.18 0.19
2020-03-05 XXXX XXXX
2020-03-20 XXXX XXXX
2020-04-06 XXXX XXXX
2020-04-22 XXXX XXXX
2020-05-07 XXXX XXXX
where 0.14 (a made-up number) is the log return of TSLA from 2020-02-03 to 2020-02-19, i.e. log(TSLA open price on 2020-02-19) - log(TSLA open price on 2020-02-03)
Thanks!
You can use merge_asof and direction='forward' parameter with reshaped both DataFrames by DataFrame.stack:
print (refined_df)
0 1
Date
2020-02-03 TSLA MSFT
2020-02-19 AMZN ADBE
2020-03-05 OYST GPRO
2020-03-20 AMZN OYST
2020-04-06 SGEN AEYE
2020-04-22 AEYE TSLA
2020-05-07 AAPL SGEN
#changed datetiems for match
print (storage_openprices)
AAL AAPL ADBE AEYE AMZN GOOG \
Date
2020-02-14 27.910000 79.175003 347.010010 5.30 1885.880005 1439.010010
2020-02-15 27.450001 77.962502 346.420013 5.02 1872.250000 1430.209961
2020-02-16 27.790001 78.397499 345.980011 5.06 1882.989990 1447.439941
2020-02-17 28.299999 79.067497 349.000000 4.84 1885.890015 1462.910034
2020-02-21 27.969999 79.297501 346.369995 4.88 1865.000000 1479.119995
GPRO MSFT OYST PACB RADI SGEN TSLA
Date
2020-02-14 4.23 163.389999 28.010000 4.85 NaN 104.849998 108.851997
2020-02-15 4.16 162.619995 26.510000 4.80 NaN 108.550003 105.952003
2020-02-16 4.28 164.350006 25.530001 4.93 NaN 107.330002 98.750000
2020-02-17 4.36 167.419998 24.740000 5.03 NaN 108.410004 101.522003
2020-02-21 4.28 166.679993 26.190001 4.95 NaN 108.379997 106.050003
df1 = storage_openprices.stack().rename_axis(['Date','type']).reset_index(name='new')
df2 = refined_df.stack().rename_axis(['Date','cols']).reset_index(name='type')
new = (pd.merge_asof(df2, df1, on='Date',by='type', direction='forward')
.pivot('Date','cols','new'))
print (new)
cols 0 1
Date
2020-02-03 108.851997 163.389999
2020-02-19 1865.000000 346.369995
2020-03-05 NaN NaN
2020-03-20 NaN NaN
2020-04-06 NaN NaN
2020-04-22 NaN NaN
2020-05-07 NaN NaN
Related
I have a dataframe with OHLC data. I need to get the close price into the pandas series, using the timestamp column as the index.
I am reading from a sqlite db into my df:
conn = sql.connect('allStockData.db')
price = pd.read_sql_query("SELECT * from ohlc_minutes", conn)
price['timestamp'] = pd.to_datetime(price['timestamp'])
print(price)
Which returns:
timestamp open high low close volume trade_count vwap symbol volume_10_day
0 2022-09-16 08:00:00+00:00 3.19 3.570 3.19 3.350 66475 458 3.404240 AAOI NaN
1 2022-09-16 08:05:00+00:00 3.35 3.440 3.33 3.430 28925 298 3.381131 AAOI NaN
2 2022-09-16 08:10:00+00:00 3.44 3.520 3.35 3.400 62901 643 3.445096 AAOI NaN
3 2022-09-16 08:15:00+00:00 3.37 3.390 3.31 3.360 17943 184 3.339721 AAOI NaN
4 2022-09-16 08:20:00+00:00 3.36 3.410 3.34 3.400 29123 204 3.383370 AAOI NaN
... ... ... ... ... ... ... ... ... ... ...
8759 2022-09-08 23:35:00+00:00 1.35 1.360 1.35 1.355 3835 10 1.350613 RUBY 515994.5
8760 2022-09-08 23:40:00+00:00 1.36 1.360 1.35 1.350 2780 7 1.353687 RUBY 515994.5
8761 2022-09-08 23:45:00+00:00 1.35 1.355 1.35 1.355 7080 11 1.350424 RUBY 515994.5
8762 2022-09-08 23:50:00+00:00 1.35 1.360 1.33 1.360 11664 30 1.351104 RUBY 515994.5
8763 2022-09-08 23:55:00+00:00 1.36 1.360 1.33 1.340 21394 32 1.348223 RUBY 515994.5
[8764 rows x 10 columns]
When I try to get the close into a series with the timestamp:
price = pd.Series(price['close'], index=price['timestamp'])
It returns a bunch of NaNs:
2022-09-16 08:00:00+00:00 NaN
2022-09-16 08:05:00+00:00 NaN
2022-09-16 08:10:00+00:00 NaN
2022-09-16 08:15:00+00:00 NaN
2022-09-16 08:20:00+00:00 NaN
..
2022-09-08 23:35:00+00:00 NaN
2022-09-08 23:40:00+00:00 NaN
2022-09-08 23:45:00+00:00 NaN
2022-09-08 23:50:00+00:00 NaN
2022-09-08 23:55:00+00:00 NaN
Name: close, Length: 8764, dtype: float64
If I remove the index:
price = pd.Series(price['close'])
The close is returned normally:
0 3.350
1 3.430
2 3.400
3 3.360
4 3.400
...
8759 1.355
8760 1.350
8761 1.355
8762 1.360
8763 1.340
Name: close, Length: 8764, dtype: float64
How can I return the close column as a pandas series, using my timestamp column as the index?
It's because price['close'] has it's own index which is incompatible with timestamp. Try use .values instead:
price = pd.Series(price['close'].values, index=price['timestamp'])
I needed to set the timestamp to the index before getting the the close as a series:
conn = sql.connect('allStockData.db')
price = pd.read_sql_query("SELECT * from ohlc_minutes", conn)
price['timestamp'] = pd.to_datetime(price['timestamp'])
price = price.set_index('timestamp')
print(price)
price = pd.Series(price['close'])
print(price)
Gives:
2022-09-16 08:00:00+00:00 3.350
2022-09-16 08:05:00+00:00 3.430
2022-09-16 08:10:00+00:00 3.400
2022-09-16 08:15:00+00:00 3.360
2022-09-16 08:20:00+00:00 3.400
...
2022-09-08 23:35:00+00:00 1.355
2022-09-08 23:40:00+00:00 1.350
2022-09-08 23:45:00+00:00 1.355
2022-09-08 23:50:00+00:00 1.360
2022-09-08 23:55:00+00:00 1.340
Name: close, Length: 8764, dtype: float64
I would like to create a new DataFrame and a bunch of stock data per each date.
Declaring a DataFrame with a multi-index - date and stock ticker.
Adding data for 2020-06-07
date stock open high low close
2020-06-07 AAPL 33.50 34.20 32.1 33.30
2020-06-07 MSFT 53.50 54.20 32.1 53.30
Adding data for 2020-06-08
date stock open high low close
2020-06-07 AAPL 33.50 34.20 32.1 33.30
2020-06-07 MSFT 53.50 54.20 32.1 53.30
2020-06-08 AAPL 32.50 34.20 31.1 32.30
2020-06-08 MSFT 58.50 59.20 52.1 53.30
What would be the best and most efficient solution?
Here's my current version that doesn't work as I expect.
df = pd.DataFrame()
for date in dates:
universe500 = get_universe(date) #returns stocks on a specific date
for security in universe500:
prices = data.get_prices(security, ['open','high','low','close'], 1, '1d') # returns pd.DataFrame
df.iloc[(date, security),:] = prices
If prices is a DataFrame formatted in the same manner as the original df, you can use concat:
In[0]:
#consttucting a fake entry
arrays = [['06-07-2020'], ['ABCD']]
multi = pd.MultiIndex.from_arrays(arrays, names=('date', 'stock'))
to_add = pd.DataFrame({'open':1, 'high':2, 'low':3, 'close':4},index=multi)
print(to_add)
Out[0]:
open high low close
date stock
2020-06-09 ABCD 1 2 3 4
In[1]:
#now adding to your data
df = pd.concat([df, to_add])
print(df)
Out[1]:
open high low close
date stock
2020-06-07 AAPL 33.5 34.2 32.1 33.3
MSFT 53.5 54.2 32.1 53.3
2020-06-08 AAPL 32.5 34.2 31.1 32.3
MSFT 58.5 59.2 52.1 53.3
2020-06-09 ABCD 1.0 2.0 3.0 4.0
If the data (prices) were just an array of 4 numbers (the open, high, low, and close) values, then loc would work in the place you used iloc:
In[2]:
df.loc[('2020-06-10','WXYZ'),:] = [10,20,30,40]
Out[2]:
open high low close
date stock
2020-06-07 AAPL 33.5 34.2 32.1 33.3
MSFT 53.5 54.2 32.1 53.3
2020-06-08 AAPL 32.5 34.2 31.1 32.3
MSFT 58.5 59.2 52.1 53.3
2020-06-09 ABCD 1.0 2.0 3.0 4.0
2020-06-10 WXYZ 10.0 20.0 30.0 40.0
I have an issue similar to "ValueError: cannot reindex from a duplicate axis".The solution isn't provided.
I have an excel file containing multiple rows and columns of weather data. Data has missing at certain intervals although not shown in the sample below. I want to reindex the time column at 5 minute intervals so that I can interpolate the missing values. Data Sample:
Date Time Temp Hum Dewpnt WindSpd
04/01/18 12:05 a 30.6 49 18.7 2.7
04/01/18 12:10 a NaN 51 19.3 1.3
04/01/18 12:20 a 30.7 NaN 19.1 2.2
04/01/18 12:30 a 30.7 51 19.4 2.2
04/01/18 12:40 a 30.9 51 19.6 0.9
Here's what I have tried.
import pandas as pd
ts = pd.read_excel('E:\DATA\AP.xlsx')
ts['Time'] = pd.to_datetime(ts['Time'])
ts.set_index('Time', inplace=True)
dt = pd.date_range("2018-04-01 00:00:00", "2018-05-01 00:00:00", freq='5min', name='T')
idx = pd.DatetimeIndex(dt)
ts.reindex(idx)
I just just want to have my index at 5 min frequency so that I can interpolate the NaN later. Expected output:
Date Time Temp Hum Dewpnt WindSpd
04/01/18 12:05 a 30.6 49 18.7 2.7
04/01/18 12:10 a NaN 51 19.3 1.3
04/01/18 12:15 a NaN NaN NaN NaN
04/01/18 12:20 a 30.7 NaN 19.1 2.2
04/01/18 12:25 a NaN NaN NaN NaN
04/01/18 12:30 a 30.7 51 19.4 2.2
One more approach.
df['Time'] = pd.to_datetime(df['Time'])
df = df.set_index(['Time']).resample('5min').last().reset_index()
df['Time'] = df['Time'].dt.time
df
output
Time Date Temp Hum Dewpnt WindSpd
0 00:05:00 4/1/2018 30.6 49.0 18.7 2.7
1 00:10:00 4/1/2018 NaN 51.0 19.3 1.3
2 00:15:00 NaN NaN NaN NaN NaN
3 00:20:00 4/1/2018 30.7 NaN 19.1 2.2
4 00:25:00 NaN NaN NaN NaN NaN
5 00:30:00 4/1/2018 30.7 51.0 19.4 2.2
6 00:35:00 NaN NaN NaN NaN NaN
7 00:40:00 4/1/2018 30.9 51.0 19.6 0.9
If times from multiple dates have to be re-sampled, you can use code below.
However, you will have to seperate 'Date' & 'Time' columns later.
df1['DateTime'] = df1['Date']+df1['Time']
df1['DateTime'] = pd.to_datetime(df1['DateTime'],format='%d/%m/%Y%I:%M %p')
df1 = df1.set_index(['DateTime']).resample('5min').last().reset_index()
df1
Output
DateTime Date Time Temp Hum Dewpnt WindSpd
0 2018-01-04 00:05:00 4/1/2018 12:05 AM 30.6 49.0 18.7 2.7
1 2018-01-04 00:10:00 4/1/2018 12:10 AM NaN 51.0 19.3 1.3
2 2018-01-04 00:15:00 NaN NaN NaN NaN NaN NaN
3 2018-01-04 00:20:00 4/1/2018 12:20 AM 30.7 NaN 19.1 2.2
4 2018-01-04 00:25:00 NaN NaN NaN NaN NaN NaN
5 2018-01-04 00:30:00 4/1/2018 12:30 AM 30.7 51.0 19.4 2.2
6 2018-01-04 00:35:00 NaN NaN NaN NaN NaN NaN
7 2018-01-04 00:40:00 4/1/2018 12:40 AM 30.9 51.0 19.6 0.9
You can try this for example:
import pandas as pd
ts = pd.read_excel('E:\DATA\AP.xlsx')
ts['Time'] = pd.to_datetime(ts['Time'])
ts.set_index('Time', inplace=True)
ts.resample('5T').mean()
More information here: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.resample.html
Set the Time column as the index, making sure it is DateTime type, then try
ts.asfreq('5T')
use
ts.asfreq('5T', method='ffill')
to pull previous values forward.
I would take the approach of creating a blank table and fill it in with the data as it comes from your data source. For this example three observations are read in as NaN, plus the row for 1:15 and 1:20 is missing.
import pandas as pd
import numpy as np
rawpd = pd.read_excel('raw.xlsx')
print(rawpd)
Date Time Col1 Col2
0 2018-04-01 01:00:00 1.0 10.0
1 2018-04-01 01:05:00 2.0 NaN
2 2018-04-01 01:10:00 NaN 10.0
3 2018-04-01 01:20:00 NaN 10.0
4 2018-04-01 01:30:00 5.0 10.0
Now create a dataframe targpd with the ideal structure.
time5min = pd.date_range(start='2018/04/1 01:00',periods=7,freq='5min')
targpd = pd.DataFrame(np.nan,index = time5min,columns=['Col1','Col2'])
print(targpd)
Col1 Col2
2018-04-01 01:00:00 NaN NaN
2018-04-01 01:05:00 NaN NaN
2018-04-01 01:10:00 NaN NaN
2018-04-01 01:15:00 NaN NaN
2018-04-01 01:20:00 NaN NaN
2018-04-01 01:25:00 NaN NaN
2018-04-01 01:30:00 NaN NaN
Now the trick is to update targpd with the data sent to you in rawpd. For this to happen the Date and Time columns have to be combined in rawpd and made into an index.
print(rawpd.Date,rawpd.Time)
0 2018-04-01
1 2018-04-01
2 2018-04-01
3 2018-04-01
4 2018-04-01
Name: Date, dtype: datetime64[ns]
0 01:00:00
1 01:05:00
2 01:10:00
3 01:20:00
4 01:30:00
Name: Time, dtype: object
You can see above the trick in all this. Your date data was converted to datetime but your time data is just a string. Below a proper index is created by used of a lambda function.
rawidx=rawpd.apply(lambda r : pd.datetime.combine(r['Date'],r['Time']),1)
print(rawidx)
This can be applied to the rawpd database as an index.
rawpd2=pd.DataFrame(rawpd[['Col1','Col2']].values,index=rawidx,columns=['Col1','Col2'])
rawpd2=rawpd2.sort_index()
print(rawpd2)
Once this is in place the update command can get you what you want.
targpd.update(rawpd2,overwrite=True)
print(targpd)
Col1 Col2
2018-04-01 01:00:00 1.0 10.0
2018-04-01 01:00:00 1.0 10.0
2018-04-01 01:05:00 2.0 NaN
2018-04-01 01:10:00 NaN 10.0
2018-04-01 01:15:00 NaN NaN
2018-04-01 01:20:00 NaN 10.0
2018-04-01 01:25:00 NaN NaN
2018-04-01 01:30:00 5.0 10.0
2018-04-01 01:05:00 2.0 NaN
2018-04-01 01:10:00 NaN 10.0
2018-04-01 01:15:00 NaN NaN
2018-04-01 01:20:00 NaN 10.0
2018-04-01 01:25:00 NaN NaN
2018-04-01 01:30:00 5.0 10.0
You now have a file ready for interpolation
I have got it to work. thank you everyone for your time. I am providing the working code.
import pandas as pd
df = pd.read_excel('E:\DATA\AP.xlsx', sheet_name='Sheet1', parse_dates=[['Date', 'Time']])
df = df.set_index(['Date_Time']).resample('5min').last().reset_index()
print(df)
Following is my dataframe. I am trying to calculate rolling 5 period percent rank of ATR. RollingPercentRank is my desired output.
symbol Day time ATR RollingPercentRank
316356 SPY 11/29/2018 10:35:00 0.377880 NaN
316357 SPY 11/29/2018 10:40:00 0.391092 NaN
316358 SPY 11/29/2018 10:45:00 0.392983 NaN
316359 SPY 11/29/2018 10:50:00 0.399685 NaN
316360 SPY 11/29/2018 10:55:00 0.392716 0.2
316361 SPY 11/29/2018 11:00:00 0.381445 0.2
316362 AAPL 11/29/2018 11:05:00 0.387300 NaN
316363 AAPL 11/29/2018 11:10:00 0.390570 NaN
316364 AAPL 11/29/2018 11:15:00 0.381313 NaN
316365 AAPL 11/29/2018 11:20:00 0.398182 NaN
316366 AAPL 11/29/2018 11:25:00 0.377364 0.6
316367 AAPL 11/29/2018 11:30:00 0.373627 0.2
As of the 5th row, I want to apply the percent rank function to all 5 previous values(1st row to 5th row) of ATR within a group. And as of the 6th row, I want to again apply the rank function to all 5 previous values(2nd row to 6th row) of ATR.
I have tried the following which gives a "'numpy.ndarray' object has no attribute 'rank' " error.
df['RollingPercentRank'] = df.groupby(['symbol'])['ATR'].rolling(window=5,min_periods=5,center=False).apply(lambda x: x.rank(pct=True)).reset_index(drop=True)
IIUC as I don't get the expected output you showed, but to use rank, you need a pd.Series and then you only want the last value of this percentage Series of 5 elements so it would be:
print (df.groupby(['symbol'])['ATR']
.rolling(window=5,min_periods=5,center=False)
.apply(lambda x: pd.Series(x).rank(pct=True).iloc[-1]))
symbol i
AAPL 316362 NaN
316363 NaN
316364 NaN
316365 NaN
316366 0.2
316367 0.2
SPY 316356 NaN
316357 NaN
316358 NaN
316359 NaN
316360 0.6
316361 0.2
Because x ix a numpy array, it is possible to get the same result using twice argsort and to create the column, a reset_index at the end:
win_val = 5
df['RollingPercentRank'] = (df.groupby(['symbol'])['ATR']
.rolling(window=win_val,min_periods=5,center=False)
.apply(lambda x: x.argsort().argsort()[-1]+1)
.reset_index(level=0,drop=True)/win_val)
print (df)
symbol Day time ATR RollingPercentRank
316356 SPY 11/29/2018 10:35:00 0.377880 NaN
316357 SPY 11/29/2018 10:40:00 0.391092 NaN
316358 SPY 11/29/2018 10:45:00 0.392983 NaN
316359 SPY 11/29/2018 10:50:00 0.399685 NaN
316360 SPY 11/29/2018 10:55:00 0.392716 0.6
316361 SPY 11/29/2018 11:00:00 0.381445 0.2
316362 AAPL 11/29/2018 11:05:00 0.387300 NaN
316363 AAPL 11/29/2018 11:10:00 0.390570 NaN
316364 AAPL 11/29/2018 11:15:00 0.381313 NaN
316365 AAPL 11/29/2018 11:20:00 0.398182 NaN
316366 AAPL 11/29/2018 11:25:00 0.377364 0.2
316367 AAPL 11/29/2018 11:30:00 0.373627 0.2
I'm having an issue changing a pandas DataFrame index to a datetime from an integer. I want to do it so that I can call reindex and fill in the dates between those listed in the table. Note that I have to use pandas 0.7.3 at the moment because I'm also using qstk, and qstk relies on pandas 0.7.3
First, here's my layout:
(Pdb) df
AAPL GOOG IBM XOM date
1 0 0 4000 0 2011-01-13 16:00:00
2 0 1000 4000 0 2011-01-26 16:00:00
3 0 1000 4000 0 2011-02-02 16:00:00
4 0 1000 4000 4000 2011-02-10 16:00:00
6 0 0 1800 4000 2011-03-03 16:00:00
7 0 0 3300 4000 2011-06-03 16:00:00
8 0 0 0 4000 2011-05-03 16:00:00
9 1200 0 0 4000 2011-06-10 16:00:00
11 1200 0 0 4000 2011-08-01 16:00:00
12 0 0 0 4000 2011-12-20 16:00:00
(Pdb) type(df['date'])
<class 'pandas.core.series.Series'>
(Pdb) df2 = DataFrame(index=df['date'])
(Pdb) df2
Empty DataFrame
Columns: array([], dtype=object)
Index: array([2011-01-13 16:00:00, 2011-01-26 16:00:00, 2011-02-02 16:00:00,
2011-02-10 16:00:00, 2011-03-03 16:00:00, 2011-06-03 16:00:00,
2011-05-03 16:00:00, 2011-06-10 16:00:00, 2011-08-01 16:00:00,
2011-12-20 16:00:00], dtype=object)
(Pdb) df2.merge(df,left_index=True,right_on='date')
AAPL GOOG IBM XOM date
1 0 0 4000 0 2011-01-13 16:00:00
2 0 1000 4000 0 2011-01-26 16:00:00
3 0 1000 4000 0 2011-02-02 16:00:00
4 0 1000 4000 4000 2011-02-10 16:00:00
6 0 0 1800 4000 2011-03-03 16:00:00
8 0 0 0 4000 2011-05-03 16:00:00
7 0 0 3300 4000 2011-06-03 16:00:00
9 1200 0 0 4000 2011-06-10 16:00:00
11 1200 0 0 4000 2011-08-01 16:00:00
12 0 0 0 4000 2011-12-20 16:00:00
I have tried multiple things to get a datetime index:
1.) Using the reindex() method with a list of datetime values. This creates a datetime index, but then fills in NaNs for the data in the DataFrame. I'm guessing that this is because the original values are tied to the integer index and reindexing to datetime tries to fill the new indices with default values (NaNs if no fill method is indicated). Thusly:
(Pdb) df.reindex(index=df['date'])
AAPL GOOG IBM XOM date
date
2011-01-13 16:00:00 NaN NaN NaN NaN NaN
2011-01-26 16:00:00 NaN NaN NaN NaN NaN
2011-02-02 16:00:00 NaN NaN NaN NaN NaN
2011-02-10 16:00:00 NaN NaN NaN NaN NaN
2011-03-03 16:00:00 NaN NaN NaN NaN NaN
2011-06-03 16:00:00 NaN NaN NaN NaN NaN
2011-05-03 16:00:00 NaN NaN NaN NaN NaN
2011-06-10 16:00:00 NaN NaN NaN NaN NaN
2011-08-01 16:00:00 NaN NaN NaN NaN NaN
2011-12-20 16:00:00 NaN NaN NaN NaN NaN
2.) Using DataFrame.merge with my original df and a second dataframe, df2, that is basically just a datetime index with nothing else. So I end up doing something like:
(pdb) df2.merge(df,left_index=True,right_on='date')
AAPL GOOG IBM XOM date
1 0 0 4000 0 2011-01-13 16:00:00
2 0 1000 4000 0 2011-01-26 16:00:00
3 0 1000 4000 0 2011-02-02 16:00:00
4 0 1000 4000 4000 2011-02-10 16:00:00
6 0 0 1800 4000 2011-03-03 16:00:00
8 0 0 0 4000 2011-05-03 16:00:00
7 0 0 3300 4000 2011-06-03 16:00:00
9 1200 0 0 4000 2011-06-10 16:00:00
11 1200 0 0 4000 2011-08-01 16:00:00
(and vice-versa). But I always end up with this kind of thing, with integer indices.
3.) Starting with an empty DataFrame with a datetime index (created from the 'date' field of df) and a bunch of empty columns. Then I attempt to assign each column by setting the columns with the same
names to be equal to the columns from df:
(Pdb) df2['GOOG']=0
(Pdb) df2
GOOG
date
2011-01-13 16:00:00 0
2011-01-26 16:00:00 0
2011-02-02 16:00:00 0
2011-02-10 16:00:00 0
2011-03-03 16:00:00 0
2011-06-03 16:00:00 0
2011-05-03 16:00:00 0
2011-06-10 16:00:00 0
2011-08-01 16:00:00 0
2011-12-20 16:00:00 0
(Pdb) df2['GOOG'] = df['GOOG']
(Pdb) df2
GOOG
date
2011-01-13 16:00:00 NaN
2011-01-26 16:00:00 NaN
2011-02-02 16:00:00 NaN
2011-02-10 16:00:00 NaN
2011-03-03 16:00:00 NaN
2011-06-03 16:00:00 NaN
2011-05-03 16:00:00 NaN
2011-06-10 16:00:00 NaN
2011-08-01 16:00:00 NaN
2011-12-20 16:00:00 NaN
So, how in pandas 0.7.3 do I get df to be re-created with an datetime index instead of the integer index? What am I missing?
I think you are looking for set_index:
In [11]: df.set_index('date')
Out[11]:
AAPL GOOG IBM XOM
date
2011-01-13 16:00:00 0 0 4000 0
2011-01-26 16:00:00 0 1000 4000 0
2011-02-02 16:00:00 0 1000 4000 0
2011-02-10 16:00:00 0 1000 4000 4000
2011-03-03 16:00:00 0 0 1800 4000
2011-06-03 16:00:00 0 0 3300 4000
2011-05-03 16:00:00 0 0 0 4000
2011-06-10 16:00:00 1200 0 0 4000
2011-08-01 16:00:00 1200 0 0 4000
2011-12-20 16:00:00 0 0 0 4000