Vectorized look-up of values in Pandas dataframe

Vectorized look-up of values in Pandas dataframe - python

I have two pandas dataframes one called orders and another one called daily_prices.
daily_prices is as follows:
AAPL GOOG IBM XOM
2011-01-10 339.44 614.21 142.78 71.57
2011-01-13 342.64 616.69 143.92 73.08
2011-01-26 340.82 616.50 155.74 75.89
2011-02-02 341.29 612.00 157.93 79.46
2011-02-10 351.42 616.44 159.32 79.68
2011-03-03 356.40 609.56 158.73 82.19
2011-05-03 345.14 533.89 167.84 82.00
2011-06-03 340.42 523.08 160.97 78.19
2011-06-10 323.03 509.51 159.14 76.84
2011-08-01 393.26 606.77 176.28 76.67
2011-12-20 392.46 630.37 184.14 79.97
orders is as follows:
direction size ticker prices
2011-01-10 Buy 1500 AAPL 339.44
2011-01-13 Sell 1500 AAPL 342.64
2011-01-13 Buy 4000 IBM 143.92
2011-01-26 Buy 1000 GOOG 616.50
2011-02-02 Sell 4000 XOM 79.46
2011-02-10 Buy 4000 XOM 79.68
2011-03-03 Sell 1000 GOOG 609.56
2011-03-03 Sell 2200 IBM 158.73
2011-06-03 Sell 3300 IBM 160.97
2011-05-03 Buy 1500 IBM 167.84
2011-06-10 Buy 1200 AAPL 323.03
2011-08-01 Buy 55 GOOG 606.77
2011-08-01 Sell 55 GOOG 606.77
2011-12-20 Sell 1200 AAPL 392.46
Index of both dataframes is datetime.date.
prices column in the orders dataframe was added by using a list comprehension to loop through all the orders and look up the specific ticker for the specific date in the daily_prices data frame and then adding that list as a column to the orders dataframe. I would like to do this using an array operation rather than something that loops. can it be done? I tried to use:
daily_prices.ix[dates, tickers]
but this returns a matrix of cartesian product of the two lists. I want it to return a column vector of only the price of a specified ticker for a specified date.

Use our friend lookup, designed precisely for this purpose:
In [17]: prices
Out[17]:
AAPL GOOG IBM XOM
2011-01-10 339.44 614.21 142.78 71.57
2011-01-13 342.64 616.69 143.92 73.08
2011-01-26 340.82 616.50 155.74 75.89
2011-02-02 341.29 612.00 157.93 79.46
2011-02-10 351.42 616.44 159.32 79.68
2011-03-03 356.40 609.56 158.73 82.19
2011-05-03 345.14 533.89 167.84 82.00
2011-06-03 340.42 523.08 160.97 78.19
2011-06-10 323.03 509.51 159.14 76.84
2011-08-01 393.26 606.77 176.28 76.67
2011-12-20 392.46 630.37 184.14 79.97
In [18]: orders
Out[18]:
Date direction size ticker prices
0 2011-01-10 00:00:00 Buy 1500 AAPL 339.44
1 2011-01-13 00:00:00 Sell 1500 AAPL 342.64
2 2011-01-13 00:00:00 Buy 4000 IBM 143.92
3 2011-01-26 00:00:00 Buy 1000 GOOG 616.50
4 2011-02-02 00:00:00 Sell 4000 XOM 79.46
5 2011-02-10 00:00:00 Buy 4000 XOM 79.68
6 2011-03-03 00:00:00 Sell 1000 GOOG 609.56
7 2011-03-03 00:00:00 Sell 2200 IBM 158.73
8 2011-06-03 00:00:00 Sell 3300 IBM 160.97
9 2011-05-03 00:00:00 Buy 1500 IBM 167.84
10 2011-06-10 00:00:00 Buy 1200 AAPL 323.03
11 2011-08-01 00:00:00 Buy 55 GOOG 606.77
12 2011-08-01 00:00:00 Sell 55 GOOG 606.77
13 2011-12-20 00:00:00 Sell 1200 AAPL 392.46
In [19]: prices.lookup(orders.Date, orders.ticker)
Out[19]:
array([ 339.44, 342.64, 143.92, 616.5 , 79.46, 79.68, 609.56,
158.73, 160.97, 167.84, 323.03, 606.77, 606.77, 392.46])

Related

Compute returns from data frame

how do I compute the returns for the following dataframe? Let the name of the dataframe be refined_df
0 1
Date
2020-02-03 TSLA MSFT
2020-02-19 AMZN ADBE
2020-03-05 OYST GPRO
2020-03-20 AMZN OYST
2020-04-06 SGEN AEYE
2020-04-22 AEYE TSLA
2020-05-07 AAPL SGEN
and we also have another dataframe, storage_openprices
AAL AAPL ADBE AEYE AMZN GOOG GPRO MSFT OYST PACB RADI SGEN TSLA
Date
2020-01-14 27.910000 79.175003 347.010010 5.300000 1885.880005 1439.010010 4.230 163.389999 28.010000 4.850000 NaN 104.849998 108.851997
2020-01-15 27.450001 77.962502 346.420013 5.020000 1872.250000 1430.209961 4.160 162.619995 26.510000 4.800000 NaN 108.550003 105.952003
2020-01-16 27.790001 78.397499 345.980011 5.060000 1882.989990 1447.439941 4.280 164.350006 25.530001 4.930000 NaN 107.330002 98.750000
2020-01-17 28.299999 79.067497 349.000000 4.840000 1885.890015 1462.910034 4.360 167.419998 24.740000 5.030000 NaN 108.410004 101.522003
2020-01-21 27.969999 79.297501 346.369995 4.880000 1865.000000 1479.119995 4.280 166.679993 26.190001 4.950000 NaN 108.379997 106.050003
What I want is to return a new dataframe with the log returns of particular stock for the specific duration.
For example, the (0,0) entry of the returned dataframe is the log return for holding TSLA from 2020-02-03 till 2020-02-19. We refer to the open prices for tesla from refined_df
Similarly, for the (1,0) entry we return the log return for holding AMZN from 2020-02-19 till 2020-03-05.
Unsure if I should be using the apply and lambda function. My issue is calling the next row in computing the log returns.
EDIT:
The output, a dataframe should look like
0 1
Date
2020-02-03 0.14 0.21
2020-02-19 0.18 0.19
2020-03-05 XXXX XXXX
2020-03-20 XXXX XXXX
2020-04-06 XXXX XXXX
2020-04-22 XXXX XXXX
2020-05-07 XXXX XXXX
where 0.14 (a made-up number) is the log return of TSLA from 2020-02-03 to 2020-02-19, i.e. log(TSLA open price on 2020-02-19) - log(TSLA open price on 2020-02-03)
Thanks!

You can use merge_asof and direction='forward' parameter with reshaped both DataFrames by DataFrame.stack:
print (refined_df)
0 1
Date
2020-02-03 TSLA MSFT
2020-02-19 AMZN ADBE
2020-03-05 OYST GPRO
2020-03-20 AMZN OYST
2020-04-06 SGEN AEYE
2020-04-22 AEYE TSLA
2020-05-07 AAPL SGEN
#changed datetiems for match
print (storage_openprices)
AAL AAPL ADBE AEYE AMZN GOOG \
Date
2020-02-14 27.910000 79.175003 347.010010 5.30 1885.880005 1439.010010
2020-02-15 27.450001 77.962502 346.420013 5.02 1872.250000 1430.209961
2020-02-16 27.790001 78.397499 345.980011 5.06 1882.989990 1447.439941
2020-02-17 28.299999 79.067497 349.000000 4.84 1885.890015 1462.910034
2020-02-21 27.969999 79.297501 346.369995 4.88 1865.000000 1479.119995
GPRO MSFT OYST PACB RADI SGEN TSLA
Date
2020-02-14 4.23 163.389999 28.010000 4.85 NaN 104.849998 108.851997
2020-02-15 4.16 162.619995 26.510000 4.80 NaN 108.550003 105.952003
2020-02-16 4.28 164.350006 25.530001 4.93 NaN 107.330002 98.750000
2020-02-17 4.36 167.419998 24.740000 5.03 NaN 108.410004 101.522003
2020-02-21 4.28 166.679993 26.190001 4.95 NaN 108.379997 106.050003
df1 = storage_openprices.stack().rename_axis(['Date','type']).reset_index(name='new')
df2 = refined_df.stack().rename_axis(['Date','cols']).reset_index(name='type')
new = (pd.merge_asof(df2, df1, on='Date',by='type', direction='forward')
.pivot('Date','cols','new'))
print (new)
cols 0 1
Date
2020-02-03 108.851997 163.389999
2020-02-19 1865.000000 346.369995
2020-03-05 NaN NaN
2020-03-20 NaN NaN
2020-04-06 NaN NaN
2020-04-22 NaN NaN
2020-05-07 NaN NaN

Pandas merge two dataframes on column with different length

I am working on a trading algo but I have some issues when trying to combine the buy_orders and sell_orders dataframes to a single dataframe, orders.
The issue shows it self on the buy_order date 2021-01-21 where a recommendation was made by my algo to buy, but this has no sell order yet as the signal hasn't been spotted yet, therefore these should be NAN when merged.
If I was to join on index the sell order would be the sell order for a different stock from the sell_orders dataframe.
buy_orders dataframe
Date_buy Name Stock_Price_buy Order
26 2020-07-30 AAPL 96.19 buy
27 2020-09-30 AAPL 115.81 buy
28 2020-11-05 AAPL 119.03 buy
29 2020-11-30 AAPL 119.05 buy
30 2021-01-21 AAPL 136.87 buy
31 2020-10-11 ABBV 21.21 buy
sell_orders dataframe
Date_sell Name Stock_Price_sell Order
25 2020-07-20 AAPL 98.36 sell
26 2020-09-02 AAPL 131.40 sell
27 2020-10-20 AAPL 117.51 sell
28 2020-11-20 AAPL 117.34 sell
29 2021-01-04 AAPL 129.41 sell
30 2020-10-15 ABBV 24.23 sell
Ideal result would be the orders dataframe as demonstrated below.
Index Buy_date Name_x Stock_Price_buy Order_x Sell_date Name_y Stock_Price_buy Order_y
26 2020-07-30 AAPL 96.19 buy 2020-09-02 AAPL 131.40 sell
27 2020-09-30 AAPL 115.81 buy 2020-10-20 AAPL 117.51 sell
28 2020-11-05 AAPL 119.03 buy 2020-11-20 AAPL 117.34 sell
29 2020-11-30 AAPL 119.05 buy 2021-01-04 AAPL 129.41 sell
30 2021-01-21 AAPL 136.87 buy NaN NaN NaN NaN
Here's how the orders dataframe looks like now when buy_orders.Name_x and sell_orders.Name_y are different for the first time. ABBV sell_order should have been NANs
28 2020-11-05 AAPL 119.03 buy 2020-11-20 AAPL 117.34 sell
29 2020-11-30 AAPL 119.05 buy 2021-01-04 AAPL 129.41 sell
30 2021-01-21 AAPL 136.87 buy 2018-05-24 ABBV 24.23 sell

Have you thought of join but then declaring suffixes as follows?.
buy_orders.join(sell_orders,lsuffix='_buy', rsuffix='_sell')
Date_buy Name_buy Stock_Price_buy Order_buy Date_sell Name_sell \
26 2020-07-30 AAPL 96.19 buy 2020-09-02 AAPL
27 2020-09-30 AAPL 115.81 buy 2020-10-20 AAPL
28 2020-11-05 AAPL 119.03 buy 2020-11-20 AAPL
29 2020-11-30 AAPL 119.05 buy 2021-01-04 AAPL
30 2021-01-21 AAPL 136.87 buy NaN NaN
Stock_Price_sell Order_sell
26 131.40 sell
27 117.51 sell
28 117.34 sell
29 129.41 sell
30 NaN NaN

Assuming your data is structured so that it always starts with a buy and alternates with sell orders, and only has one transaction per day, and is always one sized lot per transaction... you can use pd.concat. I made a simple dataframe that is kind of like yours (and in the future, it makes it easier if you include code to make a sample dataframe as part of your question):
buy_orders = pd.DataFrame.from_dict({'Date_buy': [ pd.to_datetime('2020-11-01'), pd.to_datetime('2020-11-03'),
pd.to_datetime('2020-11-05'), pd.to_datetime('2020-11-08'),
pd.to_datetime('2020-11-10')],
'Order' : ['B','B','B','B','B'],
'Name' : ['AAPL','AAPL','AAPL','AAPL','ABBV'],
'Stock_Price_buy' : [1,2,3,4,5.0]})
sell_orders = pd.DataFrame.from_dict({'Date_sell': [ pd.to_datetime('2020-11-02'), pd.to_datetime('2020-11-04'),
pd.to_datetime('2020-11-06'), pd.to_datetime('2020-11-12'),
pd.to_datetime('2020-11-22')],
'Order' : ['S','S','S','S','S'],
'Name' : ['AAPL','AAPL','AAPL','ABBV','ABBV'],
'Stock_Price_sell' : [23,24,25,26,5.0]})
You can first stack the two dataframes and sort them by date and ticker (after normalizing the column names):
buy_orders = buy_orders.rename(columns={'Date_buy' : "Date", "Stock_Price_buy" : "Price"})
sell_orders = sell_orders.rename(columns={'Date_sell' : "Date", "Stock_Price_sell" : "Price"})
df = pd.concat([buy_orders, sell_orders])
df = df.sort_values(['Date','Order']).reset_index(drop=True)
...then making a copy of the dataframe (changing the column names to keep them distinct in the later concat step):
df2 = df.copy()
df2.columns = [f"{c}_sell" for c in df.columns]
You then concatenate the two dataframes next to each other, but with a .shift(-1) on the second one so that they're offset
df3 = pd.concat([df, df2.shift(-1)], axis=1)
Finally, you can clean up the junk rows:
cut = ( df3.Name != df3.Name_sell)
import numpy as np
df3.loc[cut, 'Date_sell'] = np.nan
df3.loc[cut, 'Order_sell'] = np.nan
df3.loc[cut, 'Price_sell'] = np.nan
df3 = df3.drop(columns='Name_sell')
df3 = df3[df3.Order!="S"].reset_index(drop=True).copy()
That gives you something like
Date Order Name Price Date_sell Order_sell Price_sell
0 2020-11-01 B AAPL 1.0 2020-11-02 S 23.0
1 2020-11-03 B AAPL 2.0 2020-11-04 S 24.0
2 2020-11-05 B AAPL 3.0 2020-11-06 S 25.0
3 2020-11-08 B AAPL 4.0 NaT NaN NaN
4 2020-11-10 B ABBV 5.0 2020-11-12 S 26.0
You don't have to make all the intermediate dataframes, etc, but I left the code that way here so that if you paste things in a notebook you can look at the steps.

Vectorized Operations on two Pandas DataFrame to create a new DataFrame

I have orders.csv as a dataframe called orders_df:
Symbol Order Shares
Date
2011-01-10 AAPL BUY 100
2011-01-13 AAPL SELL 200
2011-01-13 IBM BUY 100
2011-01-26 GOOG SELL 200
I end up sorting the data frame with orders_df = orders_df.sort_index().
Then I create a symbols like so:
symbols = np.append(orders_df.loc[:, 'Symbol'].unique(), 'SPY')
Here comes my second DataFrame df_prices.
df_prices = get_data(symbols, orders_df.index, addSPY=False)
df_prices.loc[:, 'CASH] = 1.0
which prints out:
AAPL IBM GOOG XOM SPY CASH
Date
2011-01-10 150 100 50 400 100 1.0
2011-01-13 250 200 500 100 100 1.0
2011-01-13 250 200 500 100 100 1.0
2011-01-26 100 150 100 300 50 1.0
Now, I initialize a third data frame:'
df_trades = pd.DataFrame(0, df_prices.index, columns=list(df_prices))
I need to fill this data frame with the correct values using the two previous date frames. If I BUY AAPL, I want to multiply Shares from orders_df with the prices of AAPL times -1. If it were SELL I wouldn't multiply by -1. I put that value in the correct CASH column. For the other columns, I simply copy over the Shares of each stock on days they traded.
AAPL IBM GOOG XOM SPY CASH
Date
2011-01-10 100 0 0 0 0 -15000
2011-01-13 -200 0 0 0 0 50000
2011-01-13 0 100 0 0 0 -20000
2011-01-26 0 0 -200 0 0 20000
How do I achieve df_trades using vectorized operations?
UPDATE
What if I did:
df_prices = get_data(symbols, orders_df.index, addSPY=False)
df_prices.loc[:, 'CASH] = 1.0
which prints out
AAPL IBM GOOG XOM SPY CASH
2011-01-10 340.99 143.41 614.21 72.02 123.19 1.0
2011-01-11 340.18 143.06 616.01 72.56 123.63 1.0
2011-01-12 342.95 144.82 616.87 73.41 124.74 1.0
2011-01-13 344.20 144.55 616.69 73.54 124.54 1.0
2011-01-14 346.99 145.70 624.18 74.62 125.44 1.0
2011-01-18 339.19 146.33 639.63 75.45 125.65 1.0
2011-01-19 337.39 151.22 631.75 75.00 124.42 1.0
How would I produce the df_trades then?
The example values aren't valid anymore fyi.

Vectorized Solution
j = np.array([df_trades.columns.get_loc(c) for c in orders_df.Symbol])
i = np.arange(len(df_trades))
o = np.where(orders_df.Order.values == 'BUY', -1, 1)
v = orders_df.Shares.values * o
t = df_trades.values
t[i, j] = v
df_trades.loc[:, 'CASH'] = \
df_trades.drop('CASH', 1, errors='ignore').mul(prices_df).sum(1)
df_trades
AAPL IBM GOOG XOM SPY CASH
Date
2011-01-10 -100 0 0 0 0 -15000.0
2011-01-13 200 0 0 0 0 50000.0
2011-01-13 0 -100 0 0 0 -30000.0
2011-01-26 0 0 200 0 0 20000.0

Returns of multiple stocks using pandas at once

I have close prices of multiple stocks over multiple days in a dataframe like this.
In [67]: df
Out[67]:
Date Symbol Close
0 12/30/2016 AMZN 749.87
1 12/29/2016 AMZN 765.15
2 12/28/2016 AMZN 772.13
3 12/27/2016 AMZN 771.40
4 12/30/2016 GOOGL 792.45
5 12/29/2016 GOOGL 802.88
6 12/28/2016 GOOGL 804.57
7 12/27/2016 GOOGL 805.80
8 12/30/2016 NFLX 123.80
9 12/29/2016 NFLX 125.33
10 12/28/2016 NFLX 125.89
11 12/27/2016 NFLX 128.35
I would like to compute daily returns of these stocks using pandas. The output should looks like this:
Date Symbol Return
0 12/27/2016 AMZN NaN
1 12/28/2016 AMZN 0.000946
2 12/29/2016 AMZN -0.009040
3 12/30/2016 AMZN -0.019970
4 12/27/2016 GOOGL NaN
5 12/28/2016 GOOGL -0.001526
6 12/29/2016 GOOGL -0.002101
7 12/30/2016 GOOGL -0.012991
8 12/27/2016 NFLX NaN
9 12/28/2016 NFLX -0.019166
10 12/29/2016 NFLX -0.004448
11 12/30/2016 NFLX -0.012208
I got the above output using following code but I feel this can be simplified further.
In [70]: rtn = df.pivot("Date", "Symbol", "Close").pct_change().reset_index()
In [73]: pd.melt(rtn, id_vars='Date', value_vars=list(rtn.columns[1:]),var_name='Symbol',value_name='Return')

You can use sort_values first and then groupby with DataFrameGroupBy.pct_change:
df = df.sort_values(['Symbol','Date']).reset_index(drop=True)
df['Return'] = df.groupby('Symbol')['Close'].pct_change()
print (df)
Date Symbol Close Return
0 12/27/2016 AMZN 771.40 NaN
1 12/28/2016 AMZN 772.13 0.000946
2 12/29/2016 AMZN 765.15 -0.009040
3 12/30/2016 AMZN 749.87 -0.019970
4 12/27/2016 GOOGL 805.80 NaN
5 12/28/2016 GOOGL 804.57 -0.001526
6 12/29/2016 GOOGL 802.88 -0.002101
7 12/30/2016 GOOGL 792.45 -0.012991
8 12/27/2016 NFLX 128.35 NaN
9 12/28/2016 NFLX 125.89 -0.019166
10 12/29/2016 NFLX 125.33 -0.004448
11 12/30/2016 NFLX 123.80 -0.012208

You can set_index and unstack which will sort your index for you and then pct_change and stack back.
print(
df.set_index(['Date', 'Symbol'])
.Close.unstack().pct_change()
.stack(dropna=False).reset_index(name='Return')
.sort_values(['Symbol', 'Date'])
.reset_index(drop=True)
)
Date Symbol Return
0 2016-12-27 AMZN NaN
1 2016-12-28 AMZN 0.000946
2 2016-12-29 AMZN -0.009040
3 2016-12-30 AMZN -0.019970
4 2016-12-27 GOOGL NaN
5 2016-12-28 GOOGL -0.001526
6 2016-12-29 GOOGL -0.002101
7 2016-12-30 GOOGL -0.012991
8 2016-12-27 NFLX NaN
9 2016-12-28 NFLX -0.019166
10 2016-12-29 NFLX -0.004448
11 2016-12-30 NFLX -0.012208

Having an issue changing index from integer to date in pandas

I'm having an issue changing a pandas DataFrame index to a datetime from an integer. I want to do it so that I can call reindex and fill in the dates between those listed in the table. Note that I have to use pandas 0.7.3 at the moment because I'm also using qstk, and qstk relies on pandas 0.7.3
First, here's my layout:
(Pdb) df
AAPL GOOG IBM XOM date
1 0 0 4000 0 2011-01-13 16:00:00
2 0 1000 4000 0 2011-01-26 16:00:00
3 0 1000 4000 0 2011-02-02 16:00:00
4 0 1000 4000 4000 2011-02-10 16:00:00
6 0 0 1800 4000 2011-03-03 16:00:00
7 0 0 3300 4000 2011-06-03 16:00:00
8 0 0 0 4000 2011-05-03 16:00:00
9 1200 0 0 4000 2011-06-10 16:00:00
11 1200 0 0 4000 2011-08-01 16:00:00
12 0 0 0 4000 2011-12-20 16:00:00
(Pdb) type(df['date'])
<class 'pandas.core.series.Series'>
(Pdb) df2 = DataFrame(index=df['date'])
(Pdb) df2
Empty DataFrame
Columns: array([], dtype=object)
Index: array([2011-01-13 16:00:00, 2011-01-26 16:00:00, 2011-02-02 16:00:00,
2011-02-10 16:00:00, 2011-03-03 16:00:00, 2011-06-03 16:00:00,
2011-05-03 16:00:00, 2011-06-10 16:00:00, 2011-08-01 16:00:00,
2011-12-20 16:00:00], dtype=object)
(Pdb) df2.merge(df,left_index=True,right_on='date')
AAPL GOOG IBM XOM date
1 0 0 4000 0 2011-01-13 16:00:00
2 0 1000 4000 0 2011-01-26 16:00:00
3 0 1000 4000 0 2011-02-02 16:00:00
4 0 1000 4000 4000 2011-02-10 16:00:00
6 0 0 1800 4000 2011-03-03 16:00:00
8 0 0 0 4000 2011-05-03 16:00:00
7 0 0 3300 4000 2011-06-03 16:00:00
9 1200 0 0 4000 2011-06-10 16:00:00
11 1200 0 0 4000 2011-08-01 16:00:00
12 0 0 0 4000 2011-12-20 16:00:00
I have tried multiple things to get a datetime index:
1.) Using the reindex() method with a list of datetime values. This creates a datetime index, but then fills in NaNs for the data in the DataFrame. I'm guessing that this is because the original values are tied to the integer index and reindexing to datetime tries to fill the new indices with default values (NaNs if no fill method is indicated). Thusly:
(Pdb) df.reindex(index=df['date'])
AAPL GOOG IBM XOM date
date
2011-01-13 16:00:00 NaN NaN NaN NaN NaN
2011-01-26 16:00:00 NaN NaN NaN NaN NaN
2011-02-02 16:00:00 NaN NaN NaN NaN NaN
2011-02-10 16:00:00 NaN NaN NaN NaN NaN
2011-03-03 16:00:00 NaN NaN NaN NaN NaN
2011-06-03 16:00:00 NaN NaN NaN NaN NaN
2011-05-03 16:00:00 NaN NaN NaN NaN NaN
2011-06-10 16:00:00 NaN NaN NaN NaN NaN
2011-08-01 16:00:00 NaN NaN NaN NaN NaN
2011-12-20 16:00:00 NaN NaN NaN NaN NaN
2.) Using DataFrame.merge with my original df and a second dataframe, df2, that is basically just a datetime index with nothing else. So I end up doing something like:
(pdb) df2.merge(df,left_index=True,right_on='date')
AAPL GOOG IBM XOM date
1 0 0 4000 0 2011-01-13 16:00:00
2 0 1000 4000 0 2011-01-26 16:00:00
3 0 1000 4000 0 2011-02-02 16:00:00
4 0 1000 4000 4000 2011-02-10 16:00:00
6 0 0 1800 4000 2011-03-03 16:00:00
8 0 0 0 4000 2011-05-03 16:00:00
7 0 0 3300 4000 2011-06-03 16:00:00
9 1200 0 0 4000 2011-06-10 16:00:00
11 1200 0 0 4000 2011-08-01 16:00:00
(and vice-versa). But I always end up with this kind of thing, with integer indices.
3.) Starting with an empty DataFrame with a datetime index (created from the 'date' field of df) and a bunch of empty columns. Then I attempt to assign each column by setting the columns with the same
names to be equal to the columns from df:
(Pdb) df2['GOOG']=0
(Pdb) df2
GOOG
date
2011-01-13 16:00:00 0
2011-01-26 16:00:00 0
2011-02-02 16:00:00 0
2011-02-10 16:00:00 0
2011-03-03 16:00:00 0
2011-06-03 16:00:00 0
2011-05-03 16:00:00 0
2011-06-10 16:00:00 0
2011-08-01 16:00:00 0
2011-12-20 16:00:00 0
(Pdb) df2['GOOG'] = df['GOOG']
(Pdb) df2
GOOG
date
2011-01-13 16:00:00 NaN
2011-01-26 16:00:00 NaN
2011-02-02 16:00:00 NaN
2011-02-10 16:00:00 NaN
2011-03-03 16:00:00 NaN
2011-06-03 16:00:00 NaN
2011-05-03 16:00:00 NaN
2011-06-10 16:00:00 NaN
2011-08-01 16:00:00 NaN
2011-12-20 16:00:00 NaN
So, how in pandas 0.7.3 do I get df to be re-created with an datetime index instead of the integer index? What am I missing?

I think you are looking for set_index:
In [11]: df.set_index('date')
Out[11]:
AAPL GOOG IBM XOM
date
2011-01-13 16:00:00 0 0 4000 0
2011-01-26 16:00:00 0 1000 4000 0
2011-02-02 16:00:00 0 1000 4000 0
2011-02-10 16:00:00 0 1000 4000 4000
2011-03-03 16:00:00 0 0 1800 4000
2011-06-03 16:00:00 0 0 3300 4000
2011-05-03 16:00:00 0 0 0 4000
2011-06-10 16:00:00 1200 0 0 4000
2011-08-01 16:00:00 1200 0 0 4000
2011-12-20 16:00:00 0 0 0 4000

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Vectorized look-up of values in Pandas dataframe - python

Related

Compute returns from data frame

Pandas merge two dataframes on column with different length

Vectorized Operations on two Pandas DataFrame to create a new DataFrame

Returns of multiple stocks using pandas at once

Having an issue changing index from integer to date in pandas

Categories

Resources