Vectorized Operations on two Pandas DataFrame to create a new DataFrame

Vectorized Operations on two Pandas DataFrame to create a new DataFrame - python

I have orders.csv as a dataframe called orders_df:
Symbol Order Shares
Date
2011-01-10 AAPL BUY 100
2011-01-13 AAPL SELL 200
2011-01-13 IBM BUY 100
2011-01-26 GOOG SELL 200
I end up sorting the data frame with orders_df = orders_df.sort_index().
Then I create a symbols like so:
symbols = np.append(orders_df.loc[:, 'Symbol'].unique(), 'SPY')
Here comes my second DataFrame df_prices.
df_prices = get_data(symbols, orders_df.index, addSPY=False)
df_prices.loc[:, 'CASH] = 1.0
which prints out:
AAPL IBM GOOG XOM SPY CASH
Date
2011-01-10 150 100 50 400 100 1.0
2011-01-13 250 200 500 100 100 1.0
2011-01-13 250 200 500 100 100 1.0
2011-01-26 100 150 100 300 50 1.0
Now, I initialize a third data frame:'
df_trades = pd.DataFrame(0, df_prices.index, columns=list(df_prices))
I need to fill this data frame with the correct values using the two previous date frames. If I BUY AAPL, I want to multiply Shares from orders_df with the prices of AAPL times -1. If it were SELL I wouldn't multiply by -1. I put that value in the correct CASH column. For the other columns, I simply copy over the Shares of each stock on days they traded.
AAPL IBM GOOG XOM SPY CASH
Date
2011-01-10 100 0 0 0 0 -15000
2011-01-13 -200 0 0 0 0 50000
2011-01-13 0 100 0 0 0 -20000
2011-01-26 0 0 -200 0 0 20000
How do I achieve df_trades using vectorized operations?
UPDATE
What if I did:
df_prices = get_data(symbols, orders_df.index, addSPY=False)
df_prices.loc[:, 'CASH] = 1.0
which prints out
AAPL IBM GOOG XOM SPY CASH
2011-01-10 340.99 143.41 614.21 72.02 123.19 1.0
2011-01-11 340.18 143.06 616.01 72.56 123.63 1.0
2011-01-12 342.95 144.82 616.87 73.41 124.74 1.0
2011-01-13 344.20 144.55 616.69 73.54 124.54 1.0
2011-01-14 346.99 145.70 624.18 74.62 125.44 1.0
2011-01-18 339.19 146.33 639.63 75.45 125.65 1.0
2011-01-19 337.39 151.22 631.75 75.00 124.42 1.0
How would I produce the df_trades then?
The example values aren't valid anymore fyi.

Vectorized Solution
j = np.array([df_trades.columns.get_loc(c) for c in orders_df.Symbol])
i = np.arange(len(df_trades))
o = np.where(orders_df.Order.values == 'BUY', -1, 1)
v = orders_df.Shares.values * o
t = df_trades.values
t[i, j] = v
df_trades.loc[:, 'CASH'] = \
df_trades.drop('CASH', 1, errors='ignore').mul(prices_df).sum(1)
df_trades
AAPL IBM GOOG XOM SPY CASH
Date
2011-01-10 -100 0 0 0 0 -15000.0
2011-01-13 200 0 0 0 0 50000.0
2011-01-13 0 -100 0 0 0 -30000.0
2011-01-26 0 0 200 0 0 20000.0

Related

How to calculate month by month change in value per user in pandas?

I was looking for similar topics, but I found only change by month. My problem is that I would like to have a month change in value e.g. UPL but per user like in the below example.
user_id
month
UPL
1
2022-01-01 00:00:00
100
1
2022-02-01 00:00:00
200
2
2022-01-01 00:00:00
100
2
2022-02-01 00:00:00
50
1
2022-03-01 00:00:00
150
And to have additional column named "UPL change month by month":
user_id
month
UPL
UPL_change_by_month
1
2022-01-01 00:00:00
100
0
1
2022-02-01 00:00:00
200
100
2
2022-01-01 00:00:00
100
0
2
2022-02-01 00:00:00
50
-50
1
2022-03-01 00:00:00
150
-50
Is it possible using aggfunc or shift function using Pandas?

IIUC, you can use groupby_diff:
df['UPL_change_by_month'] = df.sort_values('month').groupby('user_id')['UPL'].diff().fillna(0)
print(df)
# Output
user_id month UPL UPL_change_by_month
0 1 2022-01-01 100 0.0
1 1 2022-02-01 200 100.0
2 2 2022-01-01 100 0.0
3 2 2022-02-01 50 -50.0
4 1 2022-03-01 150 -50.0

Pandas Turning multiple rows with different types into 1 row with multiple columns for each type

Given the following df,
Account contract_date type item_id quantity price tax net_amount
ABC123 2020-06-17 P 1409 1000 0.355 10 400
ABC123 2020-06-17 S 1409 2000 0.053 15 150
ABC123 2020-06-17 C 1409 500 0.25 5 180
ABC123 2020-06-17 S 1370 5000 0.17 30 900
DEF456 2020-06-18 P 7214 3000 0.1793 20 600
I would like to turn df, grouped by Account, contract_date and item_id. Then split the values of different types into different column. Intended results are as follows. I can do this with for loop/apply, but would like to seek for suggestion for groupby or pivot or any vectorized/pythonic solution to this. Intended results are as follows:
Account contract_date item_id quantity_P quantity_S quantity_C price_P price_S price_C tax_P tax_S tax_C net_amount_P net_amount_S net_amount_C
ABC123 2020-06-17 1409 1000 2000 500 0.355 0.053 0.25 10 15 5 400 150 180
ABC123 2020-06-17 1370 0 5000 0 0 0.17 0 0 30 0 0 900 0
DEF456 2020-06-18 7214 3000 0 0 0.1793 0 0 20 0 0 600 0 0
*Although it looks a bit off for the alignment, you may copy the df and use df = pd.read_clipboard() to read the table. Appreciate your help. Thank you.
Edit: The error I am getting using df.pivot(index=['Account', 'contract_date', 'item_id'], columns=['type'])

Use df.pivot:
In [1660]: df.pivot(index=['Account', 'contract_date', 'item_id'], columns=['type'])
Out[1660]:
quantity price tax net_amount
type C P S C P S C P S C P S
Account contract_date item_id
ABC123 2020-06-17 1370 NaN NaN 5000.0 NaN NaN 0.170 NaN NaN 30.0 NaN NaN 900.0
1409 500.0 1000.0 2000.0 0.25 0.3550 0.053 5.0 10.0 15.0 180.0 400.0 150.0
DEF456 2020-06-18 7214 NaN 3000.0 NaN NaN 0.1793 NaN NaN 20.0 NaN NaN 600.0 NaN

Convert pandas column with single list of values into rows

I have the following dataframe:
symbol PSAR
0 AAPL [nan,100,200]
1 PYPL [nan,300,400]
2 SPY [nan,500,600]
I am trying to turn the PSAR list values into rows like the following:
symbol PSAR
AAPL nan
AAPL 100
AAPL 200
PYPL nan
PYPL 300
... ...
SPY 600
I have been trying to solve it by following the answers in this post(one key difference being that that post has a list of list) but cant get there.
How to convert column with list of values into rows in Pandas DataFrame.
df['PSAR'].stack().reset_index(level=1, drop=True).to_frame('PSAR')
.join(df[['symbol']], how='left')

Not a slick one but this does the job:
list_of_lists = []
df_as_dict = dict(df.values)
for key,values in df_as_dict.items():
list_of_lists+=[[key,value] for value in values]
pd.DataFrame(list_of_lists)
returns:
0 1
0 AAPL NaN
1 AAPL 100.0
2 AAPL 200.0
3 PYPL NaN
4 PYPL 300.0
5 PYPL 400.0
6 SPY NaN
7 SPY 500.0
8 SPY 600.0

Pandas >= 0.25:
df1 = pd.DataFrame({'symbol':['AAPL', 'PYPL', 'SPY'],
'PSAR':[[None,100,200], [None,300,400], [None,500,600]]})
print(df1)
symbol PSAR
0 AAPL [None, 100, 200]
1 PYPL [None, 300, 400]
2 SPY [None, 500, 600]
df1.explode('PSAR')
symbol PSAR
0 AAPL None
0 AAPL 100
0 AAPL 200
1 PYPL None
1 PYPL 300
1 PYPL 400
2 SPY None
2 SPY 500
2 SPY 600

Having an issue changing index from integer to date in pandas

I'm having an issue changing a pandas DataFrame index to a datetime from an integer. I want to do it so that I can call reindex and fill in the dates between those listed in the table. Note that I have to use pandas 0.7.3 at the moment because I'm also using qstk, and qstk relies on pandas 0.7.3
First, here's my layout:
(Pdb) df
AAPL GOOG IBM XOM date
1 0 0 4000 0 2011-01-13 16:00:00
2 0 1000 4000 0 2011-01-26 16:00:00
3 0 1000 4000 0 2011-02-02 16:00:00
4 0 1000 4000 4000 2011-02-10 16:00:00
6 0 0 1800 4000 2011-03-03 16:00:00
7 0 0 3300 4000 2011-06-03 16:00:00
8 0 0 0 4000 2011-05-03 16:00:00
9 1200 0 0 4000 2011-06-10 16:00:00
11 1200 0 0 4000 2011-08-01 16:00:00
12 0 0 0 4000 2011-12-20 16:00:00
(Pdb) type(df['date'])
<class 'pandas.core.series.Series'>
(Pdb) df2 = DataFrame(index=df['date'])
(Pdb) df2
Empty DataFrame
Columns: array([], dtype=object)
Index: array([2011-01-13 16:00:00, 2011-01-26 16:00:00, 2011-02-02 16:00:00,
2011-02-10 16:00:00, 2011-03-03 16:00:00, 2011-06-03 16:00:00,
2011-05-03 16:00:00, 2011-06-10 16:00:00, 2011-08-01 16:00:00,
2011-12-20 16:00:00], dtype=object)
(Pdb) df2.merge(df,left_index=True,right_on='date')
AAPL GOOG IBM XOM date
1 0 0 4000 0 2011-01-13 16:00:00
2 0 1000 4000 0 2011-01-26 16:00:00
3 0 1000 4000 0 2011-02-02 16:00:00
4 0 1000 4000 4000 2011-02-10 16:00:00
6 0 0 1800 4000 2011-03-03 16:00:00
8 0 0 0 4000 2011-05-03 16:00:00
7 0 0 3300 4000 2011-06-03 16:00:00
9 1200 0 0 4000 2011-06-10 16:00:00
11 1200 0 0 4000 2011-08-01 16:00:00
12 0 0 0 4000 2011-12-20 16:00:00
I have tried multiple things to get a datetime index:
1.) Using the reindex() method with a list of datetime values. This creates a datetime index, but then fills in NaNs for the data in the DataFrame. I'm guessing that this is because the original values are tied to the integer index and reindexing to datetime tries to fill the new indices with default values (NaNs if no fill method is indicated). Thusly:
(Pdb) df.reindex(index=df['date'])
AAPL GOOG IBM XOM date
date
2011-01-13 16:00:00 NaN NaN NaN NaN NaN
2011-01-26 16:00:00 NaN NaN NaN NaN NaN
2011-02-02 16:00:00 NaN NaN NaN NaN NaN
2011-02-10 16:00:00 NaN NaN NaN NaN NaN
2011-03-03 16:00:00 NaN NaN NaN NaN NaN
2011-06-03 16:00:00 NaN NaN NaN NaN NaN
2011-05-03 16:00:00 NaN NaN NaN NaN NaN
2011-06-10 16:00:00 NaN NaN NaN NaN NaN
2011-08-01 16:00:00 NaN NaN NaN NaN NaN
2011-12-20 16:00:00 NaN NaN NaN NaN NaN
2.) Using DataFrame.merge with my original df and a second dataframe, df2, that is basically just a datetime index with nothing else. So I end up doing something like:
(pdb) df2.merge(df,left_index=True,right_on='date')
AAPL GOOG IBM XOM date
1 0 0 4000 0 2011-01-13 16:00:00
2 0 1000 4000 0 2011-01-26 16:00:00
3 0 1000 4000 0 2011-02-02 16:00:00
4 0 1000 4000 4000 2011-02-10 16:00:00
6 0 0 1800 4000 2011-03-03 16:00:00
8 0 0 0 4000 2011-05-03 16:00:00
7 0 0 3300 4000 2011-06-03 16:00:00
9 1200 0 0 4000 2011-06-10 16:00:00
11 1200 0 0 4000 2011-08-01 16:00:00
(and vice-versa). But I always end up with this kind of thing, with integer indices.
3.) Starting with an empty DataFrame with a datetime index (created from the 'date' field of df) and a bunch of empty columns. Then I attempt to assign each column by setting the columns with the same
names to be equal to the columns from df:
(Pdb) df2['GOOG']=0
(Pdb) df2
GOOG
date
2011-01-13 16:00:00 0
2011-01-26 16:00:00 0
2011-02-02 16:00:00 0
2011-02-10 16:00:00 0
2011-03-03 16:00:00 0
2011-06-03 16:00:00 0
2011-05-03 16:00:00 0
2011-06-10 16:00:00 0
2011-08-01 16:00:00 0
2011-12-20 16:00:00 0
(Pdb) df2['GOOG'] = df['GOOG']
(Pdb) df2
GOOG
date
2011-01-13 16:00:00 NaN
2011-01-26 16:00:00 NaN
2011-02-02 16:00:00 NaN
2011-02-10 16:00:00 NaN
2011-03-03 16:00:00 NaN
2011-06-03 16:00:00 NaN
2011-05-03 16:00:00 NaN
2011-06-10 16:00:00 NaN
2011-08-01 16:00:00 NaN
2011-12-20 16:00:00 NaN
So, how in pandas 0.7.3 do I get df to be re-created with an datetime index instead of the integer index? What am I missing?

I think you are looking for set_index:
In [11]: df.set_index('date')
Out[11]:
AAPL GOOG IBM XOM
date
2011-01-13 16:00:00 0 0 4000 0
2011-01-26 16:00:00 0 1000 4000 0
2011-02-02 16:00:00 0 1000 4000 0
2011-02-10 16:00:00 0 1000 4000 4000
2011-03-03 16:00:00 0 0 1800 4000
2011-06-03 16:00:00 0 0 3300 4000
2011-05-03 16:00:00 0 0 0 4000
2011-06-10 16:00:00 1200 0 0 4000
2011-08-01 16:00:00 1200 0 0 4000
2011-12-20 16:00:00 0 0 0 4000

Vectorized look-up of values in Pandas dataframe

I have two pandas dataframes one called orders and another one called daily_prices.
daily_prices is as follows:
AAPL GOOG IBM XOM
2011-01-10 339.44 614.21 142.78 71.57
2011-01-13 342.64 616.69 143.92 73.08
2011-01-26 340.82 616.50 155.74 75.89
2011-02-02 341.29 612.00 157.93 79.46
2011-02-10 351.42 616.44 159.32 79.68
2011-03-03 356.40 609.56 158.73 82.19
2011-05-03 345.14 533.89 167.84 82.00
2011-06-03 340.42 523.08 160.97 78.19
2011-06-10 323.03 509.51 159.14 76.84
2011-08-01 393.26 606.77 176.28 76.67
2011-12-20 392.46 630.37 184.14 79.97
orders is as follows:
direction size ticker prices
2011-01-10 Buy 1500 AAPL 339.44
2011-01-13 Sell 1500 AAPL 342.64
2011-01-13 Buy 4000 IBM 143.92
2011-01-26 Buy 1000 GOOG 616.50
2011-02-02 Sell 4000 XOM 79.46
2011-02-10 Buy 4000 XOM 79.68
2011-03-03 Sell 1000 GOOG 609.56
2011-03-03 Sell 2200 IBM 158.73
2011-06-03 Sell 3300 IBM 160.97
2011-05-03 Buy 1500 IBM 167.84
2011-06-10 Buy 1200 AAPL 323.03
2011-08-01 Buy 55 GOOG 606.77
2011-08-01 Sell 55 GOOG 606.77
2011-12-20 Sell 1200 AAPL 392.46
Index of both dataframes is datetime.date.
prices column in the orders dataframe was added by using a list comprehension to loop through all the orders and look up the specific ticker for the specific date in the daily_prices data frame and then adding that list as a column to the orders dataframe. I would like to do this using an array operation rather than something that loops. can it be done? I tried to use:
daily_prices.ix[dates, tickers]
but this returns a matrix of cartesian product of the two lists. I want it to return a column vector of only the price of a specified ticker for a specified date.

Use our friend lookup, designed precisely for this purpose:
In [17]: prices
Out[17]:
AAPL GOOG IBM XOM
2011-01-10 339.44 614.21 142.78 71.57
2011-01-13 342.64 616.69 143.92 73.08
2011-01-26 340.82 616.50 155.74 75.89
2011-02-02 341.29 612.00 157.93 79.46
2011-02-10 351.42 616.44 159.32 79.68
2011-03-03 356.40 609.56 158.73 82.19
2011-05-03 345.14 533.89 167.84 82.00
2011-06-03 340.42 523.08 160.97 78.19
2011-06-10 323.03 509.51 159.14 76.84
2011-08-01 393.26 606.77 176.28 76.67
2011-12-20 392.46 630.37 184.14 79.97
In [18]: orders
Out[18]:
Date direction size ticker prices
0 2011-01-10 00:00:00 Buy 1500 AAPL 339.44
1 2011-01-13 00:00:00 Sell 1500 AAPL 342.64
2 2011-01-13 00:00:00 Buy 4000 IBM 143.92
3 2011-01-26 00:00:00 Buy 1000 GOOG 616.50
4 2011-02-02 00:00:00 Sell 4000 XOM 79.46
5 2011-02-10 00:00:00 Buy 4000 XOM 79.68
6 2011-03-03 00:00:00 Sell 1000 GOOG 609.56
7 2011-03-03 00:00:00 Sell 2200 IBM 158.73
8 2011-06-03 00:00:00 Sell 3300 IBM 160.97
9 2011-05-03 00:00:00 Buy 1500 IBM 167.84
10 2011-06-10 00:00:00 Buy 1200 AAPL 323.03
11 2011-08-01 00:00:00 Buy 55 GOOG 606.77
12 2011-08-01 00:00:00 Sell 55 GOOG 606.77
13 2011-12-20 00:00:00 Sell 1200 AAPL 392.46
In [19]: prices.lookup(orders.Date, orders.ticker)
Out[19]:
array([ 339.44, 342.64, 143.92, 616.5 , 79.46, 79.68, 609.56,
158.73, 160.97, 167.84, 323.03, 606.77, 606.77, 392.46])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Vectorized Operations on two Pandas DataFrame to create a new DataFrame - python

Related

How to calculate month by month change in value per user in pandas?

Pandas Turning multiple rows with different types into 1 row with multiple columns for each type

Convert pandas column with single list of values into rows

Having an issue changing index from integer to date in pandas

Vectorized look-up of values in Pandas dataframe

Categories

Resources