Convert pandas column with single list of values into rows - python

I have the following dataframe:
symbol PSAR
0 AAPL [nan,100,200]
1 PYPL [nan,300,400]
2 SPY [nan,500,600]
I am trying to turn the PSAR list values into rows like the following:
symbol PSAR
AAPL nan
AAPL 100
AAPL 200
PYPL nan
PYPL 300
... ...
SPY 600
I have been trying to solve it by following the answers in this post(one key difference being that that post has a list of list) but cant get there.
How to convert column with list of values into rows in Pandas DataFrame.
df['PSAR'].stack().reset_index(level=1, drop=True).to_frame('PSAR')
.join(df[['symbol']], how='left')

Not a slick one but this does the job:
list_of_lists = []
df_as_dict = dict(df.values)
for key,values in df_as_dict.items():
list_of_lists+=[[key,value] for value in values]
pd.DataFrame(list_of_lists)
returns:
0 1
0 AAPL NaN
1 AAPL 100.0
2 AAPL 200.0
3 PYPL NaN
4 PYPL 300.0
5 PYPL 400.0
6 SPY NaN
7 SPY 500.0
8 SPY 600.0

Pandas >= 0.25:
df1 = pd.DataFrame({'symbol':['AAPL', 'PYPL', 'SPY'],
'PSAR':[[None,100,200], [None,300,400], [None,500,600]]})
print(df1)
symbol PSAR
0 AAPL [None, 100, 200]
1 PYPL [None, 300, 400]
2 SPY [None, 500, 600]
df1.explode('PSAR')
symbol PSAR
0 AAPL None
0 AAPL 100
0 AAPL 200
1 PYPL None
1 PYPL 300
1 PYPL 400
2 SPY None
2 SPY 500
2 SPY 600

Related

Reshape Pandas DataFrame with TimeSeries in rows instead of columns

I have a DataFrame df that contains price data (Open, Close, High, Low) for every day in the time from January 2010 to December 2021:
Name
ISIN
Data
02.01.2010
05.01.2010
06.01.2010
...
31.12.2021
Apple
US9835635986
Price Open
12.45
13.45
12.48
...
54.12
Apple
US9835635986
Price Close
12.58
15.35
12.38
...
54.43
Apple
US9835635986
Price High
12.78
15.85
12.83
...
54.91
Apple
US9835635986
Price Low
12.18
13.35
12.21
...
53.98
Microsoft
US1223928384
Price Open
12.45
13.45
12.48
...
43.56
...
..
...
...
...
...
...
...
I am trying to reshape the table into the format below:
Date
Name
ISIN
Price Open
Price Close
Price High
Price Low
02.01.2010
Apple
US9835635986
12.45
12.58
12.78
12.18
05.01.2010
Apple
US9835635986
13.45
15.35
15.85
13.35
...
...
...
...
...
...
...
...
02.01.2010
Microsoft
US1223928384
12.45
13.67
13.74
12.35
Simply transposing the DateFrame did not work. I also tried pivot which gave the error message that the operands ould not be broadcasted to different shapes.
dates = ['NAME','ISIN']
dates.append(df.columns.tolist()[3:]) # appends all columns names starting with 02.01.2010
df.pivot(index = dates, columns = 'Data', Values = 'Data')
How can I get this DataFrame in the desired format?
Use DataFrame.melt before pivoting with convert datetimes, last sorting MultiIndex:
df = (df.melt(['Name','ISIN','Data'], var_name='Date')
.assign(Date = lambda x: pd.to_datetime(x['Date'], format='%d.%m.%Y'))
.pivot(index = ['Date','Name','ISIN'], columns = 'Data', values = 'value')
.sort_index(level=[1,2,0])
.reset_index()
)
print (df)
Data Date Name ISIN Price Close Price High Price Low \
0 2010-01-02 Apple US9835635986 12.58 12.78 12.18
1 2010-01-05 Apple US9835635986 15.35 15.85 13.35
2 2010-01-06 Apple US9835635986 12.38 12.83 12.21
3 2021-12-31 Apple US9835635986 54.43 54.91 53.98
4 2010-01-02 Microsoft US1223928384 NaN NaN NaN
5 2010-01-05 Microsoft US1223928384 NaN NaN NaN
6 2010-01-06 Microsoft US1223928384 NaN NaN NaN
7 2021-12-31 Microsoft US1223928384 NaN NaN NaN
Data Price Open
0 12.45
1 13.45
2 12.48
3 54.12
4 12.45
5 13.45
6 12.48
7 43.56
Another idea is first convert columns names for datetimes and then reshape by DataFrame.stack and Series.unstack:
L = df.columns.tolist()
df = (df.set_axis(L[:3] + pd.to_datetime(L[3:], format='%d.%m.%Y').tolist(), axis=1)
.rename_axis('Date', axis=1)
.set_index(L[:3])
.stack()
.unstack(2)
.reorder_levels([2,0,1])
.reset_index())
print (df)
Data Date Name ISIN Price Close Price High Price Low \
0 2010-01-02 Apple US9835635986 12.58 12.78 12.18
1 2010-01-05 Apple US9835635986 15.35 15.85 13.35
2 2010-01-06 Apple US9835635986 12.38 12.83 12.21
3 2021-12-31 Apple US9835635986 54.43 54.91 53.98
4 2010-01-02 Microsoft US1223928384 NaN NaN NaN
5 2010-01-05 Microsoft US1223928384 NaN NaN NaN
6 2010-01-06 Microsoft US1223928384 NaN NaN NaN
7 2021-12-31 Microsoft US1223928384 NaN NaN NaN
Data Price Open
0 12.45
1 13.45
2 12.48
3 54.12
4 12.45
5 13.45
6 12.48
7 43.56

How to create a new column with the last value of the previous year

I have this data frame
import pandas as pd
df = pd.DataFrame({'COTA':['A','A','A','A','A','B','B','B','B'],
'Date':['14/10/2021','19/10/2020','29/10/2019','30/09/2021','20/09/2020','20/10/2021','29/10/2020','15/10/2019','10/09/2020'],
'Mark':[1,2,3,4,5,1,2,3,3]
})
print(df)
based on this data frame I wanted the MARK from the previous year, I managed to acquire the maximum COTA but I wanted the last one, I used .max() and I thought I could get it with .last() but it didn't work.
follow the example of my code.
df['Date'] = pd.to_datetime(df['Date'])
df['LastYear'] = df['Date'] - pd.offsets.YearEnd(0)
s1 = df.groupby(['Found', 'LastYear'])['Mark'].max()
s2 = s1.rename(index=lambda x: x + pd.offsets.DateOffset(years=1), level=1)
df = df.join(s2.rename('Max_MarkLastYear'), on=['Found', 'LastYear'])
print (df)
Found Date Mark LastYear Max_MarkLastYear
0 A 2021-10-14 1 2021-12-31 5.0
1 A 2020-10-19 2 2020-12-31 3.0
2 A 2019-10-29 3 2019-12-31 NaN
3 A 2021-09-30 4 2021-12-31 5.0
4 A 2020-09-20 5 2020-12-31 3.0
5 B 2021-10-20 1 2021-12-31 3.0
6 B 2020-10-29 2 2020-12-31 3.0
7 B 2019-10-15 3 2019-12-31 NaN
8 B 2020-10-09 3 2020-12-31 3.0
How do I create a new column with the last value of the previous year

Pandas Turning multiple rows with different types into 1 row with multiple columns for each type

Given the following df,
Account contract_date type item_id quantity price tax net_amount
ABC123 2020-06-17 P 1409 1000 0.355 10 400
ABC123 2020-06-17 S 1409 2000 0.053 15 150
ABC123 2020-06-17 C 1409 500 0.25 5 180
ABC123 2020-06-17 S 1370 5000 0.17 30 900
DEF456 2020-06-18 P 7214 3000 0.1793 20 600
I would like to turn df, grouped by Account, contract_date and item_id. Then split the values of different types into different column. Intended results are as follows. I can do this with for loop/apply, but would like to seek for suggestion for groupby or pivot or any vectorized/pythonic solution to this. Intended results are as follows:
Account contract_date item_id quantity_P quantity_S quantity_C price_P price_S price_C tax_P tax_S tax_C net_amount_P net_amount_S net_amount_C
ABC123 2020-06-17 1409 1000 2000 500 0.355 0.053 0.25 10 15 5 400 150 180
ABC123 2020-06-17 1370 0 5000 0 0 0.17 0 0 30 0 0 900 0
DEF456 2020-06-18 7214 3000 0 0 0.1793 0 0 20 0 0 600 0 0
*Although it looks a bit off for the alignment, you may copy the df and use df = pd.read_clipboard() to read the table. Appreciate your help. Thank you.
Edit: The error I am getting using df.pivot(index=['Account', 'contract_date', 'item_id'], columns=['type'])
Use df.pivot:
In [1660]: df.pivot(index=['Account', 'contract_date', 'item_id'], columns=['type'])
Out[1660]:
quantity price tax net_amount
type C P S C P S C P S C P S
Account contract_date item_id
ABC123 2020-06-17 1370 NaN NaN 5000.0 NaN NaN 0.170 NaN NaN 30.0 NaN NaN 900.0
1409 500.0 1000.0 2000.0 0.25 0.3550 0.053 5.0 10.0 15.0 180.0 400.0 150.0
DEF456 2020-06-18 7214 NaN 3000.0 NaN NaN 0.1793 NaN NaN 20.0 NaN NaN 600.0 NaN

Leading and Trailing Padding Dates in Pandas DataFrame

This is my dataframe:
df = pd.DataFrame.from_records(data=data, coerce_float=False, index=['date'])
# date field a datetime.datetime values
account_id amount
date
2018-01-01 1 100.0
2018-01-01 1 50.0
2018-06-01 1 200.0
2018-07-01 2 100.0
2018-10-01 2 200.0
Problem description
How can I "pad" my dataframe with leading and trailing "empty dates". I have tried to reindex on a date_range and period_range, I have tried to merge another index. I have tried all sorts of things all day, and I have read alot of the docs.
I have a simple dataframe with columns transaction_date, transaction_amount, and transaction_account. I want to group this dataframe so that it is grouped by account at the first level, and then by year, and then by month. Then I want a column for each month, with the sum of that month's transaction amount value.
This seems like it should be something that is easy to do.
Expected Output
This is the closest I have gotten:
df = pd.DataFrame.from_records(data=data, coerce_float=False, index=['date'])
df = df.groupby(['account_id', df.index.year, df.index.month])
df = df.resample('M').sum().fillna(0)
print(df)
account_id amount
account_id date date date
1 2018 1 2018-01-31 2 150.0
6 2018-06-30 1 200.0
2 2018 7 2018-07-31 2 100.0
10 2018-10-31 2 200.0
And this is what I want to achieve (basically reindex the data by date_range(start='2018-01-01', period=12, freq='M')
(Ideally I would want the month to be transposed by year across the top as columns)
amount
account_id Year Month
1 2018 1 150.0
2 NaN
3 NaN
4 NaN
5 NaN
6 200.0
....
12 200.0
2 2018 1 NaN
....
7 100.0
....
10 200.0
....
12 NaN
One way is to reindex
s=df.groupby([df['account_id'],df.index.year,df.index.month]).sum()
idx=pd.MultiIndex.from_product([s.index.levels[0],s.index.levels[1],list(range(1,13))])
s=s.reindex(idx)
s
Out[287]:
amount
1 2018 1 150.0
2 NaN
3 NaN
4 NaN
5 NaN
6 200.0
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
2 2018 1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 100.0
8 NaN
9 NaN
10 200.0
11 NaN
12 NaN

Vectorized Operations on two Pandas DataFrame to create a new DataFrame

I have orders.csv as a dataframe called orders_df:
Symbol Order Shares
Date
2011-01-10 AAPL BUY 100
2011-01-13 AAPL SELL 200
2011-01-13 IBM BUY 100
2011-01-26 GOOG SELL 200
I end up sorting the data frame with orders_df = orders_df.sort_index().
Then I create a symbols like so:
symbols = np.append(orders_df.loc[:, 'Symbol'].unique(), 'SPY')
Here comes my second DataFrame df_prices.
df_prices = get_data(symbols, orders_df.index, addSPY=False)
df_prices.loc[:, 'CASH] = 1.0
which prints out:
AAPL IBM GOOG XOM SPY CASH
Date
2011-01-10 150 100 50 400 100 1.0
2011-01-13 250 200 500 100 100 1.0
2011-01-13 250 200 500 100 100 1.0
2011-01-26 100 150 100 300 50 1.0
Now, I initialize a third data frame:'
df_trades = pd.DataFrame(0, df_prices.index, columns=list(df_prices))
I need to fill this data frame with the correct values using the two previous date frames. If I BUY AAPL, I want to multiply Shares from orders_df with the prices of AAPL times -1. If it were SELL I wouldn't multiply by -1. I put that value in the correct CASH column. For the other columns, I simply copy over the Shares of each stock on days they traded.
AAPL IBM GOOG XOM SPY CASH
Date
2011-01-10 100 0 0 0 0 -15000
2011-01-13 -200 0 0 0 0 50000
2011-01-13 0 100 0 0 0 -20000
2011-01-26 0 0 -200 0 0 20000
How do I achieve df_trades using vectorized operations?
UPDATE
What if I did:
df_prices = get_data(symbols, orders_df.index, addSPY=False)
df_prices.loc[:, 'CASH] = 1.0
which prints out
AAPL IBM GOOG XOM SPY CASH
2011-01-10 340.99 143.41 614.21 72.02 123.19 1.0
2011-01-11 340.18 143.06 616.01 72.56 123.63 1.0
2011-01-12 342.95 144.82 616.87 73.41 124.74 1.0
2011-01-13 344.20 144.55 616.69 73.54 124.54 1.0
2011-01-14 346.99 145.70 624.18 74.62 125.44 1.0
2011-01-18 339.19 146.33 639.63 75.45 125.65 1.0
2011-01-19 337.39 151.22 631.75 75.00 124.42 1.0
How would I produce the df_trades then?
The example values aren't valid anymore fyi.
Vectorized Solution
j = np.array([df_trades.columns.get_loc(c) for c in orders_df.Symbol])
i = np.arange(len(df_trades))
o = np.where(orders_df.Order.values == 'BUY', -1, 1)
v = orders_df.Shares.values * o
t = df_trades.values
t[i, j] = v
df_trades.loc[:, 'CASH'] = \
df_trades.drop('CASH', 1, errors='ignore').mul(prices_df).sum(1)
df_trades
AAPL IBM GOOG XOM SPY CASH
Date
2011-01-10 -100 0 0 0 0 -15000.0
2011-01-13 200 0 0 0 0 50000.0
2011-01-13 0 -100 0 0 0 -30000.0
2011-01-26 0 0 200 0 0 20000.0

Categories