How to find cumulative product for data frame? - python

I've a (triangular)data frame
i,e:DF1:
12-24 24-36 36-48
2017 1.554 3.532 8.657
2018 2.978 1.114 NaN
2019 4.366 NaN NaN
I've to find the cumprod for this dataframe.
I tried this code:
df2= df1.iloc[:, ::-1].cumprod(axis=1).iloc[:, ::-1]
But result is same as df1
The result should look like:
12-24 24-36 36-48
2017 8.898 4.646 8.657
2018 7.344 1.114 NaN
2019 4.366 NaN NaN
Thank you for your time :)

Try this instead:
>>> df1.iloc[::-1].cumsum().iloc[::-1]
12-24 24-36 36-48
2017 8.898 4.646 8.657
2018 7.344 1.114 NaN
2019 4.366 NaN NaN
>>>
You don't need an axis=1 and extra colons.
Doing df1.iloc[:, ::-1] would reverse the columns instead of the rows.

Related

MultiIndex column-wise from existing pandas dataframe columns

I am trying to reindex my pandas dataframe to a column-wise MultiIndex. Most answers I've explored seem to answer only row wise. My current df looks as such:
ticker calendardate eps price ps revenue
0 ABNB 2019-12-31 -2.59 NaN NaN 4.80
1 ABNB 2020-12-31 -16.12 146.80 25.962 3.37
2 AMZN 2019-12-31 23.46 1847.84 3.266 2.80
3 AMZN 2020-12-31 42.64 3256.93 4.233 3.86
I want a MultiIndex based upon calendardate so that my output looks as such:
ticker eps price ps revenue
2019 2020 2019 2020 2019 2020 2019 2020
0 ABNB -2.59 -16.12 NaN 146.80 NaN 25.962 4.80 3.37
1 AMZN 23.46 42.64 1847.84 3256.93 3.266 4.233 2.80 3.86
Any help would be appreciated. Thanks
We can use str.split to split the column calenderdate around the delimiter - then use str[0] to select the year portion of splitted column, now set the index of dataframe to column ticker along with extracted year followed by unstack to reshape.
y = df['calendardate'].str.split('-', n=1).str[0]
df.drop('calendardate', 1).set_index(['ticker', y]).unstack()
If the dtype of column calendardate is datetime then we can instead use:
y = df['calendardate'].dt.year
df.drop('calendardate', 1).set_index(['ticker', y]).unstack()
eps price ps revenue
calendardate 2019 2020 2019 2020 2019 2020 2019 2020
ticker
ABNB -2.59 -16.12 NaN 146.80 NaN 25.962 4.8 3.37
AMZN 23.46 42.64 1847.84 3256.93 3.266 4.233 2.8 3.86

get element in column that preceeded new column

I'm trying to collect items from column 'data' that just preceded the data I collected in column 'min' and create new column. See
Here is the data (importing with pd.read_csv):
time,data
12/15/18 01:10 AM,130352.146180556
12/16/18 01:45 AM,130355.219097222
12/17/18 01:47 AM,130358.223263889
12/18/18 02:15 AM,130361.281701389
12/19/18 03:15 AM,130364.406597222
12/20/18 03:25 AM,130352.427430556
12/21/18 03:27 AM,130355.431597222
12/22/18 05:18 AM,130358.663541667
12/23/18 06:44 AM,130361.842430556
12/24/18 07:19 AM,130364.915243056
12/25/18 07:33 AM,130352.944409722
12/26/18 07:50 AM,130355.979826389
12/27/18 09:13 AM,130359.153472222
12/28/18 11:53 AM,130362.4871875
12/29/18 01:23 PM,130365.673263889
12/30/18 02:17 PM,130353.785763889
12/31/18 02:23 PM,130356.798263889
01/01/19 04:41 PM,130360.085763889
01/02/19 05:01 PM,130363.128125
and my code:
import pandas as pd
import numpy as np
from scipy import signal
from scipy.signal import argrelextrema
import datetime
diff=pd.DataFrame()
df=pd.read_csv('saw_data2.csv')
df['time']=pd.to_datetime(df['time'])
print(df.head())
n=2 # number of points to be checked before and after
# Find local peaks
df['min'] = df.iloc[argrelextrema(df.data.values, np.less_equal, order=n)[0]]['data']
If you plot the data, you'll see it is similiar to a sawtooth. The element before in 'data' that I get in 'min' is the element I want to put in a new column df['new_col'].
I've tried many things like,
df['new_col']=df.index.get_loc(df['min'].df['data'])
and,
df['new_col']=df['min'].shift() #obviously wrong
IIUC, you can do the shift before selecting the rows with a value in min:
df['new_col'] = df.shift().loc[df['min'].notna(), 'data']
print (df)
time data min new_col
0 12/15/18 01:10 AM 130352.146181 130352.146181 NaN
1 12/16/18 01:45 AM 130355.219097 NaN NaN
2 12/17/18 01:47 AM 130358.223264 NaN NaN
3 12/18/18 02:15 AM 130361.281701 NaN NaN
4 12/19/18 03:15 AM 130364.406597 NaN NaN
5 12/20/18 03:25 AM 130352.427431 130352.427431 130364.406597
6 12/21/18 03:27 AM 130355.431597 NaN NaN
7 12/22/18 05:18 AM 130358.663542 NaN NaN
8 12/23/18 06:44 AM 130361.842431 NaN NaN
9 12/24/18 07:19 AM 130364.915243 NaN NaN
10 12/25/18 07:33 AM 130352.944410 130352.944410 130364.915243
11 12/26/18 07:50 AM 130355.979826 NaN NaN
12 12/27/18 09:13 AM 130359.153472 NaN NaN
13 12/28/18 11:53 AM 130362.487187 NaN NaN
14 12/29/18 01:23 PM 130365.673264 NaN NaN
15 12/30/18 02:17 PM 130353.785764 130353.785764 130365.673264
16 12/31/18 02:23 PM 130356.798264 NaN NaN
17 01/01/19 04:41 PM 130360.085764 NaN NaN
18 01/02/19 05:01 PM 130363.128125 NaN NaN

How to remove all duplicate occurrences or get unique values in a pandas dataframe?

I have a pandas data-frame with multiple occurrence of particular values. I want to either remove all the values that are duplicates or replace with NaN and finally get the name of column that has any number of unique values. Pandas drop_duplicates function only removes the rows that has duplicate value but I want to remove the values/cells in data-frame. Is there a solution for this?
Based on the input dataframe below, all the values except the first row of column "02" have duplicate occurrence in the dataframe, so column "02" is what I want. If the question is not clear please do let me know. Thanks.
DF:
02 03:10 03:02 03:02:09
0 6716 45355 45355 45355
1 4047 4047 7411 7411
2 945 2478 2478 945
Expected output:
col_with_unique_val = "02"
or
Expected output DF:
02 03:10 03:02 03:02:09
0 6716 NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
or
Expected output DF:
02
0 6716
Here is one way
df.mask(df.apply(pd.Series.duplicated,keep=False,axis=1))
02 03:10 03:02 03:02:09
0 6716.0 NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
df.mask(df.apply(pd.Series.duplicated,keep=False,axis=1)).stack().index.get_level_values(1)
Index(['02'], dtype='object')
stack, then check duplicated. where to make all non-uniques NaN
df1 = df.stack()
uniques = df1[~df1.duplicated(keep=False)].tolist()
df.where(df.isin(uniques))
# 02 03:10 03:02 03:02:09
#0 6716.0 NaN NaN NaN
#1 NaN NaN NaN NaN
#2 NaN NaN NaN NaN
df.isin(uniques).any().loc[lambda x: x].index
#Index(['02'], dtype='object')

Dataframe shift moving data into random columns?

I'm using code to shift time series data that looks somewhat similar to this:
Year Player PTSN AVGN
2018 Aaron Donald 280.60 17.538
2018 J.J. Watt 259.80 16.238
2018 Danielle Hunter 237.60 14.850
2017 Aaron Donald 181.0 12.929
2016 Danielle Hunter 204.6 12.788
with the intent of getting it into something like this:
AVGN PTSN AVGN_prev PTSN_prev
Player Year
Aaron Donald 2016 NaN NaN NaN NaN
2017 12.929 181.0 NaN NaN
2018 17.538 280.6 12.929 181.0
Danielle Hunter 2016 12.788 204.6 NaN NaN
2017 8.325 133.2 12.788 204.6
2018 14.850 237.6 8.325 133.2
J.J. Watt 2016 NaN NaN NaN NaN
2017 NaN NaN NaN NaN
2018 16.238 259.8 NaN NaN
I'm using this code to make that happen:
res = df.set_index(['player', 'Year'])
idx = pd.MultiIndex.from_product([df['player'].unique(),
df['Year'].unique()],
names=['Player', 'Year'])
res = res.groupby(['player', 'Year']).apply(sum)
res = res.reindex(idx).sort_index()
res[columns] = res.groupby('Player')[list(res.columns)].shift(1)
with the addition of a groupby.sum() because some players in the dataframe moved from one teamt o another within the same season and i want to combine those numbers. However, the data i have is actually coming out extremely wrong. The data has too many columns to post, but it seems like the data from the previous year (_prev) is placed into random columns. It doesn't change and will always place it into the same wrong columns. Is this an issue caused by the groupby.sum()? is it because i'm using a columns variable (containing all the same names as res.columns with a str(_prev) attached to them) and a list(res.columns)? And regardless of which it is, how do i solve this?
here's the outputs of columns and res.columns:
columns:
['player_id_prev', 'position_prev', 'player_game_count_prev', 'team_name_prev', 'snap_counts_total_prev', 'snap_counts_pass_rush_prev', 'snap_counts_run_defense_prev', 'snap_counts_coverage_prev', 'grades_defense_prev', 'grades_run_defense_prev', 'grades_tackle_prev', 'grades_pass_rush_defense_prev', 'grades_coverage_defense_prev', 'total_pressures_prev', 'sacks_prev', 'hits_prev', 'hurries_prev', 'batted_passes_prev', 'tackles_prev', 'assists_prev', 'missed_tackles_prev', 'stops_prev', 'forced_fumbles_prev', 'targets_prev', 'receptions_prev', 'yards_prev', 'yards_per_reception_prev', 'yards_after_catch_prev', 'longest_prev', 'touchdowns_prev', 'interceptions_prev', 'pass_break_ups_prev', 'qb_rating_against_prev', 'penalties_prev', 'declined_penalties_prev']
res_columns:
['player_id', 'position', 'player_game_count', 'team_name',
'snap_counts_total', 'snap_counts_pass_rush', 'snap_counts_run_defense',
'snap_counts_coverage', 'grades_defense', 'grades_run_defense',
'grades_tackle', 'grades_pass_rush_defense', 'grades_coverage_defense',
'total_pressures', 'sacks', 'hits', 'hurries', 'batted_passes',
'tackles', 'assists', 'missed_tackles', 'stops', 'forced_fumbles',
'targets', 'receptions', 'yards', 'yards_per_reception',
'yards_after_catch', 'longest', 'touchdowns', 'interceptions',
'pass_break_ups', 'qb_rating_against', 'penalties',
'declined_penalties']
both are length 35 when tested.
I suggest use:
#first aggregate for unique MultiIndex
res = df.groupby(['Player', 'Year']).sum()
#MultiIndex
idx = pd.MultiIndex.from_product(res.index.levels,
names=['Player', 'Year'])
#aded new missing years
res = res.reindex(idx).sort_index()
#shift all columns, add suffix and join to original
res = res.join(res.groupby('Player').shift().add_suffix('_prev'))
print (res)
PTSN AVGN PTSN_prev AVGN_prev
Player Year
Aaron Donald 2016 NaN NaN NaN NaN
2017 181.0 12.929 NaN NaN
2018 280.6 17.538 181.0 12.929
Danielle Hunter 2016 204.6 12.788 NaN NaN
2017 NaN NaN 204.6 12.788
2018 237.6 14.850 NaN NaN
J.J. Watt 2016 NaN NaN NaN NaN
2017 NaN NaN NaN NaN
2018 259.8 16.238 NaN NaN

Extract minimum and maximum year from string in Pandas DataFrame

I have a CSV file that I read into a Pandas DataFrame that contains a column with multiple year values separated by a semicolon.
I need to extract the minimum and maximum value from the string and save each in a new column.
I am able to print the minimum and maximum but I can't seem to get the correct values from each row saved into a new column.
Any help is much appreciated.
Sample DataFrame:
import pandas as pd
import numpy as np
raw_data = {'id': ['1473-2262', '2327-9214', '1949-8349', '2375-6314',
'0095-6562'],
'years': ['2000; 2001; 2002; 2003; 2004; 2004; 2004; 2005',
'2003; 2004; 2005', '2015', np.nan, '2012; 2014']}
df = pd.DataFrame(raw_data, columns = ['id', 'years'])
This is the DataFrame that I need:
id years minyear maxyear
0 1473-2262 2000; 2001; 2002; 2003; 2004; 2004; 2004; 2005 2000.0 2005.0
1 2327-9214 2003; 2004; 2005 2003.0 2005.0
2 1949-8349 2015 2015.0 2015.0
3 2375-6314 NaN NaN NaN
4 0095-6562 2012; 2014 2012.0 2014.0
I can print the minimum and maximum:
x = df['years'].notnull()
for row in df['years'][x].str.split(pat=';'):
lst = list()
for item in row:
lst.append(int(item))
print('Min=',min(lst),'Max=',max(lst))
Min= 2000 Max= 2005
Min= 2003 Max= 2005
Min= 2015 Max= 2015
Min= 2012 Max= 2014
Here's how I've tried to capture the values to new columns:
x = df['years'].notnull()
for row in df['years'][x].str.split(pat=';'):
lst = list()
for item in row:
lst.append(int(item))
df['minyear']=min(lst)
df['maxyear']=max(lst)
Only the values from the last row are saved to the new columns.
id years minyear maxyear
0 1473-2262 2000; 2001; 2002; 2003; 2004; 2004; 2004; 2005 2012 2014
1 2327-9214 2003; 2004; 2005 2012 2014
2 1949-8349 2015 2012 2014
3 2375-6314 NaN 2012 2014
4 0095-6562 2012; 2014 2012 2014
I think you need str.split with expand=True for new DataFrame, then cast to float.
Index values are same, so assign new columns:
df1 = df['years'].str.split('; ', expand=True).astype(float)
df = df.assign(maxyear=df1.max(axis=1),minyear=df1.min(axis=1))
#same as
#df['maxyear'], df['minyear'] = df1.min(axis=1), df1.max(axis=1)
print (df)
id years maxyear minyear
0 1473-2262 2000; 2001; 2002; 2003; 2004; 2004; 2004; 2005 2000.0 2005.0
1 2327-9214 2003; 2004; 2005 2003.0 2005.0
2 1949-8349 2015 2015.0 2015.0
3 2375-6314 NaN NaN NaN
4 0095-6562 2012; 2014 2012.0 2014.0
A solution similar to the one proposed by jezrael, but using a conversion to a Series. Warning: This solution does not scale well.
years = df.years.str.split(";").apply(pd.Series).astype(float)
#0 1 2 3 4 5 6 7
#0 2000.0 2001.0 2002.0 2003.0 2004.0 2004.0 2004.0 2005.0
#1 2003.0 2004.0 2005.0 NaN NaN NaN NaN NaN
#2 2015.0 NaN NaN NaN NaN NaN NaN NaN
#3 NaN NaN NaN NaN NaN NaN NaN NaN
#4 2012.0 2014.0 NaN NaN NaN NaN NaN NaN
df['maxyear'], df['minyear'] = years.min(axis=1), years.max(axis=1)

Categories