I'm using code to shift time series data that looks somewhat similar to this:
Year Player PTSN AVGN
2018 Aaron Donald 280.60 17.538
2018 J.J. Watt 259.80 16.238
2018 Danielle Hunter 237.60 14.850
2017 Aaron Donald 181.0 12.929
2016 Danielle Hunter 204.6 12.788
with the intent of getting it into something like this:
AVGN PTSN AVGN_prev PTSN_prev
Player Year
Aaron Donald 2016 NaN NaN NaN NaN
2017 12.929 181.0 NaN NaN
2018 17.538 280.6 12.929 181.0
Danielle Hunter 2016 12.788 204.6 NaN NaN
2017 8.325 133.2 12.788 204.6
2018 14.850 237.6 8.325 133.2
J.J. Watt 2016 NaN NaN NaN NaN
2017 NaN NaN NaN NaN
2018 16.238 259.8 NaN NaN
I'm using this code to make that happen:
res = df.set_index(['player', 'Year'])
idx = pd.MultiIndex.from_product([df['player'].unique(),
df['Year'].unique()],
names=['Player', 'Year'])
res = res.groupby(['player', 'Year']).apply(sum)
res = res.reindex(idx).sort_index()
res[columns] = res.groupby('Player')[list(res.columns)].shift(1)
with the addition of a groupby.sum() because some players in the dataframe moved from one teamt o another within the same season and i want to combine those numbers. However, the data i have is actually coming out extremely wrong. The data has too many columns to post, but it seems like the data from the previous year (_prev) is placed into random columns. It doesn't change and will always place it into the same wrong columns. Is this an issue caused by the groupby.sum()? is it because i'm using a columns variable (containing all the same names as res.columns with a str(_prev) attached to them) and a list(res.columns)? And regardless of which it is, how do i solve this?
here's the outputs of columns and res.columns:
columns:
['player_id_prev', 'position_prev', 'player_game_count_prev', 'team_name_prev', 'snap_counts_total_prev', 'snap_counts_pass_rush_prev', 'snap_counts_run_defense_prev', 'snap_counts_coverage_prev', 'grades_defense_prev', 'grades_run_defense_prev', 'grades_tackle_prev', 'grades_pass_rush_defense_prev', 'grades_coverage_defense_prev', 'total_pressures_prev', 'sacks_prev', 'hits_prev', 'hurries_prev', 'batted_passes_prev', 'tackles_prev', 'assists_prev', 'missed_tackles_prev', 'stops_prev', 'forced_fumbles_prev', 'targets_prev', 'receptions_prev', 'yards_prev', 'yards_per_reception_prev', 'yards_after_catch_prev', 'longest_prev', 'touchdowns_prev', 'interceptions_prev', 'pass_break_ups_prev', 'qb_rating_against_prev', 'penalties_prev', 'declined_penalties_prev']
res_columns:
['player_id', 'position', 'player_game_count', 'team_name',
'snap_counts_total', 'snap_counts_pass_rush', 'snap_counts_run_defense',
'snap_counts_coverage', 'grades_defense', 'grades_run_defense',
'grades_tackle', 'grades_pass_rush_defense', 'grades_coverage_defense',
'total_pressures', 'sacks', 'hits', 'hurries', 'batted_passes',
'tackles', 'assists', 'missed_tackles', 'stops', 'forced_fumbles',
'targets', 'receptions', 'yards', 'yards_per_reception',
'yards_after_catch', 'longest', 'touchdowns', 'interceptions',
'pass_break_ups', 'qb_rating_against', 'penalties',
'declined_penalties']
both are length 35 when tested.
I suggest use:
#first aggregate for unique MultiIndex
res = df.groupby(['Player', 'Year']).sum()
#MultiIndex
idx = pd.MultiIndex.from_product(res.index.levels,
names=['Player', 'Year'])
#aded new missing years
res = res.reindex(idx).sort_index()
#shift all columns, add suffix and join to original
res = res.join(res.groupby('Player').shift().add_suffix('_prev'))
print (res)
PTSN AVGN PTSN_prev AVGN_prev
Player Year
Aaron Donald 2016 NaN NaN NaN NaN
2017 181.0 12.929 NaN NaN
2018 280.6 17.538 181.0 12.929
Danielle Hunter 2016 204.6 12.788 NaN NaN
2017 NaN NaN 204.6 12.788
2018 237.6 14.850 NaN NaN
J.J. Watt 2016 NaN NaN NaN NaN
2017 NaN NaN NaN NaN
2018 259.8 16.238 NaN NaN
Related
I've a (triangular)data frame
i,e:DF1:
12-24 24-36 36-48
2017 1.554 3.532 8.657
2018 2.978 1.114 NaN
2019 4.366 NaN NaN
I've to find the cumprod for this dataframe.
I tried this code:
df2= df1.iloc[:, ::-1].cumprod(axis=1).iloc[:, ::-1]
But result is same as df1
The result should look like:
12-24 24-36 36-48
2017 8.898 4.646 8.657
2018 7.344 1.114 NaN
2019 4.366 NaN NaN
Thank you for your time :)
Try this instead:
>>> df1.iloc[::-1].cumsum().iloc[::-1]
12-24 24-36 36-48
2017 8.898 4.646 8.657
2018 7.344 1.114 NaN
2019 4.366 NaN NaN
>>>
You don't need an axis=1 and extra colons.
Doing df1.iloc[:, ::-1] would reverse the columns instead of the rows.
I am facing problem while dealing with NaN values in Temperature column with respect to column City by using interpolate().
The df is:
data ={
'City':['Greenville','Charlotte', 'Los Gatos','Greenville','Carson City','Greenville','Greenville' ,'Charlotte','Carson City',
'Greenville','Charlotte','Fort Lauderdale', 'Rifle', 'Los Gatos','Fort Lauderdale'],
'Rec_times':['2019-05-21 08:29:55','2019-01-27 17:43:09','2020-12-13 21:53:00','2019-07-17 11:43:09','2018-04-17 16:51:23',
'2019-10-07 13:28:09','2020-01-07 11:38:10','2019-11-03 07:13:09','2020-11-19 10:45:23','2020-10-07 15:48:19','2020-10-07 10:53:09',
'2017-08-31 17:40:49','2016-08-31 17:40:49','2021-11-13 20:13:10','2016-08-31 19:43:29'],
'Temperature':[30,45,26,33,50,None,29,None,48,32,47,33,None,None,28],
'Pressure':[30,None,26,43,50,36,29,None,48,32,None,35,23,49,None]
}
df =pd.DataFrame(data)
df
Output:
City Rec_times Temperature Pressure
0 Greenville 2019-05-21 08:29:55 30.0 30.0
1 Charlotte 2019-01-27 17:43:09 45.0 NaN
2 Los Gatos 2020-12-13 21:53:00 26.0 26.0
3 Greenville 2019-07-17 11:43:09 33.0 43.0
4 Carson City 2018-04-17 16:51:23 50.0 50.0
5 Greenville 2019-10-07 13:28:09 NaN 36.0
6 Greenville 2020-01-07 11:38:10 29.0 29.0
7 Charlotte 2019-11-03 07:13:09 NaN NaN
8 Carson City 2020-11-19 10:45:23 48.0 48.0
9 Greenville 2020-10-07 15:48:19 32.0 32.0
10 Charlotte 2020-10-07 10:53:09 47.0 NaN
11 Fort Lauderdale 2017-08-31 17:40:49 33.0 35.0
12 Rifle 2016-08-31 17:40:49 NaN 23.0
13 Los Gatos 2021-11-13 20:13:10 NaN 49.0
14 Fort Lauderdale 2016-08-31 19:43:29 28.0 NaN
I want you to deal the NaN values in the column Temperature by grouping them based on City using interpolate(method='time').
Ex:
Consider City as 'Greenville' it has 5 temperatures (30,33,NaN,29 and 32) recorded at different times. The NaN value in Temperature is replaced by a value by grouping the records by the City and using interpolate(method='time').
Note: If you know any other optimal method to replace NaN in Temperature you can use as 'Other solution'.
Use a lambda function with DatetimeIndex created by DataFrame.set_index with GroupBy.transform:
df["Rec_times"] = pd.to_datetime(df["Rec_times"])
df['Temperature'] = (df.set_index('Rec_times')
.groupby("City")['Temperature']
.transform(lambda x: x.interpolate(method='time')).to_numpy())
One possible idea for replacing missing values after interpolate is to replace them by the mean of all values like:
df1.Temperature = df1.Temperature.fillna(df1.Temperature.mean())
My understanding is that you want to replace the NaN in column temperature by an interpolation of the temperature in that specific city.
I would have to think about a more sophisticated solution. But here is a simple hack:
df["Rec_times"] = pd.to_datetime(df["Rec_times"]) # .interpolate requires datetime
df["idx"] = df.index # to restore original ordering
df_new = pd.DataFrame() # will hold new data
for (city,group) in df.groupby("City"):
group = group.set_index("Rec_times", drop=False)
df_new = pd.concat((df_new, group.interpolate(method='time')))
df_new = df_new.set_index("idx").sort_index() # Restore original ordering
df_new
Note that interpolation for Rifle will yield NaN given there is only one data point which is NaN.
Hello everyone i have such problem:
I have panel data for 400.000 objects and i want to drop objects if it contains more that 40% NaNs
For example:
inn time_reg revenue1 balans1 equity1 opprofit1 \
0 0101000021 2006 457000.0 115000.0 28000.0 29000.0
1 0101000021 2007 1943000.0 186000.0 104000.0 99000.0
2 0101000021 2008 2812000.0 318000.0 223000.0 127000.0
3 0101000021 2009 2673000.0 370000.0 242000.0 39000.0
4 0101000021 2010 3240000.0 435000.0 45000.0 NaN
... ... ... ... ... ... ...
4081810 9909403758 2003 6943000.0 2185000.0 2136000.0 -97000.0
4081811 9909403758 2004 6504000.0 2245000.0 2196000.0 -34000.0
4081812 9909403758 2005 NaN NaN NaN NaN
4081813 9909403758 2006 NaN NaN NaN NaN
4081814 9909403758 2007 NaN NaN NaN NaN
grossprofit1 netprofit1 currentassets1 stliabilities1
0 92000.0 18000.0 105000.0 87000.0
1 189000.0 76000.0 176000.0 82000.0
2 472000.0 119000.0 308000.0 95000.0
3 483000.0 29000.0 360000.0 128000.0
4 NaN 35000.0 NaN NaN
... ... ... ... ...
4081810 2365000.0 -59000.0 253000.0 49000.0
4081811 2278000.0 60000.0 425000.0 49000.0
4081812 NaN NaN NaN NaN
4081813 NaN NaN NaN NaN
4081814 NaN NaN NaN NaN
I have such dataframe, and for each subdataframe grouped by (inn,time_reg) i need to drop it if total nans in columns (revenue1 balans1 equity1 opprofit1 grossprofit1 netprofit1 currentassets1 stliabilities1) more than 40%.
I have an idea to do it in a loop but this it takes a lot of time
For example:
inn time_reg revenue1 balans1 equity1 opprofit1 \
4081809 9909403758 2002 6078000.0 2270000.0 2195000.0 -32000.0
4081810 9909403758 2003 6943000.0 2185000.0 2136000.0 -97000.0
4081811 9909403758 2004 6504000.0 2245000.0 2196000.0 -34000.0
4081812 9909403758 2005 NaN NaN NaN NaN
4081813 9909403758 2006 NaN NaN NaN NaN
4081814 9909403758 2007 NaN NaN NaN NaN
grossprofit1 netprofit1 currentassets1 stliabilities1
4081809 1324000.0 NaN 234000.0 75000.0
4081810 2365000.0 -59000.0 253000.0 49000.0
4081811 2278000.0 60000.0 425000.0 49000.0
4081812 NaN NaN NaN NaN
4081813 NaN NaN NaN NaN
4081814 NaN NaN NaN NaN
This subdataframe should be droped, coz it contains more than 40% nans
inn time_reg revenue1 balans1 equity1 opprofit1 \
0 0101000021 2006 457000.0 115000.0 28000.0 29000.0
1 0101000021 2007 1943000.0 186000.0 104000.0 99000.0
2 0101000021 2008 2812000.0 318000.0 223000.0 127000.0
3 0101000021 2009 2673000.0 370000.0 242000.0 39000.0
4 0101000021 2010 3240000.0 435000.0 45000.0 NaN
5 0101000021 2011 3480000.0 610000.0 71000.0 NaN
6 0101000021 2012 4820000.0 710000.0 139000.0 149000.0
7 0101000021 2013 5200000.0 790000.0 148000.0 170000.0
8 0101000021 2014 5450000.0 830000.0 155000.0 180000.0
9 0101000021 2015 5620000.0 860000.0 164000.0 189000.0
10 0101000021 2016 5860000.0 885000.0 175000.0 200000.0
11 0101000021 2017 15112000.0 1275000.0 298000.0 323000.0
grossprofit1 netprofit1 currentassets1 stliabilities1
0 92000.0 18000.0 105000.0 87000.0
1 189000.0 76000.0 176000.0 82000.0
2 472000.0 119000.0 308000.0 95000.0
3 483000.0 29000.0 360000.0 128000.0
4 NaN 35000.0 NaN NaN
5 NaN 61000.0 NaN NaN
6 869000.0 129000.0 700000.0 571000.0
7 1040000.0 138000.0 780000.0 642000.0
8 1090000.0 145000.0 820000.0 675000.0
9 1124000.0 154000.0 850000.0 696000.0
10 1172000.0 165000.0 875000.0 710000.0
11 3023000.0 288000.0 1265000.0 977000.0
This subdataframe contains less than 40% nans and must be in final dataframe
Would a loop be too slow too if you used a numpy/pandas function for the counting? You could use someDataFrame.isnull().sum().sum().
Probably a lot faster than writing your own loop to go over all the values in a dataframe, since those libraries tend to have very efficient implementations of those kinds of functions.
You can use the filter method of pd.DataFrame.groupby.
This allows you to pass a function that indicates whether a subframe should be filtered or not (in this case if it contains over 40% NaNs in the relevant columns). To get that information, you can use numpy to count the nans as in getNanFraction:
def getNanFraction(df):
nanCount = np.sum(np.isnan(df.drop("inn", axis=1).values))
return nanCount/len(df)
df.groupby("inn").filter(lambda x: getNanFraction(x) < 0.4 )
I'm trying to collect items from column 'data' that just preceded the data I collected in column 'min' and create new column. See
Here is the data (importing with pd.read_csv):
time,data
12/15/18 01:10 AM,130352.146180556
12/16/18 01:45 AM,130355.219097222
12/17/18 01:47 AM,130358.223263889
12/18/18 02:15 AM,130361.281701389
12/19/18 03:15 AM,130364.406597222
12/20/18 03:25 AM,130352.427430556
12/21/18 03:27 AM,130355.431597222
12/22/18 05:18 AM,130358.663541667
12/23/18 06:44 AM,130361.842430556
12/24/18 07:19 AM,130364.915243056
12/25/18 07:33 AM,130352.944409722
12/26/18 07:50 AM,130355.979826389
12/27/18 09:13 AM,130359.153472222
12/28/18 11:53 AM,130362.4871875
12/29/18 01:23 PM,130365.673263889
12/30/18 02:17 PM,130353.785763889
12/31/18 02:23 PM,130356.798263889
01/01/19 04:41 PM,130360.085763889
01/02/19 05:01 PM,130363.128125
and my code:
import pandas as pd
import numpy as np
from scipy import signal
from scipy.signal import argrelextrema
import datetime
diff=pd.DataFrame()
df=pd.read_csv('saw_data2.csv')
df['time']=pd.to_datetime(df['time'])
print(df.head())
n=2 # number of points to be checked before and after
# Find local peaks
df['min'] = df.iloc[argrelextrema(df.data.values, np.less_equal, order=n)[0]]['data']
If you plot the data, you'll see it is similiar to a sawtooth. The element before in 'data' that I get in 'min' is the element I want to put in a new column df['new_col'].
I've tried many things like,
df['new_col']=df.index.get_loc(df['min'].df['data'])
and,
df['new_col']=df['min'].shift() #obviously wrong
IIUC, you can do the shift before selecting the rows with a value in min:
df['new_col'] = df.shift().loc[df['min'].notna(), 'data']
print (df)
time data min new_col
0 12/15/18 01:10 AM 130352.146181 130352.146181 NaN
1 12/16/18 01:45 AM 130355.219097 NaN NaN
2 12/17/18 01:47 AM 130358.223264 NaN NaN
3 12/18/18 02:15 AM 130361.281701 NaN NaN
4 12/19/18 03:15 AM 130364.406597 NaN NaN
5 12/20/18 03:25 AM 130352.427431 130352.427431 130364.406597
6 12/21/18 03:27 AM 130355.431597 NaN NaN
7 12/22/18 05:18 AM 130358.663542 NaN NaN
8 12/23/18 06:44 AM 130361.842431 NaN NaN
9 12/24/18 07:19 AM 130364.915243 NaN NaN
10 12/25/18 07:33 AM 130352.944410 130352.944410 130364.915243
11 12/26/18 07:50 AM 130355.979826 NaN NaN
12 12/27/18 09:13 AM 130359.153472 NaN NaN
13 12/28/18 11:53 AM 130362.487187 NaN NaN
14 12/29/18 01:23 PM 130365.673264 NaN NaN
15 12/30/18 02:17 PM 130353.785764 130353.785764 130365.673264
16 12/31/18 02:23 PM 130356.798264 NaN NaN
17 01/01/19 04:41 PM 130360.085764 NaN NaN
18 01/02/19 05:01 PM 130363.128125 NaN NaN
Given the pandas DataFrame:
name hobby since
paul A 1995
john A 2005
paul B 2015
mary G 2013
chris E 2005
chris D 2001
paul C 1986
I would like to get:
name hobby1 since1 hobby2 since2 hobby3 since3
paul A 1995 B 2015 C 1986
john A 2005 NaN NaN NaN NaN
mary G 2013 NaN NaN NaN NaN
chris E 2005 D 2001 NaN NaN
I.e. I would like to have one row per name. The maximum number of hobbies a person can have, say 3 in this case, is something I know in advance. What would be the most elegant/short way to do this?
You can first melt and then , groupby.cumcount() to add to the variable and then pivot using pivot_table():
m=df.melt('name')
(m.assign(variable=m.variable+(m.groupby(['name','variable']).cumcount()+1).astype(str))
.pivot_table(index='name',columns='variable',values='value',aggfunc='first')
.rename_axis(None,axis=1))
hobby1 hobby2 hobby3 since1 since2 since3
name
chris E D NaN 2005 2001 NaN
john A NaN NaN 2005 NaN NaN
mary G NaN NaN 2013 NaN NaN
paul A B C 1995 2015 1986
Use cumcount and unstack. Finally, use multiindex.map to join 2-level columns to one level
df1 = df.set_index(['name', df.groupby('name').cumcount().add(1)]) \
.unstack().sort_index(1,level=1)
df1.columns = df1.columns.map('{0[0]}{0[1]}'.format)
Out[812]:
hobby1 since1 hobby2 since2 hobby3 since3
name
chris E 2005.0 D 2001.0 NaN NaN
john A 2005.0 NaN NaN NaN NaN
mary G 2013.0 NaN NaN NaN NaN
paul A 1995.0 B 2015.0 C 1986.0
Maybe something like this? But you would need to rename the columns after with this solution.
df["combined"] = [ "{}_{}".format(x,y) for x,y in zip(df.hobby,df.since)]
df.groupby("name")["combined"]
.agg(lambda x: "_".join(x))
.str.split("_",expand=True)
The result is:
0 1 2 3 4 5
name
chris E 2005 D 2001 None None
john A 2005 None None None None
mary G 2013 None None None None
paul A 1995 B 2015 C 1986