Pandas - Rebasing values based on a specific column - python

I have the following dataframe:
Name 2018-02-28 2018-01-31 2018-12-31 2017-11-30 2017-10-31 2017-09-30
ID
11 ABC 110 109 108 100 95 90
22 DEF 120 119 118 100 85 80
33 GHI 130 129 128 100 75 70
I would like to obtain the below table where the resulting data reflects the % chg of the row's values relative to a particular row, in this case 2017-11-30's values.
Then, create a row at the bottom of the dataframe that provides the average.
Name 2018-02-28 2018-01-31 2018-12-31 2017-11-30 2017-10-31 2017-09-30
ID
11 ABC 10.0% 9.0% 8.0% 0.0% -5.0% -10.0%
22 DEF 20.0% 19.0% 18.0% 0.0% -15.0% -20.0%
33 GHI 30.0% 29.0% 28.0% 0.0% -25.0% -30.0%
Average 20.0% 19.0% 18.0% 0.0% -15.0% -20.0%
My actual dataframe has about 50 columns and 50 rows, and the actual column as the "base" value when we calculate the % chg is 1 year ago (ie column 14). A solution as generic as possible would be really appreciated!

I couldn't resist to post a continuation of jpps solution but cleaning it using multiindex. First we recreate the data set with pd.compat.
import pandas as pd
import numpy as np
data = '''\
ID Name 2018-02-28 2018-01-31 2018-12-31 2017-11-30 2017-10-31 2017-09-30
11 ABC 110 109 108 100 95 90
22 DEF 120 119 118 100 85 80
33 GHI 130 129 128 100 75 70'''
df = pd.read_csv(pd.compat.StringIO(data), sep='\s+').set_index('ID')
Alternative single-index:
# Pop away the column names and add Average
names = df.pop('Name').tolist() + ['Average']
# Recreate dataframe with percent of column index 4
df.loc[:] = (df.values.T - df.iloc[:,3].values).T / 100
# Get the mean and append
s = df.mean()
s.name = '99' # name is required when you use append (this will be the id)
df = df.append(s)
# Insert back
df.insert(0,'Name', names)
print(df)
Returns
Name 2018-02-28 2018-01-31 2018-12-31 2017-11-30 2017-10-31 \
ID
11 ABC 0.1 0.09 0.08 0.0 -0.05
22 DEF 0.2 0.19 0.18 0.0 -0.15
33 GHI 0.3 0.29 0.28 0.0 -0.25
99 Average 0.2 0.19 0.18 0.0 -0.15
2017-09-30
ID
11 -0.1
22 -0.2
33 -0.3
99 -0.2
Alternative with multi-index
# Set dual index
df = df.set_index([df.index,'Name'])
# Recreate dataframe with percent of column index 3 (4th)
df.loc[:] = (df.values.T - df.iloc[:,3].values).T / 100
# Get the mean and append
s = df.mean()
s.name = 'Average'
df = df.append(s)
print(df)
df output:
2018-02-28 2018-01-31 2018-12-31 2017-11-30 2017-10-31 2017-09-30
(11, ABC) 0.1 0.09 0.08 0.0 -0.05 -0.1
(22, DEF) 0.2 0.19 0.18 0.0 -0.15 -0.2
(33, GHI) 0.3 0.29 0.28 0.0 -0.25 -0.3
Average 0.2 0.19 0.18 0.0 -0.15 -0.2

You can use numpy for this. Below output is in decimals, you can multiply by 100 if necessary.
df.iloc[:, 1:] = (df.iloc[:, 1:].values / df.iloc[:, 4].values[:, None]) - 1
df.loc[len(df)+1] = ['Average'] + np.mean(df.iloc[:, 1:].values, axis=0).tolist()
Result
Name 2018-02-28 2018-01-31 2018-12-31 2017-11-30 2017-10-31 \
ID
11 ABC 0.1 0.09 0.08 0.0 -0.05
22 DEF 0.2 0.19 0.18 0.0 -0.15
33 GHI 0.3 0.29 0.28 0.0 -0.25
4 Average 0.2 0.19 0.18 0.0 -0.15
2017-09-30
ID
11 -0.1
22 -0.2
33 -0.3
4 -0.2
Explanation
df.iloc[:, 1:] extracts the 2nd column onwards; .values retrieves the numpy array representation from the dataframe.
[:, None] changes the axis of the array so that the division is oriented correctly.

Related

Sum rows where index steps is not bigger than 1. Pandas Python

I have foot sensor data and I want to calculate the Std of the swing times.
The dataframe looks like this:
Time Force
83 0.83 80
84 0.84 60
85 0.85 40
86 0.86 20
87 0.87 0
88 0.88 0
89 0.89 20
90 0.90 40
91 0.91 60
92 0.92 40
93 0.93 0
94 0.94 0
95 0.95 0
96 0.96 20
So to get the times for when the force ==0, I did:
df[(df['Force']==0)]
Resulting in:
Time Force
87 0.87 0
88 0.88 0
93 0.93 0
94 0.94 0
95 0.95 0
Now I want to sum the Time per swing.
swing 1 = index 87 + 88, swing 2 = index 93 + 94 + 95
How can I achieve this? How can I sum the rows where the index steps is not bigger than 1?
(Imagine I have thousands of rows to sum)
I tried complicated loops like:
swing_durations = []
start = []
start.append(0)
swings_left = swing_times_left.reset_index(drop = True)
for subject in swings_left[['filename']]:
i = 1
for time in swings_left['Time'][1:-1]:
j = i - 1
k = swings_left.where(swings_left['Time'].loc[i] - swings_left['Time'].loc[j] > 0.01)
if k == True:
start.append(time)
swing_durations.append(swings_left[['Time']].loc[j] - start[j])
i = i + 1
totalswingtime_l['filename'== subject]['Variance'] = swing_durations.std()
resulting in an error
Thanks for the help!
A solution is to create an ID for each group of consecutive 0s.
This is what (df.Force.shift()!=(df.Force)).cumsum() does.
Afterwards you only keep the groups containing 0s with np.where.
In [83]: df["swing_id"] = np.where(df.Force==0, (df.Force.shift()!=(df.Force)).cumsum(),np.nan)
...: df
Out[83]:
Time Force swing_id
0 0.83 80 NaN
1 0.84 60 NaN
2 0.85 40 NaN
3 0.86 20 NaN
4 0.87 0 5.0
5 0.88 0 5.0
6 0.89 20 NaN
7 0.90 40 NaN
8 0.91 60 NaN
9 0.92 40 NaN
10 0.93 0 10.0
11 0.94 0 10.0
12 0.95 0 10.0
13 0.96 20 NaN
In [84]: df.groupby("swing_id")["Time"].sum()
Out[84]:
swing_id
5.0 1.75
10.0 2.82
Name: Time, dtype: float64

Efficient way to merge two panda groupby object

I have 2 groupby object as follows:
df1.last() => return a panda dataframe with stock_id and date as index:
close
stock_id, date
a1 2005-12-31 1.1
2006-12-31 1.2
...
2017-12-31 1.3
2018-12-31 1.3
2019-12-31 1.4
a2 2008-12-31 2.1
2009-12-31 2.4
....
2018-12-31 3.4
2019-12-31 3.4
df2 => return a groupby object with id as index:
stock_id, date, eps, dps,
id
1 a1 2017-12-01 0.2 0.03
2 a1 2018-12-01 0.3 0.02
3 a1 2019-06-01 0.4 0.01
4 a2 2018-12-01 0.5 0.03
5 a2 2019-06-01 0.3 0.04
df2 is supposed to be used as reference to merge with df1 based on stock_id and year matching with df2 as df2 has lesser year value. The expected result as follows:
df2 merge with df1:
stock_id, date, eps, dps close ratio_eps, ratio_dps
id a1 2017 0.2 0.03 1.3 0.2/1.3 0.03/1.3
a1 2018 0.3 0.02 1.3 0.3/1.3 ...
a1 2019 0.4 0.01 1.4 0.4/1.4 ...
a2 2018 0.5 0.03 3.4 ...
a2 2019 0.3 0.04 3.4 ...
The above can be done with for loop but it would be inefficient. Is there any pythonic way to achieve it ?
How do i remove the day and month from both dataframe and use it as a key to match and join both table together efficiently ?

Calculate weighted average based on 2 columns using a pandas/dataframe

I have the following dataframe df. I want to calculate a weighted average grouped by each date and Sector level
date Equity value Sector Weight
2000-01-31 TLRA 20 RG Index 0.20
2000-02-28 TLRA 30 RG Index 0.20
2000-03-31 TLRA 40 RG Index 0.20
2000-01-31 RA 50 RG Index 0.30
2000-02-28 RA 60 RG Index 0.30
2000-03-31 RA 70 RG Index 0.30
2000-01-31 AAPL 80 SA Index 0.50
2000-02-28 AAPL 90 SA Index 0.50
2000-03-31 AAPL 100 SA Index 0.50
2000-01-31 SPL 110 SA Index 0.60
2000-02-28 SPL 120 SA Index 0.60
2000-03-31 SPL 130 SA Index 0.60
There can be many Equity under a Sector . I want Sector level weighted Average based on Weight column.
Expected Output:
date RG Index SA Index
2000-01-31 19 106
2000-02-28 24 117
2000-03-31 29 138
I tried below code, but i am not getting expected output. Please help
g = df.groupby('Sector')
df['wa'] = df.value / g.value.transform("sum") * df.Weight
df.pivot(index='Sector', values='wa')
More like pivot problem first assign a new columns as product of value and weight
df.assign(V=df.value*df.Weight).pivot_table(index='date',columns='Sector',values='V',aggfunc='sum')
Out[328]:
Sector RGIndex SAIndex
date
2000-01-31 19.0 106.0
2000-02-28 24.0 117.0
2000-03-31 29.0 128.0

Python Pandas-retrieving values in one column while they are less than the value of a second column

Suppose I have a df that looks like this:
posF ffreq posR rfreq
0 10 0.50 11.0 0.08
1 20 0.20 31.0 0.90
2 30 0.03 41.0 0.70
3 40 0.72 51.0 0.08
4 50 0.09 81.0 0.78
5 60 0.09 NaN NaN
6 70 0.01 NaN NaN
7 80 0.09 NaN NaN
8 90 0.08 NaN NaN
9 100 0.02 NaN NaN
In the posR column, we see that it jumps from 11 to 31, and there is not a value in the "20's". I want to insert a value to fill that space, which would essentially just be the posF value, and NA, so my resulting df would look like this:
posF ffreq posR rfreq
0 10 0.50 11.0 0.08
1 20 0.20 20 NaN
2 30 0.03 31.0 0.90
3 40 0.72 41.0 0.70
4 50 0.09 50 NaN
5 60 0.09 60 NaN
6 70 0.01 70 NaN
7 80 0.09 80 NaN
8 90 0.08 81.0 0.78
9 100 0.02 100 NaN
So I want to fill the NaN values in the position with the values from posF that are in between the values in posR.
What I have tried to do is just make a dummy list and add values to the list based on if they were less than a (I see the flaw here but I don't know how to fix it).
insert_rows = []
for x in df['posF']:
for a,b in zip(df['posR'], df['rfreq']):
if x<a:
insert_rows.append([x, 'NA'])
print(len(insert_rows))#21, should be 5
I realize that it is appending x several times until it reaches the condition of being >a.
After this I will just create a new df and add these values to the original 2 columns so they are the same length.
If you can think of a better title, feel free to edit.
My first thought was to retrieve the new indices for the entries in posR by interpolating with posF and then put the values to their new positions - but as you want to have 81 one row later than here, I'm afraid this is not exactly what you're searching for and I still don't really get the logic behind your task.
However, perhaps this is a starting point, let's see...
This approach would work like the following:
Retrieve the new index positions of the values in posR according to their order in posF:
import numpy as np
idx = np.interp(df.posR, df.posF, df.index).round()
Get rid of nan entries and cast to int:
idx = idx[np.isfinite(idx)].astype(int)
Create a new column by copying posF in the first step, and set newrfreq to nan respectively:
df['newposR'] = df.posF
df['newrfreq'] = np.nan
Then overwrite with the values from posR and rfreq, but now at the updated positions:
df.loc[idx, 'newposR'] = df.posR[:len(idx)].values
df.loc[idx, 'newrfreq'] = df.rfreq[:len(idx)].values
Result:
posF ffreq posR rfreq newposR newrfreq
0 10 0.50 11.0 0.08 11.0 0.08
1 20 0.20 31.0 0.90 20.0 NaN
2 30 0.03 41.0 0.70 31.0 0.90
3 40 0.72 51.0 0.08 41.0 0.70
4 50 0.09 81.0 0.78 51.0 0.08
5 60 0.09 NaN NaN 60.0 NaN
6 70 0.01 NaN NaN 70.0 NaN
7 80 0.09 NaN NaN 81.0 0.78
8 90 0.08 NaN NaN 90.0 NaN
9 100 0.02 NaN NaN 100.0 NaN

Replicating case in numerator and denominator of weighted average calculation in pandas

Pandas newbie trying to replicate sql to python.
Referencing the below post, I could use a simple function to calculate the weighted average of a column in a pandas dataframe.
Calculate weighted average using a pandas/dataframe
Date ID wt value
01/01/2012 100 0.50 60
01/01/2012 101 0.75
01/01/2012 102 1.00 100
01/02/2012 201 0.50
01/02/2012 202 1.00 80
However, if I had conditions in both numerator and denominator and to get an aggregate of the weighted average, I would do the below in sql:
SELECT
date
, id
, SUM(CASE WHEN value IS NOT NULL THEN value * wt ELSE 0 END) /
NULLIF(SUM(CASE WHEN value > 0 THEN wt ELSE 0 END), 0)
AS wt_avg
FROM table
GROUP BY date, id
How would we replicate this in Pandas?
Thanks in advance.
Consider using calculated, helper columns according to specified logic with np.where() replacing the CASE statements and Series.fillna() as counterpart to NULLIF.
df['numer'] = np.where(pd.notnull(df['value']), df['value'] * df['wt'], 0)
df['denom'] = pd.Series(np.where(df['value'] > 0, df['wt'], 0)).fillna(0)
df['wt_avg'] = (df.groupby(['Date', 'ID'])['numer'].transform(sum) /
df.groupby(['Date', 'ID'])['denom'].transform(sum))
print(df)
# print(df.drop(columns=['numer', 'denom'])) # DROP HELPER COLUMNS
# Date ID wt value numer denom wt_avg
# 0 01/01/2012 100 0.50 60.0 30.0 0.5 60.0
# 1 01/01/2012 101 0.75 NaN 0.0 0.0 NaN
# 2 01/01/2012 102 1.00 100.0 100.0 1.0 100.0
# 3 01/02/2012 201 0.50 NaN 0.0 0.0 NaN
# 4 01/02/2012 202 1.00 80.0 80.0 1.0 80.0

Categories