Unwanted mean in pivot_table when values are identical - python

When there are identical values pivot_table takes the mean
(cause aggfunc='mean' by default)
For instance:
d=pd.DataFrame(data={
'x_values':[13.4,13.08,12.73,12.,33.,23.,12.],
'y_values': [1.54, 1.47,1.,2.,4.,4.,3.],
'experiment':['e', 'e', 'e', 'f', 'f','f','f']})
print(pd.pivot_table(d, index='x_values',
columns='experiment', values='y_values',sort=False))
returns:
experiment e f
x_values
13.40 1.54 NaN
13.08 1.47 NaN
12.73 1.00 NaN
12.00 NaN 2.5
33.00 NaN 4.0
23.00 NaN 4.0
As you can see a new value in f appears (2.5 which is the mean of 2. and 3).
But I want to keep the list as it was in my pandas
experiment e f
x_values
13.40 1.54 NaN
13.08 1.47 NaN
12.73 1.00 NaN
12.00 NaN 2.0
33.00 NaN 4.0
23.00 NaN 4.0
12.00 NaN 3.0
How can I do it ?
I have tried to play with aggfunc=list followed by an explode but in this case the order is lost ...
Thanks

Here's my solution. You don't really want to pivot on x_values (because there are dupes). So add a new unique column (id_col) and pivot on both x_values and id_col. Then you will have to do some cleanup:
(d
.assign(id_col=range(len(d)))
.pivot(index=['x_values', 'id_col'], columns='experiment')
.reset_index()
.drop(columns='id_col')
.set_index('x_values')
)
Here's the output:
y_values
experiment e f
x_values
12.00 NaN 2.0
12.00 NaN 3.0
12.73 1.00 NaN
13.08 1.47 NaN
13.40 1.54 NaN
23.00 NaN 4.0
33.00 NaN 4.0

A workaround would be to select data for each unique experiment value and then concat all these data:
pd.concat([d.loc[d.experiment.eq(c), ['x_values', 'y_values']].rename(columns={'y_values': c}) for c in d.experiment.unique()])
Result:
x_values e f
0 13.40 1.54 NaN
1 13.08 1.47 NaN
2 12.73 1.00 NaN
3 12.00 NaN 2.0
4 33.00 NaN 4.0
5 23.00 NaN 4.0
6 12.00 NaN 3.0

You could also just assign new variables and fill them according to boolean masks:
df=pd.DataFrame(
data={
'x_values':[13.4, 13.08, 12.73, 12., 33., 23., 12.],
'y_values': [1.54, 1.47,1.,2.,4.,4.,3.],
'experiment':['e', 'e', 'e', 'f', 'f','f','f']
}
)
df['e'] = df.loc[df['experiment'] == 'e', 'y_values']
df['f'] = df.loc[df['experiment'] == 'f', 'y_values']
df_final = df.drop(columns=['y_values', 'experiment']).set_index(['x_values'])
df_final
-------------------------------------------------
e f
x_values
13.40 1.54 NaN
13.08 1.47 NaN
12.73 1.00 NaN
12.00 NaN 2.0
33.00 NaN 4.0
23.00 NaN 4.0
12.00 NaN 3.0
-------------------------------------------------
If you have more than one attribute for the column experiment, you can iterate over all unique values:
for experiment in df['experiment'].unique():
df[experiment] = df.loc[df['experiment'] == experiment, 'y_values']
df_final = df.drop(columns=['y_values', 'experiment']).set_index(['x_values'])
df_final
which results to the desired output.
This approach appears to be more efficient than the approach provided by #Stef. However, with the cost of more lines of code.
from time import time
first_approach = []
for i in range(1000):
start = time()
pd.concat([df.loc[df.experiment.eq(c), ['x_values', 'y_values']].rename(columns={'y_values': c}) for c in df.experiment.unique()]).set_index(['x_values'])
first_approach.append(time()-start)
second_approach = []
for i in range(1000):
start = time()
for experiment in df['experiment'].unique():
df[experiment] = df.loc[df['experiment'] == experiment, 'y_values']
df.drop(columns=['y_values', 'experiment']).set_index(['x_values'])
second_approach.append(time()-start)
print(f'Average Time First Approach:\t{sum(first_approach)/len(first_approach):.5f}')
print(f'Average Time Second Approach:\t{sum(second_approach)/len(second_approach):.5f}')
--------------------------------------------
Average Time First Approach: 0.00403
Average Time Second Approach: 0.00205
--------------------------------------------

Related

Pandas dynamically replace nan values

I have a DataFrame that looks like this:
df = pd.DataFrame({'a':[1,2,np.nan,1,np.nan,np.nan,4,2,3,np.nan],
'b':[4,2,3,np.nan,np.nan,1,5,np.nan,5,8]
})
a b
0 1.0 4.0
1 2.0 2.0
2 NaN 3.0
3 1.0 NaN
4 NaN NaN
5 NaN 1.0
6 4.0 5.0
7 2.0 NaN
8 3.0 5.0
9 NaN 8.0
I want to dynamically replace the nan values. I have tried doing (df.ffill()+df.bfill())/2 but that does not yield the desired output, as it casts the fill value to the whole column at once, rather then dynamically. I have tried with interpolate, but it doesn't work well for non linear data.
I have seen this answer but did not fully understand it and not sure if it would work.
Update on the computation of the values
I want every nan value to be the mean of the previous and next non nan value. In case there are more than 1 nan value in sequence, I want to replace one at a time and then compute the mean e.g., in case there is 1, np.nan, np.nan, 4, I first want the mean of 1 and 4 (2.5) for the first nan value - obtaining 1,2.5,np.nan,4 - and then the second nan will be the mean of 2.5 and 4, getting to 1,2.5,3.25,4
The desired output is
a b
0 1.00 4.0
1 2.00 2.0
2 1.50 3.0
3 1.00 2.0
4 2.50 1.5
5 3.25 1.0
6 4.00 5.0
7 2.00 5.0
8 3.00 5.0
9 1.50 8.0
Inspired by the #ye olde noobe answer (thanks to him!):
I've optimized it to make it ≃ 100x faster (times comparison below):
def custom_fillna(s:pd.Series):
for i in range(len(s)):
if pd.isna(s[i]):
last_valid_number = (s[s[:i].last_valid_index()] if s[:i].last_valid_index() is not None else 0)
next_valid_numer = (s[s[i:].first_valid_index()] if s[i:].first_valid_index() is not None else 0)
s[i] = (last_valid_number+next_valid_numer)/2
custom_fillna(df['a'])
df
Times comparison:
Maybe not the most optimized, but it works (note: from your example, I assume that if there is no valid value before or after a NaN, like the last row on column a, 0 is used as a replacement):
import pandas as pd
def fill_dynamically(s: pd.Series):
for i in range(len(s)):
s[i] = (
(0 if s[i:].first_valid_index() is None else s[i:][s[i:].first_valid_index()]) +
(0 if s[:i+1].last_valid_index() is None else s[:i+1][s[:i+1].last_valid_index()])
) / 2
Use like this for the full dataframe:
df = pd.DataFrame({'a':[1,2,np.nan,1,np.nan,np.nan,4,2,3,np.nan],
'b':[4,2,3,np.nan,np.nan,1,5,np.nan,5,8]
})
df.apply(fill_dynamically)
df after applying:
a b
0 1.00 4.0
1 2.00 2.0
2 1.50 3.0
3 1.00 2.0
4 2.50 1.5
5 3.25 1.0
6 4.00 5.0
7 2.00 5.0
8 3.00 5.0
9 1.50 8.0
In case you would have other columns and don't want to apply that on the whole dataframe, you can of course use it on a single column, like that:
df = pd.DataFrame({'a':[1,2,np.nan,1,np.nan,np.nan,4,2,3,np.nan],
'b':[4,2,3,np.nan,np.nan,1,5,np.nan,5,8]
})
fill_dynamically(df['a'])
In this case, df looks like that:
a b
0 1.00 4.0
1 2.00 2.0
2 1.50 3.0
3 1.00 NaN
4 2.50 NaN
5 3.25 1.0
6 4.00 5.0
7 2.00 NaN
8 3.00 5.0
9 1.50 8.0

Filling missing data using a custom condition in a Pandas time series dataframe

Below is a portion of mydataframe which has many missing values.
A B
S a b c d e a b c d e
date
2020-10-15 1.0 2.0 NaN NaN NaN 10.0 11.0 NaN NaN NaN
2020-10-16 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2020-10-17 NaN NaN NaN 4.0 NaN NaN NaN NaN 13.0 NaN
2020-10-18 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2020-10-19 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2020-10-20 4.0 6.0 4.0 1.0 9.0 10.0 2.0 13.0 4.0 13.0
I would like to replace the NANs in each column using a specific backward fill condition .
For example, in column (A,a) missing values appear for dates 16th, 17th, 18th and 19th. The next value is '4' against 20th. I want this value (the next non missing value in the column) to be distributed among all these dates including 20th at a progressively increasing value of 10%. That is column (A,a) gets values of .655, .720,.793,.872 & .96 approximately for the dates 16th, 17th, 18th, 19th & 20th. This shall be the approach for all columns for all missing values across rows.
I tried using bfill() function but unable to fathom how to incorporate the required formula as an option.
I have checked the link Pandas: filling missing values in time series forward using a formula and a few other links on stackoverflow. This is somewhat similar, but in my case the the number of NANs in a given column are variable in nature and span multiple rows. Compare columns (A,a) with column (A,d) or column (B,d). Given this, I am finding it difficult to adopt the solution to my problem.
Appreciate any inputs.
Here is a completely vectorized way to do this. It is very efficient and fast: 130 ms on a 1000 x 1000 matrix. This is a good opportunity to expose some interesting techniques using numpy.
First, let's dig a bit into the requirements, specifically what exactly the value for each cell needs to be.
The example given is [nan, nan, nan, nan, 4.0] --> [.66, .72, .79, .87, .96], which is explained to be a "progressively increasing value of 10%" (in such a way that the total is the "value to spread": 4.0).
This is a geometric series with rate r = 1 + 0.1: [r^1, r^2, r^3, ...] and then normalized to sum to 1. For example:
r = 1.1
a = 4.0
n = 5
q = np.cumprod(np.repeat(r, n))
a * q / q.sum()
# array([0.65518992, 0.72070892, 0.79277981, 0.87205779, 0.95926357])
We'd like to do a direct calculation (to avoid calling Python functions and explicit loops, which would be much slower), so we need to express that normalizing factor q.sum() in closed form. It is a well-established quantity and is:
To generalize, we need 3 quantities to calculate the value of each cell:
a: value to distribute
i: index of run (0 .. n-1)
n: run length
then, the value is v = a * r**i * (r - 1) / (r**n - 1).
To illustrate with the first column in the OP's example, where the input is: [1, nan, nan, nan, nan, 4], we would like:
a = [1, 4, 4, 4, 4, 4]
i = [0, 0, 1, 2, 3, 4]
n = [1, 5, 5, 5, 5, 5]
then, the value v would be (rounded at 2 decimals): [1. , 0.66, 0.72, 0.79, 0.87, 0.96].
Now comes the part where we go about getting these three quantities as numpy arrays.
a is the easiest and is simply df.bfill().values. But for i and n, we do have to do a little bit of work, starting by assigning the values to a numpy array:
z = df.values
nrows, ncols = z.shape
For i, we start with the cumulative count of NaNs, with reset when values are not NaN. This is strongly inspired by this SO answer for "Cumulative counts in NumPy without iteration". But we do it for a 2D array, and we also want to add a first row of 0, and discard the last row to satisfy exactly our needs:
def rcount(z):
na = np.isnan(z)
without_reset = na.cumsum(axis=0)
reset_at = ~na
overcount = np.maximum.accumulate(without_reset * reset_at)
result = without_reset - overcount
return result
i = np.vstack((np.zeros(ncols, dtype=bool), rcount(z)))[:-1]
For n, we need to do some dancing on our own, using first principles of numpy (I'll break down the steps if I have time):
runlen = np.diff(np.hstack((-1, np.flatnonzero(~np.isnan(np.vstack((z, np.ones(ncols))).T)))))
n = np.reshape(np.repeat(runlen, runlen), (nrows + 1, ncols), order='F')[:-1]
So, putting it all together:
def spread_bfill(df, r=1.1):
z = df.values
nrows, ncols = z.shape
a = df.bfill().values
i = np.vstack((np.zeros(ncols, dtype=bool), rcount(z)))[:-1]
runlen = np.diff(np.hstack((-1, np.flatnonzero(~np.isnan(np.vstack((z, np.ones(ncols))).T)))))
n = np.reshape(np.repeat(runlen, runlen), (nrows + 1, ncols), order='F')[:-1]
v = a * r**i * (r - 1) / (r**n - 1)
return pd.DataFrame(v, columns=df.columns, index=df.index)
On your example data, we get:
>>> spread_bfill(df).round(2) # round(2) for printing purposes
A B
a b c d e a b c d e
S
2020-10-15 1.00 2.00 0.52 1.21 1.17 10.00 11.00 1.68 3.93 1.68
2020-10-16 0.66 0.98 0.57 1.33 1.28 1.64 0.33 1.85 4.32 1.85
2020-10-17 0.72 1.08 0.63 1.46 1.41 1.80 0.36 2.04 4.75 2.04
2020-10-18 0.79 1.19 0.69 0.30 1.55 1.98 0.40 2.24 1.21 2.24
2020-10-19 0.87 1.31 0.76 0.33 1.71 2.18 0.44 2.47 1.33 2.47
2020-10-20 0.96 1.44 0.83 0.37 1.88 2.40 0.48 2.71 1.46 2.71
For inspection, let's look at each of the 3 quantities in that example:
>>> a
[[ 1 2 4 4 9 10 11 13 13 13]
[ 4 6 4 4 9 10 2 13 13 13]
[ 4 6 4 4 9 10 2 13 13 13]
[ 4 6 4 1 9 10 2 13 4 13]
[ 4 6 4 1 9 10 2 13 4 13]
[ 4 6 4 1 9 10 2 13 4 13]]
>>> i
[[0 0 0 0 0 0 0 0 0 0]
[0 0 1 1 1 0 0 1 1 1]
[1 1 2 2 2 1 1 2 2 2]
[2 2 3 0 3 2 2 3 0 3]
[3 3 4 1 4 3 3 4 1 4]
[4 4 5 2 5 4 4 5 2 5]]
>>> n
[[1 1 6 3 6 1 1 6 3 6]
[5 5 6 3 6 5 5 6 3 6]
[5 5 6 3 6 5 5 6 3 6]
[5 5 6 3 6 5 5 6 3 6]
[5 5 6 3 6 5 5 6 3 6]
[5 5 6 3 6 5 5 6 3 6]]
And here is a final example, to illustrate what happens if a column ends with 1 or several NaNs (they remain NaN):
np.random.seed(10)
a = np.random.randint(0, 10, (6, 6)).astype(float)
a *= np.random.choice([1.0, np.nan], a.shape, p=[.3, .7])
df = pd.DataFrame(a)
>>> df
0 1 2 3 4 5
0 NaN NaN NaN NaN NaN 0.0
1 NaN NaN 9.0 NaN 8.0 NaN
2 NaN NaN NaN NaN NaN NaN
3 NaN 8.0 4.0 NaN NaN NaN
4 NaN NaN NaN 6.0 9.0 NaN
5 NaN NaN 2.0 NaN 7.0 8.0
Then:
>>> spread_bfill(df).round(2) # round(2) for printing
0 1 2 3 4 5
0 NaN 1.72 4.29 0.98 3.81 0.00
1 NaN 1.90 4.71 1.08 4.19 1.31
2 NaN 2.09 1.90 1.19 2.72 1.44
3 NaN 2.29 2.10 1.31 2.99 1.59
4 NaN NaN 0.95 1.44 3.29 1.74
5 NaN NaN 1.05 NaN 7.00 1.92
Speed
a = np.random.randint(0, 10, (1000, 1000)).astype(float)
a *= np.random.choice([1.0, np.nan], a.shape, p=[.3, .7])
df = pd.DataFrame(a)
%timeit spread_bfill(df)
# 130 ms ± 142 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Initial data:
>>> df
A B
a b c d e a b c d e
date
2020-10-15 1.0 2.0 NaN NaN NaN 10.0 11.0 NaN NaN NaN
2020-10-16 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2020-10-17 NaN NaN NaN 4.0 NaN NaN NaN NaN 13.0 NaN
2020-10-18 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2020-10-19 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2020-10-20 4.0 6.0 4.0 1.0 9.0 10.0 2.0 13.0 4.0 13.0
Define your geometric sequence:
def geomseq(seq):
q = 1.1
n = len(seq)
S = seq.max()
Uo = S * (1-q) / (1-q**n)
Un = [Uo * q**i for i in range(0, n)]
return Un
TL;DR
>>> df.unstack().groupby(df.unstack().sort_index(ascending=False).notna().cumsum().sort_index()).transform(geomseq).unstack(level=[0, 1])
A B
a b c d e a b c d e
date
2020-10-15 1.000000 2.000000 0.518430 1.208459 1.166466 10.000000 11.000000 1.684896 3.927492 1.684896
2020-10-16 0.655190 0.982785 0.570272 1.329305 1.283113 1.637975 0.327595 1.853386 4.320242 1.853386
2020-10-17 0.720709 1.081063 0.627300 1.462236 1.411424 1.801772 0.360354 2.038724 4.752266 2.038724
2020-10-18 0.792780 1.189170 0.690030 0.302115 1.552567 1.981950 0.396390 2.242597 1.208459 2.242597
2020-10-19 0.872058 1.308087 0.759033 0.332326 1.707823 2.180144 0.436029 2.466856 1.329305 2.466856
2020-10-20 0.959264 1.438895 0.834936 0.365559 1.878606 2.398159 0.479632 2.713542 1.462236 2.713542
Details
Convert your dataframe to series:
>>> sr = df.unstack()
>>> sr.head(10)
date
A a 2020-10-15 1.0
2020-10-16 NaN # <= group X (final value: .655)
2020-10-17 NaN # <= group X (final value: .720)
2020-10-18 NaN # <= group X (final value: .793)
2020-10-19 NaN # <= group X (final value: .872)
2020-10-20 4.0 # <= group X (final value: .960)
b 2020-10-15 2.0
2020-10-16 NaN
2020-10-17 NaN
2020-10-18 NaN
dtype: float64
Now you can build groups:
>>> groups = sr.sort_index(ascending=False).notna().cumsum().sort_index()
>>> groups.head(10)
date
A a 2020-10-15 16
2020-10-16 15 # <= group X15
2020-10-17 15 # <= group X15
2020-10-18 15 # <= group X15
2020-10-19 15 # <= group X15
2020-10-20 15 # <= group X15
b 2020-10-15 14
2020-10-16 13
2020-10-17 13
2020-10-18 13
dtype: int64
Apply your geometric progression:
>>> sr = sr.groupby(groups).transform(geomseq)
>>> sr.head(10)
date
A a 2020-10-15 1.000000
2020-10-16 0.655190 # <= group X15
2020-10-17 0.720709 # <= group X15
2020-10-18 0.792780 # <= group X15
2020-10-19 0.872058 # <= group X15
2020-10-20 0.959264 # <= group X15
b 2020-10-15 2.000000
2020-10-16 0.982785
2020-10-17 1.081063
2020-10-18 1.189170
dtype: float64
And finally, reshape series according to your initial dataframe:
>>> df = sr.unstack(level=[0, 1])
>>> df
A B
a b c d e a b c d e
date
2020-10-15 1.000000 2.000000 0.518430 1.208459 1.166466 10.000000 11.000000 1.684896 3.927492 1.684896
2020-10-16 0.655190 0.982785 0.570272 1.329305 1.283113 1.637975 0.327595 1.853386 4.320242 1.853386
2020-10-17 0.720709 1.081063 0.627300 1.462236 1.411424 1.801772 0.360354 2.038724 4.752266 2.038724
2020-10-18 0.792780 1.189170 0.690030 0.302115 1.552567 1.981950 0.396390 2.242597 1.208459 2.242597
2020-10-19 0.872058 1.308087 0.759033 0.332326 1.707823 2.180144 0.436029 2.466856 1.329305 2.466856
2020-10-20 0.959264 1.438895 0.834936 0.365559 1.878606 2.398159 0.479632 2.713542 1.462236 2.713542

Converting text (json object?) in pandas cell to columns

I am trying to pull data from a text values in a pandas DataFrame.
df = pd.DataFrame(['{58={1=4.5}, 50={1=4.0}, 42={1=3.5}, 62={1=4.75}, 54={1=4.25}, 46={1=3.75}}',
'{a={1=15.0}, b={1=14.0}, c={1=13.0}, d={1=15.5}, e={1=14.5}, f={1=13.5}}',
'{58={1=15.5}, 50={1=14.5}, 42={1=13.5}, 62={1=16.0}, 54={1=15.0}, 46={1=14.0}}'])
I have tried
df.apply(pd.Series)
pd.DataFrame(df.tolist(),index=df.index)
json_normalize(df)
But with no success.
I want to have new columns 50, 52, a, b c etc. And the values without the '1=' and I dont mind the NaNs. How to do that? What is this format?
Really appreciate your help.
With specific replacement to prepare a valid json string:
In [184]: new_df = pd.DataFrame(df.apply(lambda s: s.str.replace(r'(\w+)=\{1=([^}]+)\}', '"\\1":\\2'))[0].apply(pd.io
...: .json.loads).tolist())
In [185]: new_df
Out[185]:
42 46 50 54 58 62 a b c d e f
0 3.5 3.75 4.0 4.25 4.5 4.75 NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN 15.0 14.0 13.0 15.5 14.5 13.5
2 13.5 14.00 14.5 15.00 15.5 16.00 NaN NaN NaN NaN NaN NaN
There is a way you can do it by changing strings in order to make your data look like a dictionary. There is probably a smarter way using regex, but that will depend on the assumptions of the entire data you have available.
My steps below are:
Change strings to transform your data into a dict-like structure
Use literal_eval to transform the str on a dict
Unfold the df into a new dataframe
from ast import literal_eval
df[0] = df[0].str.replace('={1=',"':")\ # remove 1= and left inner dict sign {
.str.replace('}, ',",'")\ # remove right inner dict sign }
.str.replace('}}','}')\ # remove outmost extra }
.str.replace('{',"{'")\ # add appropriate string sign to first value.
.apply(literal_eval) # read as a dict
pd.DataFrame(df[0].values.tolist()) # unfold as a new dataframe
Out[1]:
58 50 42 62 54 46 a b c d e f
0 4.5 4.0 3.5 4.75 4.25 3.75 NaN NaN NaN NaN NaN NaN
1 NaN NaN NaN NaN NaN NaN 15.0 14.0 13.0 15.5 14.5 13.5
2 15.5 14.5 13.5 16.00 15.00 14.00 NaN NaN NaN NaN NaN NaN

pandas dataframe row proportions

I have a dataframe with multiple columns and rows
For all columns I need to say the row value is equal to 0.5 of this row + 0.5 of the row befores value.
I currently set up a loop which is working. But I feel there is a better way without using a loop. Does anyone have any thoughts?
dataframe = df_input
df_output=df_input.copy()
for i in range(1, df_input.shape[0]):
try:
df_output.iloc[[i]]= (df_input.iloc[[i-1]]*(1/2)).values+(df_input.iloc[[i]]*(1/2)).values
except:
pass
Do you mean sth like this:
First creating test data:
np.random.seed(42)
df = pd.DataFrame(np.random.randint(0, 20, [5, 3]), columns=['A', 'B', 'C'])
A B C
0 6 19 14
1 10 7 6
2 18 10 10
3 3 7 2
4 1 11 5
Your requested function:
(df*.5).rolling(2).sum()
A B C
0 NaN NaN NaN
1 8.0 13.0 10.0
2 14.0 8.5 8.0
3 10.5 8.5 6.0
4 2.0 9.0 3.5
EDIT:
for an unbalanced sum you can define an auxiliary function:
def weighted_mean(arr):
return sum(arr*[.25, .75])
df.rolling(2).apply(weighted_mean, raw=True)
A B C
0 NaN NaN NaN
1 9.00 10.00 8.00
2 16.00 9.25 9.00
3 6.75 7.75 4.00
4 1.50 10.00 4.25
EDIT2:
...and if the weights should be to be set at runtime:
def weighted_mean(arr, weights=[.5, .5]):
return sum(arr*weights/sum(weights))
No additional argument defaults to balanced mean:
df.rolling(2).apply(weighted_mean, raw=True)
A B C
0 NaN NaN NaN
1 8.0 13.0 10.0
2 14.0 8.5 8.0
3 10.5 8.5 6.0
4 2.0 9.0 3.5
An unbalanced mean:
df.rolling(2).apply(weighted_mean, raw=True, args=[[.25, .75]])
A B C
0 NaN NaN NaN
1 9.00 10.00 8.00
2 16.00 9.25 9.00
3 6.75 7.75 4.00
4 1.50 10.00 4.25
The division by sum(weights) enables the definition of weights not only restricted to fractions of one, but by any ratio:
df.rolling(2).apply(weighted_mean, raw=True, args=[[1, 3]])
A B C
0 NaN NaN NaN
1 9.00 10.00 8.00
2 16.00 9.25 9.00
3 6.75 7.75 4.00
4 1.50 10.00 4.25
df.rolling(window=2, min_periods=1).apply(lambda x: x[0]*0.5 + x[1] if len(x) > 1 else x)
This will do the same operation for all columns.
Explanation: For each rolling object the lambda chooses the columns and x are structured like [this_col[i], this_col[i+1]] for all cols, and then doing custom arithmetic is straightforward.
Some
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(low=0, high=10, size=(5, 1)), columns=['a'])
df["cumsum_a"] = 0.5*df["a"].cumsum() + 0.5*df["a"]
thing like below?

pandas groupby: *full* join result of groupwise operation on original index

Consider this df:
import pandas as pd, numpy as np
df = pd.DataFrame.from_dict({'id': ['A', 'B', 'A', 'C', 'D', 'B', 'C'],
'val': [1,2,-3,1,5,6,-2],
'stuff':['12','23232','13','1234','3235','3236','732323']})
Question: how to produce a table with as many columns as unique id ({A, B, C}) and
as many rows as df where, for example for the column corresponding to id==A, the values are:
1,
np.nan,
-2,
np.nan,
np.nan,
np.nan,
np.nan
(that is the result of df.groupby('id')['val'].cumsum() joined on the indexes of df).
UMMM pivot
pd.pivot(df.index,df.id,df.val).cumsum()
Out[33]:
id A B C D
0 1.0 NaN NaN NaN
1 NaN 2.0 NaN NaN
2 -2.0 NaN NaN NaN
3 NaN NaN 1.0 NaN
4 NaN NaN NaN 5.0
5 NaN 8.0 NaN NaN
6 NaN NaN -1.0 NaN
One way via a dictionary comprehension and pd.DataFrame.where:
res = pd.DataFrame({i: df['val'].where(df['id'].eq(i)).cumsum() for i in df['id'].unique()})
print(res)
A B C D
0 1.0 NaN NaN NaN
1 NaN 2.0 NaN NaN
2 -2.0 NaN NaN NaN
3 NaN NaN 1.0 NaN
4 NaN NaN NaN 5.0
5 NaN 8.0 NaN NaN
6 NaN NaN -1.0 NaN
For a small number of groups, you may find this method efficient:
df = pd.concat([df]*1000, ignore_index=True)
def piv_transform(df):
return pd.pivot(df.index, df.id, df.val).cumsum()
def dict_transform(df):
return pd.DataFrame({i: df['val'].where(df['id'].eq(i)).cumsum() for i in df['id'].unique()})
%timeit piv_transform(df) # 17.5 ms
%timeit dict_transform(df) # 8.1 ms
Certainly cleaner answers have been supplied - see pivot.
df1 = pd.DataFrame( data = [df.id == x for x in df.id.unique()]).T.mul(df.groupby(['id']).cumsum().squeeze(),axis=0)
df1.columns =df.id.unique()
df1.applymap(lambda x: np.nan if x == 0 else x)
A B C D
0 1.0 NaN NaN NaN
1 NaN 2.0 NaN NaN
2 -2.0 NaN NaN NaN
3 NaN NaN 1.0 NaN
4 NaN NaN NaN 5.0
5 NaN 8.0 NaN NaN
6 NaN NaN -1.0 NaN
Short and simple:
df.pivot(columns='id', values='val').cumsum()

Categories