Normalize DataFrame by group - python

Let's say that I have some data generated as follows:
N = 20
m = 3
data = np.random.normal(size=(N,m)) + np.random.normal(size=(N,m))**3
and then I create some categorization variable:
indx = np.random.randint(0,3,size=N).astype(np.int32)
and generate a DataFrame:
import pandas as pd
df = pd.DataFrame(np.hstack((data, indx[:,None])),
columns=['a%s' % k for k in range(m)] + [ 'indx'])
I can get the mean value, per group as:
df.groubpy('indx').mean()
What I'm unsure of how to do is to then subtract the mean off of each group, per-column in the original data, so that the data in each column is normalized by the mean within group. Any suggestions would be appreciated.

In [10]: df.groupby('indx').transform(lambda x: (x - x.mean()) / x.std())
should do it.

If the data contains many groups (thousands or more), the accepted answer using a lambda may take a very long time to compute. A fast solution would be:
groups = df.groupby("indx")
mean, std = groups.transform("mean"), groups.transform("std")
normalized = (df[mean.columns] - mean) / std
Explanation and benchmarking
The accepted answer suffers from a performance problem using apply with a lambda. Even though groupby.transform itself is fast, as are the already vectorized calls in the lambda function (.mean(), .std() and the subtraction), the call to the pure Python lambda function itself for each group creates a considerable overhead.
This can be avoided by using pure vectorized Pandas/Numpy calls and not writing any Python method, as shown in ErnestScribbler's answer.
We can get around the headache of merging and naming the columns by leveraging the broadcasting abilities of .transform. Let's put the solution from above into a method for benchmarking:
def normalize_by_group(df, by):
groups = df.groupby(by)
# computes group-wise mean/std,
# then auto broadcasts to size of group chunk
mean = groups.transform("mean")
std = groups.transform("std")
normalized = (df[mean.columns] - mean) / std
return normalized
I changed the data generation from the original question to allow for more groups:
def gen_data(N, num_groups):
m = 3
data = np.random.normal(size=(N,m)) + np.random.normal(size=(N,m))**3
indx = np.random.randint(0,num_groups,size=N).astype(np.int32)
df = pd.DataFrame(np.hstack((data, indx[:,None])),
columns=['a%s' % k for k in range(m)] + [ 'indx'])
return df
With only two groups (thus only two Python function calls), the lambda version is only about 1.8x slower than the numpy code:
In: df2g = gen_data(10000, 2) # 3 cols, 10000 rows, 2 groups
In: %timeit normalize_by_group(df2g, "indx")
6.61 ms ± 72.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In: %timeit df2g.groupby('indx').transform(lambda x: (x - x.mean()) / x.std())
12.3 ms ± 130 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Increasing the number of groups to 1000, and the runtime issue becomes apparent. The lambda version is 370x slower than the numpy code:
In: df1000g = gen_data(10000, 1000) # 3 cols, 10000 rows, 1000 groups
In: %timeit normalize_by_group(df1000g, "indx")
7.5 ms ± 87.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In: %timeit df1000g.groupby('indx').transform(lambda x: (x - x.mean()) / x.std())
2.78 s ± 13.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The accepted answer works and is elegant.
Unfortunately, for large datasets I think performance-wise using .transform() is much much slower than doing the less elegant following (illustrated with a single column 'a0'):
means_stds = df.groupby('indx')['a0'].agg(['mean','std']).reset_index()
df = df.merge(means_stds,on='indx')
df['a0_normalized'] = (df['a0'] - df['mean']) / df['std']
To do it for multiple columns you'll have to figure out the merge. My suggestion would be to flatten the multiindex columns from aggregation as in this answer and then merge and normalize for each column separately:
means_stds = df.groupby('indx')[['a0','a1']].agg(['mean','std']).reset_index()
means_stds.columns = ['%s%s' % (a, '|%s' % b if b else '') for a, b in means_stds.columns]
df = df.merge(means_stds,on='indx')
for col in ['a0','a1']:
df[col+'_normalized'] = ( df[col] - df[col+'|mean'] ) / df[col+'|std']

Although this is not the prettiest solution, you could do something like this:
indx = df['indx'].copy()
for indices in df.groupby('indx').groups.values():
df.loc[indices] -= df.loc[indices].mean()
df['indx'] = indx

Related

Compare with another column value

train.loc[:,'nd_mean_2021-04-15':'nd_mean_2021-08-27'] > train['q_5']
I get Automatic reindexing on DataFrame vs Series comparisons is deprecated and will raise ValueError in a future version. Do left, right = left.align(right, axis=1, copy=False)before e.g.left == right` and something strange output with a lot of columns, but I did expect cell values masked with True or False for calculate sum on next step.
Comparing each columns separately works just fine
train['nd_mean_2021-04-15'] > train['q_5']
But works slowly and messy code.
I've tested your original solution, and two additional ways of performing this comparison you want to make.
To cut to the chase, the following option had the smallest execution time:
%%timeit
sliced_df = df.loc[:, 'nd_mean_2021-04-15':'nd_mean_2021-08-27']
comparisson_df = pd.DataFrame({col: df['q_5'] for col in sliced_df.columns})
(sliced_df > comparisson_df)
# 1.46 ms ± 610 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Drawback: it's little bit messy and requires you to create 2 new objects (sliced_df and comparisson_df)
Option 2: Using DataFrame.apply (slower but more readable)
The second option although slower than your original and the above implementations, in my opinion is the cleanest and easiest to read of them all. If you're not trying to process large amounts of data (I assume not, since you're using pandas instead of Dask or Spark that are tools more suitable for processing large volumes of data) then it's worth bringing it to the discussion table:
%%timeit
df.loc[:, 'nd_mean_2021-04-15':'nd_mean_2021-08-27'].apply(lambda col: col > df['q_5'])
# 5.66 ms ± 897 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Original Solution
I've also tested the performance of your original implementation and here's what I got:
%%timeit
df.loc[:, 'nd_mean_2021-04-15':'nd_mean_2021-08-27'] > df['q_5']
# 2.02 ms ± 175 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Side-Note: If the FutureWarning message is bothering you, there's always the option to ignore them, adding the following code after your script imports:
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)
DataFrame Used for Testing
All of the above implementations used the same dataframe, that I created using the following code:
import pandas as pd
import numpy as np
columns = list(
map(
lambda value: f'nd_mean_{value}',
pd.date_range('2021-04-15', '2021-08-27', freq='W').to_series().dt.strftime('%Y-%m-%d').to_list()
)
)
df = pd.DataFrame(
{col: np.random.randint(0, 100, 10) for col in [*columns, 'q_5']}
)
Screenshots

Pandas transform method performing slow

I have a canonical Pandas transform example in which performance seems inexplicably slow. I have read the Q&A on the apply method, which is related but, in my humble opinion, offers an incomplete and potentially misleading answer to my question as I explain below.
The first five lines of my dataframe are
id date xvar
0 1004 1992-05-31 4.151628
1 1004 1993-05-31 2.868015
2 1004 1994-05-31 3.043287
3 1004 1995-05-31 3.189541
4 1004 1996-05-31 4.008760
There are 24,693 rows in the dataframe.
There are 2,992 unique id values.
I want to center xvar by id.
Approach 1 takes 861 ms:
df_r['xvar_center'] = (
df_r
.groupby('id')['xvar']
.transform(lambda x: x - x.mean())
)
Approach 2 takes 9 ms:
# Group means
df_r_mean = (
df_r
.groupby('id', as_index=False)['xvar']
.mean()
.rename(columns={'xvar':'xvar_avg'})
)
# Merge group means onto dataframe and center
df_w = (
pd
.merge(df_r, df_r_mean, on='id', how='left')
.assign(xvar_center=lambda x: x.xvar - x.xvar_avg)
)
The Q&A on the apply method recommends relying on vectorized functions whenever possible, much like #sammywemmy's comment implies. This I see as overlap. However, the Q&A on the apply method also sates:
"...here are some common situations where you will want to get rid of any calls to apply...Numeric Data"
#sammywemmy's comment does not "get rid of any calls to" the transform method in their answer to my question. On the contrary, the answer relies on the transform method. Therefore, unless #sammywemmy's suggestion is strictly dominated by an alternative approach that does not rely on the transform method, I think my question and its answer are sufficiently distinct from the discussion in Q&A on the apply method. (Thank you for your patience and help.)
This answer is due to the insightful comment from #sammywemmy, who deserves all credit and no blame for any inaccuracy here. Because a similar usage of transform is illustrated in the Pandas User's Guide, I thought elaborating may be useful for others.
My hypothesis is that the problem rests with a combination of using a non-vectorized function and a large number of groups. When I change the groupby variable from id (2,992 unique values) to year (constructed from the date variable and containing 28 unique values), the performance difference between my original approach and #sammywemmy's narrows substantially but is still significant.
%%timeit
df_r['xvar_center_y'] = (
df_r
.groupby('year')['xvar']
.transform(lambda x: x - x.mean())
)
11.4 ms ± 202 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
vs.
%timeit df_r['xvar_center_y'] = df_r.xvar - df_r.groupby('year')['xvar'].transform('mean')
1.69 ms ± 5.11 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The beauty of #sammywemmy's insight is that it is easy to apply to other common transformations for potentially significant performance improvements and at a modest cost in terms of additional code. For example, consider standardizing a variable:
%%timeit
df_r['xvar_z'] = (
df_r
.groupby('id')['xvar']
.transform(lambda x: (x - x.mean()) / x.std())
)
1.34 s ± 38 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
vs.
%%timeit
df_r['xvar_z'] = (
(df_r.xvar - df_r.groupby('id')['xvar'].transform('mean'))
/ df_r.groupby('id')['xvar'].transform('std')
)
3.96 ms ± 297 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Speeding up operations on large arrays & datasets (Pandas slow, Numpy better, further improvements?)

I have a large dataset comprising millions of rows and around 6 columns. The data is currently in a Pandas dataframe and I'm looking for the fastest way to operate on it. For example, let's say I want to drop all the rows where the value in one column is "1".
Here's my minimal working example:
# Create dummy data arrays and pandas dataframe
array_size = int(5e6)
array1 = np.random.rand(array_size)
array2 = np.random.rand(array_size)
array3 = np.random.rand(array_size)
array_condition = np.random.randint(0, 3, size=array_size)
df = pd.DataFrame({'array_condition': array_condition, 'array1': array1, 'array2': array2, 'array3': array3})
def method1():
df_new = df.drop(df[df.array_condition == 1].index)
EDIT: As Henry Yik pointed out in the comments, a faster Pandas approach is this:
def method1b():
df_new = df[df.array_condition != 1]
I believe that Pandas can be quite slow at this sort of thing, so I also implemented a method using numpy, processing each column as a separate array:
def method2():
masking = array_condition != 1
array1_new = array1[masking]
array2_new = array2[masking]
array3_new = array3[masking]
array_condition_new = array_condition[masking]
And the results:
%timeit method1()
625 ms ± 7.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit methodb()
158 ms ± 7.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit method2()
138 ms ± 3.8 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
So we do see a slight significant performance boost using numpy. However, this is at the cost of much less readable code (i.e. having to create a mask and apply it to each array). This method doesn't seem as scalable either as if I have, say, 30 columns of data, I'll need a lot of lines of code that apply the mask to every array! Additionally, it would be useful to allow optional columns, so this method may fail trying to operate on arrays which are empty.
Therefore, I have 2 questions:
1) Is there a cleaner / more flexible way to implement this in numpy?
2) Or better, is there any higher performance method I could use here? e.g. JIT (numba?), Cython or something else?
PS, in practice, in-place operations can be used, replacing the old array with the new one once data is dropped
Part 1: Pandas and (maybe) Numpy
Compare your method1b and method2:
method1b generates a DataFrame, which is probably what you want,
method2 generates a Numpy array, so to get fully comparable result,
you should subsequently generate a DataFrame from it.
So I changed your method2 to:
def method2():
masking = array_condition != 1
array1_new = array1[masking]
array2_new = array2[masking]
array3_new = array3[masking]
array_condition_new = array_condition[masking]
df_new = pd.DataFrame({ 'array_condition': array_condition[masking],
'array1': array1_new, 'array2': array2_new, 'array3': array3_new})
and then compared execution times (using %timeit).
The result was that my (expanded) version of method2 executed about 5% longer
than method1b (check on your own).
So my opinion is that as long as a single operation is concerned,
it is probably better to stay with Pandas.
But if you want to perform on your source DataFrame a couple of operations
in sequence and / or you are satisfied with the result as a Numpy array,
it is worth to:
Call arr = df.values to get the underlying Numpy array.
Perform all required operations on it using Numpy methods.
(Optionally) create a DataFrame from the final reslut.
I tried Numpy version of method1b:
def method3():
a = df.values
arr = a[a[:,0] != 1]
but the execution time was about 40 % longer.
The reason is probably that Numpy array has all elements of the
same type, so array_condition column is coerced to float and then
the whole Numpy array is created, what takes some time.
Part 2: Numpy and Numba
An alternative to consider is to use Numba package - a Just-In-Time
Python compiler.
I made such test:
Created a Numpy array (as a preliminary step):
a = df.values
The reason is that JIT compiled methods are able to use Numpy methods and types,
but not those of Pandas.
To perform the test, I used almost the same method as above,
but with #njit annotation (requires from numba import njit):
#njit
def method4():
arr = a[a[:,0] != 1]
This time:
The execution time was about 45 % of the time for method1b.
But since a = df.values has been executed before the test loop,
there are doubts whether this result is comparable with earlier tests.
Anyway, try Numba on your own, maybe it will be an interesting option for you.
You may find using numpy.where useful here. It converts a Boolean mask to array indices, making life much cheaper. Combining this with numpy.vstack allows for some memory-cheap operations:
def method3():
wh = np.where(array_condition == 1)
return np.vstack(tuple(col[wh] for col in (array1, array2, array3)))
This gives the following timeits:
>>> %timeit method2()
180 ms ± 6.66 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
>>> %timeit method3()
96.9 ms ± 2.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Tuple unpacking allows the operation to be fairly light on memory, as when the object is vstack-ed back together, it is smaller. If you need to get your columns out of a DataFrame directly, the following code snippet may be useful:
def method3b():
wh = np.where(array_condition == 1)
col_names = ['array1','array2','array3']
return np.vstack(tuple(col[wh] for col in tuple(df[col_name].to_numpy()
for col_name in col_names)))
This allows one to grab columns by name from the DataFrame, which are then tuple unpacked on the fly. The speed is about the same:
>>> %timeit method3b()
96.6 ms ± 3.09 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
Enjoy!

How to optimize a numpy loop that sums values from an array which is indexed by another array where values equal the loop index

I have this piece of code that is called multiple times during the run of the application.
It takes an array of numbers which represent values (value_array).
These should be summed up in zones, which are defined in the zone_array.
zone_ids represents a list of all the possible zones in zone_array.
Its basically something in the lines of: i got a population raster map and i want to know how many people live in each zone of the zone map.
the code:
values = np.zeros(len(zone_ids))
for i in zone_ids:
values[i] = round(np.nansum(value_array[zone_array == i]), 2)
return values
The culprit seems to be the for loop, but i have not found a way to eliminate it and have the same results.
I tried it with bincount but i did not succeed.
Using numba jit also has no effect.
I would like to stay away from cython as this code will be used in a Qgis plugin which has no cython support.
test code:
import numpy as np
def fill_values(zone_array, value_array, zone_ids):
values = np.zeros(len(zone_ids))
for i in zone_ids:
values[i] = round(np.nansum(value_array[zone_array == i]), 2)
return values
def run():
# 300 different zones
zone_ids = range(300)
# zone map with 300 zones
zone_array = (np.random.rand(2000, 2000) * 300).astype(int)
# value map from which we want the sum of values per zone (real map can have NaN values)
value_array = (np.random.rand(2000, 2000) * 10.)
value_array[5, 5] = np.NAN
fill_values(zone_array, value_array, zone_ids)
if __name__ == '__main__':
run()
1.92 s ± 17.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
With the implementation of bincount as suggested by Divakar :
203 ms ± 15.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
With a direct usage of bincount, you would have NaNs in the summations. So, you can simply replace the NaNs with zeros and use bincount. This should be much faster, being a vectorized solution.
Hence, the implementation would be -
val_nonan = np.where(np.isnan(value_array), 0, value_array)
out = np.round(np.bincount(zone_array.ravel(), val_nonan.ravel()),2)

What's the fastest way to acces a Pandas DataFrame?

I have a DataFrame df with 541 columns, and I need to save all unique pairs of its column names into the rows of a separate DataFrame, repeated 8 times each.
I thought I would create an empty DataFrame fp, double loop through df's column names, insert into every 8th row, and fill in the blanks with the last available value.
When I tried to do this though I was baffled by how long it's taking. With 541 columns I only have to write 146,611 times yet it's taking well over 20 minutes. This seems egregious for just data access. Where is the problem and how can I solve it? It takes less time than that for Pandas to produce a correlation matrix with the columns so I must me doing something wrong.
Here's a reproducible example of what I mean:
fp = np.empty(shape = (146611, 10))
fp.fill(np.nan)
fp = pd.DataFrame(fp)
%timeit for idx in range(0, len(fp)): fp.iloc[idx, 0] = idx
# 1 loop, best of 3: 22.3 s per loop
Don't do iloc/loc/chained-indexing. Using the NumPy interface alone increases speed by ~180x. If you further remove element access, we can bump this to 180,000x.
fp = np.empty(shape = (146611, 10))
fp.fill(np.nan)
fp = pd.DataFrame(fp)
# this confirms how slow data access is on my computer
%timeit for idx in range(0, len(fp)): fp.iloc[idx, 0] = idx
1 loops, best of 3: 3min 9s per loop
# this accesses the underlying NumPy array, so you can directly set the data
%timeit for idx in range(0, len(fp)): fp.values[idx, 0] = idx
1 loops, best of 3: 1.19 s per loop
This is because there's extensive code that goes in the Python layer for this fancing indexing, taking ~10µs per loop. Using Pandas indexing should be done to retrieve entire subsets of data, which you then use to do vectorized operations on the entire dataframe. Individual element access is glacial: using Python dictionaries will give you a > 180 fold increase in performance.
Things get a lot better when you access columns or rows instead of individual elements: 3 orders of magnitude better.
# set all items in 1 go.
%timeit fp[0] = np.arange(146611)
1000 loops, best of 3: 814 µs per loop
Moral
Don't try to access individual elements via chained indexing, loc, or iloc. Generate a NumPy array in a single allocation, from a Python list (or a C-interface if performance is absolutely critical), and then perform operations on entire columns or dataframes.
Using NumPy arrays and performing operations directly on columns rather than individual elements, we got a whopping 180,000+ fold increase in performance. Not too shabby.
Edit
Comments from #kushy suggest Pandas may have optimized indexing in certain cases since I originally wrote this answer. Always profile your own code, and your mileage may vary.
Alexander's answer was the fastest for me as of 2020-01-06 when using .is_numpy() instead of .values. Tested in Jupyter Notebook on Windows 10. Pandas version = 0.24.2
import numpy as np
import pandas as pd
fp = np.empty(shape = (146611, 10))
fp.fill(np.nan)
fp = pd.DataFrame(fp)
pd.__version__ # '0.24.2'
def func1():
# Asker badmax solution
for idx in range(0, len(fp)):
fp.iloc[idx, 0] = idx
def func2():
# Alexander Huszagh solution 1
for idx in range(0, len(fp)):
fp.to_numpy()[idx, 0] = idx
def func3():
# user4322543 answer to
# https://stackoverflow.com/questions/34855859/is-there-a-way-in-pandas-to-use-previous-row-value-in-dataframe-apply-when-previ
new = []
for idx in range(0, len(fp)):
new.append(idx)
fp[0] = new
def func4():
# Alexander Huszagh solution 2
fp[0] = np.arange(146611)
%timeit func1
19.7 ns ± 1.08 ns per loop (mean ± std. dev. of 7 runs, 500000000 loops each)
%timeit func2
19.1 ns ± 0.465 ns per loop (mean ± std. dev. of 7 runs, 500000000 loops each)
%timeit func3
21.1 ns ± 3.26 ns per loop (mean ± std. dev. of 7 runs, 500000000 loops each)
%timeit func4
24.7 ns ± 0.889 ns per loop (mean ± std. dev. of 7 runs, 50000000 loops each)

Categories