Pandas groupby and change/reassign one element - python

I want to groupby given dataframe, and then, for each group, for given column p overwrite the value of its last element (of each group) to 1 - sum(p[:-1]) (with sum being the sum of all the elements apart from the last one).
Note that after performing the operation, the sum of all values in p for each group is equal to 1.
For example, for this input dataframe (grouping by c1 and c2):
c1 c2 p
0 x a 0.4
1 y a 0.2
2 x a 0.3
3 y b 0.6
the expected output would be:
c1 c2 p
0 x a 0.4
1 y a 1.0
2 x a 0.6
3 y b 1.0
I managed to perform the operation using for loop:
for _, g in df.groupby(['c1', 'c2']):
df.loc[g.tail(1).index, 'p'] = 1 - g['p'][:-1].sum()
but I am looking for more elegant way of doing this, without explicitly looping through each group.
I tried this:
>>> df.loc[df.groupby(['c1', 'c2']).tail(1).index, 'p']
1 0.2
2 0.3
3 0.6
>>> 1 - df.groupby(['c1', 'c2']).apply(lambda x: x.iloc[:-1].sum())['p']
c1 c2
x a 0.6
y a 1.0
b 1.0
But I don't really know how to assemble those outputs given that their indices differ.

Here's a possible one-line solution:
df.groupby(['c1', 'c2']).apply(
lambda x: x.assign(p=x['p'][:-1].tolist()+[1 - x.iloc[:-1].sum()['p']])
).reset_index(level=[0,1], drop=True)
To make the above code more readable, here is a nearly equivalent version of my one-line solution:
def func(row):
result = 1 - row.iloc[:-1].sum()['p']
row['p'].iloc[-1] = result
return row
df.groupby(['c1', 'c2']).apply(func)
With those solutions in mind, I am not entirely sure why you don't want to use the .groupby call in an explicit python for-loop. My hunch is that an explicit python for-loop would be more than adequate, but I don't know your specific use case/data. I would highly recommend doing some speed comparisons using %timeit with your specific data, as I think that will help shed light on which approach you ultimately end up using.

Related

Vector arithmetic by conditional selection from multiple columns in a dataframe

I'm trying to do arithmetic among different cells in my dataframe and can't figure out how to operate on each of my groups. I'm trying to find the difference in energy_use between a baseline building (in this example upgrade_name == b is the baseline case) and each upgrade, for each building. I have an arbitrary number of building_id's and arbitrary number of upgrade_names.
I can do this successfully for a single building_id. Now I need to expand this out to a full dataset and am stuck. I will have 10's of thousands of buildings and dozens of upgrades for each building.
The answer to this question Iterating within groups in Pandas may be related, but I'm not sure how to apply it to my problem.
I have a dataframe like this:
df = pd.DataFrame({'building_id': [1,2,1,2,1], 'upgrade_name': ['a', 'a', 'b', 'b', 'c'], 'energy_use': [100.4, 150.8, 145.1, 136.7, 120.3]})
In [4]: df
Out[4]:
building_id upgrade_name energy_use
0 1 a 100.4
1 2 a 150.8
2 1 b 145.1
3 2 b 136.7
4 1 c 120.3
For a single building_id I have the following code:
upgrades = df.loc[df.building_id == 1, ['upgrade_name', 'energy_use']]
starting_point = upgrades.loc[upgrades.upgrade_name == 'b', 'energy_use']
upgrades['diff'] = upgrades.energy_use - starting_point.values[0]
In [8]: upgrades
Out[8]:
upgrade_name energy_use diff
0 a 100.4 -44.7
2 b 145.1 0.0
4 c 120.3 -24.8
How do I write this for arbitrary numbers of building_id's, instead of my hard-coded building_id == 1?
The ideal solution looks like this (doesn't matter if the baseline differences are 0 or NaN):
In [17]: df
Out[17]:
building_id upgrade_name energy_use ideal
0 1 a 100.4 -44.7
1 2 a 150.8 14.1
2 1 b 145.1 0.0
3 2 b 136.7 0.0
4 1 c 120.3 -24.8
Define the function counting the difference in energy usage (for
a group of rows for the current building) as follows:
def euDiff(grp):
euBase = grp[grp.upgrade_name == 'b'].energy_use.values[0]
return grp.energy_use - euBase
Then compute the difference (for all buildings), applying it to each group:
df['ideal'] = df.groupby('building_id').apply(euDiff)\
.reset_index(level=0, drop=True)
The result is just as you expected.
thanks for sharing that example data! Made things a lot easier.
I suggest solving this in two parts:
1. Make a dictionary from your dataframe that contains that baseline energy use for each building
2. Apply a lambda function to your dataframe to subtract each energy use value from the baseline value associated with that building.
# set index to building_id, turn into dictionary, filter out energy use
building_baseline = df[df['upgrade_name'] == 'b'].set_index('building_id').to_dict()['energy_use']
# apply lambda to dataframe, use axis=1 to access rows
df['diff'] = df.apply(lambda row: row['energy_use'] - building_baseline[row['building_id']])
You could also write a function to do this. You also don't necessarily need the dictionary, it just makes things easier. If you're curious about these alternative solutions let me know and I can add them for you.

Conditionally replace values in pandas.DataFrame with previous value

I need to filter outliers in a dataset. Replacing the outlier with the previous value in the column makes the most sense in my application.
I was having considerable difficulty doing this with the pandas tools available (mostly to do with copies on slices, or type conversions occurring when setting to NaN).
Is there a fast and/or memory efficient way to do this? (Please see my answer below for the solution I am currently using, which also has limitations.)
A simple example:
>>> import pandas as pd
>>> df = pd.DataFrame({'A':[1,2,3,4,1000,6,7,8],'B':list('abcdefgh')})
>>> df
A B
0 1 a
1 2 b
2 3 c
3 4 d
4 1000 e # '1000 e' --> '4 e'
5 6 f
6 7 g
7 8 h
You can simply mask values over your threshold and use ffill:
df.assign(A=df.A.mask(df.A.gt(10)).ffill())
A B
0 1.0 a
1 2.0 b
2 3.0 c
3 4.0 d
4 4.0 e
5 6.0 f
6 7.0 g
7 8.0 h
Using mask is necessary rather than something like shift, because it guarantees non-outlier output in the case that the previous value is also above a threshold.
I circumvented some of the issues with pandas copies and slices by converting to a numpy array first, doing the operations there, and then re-inserting the column. I'm not certain, but as far as I can tell, the datatype is the same once it is put back into the pandas.DataFrame.
def df_replace_with_previous(df,col,maskfunc,inplace=False):
arr = np.array(df[col])
mask = maskfunc(arr)
arr[ mask ] = arr[ list(mask)[1:]+[False] ]
if inplace:
df[col] = arr
return
else:
df2 = df.copy()
df2[col] = arr
return df2
This creates a mask, shifts it down by one so that the True values point at the previous entry, and updates the array. Of course, this will need to run recursively if there are multiple adjacent outliers (N times if there are N consecutive outliers), which is not ideal.
Usage in the case given in OP:
df_replace_with_previous(df,'A',lambda x:x>10,False)

Applying Operation to Pandas column if other column meets criteria

I'm relatively new to Python and totally new to Pandas, so my apologies if this is really simple. I have a dataframe, and I want to operate over all elements in a particular column, but only if a different column with the same index meets a certain criteria.
float_col int_col str_col
0 0.1 1 a
1 0.2 2 b
2 0.2 6 None
3 10.1 8 c
4 NaN -1 a
For example, if the value in float_col is greater than 5, I want to multiply the value in in_col (in the same row) by 2. I'm guessing I'm supposed to use one of the map apply or applymap functions, but I'm not sure which, or how.
There might be more elegant ways to do this, but once you understand how to use things like loc to get at a particular subset of your dataset, you can do it like this:
df.loc[df['float_col'] > 5, 'int_col'] = df.loc[df['float_col'] > 5, 'int_col'] * 2
You can also do it a bit more succinctly like this, since pandas is smart enough to match up the results based on the index of your dataframe and only use the relevant data from the df['int_col'] * 2 expression:
df.loc[df['float_col'] > 5, 'int_col'] = df['int_col'] * 2

Subtract subgroup averages from individuals without resorting to for loop

I have a dataframe with a number of columns, two of which are grouping variables.
>>> df2
Groupvar1 Groupvar2 x y z
0 A 1 0.726317 0.574514 0.700475
1 A 2 0.422089 0.798931 0.191157
2 A 3 0.888318 0.658061 0.686496
....
13 B 2 0.978920 0.764266 0.673941
14 B 3 0.759589 0.162488 0.698958
and I want to make a new dataframe which holds the diffrence between each datapoint in the origianl df and the mean corresponding to its subgroup.
So to begin with a make the new df with the grouped averages:
>>> grp_vars = ['Groupvar1','Groupvar2']
>>> df2_grp = df2.groupby(grp_vars)
>>> df2_grp_avg = df2_grp.mean()
>>> df2_grp_avg
x y z
Groupvar1 Groupvar2
A 1 0.364533 0.645237 0.886286
2 0.325533 0.500077 0.246287
3 0.796326 0.496950 0.510085
4 0.774854 0.688732 0.487547
B 1 0.743783 0.452482 0.612006
2 0.575687 0.396902 0.446126
3 0.473152 0.476379 0.508060
4 0.434320 0.406458 0.382187
and in the new dtaframe I want to keep the deltas, defined as:
delta = individual value - average value of the subgroup this individual is a member of
Now, it's clear to me how to do this the hard way (for loop) but I supose there must be a more elegant solution. Apprecaite any advice on finding that more elegant solution. TIA.
Use .groupby(...).transform function:
>>> demean = lambda df: df - df.mean()
>>> df.groupby(['Groupvar1', 'Groupvar2']).transform(demean)
ant then pd.concat the result with the original data-frame.

Normalize data in pandas

Suppose I have a pandas data frame df:
I want to calculate the column wise mean of a data frame.
This is easy:
df.apply(average)
then the column wise range max(col) - min(col). This is easy again:
df.apply(max) - df.apply(min)
Now for each element I want to subtract its column's mean and divide by its column's range. I am not sure how to do that
Any help/pointers are much appreciated.
In [92]: df
Out[92]:
a b c d
A -0.488816 0.863769 4.325608 -4.721202
B -11.937097 2.993993 -12.916784 -1.086236
C -5.569493 4.672679 -2.168464 -9.315900
D 8.892368 0.932785 4.535396 0.598124
In [93]: df_norm = (df - df.mean()) / (df.max() - df.min())
In [94]: df_norm
Out[94]:
a b c d
A 0.085789 -0.394348 0.337016 -0.109935
B -0.463830 0.164926 -0.650963 0.256714
C -0.158129 0.605652 -0.035090 -0.573389
D 0.536170 -0.376229 0.349037 0.426611
In [95]: df_norm.mean()
Out[95]:
a -2.081668e-17
b 4.857226e-17
c 1.734723e-17
d -1.040834e-17
In [96]: df_norm.max() - df_norm.min()
Out[96]:
a 1
b 1
c 1
d 1
If you don't mind importing the sklearn library, I would recommend the method talked on this blog.
import pandas as pd
from sklearn import preprocessing
data = {'score': [234,24,14,27,-74,46,73,-18,59,160]}
cols = data.columns
df = pd.DataFrame(data)
df
min_max_scaler = preprocessing.MinMaxScaler()
np_scaled = min_max_scaler.fit_transform(df)
df_normalized = pd.DataFrame(np_scaled, columns = cols)
df_normalized
You can use apply for this, and it's a bit neater:
import numpy as np
import pandas as pd
np.random.seed(1)
df = pd.DataFrame(np.random.randn(4,4)* 4 + 3)
0 1 2 3
0 9.497381 0.552974 0.887313 -1.291874
1 6.461631 -6.206155 9.979247 -0.044828
2 4.276156 2.002518 8.848432 -5.240563
3 1.710331 1.463783 7.535078 -1.399565
df.apply(lambda x: (x - np.mean(x)) / (np.max(x) - np.min(x)))
0 1 2 3
0 0.515087 0.133967 -0.651699 0.135175
1 0.125241 -0.689446 0.348301 0.375188
2 -0.155414 0.310554 0.223925 -0.624812
3 -0.484913 0.244924 0.079473 0.114448
Also, it works nicely with groupby, if you select the relevant columns:
df['grp'] = ['A', 'A', 'B', 'B']
0 1 2 3 grp
0 9.497381 0.552974 0.887313 -1.291874 A
1 6.461631 -6.206155 9.979247 -0.044828 A
2 4.276156 2.002518 8.848432 -5.240563 B
3 1.710331 1.463783 7.535078 -1.399565 B
df.groupby(['grp'])[[0,1,2,3]].apply(lambda x: (x - np.mean(x)) / (np.max(x) - np.min(x)))
0 1 2 3
0 0.5 0.5 -0.5 -0.5
1 -0.5 -0.5 0.5 0.5
2 0.5 0.5 0.5 -0.5
3 -0.5 -0.5 -0.5 0.5
Slightly modified from: Python Pandas Dataframe: Normalize data between 0.01 and 0.99? but from some of the comments thought it was relevant (sorry if considered a repost though...)
I wanted customized normalization in that regular percentile of datum or z-score was not adequate. Sometimes I knew what the feasible max and min of the population were, and therefore wanted to define it other than my sample, or a different midpoint, or whatever! This can often be useful for rescaling and normalizing data for neural nets where you may want all inputs between 0 and 1, but some of your data may need to be scaled in a more customized way... because percentiles and stdevs assumes your sample covers the population, but sometimes we know this isn't true. It was also very useful for me when visualizing data in heatmaps. So i built a custom function (used extra steps in the code here to make it as readable as possible):
def NormData(s,low='min',center='mid',hi='max',insideout=False,shrinkfactor=0.):
if low=='min':
low=min(s)
elif low=='abs':
low=max(abs(min(s)),abs(max(s)))*-1.#sign(min(s))
if hi=='max':
hi=max(s)
elif hi=='abs':
hi=max(abs(min(s)),abs(max(s)))*1.#sign(max(s))
if center=='mid':
center=(max(s)+min(s))/2
elif center=='avg':
center=mean(s)
elif center=='median':
center=median(s)
s2=[x-center for x in s]
hi=hi-center
low=low-center
center=0.
r=[]
for x in s2:
if x<low:
r.append(0.)
elif x>hi:
r.append(1.)
else:
if x>=center:
r.append((x-center)/(hi-center)*0.5+0.5)
else:
r.append((x-low)/(center-low)*0.5+0.)
if insideout==True:
ir=[(1.-abs(z-0.5)*2.) for z in r]
r=ir
rr =[x-(x-0.5)*shrinkfactor for x in r]
return rr
This will take in a pandas series, or even just a list and normalize it to your specified low, center, and high points. also there is a shrink factor! to allow you to scale down the data away from endpoints 0 and 1 (I had to do this when combining colormaps in matplotlib:Single pcolormesh with more than one colormap using Matplotlib) So you can likely see how the code works, but basically say you have values [-5,1,10] in a sample, but want to normalize based on a range of -7 to 7 (so anything above 7, our "10" is treated as a 7 effectively) with a midpoint of 2, but shrink it to fit a 256 RGB colormap:
#In[1]
NormData([-5,2,10],low=-7,center=1,hi=7,shrinkfactor=2./256)
#Out[1]
[0.1279296875, 0.5826822916666667, 0.99609375]
It can also turn your data inside out... this may seem odd, but I found it useful for heatmapping. Say you want a darker color for values closer to 0 rather than hi/low. You could heatmap based on normalized data where insideout=True:
#In[2]
NormData([-5,2,10],low=-7,center=1,hi=7,insideout=True,shrinkfactor=2./256)
#Out[2]
[0.251953125, 0.8307291666666666, 0.00390625]
So now "2" which is closest to the center, defined as "1" is the highest value.
Anyways, I thought my application was relevant if you're looking to rescale data in other ways that could have useful applications to you.
This is how you do it column-wise:
[df[col].update((df[col] - df[col].min()) / (df[col].max() - df[col].min())) for col in df.columns]

Categories