I have a situation where I am creating a pivot table in PANDAS where it makes more sense to calculate the fields separately and just use .pivot_table() for the pivot step. However, I am running into some difficultly trying to calculate the denominator for my percentages. Essentially, due to the data format I appear to need to do something like "groupby transform unique sum" on the second line below (which is where I am stuck):
df['numerator'] = df.groupby(['category1','category2'])['customer_id'].transform('nunique')
df['denominator'] = df.groupby(['category2'])['numerator'].nunique().transform('sum')
df['percentage'] = (df['numerator'] / df['denominator'])
df_pivot = df.pivot_table(index='category1',
columns=['category2'],
values=['numerator','percentage']) \
swaplevel(0,1,axis=1)
df_pivot.loc['total', :] = df_pivot.sum().values
My apologies for not being able to provide any dummy data, but I would appreciate any tips if I have hopefully provided enough detail to reason about.
I believe need lambda function with unique and sum:
df = pd.DataFrame({'numerator':[3,1,1,9,2,2],
'category2':list('aaabbb')})
#print (df)
df['denominator']=df.groupby(['category2'])['numerator'].transform(lambda x: x.unique().sum())
Alternative solution with sets and sums:
df['denominator']=df.groupby(['category2'])['numerator'].transform(lambda x: sum(set(x)))
print (df)
category2 numerator denominator
0 a 3 4
1 a 1 4
2 a 1 4
3 b 9 11
4 b 2 11
5 b 2 11
Related
I have a DF where I'd like to create a new column with the difference of 2 other column values.
name rate avg_rate
A 10 3
B 6 5
C 4 3
I wrote this code to calculate the difference :
result= df.groupby(['name']).apply(lambda g: g.rate - g.avg_rate)
df['rate_diff']=result.reset_index(drop=True)
df.tail(3)
But I notice that some of the values calculated are NANs. What is the best way to handle this?
Output i am getting:
name rate avg_rate rate_diff
A 10 3 NAN
B 6 5 NAN
C 4 3 NAN
If you want to use groupby and apply then following should work,
res = df.groupby(['name']).apply(lambda g: g.rate - g.avg_rate).reset_index().set_index('level_1')
df = pd.merge(df,res,on=['name'],left_index = True, right_index=True).rename({0:'rate_diff'},axis=1)
However, as #sacuL suggested in the comments, you don't need to use groupby to calculate the difference as you are just going to get the difference by simply subtracting columns (side by side), and groupby apply will be overkill for this simple task.
df["rate_diff"] = df.rate - df.avg_rate
I'm trying to do arithmetic among different cells in my dataframe and can't figure out how to operate on each of my groups. I'm trying to find the difference in energy_use between a baseline building (in this example upgrade_name == b is the baseline case) and each upgrade, for each building. I have an arbitrary number of building_id's and arbitrary number of upgrade_names.
I can do this successfully for a single building_id. Now I need to expand this out to a full dataset and am stuck. I will have 10's of thousands of buildings and dozens of upgrades for each building.
The answer to this question Iterating within groups in Pandas may be related, but I'm not sure how to apply it to my problem.
I have a dataframe like this:
df = pd.DataFrame({'building_id': [1,2,1,2,1], 'upgrade_name': ['a', 'a', 'b', 'b', 'c'], 'energy_use': [100.4, 150.8, 145.1, 136.7, 120.3]})
In [4]: df
Out[4]:
building_id upgrade_name energy_use
0 1 a 100.4
1 2 a 150.8
2 1 b 145.1
3 2 b 136.7
4 1 c 120.3
For a single building_id I have the following code:
upgrades = df.loc[df.building_id == 1, ['upgrade_name', 'energy_use']]
starting_point = upgrades.loc[upgrades.upgrade_name == 'b', 'energy_use']
upgrades['diff'] = upgrades.energy_use - starting_point.values[0]
In [8]: upgrades
Out[8]:
upgrade_name energy_use diff
0 a 100.4 -44.7
2 b 145.1 0.0
4 c 120.3 -24.8
How do I write this for arbitrary numbers of building_id's, instead of my hard-coded building_id == 1?
The ideal solution looks like this (doesn't matter if the baseline differences are 0 or NaN):
In [17]: df
Out[17]:
building_id upgrade_name energy_use ideal
0 1 a 100.4 -44.7
1 2 a 150.8 14.1
2 1 b 145.1 0.0
3 2 b 136.7 0.0
4 1 c 120.3 -24.8
Define the function counting the difference in energy usage (for
a group of rows for the current building) as follows:
def euDiff(grp):
euBase = grp[grp.upgrade_name == 'b'].energy_use.values[0]
return grp.energy_use - euBase
Then compute the difference (for all buildings), applying it to each group:
df['ideal'] = df.groupby('building_id').apply(euDiff)\
.reset_index(level=0, drop=True)
The result is just as you expected.
thanks for sharing that example data! Made things a lot easier.
I suggest solving this in two parts:
1. Make a dictionary from your dataframe that contains that baseline energy use for each building
2. Apply a lambda function to your dataframe to subtract each energy use value from the baseline value associated with that building.
# set index to building_id, turn into dictionary, filter out energy use
building_baseline = df[df['upgrade_name'] == 'b'].set_index('building_id').to_dict()['energy_use']
# apply lambda to dataframe, use axis=1 to access rows
df['diff'] = df.apply(lambda row: row['energy_use'] - building_baseline[row['building_id']])
You could also write a function to do this. You also don't necessarily need the dictionary, it just makes things easier. If you're curious about these alternative solutions let me know and I can add them for you.
I have a data frame with 5 fields. I want to copy 2 fields from this into a new data frame. This works fine. df1 = df[['task_id','duration']]
Now in this df1, when I try to group by task_id and sum duration, the task_id field drops off.
Before (what I have now).
After (what I'm trying to achieve).
So, for instance, I'm trying this:
df1['total'] = df1.groupby(['task_id'])['duration'].sum()
The result is:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
I don't know why I can't just sum the values in a column and group by unique IDs in another column. Basically, all I want to do is preserve the original two columns (['task_id', 'duration']), sum duration, and calculate a percentage of duration in a new column named pct. This seems like a very simple thing but I can't get anything working. How can I get this straightened out?
The code will take care of having the columns retained and getting the sum.
df[['task_id', 'duration']].groupby(['task_id', 'duration']).size().reset_index(name='counts')
Setup:
X = np.random.choice([0,1,2], 20)
Y = np.random.uniform(2,10,20)
df = pd.DataFrame({'task_id':X, 'duration':Y})
Calculate pct:
df = pd.merge(df, df.groupby('task_id').agg(sum).reset_index(), on='task_id')
df['pct'] = df['duration_x'].divide(df['duration_y'])*100
df.drop('duration_y', axis=1) # Drops sum duration, remove this line if you want to see it.
Result:
duration_x task_id pct
0 8.751517 0 58.017921
1 6.332645 0 41.982079
2 8.828693 1 9.865355
3 2.611285 1 2.917901
4 5.806709 1 6.488531
5 8.045490 1 8.990189
6 6.285593 1 7.023645
7 7.932952 1 8.864436
8 7.440938 1 8.314650
9 7.272948 1 8.126935
10 9.162262 1 10.238092
11 7.834692 1 8.754639
12 7.989057 1 8.927129
13 3.795571 1 4.241246
14 6.485703 1 7.247252
15 5.858985 2 21.396850
16 9.024650 2 32.957771
17 3.885288 2 14.188966
18 5.794491 2 21.161322
19 2.819049 2 10.295091
disclaimer: All data is randomly generated in setup, however, calculations are straightforward and should be correct for any case.
I finally got everything working in the following way.
# group by and sum durations
df1 = df1.groupby('task_id', as_index=False).agg({'duration': 'sum'})
list(df1)
# find each task_id as relative percentage of whole
df1['pct'] = df1['duration']/(df1['duration'].sum())
df1 = pd.DataFrame(df1)
I want to add an aggregate, grouped, nunique column to my pandas dataframe but not aggregate the entire dataframe. I'm trying to do this in one line and avoid creating a new aggregated object and merging that, etc.
my df has track, type, and id. I want the number of unique ids for each track/type combination as a new column in the table (but not collapse track/type combos in the resulting df). Same number of rows, 1 more column.
something like this isn't working:
df['n_unique_id'] = df.groupby(['track', 'type'])['id'].nunique()
nor is
df['n_unique_id'] = df.groupby(['track', 'type'])['id'].transform(nunique)
this last one works with some aggregating functions but not others. the following works (but is meaningless on my dataset):
df['n_unique_id'] = df.groupby(['track', 'type'])['id'].transform(sum)
in R this is easily done in data.table with
df[, n_unique_id := uniqueN(id), by = c('track', 'type')]
thanks!
df.groupby(['track', 'type'])['id'].transform(nunique)
Implies that there is a name nunique in the name space that performs some function. transform will take a function or a string that it knows a function for. nunique is definitely one of those strings.
As pointed out by #root, often the method that pandas will utilize to perform a transformation indicated by these strings are optimized and should generally be preferred to passing your own functions. This is True even for passing numpy functions in some cases.
For example transform('sum') should be preferred over transform(sum).
Try this instead
df.groupby(['track', 'type'])['id'].transform('nunique')
demo
df = pd.DataFrame(dict(
track=list('11112222'), type=list('AAAABBBB'), id=list('XXYZWWWW')))
print(df)
id track type
0 X 1 A
1 X 1 A
2 Y 1 A
3 Z 1 A
4 W 2 B
5 W 2 B
6 W 2 B
7 W 2 B
df.groupby(['track', 'type'])['id'].transform('nunique')
0 3
1 3
2 3
3 3
4 1
5 1
6 1
7 1
Name: id, dtype: int64
If I have a pandas database such as:
timestamp label value new
etc. a 1 3.5
b 2 5
a 5 ...
b 6 ...
a 2 ...
b 4 ...
I want the new column to be the average of the last two a's and the last two b's... so for the first it would be the average of 5 and 2 to get 3.5. It will be sorted by the timestamp. I know I could use a groupby to get the average of all the a's or all the b's but I'm not sure how to get an average of just the last two. I'm kinda new to python and coding so this might not be possible idk.
Edit: I should also mention this is not for a class or anything this is just for something I'm doing on my own and that this will be on a very large dataset. I'm just using this as an example. Also I would want each A and each B to have its own value for the last 2 average so the dimension of the new column will be the same as the others. So for the third line it would be the average of 2 and whatever the next a would be in the data set.
IIUC one way (among many) to do that:
In [139]: df.groupby('label').tail(2).groupby('label').mean().reset_index()
Out[139]:
label value
0 a 3.5
1 b 5.0
Edited to reflect a change in the question specifying the last two, not the ones following the first, and that you wanted the same dimensionality with values repeated.
import pandas as pd
data = {'label': ['a','b','a','b','a','b'], 'value':[1,2,5,6,2,4]}
df = pd.DataFrame(data)
grouped = df.groupby('label')
results = {'label':[], 'tail_mean':[]}
for item, grp in grouped:
subset_mean = grp.tail(2).mean()[0]
results['label'].append(item)
results['tail_mean'].append(subset_mean)
res_df = pd.DataFrame(results)
df = df.merge(res_df, on='label', how='left')
Outputs:
>> res_df
label tail_mean
0 a 3.5
1 b 5.0
>> df
label value tail_mean
0 a 1 3.5
1 b 2 5.0
2 a 5 3.5
3 b 6 5.0
4 a 2 3.5
5 b 4 5.0
Now you have a dataframe of your results only, if you need them, plus a column with it merged back into the main dataframe. Someone else posted a more succinct way to get to the results dataframe; probably no reason to do it the longer way I showed here unless you also need to perform more operations like this that you could do inside the same loop.