panda aggregate by functions - python

I have data like below:
id movie details value
5 cane1 good 6
5 wind2 ok 30.3
5 wind1 ok 18
5 cane1 good 2
5 cane22 ok 4
5 cane34 good 7
5 wind2 ok 2
I want the output with below criteria:
If movie name starts with 'cane' - sum the value
If movie name starts with 'wind' - count the occurrence.
So - the final output will be:
id movie value
5 cane1 8
5 cane22 4
5 cane34 7
5 wind1 1
5 wind2 2
I tried to use:
movie_df.groupby(['id']).apply(aggr)
def aggr(x):
if x['movie'].str.startswith('cane'):
y = x.groupby(['value']).sum()
else:
y = x.groupby(['movie']).count()
return y
But It's not working. Can anyone please help?

You should aim for vectorised operations where possible.
You can calculate 2 results and then concatenate them.
mask = df['movie'].str.startswith('cane')
df1 = df[mask].groupby('movie')['value'].sum()
df2 = df[~mask].groupby('movie').size()
res = pd.concat([df1, df2], ignore_index=0)\
.rename('value').reset_index()
print(res)
movie value
0 cane1 8.0
1 cane22 4.0
2 cane34 7.0
3 wind1 1.0
4 wind2 2.0

There might be multiple ways of doing this. One way would to filter by the start of movie name first and then aggregate and merge afterwards.
cane = movie_df[movie_df['movie'].str.startswith('cane1')]
wind = movie_df[movie_df['movie'].str.startswith('wind')]
cane_sum = cane.groupby(['id']).agg({'movie':'first', 'value':'sum'}).reset_index()
wind_count = wind.groupby(['id']).agg({'movie':'first', 'value':'count'}).reset_index()
pd.concat([cane_sum, wind_count])

First of all, you need to perform string operation. I guess in your case you don't want digits in a movie name. Use solution discussed at pandas applying regex to replace values.
Then you call groupby() on new series.
FYI: Some movie names have digits only; in that case, you need to use update function. https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.update.html

I would start by creating a column which defines the required groups. For the example at hand this can be done with
df['group'] = df.movie.transform(lambda x : x[:4])
The next step would be to group by this column
df.groupby('group').apply(agg_fun)
using the following aggregation function
def agg_fun(grp):
if grp.name == "cane":
value=grp.value.sum()
else:
value=grp.value.count()
return value
The output of this code is
group
cane 19.0
wind 3.0

Related

Pandas Groupby with lambda gives some NANs

I have a DF where I'd like to create a new column with the difference of 2 other column values.
name rate avg_rate
A 10 3
B 6 5
C 4 3
I wrote this code to calculate the difference :
result= df.groupby(['name']).apply(lambda g: g.rate - g.avg_rate)
df['rate_diff']=result.reset_index(drop=True)
df.tail(3)
But I notice that some of the values calculated are NANs. What is the best way to handle this?
Output i am getting:
name rate avg_rate rate_diff
A 10 3 NAN
B 6 5 NAN
C 4 3 NAN
If you want to use groupby and apply then following should work,
res = df.groupby(['name']).apply(lambda g: g.rate - g.avg_rate).reset_index().set_index('level_1')
df = pd.merge(df,res,on=['name'],left_index = True, right_index=True).rename({0:'rate_diff'},axis=1)
However, as #sacuL suggested in the comments, you don't need to use groupby to calculate the difference as you are just going to get the difference by simply subtracting columns (side by side), and groupby apply will be overkill for this simple task.
df["rate_diff"] = df.rate - df.avg_rate

Moving pandas series value by switching column name?

I have a DF, however the last value of some series should be placed in a different one. This happened due to column names not being standardized - i.e., some are "Wx_y_x_PRED" and some are "Wx_x_y_PRED". I'm having difficulty writing a function that will simply find the columns with >= 225 NaN's and changing the column it's assigned to.
I've written a function that for some reason will sometimes work and sometimes won't. When it does, it further creates approx 850 columns in its wake (the OG dataframe is around 420 with the duplicate columns). I'm hoping to have something that just reassigns the value. If it automatically deletes the incorrect column, that's awesome too, but I just used .dropna(thresh = 2) when my function worked originally.
Here's what it looks like originally:
in: df = pd.DataFrame(data = {'W10_IND_JAC_PRED': ['NaN','NaN','NaN','NaN','NaN',2],
'W10_JAC_IND_PRED': [1,2,1,2,1,'NAN']})
out:df
W10_IND_JAC_PRED W10_JAC_IND_PRED
0 NaN 1
1 NaN 2
2 NaN 1
3 NaN 2
4 NaN 1
W 2 NAN
I wrote this, which occasionally works but most of the time doesn't and i'm not sure why.
def switch_cols(x):
"""Takes mismatched columns (where only the last value != NaN) and changes order of team column names"""
if x.isna().sum() == 5:
col_string = x.name.split('_')
col_to_switch = ('_').join([col_string[0],col_string[2],col_string[1],'PRED'])
df[col_to_switch]['row_name'] = x[-1]
else:
pass
return x
Most of the time it just returns to me the exact same DF, but this is the desired outcome.
W10_IND_JAC_PRED W10_JAC_IND_PRED
0 NaN 1
1 NaN 2
2 NaN 1
3 NaN 2
4 NaN 1
W 2 2
Anyone have any tips or could share why my function works maybe 10% of the time?
Edit:
so this is an ugly "for" loop I wrote that works. I know there has to be a much more pythonic way of doing this while preserving original column names, though.
for i in range(df.shape[1]):
if df.iloc[:,i].isna().sum() == 5:
split_nan_col = df.columns[i].split('_')
correct_col_name = ('_').join([split_nan_col[0],split_nan_col[2],split_nan_col[1],split_nan_col[3]])
df.loc[5,correct_col_name] = df.loc[5,df.columns[i]]
else:
pass
Doing with split before frozenset(will return the order list), then we do join: Notice this solution can be implemented to more columns
df.columns=df.columns.str.split('_').map(frozenset).map('_'.join)
df.mask(df=='NaN').groupby(level=0,axis=1).first() # groupby first will return the first not null value
PRED_JAC_W10_IND
0 1
1 2
2 1
3 2
4 1
5 2

How can I keep all columns in a dataframe, plus add groupby, and sum?

I have a data frame with 5 fields. I want to copy 2 fields from this into a new data frame. This works fine. df1 = df[['task_id','duration']]
Now in this df1, when I try to group by task_id and sum duration, the task_id field drops off.
Before (what I have now).
After (what I'm trying to achieve).
So, for instance, I'm trying this:
df1['total'] = df1.groupby(['task_id'])['duration'].sum()
The result is:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
I don't know why I can't just sum the values in a column and group by unique IDs in another column. Basically, all I want to do is preserve the original two columns (['task_id', 'duration']), sum duration, and calculate a percentage of duration in a new column named pct. This seems like a very simple thing but I can't get anything working. How can I get this straightened out?
The code will take care of having the columns retained and getting the sum.
df[['task_id', 'duration']].groupby(['task_id', 'duration']).size().reset_index(name='counts')
Setup:
X = np.random.choice([0,1,2], 20)
Y = np.random.uniform(2,10,20)
df = pd.DataFrame({'task_id':X, 'duration':Y})
Calculate pct:
df = pd.merge(df, df.groupby('task_id').agg(sum).reset_index(), on='task_id')
df['pct'] = df['duration_x'].divide(df['duration_y'])*100
df.drop('duration_y', axis=1) # Drops sum duration, remove this line if you want to see it.
Result:
duration_x task_id pct
0 8.751517 0 58.017921
1 6.332645 0 41.982079
2 8.828693 1 9.865355
3 2.611285 1 2.917901
4 5.806709 1 6.488531
5 8.045490 1 8.990189
6 6.285593 1 7.023645
7 7.932952 1 8.864436
8 7.440938 1 8.314650
9 7.272948 1 8.126935
10 9.162262 1 10.238092
11 7.834692 1 8.754639
12 7.989057 1 8.927129
13 3.795571 1 4.241246
14 6.485703 1 7.247252
15 5.858985 2 21.396850
16 9.024650 2 32.957771
17 3.885288 2 14.188966
18 5.794491 2 21.161322
19 2.819049 2 10.295091
disclaimer: All data is randomly generated in setup, however, calculations are straightforward and should be correct for any case.
I finally got everything working in the following way.
# group by and sum durations
df1 = df1.groupby('task_id', as_index=False).agg({'duration': 'sum'})
list(df1)
# find each task_id as relative percentage of whole
df1['pct'] = df1['duration']/(df1['duration'].sum())
df1 = pd.DataFrame(df1)

Extrapolate a row by quantity in a pandas DataFrame

I've been wondering how to solve the following problem. Say I have a dataframe df, which looks like this:
Name quantity price
A 1 10.0
A 3 26.0
B 1 15.0
B 3 30.0
...
Now, say I wanted to extrapolate the price by quantity, and for each Name create a row for quantity = 1,2,3, which is some function of the list of available quantities and respective prices. (I.e. say that I have a function extrapolate(qts, prices, n) that computes a price for quantity=n based on known qts and prices, then the result would look like:
Name quantity price
A 1 10.0
A 2 extrapolate([1, 3], [10.0, 26.0], 2)
A 3 26.0
B 1 15.0
B 2 extrapolate([1, 3], [15.0, 30.0], 2)
B 3 30.0
...
I would appreciate some insight on how to achieve this, or a place to reference to learn more about how groupby can be used for this case
Thank you in advance
What do you want is called missing data imputation. There are many approaches to it.
You may want to check package called fancyimpute. It offers imputing data using MICE, which seems to do what you want.
Other than that, if your case is just as simple in structure as the example is, you can always groupby('Name').mean() and you will get the middle value for each subgroup.
The following should do what you described:
def get_extrapolate_val(group, qts, prices, n):
# do your actual calculations here; now it returns just a dummy value
some_value = (group[qts] * group[prices]).sum() / n
return some_value
# some definitions
n = 2
quan_col = 'quantity'
price_col = 'price'
First we group by Name and then apply the function get_extrapolate_val to each group whereby we pass the additional column names and n as arguments. As this returns a series object, we need an additional reset_index and rename which will make the concatenation easier.
new_stuff = df.groupby('Name').apply(get_extrapolate_val, quan_col, price_col, n).reset_index().rename(columns={0: price_col})
Add n as additional column
new_stuff[quan_col] = n
We concatenate the two dataframes and are done
final_df = pd.concat([df, new_stuff]).sort_values(['Name', quan_col]).reset_index(drop=True)
Name price quantity
0 A 10.0 1
1 A 44.0 2
2 A 26.0 3
3 B 15.0 1
4 B 52.5 2
5 B 30.0 3
The values I now added in are of course meaningless but are just there to illustrate the method.
OLD version
Assuming that there is always only 1 and 3 in your quantity column, the following should work:
new_stuff = df.groupby('Name', as_index=False)['price'].mean()
This gives
Name price
0 A 18.0
1 B 22.5
That - as written - assumes that it is always only 1 and 3, so we can simply calculate the mean.
Then we add the 2
new_stuff['quantity'] = 2
and concatenate the two dataframes with an additional sorting
pd.concat([df, new_stuff]).sort_values(['Name', 'quantity']).reset_index(drop=True)
which gives the desired outcome
Name price quantity
0 A 10.0 1
1 A 18.0 2
2 A 26.0 3
3 B 15.0 1
4 B 22.5 2
5 B 30.0 3
There are probably far more elegant ways to do this though...

Slice column in panda database and averaging results

If I have a pandas database such as:
timestamp label value new
etc. a 1 3.5
b 2 5
a 5 ...
b 6 ...
a 2 ...
b 4 ...
I want the new column to be the average of the last two a's and the last two b's... so for the first it would be the average of 5 and 2 to get 3.5. It will be sorted by the timestamp. I know I could use a groupby to get the average of all the a's or all the b's but I'm not sure how to get an average of just the last two. I'm kinda new to python and coding so this might not be possible idk.
Edit: I should also mention this is not for a class or anything this is just for something I'm doing on my own and that this will be on a very large dataset. I'm just using this as an example. Also I would want each A and each B to have its own value for the last 2 average so the dimension of the new column will be the same as the others. So for the third line it would be the average of 2 and whatever the next a would be in the data set.
IIUC one way (among many) to do that:
In [139]: df.groupby('label').tail(2).groupby('label').mean().reset_index()
Out[139]:
label value
0 a 3.5
1 b 5.0
Edited to reflect a change in the question specifying the last two, not the ones following the first, and that you wanted the same dimensionality with values repeated.
import pandas as pd
data = {'label': ['a','b','a','b','a','b'], 'value':[1,2,5,6,2,4]}
df = pd.DataFrame(data)
grouped = df.groupby('label')
results = {'label':[], 'tail_mean':[]}
for item, grp in grouped:
subset_mean = grp.tail(2).mean()[0]
results['label'].append(item)
results['tail_mean'].append(subset_mean)
res_df = pd.DataFrame(results)
df = df.merge(res_df, on='label', how='left')
Outputs:
>> res_df
label tail_mean
0 a 3.5
1 b 5.0
>> df
label value tail_mean
0 a 1 3.5
1 b 2 5.0
2 a 5 3.5
3 b 6 5.0
4 a 2 3.5
5 b 4 5.0
Now you have a dataframe of your results only, if you need them, plus a column with it merged back into the main dataframe. Someone else posted a more succinct way to get to the results dataframe; probably no reason to do it the longer way I showed here unless you also need to perform more operations like this that you could do inside the same loop.

Categories