In the following dataframa I would like to substract the max value of u from each user to the value of u with the corresponding row of the max value of t. So it should be like 21 (u-max) - 18 (u-value of t-max).
The dataframe is grouped by ['user','t']
user t u
1 0.0 -1.14
2.30 2.8
2.37 9.20
2.40 21
2.45 18
2 ... ...
If t wasn't part of the index, I would have used something like df.groupby().agg({'u':'max'}) and df.groupby().agg({'t':'max'}) ,but since it isn't, I don't know how I could use agg()on t
(edit)
I found out that I can use df.reset_index(level=['t'], inplace=True) to change t into a column, but now I realise that if I would use
df.groupby(['user']).agg({"t":'max'}) , that the corresponding u values would be missing
The goal is to create a new dataframe that contains the values like this:
user (U_max - U_tmax)
1 3
2 ...
Let's start by re-creating a dataframe similar to yours, with the below code:
import pandas as pd
import numpy as np
cols = ['user', 't', 'u']
df = pd.DataFrame(columns=cols)
size = 10
df['user'] = np.random.randint(1,3, size=size)
df['t'] = np.random.uniform(0.0,3.0, size=size)
df = df.groupby(['user','t']).sum()
df['u'] = np.random.randint(-30,30, size=len(df))
print(df)
The output is something like:
u
user t
1 0.545562 19
0.627296 23
0.945533 -13
1.697278 -18
1.904453 -10
2.008375 5
2.296342 -2
2 0.282291 14
1.461548 -6
2.594966 -19
The first thing we'll need to do in order to work on this df is to reset the index, so:
df = df.reset_index()
Now we have all our columns back and we can use them to apply our final groupby() function.
We can start by grouping by user, which is what we need, specifying u and t as columns, so that we can access them in a lambda function.
In this lambda function, we will subtract from the max value of u and the corresponding u value for the max value of t.
So, the max value of u must be something like:
x['u'].max()
And the u value of max of t should look like:
x['u'][x['t'].idxmax()])
So as you can see we've found the index for the max value of t, and used it to slice x['u'].
Here is the final code:
df = df.reset_index()
df = df.groupby(['user'])['u', 't'].apply(lambda x: (x['u'].max() - x['u'][x['t'].idxmax()]) )
print(df)
Final output:
user
1 25
2 33
Gross Error Check:
max of u for user 1 is 23
max of t for user 1 is 2.296342, and the corresponding u is -2
23 - (-2) = 25
max of u for user 2 is 14
max of t for user 2 is 2.594966, and the corresponding u is -19
14 - (-19) = 33
Bonus tip: If you'd like to rename the returned column from groupby, use reset_index() along with set_index() after the groupby operation:
df = df.reset_index(name='(U_max - U_tmax)').set_index('user')
It will yield:
(U_max - U_tmax)
user
1 25
2 33
Related
Below is a dataframe showing coordinate values from and to, each row having a corresponding value column.
I want to find the range of coordinates where the value column doesn't exceed 5. Below is the dataframe input.
import pandas as pd
From=[10,20,30,40,50,60,70]
to=[20,30,40,50,60,70,80]
value=[2,3,5,6,1,3,1]
df=pd.DataFrame({'from':From, 'to':to, 'value':value})
print(df)
hence I want to convert the following table:
to the following outcome:
Further explanation:
Coordinates from 10 to 30 are joined and the value column changed to 5
as its sum of values from 10 to 30 (not exceeding 5)
Coordinates 30 to 40 equals 5
Coordinate 40 to 50 equals 6 (more than 5, however, it's included as it cannot be divided further)
Remaining coordinates sum up to a value of 5
What code is required to achieve the above?
We can do a groupby on cumsum:
s = df['value'].ge(5)
(df.groupby([~s, s.cumsum()], as_index=False, sort=False)
.agg({'from':'min','to':'max', 'value':'sum'})
)
Output:
from to value
0 10 30 5
1 30 40 5
2 40 50 6
3 50 80 5
Update: It looks like you want to accumulate the values so the new groups do not exceed 5. There are several threads on SO saying that this can only be done with a for a loop. So we can do something like this:
thresh = 5
groups, partial, curr_grp = [], thresh, 0
for v in df['value']:
if partial + v > thresh:
curr_grp += 1
partial = v
else:
partial += v
groups.append(curr_grp)
df.groupby(groups).agg({'from':'min','to':'max', 'value':'sum'})
I have two dataframes and I want to find the difference between dataframe1 and dataframe2 based on the condition. What I mean is the following:
df.ref_well:
zone depth
a 34
b 23
c 11
d 35
e -9999
df_well
zone depth
a 17
c 15
d 25
f 11
what I want is to generate the df3 with the zone name and the difference between depth of the same zones in df1 and df3:
df3 = well- ref well (the same zones)
zone depth
a 17
b -9999
c -4
d 10
e -9999
I have tried to iterate through dfs separately and identify the same zones, and if they are equal to find the difference:
ref_well_zone_count=len(df_ref_well.iloc[:,0])
well_zone_count=len(df_well.iloc[:,0])
delta_depth=[]
for ref_zone in range(ref_well_zone_count):
for well_zone in range(well_zone_count):
if df_ref_well.iloc[ref_zone,0]==df_well.iloc[well_zone,0]:
delta_tvdss.append(df_well.iloc[well_zone, 1] - df_ref_well.iloc[ref_zone, 1])
The problem is I can't fill the results into the new column I am not able to insert them, so when I try adding the delta_depth as a column it says that:
ValueError: Length of values does not match length of index
But if I print out the results it calculates perfectly
You didn't specify what you want to do if there is no match. So I will assume no match means depth = 0
Link 2 df together using merge, then fill those that doesn't have a match will have 0 by default:
df3 = pd.merge(ref_well,df_well, on=['zone'], how='outer').fillna(0)
Calculate the difference and put it back
df3['diff'] = df3.depth_x - df3.depth_y
I have a data frame with 5 fields. I want to copy 2 fields from this into a new data frame. This works fine. df1 = df[['task_id','duration']]
Now in this df1, when I try to group by task_id and sum duration, the task_id field drops off.
Before (what I have now).
After (what I'm trying to achieve).
So, for instance, I'm trying this:
df1['total'] = df1.groupby(['task_id'])['duration'].sum()
The result is:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
I don't know why I can't just sum the values in a column and group by unique IDs in another column. Basically, all I want to do is preserve the original two columns (['task_id', 'duration']), sum duration, and calculate a percentage of duration in a new column named pct. This seems like a very simple thing but I can't get anything working. How can I get this straightened out?
The code will take care of having the columns retained and getting the sum.
df[['task_id', 'duration']].groupby(['task_id', 'duration']).size().reset_index(name='counts')
Setup:
X = np.random.choice([0,1,2], 20)
Y = np.random.uniform(2,10,20)
df = pd.DataFrame({'task_id':X, 'duration':Y})
Calculate pct:
df = pd.merge(df, df.groupby('task_id').agg(sum).reset_index(), on='task_id')
df['pct'] = df['duration_x'].divide(df['duration_y'])*100
df.drop('duration_y', axis=1) # Drops sum duration, remove this line if you want to see it.
Result:
duration_x task_id pct
0 8.751517 0 58.017921
1 6.332645 0 41.982079
2 8.828693 1 9.865355
3 2.611285 1 2.917901
4 5.806709 1 6.488531
5 8.045490 1 8.990189
6 6.285593 1 7.023645
7 7.932952 1 8.864436
8 7.440938 1 8.314650
9 7.272948 1 8.126935
10 9.162262 1 10.238092
11 7.834692 1 8.754639
12 7.989057 1 8.927129
13 3.795571 1 4.241246
14 6.485703 1 7.247252
15 5.858985 2 21.396850
16 9.024650 2 32.957771
17 3.885288 2 14.188966
18 5.794491 2 21.161322
19 2.819049 2 10.295091
disclaimer: All data is randomly generated in setup, however, calculations are straightforward and should be correct for any case.
I finally got everything working in the following way.
# group by and sum durations
df1 = df1.groupby('task_id', as_index=False).agg({'duration': 'sum'})
list(df1)
# find each task_id as relative percentage of whole
df1['pct'] = df1['duration']/(df1['duration'].sum())
df1 = pd.DataFrame(df1)
my_df = DataFrame(np.arange(1,13).reshape(4,3), columns=list('abc'))
my_df.sum(axis="rows")
O/P is
a 22
b 26
c 30
// I expect it to sum by rows thereby giving
0 6
1 15
2 24
3 33
my_df.sum(axis="columns") //helps achieve this
Why does it work counterintutively?
In a similar context, drop method works as it should i.e when i write
my_df.drop(['a'],axis="columns")
// This drops column "a".
Am I missing something? Please enlighten.
Short version
It is a naming convention. The sum of the columns gives a row-wise sum. You are looking for axis='columns').
Long version
Ok that was interesting. In pandas normally 0 is for columns and 1 is for rows.
However looking in the docs we find that the allowed params are:
axis : {index (0), columns (1)}
You are passing a param that does not exist which results in the default. This can thus be read as: The sum of the columns returns the row sum. The sum of the index returns the column sum. What you want to use it axis=1 or axis='columns' which results in your desired output:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.arange(1,13).reshape(4,3), columns=list('abc'))
print(df.sum(axis=1))
Returns:
0 6
1 15
2 24
3 33
dtype: int64
I have a requirement to sort a table by date starting from the oldest. Total field is created by grouping name and kind fields and applying sum. Now for each row I need to calculate the remaining time in the same name-kind grouping.
The csv looks like that:
date name kind duration total remaining
1-1-2017 a 1 10 100 ? should be 90
2-1-2017 b 1 5 35 ? should be 30
3-1-2017 a 2 3 50 ? should be 47
4-1-2017 b 2 1 25 ? should be 24
5-1-2017 a 1 8 100 ? should be 82
6-1-2017 b 1 2 35 ? should be 33
7-1-2017 a 2 3 50 ? should be 44
8-1-2017 b 2 6 25 ? should be 18
...
My question is how do I calculate the remaining value while having the DataFrame grouped by name and kind?
My initial approach was to shift the column and add the values from duration to each other like that:
df['temp'] = df.groupby(['name', 'kind'])['duration'].apply(lambda x: x.shift() + x)
and then:
df['duration'] = df.apply(lambda x: x['total'] - x['temp'], axis=1)
But it did not work as expected.
Is there a clean way to do it, or using the iloc, ix, loc somehow is the way to go?
Thanks.
You could do something like:
df["cumsum"] = df.groupby(['name', 'kind'])["duration"].cumsum()
df["remaining"] = df["total"] - df["cumsum"]
Being careful with resetting the index maybe.