How to assign different row's value to new column - python

I'm trying to add a column, 'C_End', to a DataFrame in Pandas that looks something like this:
df = pd.DataFrame({'ID':[123,123,123,456,456,789],
'C_ID':[8,10,35,36,40,7],
'C_Type':['New','Renew','Renew','New','Term','New'],
'Rank':[1,2,3,1,2,1]})
The new column needs to be the next 'C_Type' for each ID based on 'Rank', resulting in a DataFrame that looks like this:
ID C_ID C_Type Rank C_End
0 123 8 New 1 Renew
1 123 10 Renew 2 Renew
2 123 35 Renew 3 None
3 456 36 New 1 Term
4 456 40 Term 2 None
5 789 7 New 1 None
Essentially, I want to find the row where ID = ID and Rank = Rank+1 and assign C_Type to new column C_End. I've tried creating a function and using Apply (below), but that took forever and eventually gave me an error. I'm still new to Pandas and Python in general, but I feel like there has to be an easy solution that I'm not seeing.
def get_next_c_type(row):
return df.loc[(df['id'] == row['id']) & (df['rank'] == row['rank'] + 1),'c_type']
df['c_end'] = df.apply(get_next_c_type, axis = 1)

Try:
df['C_End'] = df.sort_values('Rank').groupby('ID')['C_Type'].transform('shift',-1)
Or as #W-B suggest:
df['C_End'] = df.sort_values('Rank').groupby('ID')['C_Type'].shift(-1)
Output:
ID C_ID C_Type Rank C_End
0 123 8 New 1 Renew
1 123 10 Renew 2 Renew
2 123 35 Renew 3 NaN
3 456 36 New 1 Term
4 456 40 Term 2 NaN
5 789 7 New 1 NaN

Here's one way using np.where:
dfs = df.shift(-1)
m1 = df.ID == dfs.ID
m2 = df.Rank + 1 == dfs.Rank
df.loc[:, 'C_End'] = np.where(m1 & m2, dfs.C_Type, None)
ID C_ID C_Type Rank C_End
0 123 8 New 1 Renew
1 123 10 Renew 2 Renew
2 123 35 Renew 3 None
3 456 36 New 1 Term
4 456 40 Term 2 None
5 789 7 New 1 None

Related

Perform operations on a dataframe from groupings by ID

I have the following dataframe in Python:
ID
maths
value
0
add
12
1
sub
30
0
add
10
2
mult
3
0
sub
10
1
add
11
3
sub
40
2
add
21
My idea is to perform the following operations to get the result I want:
First step: Group the rows of the dataframe by ID. The order of the groups shall be indicated by the order of the original dataframe.
ID
maths
value
0
add
12
0
add
10
0
sub
10
1
sub
30
1
add
11
2
mult
3
2
add
21
3
sub
40
Second step: For each group created: Create a value for a new column 'result' where a mathematical operation indicated by the previous column of 'maths' is performed. If there is no previous row for the group, this column would have the value NaN.
ID
maths
value
result
0
add
12
NaN
0
add
10
22
0
sub
10
20
1
sub
30
NaN
1
add
11
19
2
mult
3
NaN
2
add
21
63
3
sub
40
NaN
Third step: Return the resulting dataframe.
I have tried to realise this code by making use of the pandas groupby method. But I have problems to iterate with conditions for each row and each group, and I don't know how to create the new column 'result' on a groupby object.
grouped_df = testing.groupby('ID')
for key, item in grouped_df:
print(grouped_df.get_group(key))
I don't know whether to use orderby or groupby or some other method that works for what I want to do. If you can help me with a better idea, I'd appreciate it.
ID = list("00011223")
maths = ["add","add","sub","sub","add","mult","add","sub"]
value = [12,10,10,30,11,3,21,40]
import pandas as pd
df = pd.DataFrame(list(zip(ID,maths,value)),columns = ["ID","Maths","Value"])
df["Maths"] = df.groupby(["ID"]).pipe(lambda df:df.Maths.shift(1)).fillna("add")
df["Value1"] = df.groupby(["ID"]).pipe(lambda df:df.Value.shift(1))
df["result"] = df.groupby(["Maths"]).pipe(lambda x:(x.get_group("add")["Value1"] + x.get_group("add")["Value"]).append(
x.get_group("sub")["Value1"] - x.get_group("sub")["Value"]).append(
x.get_group("mult")["Value1"] * x.get_group("mult")["Value"])).sort_index()
Here is the Output:
df
Out[168]:
ID Maths Value Value1 result
0 0 add 12 NaN NaN
1 0 add 10 12.0 22.0
2 0 add 10 10.0 20.0
3 1 add 30 NaN NaN
4 1 sub 11 30.0 19.0
5 2 add 3 NaN NaN
6 2 mult 21 3.0 63.0
7 3 add 40 NaN NaN

pandas agrregate and join dataframe during group by

I have a data frame:
id parentid score body
1 10 10 abc
2 10 0 xyz
3 10 1 efg
4 23 3 afd
5 23 2 asfagr
6 34 1 wrqqw
i need to groupby(parentid) then aggregate score by mean , and join body. id field is not relevent, it can be changed to min or max.
result should be
id parentid score body
1 10 3 abc xyz efg
4 23 2 afd asfagr
6 34 1 wrqqw
i have tried
def f(x):
x['Id'] = x['Id']
x['ParentId'] = x['ParentId']
x['Score'] = x['Score'].min() #change to max/ min/ mean to get different results!
x['Body']= " ".join(x['Body'])
return x
temp = temp.groupby("ParentId").apply(f)
temp = temp.reset_index()
it gives corerct result but ince dataset size is >1.8 gb , the system becomes irresponsive.
i have tried it in google colab too, it has crashed 3 times.
please suggest a faster method such as lambda functions or anything else.
Try this using groupby with agg and a dictionary to define aggregations for each column:
df.groupby('parentid', as_index=False)[['score', 'body']]\
.agg({'score':'mean', 'body':' '.join})
Output:
parentid score body
0 10 3.666667 abc xyz efg
1 23 2.500000 afd asfagr
2 34 1.000000 wrqqw
Try
temp.groupby("ParentId").agg({"score":np.mean, "body": lambda x: " ".join([I for I in x])})

Trying to group by, then sort a dataframe based on multiple values [duplicate]

Suppose I have pandas DataFrame like this:
df = pd.DataFrame({'id':[1,1,1,2,2,2,2,3,4], 'value':[1,2,3,1,2,3,4,1,1]})
which looks like:
id value
0 1 1
1 1 2
2 1 3
3 2 1
4 2 2
5 2 3
6 2 4
7 3 1
8 4 1
I want to get a new DataFrame with top 2 records for each id, like this:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
I can do it with numbering records within group after groupby:
dfN = df.groupby('id').apply(lambda x:x['value'].reset_index()).reset_index()
which looks like:
id level_1 index value
0 1 0 0 1
1 1 1 1 2
2 1 2 2 3
3 2 0 3 1
4 2 1 4 2
5 2 2 5 3
6 2 3 6 4
7 3 0 7 1
8 4 0 8 1
then for the desired output:
dfN[dfN['level_1'] <= 1][['id', 'value']]
Output:
id value
0 1 1
1 1 2
3 2 1
4 2 2
7 3 1
8 4 1
But is there more effective/elegant approach to do this? And also is there more elegant approach to number records within each group (like SQL window function row_number()).
Did you try
df.groupby('id').head(2)
Output generated:
id value
id
1 0 1 1
1 1 2
2 3 2 1
4 2 2
3 7 3 1
4 8 4 1
(Keep in mind that you might need to order/sort before, depending on your data)
EDIT: As mentioned by the questioner, use
df.groupby('id').head(2).reset_index(drop=True)
to remove the MultiIndex and flatten the results:
id value
0 1 1
1 1 2
2 2 1
3 2 2
4 3 1
5 4 1
Since 0.14.1, you can now do nlargest and nsmallest on a groupby object:
In [23]: df.groupby('id')['value'].nlargest(2)
Out[23]:
id
1 2 3
1 2
2 6 4
5 3
3 7 1
4 8 1
dtype: int64
There's a slight weirdness that you get the original index in there as well, but this might be really useful depending on what your original index was.
If you're not interested in it, you can do .reset_index(level=1, drop=True) to get rid of it altogether.
(Note: From 0.17.1 you'll be able to do this on a DataFrameGroupBy too but for now it only works with Series and SeriesGroupBy.)
Sometimes sorting the whole data ahead is very time consuming.
We can groupby first and doing topk for each group:
g = df.groupby(['id']).apply(lambda x: x.nlargest(topk,['value'])).reset_index(drop=True)
df.groupby('id').apply(lambda x : x.sort_values(by = 'value', ascending = False).head(2).reset_index(drop = True))
Here sort values ascending false gives similar to nlargest and True gives similar to nsmallest.
The value inside the head is the same as the value we give inside nlargest to get the number of values to display for each group.
reset_index is optional and not necessary.
This works for duplicated values
If you have duplicated values in top-n values, and want only unique values, you can do like this:
import pandas as pd
ifile = "https://raw.githubusercontent.com/bhishanpdl/Shared/master/data/twitter_employee.tsv"
df = pd.read_csv(ifile,delimiter='\t')
print(df.query("department == 'Audit'")[['id','first_name','last_name','department','salary']])
id first_name last_name department salary
24 12 Shandler Bing Audit 110000
25 14 Jason Tom Audit 100000
26 16 Celine Anston Audit 100000
27 15 Michale Jackson Audit 70000
If we do not remove duplicates, for the audit department we get top 3 salaries as 110k,100k and 100k.
If we want to have not-duplicated salaries per each department, we can do this:
(df.groupby('department')['salary']
.apply(lambda ser: ser.drop_duplicates().nlargest(3))
.droplevel(level=1)
.sort_index()
.reset_index()
)
This gives
department salary
0 Audit 110000
1 Audit 100000
2 Audit 70000
3 Management 250000
4 Management 200000
5 Management 150000
6 Sales 220000
7 Sales 200000
8 Sales 150000
To get the first N rows of each group, another way is via groupby().nth[:N]. The outcome of this call is the same as groupby().head(N). For example, for the top-2 rows for each id, call:
N = 2
df1 = df.groupby('id', as_index=False).nth[:N]
To get the largest N values of each group, I suggest two approaches.
First sort by "id" and "value" (make sure to sort "id" in ascending order and "value" in descending order by using the ascending parameter appropriately) and then call groupby().nth[].
N = 2
df1 = df.sort_values(by=['id', 'value'], ascending=[True, False])
df1 = df1.groupby('id', as_index=False).nth[:N]
Another approach is to rank the values of each group and filter using these ranks.
# for the entire rows
N = 2
msk = df.groupby('id')['value'].rank(method='first', ascending=False) <= N
df1 = df[msk]
# for specific column rows
df1 = df.loc[msk, 'value']
Both of these are much faster than groupby().apply() and groupby().nlargest() calls as suggested in the other answers on here(1, 2, 3). On a sample with 100k rows and 8000 groups, a %timeit test showed that it was 24-150 times faster than those solutions.
Also, instead of slicing, you can also pass a list/tuple/range to a .nth() call:
df.groupby('id', as_index=False).nth([0,1])
# doesn't even have to be consecutive
# the following returns 1st and 3rd row of each id
df.groupby('id', as_index=False).nth([0,2])

How to drop rows by threshold of index column's occur frequence in Pandas

I have a dataframe like this:
userid itemid timestamp
1 1 50
1 2 50
1 3 50
1 4 60
2 1 40
2 2 50
I want to drop all rows whose userid occur more than 2 times and get a new dataframe as follows. Does someone can help me? Thanks.
userid itemid timestamp
2 1 40
2 2 50
You can use pd.Series.value_counts and calculate an array of userid filtered by your condition. Then use this to filter your original dataframe.
c = df['userid'].value_counts()
idx = c[c > 2].index
res = df[~df['userid'].isin(idx)]
print(res)
userid itemid timestamp
4 2 1 40
5 2 2 50

Grouping values

I have a dataframe that contain 3 columns, Id, Stage, Status. I would like to change that value based on the condition: if for the same ID, the stage changed, then change the status to 1. If another occurrence of the same id happened, stage is still same then change status back to 0.
Thanks !!
To calculate the Period column, you can calculate the result with two (nested) groupbys:
df["Period"] = (df.groupby("ID", group_keys=False)
# use the common diff.cumsum pattern to calculate the group variable here
.apply(lambda g: g.groupby(by = (g.Stage.diff() != 0).cumsum())
.cumcount() * 30))
df
The status column can be obtained this way:
df.groupby('ID').diff().Stage.fillna(0).ne(0)
Out[86]:
4 False
10 True
0 False
2 True
3 True
5 True
7 False
8 False
9 True
1 False
6 False
Name: Stage, dtype: bool
You needs to sort on column ID and then use np.where() and df.shift() to find the right status.
df=df.sort_values('ID')
df['Status']=np.where(((df.ID.shift()==df.ID) & (df.Stage.shift()<>df.Stage)),1,0)
output
ID Stage Status
4 45 2 0
10 45 3 1
0 50 4 0
2 50 5 1
3 50 6 1
5 50 4 1
7 50 4 0
8 50 4 0
9 50 5 1
1 55 3 0
6 55 3 0

Categories