How to recognize changed value when apply transform() pandas - python

I have a DF like below
df = pd.DataFrame({'category' : ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C'],
'size': [20, 0, 10, 30, 30, 0, 0, 10],
'price': [5, 0, 2, 10, 10, 0, 0, 3],
'flag' : [0,0,0,0,0,0,0,0]
})
I would like to change 0 in ['size'] column into the max value of the category so
df['size'] = np.where(df['size'].eq(0), df.groupby('category')['size'].transform('max'), df['size'])
df['price'] = np.where(df['price'].eq(0), df.groupby('category')['price'].transform('max'), df['price'])
And output would be like
df = pd.DataFrame({'category' : ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C'],
'size': [20, 20, 10, 30, 30, 30, 10, 10],
'price': [5, 5, 2, 10, 10, 10, 3, 3],
'flag' : [0,0,0,0,0,0,0,0]
})
(process so far confirmed)
But now I would like to know which row has been changed so I assigned a ['flag'] column and would like to change the value 0 to 1 when any other value in the same row has been changed.
So desired output would be like below
df = pd.DataFrame({'category' : ['A', 'A', 'A', 'B', 'B', 'B', 'C', 'C'],
'size': [20, 20, 10, 30, 30, 30, 10, 10],
'price': [5, 5, 2, 10, 10, 10, 3, 3],
'flag' : [0,1,0,0,0,1,1,0]
})
Is there anyway I can do this one line with the transform sentence? Or any other good way?

Can you just label what will be changed before applying your operations? i.e. find the places where size == 0:
df['flag'] = (df['size'] == 0).astype(int)
# then do
df['size'] = np.where(df['size'].eq(0), df.groupby('category')['size'].transform('max'), df['size'])
df['price'] = np.where(df['price'].eq(0), df.groupby('category')['price'].transform('max'), df['price'])
Or for either price or size:
df['flag'] = ((df['size'] == 0) | (df['price'] == 0)).astype(int)

Check here I add the addtional condition due to the whole value price and size in that group maybe 0
cond1 = df.groupby('category')['size'].transform('max')
cond2 = df.groupby('category')['price'].transform('max')
df['changed'] = ((df['size'].ne(cond1)&df['size'].eq(0)) | (df['price'].ne(cond2)&df['price'].eq(0))) .astype(int)
df['size'] = np.where(df['size'].eq(0), cond1, df['size'])
df['price'] = np.where(df['price'].eq(0), cond2, df['price'])
Out[406]:
category size price flag changed
0 A 20 5 0 0
1 A 20 5 0 1
2 A 10 2 0 0
3 B 30 10 0 0
4 B 30 10 0 0
5 B 30 10 0 1
6 C 10 3 0 1
7 C 10 3 0 0
Or if the max value always more than 0
df['changes'] = df[['size','price']].eq(0).any(1).astype(int)

Related

Find highest two numbers on every row in pandas dataframe and extract the column names

I have a code with multiple columns and I would like to add two more, one for the highest number on the row, and another one for the second highest. However, instead of the number, I would like to show the column name where they are found.
Assume the following data frame:
import pandas as pd
df = pd.DataFrame({'A': [1, 5, 10], 'B': [2, 6, 11], 'C': [3, 7, 12], 'D': [4, 8, 13], 'E': [5, 9, 14]})
To extract the highest number on every row, I can just apply max(axis=1) like this:
df['max1'] = df[['A', 'B', 'C', 'D', 'E']].max(axis = 1)
This gets me the max number, but not the column name itself.
How can this be applied to the second max number as well?
You can sorting values and assign top2 values:
cols = ['A', 'B', 'C', 'D', 'E']
df[['max2','max1']] = np.sort(df[cols].to_numpy(), axis=1)[:, -2:]
print (df)
A B C D E max2 max1
0 1 2 3 4 5 4 5
1 5 6 7 8 9 8 9
2 10 11 12 13 14 13 14
df[['max1','max2']] = np.sort(df[cols].to_numpy(), axis=1)[:, -2:][:, ::-1]
EDIT: For get top2 columns names and top2 values use:
df = pd.DataFrame({'A': [1, 50, 10], 'B': [2, 6, 11],
'C': [3, 7, 12], 'D': [40, 8, 13], 'E': [5, 9, 14]})
cols = ['A', 'B', 'C', 'D', 'E']
#values in numpy array
vals = df[cols].to_numpy()
#columns names in array
cols = np.array(cols)
#get indices that would sort an array in descending order
arr = np.argsort(-vals, axis=1)
#top 2 columns names
df[['top1','top2']] = cols[arr[:, :2]]
#top 2 values
df[['max2','max1']] = vals[np.arange(arr.shape[0])[:, None], arr[:, :2]]
print (df)
A B C D E top1 top2 max2 max1
0 1 2 3 40 5 D E 40 5
1 50 6 7 8 9 A E 50 9
2 10 11 12 13 14 E D 14 13
Another approaches to you can get first max then remove it and get max again to get the second max
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, 15, 10], 'B': [2, 89, 11], 'C': [80, 7, 12], 'D': [4, 8, 13], 'E': [5, 9, 14]})
max1=df.max(axis=1)
maxcolum1=df.idxmax(axis=1)
max2 = df.replace(np.array(df.max(axis=1)),0).max(axis=1)
maxcolum2=df.replace(np.array(df.max(axis=1)),0).idxmax(axis=1)
df2 =pd.DataFrame({ 'max1': max1, 'max2': max2 ,'maxcol1':maxcolum1,'maxcol2':maxcolum2 })
df.join(df2)

More efficient way to search through Pandas groups

I want to get the list of values from col2 that belong to the same groupId, given corresponding value in col1. Col1 values can belong to multiple groups and in that case only top-most group should be considered (group 2 but not group 3 in my example). Col1 values are always identical within the same groupId.
groupId
col1
col2
2
a
10
1
b
20
2
a
30
1
b
40
3
a
50
3
a
60
1
b
70
My current solution takes over 30s for a df with 2000 rows and 32 values to search for in col1 ('a' in this case):
group_id_groups = df.groupby('groupId')
for group_id, group in group_id_groups:
col2_values = list(group[group['col1'] == 'a']['col2'])
if col2_values:
print(col2_values)
break
result: [10, 30]
The sort parameter of groupby defaults to true, which means the first group will be the topmost by default. You can change the col_to_search to b and get the other answer.
import pandas as pd
df = pd.DataFrame({'groupId': [2, 1, 2, 1, 3, 3, 1],
'col1': ['a', 'b', 'a', 'b', 'a', 'a', 'b'],
'col2': [10, 20, 30, 40, 50, 60, 70]})
col_to_search = 'a'
(
df.loc[df['col1'].eq(col_to_search)]
.groupby('groupId')['col2']
.apply(list)
.iloc[0]
)
Output
[10, 30]
I am still not sure what you want. Does this help you? I am sure that pandas.DataFrame.groupby() is your friend here.
Full code
#!/usr/bin/env python3
import pandas as pd
# initial data
df = pd.DataFrame({
'groupId': [2, 1, 2, 1, 3, 3, 1],
'col1': list('ababaab'),
'col2': range(10, 80, 10)
})
print(df)
g = df.groupby(['groupId', 'col1']).agg(list)
print(g)
result = g.loc[(2, 'a')]
print(result)
Step by step
Your initial data in df looks like this
groupId col1 col2
0 2 a 10
1 1 b 20
2 2 a 30
3 1 b 40
4 3 a 50
5 3 a 60
6 1 b 70
Then you simply group your data by your two "search columns". The result per group is stored as a list.
g = df.groupby(['groupId', 'col1']).agg(list)
The result:
col2
groupId col1
1 b [20, 40, 70]
2 a [10, 30]
3 a [50, 60]
No you can do your search:
result = g.loc[(2, 'a')]
That gives you
col2 [10, 30]
Name: (2, a), dtype: object
It seems to me that you mostly need to create a mask without using a groupby.
import pandas as pd
# data
data = {'groupId': {0: '2', 1: '1', 2: '2', 3: '1', 4: '3', 5: '3', 6: '1'},
'col1': {0: 'a', 1: 'b', 2: 'a', 3: 'b', 4: 'a', 5: 'a', 6: 'b'},
'col2': {0: 10, 1: 20, 2: 30, 3: 40, 4: 50, 5: 60, 6: 70}}
df = pd.DataFrame(data)
# First group where condition is satisfied
first_group = df[df["col1"].eq("a")].iloc[0]["groupId"]
# Output
df[df["col1"].eq("a") &
df["groupId"].eq(first_group)]["col2"].to_list()
And the output is [10, 30] as expected.
You can use pandas.groupby with agg(list) then search what you want with .loc and return the first find.
>>> grp = df.groupby(['groupId', 'col1']).agg(list).reset_index()
>>> grp.loc[grp['col1'].eq('a'), 'col2'].to_list()[0]
[10, 30]
>>> grp.loc[grp['col1'].eq('a'), 'col2']
1 [10, 30]
2 [50, 60]
Name: col2, dtype: object

How to update original array with groupby in python

I have a dataset and I am trying to iterate each group and based on each group, I am trying to update original groups:
import pandas as pd
import numpy as np
arr = np.array([1, 2, 4, 7, 11, 16, 22, 29, 37, 46])
df = pd.DataFrame({'grain': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B']})
df["target"] = arr
for group_name, b in df.groupby("grain"):
if group_name == "A":
// do some processing
if group_name == "B":
// do another processing
I expect to see original df is updated. Is there any way to do it?
Here is a way to change the original data, this example requires a non-duplicate index. I am not sure what would be the benefit of this approach compared to using classical pandas operations.
import pandas as pd
import numpy as np
arr = np.array([1, 2, 4, 7, 11, 16, 22, 29, 37, 46])
df = pd.DataFrame({'grain': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B']})
df["target"] = arr
for g_name, g_df in df.groupby("grain"):
if g_name == "A":
df.loc[g_df.index, 'target'] *= 10
if g_name == "B":
df.loc[g_df.index, 'target'] *= -1
Output:
>>> df
grain target
0 A 10
1 B -2
2 A 40
3 B -7
4 A 110
5 B -16
6 A 220
7 B -29
8 A 370
9 B -46

Python appending a list to dataframe column

I have a dataframe from a stata file and I would like to add a new column to it which has a numeric list as an entry for each row. How can one accomplish this? I have been trying assignment but its complaining about index size.
I tried initiating a new column of strings (also tried integers) and tried something like this but it didnt work.
testdf['new_col'] = '0'
testdf['new_col'] = testdf['new_col'].map(lambda x : list(range(100)))
Here is a toy example resembling what I have:
data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd'], 'start_val': [1,7,9,10], 'end_val' : [3,11, 12,15]}
testdf = pd.DataFrame.from_dict(data)
This is what I would like to have:
data2 = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd'], 'start_val': [1,7,9,10], 'end_val' : [3,11, 12,15], 'list' : [[1,2,3],[7,8,9,10,11],[9,10,11,12],[10,11,12,13,14,15]]}
testdf2 = pd.DataFrame.from_dict(data2)
My final goal is to use explode on that "list" column to duplicate the rows appropriately.
Try this bit of code:
testdf['list'] = pd.Series(np.arange(i, j) for i, j in zip(testdf['start_val'],
testdf['end_val']+1))
testdf
Output:
col_1 col_2 start_val end_val list
0 3 a 1 3 [1, 2, 3]
1 2 b 7 11 [7, 8, 9, 10, 11]
2 1 c 9 12 [9, 10, 11, 12]
3 0 d 10 15 [10, 11, 12, 13, 14, 15]
Let's use comprehension and zip with a pd.Series constructor and np.arange to create the lists.
If you'd stick to using the apply function:
import pandas as pd
import numpy as np
data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd'], 'start_val': [1,7,9,10], 'end_val' : [3,11, 12,15]}
df = pd.DataFrame.from_dict(data)
df['range'] = df.apply(lambda row: np.arange(row['start_val'], row['end_val']+1), axis=1)
print(df)
Output:
col_1 col_2 start_val end_val range
0 3 a 1 3 [1, 2, 3]
1 2 b 7 11 [7, 8, 9, 10, 11]
2 1 c 9 12 [9, 10, 11, 12]
3 0 d 10 15 [10, 11, 12, 13, 14, 15]

Replicating rows in pandas dataframe by column value and add a new column with repetition index

My question is similar to one asked here. I have a dataframe and I want to repeat each row of the dataframe k number of times. Along with it, I also want to create a column with values 0 to k-1. So
import pandas as pd
df = pd.DataFrame(data={
'id': ['A', 'B', 'C'],
'n' : [ 1, 2, 3],
'v' : [ 10, 13, 8]
})
what_i_want = pd.DataFrame(data={
'id': ['A', 'B', 'B', 'C', 'C', 'C'],
'n' : [ 1, 2, 2, 3, 3, 3],
'v' : [ 10, 13, 13, 8, 8, 8],
'repeat_id': [0, 0, 1, 0, 1, 2]
})
Command below does half of the job. I am looking for pandas way of adding the repeat_id column.
df.loc[df.index.repeat(df.n)]
Use GroupBy.cumcount and copy for avoid SettingWithCopyWarning:
If you modify values in df1 later you will find that the modifications do not propagate back to the original data (df), and that Pandas does warning.
df1 = df.loc[df.index.repeat(df.n)].copy()
df1['repeat_id'] = df1.groupby(level=0).cumcount()
df1 = df1.reset_index(drop=True)
print (df1)
id n v repeat_id
0 A 1 10 0
1 B 2 13 0
2 B 2 13 1
3 C 3 8 0
4 C 3 8 1
5 C 3 8 2

Categories