I need to assign values to the columns of df based on conditions.
If df.condition>0, df.result=df.data1, if df.condition<0,df.result=df.data2
code show as below:
def main():
import pandas as pd
import numpy as np
condition = {"condition": np.random.randn(200)}
df = pd.DataFrame(condition)
df['data1'] = np.random.randint(1, 100, len(df))
df['data2'] = np.random.randint(1, 100, len(df))
df['result'] = 0
df['result'].loc[df['condition'] > 0] = df[df['condition'] > 0]['data1']
df['result'].loc[df['condition'] < 0] = df[df['condition'] < 0]['data2']
print (df.head(10))
main()
My method will bring SettingWithCopyWarning: .A value is trying to be set on a copy of a slice from a DataFrame. And is not optimized.
It looks like I have a wrong understanding of pd.series.where.
The modified code is as follows:
def main():
condition = {"condition": np.random.randn(200)}
df = pd.DataFrame(condition)
df['data1']=np.random.randint(1,100, len(df))
df['data2']=np.random.randint(1,100, len(df))
df['result']=0
gt=df.condition>0
lt=df.condition<0
df.result.where(gt,df.data2,inplace=True)
df.result.where(lt,df.data1,inplace=True)
print (df.head(10))
return
main()
The result is :
condition data1 data2 result
0 -1.580927 63 23 23
1 -1.549005 94 20 20
2 2.153873 18 83 18
3 -0.115974 31 8 8
4 -0.726009 61 38 38
5 2.039930 96 63 96
6 -1.523605 94 96 96
7 -0.157509 8 4 4
8 -0.166163 11 21 21
9 -0.540077 14 64 64
I just figured out the usage of np.where:
import pandas as pd
import numpy as np
def main():
condition = {"condition": np.random.randn(200)}
df = pd.DataFrame(condition)
df['data1'] = np.random.randint(1, 100, len(df))
df['data2'] = np.random.randint(1, 100, len(df))
df['result'] = np.where(df['condition'] > 0, df['data1'], df['data2'])
print (df.head(10))
main()
Create boolean masks for your conditions and use them with DataFrame.loc to select the rows on the left-hand-side and the right-hand-side of the assignment.
Boolean Indexing
>>> df.head(15)
data a b data2
0 1.864896 81 30 0
1 -0.059083 81 93 0
2 -0.953324 89 1 0
3 0.367495 2 68 0
4 -1.537818 70 88 0
5 -1.118238 76 35 0
6 -0.017608 46 68 0
7 1.571796 12 95 0
8 0.683234 44 7 0
9 -1.320751 50 42 0
10 -0.463197 19 66 0
11 0.786541 44 32 0
12 -0.171833 28 26 0
13 1.668763 75 7 0
14 0.846662 42 56 0
>>> gt = df.data > 0
>>> lt = df.data < 0
>>> df.loc[gt,'a'] = df.loc[gt,'data2']
>>> df.loc[lt,'b'] = df.loc[lt,'data2']
>>> df.head(15)
data a b data2
0 1.864896 0 30 0
1 -0.059083 81 0 0
2 -0.953324 89 0 0
3 0.367495 0 68 0
4 -1.537818 70 0 0
5 -1.118238 76 0 0
6 -0.017608 46 0 0
7 1.571796 0 95 0
8 0.683234 0 7 0
9 -1.320751 50 0 0
10 -0.463197 19 0 0
11 0.786541 0 32 0
12 -0.171833 28 0 0
13 1.668763 0 7 0
14 0.846662 0 56 0
Using Series.where you have to reverse the logic as it only changes the values where the condition is NOT met.
>>> df.head(10)
data a b data2
0 1.046114 41 66 0
1 0.156532 65 46 0
2 -0.768515 56 36 0
3 0.640834 36 89 0
4 0.008113 39 26 0
5 -0.528028 63 49 0
6 -1.343293 87 94 0
7 1.076804 5 26 0
8 0.172443 9 57 0
9 -0.375729 84 47 0
>>> gt = df.data > 0
>>> lt = df.data < 0
>>> df.b.where(gt,df.data2,inplace=True)
>>> df.a.where(lt,df.data2,inplace=True)
>>> df.head(10)
data a b data2
0 1.046114 0 66 0
1 0.156532 0 46 0
2 -0.768515 56 0 0
3 0.640834 0 89 0
4 0.008113 0 26 0
5 -0.528028 63 0 0
6 -1.343293 87 0 0
7 1.076804 0 26 0
8 0.172443 0 57 0
9 -0.375729 84 0 0
>>>
Related
I have this dataframe:
ID X1 X2 Y
A 11 47 0
A 11 87 0
A 56 33 0
A 92 72 1
A 83 34 0
A 34 31 0
B 88 62 1
B 28 71 0
B 95 28 0
B 92 87 1
B 91 45 0
C 46 59 0
C 60 68 1
C 67 78 0
C 26 26 0
C 13 77 0
D 40 95 0
D 25 26 1
D 93 31 0
D 71 67 0
D 91 24 1
D 80 19 0
D 44 49 0
D 41 84 1
E 38 10 0
F 23 75 1
G 46 58 1
G 44 52 0
I want to assign a value of 1 after Y was equal 1, for the last time only. Otherwise, 0.
Note: it should be applied for each ID separately.
Expected result:
ID X1 X2 Y after
A 11 47 0 0
A 11 87 0 0
A 56 33 0 0
A 92 72 1 0
A 83 34 0 1
A 34 31 0 1
B 88 62 1 0
B 28 71 0 0
B 95 28 0 0
B 92 87 1 0
B 91 45 0 1
C 46 59 0 0
C 60 68 1 0
C 67 78 0 1
C 26 26 0 1
C 13 77 0 1
D 40 95 0 0
D 25 26 1 0
D 93 31 0 0
D 71 67 0 0
D 91 24 1 0
D 80 19 0 0
D 44 49 0 0
D 41 84 1 0
E 38 10 0 0
F 23 75 1 0
G 46 58 1 0
G 44 52 0 1
This might help:
Assign a value of 1 before another variable was equal 1, only for the first time
Let us try idxmax with transform first try find the last 1's index for each group, then we just need compare the original index with this output.
df['before']=(df.iloc[::-1,].groupby('ID').Y.transform('idxmax').sort_index()<df.index).astype(int)
df
Out[70]:
ID X1 X2 Y before
0 A 11 47 0 0
1 A 11 87 0 0
2 A 56 33 0 0
3 A 92 72 1 0
4 A 83 34 0 1
5 A 34 31 0 1
6 B 88 62 1 0
7 B 28 71 0 0
8 B 95 28 0 0
9 B 92 87 1 0
10 B 91 45 0 1
11 C 46 59 0 0
12 C 60 68 1 0
13 C 67 78 0 1
14 C 26 26 0 1
15 C 13 77 0 1
16 D 40 95 0 0
17 D 25 26 1 0
18 D 93 31 0 0
19 D 71 67 0 0
20 D 91 24 1 0
21 D 80 19 0 0
22 D 44 49 0 0
23 D 41 84 1 0
24 E 38 10 0 0
25 F 23 75 1 0
26 G 46 58 1 0
27 G 44 52 0 1
I have this data frame:
ID Date X1 X2 Y
A 16-07-19 58 50 0
A 17-07-19 61 83 1
A 18-07-19 97 38 0
A 19-07-19 29 77 0
A 20-07-19 66 71 1
A 21-07-19 28 74 0
B 19-07-19 54 65 1
B 20-07-19 55 32 1
B 21-07-19 50 30 0
B 22-07-19 51 38 0
B 23-07-19 81 61 0
C 24-07-19 55 29 0
C 25-07-19 97 69 1
C 26-07-19 92 44 1
C 27-07-19 55 97 0
C 28-07-19 13 48 1
D 29-07-19 77 27 1
D 30-07-19 68 50 1
D 31-07-19 71 32 1
D 01-08-19 89 57 1
D 02-08-19 46 70 0
D 03-08-19 14 68 1
D 04-08-19 12 87 1
D 05-08-19 56 13 0
E 06-08-19 47 35 1
I want to create a variable that equals 1 when Y was equal 1 at the last time (for each ID), and 0 otherwise.
Also, to exclude all the rows that come after the last time Y was equal 1.
Expected result:
ID Date X1 X2 Y Last
A 16-07-19 58 50 0 0
A 17-07-19 61 83 1 0
A 18-07-19 97 38 0 0
A 19-07-19 29 77 0 0
A 20-07-19 66 71 1 1
B 19-07-19 54 65 1 0
B 20-07-19 55 32 1 1
C 24-07-19 55 29 0 0
C 25-07-19 97 69 1 0
C 26-07-19 92 44 1 0
C 27-07-19 55 97 0 0
C 28-07-19 13 48 1 1
D 29-07-19 77 27 1 0
D 30-07-19 68 50 1 0
D 31-07-19 71 32 1 0
D 01-08-19 89 57 1 0
D 02-08-19 46 70 0 0
D 03-08-19 14 68 1 0
D 04-08-19 12 87 1 1
E 06-08-19 47 35 1 1
First remove all rows after last 1 in Y with compare Y with swap order and GroupBy.cumsum, then get all rows not equal by 0 and filter in boolean indexing, last use
numpy.where for new column:
df = df[df['Y'].eq(1).iloc[::-1].groupby(df['ID']).cumsum().ne(0).sort_index()]
df['Last'] = np.where(df['ID'].duplicated(keep='last'), 0, 1)
print (df)
ID Date X1 X2 Y Last
0 A 16-07-19 58 50 0 0
1 A 17-07-19 61 83 1 0
2 A 18-07-19 97 38 0 0
3 A 19-07-19 29 77 0 0
4 A 20-07-19 66 71 1 1
6 B 19-07-19 54 65 1 0
7 B 20-07-19 55 32 1 1
11 C 24-07-19 55 29 0 0
12 C 25-07-19 97 69 1 0
13 C 26-07-19 92 44 1 0
14 C 27-07-19 55 97 0 0
15 C 28-07-19 13 48 1 1
16 D 29-07-19 77 27 1 0
17 D 30-07-19 68 50 1 0
18 D 31-07-19 71 32 1 0
19 D 01-08-19 89 57 1 0
20 D 02-08-19 46 70 0 0
21 D 03-08-19 14 68 1 0
22 D 04-08-19 12 87 1 1
24 E 06-08-19 47 35 1 1
EDIT:
m = df['Y'].eq(1).iloc[::-1].groupby(df['ID']).cumsum().ne(0).sort_index()
df['Last'] = np.where(m.ne(m.groupby(df['ID']).shift(-1)) & m,1,0)
print (df)
ID Date X1 X2 Y Last
0 A 16-07-19 58 50 0 0
1 A 17-07-19 61 83 1 0
2 A 18-07-19 97 38 0 0
3 A 19-07-19 29 77 0 0
4 A 20-07-19 66 71 1 1
5 A 21-07-19 28 74 0 0
6 B 19-07-19 54 65 1 0
7 B 20-07-19 55 32 1 1
8 B 21-07-19 50 30 0 0
9 B 22-07-19 51 38 0 0
10 B 23-07-19 81 61 0 0
11 C 24-07-19 55 29 0 0
12 C 25-07-19 97 69 1 0
13 C 26-07-19 92 44 1 0
14 C 27-07-19 55 97 0 0
15 C 28-07-19 13 48 1 1
16 D 29-07-19 77 27 1 0
17 D 30-07-19 68 50 1 0
18 D 31-07-19 71 32 1 0
19 D 01-08-19 89 57 1 0
20 D 02-08-19 46 70 0 0
21 D 03-08-19 14 68 1 0
22 D 04-08-19 12 87 1 1
23 D 05-08-19 56 13 0 0
24 E 06-08-19 47 35 1 1
I have a dataframe df1:
Time Delta_time
0 0 NaN
1 15 15
2 18 3
3 30 12
4 45 15
5 64 19
6 80 16
7 82 2
8 100 18
9 120 20
where Delta_time is the difference between adjacent values in the Time column. I have another dataframe df2 that has time values numbering from 0 to 120 (121 rows) and another column called 'Short_gap'.
How do I set the value of Short_gap to 1 for all Time values that lie in a Delta_time value smaller than 5? For example, the Short_gap column should have a value of 1 for Time = 15,16,17,18 since Delta_time = 3 < 5.
Edit: Currently, df2 looks like this.
Time Short_gap
0 0 0
1 1 0
2 2 0
3 3 0
... ... ...
118 118 0
119 119 0
120 120 0
The expected output for df2 is
Time Short_gap
0 0 0
1 1 0
2 2 0
... ... ...
13 13 0
14 14 0
15 15 1
16 16 1
17 17 1
18 18 1
19 19 0
20 20 0
... ... ...
78 78 0
79 79 0
80 80 1
81 81 1
82 82 1
83 83 0
84 84 0
... ... ...
119 119 0
120 120 0
Try:
t = df['Delta_time'].shift(-1)
df2 = ((t < 5).repeat(t.fillna(1)).astype(int).reset_index(drop=True)
.to_frame(name='Short_gap').rename_axis('Time').reset_index())
print(df2.head(20))
print('...')
print(df2.loc[78:84])
Output:
Time Short_gap
0 0 0
1 1 0
2 2 0
3 3 0
4 4 0
5 5 0
6 6 0
7 7 0
8 8 0
9 9 0
10 10 0
11 11 0
12 12 0
13 13 0
14 14 0
15 15 1
16 16 1
17 17 1
18 18 0
19 19 0
...
Time Short_gap
78 78 0
79 79 0
80 80 1
81 81 1
82 82 0
83 83 0
84 84 0
I want add rate based on the conditions in few columns
if A > 30 +1 and B > 50 +1 and C > 80 +1, D doesn't matter,
for example i have a matrix (dataframe):
A B C D
0 21 32 84 43 # 0 + 0 + 1
1 79 29 42 63 # 1 + 0 + 0
2 31 38 6 52 # 1 + 0 + 0
3 92 54 79 75 # 1 + 1 + 0
4 9 14 87 85 # 0 + 0 + 1
what i try:
In [1]: import numpy as np
In [2]: import pandas as pd
In [36]: df = pd.DataFrame(
np.random.randint(0,100,size=(5, 4)),
columns=list('ABCD')
)
In [36]: df
Out[36]:
A B C D
0 21 32 84 43
1 79 29 42 63
2 31 38 6 52
3 92 54 79 75
4 9 14 87 8
create series (df['A'] > 30)
concat it to the frame
and sum rows
In [37]: df['R'] = pd.concat(
[(df['A'] > 30), (df['B'] > 50), (df['C'] > 80)], axis=1
).sum(axis=1)
In [38]: df
Out[38]:
A B C D R
0 21 32 84 43 1
1 79 29 42 63 1
2 31 38 6 52 1
3 92 54 79 75 2
4 9 14 87 85 1
And result as i expected, but maybe there are more simple way?
You can just do this:
df['R'] = (df.iloc[:,:3]>[30, 50, 80]).sum(axis=1)
the same solution using column names
df['R'] = (df[['A','B','C']]>[30, 50, 80]).sum(axis=1)
How about
df["R"] = (
(df["A"] > 30).astype(int) +
(df["B"] > 50).astype(int) +
(df["C"] > 80).astype(int)
)
You can also try this. Not sure if it is any better.
>>> df
A B C D
0 8 47 95 52
1 90 84 39 80
2 15 52 37 79
3 99 24 76 5
4 93 4 97 0
>>> df.apply(lambda x: int(x[0] > 30) + int(x[1] > 50) + int(x[2] > 80) , axis=1)
0 1
1 2
2 1
3 1
4 2
dtype: int64
>>> df.agg(lambda x: int(x[0] > 30) + int(x[1] > 50) + int(x[2] > 80) , axis=1)
0 1
1 2
2 1
3 1
4 2
dtype: int64
I'm trying to figure out how to retrieve values from future dates using an offset variable in a separate row in Python. For instance, I have the dataframe df below, and I'd like to find a way to produce Column C:
Orig A Orig B Desired Column C
54 1 76
76 4 46
14 3 46
35 1 -3
-3 0 -3
46 0 46
64 0 64
93 0 93
72 0 72
Any help is much appreciated, thank you!
You can use NumPy for a vectorised solution:
import numpy as np
idx = np.arange(df.shape[0]) + df['OrigB'].values
df['C'] = df['OrigA'].iloc[idx].values
print(df)
OrigA OrigB C
0 54 1 76
1 76 4 46
2 14 3 46
3 35 1 -3
4 -3 0 -3
5 46 0 46
6 64 0 64
7 93 0 93
8 72 0 72
import pandas as pd
dict = {"Orig A": [54,76,14,35,-3,46,64,93,72],
"Orig B": [1,4,3,1,0,0,0,0,0],
"Desired Column C": [76,46,46,-3,-3,46,64,93,72]}
df = pd.DataFrame(dict)
df["desired_test"] = [df["Orig A"].values[i+j] for i,j in enumerate(df["Orig B"].values)]
df
Orig A Orig B Desired Column C desired_test
0 54 1 76 76
1 76 4 46 46
2 14 3 46 46
3 35 1 -3 -3
4 -3 0 -3 -3
5 46 0 46 46
6 64 0 64 64
7 93 0 93 93
8 72 0 72 72