I want add rate based on the conditions in few columns
if A > 30 +1 and B > 50 +1 and C > 80 +1, D doesn't matter,
for example i have a matrix (dataframe):
A B C D
0 21 32 84 43 # 0 + 0 + 1
1 79 29 42 63 # 1 + 0 + 0
2 31 38 6 52 # 1 + 0 + 0
3 92 54 79 75 # 1 + 1 + 0
4 9 14 87 85 # 0 + 0 + 1
what i try:
In [1]: import numpy as np
In [2]: import pandas as pd
In [36]: df = pd.DataFrame(
np.random.randint(0,100,size=(5, 4)),
columns=list('ABCD')
)
In [36]: df
Out[36]:
A B C D
0 21 32 84 43
1 79 29 42 63
2 31 38 6 52
3 92 54 79 75
4 9 14 87 8
create series (df['A'] > 30)
concat it to the frame
and sum rows
In [37]: df['R'] = pd.concat(
[(df['A'] > 30), (df['B'] > 50), (df['C'] > 80)], axis=1
).sum(axis=1)
In [38]: df
Out[38]:
A B C D R
0 21 32 84 43 1
1 79 29 42 63 1
2 31 38 6 52 1
3 92 54 79 75 2
4 9 14 87 85 1
And result as i expected, but maybe there are more simple way?
You can just do this:
df['R'] = (df.iloc[:,:3]>[30, 50, 80]).sum(axis=1)
the same solution using column names
df['R'] = (df[['A','B','C']]>[30, 50, 80]).sum(axis=1)
How about
df["R"] = (
(df["A"] > 30).astype(int) +
(df["B"] > 50).astype(int) +
(df["C"] > 80).astype(int)
)
You can also try this. Not sure if it is any better.
>>> df
A B C D
0 8 47 95 52
1 90 84 39 80
2 15 52 37 79
3 99 24 76 5
4 93 4 97 0
>>> df.apply(lambda x: int(x[0] > 30) + int(x[1] > 50) + int(x[2] > 80) , axis=1)
0 1
1 2
2 1
3 1
4 2
dtype: int64
>>> df.agg(lambda x: int(x[0] > 30) + int(x[1] > 50) + int(x[2] > 80) , axis=1)
0 1
1 2
2 1
3 1
4 2
dtype: int64
Related
I need to assign values to the columns of df based on conditions.
If df.condition>0, df.result=df.data1, if df.condition<0,df.result=df.data2
code show as below:
def main():
import pandas as pd
import numpy as np
condition = {"condition": np.random.randn(200)}
df = pd.DataFrame(condition)
df['data1'] = np.random.randint(1, 100, len(df))
df['data2'] = np.random.randint(1, 100, len(df))
df['result'] = 0
df['result'].loc[df['condition'] > 0] = df[df['condition'] > 0]['data1']
df['result'].loc[df['condition'] < 0] = df[df['condition'] < 0]['data2']
print (df.head(10))
main()
My method will bring SettingWithCopyWarning: .A value is trying to be set on a copy of a slice from a DataFrame. And is not optimized.
It looks like I have a wrong understanding of pd.series.where.
The modified code is as follows:
def main():
condition = {"condition": np.random.randn(200)}
df = pd.DataFrame(condition)
df['data1']=np.random.randint(1,100, len(df))
df['data2']=np.random.randint(1,100, len(df))
df['result']=0
gt=df.condition>0
lt=df.condition<0
df.result.where(gt,df.data2,inplace=True)
df.result.where(lt,df.data1,inplace=True)
print (df.head(10))
return
main()
The result is :
condition data1 data2 result
0 -1.580927 63 23 23
1 -1.549005 94 20 20
2 2.153873 18 83 18
3 -0.115974 31 8 8
4 -0.726009 61 38 38
5 2.039930 96 63 96
6 -1.523605 94 96 96
7 -0.157509 8 4 4
8 -0.166163 11 21 21
9 -0.540077 14 64 64
I just figured out the usage of np.where:
import pandas as pd
import numpy as np
def main():
condition = {"condition": np.random.randn(200)}
df = pd.DataFrame(condition)
df['data1'] = np.random.randint(1, 100, len(df))
df['data2'] = np.random.randint(1, 100, len(df))
df['result'] = np.where(df['condition'] > 0, df['data1'], df['data2'])
print (df.head(10))
main()
Create boolean masks for your conditions and use them with DataFrame.loc to select the rows on the left-hand-side and the right-hand-side of the assignment.
Boolean Indexing
>>> df.head(15)
data a b data2
0 1.864896 81 30 0
1 -0.059083 81 93 0
2 -0.953324 89 1 0
3 0.367495 2 68 0
4 -1.537818 70 88 0
5 -1.118238 76 35 0
6 -0.017608 46 68 0
7 1.571796 12 95 0
8 0.683234 44 7 0
9 -1.320751 50 42 0
10 -0.463197 19 66 0
11 0.786541 44 32 0
12 -0.171833 28 26 0
13 1.668763 75 7 0
14 0.846662 42 56 0
>>> gt = df.data > 0
>>> lt = df.data < 0
>>> df.loc[gt,'a'] = df.loc[gt,'data2']
>>> df.loc[lt,'b'] = df.loc[lt,'data2']
>>> df.head(15)
data a b data2
0 1.864896 0 30 0
1 -0.059083 81 0 0
2 -0.953324 89 0 0
3 0.367495 0 68 0
4 -1.537818 70 0 0
5 -1.118238 76 0 0
6 -0.017608 46 0 0
7 1.571796 0 95 0
8 0.683234 0 7 0
9 -1.320751 50 0 0
10 -0.463197 19 0 0
11 0.786541 0 32 0
12 -0.171833 28 0 0
13 1.668763 0 7 0
14 0.846662 0 56 0
Using Series.where you have to reverse the logic as it only changes the values where the condition is NOT met.
>>> df.head(10)
data a b data2
0 1.046114 41 66 0
1 0.156532 65 46 0
2 -0.768515 56 36 0
3 0.640834 36 89 0
4 0.008113 39 26 0
5 -0.528028 63 49 0
6 -1.343293 87 94 0
7 1.076804 5 26 0
8 0.172443 9 57 0
9 -0.375729 84 47 0
>>> gt = df.data > 0
>>> lt = df.data < 0
>>> df.b.where(gt,df.data2,inplace=True)
>>> df.a.where(lt,df.data2,inplace=True)
>>> df.head(10)
data a b data2
0 1.046114 0 66 0
1 0.156532 0 46 0
2 -0.768515 56 0 0
3 0.640834 0 89 0
4 0.008113 0 26 0
5 -0.528028 63 0 0
6 -1.343293 87 0 0
7 1.076804 0 26 0
8 0.172443 0 57 0
9 -0.375729 84 0 0
>>>
I have this dataframe:
ID X1 X2 Y
A 11 47 0
A 11 87 0
A 56 33 0
A 92 72 1
A 83 34 0
A 34 31 0
B 88 62 1
B 28 71 0
B 95 28 0
B 92 87 1
B 91 45 0
C 46 59 0
C 60 68 1
C 67 78 0
C 26 26 0
C 13 77 0
D 40 95 0
D 25 26 1
D 93 31 0
D 71 67 0
D 91 24 1
D 80 19 0
D 44 49 0
D 41 84 1
E 38 10 0
F 23 75 1
G 46 58 1
G 44 52 0
I want to assign a value of 1 after Y was equal 1, for the last time only. Otherwise, 0.
Note: it should be applied for each ID separately.
Expected result:
ID X1 X2 Y after
A 11 47 0 0
A 11 87 0 0
A 56 33 0 0
A 92 72 1 0
A 83 34 0 1
A 34 31 0 1
B 88 62 1 0
B 28 71 0 0
B 95 28 0 0
B 92 87 1 0
B 91 45 0 1
C 46 59 0 0
C 60 68 1 0
C 67 78 0 1
C 26 26 0 1
C 13 77 0 1
D 40 95 0 0
D 25 26 1 0
D 93 31 0 0
D 71 67 0 0
D 91 24 1 0
D 80 19 0 0
D 44 49 0 0
D 41 84 1 0
E 38 10 0 0
F 23 75 1 0
G 46 58 1 0
G 44 52 0 1
This might help:
Assign a value of 1 before another variable was equal 1, only for the first time
Let us try idxmax with transform first try find the last 1's index for each group, then we just need compare the original index with this output.
df['before']=(df.iloc[::-1,].groupby('ID').Y.transform('idxmax').sort_index()<df.index).astype(int)
df
Out[70]:
ID X1 X2 Y before
0 A 11 47 0 0
1 A 11 87 0 0
2 A 56 33 0 0
3 A 92 72 1 0
4 A 83 34 0 1
5 A 34 31 0 1
6 B 88 62 1 0
7 B 28 71 0 0
8 B 95 28 0 0
9 B 92 87 1 0
10 B 91 45 0 1
11 C 46 59 0 0
12 C 60 68 1 0
13 C 67 78 0 1
14 C 26 26 0 1
15 C 13 77 0 1
16 D 40 95 0 0
17 D 25 26 1 0
18 D 93 31 0 0
19 D 71 67 0 0
20 D 91 24 1 0
21 D 80 19 0 0
22 D 44 49 0 0
23 D 41 84 1 0
24 E 38 10 0 0
25 F 23 75 1 0
26 G 46 58 1 0
27 G 44 52 0 1
I have this data frame:
ID Date X1 X2 Y
A 16-07-19 58 50 0
A 17-07-19 61 83 1
A 18-07-19 97 38 0
A 19-07-19 29 77 0
A 20-07-19 66 71 1
A 21-07-19 28 74 0
B 19-07-19 54 65 1
B 20-07-19 55 32 1
B 21-07-19 50 30 0
B 22-07-19 51 38 0
B 23-07-19 81 61 0
C 24-07-19 55 29 0
C 25-07-19 97 69 1
C 26-07-19 92 44 1
C 27-07-19 55 97 0
C 28-07-19 13 48 1
D 29-07-19 77 27 1
D 30-07-19 68 50 1
D 31-07-19 71 32 1
D 01-08-19 89 57 1
D 02-08-19 46 70 0
D 03-08-19 14 68 1
D 04-08-19 12 87 1
D 05-08-19 56 13 0
E 06-08-19 47 35 1
I want to create a variable that equals 1 when Y was equal 1 at the last time (for each ID), and 0 otherwise.
Also, to exclude all the rows that come after the last time Y was equal 1.
Expected result:
ID Date X1 X2 Y Last
A 16-07-19 58 50 0 0
A 17-07-19 61 83 1 0
A 18-07-19 97 38 0 0
A 19-07-19 29 77 0 0
A 20-07-19 66 71 1 1
B 19-07-19 54 65 1 0
B 20-07-19 55 32 1 1
C 24-07-19 55 29 0 0
C 25-07-19 97 69 1 0
C 26-07-19 92 44 1 0
C 27-07-19 55 97 0 0
C 28-07-19 13 48 1 1
D 29-07-19 77 27 1 0
D 30-07-19 68 50 1 0
D 31-07-19 71 32 1 0
D 01-08-19 89 57 1 0
D 02-08-19 46 70 0 0
D 03-08-19 14 68 1 0
D 04-08-19 12 87 1 1
E 06-08-19 47 35 1 1
First remove all rows after last 1 in Y with compare Y with swap order and GroupBy.cumsum, then get all rows not equal by 0 and filter in boolean indexing, last use
numpy.where for new column:
df = df[df['Y'].eq(1).iloc[::-1].groupby(df['ID']).cumsum().ne(0).sort_index()]
df['Last'] = np.where(df['ID'].duplicated(keep='last'), 0, 1)
print (df)
ID Date X1 X2 Y Last
0 A 16-07-19 58 50 0 0
1 A 17-07-19 61 83 1 0
2 A 18-07-19 97 38 0 0
3 A 19-07-19 29 77 0 0
4 A 20-07-19 66 71 1 1
6 B 19-07-19 54 65 1 0
7 B 20-07-19 55 32 1 1
11 C 24-07-19 55 29 0 0
12 C 25-07-19 97 69 1 0
13 C 26-07-19 92 44 1 0
14 C 27-07-19 55 97 0 0
15 C 28-07-19 13 48 1 1
16 D 29-07-19 77 27 1 0
17 D 30-07-19 68 50 1 0
18 D 31-07-19 71 32 1 0
19 D 01-08-19 89 57 1 0
20 D 02-08-19 46 70 0 0
21 D 03-08-19 14 68 1 0
22 D 04-08-19 12 87 1 1
24 E 06-08-19 47 35 1 1
EDIT:
m = df['Y'].eq(1).iloc[::-1].groupby(df['ID']).cumsum().ne(0).sort_index()
df['Last'] = np.where(m.ne(m.groupby(df['ID']).shift(-1)) & m,1,0)
print (df)
ID Date X1 X2 Y Last
0 A 16-07-19 58 50 0 0
1 A 17-07-19 61 83 1 0
2 A 18-07-19 97 38 0 0
3 A 19-07-19 29 77 0 0
4 A 20-07-19 66 71 1 1
5 A 21-07-19 28 74 0 0
6 B 19-07-19 54 65 1 0
7 B 20-07-19 55 32 1 1
8 B 21-07-19 50 30 0 0
9 B 22-07-19 51 38 0 0
10 B 23-07-19 81 61 0 0
11 C 24-07-19 55 29 0 0
12 C 25-07-19 97 69 1 0
13 C 26-07-19 92 44 1 0
14 C 27-07-19 55 97 0 0
15 C 28-07-19 13 48 1 1
16 D 29-07-19 77 27 1 0
17 D 30-07-19 68 50 1 0
18 D 31-07-19 71 32 1 0
19 D 01-08-19 89 57 1 0
20 D 02-08-19 46 70 0 0
21 D 03-08-19 14 68 1 0
22 D 04-08-19 12 87 1 1
23 D 05-08-19 56 13 0 0
24 E 06-08-19 47 35 1 1
I have a dataset in the following format. It got 48 columns and about 200000 rows.
slot1,slot2,slot3,slot4,slot5,slot6...,slot45,slot46,slot47,slot48
1,2,3,4,5,6,7,......,45,46,47,48
3.5,5.2,2,5.6,...............
I want to reshape this dataset to something as below, where N is less than 48 (maybe 24 or 12 etc..) column headers doesn't matter.
when N = 4
slotNew1,slotNew2,slotNew3,slotNew4
1,2,3,4
5,6,7,8
......
45,46,47,48
3.5,5.2,2,5.6
............
I can read row by row and then split each row and append to a new dataframe. But that is very inefficient. Is there any efficient and faster way to do that?
You may try this
N = 4
df_new = pd.DataFrame(df_original.values.reshape(-1, N))
df_new.columns = ['slotNew{:}'.format(i + 1) for i in range(N)]
The code extracts the data into numpy.ndarray, reshape it, and create a new dataset of desired dimension.
Example:
import numpy as np
import pandas as pd
df0 = pd.DataFrame(np.arange(48 * 3).reshape(-1, 48))
df0.columns = ['slot{:}'.format(i + 1) for i in range(48)]
print(df0)
# slot1 slot2 slot3 slot4 ... slot45 slot46 slot47 slot48
# 0 0 1 2 3 ... 44 45 46 47
# 1 48 49 50 51 ... 92 93 94 95
# 2 96 97 98 99 ... 140 141 142 143
#
# [3 rows x 48 columns]
N = 4
df = pd.DataFrame(df0.values.reshape(-1, N))
df.columns = ['slotNew{:}'.format(i + 1) for i in range(N)]
print(df.head())
# slotNew1 slotNew2 slotNew3 slotNew4
# 0 0 1 2 3
# 1 4 5 6 7
# 2 8 9 10 11
# 3 12 13 14 15
# 4 16 17 18 19
Another approach
N = 4
df1 = df0.stack().reset_index()
df1['i'] = df1['level_1'].str.replace('slot', '').astype(int) // N
df1['j'] = df1['level_1'].str.replace('slot', '').astype(int) % N
df1['i'] -= (df1['j'] == 0) - df1['level_0'] * 48 / N
df1['j'] += (df1['j'] == 0) * N
df1['j'] = 'slotNew' + df1['j'].astype(str)
df1 = df1[['i', 'j', 0]]
df = df1.pivot(index='i', columns='j', values=0)
Use pandas.explode after making chunks. Given df:
import pandas as pd
df = pd.DataFrame([np.arange(1, 49)], columns=['slot%s' % i for i in range(1, 49)])
print(df)
slot1 slot2 slot3 slot4 slot5 slot6 slot7 slot8 slot9 slot10 ... \
0 1 2 3 4 5 6 7 8 9 10 ...
slot39 slot40 slot41 slot42 slot43 slot44 slot45 slot46 slot47 \
0 39 40 41 42 43 44 45 46 47
slot48
0 48
Using chunks to divide:
def chunks(l, n):
"""Yield successive n-sized chunks from l.
Source: https://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks
"""
n_items = len(l)
if n_items % n:
n_pads = n - n_items % n
else:
n_pads = 0
l = l + [np.nan for _ in range(n_pads)]
for i in range(0, len(l), n):
yield l[i:i + n]
N = 4
new_df = pd.DataFrame(list(df.apply(lambda x: list(chunks(list(x), N)), 1).explode()))
print(new_df)
Output:
0 1 2 3
0 1 2 3 4
1 5 6 7 8
2 9 10 11 12
3 13 14 15 16
4 17 18 19 20
...
Advantage of this approach over numpy.reshape is that it can handle when N is not a factor:
N = 7
new_df = pd.DataFrame(list(df.apply(lambda x: list(chunks(list(x), N)), 1).explode()))
print(new_df)
Output:
0 1 2 3 4 5 6
0 1 2 3 4 5 6 7.0
1 8 9 10 11 12 13 14.0
2 15 16 17 18 19 20 21.0
3 22 23 24 25 26 27 28.0
4 29 30 31 32 33 34 35.0
5 36 37 38 39 40 41 42.0
6 43 44 45 46 47 48 NaN
df=pd.DataFrame({'c1':[12,45,21,49],'c2':[67,86,28,55]})
I'd like to convert the index into columns
c1 c2
0 1 2 3 0 1 2 3
12 45 21 49 67 86 28 55
I tried combining stack and unstack but so far without success
Use unstack + to_frame + T:
df=pd.DataFrame({'c1':[12,45,21,49],'c2':[67,86,28,55]})
print (df.unstack().to_frame().T)
c1 c2
0 1 2 3 0 1 2 3
0 12 45 21 49 67 86 28 55
Or DataFrame + numpy.ravel + numpy.reshape with MultiIndex.from_product:
mux = pd.MultiIndex.from_product([df.columns, df.index])
print (pd.DataFrame(df.values.ravel().reshape(1, -1), columns=mux))
c1 c2 c3
0 1 2 3 0 1 2 3 0 1 2 3
0 12 67 67 45 86 86 21 28 28 49 55 55