I want to replace a range of values for multiple columns with NaNs based off range value conditions on that specific column.
Ie: Suppose I had [col1_min = 5, col1_max = 15], and [col2_min = 2, col2_max = 20] and the columns looked like this:
df = pd.DataFrame({'col1':[1,50,15,10,4], 'col2':[12,10,100,11,56]})
col1 col2
1 12
50 10
15 100
10 11
4 56
The desired output would be:
df_filtered
col1 col2
nan 12
nan 10
15 nan
10 11
4 nan
Pseudo code I could do is groupby each column within the boundary using 'df.groupby('col1' or 'col2')' and then filter each column then merge back into the original but I'd like to keep memory cost to the minimum.
Is there any way to do this easily?
Use Series.where:
df['col1']=df['col1'].where(df['col1'].between(5,15))
df['col2']=df['col2'].where(df['col2'].between(2,20))
I will do it by
condition = {'col1':[5,15],'col2':[2,20]}
pd.concat([df.loc[df[x].between(*y),x]for x, y in condition.items()],axis=1)
Out[313]:
col1 col2
0 NaN 12.0
1 NaN 10.0
2 15.0 NaN
3 10.0 11.0
Related
I need to filter a data frame with different groups. The data frame looks as follows:
df = pd.DataFrame({"group":[1,1,1,
2,2,2,2,
3,3,3,
4,4],
"percentage":[70,70,70,
45,80,60,70,
71,85,90,
np.nan, np.nan]})
My goal is to return a data frame containing only groups that satisfy one of the two following conditions:
All observations of the group have percentage > 70
All observations of the group are np.nan
I know that I have to group the data frame first and then apply the conditions. This might be easily done using a for loop for groups. However, using such a solution might be very slow. Any help would be appreciated.
You can try with filter
df = df.groupby('group').filter(lambda x : x['percentage'].gt(70).all() | x['percentage'].isna().all() )
Out[25]:
group percentage
7 3 71.0
8 3 85.0
9 3 90.0
10 4 NaN
11 4 NaN
USE- df[(df.percentage > 70) | df['percentage'].isnull()]
Output-
group percentage
4 2 80.0
7 3 71.0
8 3 85.0
9 3 90.0
10 4 NaN
11 4 NaN
Look for the fastest and most pandas centric way of doing the following:
support=
Values Confidence R/S
10 3 S
20 6 S
40 10 S
35 12 S
df =
name strike
xyz 12
dfg 6
ghf 40
Aim: Get the closest greater than 0 row from support to df.
Excpected output:
df =
name strike support
xyz 12 [10, 3, S, 2]
dfg 6 [0, 0, S, 0] # as there is no > 0 value when subtracting strike from support
ghf 40 [35, 12, S, 5]
Bonus: expand the columns into the relevant columns.
I can do this by looping through the strikes, wondering if there is a better/faster way to achieve what I am thinking of doing.
Use merge_asof
pd.merge_asof(df.sort_values('strike'), # must be sorted by key
support.sort_values('Values'), # must be sorted by key
left_on='strike',
right_on='Values',
direction='backward', # default, so `Values <= strike`
allow_exact_matches=False # so that `Values != strike`
)
Output:
name strike Values Confidence R/S
0 dfg 6 NaN NaN NaN
1 xyz 12 10.0 3.0 S
2 ghf 40 35.0 12.0 S
For a dataframe I would like to perform a lookup for every column and place the results in the neighbouring column. id_df contains the IDs and looks as following:
Col1 Col2 ... Col160 Col161
0 4328.0 4561.0 ... NaN 5828.0
1 3587.0 4328.0 ... NaN 20572.0
2 4454.0 1702.0 ... NaN 683.0
lookup_df also contains the ID and a value that I'm interested in. lookup_df looks as following:
ID Value
0 3587 3.0650
1 4454 2.9000
2 5 2.8450
3 8 2.8750
4 11 3.1000
5 13 3.1600
6 16 2.4450
7 18 3.0700
8 20 2.7950
9 23 3.0500
10 25 3.2250
I would like to get the following Dataframe df3:
Col1ID Col1 Value ... Col161 ID Col161 Value
0 4328.0 2.4450 ... 5828.0 3.1600
1 3587.0 3.2250 ... 20572.0 3.0650
2 4454.0 3.0500 ... 683.0 3.1600
Because I'm an excel user I thought of using the function 'merge', but I don't see how this can be done with multiple columns.
Thank you!
Use map:
m = lookup_df.set_index('ID')['Value']
result = pd.DataFrame()
for col in id_df.columns:
result[col + '_ID'] = df[col]
result[col + '_Value'] = df[col].map(m)
I have a smiliar question to this one.
I have a dataframe with several rows, which looks like this:
Name TypA TypB ... TypF TypA_value TypB_value ... TypF_value Divider
1 1 1 NaN 10 5 NaN 5
2 NaN 2 NaN NaN 20 NaN 10
and I want to divide all columns with the ending "value" by the column "Divider", how can I do so? One trick would be to use the sorting, to use the answer from above, but is there a direct way for it? That I do not need to sort the dataframe.
The outcome would be:
Name TypA TypB ... TypF TypA_value TypB_value ... TypF_value Divider
1 1 1 NaN 2 1 0 5
2 NaN 2 NaN 0 2 0 10
So a NaN will lead to a 0.
Use DataFrame.filter to filter the columns like value from dataframe then use DataFrame.div along axis=0 to divide it by column Divider, finally use DataFrame.update to update the values in dataframe:
d = df.filter(like='_value').div(df['Divider'], axis=0).fillna(0)
df.update(d)
Result:
Name TypA TypB TypF TypA_value TypB_value TypF_value Divider
0 1 1.0 1 NaN 2.0 1.0 0.0 5
1 2 NaN 2 NaN 0.0 2.0 0.0 10
You could select the columns of interest using DataFrame.filter, and divide as:
value_cols = df.filter(regex=r'_value$').columns
df[value_cols] /= df['Divider'].to_numpy()[:,None]
# df[value_cols] = df[value_cols].fillna(0)
print(df)
Name TypA TypB TypF TypA_value TypB_value TypF_value Divider
0 1 1.0 1 NaN 2.0 1.0 NaN 5
1 2 NaN 2 NaN NaN 2.0 NaN 10
Taking two sample columns A and B :
import pandas as pd
import numpy as np
a={ 'Name':[1,2],
'TypA':[1,np.nan],
'TypB':[1,2],
'TypA_value':[10,np.nan],
'TypB_value':[5,20],
'Divider':[5,10]
}
df = pd.DataFrame(a)
cols_all = df.columns
Find columns for which calculations are to be done. Assuming there all have 'value' and an underscore :
cols_to_calc = [c for c in cols_all if '_value' in c]
For these columns: first, divide with the divider column then replace nan with 0 in those columns.
for c in cols_to_calc:
df[c] = df[c] / df.Divider
df[c] = df[c].fillna(0)
I am trying to do the following: on a dataframe X, I want to select all rows where X['a']>0 but I want to preserve the dimension of X, so that any other row will appear as containing NaN. Is there a fast way to do it? If one does X[X['a']>0] the dimensions of X are not preserved.
Use double subscript [[]]:
In [42]:
df = pd.DataFrame({'a':np.random.randn(10)})
df
Out[42]:
a
0 1.042971
1 0.978914
2 0.764374
3 -0.338405
4 0.974011
5 -0.995945
6 -1.649612
7 0.965838
8 -0.142608
9 -0.804508
In [48]:
df[df[['a']] > 1]
Out[48]:
a
0 1.042971
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
The key semantic difference here is what is returned is a df when you double subscript so this masks the df itself rather than the index
Note though that if you have multiple columns then it will mask all those as NaN