Python Groupby with Boolean Mask - python
I have a pandas dataframe with the following general format:
id,atr1,atr2,orig_date,fix_date
1,bolt,l,2000-01-01,nan
1,screw,l,2000-01-01,nan
1,stem,l,2000-01-01,nan
2,stem,l,2000-01-01,nan
2,screw,l,2000-01-01,nan
2,stem,l,2001-01-01,2001-01-01
3,bolt,r,2000-01-01,nan
3,stem,r,2000-01-01,nan
3,bolt,r,2001-01-01,2001-01-01
3,stem,r,2001-01-01,2001-01-01
This result would be the following:
id,atr1,atr2,orig_date,fix_date,failed_part_ind
1,bolt,l,2000-01-01,nan,0
1,screw,l,2000-01-01,nan,0
1,stem,l,2000-01-01,nan,0
2,stem,l,2000-01-01,nan,1
2,screw,l,2000-01-01,nan,0
2,stem,l,2001-01-01,2001-01-01,0
3,bolt,r,2000-01-01,nan,1
3,stem,r,2000-01-01,nan,1
3,bolt,r,2001-01-01,2001-01-01,0
3,stem,r,2001-01-01,2001-01-01,0
Any tips or tricks most welcome!
Update2:
A better way to describe what I need to accomplish is that in a .groupby(['id','atr1','atr2']) to create a new indicator column where the following criteria are met for records within the groups:
(df['orig_date'] < df['fix_date'])
I think this should work:
df['failed_part_ind'] = df.apply(lambda row: 1 if ((row['id'] == row['id']) &
(row['atr1'] == row['atr1']) &
(row['atr2'] == row['atr2']) &
(row['orig_date'] < row['fix_date']))
else 0, axis=1)
Update: I think this is what you want:
import numpy as np
def f(g):
min_fix_date = g['fix_date'].min()
if np.isnan(min_fix_date):
g['failed_part_ind'] = 0
else:
g['failed_part_ind'] = g['orig_date'].apply(lambda d: 1 if d < min_fix_date else 0)
return g
df.groupby(['id', 'atr1', 'atr2']).apply(lambda g: f(g))
Related
how to slice a pandas data frame with for loop in another list?
I have a dataframe like this but Its not necessary that there are just 3 sites always: data = [[501620, 501441,501549], [501832, 501441,501549], [528595, 501662,501549],[501905,501441,501956],[501913,501441,501549]] df = pd.DataFrame(data, columns = ["site_0", "site_1","site_2"]) I want to slice the dataframe which can take condition dynemically from li(list) element and random combination. I have tried below code which is a static one: li = [1,2] random_com = (501620, 501441,501549) df_ = df[(df["site_"+str(li[0])] == random_com[li[0]]) & \ (df["site_"+str(li[1])] == random_com[li[1]])] How can I make the above code dynemic ? I have tried this but It is giving me two different dataframe but I need one dataframe with both condition (AND). [df[(df["site_"+str(j)] == random_com[j])] for j in li]
You can do an iteration over the conditions and create an & of all the conditions. li = [1,2] main_mask = True for i in li: main_mask = main_mask & (df["site_"+str(i)] == random_com[i]) df_ = df[main_mask]
If you prefer one-liner, I think you could use reduce() df_ = df[reduce(lambda x, y: x & y, [(df["site_"+str(j)] == random_com[j]) for j in li])]
How to use df.apply to switch between columns?
Consider the following code. import pandas as pd np.random.seed(0) df_H = pd.DataFrame( {'L0': np.random.randn(100), 'OneAndZero': np.random.randn(100), 'OneAndTwo': np.random.randn(100), 'GTwo': np.random.randn(100), 'Decide': np.random.randn(100)}) I would like to create a new column named Result, which depends on the value of the column Decide. So if the value in Decide is less than 0, I would like Result to have the corresponding value of the row in L0. If the value on the row in Decide is between 1 and 0, it should grab the value in OneAndZero, between 1 and 2, it should grab OneAndTwo and if the value of decide is > 2, then it should grab GTwo. How would one do this with df.apply since I have only seen examples with fixed values and not values from other columns?
Just because it is Good Friday, we can try the following. Else it is a commonly asked question. c1=df_H['Decide'].le(0) c2=df_H['Decide'].between(0,1) c3=df_H['Decide'].between(1,2) c4=df_H['Decide'].gt(2) cond=[c1,c2,c3,c4] choices=[df_H['L0'],df_H['OneAndZero'],df_H['OneAndTwo'],df_H['GTwo']] df_H['Result']=np.select(cond,choices) df_H
If you really want to use apply def choose_res(x): if x['Decide'] <= 0: return x['L0'] if 0 < x['Decide'] <= 1: return x['OneAndZero'] if 1 < x['Decide'] <= 2: return x['OneAndTwo'] if x['Decide'] > 2: return x['GTwo'] df_H['Result'] = df_H.apply(axis=1, func=choose_res, result_type='expand')
df.iloc df_H.reset_index(drop=True, inplace=True) for i in range(len(df_H)): a = df_H['Decide'].iloc[i] if 0 <= a <=1 : b = df_H['OneAndZero'].iloc[i] df_H.loc[i,'Result']= b if 1.1 <= a <= 2: b = df_H['OneAndTwo'].iloc[i] df_H.loc[i,'Result']= b maybe you can try this way. df_apply if you want to use apply.. create the function that have the condition, and the output, then used this code: df_H['Result'] = df_H.apply(your function name)
pandas calculate/show dataframe cumsum() only for positive values and other condition
How to make this cumsum() calculate and show values on new column rows only when df.col_2 == 'closed' and df.col_values > 0 : df['new_col'] = df.groupby('col_1')['col_values'].cumsum()
Here is a solution (but there might be a more elegant one): indexes = (df.col_2 == 'closed') & (df.col_values > 0) df.loc[indexes, 'new_col'] = df.loc[indexes].groupby('col_1')['col_values'].cumsum()
Creating a Pandas dataframe column which is conditional on a function
Say I have some dataframe like below and I create a new column (track_len) which gives the length of the column track_no. import pandas as pd df = pd.DataFrame({'item_id': [1,2,3], 'track_no': ['qwerty23', 'poiu2', 'poiuyt5']}) df['track_len'] = df['track_no'].str.len() df.head() My Question is: How do I now create a new column (new_col) which selects a specific subset of the track_no string and outputs that depending on the length of the track number (track_len). I have tried creating a function which outputs the specific string slice of the track_no given the various track_len conditions and then use an apply method to create the column and it doesnt work. The code is below: Tried: def f(row): if row['track_len'] == 8: val = row['track_no'].str[0:3] elif row['track_len'] == 5: val = row['track_no'].str[0:1] elif row['track_len'] =7: val = row['track_no'].str[0:2] return val df['new_col'] = df.apply(f, axis=1) df.head() Thus the desired output should be (based on string slicing output of f): Output {new_col: ['qwe', 'p', 'po']} If there are alternative better solutions to this problem those would also be appreciated.
Your function works well you need to remove .str part in your if blocks. Values are already strings: def f(row): if row['track_len'] == 8: val = row['track_no'][:3] elif row['track_len'] == 5: val = row['track_no'][:1] elif row['track_len'] ==7: val = row['track_no'][:2] return val df['new_col'] = df.apply(f, axis=1) df.head() #Output: item_id track_no track_len new_col 0 1 qwerty23 8 qwe 1 2 poiu2 5 p 2 3 poiuyt5 7 po
Replacing value based on conditional pandas
How do you replace a value in a dataframe for a cell based on a conditional for the entire data frame not just a column. I have tried to use df.where but this doesn't work as planned df = df.where(operator.and_(df > (-1 * .2), df < 0),0) df = df.where(df > 0 , df * 1.2) Basically what Im trying to do here is replace all values between -.2 and 0 to zero across all columns in my dataframe and all values greater than zero I want to multiply by 1.2
You've misunderstood the way pandas.where works, which keeps the values of the original object if condition is true, and replace otherwise, you can try to reverse your logic: df = df.where((df <= (-1 * .2)) | (df >= 0), 0) df = df.where(df <= 0 , df * 1.2)
where allows you to have a one-line solution, which is great. I prefer to use a mask like so. idx = (df < 0) & (df >= -0.2) df[idx] = 0 I prefer breaking this into two lines because, using this method, it is easier to read. You could force this onto a single line as well. df[(df < 0) & (df >= -0.2)] = 0 Just another option.