I have a following dataset that contains information if a consumer gave a recommendation or not:
data = {'customer_id': [1, 2, 1, 3], 'recommend': [0, 1, 1, 0]}
df = pd.DataFrame.from_dict(data)
I would like to know if a customer gave 0 recommendation in the past. Desired output would be:
data = {'customer_id': [1, 2, 1, 3], 'recommend': [0, 1, 1, 0], 'past': [0, 0, 1, 0]}
df = pd.DataFrame.from_dict(data)
How can I do it please?
You can do it by first adding a new column with your condition (recommend == 0) and then using groupby together with shift and cummax to obtain the wanted past column. Finally, drop the temporary column created.
Code:
df['equal_zero'] = (df['recommend'] == 0).astype(int)
df['past'] = df.groupby('customer_id')['equal_zero'].shift(1).cummax().fillna(0)
df = df.drop(columns=['equal_zero'])
Result:
customer_id recommend past
0 1 0 0.0
1 2 1 0.0
2 1 1 1.0
3 3 0 0.0
Use custom function per groups with shifting and cumulative max in GroupBy.transform:
df['past'] = (df['recommend'].eq(0)
.groupby(df['customer_id'])
.transform(lambda x: x.shift(fill_value=False).cummax())
.astype(int))
print (df)
customer_id recommend past
0 1 0 0
1 2 1 0
2 1 1 1
3 3 0 0
Assuming 'past' is a boolean, (1 if a customer gave a zero in the past, 0 else).
Here is a one-line solution :
df['past'] = df.apply(lambda x: 1 if len(df[(df.customer_id == x.customer_id) & (df.index < x.name) & (df.recommend == 0)]) > 0 else 0, axis=1)
If 'past' is a count value :
df['past'] = df.apply(lambda x: len(df[(df.customer_id == x.customer_id) & (df.index < x.name) & (df.recommend == 0)]), axis=1)
Related
I think the code below is OK but seems to clumsy. Basically, I want to go from here:
to here:
Basically adding column Result if the dummy column is 1. Hope this makes sense?
data = {'Dummy1':[0, 0, 1, 1],
'Dummy2':[1, 1, 0, 0],
'Result':[1, 1, 2, 2]}
haves = pd.DataFrame(data)
print(haves)
melted = pd.melt(haves, id_vars=['Result'])
melted = melted.loc[melted["value"] > 0]
print(melted)
wants = melted.groupby(["variable"])["Result"].sum()
print(wants)
No need to melt, perform a simple multiplication and sum:
wants = haves.drop('Result', axis=1).mul(haves['Result'], axis=0).sum()
output:
Dummy1 4
Dummy2 2
dtype: int64
Intermediate:
>>> haves.drop('Result', axis=1).mul(haves['Result'], axis=0)
Dummy1 Dummy2
0 0 1
1 0 1
2 2 0
3 2 0
Shorter variant
Warning: this mutates the original dataframe, which will lose the 'Result' column.
wants = haves.mul(haves.pop('Result'), axis=0).sum()
I would like to add multiple columns programmatically to a dataframe using pre-defined rules. As an example, I would like to add 3 columns to the below dataframe, based on whether or not they satisfy the three rules indicated in code below:
#define dataframe
df1 = pd.DataFrame({"time1": [0, 1, 1, 0, 0],
"time2": [1, 0, 0, 0, 1],
"time3": [0, 0, 0, 1, 0],
"outcome": [1, 0, 0, 1, 0]})
#define "rules" for adding subsequent columns
rule_1 = (df1["time1"] == 1)
rule_2 = (df1["time2"] == 1)
rule_3 = (df1["time3"] == 1)
#add new columns based on whether or not above rules are satisfied
df1["rule_1"] = np.where(rule_1, 1, 0)
df1["rule_2"] = np.where(rule_2, 1, 0)
df1["rule_3"] = np.where(rule_3, 1, 0)
As you can see my approach gets tedious when I need to add 10s of columns - each based on a different "rule" - to a test dataframe.
Is there a way to do this more easily without defining each column manually along with its individual np.where clause? I tried doing something like this, but pandas does not accept this.
rules = [rule_1, rule_2, rule_3]
for rule in rules:
df1[rule] = np.where(rule, 1, 0)
Any ideas on how to make my approach more programmatically efficient?
The solution you provided doesn't work because you are using the rule element as the new dataframe column for the rule. I would solve it like this:
rules = [rule_1, rule_2, rule_3]
for i, rule in enumerate(rules):
df1[f'rule_{i+1}'] = np.where(rule, 1, 0)
Leverage pythons f strings in a for loop. They are good at this
#Create a list by filtering the time columns
cols=list(df1.filter(regex='time', axis=1).columns)
#Iterate through the list of columns imposing your conditions using np.where
for col in cols:
df1[f'{col}_new'] = df1[f'{col}'].apply(lambda x:np.where(x==1,1,0))
I might be oversimplifying your rules, but something like:
rules = [
('item1', 1),
('item2', 1),
('item3', 1),
]
for i, (col, val) in enumerate(rules):
df[f"rule_{i + 1}"] = np.where(df[col] == val, 1, 0)
If all of your rules check the same thing, maybe this could be helpful: unstack the relevant columns and check the condition on the Series and convert back to DataFrame with unstack:
df1[['rule1','rule2','rule3']] = df1[['time1','time2','time3']].unstack().eq(1).astype(int).swaplevel().unstack()
Output:
time1 time2 time3 outcome rule1 rule2 rule3
0 0 1 0 1 0 1 0
1 1 0 0 0 1 0 0
2 1 0 0 0 1 0 0
3 0 0 1 1 0 0 1
4 0 1 0 0 0 1 0
I have following dataframe table:
df = pd.DataFrame({'A': [0, 1, 0],
'B': [1, 1, 1]},
index=['2020-01-01', '2020-02-01', '2020-03-01'])
I'm trying to achieve that every value where 1 is present will be replaced by an increasing number. I'm looking for something like:
df.replace(1, value=3)
that works great but instead of number 3 I need number to be increasing (as I want to use it as ID)
number += 1
If I join those together, it doesn't work (or at least I'm not able to find correct syntax) I'd like to obtain following result:
df = pd.DataFrame({'A': [0, 2, 0],
'B': [1, 3, 4]},
index=['2020-01-01', '2020-02-01', '2020-03-01'])
Note: I can not use any command that relies on specification of column or row name, because table has 2600 columns and 5000 rows.
Element-wise assignment on a copy of df.values can work.
More specifically, a range starting from 1 to the number of 1's (inclusive) is assigned onto the location of 1 elements in the value array. The assigned array is then put back into the original dataframe.
Code
(Data as given)
1. Row-first ordering (what the OP wants)
arr = df.values
mask = (arr > 0)
arr[mask] = range(1, mask.sum() + 1)
for i, col in enumerate(df.columns):
df[col] = arr[:, i]
# Result
print(df)
A B
2020-01-01 0 1
2020-02-01 2 3
2020-03-01 0 4
2. Column-first ordering (another possibility)
arr_tr = df.values.transpose()
mask_tr = (arr_tr > 0)
arr_tr[mask_tr] = range(1, mask_tr.sum() + 1)
for i, col in enumerate(df.columns):
df[col] = arr_tr[i, :]
# Result
print(df)
A B
2020-01-01 0 2
2020-02-01 1 3
2020-03-01 0 4
Dataframe image
the operation that I intend to perform is whenever there is a '2' in the column 3, we need to take that entry and take the column 1 value of that entry and subtract the column 1 value of the previous entry and then multiply the result by a constant integer (say 5).
For example: From the image we have a '2' in column 3 at 6:00 and the value of column 1 for that entry is 0.011333 and take the previous column 1 entry which is 0.008583 and perform the following.
(0.011333 - 0.008583)* 5.
This I want to perform everytime when we receive a '2' in column 3 in a dataframe. Please help. I am not able to get the write code to perform the above operation.
Hope this helps:
You can use df.shift(1) to get the previous row and np.where to get the row satisfying your condition
df = pd.DataFrame([['ABC', 1, 0, 0],
['DEF', 2, 0, 0],
['GHI', 3, 0, 0],
['JKL', 4, 0, 2],
['MNO', 5, 0, 2],
['PQR', 6, 0, 2],
['STU', 7, 0, 0]],
columns=['Date & Time', 'column 1', 'column 2', 'column 3'])
df['new'] = np.where(df['column 3'] == 2, df['column 1'] - df['column 1'].shift(1) * 5, 0)
print(df)
Output:
Date & Time column 1 column 2 column 3 new
0 ABC 1 0 0 0.0
1 DEF 2 0 0 0.0
2 GHI 3 0 0 0.0
3 JKL 4 0 2 -11.0
4 MNO 5 0 2 -15.0
5 PQR 6 0 2 -19.0
6 STU 7 0 0 0.0
You can change your calculations as you want. In the else part you can put np.NaN or any other calculation if you want.
Would something like that do the job ?
dataframe = [
[1,3,6,6,7],
[4,3,5,6,7],
[12,3,2,6,7],
[2,3,7,6,7],
[9,3,5,6,7],
[13,3,2,6,7]
]
constant = 5
list_of_outputs = []
for row in dataframe:
if row[2] == 2:
try:
output = (row[0] - prev_entry) * constant
list_of_outputs.append(output)
except:
print("No previous entry!")
prev_entry = row[0]
Perhaps this question will help you
I think in SQL way, so basically you will make new column that filled with the value from the row above it.
df['column1_lagged'] = df['column 1'].shift(1)
Then you create another column that do the calculation
constant = 5
df['calculation'] = (df['column 1'] - df['column1_lagged'])*constant
After that you just slice the dataframe to your condition (column 3 with '2's)
condition = df['column 3'] == 2
df[condition]
I'm working in Python. I have two dataframes df1 and df2:
d1 = {'timestamp1': [88148 , 5617900, 5622548, 5645748, 6603950, 6666502], 'col01': [1, 2, 3, 4, 5, 6]}
df1 = pd.DataFrame(d1)
d2 = {'timestamp2': [5629500, 5643050, 6578800, 6583150, 6611350], 'col02': [7, 8, 9, 10, 11], 'col03': [0, 1, 0, 0, 1]}
df2 = pd.DataFrame(d2)
I want to create a new column in df1 with the value of the minimum timestamp of df2 greater than the current df1 timestamp, where df2['col03'] is zero. This is the way I did it:
df1['colnew'] = np.nan
TSs = df1['timestamp1']
for TS in TSs:
values = df2['timestamp2'][(df2['timestamp2'] > TS) & (df2['col03']==0)]
if not values.empty:
df1.loc[df1['timestamp1'] == TS, 'colnew'] = values.iloc[0]
It works, but I'd prefer not to use a for loop. Is there a better way to do this?
Use pandas.merge_asof with a forward direction
pd.merge_asof(
df1, df2.loc[df2.col03 == 0, ['timestamp2']],
left_on='timestamp1', right_on='timestamp2', direction='forward'
).rename(columns=dict(timestamp2='colnew'))
col01 timestamp1 colnew
0 1 88148 5629500.0
1 2 5617900 5629500.0
2 3 5622548 5629500.0
3 4 5645748 6578800.0
4 5 6603950 NaN
5 6 6666502 NaN
Give a try to the apply method.
def func(x):
values = df2['timestamp2'][(df2['timestamp2'] > x) & (df2['col03']==0)]
if not values.empty:
return values.iloc[0]
else:
np.NAN
df1["timestamp1"].apply(func)
You can create a separate function to do what has to be done.
The output is your new column
0 5629500.0
1 5629500.0
2 5629500.0
3 6578800.0
4 NaN
5 NaN
Name: timestamp1, dtype: float64
It is not an one-line solution, but it helps keeping things organised.