Apply conditional groupby - python

I have a dataframe with two month value columns as 'month1' and 'month2'. If the value in 'month1' column is not 'NA', then sum the corresponding 'amount' values as per 'month1' column. If the value in 'month1' column is 'NA', then sum the corresponding 'amount' values of 'month2' column.
import pandas as pd
df = pd.DataFrame({'month1': [1,2,'NA', 1, 4, 'NA', 'NA'],
'month2': ['NA',5,1, 2, 'NA', 1, 3],
'amount': [10,20,40, 50, 60, 70, 100]})
The input and output dataframes are as follows:
Input dataframe
month1 month2 amount
0 1.0 NaN 10
1 2.0 5.0 20
2 NaN 1.0 40
3 1.0 2.0 50
4 4.0 NaN 60
5 NaN 1.0 70
6 NaN 3.0 100
Output dataframe

since your NA values is string, you can simply groupby on the two columns:
# ignore month2 if month1 is NA
df.loc[df.month1.ne('NA'), 'month2'] = 'NA'
# groupby and sum
df.groupby(['month1','month2']).amount.transform('sum')
if you don't want to alter your data, you can do
s = np.where(df.month1.ne('NA'), 'NA', df['month2'])
df.groupby(['month1', s]).amount.transform('sum')
Output:
0 60
1 20
2 110
3 60
4 60
5 110
6 100
Name: amount, dtype: int64

You can use:
c=df.month1.eq('NA')
np.select([c,~c],[df.groupby('month2')['amount'].transform('sum')
,df.groupby('month1')['amount'].transform('sum')],default='NA') #assign to new column
array(['60', '20', '110', '60', '60', '110', '100'], dtype='<U21')

Edit: as #rafael pointed out, your data may be mixing of numbers and strings, so converting them all to numeric before processing is needed.
A simple way is groupby and transform month1 and month2 separately and fillna result of month1 by month2
df = df.apply(pd.to_numeric, errors='coerce')
m1 = df.groupby('month1').amount.transform('sum')
m2 = df.groupby('month2').amount.transform('sum')
m1.fillna(m2)
Out[406]:
0 60.0
1 20.0
2 110.0
3 60.0
4 60.0
5 110.0
6 100.0
Name: amount, dtype: float64

Related

Pandas: How to find the average length of days for a local outbreak to peak in a COVID-19 dataframe?

Let's say I have this dataframe containing the difference in number of active cases from previous value in each country:
[in]
import pandas as pd
import numpy as np
active_cases = {'Day(s) since outbreak':['0', '1', '2', '3', '4', '5'], 'Australia':[np.NaN, 10, 10, -10, -20, -20], 'Albania':[np.NaN, 20, 0, 15, 0, -20], 'Algeria':[np.NaN, 25, 10, -10, 20, -20]}
df = pd.DataFrame(active_cases)
df
[out]
Day(s) since outbreak Australia Albania Algeria
0 0 NaN NaN NaN
1 1 10.0 20.0 25.0
2 2 10.0 0.0 10.0
3 3 -10.0 15.0 -10.0
4 4 -20.0 0.0 20.0
5 5 -20.0 -20.0 -20.0
I need to find the average length of days for a local outbreak to peak in this COVID-19 dataframe.
My solution is to find the nth row with the first negative value in each column (e.g., nth row of first negative value in 'Australia': 3, nth row of first negative value in 'Albania': 5) and average it.
However, I have no idea how to do this in Panda/Python.
Are there any ways to perform this task with simple lines of Python/Panda code?
you can set_index the column Day(s) since outbreak, then use iloc to select all rows except the first one, then check where the values are less than (lt) 0. Use idxmax to get the first row where the value is less than 0 and take the mean. With your input, it gives:
print (df.set_index('Day(s) since outbreak')\
.iloc[1:, :].lt(0).idxmax().astype(float).mean())
3.6666666666666665
IICU
using df.where mask negatives and replace positives with np.NaN and then calculate the mean
cols= ['Australia','Albania','Algeria']
df.set_index('Day(s) since outbreak', inplace=True)
m = df< 0
df2=df.where(m, np.NaN)
#df2 = df2.replace(0, np.NaN)
df2.mean()
Result

check if each user has consecutive dates in a python 3 pandas dataframe

Imagine there is a dataframe:
id date balance_total transaction_total
0 1 01/01/2019 102.0 -1.0
1 1 01/02/2019 100.0 -2.0
2 1 01/03/2019 100.0 NaN
3 1 01/04/2019 100.0 NaN
4 1 01/05/2019 96.0 -4.0
5 2 01/01/2019 200.0 -2.0
6 2 01/02/2019 100.0 -2.0
7 2 01/04/2019 100.0 NaN
8 2 01/05/2019 96.0 -4.0
here is the create dataframe command:
import pandas as pd
import numpy as np
users=pd.DataFrame(
[
{'id':1,'date':'01/01/2019', 'transaction_total':-1, 'balance_total':102},
{'id':1,'date':'01/02/2019', 'transaction_total':-2, 'balance_total':100},
{'id':1,'date':'01/03/2019', 'transaction_total':np.nan, 'balance_total':100},
{'id':1,'date':'01/04/2019', 'transaction_total':np.nan, 'balance_total':100},
{'id':1,'date':'01/05/2019', 'transaction_total':-4, 'balance_total':np.nan},
{'id':2,'date':'01/01/2019', 'transaction_total':-2, 'balance_total':200},
{'id':2,'date':'01/02/2019', 'transaction_total':-2, 'balance_total':100},
{'id':2,'date':'01/04/2019', 'transaction_total':np.nan, 'balance_total':100},
{'id':2,'date':'01/05/2019', 'transaction_total':-4, 'balance_total':96}
]
)
How could I check if each id has consecutive dates or not? I use the
"shift" idea here but it doesn't seem to work:
Calculating time difference between two rows
df['index_col'] = df.index
for id in df['id'].unique():
# create an empty QA dataframe
column_names = ["Delta"]
df_qa = pd.DataFrame(columns = column_names)
df_qa['Delta']=(df['index_col'] - df['index_col'].shift(1))
if (df_qa['Delta'].iloc[1:] != 1).any() is True:
print('id ' + id +' might have non-consecutive dates')
# doesn't print any account => Each Customer's Daily Balance has Consecutive Dates
break
Ideal output:
it should print id 2 might have non-consecutive dates
Thank you!
Use groupby and diff:
df["date"] = pd.to_datetime(df["date"],format="%m/%d/%Y")
df["difference"] = df.groupby("id")["date"].diff()
print (df.loc[df["difference"]>pd.Timedelta(1, unit="d")])
#
id date transaction_total balance_total difference
7 2 2019-01-04 NaN 100.0 2 days
Use DataFrameGroupBy.diff with Series.dt.days, compre by greatee like 1 and filter only id column by DataFrame.loc:
users['date'] = pd.to_datetime(users['date'])
i = users.loc[users.groupby('id')['date'].diff().dt.days.gt(1), 'id'].tolist()
print (i)
[2]
for val in i:
print( f'id {val} might have non-consecutive dates')
id 2 might have non-consecutive dates
First step is to parse date:
users['date'] = pd.to_datetime(users.date).
Then add a shifted column on the id and date columns:
users['id_shifted'] = users.id.shift(1)
users['date_shifted'] = users.date.shift(1)
The difference between date and date_shifted columns is of interest:
>>> users.date - users.date_shifted
0 NaT
1 1 days
2 1 days
3 1 days
4 1 days
5 -4 days
6 1 days
7 2 days
8 1 days
dtype: timedelta64[ns]
You can now query the DataFrame for what you want:
users[(users.id_shifted == users.id) & (users.date_shifted - users.date != np.timedelta64(days=1))]
That is, consecutive lines of the same user with a date difference != 1 day.
This solution does assume the data is sorted by (id, date).

How to add conditions when calculating using Python?

I have a dataframe with two numeric columns. I want to add a third column to calculate the difference. But the condition is if the values in the first column are blank or Nan, the difference should be the value in the second column...
Can anyone help me with this problem?
Any suggestions and clues will be appreciated!
Thank you.
You should use vectorised operations where possible. Here you can use numpy.where:
df['Difference'] = np.where(df['July Sales'].isnull(), df['August Sales'],
df['August Sales'] - df['July Sales'])
However, consider this is precisely the same as considering NaN values in df['July Sales'] to be equal to zero. So you can use pd.Series.fillna:
df['Difference'] = df['August Sales'] - df['July Sales'].fillna(0)
This isn't really a situation with conditions, it is just a math operation.. Suppose you have the df:
consider your df using the .sub() method:
df['Diff'] = df['August Sales'].sub(df['July Sales'], fill_value=0)
returns output:
July Sales August Sales Diff
0 459.0 477 18.0
1 422.0 125 -297.0
2 348.0 483 135.0
3 397.0 271 -126.0
4 NaN 563 563.0
5 191.0 325 134.0
6 435.0 463 28.0
7 NaN 479 479.0
8 475.0 473 -2.0
9 284.0 496 212.0
Used a sample dataframe, but it shouldn't be hard to comprehend:
df = pd.DataFrame({'A': [1, 2, np.nan, 3], 'B': [10, 20, 30, 40]})
def diff(row):
return row['B'] if (pd.isnull(row['A'])) else (row['B'] - row['A'])
df['C'] = df.apply(diff, axis=1)
ORIGINAL DATAFRAME:
A B
0 1.0 10
1 2.0 20
2 NaN 30
3 3.0 40
AFTER apply:
A B C
0 1.0 10 9.0
1 2.0 20 18.0
2 NaN 30 30.0
3 3.0 40 37.0
try this:
def diff(row):
if not row['col1']:
return row['col2']
else:
return row['col1'] - row['col2']
df['col3']= df.apply(diff, axis=1)

Pandas DF Multiple Conditionals using np.where

I am trying to combine a few relatively simple conditions into an np.where clause, but am having trouble getting the syntax down for the logic.
My current dataframe looks like the df below, with four columns. I would like to add two columns, named the below, with the following conditions:
The desired output is below - the df df_so_v2
Days since activity
*Find most recent prior row with same ID, then subtract dates column
*If no most recent value, return NA
Chg. Avg. Value
Condition 1: If Count = 0, NA
Condition 2: If Count !=0, find most recent prior row with BOTH the same ID and Count!=0, then find the difference in Avg. Value column.
However, I am building off simple np.where queries like the below and do not know how to combine the multiple conditions needed in this case.
df['CASH'] = np.where(df['CASH'] != 0, df['CASH'] + commission , df['CASH'])
Thank you very much for your help on this.
df_dict={'DateOf': ['2017-08-07','2017-08-07','2017-08-07','2017-08-04','2017-08-04','2017-08-04'
, '2017-08-03','2017-08-03','2017-08-03','2017-08-02','2017-08-02','2017-08-02','2017-08-01','2017-08-01','2017-08-01'],
'ID': ['553','559','914','553','559','914','553','559','914','553','559','914','553','559','914'], 'Count': [0, 4, 5, 0, 11, 10, 3, 9, 0,1,0,2,4,4,0],
'Avg. Value': [0,3.5,2.2,0,4.2,3.3,5.3,5,0,3,0,2,4.4,6.4,0]}
df_so=pd.DataFrame(df_dict)
df_dict_v2={'DateOf': ['2017-08-07','2017-08-07','2017-08-07','2017-08-04','2017-08-04','2017-08-04'
, '2017-08-03','2017-08-03','2017-08-03','2017-08-02','2017-08-02','2017-08-02','2017-08-01','2017-08-01','2017-08-01'],
'ID': ['553','559','914','553','559','914','553','559','914','553','559','914','553','559','914'], 'Count': [0, 4, 5, 0, 11, 10, 3, 9, 0,1,0,2,4,4,0],
'Avg. Value': [0,3.5,2.2,0,4.2,3.3,5.3,5,0,3,0,2,4.4,6.4,0],
'Days_since_activity': [4,3,1,1,1,2,1,2,1,1,1,1,'NA','NA','NA'],
'Chg. Avg Value': ['NA',-0.7,-1.1,'NA',-0.8,1.3,2.3,-1.4,'NA',-1.4,'NA','NA','NA','NA','NA']
}
df_so_v2=pd.DataFrame(df_dict_v2)
Here is the answer to this part of the question. I need more clarification on the conditions of 2.
1) Days since activity *Find most recent prior row with same ID, then subtract dates column *If no most recent value, return NA
First you need to convert strings to datetime, then sort the dates in ascending order. Finally use .transform to find the difference.
df_dict={'DateOf': ['2017-08-07','2017-08-07','2017-08-07','2017-08-04','2017-08-04','2017-08-04'
, '2017-08-03','2017-08-03','2017-08-03','2017-08-02','2017-08-02','2017-08-02','2017-08-01','2017-08-01','2017-08-01'],
'ID': ['553','559','914','553','559','914','553','559','914','553','559','914','553','559','914'], 'Count': [0, 4, 5, 0, 11, 10, 3, 9, 0,1,0,2,4,4,0],
'Avg. Value': [0,3.5,2.2,0,4.2,3.3,5.3,5,0,3,0,2,4.4,6.4,0]}
df_so = pd.DataFrame(df_dict)
df_so['DateOf'] = pd.to_datetime(df_so['DateOf'])
df_so.sort_values('DateOf', inplace=True)
df_so['Days_since_activity'] = df_so.groupby(['ID'])['DateOf'].transform(pd.Series.diff)
df_so.sort_index()
Edited based on your comment:
Find the most recent previous day that does not have a count of Zero and calculate the difference.
df_dict={'DateOf': ['2017-08-07','2017-08-07','2017-08-07','2017-08-04','2017-08-04','2017-08-04'
, '2017-08-03','2017-08-03','2017-08-03','2017-08-02','2017-08-02','2017-08-02','2017-08-01','2017-08-01','2017-08-01'],
'ID': ['553','559','914','553','559','914','553','559','914','553','559','914','553','559','914'], 'Count': [0, 4, 5, 0, 11, 10, 3, 9, 0,1,0,2,4,4,0],
'Avg. Value': [0,3.5,2.2,0,4.2,3.3,5.3,5,0,3,0,2,4.4,6.4,0]}
df = pd.DataFrame(df_dict)
df['DateOf'] = pd.to_datetime(df['DateOf'], format='%Y-%m-%d')
df.sort_values(['ID','DateOf'], inplace=True)
df['Days_since_activity'] = df.groupby(['ID'])['DateOf'].diff()
mask = df.ID != df.ID.shift(1)
mask2 = df.groupby('ID').Count.shift(1) == 0
df['Days_since_activity'][mask] = np.nan
df['Days_since_activity'][mask2] = df.groupby(['ID'])['DateOf'].diff(2)
df['Chg. Avg Value'] = df.groupby(['ID'])['Avg. Value'].diff()
df['Chg. Avg Value'][mask2] = df.groupby(['ID'])['Avg. Value'].diff(2)
conditions = [((df['Count'] == 0)),]
choices = [np.nan,]
df['Chg. Avg Value'] = np.select(conditions, choices, default = df['Chg. Avg Value'])
# df = df.sort_index()
df
New unsorted Output for easy comparison:
DateOf ID Count Avg. Value Days_since_activity Chg. Avg Value
12 2017-08-01 553 4 4.4 NaT NaN
9 2017-08-02 553 1 3.0 1 days -1.4
6 2017-08-03 553 3 5.3 1 days 2.3
3 2017-08-04 553 0 0.0 1 days NaN
0 2017-08-07 553 0 0.0 4 days NaN
13 2017-08-01 559 4 6.4 NaT NaN
10 2017-08-02 559 0 0.0 1 days NaN
7 2017-08-03 559 9 5.0 2 days -1.4
4 2017-08-04 559 11 4.2 1 days -0.8
1 2017-08-07 559 4 3.5 3 days -0.7
14 2017-08-01 914 0 0.0 NaT NaN
11 2017-08-02 914 2 2.0 NaT NaN
8 2017-08-03 914 0 0.0 1 days NaN
5 2017-08-04 914 10 3.3 2 days 1.3
2 2017-08-07 914 5 2.2 3 days -1.1
index 11 should be NaT because the most current previous row has a count of zero and there is nothing else to compare it to

Pandas: how to identify the values in a column of a dataframe and do some math operations

I want to do operations, such that I produce something like this:
In other words, if the values in Name are in the 'first_list', I want to multiply the 'Values' by two. If they are in the 'second_list', I want to multiply them by 0.5. If they are not in either (for Nick and Nicky), do not do anything.
This is what I have:
first_list = ['John', 'James', 'Julius', 'Alex']
second_list = ['Lilly', 'Alexis', 'Becly']
if df['Name'].isin(first_list).any():
df['New Values'] = df['Values'] * 2
elif df['Name'].isin(second_list).any():
df['New Values'] = df['Values'] * 0.5
But its' not doing the multiplication as I want. Instead, it gives me:
Let's use np.where and isin:
df['New Value'] = (np.where(df.Name.isin(first_list),
df.Values*2,
np.where(df.Name.isin(second_list),
df.Values*.5,
df.Values)))
Setup:
df = pd.DataFrame({'Name':['John','Lily','Alexis','Becky','James','Julian','Alex','Nick','Nicky'],'Values':[50,100,30,60,40,20,80,25,46]})
first_list = ['John','James','Julius','Alex']
second_list = ['Lily','Alexis','Becky']
Output:
Name Values New Value
0 John 50 100.0
1 Lily 100 50.0
2 Alexis 30 15.0
3 Becky 60 30.0
4 James 40 80.0
5 Julian 20 20.0
6 Alex 80 160.0
7 Nick 25 25.0
8 Nicky 46 46.0

Categories