I'm a recent convert from excel to python. I think that what I'm trying to here would be traditionally done with a Vlookup of sorts. But I might be struggling with the terminology and not being able to find the python solution. I have been using the pandas library for most of my data analysis framework.
I have two different data frames. One with the weight changes (DF1), and other with the weights(DF2). I want to go line by line (changes are chronological) and:
create a new column in DF1 with the weight before the change
(basically extracted from DF2).
update the results in DF2 where Weight = Weight + WeightChange
Note: The data frames do not have the same dimension, an individual has several weight changes(DF1) but only one weight (DF2):
Name WeightChange
1 John 5
2 Peter 10
3 John 7
4 Mary -20
5 Gary -3
DF2:
Name Weight
1 John 180
2 Peter 160
3 Mary 120
4 Gary 150
Firstly I'd merge df1 and df2 on the 'Name' column to add the weight column to df1.
Then I'd groupby df1 on name and apply a transform to calculate the total weight change for each person. transform returns a Series aligned to the orig df so you can add an aggregated column back to the df.
Then I'd merge this column to df2 and then it's a simple case of adding this total weight change to the existing weight column:
In [242]:
df1 = df1.merge(df2, on='Name', how='left')
df1['WeightChangeTotal'] = df1.groupby('Name')['WeightChange'].transform('sum')
df1
Out[242]:
Name WeightChange Weight WeightChangeTotal
0 John 5 180 12
1 Peter 10 160 10
2 John 7 180 12
3 Mary -20 120 -20
4 Gary -3 150 -3
In [243]:
df2 = df2.merge(df1[['Name','WeightChangeTotal']], on='Name')
df2
Out[243]:
Name Weight WeightChangeTotal
0 John 180 12
1 John 180 12
2 Peter 160 10
3 Mary 120 -20
4 Gary 150 -3
In [244]:
df2['Weight'] = df2['Weight'] + df2['WeightChangeTotal']
df2
Out[244]:
Name Weight WeightChangeTotal
0 John 192 12
1 John 192 12
2 Peter 170 10
3 Mary 100 -20
4 Gary 147 -3
EDIT
To address your desired behaviour for the 'WeightBefore' column:
In [267]:
df1['WeightBefore'] = df1['Weight'] + df1.groupby('Name')['WeightChange'].shift().cumsum().fillna(0)
df1
Out[267]:
Name WeightChange Weight WeightBefore
0 John 5 180 180
1 Peter 10 160 160
2 John 7 180 185
3 Mary -20 120 120
4 Gary -3 150 150
So the above groups on 'Name', applies a shift to the column and then cumsum so we add the incremental differences, we have to call fillna as this will produce NaN values where we have only a single weight change per Name.
Related
This question already has answers here:
How do I create a new column from the output of pandas groupby().sum()?
(4 answers)
Closed 12 days ago.
Here is a simplified version of my dataframe (the number of persons in my dataframe is way more than 3):
df = pd.DataFrame({'Person':['John','David','Mary','John','David','Mary'],
'Sales':[10,15,20,11,12,18],
})
Person Sales
0 John 10
1 David 15
2 Mary 20
3 John 11
4 David 12
5 Mary 18
I would like to add a column "Total" to this data frame, which is the sum of total sales per person
Person Sales Total
0 John 10 21
1 David 15 27
2 Mary 20 38
3 John 11 21
4 David 12 27
5 Mary 18 38
What would be the easiest way to achieve this?
I have tried
df.groupby('Person').sum()
but the shape of the output is not congruent with the shape of df.
Sales
Person
David 27
John 21
Mary 38
What you want is the transform method which can apply a function on each group:
df['Total'] = df.groupby('Person')['Sales'].transform(sum)
It gives as expected:
Person Sales Total
0 John 10 21
1 David 15 27
2 Mary 20 38
3 John 11 21
4 David 12 27
5 Mary 18 38
The easiest way to achieve this is by using the pandas groupby and sum functions.
df['Total'] = df.groupby('Person')['Sales'].sum()
This will add a column to the dataframe with the total sales per person.
your 'Persons' column in the dataframe contains repeated values
it is not possible to apply a new column to this via groupby
I would suggest making a new dataframe based on sales sum
The below code will help you with that
newDf = pd.DataFrame(df.groupby('Person')['Sales'].sum()).reset_index()
This will create a new dataframe with 'Person' and 'sales' as columns.
I have the following DataFrame
import pandas as pd
d = {'Client':[1,2,3,4],'Salesperson':['John','John','Bob','Richard'],
'Amount':[1000,1000,0,500],'Salesperson 2':['Bob','Richard','John','Tom'],
'Amount2':[400,200,300,500]}
df = pd.DataFrame(data=d)
Client
Salesperson
Amount
Salesperson
Amount2
1
John
1000
Bob
400
2
John
1000
Richard
200
3
Bob
0
John
300
4
Richard
500
Tom
500
And I just need to create some sort of "sumif" statement (the one from excel) that will add the amount each salesperson is due. I don't know how to iterate over each row, but I want to have it so that it adds the values in "Amount" and "Amount2" for each one of the salespersons.
Then I need to be able to see the amount per salesperson.
Expected Output (Ideally in a DataFrame as well)
Sales Person
Total Amount
John
2300
Bob
400
Richard
700
Tom
500
There can be multiple ways of solving this. One option is to use Pandas Concat to join required columns and use groupby
merged_df = pd.concat([df[['Salesperson','Amount']], df[['Salesperson 2', 'Amount2']].rename(columns={'Salesperson 2':'Salesperson','Amount2':'Amount'})])
merged_df.groupby('Salesperson',as_index = False)['Amount'].sum()
you get
Salesperson Amount
0 Bob 400
1 John 2300
2 Richard 700
3 Tom 500
Edit: If you have another pair of salesperson/amount, you can add that to the concat
d = {'Client':[1,2,3,4],'Salesperson':['John','John','Bob','Richard'],
'Amount':[1000,1000,0,500],'Salesperson 2':['Bob','Richard','John','Tom'],
'Amount2':[400,200,300,500], 'Salesperson 3':['Nick','Richard','Sam','Bob'],
'Amount3':[400,800,100,400]}
df = pd.DataFrame(data=d)
merged_df = pd.concat([df[['Salesperson','Amount']], df[['Salesperson 2', 'Amount2']].rename(columns={'Salesperson 2':'Salesperson','Amount2':'Amount'}), df[['Salesperson 3', 'Amount3']].rename(columns={'Salesperson 3':'Salesperson','Amount3':'Amount'})])
merged_df.groupby('Salesperson',as_index = False)['Amount'].sum()
Salesperson Amount
0 Bob 800
1 John 2300
2 Nick 400
3 Richard 1500
4 Sam 100
5 Tom 500
Edit 2: Another solution using pandas wide_to_long
df = df.rename({'Salesperson':'Salesperson 1','Amount':'Amount1'}, axis='columns')
reshaped_df = pd.wide_to_long(df, stubnames=['Salesperson','Amount'], i='Client',j='num', suffix='\s?\d+').reset_index(drop = 1)
The above will reshape df,
Salesperson Amount
0 John 1000
1 John 1000
2 Bob 0
3 Richard 500
4 Bob 400
5 Richard 200
6 John 300
7 Tom 500
8 Nick 400
9 Richard 800
10 Sam 100
11 Bob 400
A simple groupby on reshaped_df will give you required output
reshaped_df.groupby('Salesperson', as_index = False)['Amount'].sum()
One option is to tidy the dataframe into long form, where all the Salespersons are in one column, and the amounts are in another, then you can groupby and get the aggregate.
Let's use pivot_longer from pyjanitor to transform to long form:
# pip install pyjanitor
import pandas as pd
import janitor
(df
.pivot_longer(
index="Client",
names_to=".value",
names_pattern=r"([a-zA-Z]+).*",
)
.groupby("Salesperson", as_index = False)
.Amount
.sum()
)
Salesperson Amount
0 Bob 400
1 John 2300
2 Richard 700
3 Tom 500
The .value tells the function to keep only those parts of the column that match it as headers. The columns have a pattern (They start with a text - either Salesperson or Amount - and either have a number at the end ( or not). This pattern is captured in names_pattern. .value is paired with the regex in the brackets, those outside do not matter in this case.
Once transformed into long form, it is easy to groupby and aggregate. The as_index parameter allows us to keep the output as a dataframe.
Say I have this simple dataframe-
dic = {'firstname':['Steve','Steve','Steve','Steve','Steve','Steve'],
'lastname':['Johnson','Johnson','Johnson','Johnson','Johnson',
'Johnson'],
'company':['CHP','CHP','CHP','CHP','CHP','CHP'],
'faveday':['2020-07-13','2020-07-20','2020-07-16','2020-10-14',
'2020-10-28','2020-10-21'],
'paid':[200,300,550,100,900,650]}
df = pd.DataFrame(dic)
df['faveday'] = pd.to_datetime(df['faveday'])
print(df)
with output-
firstname lastname company faveday paid
0 Steve Johnson CHP 2020-07-13 200
1 Steve Johnson CHP 2020-07-20 300
2 Steve Johnson CHP 2020-07-16 550
3 Steve Johnson CHP 2020-10-14 100
4 Steve Johnson CHP 2020-10-28 900
5 Steve Johnson CHP 2020-10-21 650
I want to be able to keep the rows that have a faveday within 7 days of another, but also their paid columns have to sum greater than 1000.
Individually, if I wanted to apply the 7 day function, I would use-
def sefd (x):
return np.sum((np.abs(x.values-x.values[:,None])/np.timedelta64(1, 'D'))<=7,axis=1)>=2
s=df.groupby(['firstname', 'lastname', 'company'])['faveday'].transform(sefd)
df['seven_days']=s
df = df[s]
del df['seven_days']
This would keep all of the entries (All of these are within 7 days of another faveday grouped by firstname, lastname, and company).
If I wanted to apply a function that keeps rows for the same person with the same company and a summed paid amount > 1000, I would use-
df = df[df.groupby(['lastname', 'firstname','company'])['paid'].transform(sum) > 1000]
Just a simple transform(sum) function
This would also keep all of the entries (since all are under the same name and company and sum to greater than 1000).
However, if we were to combine these two functions at the same time, one row actually would not be included.
My desired output is-
firstname lastname company faveday paid
0 Steve Johnson CHP 2020-07-13 200
1 Steve Johnson CHP 2020-07-20 300
2 Steve Johnson CHP 2020-07-16 550
4 Steve Johnson CHP 2020-10-28 900
5 Steve Johnson CHP 2020-10-21 650
Notice how index 3 is no longer valid because it's only within 7 days of index 5, but if you were to sum index 3 paid and index 5 paid, it would only be 750 (<1000).
It is also important to note that since indexes 0, 1, and 2 are all within 7 days of each other, that counts as one summed group (200 + 300 + 550 > 1000).
The logic is that I would want to first see (based on a group of firstname, lastname, and company name) whether or not a faveday is within 7 days of another. Then after confirming this, see if the paid column for these favedays sums to over 1000. If so, keep those indexes in the dataframe. Otherwise, do not.
A suggested answer given to me was-
df=df.sort_values(["firstname","lastname","company","faveday"])
def date_difference_from(x,df):
return abs((df.faveday - x).dt.days)
def grouped_dates(grouped_df):
keep = []
for idx, row in grouped_df.iterrows():
within_7 = date_difference_from(row.faveday,grouped_df) <= 7
keep.append(within_7.sum() > 1 and grouped_df[within_7].paid.sum() > 1000)
msk = np.array(keep)
return grouped_df[msk]
df = df.groupby(["firstname","lastname","company"]).apply(grouped_dates).reset_index(drop=True)
print(df)
This works perfectly for small data sets like this one, but when I apply it to a bigger dataset (10,000+ rows), some inconsistencies appear.
Is there any way to improve this code?
I found a solution that avoids looping idx to compare if other rows are within 7 days, but involves unstack and reindex so it will increase memory usage (I tried tapping into the _get_window_bounds method of rolling but it proved above my expertise). It should be fine for the scale you request. Although this solution's is on par of yours with the toy df you provided, it is orders of magnitude faster on larger datasets.
Edit: allow multiple deposits in one date.
Take this data (with replace=True by default in random.choice)
import string
np.random.seed(123)
n = 40
df = pd.DataFrame([[a, b, b, faveday, paid]
for a in string.ascii_lowercase
for b in string.ascii_lowercase
for faveday, paid in zip(
np.random.choice(pd.date_range('2020-01-01', '2020-12-31'), n),
np.random.randint(100, 1200, n))
], columns=['firstname', 'lastname', 'company', 'faveday', 'paid'])
df['faveday'] = pd.to_datetime(df['faveday'])
df = df.sort_values(["firstname", "lastname", "company", "faveday"]).reset_index(drop=True)
>>>print(df)
firstname lastname company faveday paid
0 a a a 2020-01-03 1180
1 a a a 2020-01-18 206
2 a a a 2020-02-02 490
3 a a a 2020-02-09 615
4 a a a 2020-02-17 471
... ... ... ... ... ...
27035 z z z 2020-11-22 173
27036 z z z 2020-12-22 863
27037 z z z 2020-12-23 675
27038 z z z 2020-12-26 1165
27039 z z z 2020-12-30 683
[27040 rows x 5 columns]
And the code
def get_valid(df, window_size=7, paid_gt=1000, groupbycols=['firstname', 'lastname', 'company']):
# df_clean = df.set_index(['faveday'] + groupbycols).unstack(groupbycols)
# # unstack names to bypass groupby
df_clean = df.groupby(['faveday'] + groupbycols).paid.agg(['size', sum])
df_clean.columns = ['ct', 'paid']
df_clean = df_clean.unstack(groupbycols)
df_clean = df_clean.reindex(pd.date_range(df_clean.index.min(),
df_clean.index.max())).sort_index() # include all dates, to treat index as integer
window = df_clean.fillna(0).rolling(window_size + 1).sum()
# notice fillna to prevent false NaNs while summing
df_clean = df_clean.paid * ( # multiply times a mask for both conditions
(window.ct > 1) & (window.paid > paid_gt)
).replace(False, np.nan).bfill(limit=7)
# replacing with np.nan so we can backfill to include all dates in window
df_clean = df_clean.rename_axis('faveday').stack(groupbycols)\
.reset_index(level='faveday').sort_index().reset_index()
# reshaping to original format
return df_clean
df1 = get_valid(df, window_size=7, paid_gt=1000,
groupbycols=['firstname', 'lastname', 'company'])
Still running at 1.5 seconds (vs 143 seconds of your current code) and returns
firstname lastname company faveday 0
0 a a a 2020-02-02 490.0
1 a a a 2020-02-09 615.0
2 a a a 2020-02-17 1232.0
3 a a a 2020-03-09 630.0
4 a a a 2020-03-14 820.0
... ... ... ... ... ...
17561 z z z 2020-11-12 204.0
17562 z z z 2020-12-22 863.0
17563 z z z 2020-12-23 675.0
17564 z z z 2020-12-26 1165.0
17565 z z z 2020-12-30 683.0
[17566 rows x 5 columns]
I have a pandas dataframe that contains budget data but my sales data is located in another dataframe that is not the same size. How can I get my sales data updated in my budget data? How can I write conditions so that it makes these updates?
DF budget:
cust type loc rev sales spend
0 abc new north 500 0 250
1 def new south 700 0 150
2 hij old south 700 0 150
DF sales:
cust type loc sales
0 abc new north 15
1 hij old south 18
DF budget outcome:
cust type loc rev sales spend
0 abc new north 500 15 250
1 def new south 700 0 150
2 hij old south 700 18 150
Any thoughts?
Assuming that 'cust' column is unique in your other df, you can call map on the sales df after setting the index to be the 'cust' column, this will map for each 'cust' in budget df to it's sales value, additionally you will get NaN where there are missing values so you call fillna(0) to fill those values:
In [76]:
df['sales'] = df['cust'].map(df1.set_index('cust')['sales']).fillna(0)
df
Out[76]:
cust type loc rev sales spend
0 abc new north 500 15 250
1 def new south 700 0 150
2 hij old south 700 18 150
Suppose we take a pandas dataframe...
name age family
0 john 1 1
1 jason 36 1
2 jane 32 1
3 jack 26 2
4 james 30 2
Then do a groupby() ...
group_df = df.groupby('family')
group_df = group_df.aggregate({'name': name_join, 'age': pd.np.mean})
Then do some aggregate/summarize operation (in my example, my function name_join aggregates the names):
def name_join(list_names, concat='-'):
return concat.join(list_names)
The grouped summarized output is thus:
age name
family
1 23 john-jason-jane
2 28 jack-james
Question:
Is there a quick, efficient way to get to the following from the aggregated table?
name age family
0 john 23 1
1 jason 23 1
2 jane 23 1
3 jack 28 2
4 james 28 2
(Note: the age column values are just examples, I don't care for the information I am losing after averaging in this specific example)
The way I thought I could do it does not look too efficient:
create empty dataframe
from every line in group_df, separate the names
return a dataframe with as many rows as there are names in the starting row
append the output to the empty dataframe
The rough equivalent is .reset_index(), but it may not be helpful to think of it as the "opposite" of groupby().
You are splitting a string in to pieces, and maintaining each piece's association with 'family'. This old answer of mine does the job.
Just set 'family' as the index column first, refer to the link above, and then reset_index() at the end to get your desired result.
It turns out that pd.groupby() returns an object with the original data stored in obj. So ungrouping is just pulling out the original data.
group_df = df.groupby('family')
group_df.obj
Example
>>> dat_1 = df.groupby("category_2")
>>> dat_1
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fce78b3dd00>
>>> dat_1.obj
order_date category_2 value
1 2011-02-01 Cross Country Race 324400.0
2 2011-03-01 Cross Country Race 142000.0
3 2011-04-01 Cross Country Race 498580.0
4 2011-05-01 Cross Country Race 220310.0
5 2011-06-01 Cross Country Race 364420.0
.. ... ... ...
535 2015-08-01 Triathalon 39200.0
536 2015-09-01 Triathalon 75600.0
537 2015-10-01 Triathalon 58600.0
538 2015-11-01 Triathalon 70050.0
539 2015-12-01 Triathalon 38600.0
[531 rows x 3 columns]
Here's a complete example that recovers the original dataframe from the grouped object
def name_join(list_names, concat='-'):
return concat.join(list_names)
print('create dataframe\n')
df = pandas.DataFrame({'name':['john', 'jason', 'jane', 'jack', 'james'], 'age':[1,36,32,26,30], 'family':[1,1,1,2,2]})
df.index.name='indexer'
print(df)
print('create group_by object')
group_obj_df = df.groupby('family')
print(group_obj_df)
print('\nrecover grouped df')
group_joined_df = group_obj_df.aggregate({'name': name_join, 'age': 'mean'})
group_joined_df
create dataframe
name age family
indexer
0 john 1 1
1 jason 36 1
2 jane 32 1
3 jack 26 2
4 james 30 2
create group_by object
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fbfdd9dd048>
recover grouped df
name age
family
1 john-jason-jane 23
2 jack-james 28
print('\nRecover the original dataframe')
print(pandas.concat([group_obj_df.get_group(key) for key in group_obj_df.groups]))
Recover the original dataframe
name age family
indexer
0 john 1 1
1 jason 36 1
2 jane 32 1
3 jack 26 2
4 james 30 2
There are a few ways to undo DataFrame.groupby, one way is to do DataFrame.groupby.filter(lambda x:True), this gets back to the original DataFrame.