YY_MM_CD customerid pol_no type WE WP
2019-07 15680 1313145 new 3 89
2020-01 14672 1418080 renwd -8 223
2019-01 15681 1213143 new 4 8
2019-01 15683 1213344 new -6 3
2019-03 14678 1418280 renwd -66 -7
Now . I have some x years of data and i am taking for snapshot of 19-20. Suppose if a customer in this snapshot paid premium on 01/11/2019 the customer didn't paid premium on 01/11/2020 so that record will be not their in data. I need to created a dummy record for that customer . Like this customer id 15681 and keep their WE and WP as 0 0 since customer didn't paid
YY_MM_CD customerid pol_no type WE WP
2019-07 15680 1313145 new 3 89
2020-01 14672 1418080 renwd -8 223
2019-01 15681 1213143 new 4 8
2020-01 15681 1213143 new 0 0
2019-03 14678 1418280 renwd -66 -7
Don't create a dummy datapoint. Write the expiration date next to each customer id. Then when accessing the data just check if the current data is before the expiration date.
Simpler and cleaner
If you would like to do what you asked (add a row or column based on a condition):
You need to group the customers
Use a lambda function to add your condition
For example.
new_df = pd.DataFrame()
df = YOURDATA
groups = df.groupby("customerid")
for group in groups:
if len(group) < 2: #your condition
df2 = pd.DataFrame( ADD YOUR DATA HERE)
new_df.append(df2, ignore_index=True)
at the end you can combine new_df and df with concat:https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
Related
I have a dataset like this:
Customer ID
Date
Profit
1
4/13/2018
10.00
1
4/26/2018
13.27
1
10/23/2018
15.00
2
1/1/2017
7.39
2
7/5/2017
9.99
2
7/7/2017
10.01
3
5/4/2019
30.30
I'd like to groupby and sum profit, for every 6 months, starting at each users first transaction.
The output ideally should look like this:
Customer ID
Date
Profit
1
4/13/2018
23.27
1
10/13/2018
15.00
2
1/1/2017
7.39
2
7/1/2017
20.00
3
5/4/2019
30.30
The closest I've seem to get on this problem is by using:
df.groupby(['Customer ID',pd.Grouper(key='Date', freq='6M', closed='left')])['Profit'].sum().reset_index()
But, that doesn't seem to sum starting on a users first transaction day.
If the changing of dates is not possible (ex. customer 2 date is 7/1/2017 and not 7/5/2017), then at least summing the profit so that its based on each users own 6 month purchase journey would be extremely helpful. Thank you!
I can get you the first of the month until you find a more perfect solution.
df["Date"] = pd.to_datetime(df["Date"], format="%m/%d/%Y")
df = (
df
.set_index("Date")
.groupby(["Customer ID"])
.Profit
.resample("6MS")
.sum()
.reset_index(name="Profit")
)
print(df)
Customer ID Date Profit
0 1 2018-04-01 23.27
1 1 2018-10-01 15.00
2 2 2017-01-01 7.39
3 2 2017-07-01 20.00
4 3 2019-05-01 30.30
i want to split a string into multiple rows.
df.assign(MODEL_ABC = df['MODEL_ABC'].str.split('_').explode('MODEL_ABC'))
my output:
YEAR PERIOD MODEL_ABC Price Qty
0 2018 First A 75.0 25.0
if i run individually for column i'm getting like below but not entire dataframe
A
B
this is my dataframe df
YEAR PERIOD MODEL_ABC Price Qty
0 2018 First A_B 75.0 25.0
expected output:
YEAR PERIOD MODEL_ABC Price Qty
0 2018 First A 75.0 25.0
1 2018 First B 75.0 25.0
You can do the following, start by transforming the column into a list, so then you can explode it to create multiple rows:
df['MODEL_ABC'] = df['MODEL_ABC'].str.split('_')
df = df.explode('MODEL_ABC')
I have a Pandas dataframe as follows
df = pd.DataFrame([['John', '1/1/2017','10'],
['John', '2/2/2017','15'],
['John', '2/2/2017','20'],
['John', '3/3/2017','30'],
['Sue', '1/1/2017','10'],
['Sue', '2/2/2017','15'],
['Sue', '3/2/2017','20'],
['Sue', '3/3/2017','7'],
['Sue', '4/4/2017','20']
],
columns=['Customer', 'Deposit_Date','DPD'])
. What is the best way to calculate the PreviousMean column in the screen shot below?
The column is the year to date average of DPD for that customer. I.e. Includes all DPDs up to but not including rows that match the current deposit date. If no previous records existed then it's null or 0.
Screenshot:
Notes:
the data is grouped by Customer Name and expanding over Deposit Dates
within each group, the expanding mean is calculated using only values from the previous rows.
at the start of each new customer the mean is 0 or alternatively null as there are no previous records on which to form the mean
the data frame is ordered by Customer Name and Deposit_Date
instead of grouping & expanding the mean, filter the dataframe on the conditions, and calculate the mean of DPD:
Customer == current row's Customer
Deposit_Date < current row's Deposit_Date
Use df.apply to perform this operation for all row in the dataframe:
df['PreviousMean'] = df.apply(
lambda x: df[(df.Customer == x.Customer) & (df.Deposit_Date < x.Deposit_Date)].DPD.mean(),
axis=1)
outputs:
Customer Deposit_Date DPD PreviousMean
0 John 2017-01-01 10 NaN
1 John 2017-02-02 15 10.0
2 John 2017-02-02 20 10.0
3 John 2017-03-03 30 15.0
4 Sue 2017-01-01 10 NaN
5 Sue 2017-02-02 15 10.0
6 Sue 2017-03-02 20 12.5
7 Sue 2017-03-03 7 15.0
8 Sue 2017-04-04 20 13.0
Here's one way to exclude repeated days from mean calculation:
# create helper series which is NaN for repeated days, DPD otherwise
s = df.groupby(['Customer Name', 'Deposit_Date']).cumcount() == 1
df['DPD2'] = np.where(s, np.nan, df['DPD'])
# apply pd.expanding_mean
df['CumMean'] = df.groupby(['Customer Name'])['DPD2'].apply(lambda x: pd.expanding_mean(x))
# drop helper series
df = df.drop('DPD2', 1)
print(df)
Customer Name Deposit_Date DPD CumMean
0 John 01/01/2017 10 10.0
1 John 01/01/2017 10 10.0
2 John 02/02/2017 20 15.0
3 John 03/03/2017 30 20.0
4 Sue 01/01/2017 10 10.0
5 Sue 01/01/2017 10 10.0
6 Sue 02/02/2017 20 15.0
7 Sue 03/03/2017 30 20.0
Ok here is the best solution I've come up with thus far.
The trick is to first create an aggregated table at the customer & deposit date level containing a shifted mean. To calculate this mean you have to calculate the sum and the count first.
s=df.groupby(['Customer Name','Deposit_Date'],as_index=False)[['DPD']].agg(['count','sum'])
s.columns = [' '.join(col) for col in s.columns]
s.reset_index(inplace=True)
s['DPD_CumSum']=s.groupby(['Customer Name'])[['DPD sum']].cumsum()
s['DPD_CumCount']=s.groupby(['Customer Name'])[['DPD count']].cumsum()
s['DPD_CumMean']=s['DPD_CumSum']/ s['DPD_CumCount']
s['DPD_PrevMean']=s.groupby(['Customer Name'])['DPD_CumMean'].shift(1)
df=df.merge(s[['Customer Name','Deposit_Date','DPD_PrevMean']],how='left',on=['Customer Name','Deposit_Date'])
I need some help iterating over a groupby object in python. I have people nested under a single ID variable, and then under each one of those, they have balances for anywhere from 3 to 6 months. So, printing the groupby object looks, for example, like this:
(1, Primary BP Product Rpt Month Closing Balance
0 1 CHECK 201708 10.04
1 1 CHECK 201709 11.1
2 1 CHECK 201710 11.16
3 1 CHECK 201711 11.22
4 1 CHECK 201712 11.28
5 1 CHECK 201801 11.34)
(2, Primary BP Product Rpt Month Closing Balance
79 2 CHECK 201711 52.42
85 2 CHECK 201712 31.56
136 2 CHECK 201801 99.91)
I want to create another column that standardizes the closing balance based on their first amount. So the ideal output would then look like this:
(1, Primary BP Product Rpt Month Closing Balance standardized
0 1 CHECK 201708 10.04 0
1 1 CHECK 201709 11.1 1.1
2 1 CHECK 201710 11.16 1.16
3 1 CHECK 201711 11.22 1.22
4 1 CHECK 201712 11.28 1.28
5 1 CHECK 201801 11.34 1.34)
(2, Primary BP Product Rpt Month Closing Balance standardized
79 2 CHECK 201711 52.42 0
85 2 CHECK 201712 31.56 -20.86
136 2 CHECK 201801 99.91 47.79)
I just can't quite figure out how to make a nice for loop, or if there is any other way, that will iterate within the groups of a groupby object, taking the first value for closing balance and subtracting it from each closing balance to essentially create a difference score.
I solved it! Only two weeks later. Did it without the use of a groupby object. Here is how:
bpid = []
diffs = []
# These two lines were just a bit of cleaning needed to make the vals numeric
data['Closing Balance'] = data['Closing Balance'].str.replace(",", "")
data['Closing Balance'] = pd.to_numeric(data['Closing Balance'])
# Create a new variable in monthly_data that simply shows the increase in closing balance for each month,
# setting the first month to 0
for index, row in data.iterrows():
bp = row[0]
if bp not in bpid:
bpid.append(bp)
first = row[3]
bal = row[3]
diff = round(bal-first, 2)
diffs.append(diff)
row['balance increase'] = diff
# Just checking to make sure there are the right number of values. Same as data, so good to go
print(len(diffs))
# Convert my list of differences in closing balance to a series object, and merge with the monthly_data
se = pd.Series(diffs)
data['balance increase'] = se.values
I have a pandas dataframe that contains budget data but my sales data is located in another dataframe that is not the same size. How can I get my sales data updated in my budget data? How can I write conditions so that it makes these updates?
DF budget:
cust type loc rev sales spend
0 abc new north 500 0 250
1 def new south 700 0 150
2 hij old south 700 0 150
DF sales:
cust type loc sales
0 abc new north 15
1 hij old south 18
DF budget outcome:
cust type loc rev sales spend
0 abc new north 500 15 250
1 def new south 700 0 150
2 hij old south 700 18 150
Any thoughts?
Assuming that 'cust' column is unique in your other df, you can call map on the sales df after setting the index to be the 'cust' column, this will map for each 'cust' in budget df to it's sales value, additionally you will get NaN where there are missing values so you call fillna(0) to fill those values:
In [76]:
df['sales'] = df['cust'].map(df1.set_index('cust')['sales']).fillna(0)
df
Out[76]:
cust type loc rev sales spend
0 abc new north 500 15 250
1 def new south 700 0 150
2 hij old south 700 18 150