standardizing a value, iterating over a groupby object - python

I need some help iterating over a groupby object in python. I have people nested under a single ID variable, and then under each one of those, they have balances for anywhere from 3 to 6 months. So, printing the groupby object looks, for example, like this:
(1, Primary BP Product Rpt Month Closing Balance
0 1 CHECK 201708 10.04
1 1 CHECK 201709 11.1
2 1 CHECK 201710 11.16
3 1 CHECK 201711 11.22
4 1 CHECK 201712 11.28
5 1 CHECK 201801 11.34)
(2, Primary BP Product Rpt Month Closing Balance
79 2 CHECK 201711 52.42
85 2 CHECK 201712 31.56
136 2 CHECK 201801 99.91)
I want to create another column that standardizes the closing balance based on their first amount. So the ideal output would then look like this:
(1, Primary BP Product Rpt Month Closing Balance standardized
0 1 CHECK 201708 10.04 0
1 1 CHECK 201709 11.1 1.1
2 1 CHECK 201710 11.16 1.16
3 1 CHECK 201711 11.22 1.22
4 1 CHECK 201712 11.28 1.28
5 1 CHECK 201801 11.34 1.34)
(2, Primary BP Product Rpt Month Closing Balance standardized
79 2 CHECK 201711 52.42 0
85 2 CHECK 201712 31.56 -20.86
136 2 CHECK 201801 99.91 47.79)
I just can't quite figure out how to make a nice for loop, or if there is any other way, that will iterate within the groups of a groupby object, taking the first value for closing balance and subtracting it from each closing balance to essentially create a difference score.

I solved it! Only two weeks later. Did it without the use of a groupby object. Here is how:
bpid = []
diffs = []
# These two lines were just a bit of cleaning needed to make the vals numeric
data['Closing Balance'] = data['Closing Balance'].str.replace(",", "")
data['Closing Balance'] = pd.to_numeric(data['Closing Balance'])
# Create a new variable in monthly_data that simply shows the increase in closing balance for each month,
# setting the first month to 0
for index, row in data.iterrows():
bp = row[0]
if bp not in bpid:
bpid.append(bp)
first = row[3]
bal = row[3]
diff = round(bal-first, 2)
diffs.append(diff)
row['balance increase'] = diff
# Just checking to make sure there are the right number of values. Same as data, so good to go
print(len(diffs))
# Convert my list of differences in closing balance to a series object, and merge with the monthly_data
se = pd.Series(diffs)
data['balance increase'] = se.values

Related

Pandas/Python add a row based on condition

YY_MM_CD customerid pol_no type WE WP
2019-07 15680 1313145 new 3 89
2020-01 14672 1418080 renwd -8 223
2019-01 15681 1213143 new 4 8
2019-01 15683 1213344 new -6 3
2019-03 14678 1418280 renwd -66 -7
Now . I have some x years of data and i am taking for snapshot of 19-20. Suppose if a customer in this snapshot paid premium on 01/11/2019 the customer didn't paid premium on 01/11/2020 so that record will be not their in data. I need to created a dummy record for that customer . Like this customer id 15681 and keep their WE and WP as 0 0 since customer didn't paid
YY_MM_CD customerid pol_no type WE WP
2019-07 15680 1313145 new 3 89
2020-01 14672 1418080 renwd -8 223
2019-01 15681 1213143 new 4 8
2020-01 15681 1213143 new 0 0
2019-03 14678 1418280 renwd -66 -7
Don't create a dummy datapoint. Write the expiration date next to each customer id. Then when accessing the data just check if the current data is before the expiration date.
Simpler and cleaner
If you would like to do what you asked (add a row or column based on a condition):
You need to group the customers
Use a lambda function to add your condition
For example.
new_df = pd.DataFrame()
df = YOURDATA
groups = df.groupby("customerid")
for group in groups:
if len(group) < 2: #your condition
df2 = pd.DataFrame( ADD YOUR DATA HERE)
new_df.append(df2, ignore_index=True)
at the end you can combine new_df and df with concat:https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

Rolling sum based on dates, adding in conditions that actively update values in Pandas Dataframe if met?

I am calculating rolling last 180 day sales totals by ID in Python using Pandas and need to be able to update the last 180 day cumulative sales column if a user hits a certain threshold. For example, if someone reaches $100 spent cumulatively in the last 180 days, their cumulative spend for that day should reflect them reaching that level and effectively "redeeming" that $100, leaving them only with the excess from the last visit as progress towards their next $100 hit. (See the example below)
I also need to create a separate data frame during this process containing only the dates & user_ids for when the $100 is met to keep track of how many times the threshold has been met across all users.
I was thinking somehow I could use apply with conditional statements, but was not sure exactly how it would work as the data frame needs to be updated on the fly to have the rolling sums for later dates be calculated taking into account this updated total. In other words, the cumulative sums for dates after they hit the threshold need to be adjusted for the fact that they "redeemed" the $100.
This is what I have so far that gets the rolling cumulative sum by user. I don't know if its possible to chain conditional methods with apply to this or what the best way forward is.
order_data['rolling_sales_180'] = order_data.groupby('user_id').rolling(window='180D', on='day')['sales'].sum().reset_index(drop=True)
See the below example of expected results. In row 6, the user reaches $120, crossing the $100 threshold, but the $100 is subtracted from his cumulative sum as of that date and he is left with $20 as of that date because that was the amount in excess of the $100 threshold that he spent on that day. He then continues to earn cumulatively on this $20 for his subsequent visit within 180 days. A user can go through this process many times, earning many rewards over different 180 day periods.
print(order_data)
day user_id sales \
0 2017-08-10 1 10
1 2017-08-22 1 10
2 2017-08-31 1 10
3 2017-09-06 1 10
4 2017-09-19 1 10
5 2017-10-16 1 30
6 2017-11-28 1 40
7 2018-01-22 1 10
8 2018-03-19 1 10
9 2018-07-25 1 10
rolling_sales_180
0 10
1 20
2 30
3 40
4 50
5 80
6 20
7 30
8 40
9 20
Additionally, as mentioned above, I need a separate data frame to be created throughout this process with the day, user_id, sales, and rolling_sales_180 that only includes all the days during which the $100 threshold was met in order to count the number of times this goal is reached. See below:
print(threshold_reached)
day user_id sales rolling_sales_180
0 2017-11-28 1 40 120
.
.
.
If I understand your question correctly, the following should work for you:
def groupby_rolling(grp_df):
df = grp_df.set_index("day")
cum_sales = df.rolling("180D")["sales"].sum()
hundreds = (cum_sales // 100).astype(int)
progress = cum_sales % 100
df["rolling_sales_180"] = cum_sales
df["progress"] = progress
df["milestones"] = hundreds
return df
result = df.groupby("user_id").apply(groupby_rolling)
Output of this is (for your provided sample):
user_id sales rolling_sales_180 progress milestones
user_id day
1 2017-08-10 1 10 10.0 10.0 0
2017-08-22 1 10 20.0 20.0 0
2017-08-31 1 10 30.0 30.0 0
2017-09-06 1 10 40.0 40.0 0
2017-09-19 1 10 50.0 50.0 0
2017-10-16 1 30 80.0 80.0 0
2017-11-28 1 40 120.0 20.0 1
2018-01-22 1 10 130.0 30.0 1
2018-03-19 1 10 90.0 90.0 0
2018-07-25 1 10 20.0 20.0 0
What the groupby(...).apply(...) does is for each group in the original df, the provided function is applied. In this case, I've encapsulated your complex logic, which is currently not possible to do with a straightforward groupby-rolling operation, in a simple-to-parse basic function.
The function should hopefully be self-documenting by how I named variables, but I'd be happy to add comments if you'd like.

Separate Pandas DataFrame into sections between rows that satisfy a condition

I have a DataFrame of several trips that looks kind of like this:
TripID Lat Lon time delta_t
0 1 53.55 9.99 74 1
1 1 53.58 9.99 75 1
2 1 53.60 9.98 76 5
3 1 53.60 9.98 81 1
4 1 53.58 9.99 82 1
5 1 53.59 9.97 83 NaN
6 2 52.01 10.04 64 1
7 2 52.34 10.05 65 1
8 2 52.33 10.07 66 NaN
As you can see, I have records of location and time, which all belong to some trip, identified by a trip ID. I have also computed delta_t as the time that passes until the entry that follows in the trip. The last entry of each trip is assigned NaN as its delta_t.
Now I need to make sure that the time step of my records is the same value across all my data. I've gone with one time unit for this example. For the most part the trips do fulfill this condition, but every now and then I have a single record, such as record no. 2, within an otherwise fine trip, that doesn't.
That's why I want to simply split my trip into two trips at this point. That go me stuck though. I can't seem to find a good way of doing this.
To consider each trip by itself, I was thinking of something like this:
for key, grp in df.groupby('TripID'):
# split trip at too long delta_t(s)
However, the actual splitting within the loop is what I don't know how to do. Basically, I need to assign a new trip ID to every entry from one large delta_t to the next (or the end of the trip), or have some sort of grouping operation that can group between those large delta_t.
I know this is quite a specific problem. I hope someone has an idea how to do this.
I think the new NaNs, which would then be needed, can be neglected at first and easily added later with this line (which I know only works for ascending trip IDs):
df.loc[df['TripID'].diff().shift(-1) > 0, 'delta_t'] = np.nan
IIUC, there is no need for a loop. The following creates a new column called new_TripID based on 2 conditions: That the original TripID changes from one row to the next, or that the difference in your time column is greater than one
df['new_TripID'] = ((df['TripID'] != df['TripID'].shift()) | (df.time.diff() > 1)).cumsum()
>>> df
TripID Lat Lon time delta_t new_TripID
0 1 53.55 9.99 74 1.0 1
1 1 53.58 9.99 75 1.0 1
2 1 53.60 9.98 76 5.0 1
3 1 53.60 9.98 81 1.0 2
4 1 53.58 9.99 82 1.0 2
5 1 53.59 9.97 83 NaN 2
6 2 52.01 10.04 64 1.0 3
7 2 52.34 10.05 65 1.0 3
8 2 52.33 10.07 66 NaN 3
Note that from your description and your data, it looks like you could really use groupby, and you should probably look into it for other manipulations. However, in the particular case you're asking for, it's unnecessary

Aggregation within a index heirarchy

Currently, I have a dataframe with an index heirarchy for monthly cohorts. Here is how I grouped them.
grouped = dfsort.groupby(['Cohort','Lifetime_Revenue'])
cohorts = grouped.agg({'Customer_ID': pd.Series.nunique})
Which outputs:
Cohort Lifetime_Revenue Customer_ID
2014-01 149.9 1
2014-02 299.9 1
2014-03 269.91 1
329.89 1
899.88 1
2014-04 299.9 1
674.91 2
2014-05 899.88 1
2014-06 824.89 1
And so on.
I was looking to get the total sum of the Lifetime revenue for each cohort as well as the total amount of users for a cohort.
Basically, I want to turn it into a regular database.
Anyone got any thoughts on this?

Pandas: groupby and get median value by month?

I have a data frame that looks like this:
org date value
0 00C 2013-04-01 0.092535
1 00D 2013-04-01 0.114941
2 00F 2013-04-01 0.102794
3 00G 2013-04-01 0.099421
4 00H 2013-04-01 0.114983
Now I want to figure out:
the median value for each organisation in each month of the year
X for each organisation, where X is the percentage difference between the lowest median monthly value, and the highest median value.
What's the best way to approach this in Pandas?
I am trying to generate the medians by month as follows, but it's failing:
df['date'] = pd.to_datetime(df['date'])
ave = df.groupby(['row_id', 'date.month']).median()
This fails with KeyError: 'date.month'.
UPDATE: Thanks to #EdChum I'm now doing:
ave = df.groupby([df['row_id'], df['date'].dt.month]).median()
which works great and gives me:
99P 1 0.106975
2 0.091344
3 0.098958
4 0.092400
5 0.087996
6 0.081632
7 0.083592
8 0.075258
9 0.080393
10 0.089634
11 0.085679
12 0.108039
99Q 1 0.110889
2 0.094837
3 0.100658
4 0.091641
5 0.088971
6 0.083329
7 0.086465
8 0.078368
9 0.082947
10 0.090943
11 0.086343
12 0.109408
Now I guess, for each item in the index, I need to find the min and max calculated values, then the difference between them. What is the best way to do that?
For your first error you have a syntax error, you can pass a list of the column names or the columns themselves, you passed a list of names and date.month doesn't exist so you want:
ave = df.groupby([df['row_id'], df['date'].dt.month]).median()
After that you can calculate the min and max for a specific index level so:
((ave.max(level=0) - ave.min(level=0))/ave.max(level=0)) * 100
should give you what you want.
This calculates the difference between the min and max value for each organisation, divides by the max at that level and creates the percentage by multiplying by 100

Categories