I have a DataFrame of several trips that looks kind of like this:
TripID Lat Lon time delta_t
0 1 53.55 9.99 74 1
1 1 53.58 9.99 75 1
2 1 53.60 9.98 76 5
3 1 53.60 9.98 81 1
4 1 53.58 9.99 82 1
5 1 53.59 9.97 83 NaN
6 2 52.01 10.04 64 1
7 2 52.34 10.05 65 1
8 2 52.33 10.07 66 NaN
As you can see, I have records of location and time, which all belong to some trip, identified by a trip ID. I have also computed delta_t as the time that passes until the entry that follows in the trip. The last entry of each trip is assigned NaN as its delta_t.
Now I need to make sure that the time step of my records is the same value across all my data. I've gone with one time unit for this example. For the most part the trips do fulfill this condition, but every now and then I have a single record, such as record no. 2, within an otherwise fine trip, that doesn't.
That's why I want to simply split my trip into two trips at this point. That go me stuck though. I can't seem to find a good way of doing this.
To consider each trip by itself, I was thinking of something like this:
for key, grp in df.groupby('TripID'):
# split trip at too long delta_t(s)
However, the actual splitting within the loop is what I don't know how to do. Basically, I need to assign a new trip ID to every entry from one large delta_t to the next (or the end of the trip), or have some sort of grouping operation that can group between those large delta_t.
I know this is quite a specific problem. I hope someone has an idea how to do this.
I think the new NaNs, which would then be needed, can be neglected at first and easily added later with this line (which I know only works for ascending trip IDs):
df.loc[df['TripID'].diff().shift(-1) > 0, 'delta_t'] = np.nan
IIUC, there is no need for a loop. The following creates a new column called new_TripID based on 2 conditions: That the original TripID changes from one row to the next, or that the difference in your time column is greater than one
df['new_TripID'] = ((df['TripID'] != df['TripID'].shift()) | (df.time.diff() > 1)).cumsum()
>>> df
TripID Lat Lon time delta_t new_TripID
0 1 53.55 9.99 74 1.0 1
1 1 53.58 9.99 75 1.0 1
2 1 53.60 9.98 76 5.0 1
3 1 53.60 9.98 81 1.0 2
4 1 53.58 9.99 82 1.0 2
5 1 53.59 9.97 83 NaN 2
6 2 52.01 10.04 64 1.0 3
7 2 52.34 10.05 65 1.0 3
8 2 52.33 10.07 66 NaN 3
Note that from your description and your data, it looks like you could really use groupby, and you should probably look into it for other manipulations. However, in the particular case you're asking for, it's unnecessary
Related
I am currently working with a data-stream that updates every 30 seconds with highway probe data. The data in the database needs to aggregate the incoming data and provide a 15 minute total. The issue I am encountering is trying to sum specific columns while matching keys.
Current_DataFrame:
uuid lane-Number lane-Status lane-Volume lane-Speed lane-Class1Count laneClass2Count
1 1 GOOD 10 55 5 5
1 2 GOOD 5 57 3 2
2 1 GOOD 7 45 4 3
New_Dataframe:
uuid lane-Number lane-Status lane-Volume lane-Speed lane-Class1Count laneClass2Count
1 1 BAD 7 59 6 1
1 2 GOOD 4 64 2 2
2 1 BAD 5 63 3 2
Goal_Dataframe:
uuid lane-Number lane-Status lane-Volume lane-Speed lane-Class1Count laneClass2Count
1 1 BAD 17 59 11 6
1 2 GOOD 9 64 5 4
2 1 BAD 12 63 7 5
The goal is to match the dataframes on the uuid and lane-Number, and then to take the New_Dataframe values for lane-Status and lane-Speed, and then sum the lane-Volume, lane-Class1Count and laneClass2Count together. I want to keep all the new incoming data, unless it is aggregative (i.e. Number of cars passing the road probe) in which case I want to sum it together.
I found a solution after some more digging.
df = pd.concat(["new_dataframe", "current_dataframe"], ignore_index=True)
df = df.groupby(["uuid", "lane-Number"]).agg(
{
"lane-Status": "first",
"lane-Volume": "sum",
"lane-Speed": "first",
"lane-Class1Count": "sum",
"lane-Class2Count": "sum"
})
By concatenating the current_dataframe onto the back of the new_dataframe I can use the first aggregation option to get the newest data, and then sum the necessary rows.
I am facing a weird scenario.
I have a data frame with having 3 largest scores for unique row like this:
id rid code score
1 9 67 43
1 8 87 22
1 4 32 20
2 3 56 43
3 10. 22 100
3. 5 67. 50
Here id column is same but row wise it is different.
I want to make my data frame like this:
id first_code second_code third_code
1 67 87 32
2. 56. none. none
3 22. 67. none
So I have made my dataframe which is showing highest top 3 scores. If there is not top 3 value I am taking top 2 or the only value which is the score. So depending on score value, I want to re-arrange the code column into three different columns as example first_code is representing the highest_score, second_score is representing second-highest, third_code is representing the third highest value. If not found then I will make those blanks.
Kindly help me to solve this.
Use GroupBy.cumcount for counter, create MultiIndex and reshape by Series.unstack:
df = df.set_index(['id',df.groupby('id').cumcount()])['code'].unstack()
df.columns=['first_code', 'second_code', 'third_code']
df = df.reset_index()
print (df)
id first_code second_code third_code
0 1.0 67.0 87.0 32.0
1 2.0 56.0 NaN NaN
2 3.0 22.0 67.0 NaN
Btw, cumcount should be used also in previous code for filter top3 values.
I am calculating rolling last 180 day sales totals by ID in Python using Pandas and need to be able to update the last 180 day cumulative sales column if a user hits a certain threshold. For example, if someone reaches $100 spent cumulatively in the last 180 days, their cumulative spend for that day should reflect them reaching that level and effectively "redeeming" that $100, leaving them only with the excess from the last visit as progress towards their next $100 hit. (See the example below)
I also need to create a separate data frame during this process containing only the dates & user_ids for when the $100 is met to keep track of how many times the threshold has been met across all users.
I was thinking somehow I could use apply with conditional statements, but was not sure exactly how it would work as the data frame needs to be updated on the fly to have the rolling sums for later dates be calculated taking into account this updated total. In other words, the cumulative sums for dates after they hit the threshold need to be adjusted for the fact that they "redeemed" the $100.
This is what I have so far that gets the rolling cumulative sum by user. I don't know if its possible to chain conditional methods with apply to this or what the best way forward is.
order_data['rolling_sales_180'] = order_data.groupby('user_id').rolling(window='180D', on='day')['sales'].sum().reset_index(drop=True)
See the below example of expected results. In row 6, the user reaches $120, crossing the $100 threshold, but the $100 is subtracted from his cumulative sum as of that date and he is left with $20 as of that date because that was the amount in excess of the $100 threshold that he spent on that day. He then continues to earn cumulatively on this $20 for his subsequent visit within 180 days. A user can go through this process many times, earning many rewards over different 180 day periods.
print(order_data)
day user_id sales \
0 2017-08-10 1 10
1 2017-08-22 1 10
2 2017-08-31 1 10
3 2017-09-06 1 10
4 2017-09-19 1 10
5 2017-10-16 1 30
6 2017-11-28 1 40
7 2018-01-22 1 10
8 2018-03-19 1 10
9 2018-07-25 1 10
rolling_sales_180
0 10
1 20
2 30
3 40
4 50
5 80
6 20
7 30
8 40
9 20
Additionally, as mentioned above, I need a separate data frame to be created throughout this process with the day, user_id, sales, and rolling_sales_180 that only includes all the days during which the $100 threshold was met in order to count the number of times this goal is reached. See below:
print(threshold_reached)
day user_id sales rolling_sales_180
0 2017-11-28 1 40 120
.
.
.
If I understand your question correctly, the following should work for you:
def groupby_rolling(grp_df):
df = grp_df.set_index("day")
cum_sales = df.rolling("180D")["sales"].sum()
hundreds = (cum_sales // 100).astype(int)
progress = cum_sales % 100
df["rolling_sales_180"] = cum_sales
df["progress"] = progress
df["milestones"] = hundreds
return df
result = df.groupby("user_id").apply(groupby_rolling)
Output of this is (for your provided sample):
user_id sales rolling_sales_180 progress milestones
user_id day
1 2017-08-10 1 10 10.0 10.0 0
2017-08-22 1 10 20.0 20.0 0
2017-08-31 1 10 30.0 30.0 0
2017-09-06 1 10 40.0 40.0 0
2017-09-19 1 10 50.0 50.0 0
2017-10-16 1 30 80.0 80.0 0
2017-11-28 1 40 120.0 20.0 1
2018-01-22 1 10 130.0 30.0 1
2018-03-19 1 10 90.0 90.0 0
2018-07-25 1 10 20.0 20.0 0
What the groupby(...).apply(...) does is for each group in the original df, the provided function is applied. In this case, I've encapsulated your complex logic, which is currently not possible to do with a straightforward groupby-rolling operation, in a simple-to-parse basic function.
The function should hopefully be self-documenting by how I named variables, but I'd be happy to add comments if you'd like.
I have a dataframe, with recordings of statistics in multiple columns.
I have a list of the column names: stat_columns = ['Height', 'Speed'].
I want to combine the data to get one row per id.
The data comes sorted with the newest records on the top. I want the most recent data, so I must use the first value of each column, by id.
My dataframe looks like this:
Index id Height Speed
0 100007 8.3
1 100007 54
2 100007 8.6
3 100007 52
4 100035 39
5 100014 44
6 100035 5.6
And I want it to look like this:
Index id Height Speed
0 100007 54 8.3
1 100014 44
2 100035 39 5.6
I have tried a simple groupby myself:
df_stats = df_path.groupby(['id'], as_index=False).first()
But this seems to only give me a row with the first statistic found.
For me your solution working, maybe is necessary replace empty values to NaNs:
df_stats = df_path.replace('',np.nan).groupby('id', as_index=False).first()
print (df_stats)
id Index Height Speed
0 100007 0 54.0 8.3
1 100014 5 44.0 NaN
2 100035 4 39.0 5.6
I have a datafame that goes like this
id rev committer_id
date
1996-07-03 08:18:15 1 76620 1
1996-07-03 08:18:15 2 76621 2
1996-11-18 20:51:08 3 76987 3
1996-11-21 09:12:53 4 76995 2
1996-11-21 09:16:33 5 76997 2
1996-11-21 09:39:27 6 76999 2
1996-11-21 09:53:37 7 77003 2
1996-11-21 10:11:35 8 77006 2
1996-11-21 10:17:50 9 77008 2
1996-11-21 10:23:58 10 77010 2
1996-11-21 10:32:58 11 77012 2
1996-11-21 10:55:51 12 77014 2
I would like to group by quarterly periods and then show number of unique entries in the committer_id column. Columns id and rev are actually not used for the moment.
I would like to have a result as the following
committer_id
date
1996-09-30 2
1996-12-31 91
1997-03-31 56
1997-06-30 154
1997-09-30 84
The actual results are aggregated by number of entries in each time period and not by unique entries. I am using the following :
df[['committer_id']].groupby(pd.Grouper(freq='Q-DEC')).aggregate(np.size)
Can't figure how to use np.unique.
Any ideas, please.
Best,
--
df[['committer_id']].groupby(pd.Grouper(freq='Q-DEC')).aggregate(pd.Series.nunique)
Should work for you. Or df.groupby(pd.Grouper(freq='Q-DEC'))['committer_id'].nunique()
Your try with np.unique didn't work because that returns an array of unique items. The result for agg must be a scalar. So .aggregate(lambda x: len(np.unique(x)) probably would work too.