How to create a rank from a df with Pandas - python

I have a table that is cronologically sorted, with an state and an amount fore each date. The table looks as follows:
Date
State
Amount
01/01/2022
1
1233.11
02/01/2022
1
16.11
03/01/2022
2
144.58
04/01/2022
1
298.22
05/01/2022
2
152.34
06/01/2022
2
552.01
07/01/2022
3
897.25
To generate the dataset:
pd.DataFrame({'date': ["01/08/2022","02/08/2022","03/08/2022","04/08/2022","05/08/2022","06/08/2022","07/08/2022","08/08/2022","09/08/2022","10/08/2022","11/08/2022"], 'state' : [1,1,2,2,3,1,1,2,2,2,1],'amount': [144,142,166,144,142,166,144,142,166,142,166]})
I want to add a column called rank that is increased when the state changes. So if you have twenty times state 1, it is just rank 1. If then you have state 2, when the state 1 appears again, the rank is increased. That is, if for two days in a row State is 1, Rank is 1. Then, another state appears. When State 1 appears again, Rank would increment to 2.
I want to add a column called "Rank" which has a value that increments itself if a given state appears again. It is like a counter amount of times that state appear consecutively. That it, if state. An example would be as follows:
Date
State
Amount
Rank
01/01/2022
1
1233.11
1
02/01/2022
1
16.11
1
03/01/2022
2
144.58
1
04/01/2022
1
298.22
2
05/01/2022
2
152.34
2
06/01/2022
2
552.01
2
07/01/2022
3
897.25
1
This could be also understanded as follows:
Date
State
Amount
Rank_State1
Rank_State2
Rank_State2
01/01/2022
1
1233.11
1
02/01/2022
1
16.11
1
03/01/2022
2
144.58
1
04/01/2022
1
298.22
2
05/01/2022
2
152.34
2
06/01/2022
2
552.01
2
07/01/2022
3
897.25
1
Does anyone know how to build that Rank column starting from the previous table?

Your problem is in the general category of state change accumulation, which suggests an approach using cumulative sums and booleans.
Here's one way you can do it - maybe not the most elegant, but I think it does what you need
import pandas as pd
someDF = pd.DataFrame({'date': ["01/08/2022","02/08/2022","03/08/2022","04/08/2022","05/08/2022","06/08/2022","07/08/2022","08/08/2022","09/08/2022","10/08/2022","11/08/2022"], 'state' : [1,1,2,2,3,1,1,2,2,2,1],'amount': [144,142,166,144,142,166,144,142,166,142,166]})
someDF["StateAccumulator"] = someDF["state"].apply(str).cumsum()
def groupOccurrence(someRow):
sa = someRow["StateAccumulator"]
s = str(someRow["state"])
stateRank = len("".join([i if i != '' else " " for i in sa.split(s)]).split())\
+ int((sa.split(s)[0] == '') or (int(sa.split(s)[-1] == '')) and sa[-1] != s)
return stateRank
someDF["Rank"] = someDF.apply(lambda x: groupOccurrence(x), axis=1)
If I understand correctly, this is the result you want - "Rank" is intended to represent the number of times a given set of contiguous states have appeared:
date state amount StateAccumulator Rank
0 01/08/2022 1 144 1 1
1 02/08/2022 1 142 11 1
2 03/08/2022 2 166 112 1
3 04/08/2022 2 144 1122 1
4 05/08/2022 3 142 11223 1
5 06/08/2022 1 166 112231 2
6 07/08/2022 1 144 1122311 2
7 08/08/2022 2 142 11223112 2
8 09/08/2022 2 166 112231122 2
9 10/08/2022 2 142 1122311222 2
10 11/08/2022 1 166 11223112221 3
Notes:
instead of the somewhat hacky string cumsum method I'm using here, you could probably use a list accumulation function and then use a pandas split-apply-combine method to do the counting in the lambda function
you would then apply a state change boolean, and do a cumsum on the state change boolean, filtered/grouped on the state value (so, how many state changes do we have for any given state)
state change boolean is done like this:
someDF["StateChange"] = someDF["state"] != someDF["state"].shift()
so for a given state at a given row, you'd count how many state changes had occurred in the previous rows.

Related

group rows based on a string in a column in pandas and count the number of occurrence of unique rows that contained the string

I have a dataset with a few columns. I would like to slice the data frame with finding a string "M22" in the column "Run number". I am able to do so. However, I would like to count the number of unique rows that contained the string "M22".
Here is what I have done for the below table (example):
RUN_NUMBER DATE_TIME CULTURE_DAY AGE_HRS AGE_DAYS
335991M 6/30/2022 0 0 0
M220621 7/1/2022 1 24 1
M220678 7/2/2022 2 48 2
510091M 7/3/2022 3 72 3
M220500 7/4/2022 4 96 4
335991M 7/5/2022 5 120 5
M220621 7/6/2022 6 144 6
M220678 7/7/2022 7 168 7
335991M 7/8/2022 8 192 8
M220621 7/9/2022 9 216 9
M220678 7/10/2022 10 240 10
here is the results I got:
RUN_NUMBER
335991M 0
510091M 0
335992M 0
M220621 3
M220678 3
M220500 1
Now I need to count the strings/rows that contained "M22" : so I need to get 3 as output.
Use the following approach with pd.Series.unique function:
df[df['RUN_NUMBER'].str.contains("M22")]['RUN_NUMBER'].unique().size
Or a more faster alternative using numpy.char.find function:
(np.char.find(df['RUN_NUMBER'].unique().astype(str), 'M22') != -1).sum()
3

Separate Pandas DataFrame into sections between rows that satisfy a condition

I have a DataFrame of several trips that looks kind of like this:
TripID Lat Lon time delta_t
0 1 53.55 9.99 74 1
1 1 53.58 9.99 75 1
2 1 53.60 9.98 76 5
3 1 53.60 9.98 81 1
4 1 53.58 9.99 82 1
5 1 53.59 9.97 83 NaN
6 2 52.01 10.04 64 1
7 2 52.34 10.05 65 1
8 2 52.33 10.07 66 NaN
As you can see, I have records of location and time, which all belong to some trip, identified by a trip ID. I have also computed delta_t as the time that passes until the entry that follows in the trip. The last entry of each trip is assigned NaN as its delta_t.
Now I need to make sure that the time step of my records is the same value across all my data. I've gone with one time unit for this example. For the most part the trips do fulfill this condition, but every now and then I have a single record, such as record no. 2, within an otherwise fine trip, that doesn't.
That's why I want to simply split my trip into two trips at this point. That go me stuck though. I can't seem to find a good way of doing this.
To consider each trip by itself, I was thinking of something like this:
for key, grp in df.groupby('TripID'):
# split trip at too long delta_t(s)
However, the actual splitting within the loop is what I don't know how to do. Basically, I need to assign a new trip ID to every entry from one large delta_t to the next (or the end of the trip), or have some sort of grouping operation that can group between those large delta_t.
I know this is quite a specific problem. I hope someone has an idea how to do this.
I think the new NaNs, which would then be needed, can be neglected at first and easily added later with this line (which I know only works for ascending trip IDs):
df.loc[df['TripID'].diff().shift(-1) > 0, 'delta_t'] = np.nan
IIUC, there is no need for a loop. The following creates a new column called new_TripID based on 2 conditions: That the original TripID changes from one row to the next, or that the difference in your time column is greater than one
df['new_TripID'] = ((df['TripID'] != df['TripID'].shift()) | (df.time.diff() > 1)).cumsum()
>>> df
TripID Lat Lon time delta_t new_TripID
0 1 53.55 9.99 74 1.0 1
1 1 53.58 9.99 75 1.0 1
2 1 53.60 9.98 76 5.0 1
3 1 53.60 9.98 81 1.0 2
4 1 53.58 9.99 82 1.0 2
5 1 53.59 9.97 83 NaN 2
6 2 52.01 10.04 64 1.0 3
7 2 52.34 10.05 65 1.0 3
8 2 52.33 10.07 66 NaN 3
Note that from your description and your data, it looks like you could really use groupby, and you should probably look into it for other manipulations. However, in the particular case you're asking for, it's unnecessary

Unique values python

I am trying to basically look through a column and if that column has a unique value then enter 1 but if it doesn't it just becomes a NaN, my dataframe looks like this:
Street Number
0 1312 Oak Avenue 1
1 14212 central Ave 2
2 981 franklin way 1
the code I am using to put the number 1 next to unique values is as follows:
df.loc[(df['Street'].unique()), 'Unique'] = '1'
however when I run this I get this error KeyError: "not in index" I don't know why. I tried running this on the Number column and I get my desired result which is:
Street Number Unique
0 1312 Oak Avenue 1 NaN
1 14212 central Ave 2 1
2 981 franklin way 1 1
so my column that specifies which ones are unique is called Unique and it puts a one by the rows that are unique and NaNs by ones that are duplicates. So in this case I have 2 ones and it notices that and makes the first NaN and the second it provides a 1 and since their is only 1 two then it provides us for a 1 their as well since it is unique. I just don't know why I am getting that error for the street column.
That's not really producing your desired result. The output of df['Number'].unique(), array([1, 2], dtype=int64), just happened to be in the index. You'd encounter the same issue on that column if Number instead was [3, 4, 3], say.
For what you're looking for, selecting where not duplicated, or where you have left after dropping duplicates, might be better than unique:
df.loc[~(df['Number'].duplicated()), 'Unique'] = 1
df
Out[51]:
Street Number Unique
0 1312 Oak Avenue 1 1.0
1 14212 central Ave 2 1.0
2 981 franklin way 1 NaN
df.loc[df['Number'].drop_duplicates(), 'Unique'] = 1
df
Out[63]:
Street Number Unique
0 1312 Oak Avenue 1 NaN
1 14212 central Ave 2 1.0
2 981 franklin way 1 1.0

Time series: Mean per hour per day per Id number

I am a somewhat beginner programmer and learning python (+pandas) and hope I can explain this well enough. I have a large time series pd dataframe of over 3 million rows and initially 12 columns spanning a number of years. This covers people taking a ticket from different locations denoted by Id numbers(350 of them). Each row is one instance (one ticket taken).
I have searched many questions like counting records per hour per day and getting average per hour over several years. However, I run into the trouble of including the 'Id' variable.
I'm looking to get the mean value of people taking a ticket for each hour, for each day of the week (mon-fri) and per station.
I have the following, setting datetime to index:
Id Start_date Count Day_name_no
149 2011-12-31 21:30:00 1 5
150 2011-12-31 20:51:00 1 0
259 2011-12-31 20:48:00 1 1
3015 2011-12-31 19:38:00 1 4
28 2011-12-31 19:37:00 1 4
Using groupby and Start_date.index.hour, I cant seem to include the 'Id'.
My alternative approach is to split the hour out of the date and have the following:
Id Count Day_name_no Trip_hour
149 1 2 5
150 1 4 10
153 1 2 15
1867 1 4 11
2387 1 2 7
I then get the count first with:
Count_Item = TestFreq.groupby([TestFreq['Id'], TestFreq['Day_name_no'], TestFreq['Hour']]).count().reset_index()
Id Day_name_no Trip_hour Count
1 0 7 24
1 0 8 48
1 0 9 31
1 0 10 28
1 0 11 26
1 0 12 25
Then use groupby and mean:
Mean_Count = Count_Item.groupby(Count_Item['Id'], Count_Item['Day_name_no'], Count_Item['Hour']).mean().reset_index()
However, this does not give the desired result as the mean values are incorrect.
I hope I have explained this issue in a clear way. I looking for the mean per hour per day per Id as I plan to do clustering to separate my dataset into groups before applying a predictive model on these groups.
Any help would be grateful and if possible an explanation of what I am doing wrong either code wise or my approach.
Thanks in advance.
I have edited this to try make it a little clearer. Writing a question with a lack of sleep is probably not advisable.
A toy dataset that i start with:
Date Id Dow Hour Count
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
04/01/2015 1234 1 11 1
I now realise I would have to use the date first and get something like:
Date Id Dow Hour Count
12/12/2014 1234 0 9 5
19/12/2014 1234 0 9 3
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 4
04/01/2015 1234 1 11 1
And then calculate the mean per Id, per Dow, per hour. And want to get this:
Id Dow Hour Mean
1234 0 9 4
1234 0 10 1
1234 1 11 2.5
I hope this makes it a bit clearer. My real dataset spans 3 years with 3 million rows, contains 350 Id numbers.
Your question is not very clear, but I hope this helps:
df.reset_index(inplace=True)
# helper columns with date, hour and dow
df['date'] = df['Start_date'].dt.date
df['hour'] = df['Start_date'].dt.hour
df['dow'] = df['Start_date'].dt.dayofweek
# sum of counts for all combinations
df = df.groupby(['Id', 'date', 'dow', 'hour']).sum()
# take the mean over all dates
df = df.reset_index().groupby(['Id', 'dow', 'hour']).mean()
You can use the groupby function using the 'Id' column and then use the resample function with how='sum'.

Column operations on pandas groupby object

I have a dataframe df that looks like this:
id Category Time
1 176 12 00:00:00
2 4956 2 00:00:00
3 583 4 00:00:04
4 9395 2 00:00:24
5 176 12 00:03:23
which is basically a set of id and the category of item they used at a particular Time. I use df.groupby['id'] and then I want to see if they used the same category or different and assign True or False respectively (or NaN if that was the first item for that particular id. I also filtered out the data to remove all the ids with only one Time.
For example one of the groups may look like
id Category Time
1 176 12 00:00:00
2 176 12 00:03:23
3 176 2 00:04:34
4 176 2 00:04:54
5 176 2 00:05:23
and I want to perform an operation to get
id Category Time Transition
1 176 12 00:00:00 NaN
2 176 12 00:03:23 False
3 176 2 00:04:34 True
4 176 2 00:04:54 False
5 176 2 00:05:23 False
I thought about doing an apply of some sorts to the Category column after groupby but I am having trouble figuring out the right function.
you don't need a groupby here, you just need sort and shift.
df.sort(['id', 'Time'], inplace=True)
df['Transition'] = df.Category != df.Category.shift(1)
df.loc[df.id != df.id.shift(1), 'Transition'] = np.nan
i haven't tested this, but it should do the trick

Categories