Grouping by unique values in python pandas dataframe - python

I have a datafame that goes like this
id rev committer_id
date
1996-07-03 08:18:15 1 76620 1
1996-07-03 08:18:15 2 76621 2
1996-11-18 20:51:08 3 76987 3
1996-11-21 09:12:53 4 76995 2
1996-11-21 09:16:33 5 76997 2
1996-11-21 09:39:27 6 76999 2
1996-11-21 09:53:37 7 77003 2
1996-11-21 10:11:35 8 77006 2
1996-11-21 10:17:50 9 77008 2
1996-11-21 10:23:58 10 77010 2
1996-11-21 10:32:58 11 77012 2
1996-11-21 10:55:51 12 77014 2
I would like to group by quarterly periods and then show number of unique entries in the committer_id column. Columns id and rev are actually not used for the moment.
I would like to have a result as the following
committer_id
date
1996-09-30 2
1996-12-31 91
1997-03-31 56
1997-06-30 154
1997-09-30 84
The actual results are aggregated by number of entries in each time period and not by unique entries. I am using the following :
df[['committer_id']].groupby(pd.Grouper(freq='Q-DEC')).aggregate(np.size)
Can't figure how to use np.unique.
Any ideas, please.
Best,
--

df[['committer_id']].groupby(pd.Grouper(freq='Q-DEC')).aggregate(pd.Series.nunique)
Should work for you. Or df.groupby(pd.Grouper(freq='Q-DEC'))['committer_id'].nunique()
Your try with np.unique didn't work because that returns an array of unique items. The result for agg must be a scalar. So .aggregate(lambda x: len(np.unique(x)) probably would work too.

Related

group rows based on a string in a column in pandas and count the number of occurrence of unique rows that contained the string

I have a dataset with a few columns. I would like to slice the data frame with finding a string "M22" in the column "Run number". I am able to do so. However, I would like to count the number of unique rows that contained the string "M22".
Here is what I have done for the below table (example):
RUN_NUMBER DATE_TIME CULTURE_DAY AGE_HRS AGE_DAYS
335991M 6/30/2022 0 0 0
M220621 7/1/2022 1 24 1
M220678 7/2/2022 2 48 2
510091M 7/3/2022 3 72 3
M220500 7/4/2022 4 96 4
335991M 7/5/2022 5 120 5
M220621 7/6/2022 6 144 6
M220678 7/7/2022 7 168 7
335991M 7/8/2022 8 192 8
M220621 7/9/2022 9 216 9
M220678 7/10/2022 10 240 10
here is the results I got:
RUN_NUMBER
335991M 0
510091M 0
335992M 0
M220621 3
M220678 3
M220500 1
Now I need to count the strings/rows that contained "M22" : so I need to get 3 as output.
Use the following approach with pd.Series.unique function:
df[df['RUN_NUMBER'].str.contains("M22")]['RUN_NUMBER'].unique().size
Or a more faster alternative using numpy.char.find function:
(np.char.find(df['RUN_NUMBER'].unique().astype(str), 'M22') != -1).sum()
3

Python Pandas Sum specific columns while matching keys

I am currently working with a data-stream that updates every 30 seconds with highway probe data. The data in the database needs to aggregate the incoming data and provide a 15 minute total. The issue I am encountering is trying to sum specific columns while matching keys.
Current_DataFrame:
uuid lane-Number lane-Status lane-Volume lane-Speed lane-Class1Count laneClass2Count
1 1 GOOD 10 55 5 5
1 2 GOOD 5 57 3 2
2 1 GOOD 7 45 4 3
New_Dataframe:
uuid lane-Number lane-Status lane-Volume lane-Speed lane-Class1Count laneClass2Count
1 1 BAD 7 59 6 1
1 2 GOOD 4 64 2 2
2 1 BAD 5 63 3 2
Goal_Dataframe:
uuid lane-Number lane-Status lane-Volume lane-Speed lane-Class1Count laneClass2Count
1 1 BAD 17 59 11 6
1 2 GOOD 9 64 5 4
2 1 BAD 12 63 7 5
The goal is to match the dataframes on the uuid and lane-Number, and then to take the New_Dataframe values for lane-Status and lane-Speed, and then sum the lane-Volume, lane-Class1Count and laneClass2Count together. I want to keep all the new incoming data, unless it is aggregative (i.e. Number of cars passing the road probe) in which case I want to sum it together.
I found a solution after some more digging.
df = pd.concat(["new_dataframe", "current_dataframe"], ignore_index=True)
df = df.groupby(["uuid", "lane-Number"]).agg(
{
"lane-Status": "first",
"lane-Volume": "sum",
"lane-Speed": "first",
"lane-Class1Count": "sum",
"lane-Class2Count": "sum"
})
By concatenating the current_dataframe onto the back of the new_dataframe I can use the first aggregation option to get the newest data, and then sum the necessary rows.

Separate Pandas DataFrame into sections between rows that satisfy a condition

I have a DataFrame of several trips that looks kind of like this:
TripID Lat Lon time delta_t
0 1 53.55 9.99 74 1
1 1 53.58 9.99 75 1
2 1 53.60 9.98 76 5
3 1 53.60 9.98 81 1
4 1 53.58 9.99 82 1
5 1 53.59 9.97 83 NaN
6 2 52.01 10.04 64 1
7 2 52.34 10.05 65 1
8 2 52.33 10.07 66 NaN
As you can see, I have records of location and time, which all belong to some trip, identified by a trip ID. I have also computed delta_t as the time that passes until the entry that follows in the trip. The last entry of each trip is assigned NaN as its delta_t.
Now I need to make sure that the time step of my records is the same value across all my data. I've gone with one time unit for this example. For the most part the trips do fulfill this condition, but every now and then I have a single record, such as record no. 2, within an otherwise fine trip, that doesn't.
That's why I want to simply split my trip into two trips at this point. That go me stuck though. I can't seem to find a good way of doing this.
To consider each trip by itself, I was thinking of something like this:
for key, grp in df.groupby('TripID'):
# split trip at too long delta_t(s)
However, the actual splitting within the loop is what I don't know how to do. Basically, I need to assign a new trip ID to every entry from one large delta_t to the next (or the end of the trip), or have some sort of grouping operation that can group between those large delta_t.
I know this is quite a specific problem. I hope someone has an idea how to do this.
I think the new NaNs, which would then be needed, can be neglected at first and easily added later with this line (which I know only works for ascending trip IDs):
df.loc[df['TripID'].diff().shift(-1) > 0, 'delta_t'] = np.nan
IIUC, there is no need for a loop. The following creates a new column called new_TripID based on 2 conditions: That the original TripID changes from one row to the next, or that the difference in your time column is greater than one
df['new_TripID'] = ((df['TripID'] != df['TripID'].shift()) | (df.time.diff() > 1)).cumsum()
>>> df
TripID Lat Lon time delta_t new_TripID
0 1 53.55 9.99 74 1.0 1
1 1 53.58 9.99 75 1.0 1
2 1 53.60 9.98 76 5.0 1
3 1 53.60 9.98 81 1.0 2
4 1 53.58 9.99 82 1.0 2
5 1 53.59 9.97 83 NaN 2
6 2 52.01 10.04 64 1.0 3
7 2 52.34 10.05 65 1.0 3
8 2 52.33 10.07 66 NaN 3
Note that from your description and your data, it looks like you could really use groupby, and you should probably look into it for other manipulations. However, in the particular case you're asking for, it's unnecessary

Matching 'Date' dataframes in Pandas to enable joins/merging

I have two csv files with pandas dataframes with a 'Date' column, which is my desired target to join the two tables (my goal is to join my two csvs by dates and merge matching dataframes by summing them).
The issue is that despite sharing the same month-year format, my first csv abbreviated the years, whereas my desired output would be mm-yyyy (for example, Aug-2012 as opposed to Aug-12).
csv1:
0 Oct-12 1154293
1 Nov-12 885773
2 Dec-12 -448704
3 Jan-13 563679
4 Feb-13 555394
5 Mar-13 631974
6 Apr-13 957395
7 May-13 1104047
8 Jun-13 693464
...
has 41 rows; i.e. 41 months worth of data between Oct. 12 - Feb. 16
csv2:
0 Jan-2009 943690
1 Feb-2009 1062565
2 Mar-2009 210079
3 Apr-2009 -735286
4 May-2009 842933
5 Jun-2009 358691
6 Jul-2009 914953
7 Aug-2009 723427
8 Sep-2009 -837468
...
has 86 rows; i.e. 41 months worth of data between Jan. 2009 - Feb. 2016
I tried initially to do something akin to a 'find and replace' function as one would in Excel.
I tried :
findlist = ['12','13','14','15','16']
replacelist = ['2012','2013','2014','2015','2016']
def findReplace(find, replace):
s = csv1_df.read()
s = s.replace(Date, replacement)
csv1_dfc.write(s)
for item, replacement in zip(findlist, replacelist):
s = s.replace(Date, replacement)
But I am getting a
NameError: name 's' is not defined
You can use to_datetime to transform to datetime format, and then strftime to adjust your format:
df['col_date'] = pd.to_datetime(df['col_date'], format="%b-%y").dt.strftime('%b-%Y')
Input:
col_date val
0 Oct-12 1154293
1 Nov-12 885773
2 Dec-12 -448704
3 Jan-13 563679
4 Feb-13 555394
5 Mar-13 631974
6 Apr-13 957395
7 May-13 1104047
8 Jun-13 693464
Output:
col_date val
0 Oct-2012 1154293
1 Nov-2012 885773
2 Dec-2012 -448704
3 Jan-2013 563679
4 Feb-2013 555394
5 Mar-2013 631974
6 Apr-2013 957395
7 May-2013 1104047
8 Jun-2013 693464
Note the lower case y for 2 digits year and upper case Y for 4 digits year.

Python replace all values in dataframe with values from other dataframe

I'm quite new to python (and pandas) and a have a replace task for a large dataframe i couldn't find a solution for.
So i have two dataframes, one (df1) which looks something like this:
Id Id Id
4954733 3929949 515674
2950086 1863885 4269069
1241018 3711213 4507609
3806276 2035233 4968071
4437138 1248817 1167192
5468160 4726010 2851685
1211786 2604463 5172095
2914539 5235788 4130808
4730974 5835757 1536235
2201352 5779683 5771612
3864854 4784259 2928288
the other dataframe (df2) containing all the 'old' id's and the corresponding new ones in the next column (from 1 to 20,000), which looks something like this:
Id Id_new
5774290 1
761000 2
3489755 3
1084156 4
2188433 5
3456900 6
4364416 7
3518181 8
3926684 9
5797492 10
4435820 11
what i would like to do is replace all the id's (all columns) in df1 with the corresponding Id_new from df2. I guess ideally without having to do a merge or join for each column, given the size of the dataset?
The result should be like this: df_new
Id_new Id_new Id_new
8 12 22
16 9 8
21 25 10
10 15 13
29 6 4
22 7 22
30 3 3
11 31 29
32 29 27
12 3 4
14 6 24
Any tips would be great, thanks in advance!
I think you need replace by Series created by set_index:
print (df1)
Id Id.1 Id.2
0 4954733 3929949 515674 <-first value changed for match data
1 2950086 1863885 4269069
2 1241018 3711213 4507609
3 3806276 2035233 4968071
4 4437138 1248817 1167192
5 5468160 4726010 2851685
6 1211786 2604463 5172095
7 2914539 5235788 4130808
8 4730974 5835757 1536235
9 2201352 5779683 5771612
10 3864854 4784259 2928288
df = df1.replace(df2.set_index('Id')['Id_new'])
print (df)
Id Id.1 Id.2
0 1 3929949 515674
1 2950086 1863885 4269069
2 1241018 3711213 4507609
3 3806276 2035233 4968071
4 4437138 1248817 1167192
5 5468160 4726010 2851685
6 1211786 2604463 5172095
7 2914539 5235788 4130808
8 4730974 5835757 1536235
9 2201352 5779683 5771612
10 3864854 4784259 2928288

Categories