Python Pandas Sum specific columns while matching keys - python

I am currently working with a data-stream that updates every 30 seconds with highway probe data. The data in the database needs to aggregate the incoming data and provide a 15 minute total. The issue I am encountering is trying to sum specific columns while matching keys.
Current_DataFrame:
uuid lane-Number lane-Status lane-Volume lane-Speed lane-Class1Count laneClass2Count
1 1 GOOD 10 55 5 5
1 2 GOOD 5 57 3 2
2 1 GOOD 7 45 4 3
New_Dataframe:
uuid lane-Number lane-Status lane-Volume lane-Speed lane-Class1Count laneClass2Count
1 1 BAD 7 59 6 1
1 2 GOOD 4 64 2 2
2 1 BAD 5 63 3 2
Goal_Dataframe:
uuid lane-Number lane-Status lane-Volume lane-Speed lane-Class1Count laneClass2Count
1 1 BAD 17 59 11 6
1 2 GOOD 9 64 5 4
2 1 BAD 12 63 7 5
The goal is to match the dataframes on the uuid and lane-Number, and then to take the New_Dataframe values for lane-Status and lane-Speed, and then sum the lane-Volume, lane-Class1Count and laneClass2Count together. I want to keep all the new incoming data, unless it is aggregative (i.e. Number of cars passing the road probe) in which case I want to sum it together.

I found a solution after some more digging.
df = pd.concat(["new_dataframe", "current_dataframe"], ignore_index=True)
df = df.groupby(["uuid", "lane-Number"]).agg(
{
"lane-Status": "first",
"lane-Volume": "sum",
"lane-Speed": "first",
"lane-Class1Count": "sum",
"lane-Class2Count": "sum"
})
By concatenating the current_dataframe onto the back of the new_dataframe I can use the first aggregation option to get the newest data, and then sum the necessary rows.

Related

group rows based on a string in a column in pandas and count the number of occurrence of unique rows that contained the string

I have a dataset with a few columns. I would like to slice the data frame with finding a string "M22" in the column "Run number". I am able to do so. However, I would like to count the number of unique rows that contained the string "M22".
Here is what I have done for the below table (example):
RUN_NUMBER DATE_TIME CULTURE_DAY AGE_HRS AGE_DAYS
335991M 6/30/2022 0 0 0
M220621 7/1/2022 1 24 1
M220678 7/2/2022 2 48 2
510091M 7/3/2022 3 72 3
M220500 7/4/2022 4 96 4
335991M 7/5/2022 5 120 5
M220621 7/6/2022 6 144 6
M220678 7/7/2022 7 168 7
335991M 7/8/2022 8 192 8
M220621 7/9/2022 9 216 9
M220678 7/10/2022 10 240 10
here is the results I got:
RUN_NUMBER
335991M 0
510091M 0
335992M 0
M220621 3
M220678 3
M220500 1
Now I need to count the strings/rows that contained "M22" : so I need to get 3 as output.
Use the following approach with pd.Series.unique function:
df[df['RUN_NUMBER'].str.contains("M22")]['RUN_NUMBER'].unique().size
Or a more faster alternative using numpy.char.find function:
(np.char.find(df['RUN_NUMBER'].unique().astype(str), 'M22') != -1).sum()
3

Matching 'Date' dataframes in Pandas to enable joins/merging

I have two csv files with pandas dataframes with a 'Date' column, which is my desired target to join the two tables (my goal is to join my two csvs by dates and merge matching dataframes by summing them).
The issue is that despite sharing the same month-year format, my first csv abbreviated the years, whereas my desired output would be mm-yyyy (for example, Aug-2012 as opposed to Aug-12).
csv1:
0 Oct-12 1154293
1 Nov-12 885773
2 Dec-12 -448704
3 Jan-13 563679
4 Feb-13 555394
5 Mar-13 631974
6 Apr-13 957395
7 May-13 1104047
8 Jun-13 693464
...
has 41 rows; i.e. 41 months worth of data between Oct. 12 - Feb. 16
csv2:
0 Jan-2009 943690
1 Feb-2009 1062565
2 Mar-2009 210079
3 Apr-2009 -735286
4 May-2009 842933
5 Jun-2009 358691
6 Jul-2009 914953
7 Aug-2009 723427
8 Sep-2009 -837468
...
has 86 rows; i.e. 41 months worth of data between Jan. 2009 - Feb. 2016
I tried initially to do something akin to a 'find and replace' function as one would in Excel.
I tried :
findlist = ['12','13','14','15','16']
replacelist = ['2012','2013','2014','2015','2016']
def findReplace(find, replace):
s = csv1_df.read()
s = s.replace(Date, replacement)
csv1_dfc.write(s)
for item, replacement in zip(findlist, replacelist):
s = s.replace(Date, replacement)
But I am getting a
NameError: name 's' is not defined
You can use to_datetime to transform to datetime format, and then strftime to adjust your format:
df['col_date'] = pd.to_datetime(df['col_date'], format="%b-%y").dt.strftime('%b-%Y')
Input:
col_date val
0 Oct-12 1154293
1 Nov-12 885773
2 Dec-12 -448704
3 Jan-13 563679
4 Feb-13 555394
5 Mar-13 631974
6 Apr-13 957395
7 May-13 1104047
8 Jun-13 693464
Output:
col_date val
0 Oct-2012 1154293
1 Nov-2012 885773
2 Dec-2012 -448704
3 Jan-2013 563679
4 Feb-2013 555394
5 Mar-2013 631974
6 Apr-2013 957395
7 May-2013 1104047
8 Jun-2013 693464
Note the lower case y for 2 digits year and upper case Y for 4 digits year.

Python replace all values in dataframe with values from other dataframe

I'm quite new to python (and pandas) and a have a replace task for a large dataframe i couldn't find a solution for.
So i have two dataframes, one (df1) which looks something like this:
Id Id Id
4954733 3929949 515674
2950086 1863885 4269069
1241018 3711213 4507609
3806276 2035233 4968071
4437138 1248817 1167192
5468160 4726010 2851685
1211786 2604463 5172095
2914539 5235788 4130808
4730974 5835757 1536235
2201352 5779683 5771612
3864854 4784259 2928288
the other dataframe (df2) containing all the 'old' id's and the corresponding new ones in the next column (from 1 to 20,000), which looks something like this:
Id Id_new
5774290 1
761000 2
3489755 3
1084156 4
2188433 5
3456900 6
4364416 7
3518181 8
3926684 9
5797492 10
4435820 11
what i would like to do is replace all the id's (all columns) in df1 with the corresponding Id_new from df2. I guess ideally without having to do a merge or join for each column, given the size of the dataset?
The result should be like this: df_new
Id_new Id_new Id_new
8 12 22
16 9 8
21 25 10
10 15 13
29 6 4
22 7 22
30 3 3
11 31 29
32 29 27
12 3 4
14 6 24
Any tips would be great, thanks in advance!
I think you need replace by Series created by set_index:
print (df1)
Id Id.1 Id.2
0 4954733 3929949 515674 <-first value changed for match data
1 2950086 1863885 4269069
2 1241018 3711213 4507609
3 3806276 2035233 4968071
4 4437138 1248817 1167192
5 5468160 4726010 2851685
6 1211786 2604463 5172095
7 2914539 5235788 4130808
8 4730974 5835757 1536235
9 2201352 5779683 5771612
10 3864854 4784259 2928288
df = df1.replace(df2.set_index('Id')['Id_new'])
print (df)
Id Id.1 Id.2
0 1 3929949 515674
1 2950086 1863885 4269069
2 1241018 3711213 4507609
3 3806276 2035233 4968071
4 4437138 1248817 1167192
5 5468160 4726010 2851685
6 1211786 2604463 5172095
7 2914539 5235788 4130808
8 4730974 5835757 1536235
9 2201352 5779683 5771612
10 3864854 4784259 2928288

Grouping by unique values in python pandas dataframe

I have a datafame that goes like this
id rev committer_id
date
1996-07-03 08:18:15 1 76620 1
1996-07-03 08:18:15 2 76621 2
1996-11-18 20:51:08 3 76987 3
1996-11-21 09:12:53 4 76995 2
1996-11-21 09:16:33 5 76997 2
1996-11-21 09:39:27 6 76999 2
1996-11-21 09:53:37 7 77003 2
1996-11-21 10:11:35 8 77006 2
1996-11-21 10:17:50 9 77008 2
1996-11-21 10:23:58 10 77010 2
1996-11-21 10:32:58 11 77012 2
1996-11-21 10:55:51 12 77014 2
I would like to group by quarterly periods and then show number of unique entries in the committer_id column. Columns id and rev are actually not used for the moment.
I would like to have a result as the following
committer_id
date
1996-09-30 2
1996-12-31 91
1997-03-31 56
1997-06-30 154
1997-09-30 84
The actual results are aggregated by number of entries in each time period and not by unique entries. I am using the following :
df[['committer_id']].groupby(pd.Grouper(freq='Q-DEC')).aggregate(np.size)
Can't figure how to use np.unique.
Any ideas, please.
Best,
--
df[['committer_id']].groupby(pd.Grouper(freq='Q-DEC')).aggregate(pd.Series.nunique)
Should work for you. Or df.groupby(pd.Grouper(freq='Q-DEC'))['committer_id'].nunique()
Your try with np.unique didn't work because that returns an array of unique items. The result for agg must be a scalar. So .aggregate(lambda x: len(np.unique(x)) probably would work too.

Performing calculations on subset of data frame subset in Python

user_id char_id rating
100 33 3
100 44 2
100 33 1
100 44 4
111 55 5
111 44 4
111 55 5
I have a data frame formatted similarly to this one and am trying to perform calculations on the ratings after they have been grouped by user_id and char_id.
It doesn't work but I need to do something like data.groupby('user_id', 'char_id') and then calculate the moving average for each char_id for each user_id. Any help? I have several thousand user_id so I can't go through and select one at a time for the calculations.
I need to somehow iterate over the user_id column and group all the same user_ids together, and save that format so that user_ids are separate. Then I need to do the same thing, iterating over char_id for each user_id subset and saving that format so that I can finally perform calculations on the subsets of subsets of ratings. So far all my attempts have been unsuccessful. The closest I came was:
def divide_by_user(data):
for user in data['user_id']:
user_data = data.where(data['user_id'] == user)
return user_data
There's no need to do this manually, creating and summarizing subsets like this is exactly what DataFrame.groupby() is for. Create your groupby:
grouped = df.groupby(['user_id', 'char_id'])
Then you can apply a function to each subset. It sounds like you want either rolling_mean or expanding_mean, both of which are already available in pandas:
df['cum_average'] = grouped['rating'].apply(pd.expanding_mean)
# New column now contains the average rating for each subset,
# including all values that have been seen so far.
df
Out[43]:
user_id char_id rating cum_average
0 100 33 3 3
1 100 44 2 2
2 100 33 1 2
3 100 44 4 3
4 111 55 5 5
5 111 44 4 4
6 111 55 5 5
Using a larger randomly-generated dataset to demonstrate rolling_window():
df = pd.DataFrame({
'user_id': [random.choice([100, 111, 112]) for n in range(n_rows)],
'char_id': [random.choice([33, 44, 55]) for n in range(n_rows)],
'rating': [random.choice([1, 2, 3, 4, 5]) for n in range(n_rows)]
})
grouped = df.groupby(['user_id', 'char_id'])
df['cum_average'] = grouped['rating'].apply(pd.rolling_mean, window=7)
# Output. The rolling average will be NaN until enough values have been
# observed for that subset, you can change this using the
# min_periods argument to rolling_window
df.sort(columns=['user_id', 'char_id'])
char_id rating user_id cum_average
3 33 1 100 NaN
19 33 2 100 NaN
22 33 5 100 NaN
34 33 1 100 NaN
47 33 1 100 NaN
48 33 1 100 NaN
49 33 1 100 1.714286
51 33 4 100 2.142857
55 33 2 100 2.142857
60 33 2 100 1.714286
66 33 2 100 1.857143
...
etc.
Try this:
"df" is the dataFrame
mean=pd.rolling_mean(df.rating, 7)

Categories