Duplicate ID's with different values in rows - python

I'm having an issue with a dataset I am using for my thesis. The dataset contains customer purchase information and I want to figure out how many times a customer has purchased, what the total purchase amount is and what their average spending is. The data I currently have looks something like this:
id date total_purchase_amount product_price
0 84288 2020-1-1 100 50
1 84288 2020-1-1 50
2 84288 2020-3-7 80 20
3 84288 2020-3-7 60
4 84289 2020-8-16 200 10
5 84289 2020-8-16 50
6 84289 2020-8-16 10
7 84288 2020-8-16 80
8 84290 2020-4-2 10 10
9 84290 2020-4-8 30 30
10 84291 2020-5-23 45 45
Some customers have made purchases more than once, causing their customer ID to appear multiple times in the dataset. What I want to achieve is a dataset which looks this:
id total_purchase_amount average_spending times_purchased
0 84288 180 45 2
1 84289 200 37,5 1
2 84290 40 20 2
3 84291 45 45 1
Does anyone have a suggestion how I can achieve this? The dataset I work with is very large, so this problem cannot be solved manually.
Here is the code to get the first dataframe:
import pandas as pd
data = [[84288, "2020-1-1", 100, 50],[84288, "2020-1-1", "", 50],[84288, "2020-3-7", 80, 20], [84288, "2020-3-7", "", 60],[84289, "2020-8-16", 200, 10],[84289, "2020-8-16", "", 50],[84289, "2020-8-16", "", 10], [84289, "2020-8-16", "", 80],[84290, "2020-4-2", 10, 10],[84290, "2020-4-8", 30, 30],[84291, "2020-5-23", 45, 45]]
df = pd.DataFrame(data, columns=['id','date','total_purchase_amount','purchase_amount'])

Replace the blank rows with NA and do the math in the grouping.
df.replace('', np.NaN, inplace=True)
df.groupby('id')[['total_purchase_amount','purchase_amount']].agg(average_spending=('purchase_amount','mean'),times_purchased=('total_purchase_amount','count'))
average_spending times_purchased
id
84288 45.0 2
84289 37.5 1
84290 20.0 2
84291 45.0 1

Related

Python Pandas Sum specific columns while matching keys

I am currently working with a data-stream that updates every 30 seconds with highway probe data. The data in the database needs to aggregate the incoming data and provide a 15 minute total. The issue I am encountering is trying to sum specific columns while matching keys.
Current_DataFrame:
uuid lane-Number lane-Status lane-Volume lane-Speed lane-Class1Count laneClass2Count
1 1 GOOD 10 55 5 5
1 2 GOOD 5 57 3 2
2 1 GOOD 7 45 4 3
New_Dataframe:
uuid lane-Number lane-Status lane-Volume lane-Speed lane-Class1Count laneClass2Count
1 1 BAD 7 59 6 1
1 2 GOOD 4 64 2 2
2 1 BAD 5 63 3 2
Goal_Dataframe:
uuid lane-Number lane-Status lane-Volume lane-Speed lane-Class1Count laneClass2Count
1 1 BAD 17 59 11 6
1 2 GOOD 9 64 5 4
2 1 BAD 12 63 7 5
The goal is to match the dataframes on the uuid and lane-Number, and then to take the New_Dataframe values for lane-Status and lane-Speed, and then sum the lane-Volume, lane-Class1Count and laneClass2Count together. I want to keep all the new incoming data, unless it is aggregative (i.e. Number of cars passing the road probe) in which case I want to sum it together.
I found a solution after some more digging.
df = pd.concat(["new_dataframe", "current_dataframe"], ignore_index=True)
df = df.groupby(["uuid", "lane-Number"]).agg(
{
"lane-Status": "first",
"lane-Volume": "sum",
"lane-Speed": "first",
"lane-Class1Count": "sum",
"lane-Class2Count": "sum"
})
By concatenating the current_dataframe onto the back of the new_dataframe I can use the first aggregation option to get the newest data, and then sum the necessary rows.

Python Pandas: Create New Column With Calculations Based on Categorical Values in A Different Column

I have the following sample data frame:
id category time
43 S 8
22 I 10
15 T 350
18 L 46
I want to apply the following logic:
1) if category value equals "T" then create new column called "time_2" where "time" value is divided by 24.
2) if category value equals "L" then create new column called "time_2" where "time" value is divided by 3.5.
3) otherwise take existing "time" value from categories S or I
Below is my desired output table:
id category time time_2
43 S 8 8
22 I 10 10
15 T 350 14.58333333
18 L 46 13.14285714
I've tried using pd.np.where to get the above to work but am confused around syntax.
You can use map for rules
In [1066]: df['time_2'] = df.time / df.category.map({'T': 24, 'L': 3.5}).fillna(1)
In [1067]: df
Out[1067]:
id category time time_2
0 43 S 8 8.000000
1 22 I 10 10.000000
2 15 T 350 14.583333
3 18 L 46 13.142857
You can use np.select. This is a good alternative to nested np.where logic.
conditions = [df['category'] == 'T', df['category'] == 'L']
values = [df['time'] / 24, df['time'] / 3.5]
df['time_2'] = np.select(conditions, values, df['time'])
print(df)
id category time time_2
0 43 S 8 8.000000
1 22 I 10 10.000000
2 15 T 350 14.583333
3 18 L 46 13.142857

Python replace all values in dataframe with values from other dataframe

I'm quite new to python (and pandas) and a have a replace task for a large dataframe i couldn't find a solution for.
So i have two dataframes, one (df1) which looks something like this:
Id Id Id
4954733 3929949 515674
2950086 1863885 4269069
1241018 3711213 4507609
3806276 2035233 4968071
4437138 1248817 1167192
5468160 4726010 2851685
1211786 2604463 5172095
2914539 5235788 4130808
4730974 5835757 1536235
2201352 5779683 5771612
3864854 4784259 2928288
the other dataframe (df2) containing all the 'old' id's and the corresponding new ones in the next column (from 1 to 20,000), which looks something like this:
Id Id_new
5774290 1
761000 2
3489755 3
1084156 4
2188433 5
3456900 6
4364416 7
3518181 8
3926684 9
5797492 10
4435820 11
what i would like to do is replace all the id's (all columns) in df1 with the corresponding Id_new from df2. I guess ideally without having to do a merge or join for each column, given the size of the dataset?
The result should be like this: df_new
Id_new Id_new Id_new
8 12 22
16 9 8
21 25 10
10 15 13
29 6 4
22 7 22
30 3 3
11 31 29
32 29 27
12 3 4
14 6 24
Any tips would be great, thanks in advance!
I think you need replace by Series created by set_index:
print (df1)
Id Id.1 Id.2
0 4954733 3929949 515674 <-first value changed for match data
1 2950086 1863885 4269069
2 1241018 3711213 4507609
3 3806276 2035233 4968071
4 4437138 1248817 1167192
5 5468160 4726010 2851685
6 1211786 2604463 5172095
7 2914539 5235788 4130808
8 4730974 5835757 1536235
9 2201352 5779683 5771612
10 3864854 4784259 2928288
df = df1.replace(df2.set_index('Id')['Id_new'])
print (df)
Id Id.1 Id.2
0 1 3929949 515674
1 2950086 1863885 4269069
2 1241018 3711213 4507609
3 3806276 2035233 4968071
4 4437138 1248817 1167192
5 5468160 4726010 2851685
6 1211786 2604463 5172095
7 2914539 5235788 4130808
8 4730974 5835757 1536235
9 2201352 5779683 5771612
10 3864854 4784259 2928288

Time series: Mean per hour per day per Id number

I am a somewhat beginner programmer and learning python (+pandas) and hope I can explain this well enough. I have a large time series pd dataframe of over 3 million rows and initially 12 columns spanning a number of years. This covers people taking a ticket from different locations denoted by Id numbers(350 of them). Each row is one instance (one ticket taken).
I have searched many questions like counting records per hour per day and getting average per hour over several years. However, I run into the trouble of including the 'Id' variable.
I'm looking to get the mean value of people taking a ticket for each hour, for each day of the week (mon-fri) and per station.
I have the following, setting datetime to index:
Id Start_date Count Day_name_no
149 2011-12-31 21:30:00 1 5
150 2011-12-31 20:51:00 1 0
259 2011-12-31 20:48:00 1 1
3015 2011-12-31 19:38:00 1 4
28 2011-12-31 19:37:00 1 4
Using groupby and Start_date.index.hour, I cant seem to include the 'Id'.
My alternative approach is to split the hour out of the date and have the following:
Id Count Day_name_no Trip_hour
149 1 2 5
150 1 4 10
153 1 2 15
1867 1 4 11
2387 1 2 7
I then get the count first with:
Count_Item = TestFreq.groupby([TestFreq['Id'], TestFreq['Day_name_no'], TestFreq['Hour']]).count().reset_index()
Id Day_name_no Trip_hour Count
1 0 7 24
1 0 8 48
1 0 9 31
1 0 10 28
1 0 11 26
1 0 12 25
Then use groupby and mean:
Mean_Count = Count_Item.groupby(Count_Item['Id'], Count_Item['Day_name_no'], Count_Item['Hour']).mean().reset_index()
However, this does not give the desired result as the mean values are incorrect.
I hope I have explained this issue in a clear way. I looking for the mean per hour per day per Id as I plan to do clustering to separate my dataset into groups before applying a predictive model on these groups.
Any help would be grateful and if possible an explanation of what I am doing wrong either code wise or my approach.
Thanks in advance.
I have edited this to try make it a little clearer. Writing a question with a lack of sleep is probably not advisable.
A toy dataset that i start with:
Date Id Dow Hour Count
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
12/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
19/12/2014 1234 0 9 1
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
27/12/2014 1234 1 11 1
04/01/2015 1234 1 11 1
I now realise I would have to use the date first and get something like:
Date Id Dow Hour Count
12/12/2014 1234 0 9 5
19/12/2014 1234 0 9 3
26/12/2014 1234 0 10 1
27/12/2014 1234 1 11 4
04/01/2015 1234 1 11 1
And then calculate the mean per Id, per Dow, per hour. And want to get this:
Id Dow Hour Mean
1234 0 9 4
1234 0 10 1
1234 1 11 2.5
I hope this makes it a bit clearer. My real dataset spans 3 years with 3 million rows, contains 350 Id numbers.
Your question is not very clear, but I hope this helps:
df.reset_index(inplace=True)
# helper columns with date, hour and dow
df['date'] = df['Start_date'].dt.date
df['hour'] = df['Start_date'].dt.hour
df['dow'] = df['Start_date'].dt.dayofweek
# sum of counts for all combinations
df = df.groupby(['Id', 'date', 'dow', 'hour']).sum()
# take the mean over all dates
df = df.reset_index().groupby(['Id', 'dow', 'hour']).mean()
You can use the groupby function using the 'Id' column and then use the resample function with how='sum'.

Performing calculations on subset of data frame subset in Python

user_id char_id rating
100 33 3
100 44 2
100 33 1
100 44 4
111 55 5
111 44 4
111 55 5
I have a data frame formatted similarly to this one and am trying to perform calculations on the ratings after they have been grouped by user_id and char_id.
It doesn't work but I need to do something like data.groupby('user_id', 'char_id') and then calculate the moving average for each char_id for each user_id. Any help? I have several thousand user_id so I can't go through and select one at a time for the calculations.
I need to somehow iterate over the user_id column and group all the same user_ids together, and save that format so that user_ids are separate. Then I need to do the same thing, iterating over char_id for each user_id subset and saving that format so that I can finally perform calculations on the subsets of subsets of ratings. So far all my attempts have been unsuccessful. The closest I came was:
def divide_by_user(data):
for user in data['user_id']:
user_data = data.where(data['user_id'] == user)
return user_data
There's no need to do this manually, creating and summarizing subsets like this is exactly what DataFrame.groupby() is for. Create your groupby:
grouped = df.groupby(['user_id', 'char_id'])
Then you can apply a function to each subset. It sounds like you want either rolling_mean or expanding_mean, both of which are already available in pandas:
df['cum_average'] = grouped['rating'].apply(pd.expanding_mean)
# New column now contains the average rating for each subset,
# including all values that have been seen so far.
df
Out[43]:
user_id char_id rating cum_average
0 100 33 3 3
1 100 44 2 2
2 100 33 1 2
3 100 44 4 3
4 111 55 5 5
5 111 44 4 4
6 111 55 5 5
Using a larger randomly-generated dataset to demonstrate rolling_window():
df = pd.DataFrame({
'user_id': [random.choice([100, 111, 112]) for n in range(n_rows)],
'char_id': [random.choice([33, 44, 55]) for n in range(n_rows)],
'rating': [random.choice([1, 2, 3, 4, 5]) for n in range(n_rows)]
})
grouped = df.groupby(['user_id', 'char_id'])
df['cum_average'] = grouped['rating'].apply(pd.rolling_mean, window=7)
# Output. The rolling average will be NaN until enough values have been
# observed for that subset, you can change this using the
# min_periods argument to rolling_window
df.sort(columns=['user_id', 'char_id'])
char_id rating user_id cum_average
3 33 1 100 NaN
19 33 2 100 NaN
22 33 5 100 NaN
34 33 1 100 NaN
47 33 1 100 NaN
48 33 1 100 NaN
49 33 1 100 1.714286
51 33 4 100 2.142857
55 33 2 100 2.142857
60 33 2 100 1.714286
66 33 2 100 1.857143
...
etc.
Try this:
"df" is the dataFrame
mean=pd.rolling_mean(df.rating, 7)

Categories