Python Pandas: how to take only the earliest date in each group - python

here's a sample dataset i've created for this question:
data1 = pd.DataFrame([['1','303','3/7/2016'],
['4','404','6/23/2011'],
['7','101','3/7/2016'],
['1','303','5/6/2017']],
columns=["code", "ticket #", "CB date"])
data1['CB date'] = pd.to_datetime(data1['CB date'])
data2 = pd.DataFrame([['1','303','2/5/2016'],
['4','404','6/23/2011'],
['7','101','3/17/2016'],
['1','303','4/6/2017']],
columns=["code", "ticket #", "audit date"])
data2['audit date'] = pd.to_datetime(data2['audit date'])
print(data1)
print(data2)
code ticket # CB date
0 1 303 2016-03-07
1 4 404 2011-06-23
2 7 101 2016-03-07
3 1 303 2017-05-06
code ticket # audit date
0 1 303 2016-02-05
1 4 404 2011-06-23
2 7 101 2016-03-17
3 1 303 2017-04-06
I want to merge the two df's, and make sure that the CB dates are always on or after Audit dates:
data_all = data1.merge(data2, how='inner', on=['code', 'ticket #'])
data_all = data_all[data_all['audit date'] <= data_all['CB date']]
print(data_all)
code ticket # CB date audit date
0 1 303 2016-03-07 2016-02-05
2 1 303 2017-05-06 2016-02-05
3 1 303 2017-05-06 2017-04-06
4 4 404 2011-06-23 2011-06-23
However, I only want to keep the rows with earliest CB date after each audit date. So in above output, row 2 shouldn't be there, because row 1 and row 2 both have same audit date 2016/2/5, but I only want to keep row 1 since the CB date is much closer to 2016/2/5 than row 2 CB date does.
Desired output:
code ticket # CB date audit date
0 1 303 2016-03-07 2016-02-05
3 1 303 2017-05-06 2017-04-06
4 4 404 2011-06-23 2011-06-23
I know in SQL I'd have to gorupby code & ticket # & Audit date first, then order CB date in ascending order, then take the item rank = 1 in each group; but how can I do this in Python/Pandas?
I read other posts here but I am still not getting it, so would really appreciate some advice here.
Few posts I read into include:
Pandas Groupy take only the first N Groups
Pandas: select the first couple of rows in each group

I'd do this with an optional sort_values call and a drop_duplicates call.
data_all.sort_values(data_all.columns.tolist())\
.drop_duplicates(subset=['CB date'], keep='first')
code ticket # CB date audit date
0 1 303 2016-03-07 2016-02-05
2 1 303 2017-05-06 2016-02-05
4 4 404 2011-06-23 2011-06-23
I say the sort_values call is optional here, since your data appears to be sorted already. If it isn't, make sure that's part of your solution.

Related

Pandas: Groupby and sum customer profit, for every 6 months, starting from users first transaction

I have a dataset like this:
Customer ID
Date
Profit
1
4/13/2018
10.00
1
4/26/2018
13.27
1
10/23/2018
15.00
2
1/1/2017
7.39
2
7/5/2017
9.99
2
7/7/2017
10.01
3
5/4/2019
30.30
I'd like to groupby and sum profit, for every 6 months, starting at each users first transaction.
The output ideally should look like this:
Customer ID
Date
Profit
1
4/13/2018
23.27
1
10/13/2018
15.00
2
1/1/2017
7.39
2
7/1/2017
20.00
3
5/4/2019
30.30
The closest I've seem to get on this problem is by using:
df.groupby(['Customer ID',pd.Grouper(key='Date', freq='6M', closed='left')])['Profit'].sum().reset_index()
But, that doesn't seem to sum starting on a users first transaction day.
If the changing of dates is not possible (ex. customer 2 date is 7/1/2017 and not 7/5/2017), then at least summing the profit so that its based on each users own 6 month purchase journey would be extremely helpful. Thank you!
I can get you the first of the month until you find a more perfect solution.
df["Date"] = pd.to_datetime(df["Date"], format="%m/%d/%Y")
df = (
df
.set_index("Date")
.groupby(["Customer ID"])
.Profit
.resample("6MS")
.sum()
.reset_index(name="Profit")
)
print(df)
Customer ID Date Profit
0 1 2018-04-01 23.27
1 1 2018-10-01 15.00
2 2 2017-01-01 7.39
3 2 2017-07-01 20.00
4 3 2019-05-01 30.30

How to get weekly averages for column values and week number for the corresponding year based on daily data records with pandas

I'm still learning python and would like to ask your help with the following problem:
I have a csv file with daily data and I'm looking for a solution to sum it per calendar weeks. So for the mockup data below I have rows stretched over 2 weeks (week 14 (current week) and week 13 (past week)). Now I need to find a way to group rows per calendar week, recognize what year they belong to and calculate week sum and week average. In the file input example there are only two different IDs. However, in the actual data file I expect many more.
input.csv
id date activeMembers
1 2020-03-30 10
2 2020-03-30 1
1 2020-03-29 5
2 2020-03-29 6
1 2020-03-28 0
2 2020-03-28 15
1 2020-03-27 32
2 2020-03-27 10
1 2020-03-26 9
2 2020-03-26 3
1 2020-03-25 0
2 2020-03-25 0
1 2020-03-24 0
2 2020-03-24 65
1 2020-03-23 22
2 2020-03-23 12
...
desired output.csv
id week WeeklyActiveMembersSum WeeklyAverageActiveMembers
1 202014 10 1.4
2 202014 1 0.1
1 202013 68 9.7
2 202013 111 15.9
my goal is to:
import pandas as pd
df = pd.read_csv('path/to/my/input.csv')
Here I'd need to group by 'id' + 'date' column (per calendar week - not sure if this is possible) and create a 'week' column with the week number, then sum 'activeMembers' values for the particular week, save as 'WeeklyActiveMembersSum' column in my output file and finally calculate 'weeklyAverageActiveMembers' for the particular week. I was experimenting with groupby and isin parameters but no luck so far... would I have to go with something similar to this:
df.groupby('id', as_index=False).agg({'date':'max',
'activeMembers':'sum'}
and finally save all as output.csv:
df.to_csv('path/to/my/output.csv', index=False)
Thanks in advance!
It seems I'm getting a different week setting than you do:
# should convert datetime column to datetime type
df['date'] = pd.to_datetime(df['date'])
(df.groupby(['id',df.date.dt.strftime('%Y%W')], sort=False)
.activeMembers.agg([('Sum','sum'),('Average','mean')])
.add_prefix('activeMembers')
.reset_index()
)
Output:
id date activeMembersSum activeMembersAverage
0 1 202013 10 10.000000
1 2 202013 1 1.000000
2 1 202012 68 9.714286
3 2 202012 111 15.857143

Calculating total monthly cumulative number of Order

I need to find the total monthly cumulative number of order. I have 2 columns OrderDate and OrderId.I cant use a list to find the cumulative numbers since data is so large. and result should be year_month format along with cumulative order total per each months.
orderDate OrderId
2011-11-18 06:41:16 23
2011-11-18 04:41:16 2
2011-12-18 06:41:16 69
2012-03-12 07:32:15 235
2012-03-12 08:32:15 234
2012-03-12 09:32:15 235
2012-05-12 07:32:15 233
desired Result
Date CumulativeOrder
2011-11 2
2011-12 3
2012-03 6
2012-05 7
I have imported my excel into pycharm and use pandas to read excel
I have tried to split the datetime column to year and month then grouped but not getting the correct result.
df1 = df1[['OrderId','orderDate']]
df1['year'] = pd.DatetimeIndex(df1['orderDate']).year
df1['month'] = pd.DatetimeIndex(df1['orderDate']).month
df1.groupby(['year','month']).sum().groupby('year','month').cumsum()
print (df1)
Convert column to datetimes, then to months period by to_period, add new column by numpy.arange and last remove duplicates with keep last dupe by column Date and DataFrame.drop_duplicates:
import numpy as np
df1['orderDate'] = pd.to_datetime(df1['orderDate'])
df1['Date'] = df1['orderDate'].dt.to_period('m')
#use if not sorted datetimes
#df1 = df1.sort_values('Date')
df1['CumulativeOrder'] = np.arange(1, len(df1) + 1)
print (df1)
orderDate OrderId Date CumulativeOrder
0 2011-11-18 06:41:16 23 2011-11 1
1 2011-11-18 04:41:16 2 2011-11 2
2 2011-12-18 06:41:16 69 2011-12 3
3 2012-03-12 07:32:15 235 2012-03 4
df2 = df1.drop_duplicates('Date', keep='last')[['Date','CumulativeOrder']]
print (df2)
Date CumulativeOrder
1 2011-11 2
2 2011-12 3
3 2012-03 4
Another solution:
df2 = (df1.groupby(df1['orderDate'].dt.to_period('m')).size()
.cumsum()
.rename_axis('Date')
.reset_index(name='CumulativeOrder'))
print (df2)
Date CumulativeOrder
0 2011-11 2
1 2011-12 3
2 2012-03 6
3 2012-05 7

Need to calculate timedifference between timestamps and store it in a variable

I am a beginner with python, and so my questions could come across as trivial. I would appreciate your support or any leads to my problem.
Problem:
There are about 10 different states; A order moves across different states and a time stamp is generated when the state ends. For example below, There are four states A, B, C, D.
A 10 AM
B 1 PM
C 4 Pm
D 5 PM
Time spent in B = 1PM -10AM = 3.
Some times the same state can occur multiple times. Hence, we need a variable to store the time difference value for a single state
Attached the raw data csv and my code so far.There are multiple orders for which this calculation needs to be performed. however, for simplicity, I have data for just one order now.
sample data:
Order States modified_at
1 Resolved 2018-06-18T15:05:52.2460000
1 Edited 2018-05-24T21:44:07.9030000
1 Pending PO Creation 2018-06-06T19:52:51.5990000
1 Assigned 2018-05-24T17:46:03.2090000
1 Edited 2018-06-04T15:02:57.5130000
1 Draft 2018-05-24T17:45:07.9960000
1 PO Placed 2018-06-06T20:49:37.6540000
1 Edited 2018-06-04T11:18:13.9830000
1 Edited 2018-05-24T17:45:39.4680000
1 Pending Approval 2018-05-24T21:48:23.9180000
1 Edited 2018-06-06T21:00:19.6350000
1 Submitted 2018-05-24T21:44:37.8830000
1 Edited 2018-05-30T11:19:36.5460000
1 Edited 2018-05-25T11:16:07.9690000
1 Edited 2018-05-24T21:43:35.0770000
1 Assigned 2018-06-07T18:39:00.2580000
1 Pending Review 2018-05-24T17:45:10.5980000
1 Pending PO Submission 2018-06-06T14:16:26.6580000
Code I tried:
import pandas as pd
import datetime as datetime
from dateutil.relativedelta import relativedelta
fileName = "SamplePR.csv"
df = pd.read_csv(fileName, delimiter=',')
df['modified_at'] = pd.to_datetime(df.modified_at)
df = df.sort_values(by='modified_at')
df = df.reset_index(drop=True)
df1 = df[:-1]
df2 = df[1:]
dfm1 = df1['modified_at']
dfm2 = df2['modified_at']
dfm1 = dfm1.reset_index(drop=True)
dfm2 = dfm2.reset_index(drop=True)
for i in range(len(df)-1):
start = datetime.datetime.strptime(str(dfm1[i]), '%Y-%m-%d %H:%M:%S')
ends = datetime.datetime.strptime(str(dfm2[i]), '%Y-%m-%d %H:%M:%S')
diff = relativedelta(ends, start)
print (diff)
So far, I tried to sort the list by time and then calculate the difference between 2 states. Would really appreciate if someone can help with logic or point in the right direction
You can use diff from pandas to get the difference between two rows
Here is a sample code.
In [1]: import pandas as pd
In [2]: from io import StringIO
In [3]: data = StringIO('''Order,States,modified_at
...: 1,Resolved,2018-06-18T15:05:52.2460000
...: 1,Edited,2018-05-24T21:44:07.9030000
...: 1,Pending PO Creation,2018-06-06T19:52:51.5990000
...: ''')
In [4]: df = pd.read_csv(data, sep=',')
In [5]: df['modified_at'] = pd.to_datetime(df['modified_at']) #convert the type to datetime
In [6]: df
Out[6]:
Order States modified_at
0 1 Resolved 2018-06-18 15:05:52.246
1 1 Edited 2018-05-24 21:44:07.903
2 1 Pending PO Creation 2018-06-06 19:52:51.599
In [7]: df['diff'] = df['modified_at'].diff() #get the diff and add to a new column
In [8]: df
Out[8]:
Order States modified_at diff
0 1 Resolved 2018-06-18 15:05:52.246 NaT
1 1 Edited 2018-05-24 21:44:07.903 -25 days +06:38:15.657000
2 1 Pending PO Creation 2018-06-06 19:52:51.599 12 days 22:08:43.696000
Welcome visal, if your intention is to just check the time difference between date stamp , use to_datetime to convert to datestamp and difference it by shifting
index Order States modified_at
0 0 1 Resolved 2018-06-18 15:05:52.246
1 1 1 Edited 2018-05-24 21:44:07.903
2 0 1 Edited 2018-06-06 21:00:19.635
3 1 1 Submitted 2018-05-24 21:44:37.883
4 2 1 Edited 2018-05-30 11:19:36.546
5 3 1 Edited 2018-05-25 11:16:07.969
6 4 1 Edited 2018-05-24 21:43:35.077
7 5 1 Assigned 2018-06-07 18:39:00.258
df.modified_at = pd.to_datetime(df.modified_at)
df['time_spent'] = df.modified_at - df.modified_at.shift()
Out:
0 NaT
1 -25 days +06:38:15.657000
2 12 days 23:16:11.732000
3 -13 days +00:44:18.248000
4 5 days 13:34:58.663000
5 -6 days +23:56:31.423000
6 -1 days +10:27:27.108000
7 13 days 20:55:25.181000
Name: modified_at, dtype: timedelta64[ns]
you can use pivot table for your requirement
df.time_spent = df.time_spent.dt.seconds
pd.pivot_table(df,values='time_spent',index=['Order'],columns=['States'],aggfunc=np.sum)
Out:
States Assigned Edited Resolved Submitted
Order
0 NaN 83771.0 0.0 NaN
1 NaN 23895.0 NaN 2658.0
2 NaN 48898.0 NaN NaN
3 NaN 86191.0 NaN NaN
4 NaN 37647.0 NaN NaN
5 75325.0 NaN NaN NaN
$datetime1 = new DateTime('2016-11-30 03:55:06');//start time
$datetime2 = new DateTime('2016-11-30 11:55:06');//end time
$interval = $datetime1->diff($datetime2);
echo $interval->format('%Y years %m months %d days %H hours %i minutes %s seconds');//00 years 0 months 0 days 08 hours 0 minutes 0 seconds

Python selecting row from second dataframe based on complex criteria

I have two dataframes, one with some purchasing data, and one with a weekly calendar, e.g.
df1:
purchased_at product_id cost
01-01-2017 1 £10
01-01-2017 2 £8
09-01-2017 1 £10
18-01-2017 3 £12
df2:
week_no week_start week_end
1 31-12-2016 06-01-2017
2 07-01-2017 13-01-2017
3 14-01-2017 20-01-2017
I want to use data from the two to add a 'week_no' column to df1, which is selected from df2 based on where the 'purchased_at' date in df1 falls between the 'week_start' and 'week_end' dates in df2, i.e.
df1:
purchased_at product_id cost week_no
01-01-2017 1 £10 1
01-01-2017 2 £8 1
09-01-2017 1 £10 2
18-01-2017 3 £12 3
I've searched but I've not been able to find an example where the data is being pulled from a second dataframe using comparisons between the two, and I've been unable to correctly apply any examples I've found, e.g.
df1.loc[(df1['purchased_at'] < df2['week_end']) &
(df1['purchased_at'] > df2['week_start']), df2['week_no']
was unsuccessful, with the ValueError 'can only compare identically-labeled Series objects'
Could anyone help with this problem, or I'm open to suggestions if there is a better way to achieve the same outcome.
edit to add further detail of df1
df1 full dataframe headers
purchased_at purchase_id product_id product_name transaction_id account_number cost
01-01-2017 1 1 A 1 AA001 £10
01-01-2017 2 2 B 1 AA001 £8
02-01-2017 3 1 A 2 AA008 £10
03-01-2017 4 3 C 3 AB040 £12
...
09-01-2017 12 1 A 10 AB102 £10
09-01-2017 13 2 B 11 AB102 £8
...
18-01-2017 20 3 C 15 AA001 £12
So the purchase_id increases incrementally with each row, the product_id and product_name have a 1:1 relationship, the transaction_id also increases incrementally, but there can be multiple purchases within a transaction.
If your dataframes are to big you can use this trick.
Do a full cartisian product join of all records to all records:
df_out = pd.merge(df1.assign(key=1),df2.assign(key=1),on='key')
Next filter out those records that do not match criteria in this case, where purchased_at is not between week_start and week_end
(df_out.query('week_start < purchased_at < week_end')
.drop(['key','week_start','week_end'], axis=1))
Output:
purchased_at product_id cost week_no
0 2017-01-01 1 £10 1
3 2017-01-01 2 £8 1
7 2017-01-09 1 £10 2
11 2017-01-18 3 £12 3
If you do have large dataframes then you can use this numpy method as proposed by PiRSquared.
a = df1.purchased_at.values
bh = df2.week_end.values
bl = df2.week_start.values
i, j = np.where((a[:, None] >= bl) & (a[:, None] <= bh))
pd.DataFrame(
np.column_stack([df1.values[i], df2.values[j]]),
columns=df1.columns.append(df2.columns)
).drop(['week_start','week_end'],axis=1)
Output:
purchased_at product_id cost week_no
0 2017-01-01 00:00:00 1 £10 1
1 2017-01-01 00:00:00 2 £8 1
2 2017-01-09 00:00:00 1 £10 2
3 2017-01-18 00:00:00 3 £12 3
You could just use time.strftime() to extract the week number from the date. If you want to keep counting the weeks upwards, you need to define a "zero year" as the start of your time-series and offset the week_no accordingly:
import pandas as pd
data = {'purchased_at': ['01-01-2017', '01-01-2017', '09-01-2017', '18-01-2017'], 'product_id': [1,2,1,3], 'cost':['£10', '£8', '£10', '£12']}
df = pd.DataFrame(data, columns=['purchased_at', 'product_id', 'cost'])
def getWeekNo(date, year0):
datetime = pd.to_datetime(date, dayfirst=True)
year = int(datetime.strftime('%Y'))
weekNo = int(datetime.strftime('%U'))
return weekNo + 52*(year-year0)
df['week_no'] = df.purchased_at.apply(lambda x: getWeekNo(x, 2017))
Here, I use pd.to_dateime() to convert the datestring from df into a datetime-object. strftime('%Y') returns the year and strftime('%U') the week (with the first week of a year starting on it's first Sunday. If weeks should start on Monday, use '%W' instead).
This way, you don't need to maintain a seperate DataFrame only for week numbers.

Categories