I have two dataframes. the first one looks like this:
id, orderdate, orderid, amount,camp1, camp2, camp3
1 2020-01-01 100 100 1 0 0
2 2020-02-01 120 200 1 0 1
3 2019-12-01 130 500 0 1 0
4 2019-11-01 150 750 0 1 0
5 2020-01-01 160 1000 1 1 1
camp1, camp2, camp3 parts show if the customer attended to a campaign.
and the campaigns have a period dictionary such that
camp_periods = {'camp1':
[datetime.strptime('2019-04-08', '%Y-%m-%d'), datetime.strptime('2019-06-06', '%Y-%m-%d')],
'camp2':
[datetime.strptime('2019-09-15', '%Y-%m-%d'), datetime.strptime('2019-09-28', '%Y-%m-%d')],
'camp3':
[datetime.strptime('2019-11-15', '%Y-%m-%d'), datetime.strptime('2019-12-28', '%Y-%m-%d')]
}
I would like to create a table giving the number of orders and total of order amounts per customer, if the orderdate is between the campaign periods in the camp_periods dictionary and if the customer attended to that campaign.
I'm not sure if I understood very well your question, I guess when you say the number of orders and total of order amounts you want to get the first n number of orders that are under or equal with a given total order amounts, here is my approach:
data example:
from operator import or_
from functools import reduce
number_orders = 2
total_order_amounts = 3000
orderdate is between the campaign periods in the camp_periods
dictionary and if the customer attended to that campaign
cond = [(df[k].astype('bool') & df['orderdate'].between(*v)) for k, v in camp_periods.items()]
cond = reduce(or_, cond)
df_cond = df[cond]
df_final = df_cond[df_cond['amount'].cumsum() <= total_order_amounts].head(number_orders)
df_final
output:
Related
I want to make the sum of each 'Group' which have at least one 'Customer' with an 'Active' Bail.
Sample Input :
Customer ID Group Bail Amount
0 23453 NAFNAF Active 200
1 23849 LINDT Active 350
2 23847 NAFNAF Inactive 100
3 84759 CARROUF Inactive 20
For example 'NAFNAF' has 2 customers, including one with an active bail.
Output expected :
NAFNAF : 300
LINDT : 350
TOTAL ACTIVE: 650
I don't wanna change the original dataframe
You can use:
(df.assign(Bail=df.Bail.eq('Active'))
.groupby('Group')[['Bail', 'Amount']].agg('sum')
.loc[lambda d: d['Bail'].ge(1), ['Amount']]
)
output:
Amount
Group
LINDT 350
NAFNAF 300
Full output with total:
df2 = (
df.assign(Bail=df.Bail.eq('Active'))
.groupby('Group')[['Bail', 'Amount']].agg('sum')
.loc[lambda d: d['Bail'].ge(1), ['Amount']]
)
df2 = pd.concat([df2, df2.sum().to_frame('TOTAL').T])
output:
Amount
LINDT 350
NAFNAF 300
TOTAL 650
Create a boolean mask of Group with at least one active lease:
m = df['Group'].isin(df.loc[df['Bail'].eq('Active'), 'Group'])
out = df[m]
At this point, your filtered dataframe looks like:
>>> out
Customer ID Group Bail Amount
0 23453 NAFNAF Active 200
1 23849 LINDT Active 350
2 23847 NAFNAF Inactive 100
Now you can use groupby and sum:
out = df[m].groupby('Group')['Amount'].sum()
out = pd.concat([out, pd.Series(out.sum(), index=['TOTAL ACTIVE'])])
# Output
LINDT 350
NAFNAF 300
TOTAL ACTIVE 650
dtype: int64
I have the following dataframe:
print(dd)
dt_op quantity product_code
20/01/18 1 613
21/01/18 8 611
21/01/18 1 613
...
I am trying to get the sales in the dataframe of the next "n" days, but the following code does not compute them by product_code as well:
dd["Final_Quantity"] = [dd.loc[dd['dt_op'].between(d, d + pd.Timedelta(days = 7)), 'quantity'].sum() \
for d in dd['dt_op']]
I would like to define dd["Final_Quantity"] as sum of df["quantity"] sold in the next "n" days, for every different product in stock;
Ultimately, for i in dt_op and product_code.
print(final_dd)
n = 7
dt_op quantity product_code Final_Quantity
20/01/18 1 613 2
21/01/18 8 611 8
25/01/18 1 613 1
...
Regardless how you wanted to present the output, you can try the following codes to get total of sales for every product for every n days. Let say for every 7 days:
dd.groupby([pd.Grouper(key='dt_op', freq='7D'), 'product_code']).sum()['quantity']
I have 3 dataframes already sorted with date and p_id and with no null values as:
First DataFrame
df1 = pd.DataFrame([['2018-07-05',8.0,1],
['2018-07-15',1.0,1],
['2018-08-05',2.0,1],
['2018-08-05',2.0,2]],
columns=["purchase_date", "qty", "p_id"])
Second DataFrame
df2 = pd.DataFrame([['2018-07-15',2.0,1],
['2018-08-04',7.0,1],
['2018-08-15',1.0,2]],
columns=["sell_date", "qty", "p_id"])
Third DataFrame
df3 = pd.DataFrame([['2018-07-25',1.0,1],
['2018-08-15',1.0,1]],
columns=["expired_date", "qty", "p_id"])
dataframe looks like:
1st: (Holds Purchase details)
purchase_date qty p_id
0 2018-07-05 8.0 1
1 2018-07-15 1.0 1
2 2018-08-05 2.0 1
3 2018-08-05 2.0 2
2nd: (Holds Sales Details)
sell_date qty p_id
0 2018-07-15 2.0 1
1 2018-08-04 7.0 1
2 2018-08-15 1.0 2
3rd: (Holds Expiry Details)
expired_date qty p_id
0 2018-07-25 1.0 1
1 2018-08-15 1.0 1
Now What I want to do is find when the product that has expired was bought following FIFO (product first purchased will expire first)
Explanation: Consider product with id 1
By date 2018-07-15
We had 8+1 purchased quantity and -2 sold quantity i.e. total of 8+1-2 quantity in stock , -ve sign signify quantity deduction
By date 2018-07-25
1 quantity expired so first entry for our new when_product_expired dataframe will be:
purchase_date expired_date p_id
2018-07-05 2018-07-25 1
And then for next expiry entry
By date 2018-08-04
7 quantity were sold out so current quantity will be 8+1-2-7 = 0
By date 2018-08-05
2 quantity were bought so current quantity is 0+2
By date 2018-08-15
1 quantity expired
So a new and final entry will be:
purchase_date expired_date p_id
2018-07-05 2018-07-25 1
2018-08-05 2018-08-15 1
This time the product expired was one that was purchased on 2018-07-25
Actually I have date time, so purchase and sell time will never be equal (you may assume), also before selling and expire, there will always be some quantity of product in stock, i.e. data is consistent
And Thank you in advance :-)
Updated
What by now I am thinking is rename all date fields to same field name and append purchase, sell, expired dataframe with negative sign, but that won't help me
df2.qty = df2.qty*-1
df3.qty=df3.qty*-1
new = pd.concat([df1,df2, df3],sort=False)
.sort_values(by=["purchase_date"],ascending=True)
.reset_index(drop=True)
What you essentially want is this FIFO list of items in stock. In my experience pandas is not the right tool to relate different rows to each other. The workflow should be split-apply-combine. If you split it and don't really see a way how to puzzle it back together, it may be a ill-formulated problem. You can still get a lot done with groupby, but this is something I would not try to solve with some clever trick in pandas. Even if you make it work, it will be hell to maintain.
I don't know how performance critical your problem is (i.e. how large are your Dataframes). If its just a few 10000 entries you can just explicitly loop over the pandas rows (warning: this is slow) and build the fifo list by hand.
I hacked together some code for this. The DateFrame you proposed is in there. I loop over all rows and do bookkeeping how many items are in stock. This is done in a queue q which contains an element for each item and the element convienently is the purchase_date.
import queue
import pandas as pd
from pandas import Series, DataFrame
# modified (see text)
df1 = pd.DataFrame([['2018-07-05',8.0,1],
['2018-07-15',3.0,1],
['2018-08-05',2.0,1],
['2018-08-05',2.0,2]],
columns=["purchase_date", "qty", "p_id"])
df2 = pd.DataFrame([['2018-07-15',2.0,1],
['2018-08-04',7.0,1],
['2018-08-15',1.0,2]],
columns=["sell_date", "qty", "p_id"])
df3 = pd.DataFrame([['2018-07-25',1.0,1],
['2018-08-15',1.0,1]],
columns=["expired_date", "qty", "p_id"])
df1 = df1.rename(columns={'purchase_date':'date'})
df2 = df2.rename(columns={'sell_date':'date'})
df3 = df3.rename(columns={'expired_date' : 'date'})
df3['qty'] *= -1
df2['qty'] *= -1
df = pd.concat([df1,df2])\
.sort_values(by=["date"],ascending=True)\
.reset_index(drop=True)
# Necessary to distinguish between sold and expried items while looping
df['expired'] = False
df3['expired'] = True
df = pd.concat([df,df3])\
.sort_values(by=["date"],ascending=True)\
.reset_index(drop=True)
#date qty p_id expired
#7-05 8.0 1 False
#7-15 1.0 1 False
#7-15 -2.0 1 False
#7-25 -1.0 1 True
#8-04 -7.0 1 False
#8-05 2.0 1 False
#8-05 2.0 2 False
#8-15 -1.0 2 False
#8-15 -1.0 1 True
# Iteratively build up when_product_expired
when_product_expired = []
# p_id hardcoded here
p_id = 1
# q contains purchase dates for all individual items 'currently' in stock
q = queue.Queue()
for index, row in df[df['p_id'] == p_id].iterrows():
# if items are bought, put as many as 'qty' into q
if row['qty'] > 0:
for tmp in range(int(round(row['qty']))):
date = row['date']
q.put(date)
# if items are sold or expired, remove as many from q.
# if expired additionaly save purchase and expiration date into when_product_expired
elif row['qty'] < 0:
for tmp in range(int(round(-row['qty']))):
purchase_date = q.get()
if row['expired']:
print 'item p_id 1 was bought on', purchase_date
when_product_expired.append([purchase_date, row['date'], p_id])
when_product_expired = DataFrame(when_product_expired, columns=['purchase_date', 'expired_date', 'p_id'])
A few remarks:
I relied on your guarentee that
before selling and expire, there will always be some quantity of product in stock
This is not given for your example DataFrames. Before 2018-07-25 there are 9 items with p_id 1 bought and 9 sold. There is nothing in stock that could expire. I modified df1 so that 11 pieces are bought.
If this assumption is violated Queue will try to get an item that is not there. On my machine that leads to an endless loop. You might want to catch the exception.
The queue is not in the least efficiently implemented. If many items are in stock, there will be a lot of data doubling.
You can generalize that to more p_id's by either putting everything into a function and .groupby('p_id').apply(function) or loop over df['p_id'].unique()
So while this is not scalable solution, I hope it helps you a bit. Good look
Currently, I have a dataframe with an index heirarchy for monthly cohorts. Here is how I grouped them.
grouped = dfsort.groupby(['Cohort','Lifetime_Revenue'])
cohorts = grouped.agg({'Customer_ID': pd.Series.nunique})
Which outputs:
Cohort Lifetime_Revenue Customer_ID
2014-01 149.9 1
2014-02 299.9 1
2014-03 269.91 1
329.89 1
899.88 1
2014-04 299.9 1
674.91 2
2014-05 899.88 1
2014-06 824.89 1
And so on.
I was looking to get the total sum of the Lifetime revenue for each cohort as well as the total amount of users for a cohort.
Basically, I want to turn it into a regular database.
Anyone got any thoughts on this?
I have two dataframes, one with some purchasing data, and one with a weekly calendar, e.g.
df1:
purchased_at product_id cost
01-01-2017 1 £10
01-01-2017 2 £8
09-01-2017 1 £10
18-01-2017 3 £12
df2:
week_no week_start week_end
1 31-12-2016 06-01-2017
2 07-01-2017 13-01-2017
3 14-01-2017 20-01-2017
I want to use data from the two to add a 'week_no' column to df1, which is selected from df2 based on where the 'purchased_at' date in df1 falls between the 'week_start' and 'week_end' dates in df2, i.e.
df1:
purchased_at product_id cost week_no
01-01-2017 1 £10 1
01-01-2017 2 £8 1
09-01-2017 1 £10 2
18-01-2017 3 £12 3
I've searched but I've not been able to find an example where the data is being pulled from a second dataframe using comparisons between the two, and I've been unable to correctly apply any examples I've found, e.g.
df1.loc[(df1['purchased_at'] < df2['week_end']) &
(df1['purchased_at'] > df2['week_start']), df2['week_no']
was unsuccessful, with the ValueError 'can only compare identically-labeled Series objects'
Could anyone help with this problem, or I'm open to suggestions if there is a better way to achieve the same outcome.
edit to add further detail of df1
df1 full dataframe headers
purchased_at purchase_id product_id product_name transaction_id account_number cost
01-01-2017 1 1 A 1 AA001 £10
01-01-2017 2 2 B 1 AA001 £8
02-01-2017 3 1 A 2 AA008 £10
03-01-2017 4 3 C 3 AB040 £12
...
09-01-2017 12 1 A 10 AB102 £10
09-01-2017 13 2 B 11 AB102 £8
...
18-01-2017 20 3 C 15 AA001 £12
So the purchase_id increases incrementally with each row, the product_id and product_name have a 1:1 relationship, the transaction_id also increases incrementally, but there can be multiple purchases within a transaction.
If your dataframes are to big you can use this trick.
Do a full cartisian product join of all records to all records:
df_out = pd.merge(df1.assign(key=1),df2.assign(key=1),on='key')
Next filter out those records that do not match criteria in this case, where purchased_at is not between week_start and week_end
(df_out.query('week_start < purchased_at < week_end')
.drop(['key','week_start','week_end'], axis=1))
Output:
purchased_at product_id cost week_no
0 2017-01-01 1 £10 1
3 2017-01-01 2 £8 1
7 2017-01-09 1 £10 2
11 2017-01-18 3 £12 3
If you do have large dataframes then you can use this numpy method as proposed by PiRSquared.
a = df1.purchased_at.values
bh = df2.week_end.values
bl = df2.week_start.values
i, j = np.where((a[:, None] >= bl) & (a[:, None] <= bh))
pd.DataFrame(
np.column_stack([df1.values[i], df2.values[j]]),
columns=df1.columns.append(df2.columns)
).drop(['week_start','week_end'],axis=1)
Output:
purchased_at product_id cost week_no
0 2017-01-01 00:00:00 1 £10 1
1 2017-01-01 00:00:00 2 £8 1
2 2017-01-09 00:00:00 1 £10 2
3 2017-01-18 00:00:00 3 £12 3
You could just use time.strftime() to extract the week number from the date. If you want to keep counting the weeks upwards, you need to define a "zero year" as the start of your time-series and offset the week_no accordingly:
import pandas as pd
data = {'purchased_at': ['01-01-2017', '01-01-2017', '09-01-2017', '18-01-2017'], 'product_id': [1,2,1,3], 'cost':['£10', '£8', '£10', '£12']}
df = pd.DataFrame(data, columns=['purchased_at', 'product_id', 'cost'])
def getWeekNo(date, year0):
datetime = pd.to_datetime(date, dayfirst=True)
year = int(datetime.strftime('%Y'))
weekNo = int(datetime.strftime('%U'))
return weekNo + 52*(year-year0)
df['week_no'] = df.purchased_at.apply(lambda x: getWeekNo(x, 2017))
Here, I use pd.to_dateime() to convert the datestring from df into a datetime-object. strftime('%Y') returns the year and strftime('%U') the week (with the first week of a year starting on it's first Sunday. If weeks should start on Monday, use '%W' instead).
This way, you don't need to maintain a seperate DataFrame only for week numbers.