Next "n" Days Sales for every product - python

I have the following dataframe:
print(dd)
dt_op quantity product_code
20/01/18 1 613
21/01/18 8 611
21/01/18 1 613
...
I am trying to get the sales in the dataframe of the next "n" days, but the following code does not compute them by product_code as well:
dd["Final_Quantity"] = [dd.loc[dd['dt_op'].between(d, d + pd.Timedelta(days = 7)), 'quantity'].sum() \
for d in dd['dt_op']]
I would like to define dd["Final_Quantity"] as sum of df["quantity"] sold in the next "n" days, for every different product in stock;
Ultimately, for i in dt_op and product_code.
print(final_dd)
n = 7
dt_op quantity product_code Final_Quantity
20/01/18 1 613 2
21/01/18 8 611 8
25/01/18 1 613 1
...

Regardless how you wanted to present the output, you can try the following codes to get total of sales for every product for every n days. Let say for every 7 days:
dd.groupby([pd.Grouper(key='dt_op', freq='7D'), 'product_code']).sum()['quantity']

Related

pandas calculated field by comparing other fields and dictionary values

I have two dataframes. the first one looks like this:
id, orderdate, orderid, amount,camp1, camp2, camp3
1 2020-01-01 100 100 1 0 0
2 2020-02-01 120 200 1 0 1
3 2019-12-01 130 500 0 1 0
4 2019-11-01 150 750 0 1 0
5 2020-01-01 160 1000 1 1 1
camp1, camp2, camp3 parts show if the customer attended to a campaign.
and the campaigns have a period dictionary such that
camp_periods = {'camp1':
[datetime.strptime('2019-04-08', '%Y-%m-%d'), datetime.strptime('2019-06-06', '%Y-%m-%d')],
'camp2':
[datetime.strptime('2019-09-15', '%Y-%m-%d'), datetime.strptime('2019-09-28', '%Y-%m-%d')],
'camp3':
[datetime.strptime('2019-11-15', '%Y-%m-%d'), datetime.strptime('2019-12-28', '%Y-%m-%d')]
}
I would like to create a table giving the number of orders and total of order amounts per customer, if the orderdate is between the campaign periods in the camp_periods dictionary and if the customer attended to that campaign.
I'm not sure if I understood very well your question, I guess when you say the number of orders and total of order amounts you want to get the first n number of orders that are under or equal with a given total order amounts, here is my approach:
data example:
from operator import or_
from functools import reduce
number_orders = 2
total_order_amounts = 3000
orderdate is between the campaign periods in the camp_periods
dictionary and if the customer attended to that campaign
cond = [(df[k].astype('bool') & df['orderdate'].between(*v)) for k, v in camp_periods.items()]
cond = reduce(or_, cond)
df_cond = df[cond]
df_final = df_cond[df_cond['amount'].cumsum() <= total_order_amounts].head(number_orders)
df_final
output:

How to follow changes in a dataframe, but only in one direction

I am trying to simulate a trailing stop, used in trading.
Some data:
(input) (output)
price peg
1000 0995 set a price - 5
1001 0996 following price up
1002 0997 following price up
1001 0997 not following price down
1010 1005 following price up
1012 1007 following price up
1010 1007 not following price down
1006 STOP the price went below the last peg
The logic is the following:
I start by setting a peg at -5, so it takes price - 5 and makes 995.
Each time the price goes up, the peg follows up, always keeping a -5 gap
If the price goes down, the peg does NOT go down
If the price goes below, or equal to the peg, I need to know the index and the process RESTARTS
Is there a Pandas idiomatic way to do this? I've implemented it as a loop, but it is very slow.
This is some code I've done for the loop:
# i is the index at which we take a trade in
# and I want to go through the rest of the dataframe to see if it would
# hit a trailing stop
if direction == +1: # only long trades in this example
peg_price = entry_price -5
for j in range(i + 1, len(df)):
low = df['low'][j]
if low <= peg_price:
date = df['date'][i]
trade_date.append(df['date'][i])
trade_exit_date.append(df['date'][j])
trade_price.append(entry_price)
trade_exit.append(peg_price)
trade_profit.append(peg_price - entry_price)
skip_to = j + 1
else:
low = df['high'][j]
peg_price = max(high - 5, peg_price)
The example is a bit more complex because I need to compare the peg with the 'low' price but update it with the 'high' price; but the idea is there.
IIUC:
data = {"price":[1000,1001,1002,1001,1010,1012,1010,1006]}
df = pd.DataFrame(data)
# first make a column of price-5
df["peg"] = df["price"]-5
# use np.where to check whether price dropped or increased
df["peg"] = np.where(df["price"].shift()>df["price"],df["peg"].shift(),df["peg"])
print (df)
price peg
0 1000 995.0
1 1001 996.0
2 1002 997.0
3 1001 997.0
4 1010 1005.0
5 1012 1007.0
6 1010 1007.0
7 1006 1005.0
# Get the index of STOP
print (df[df["peg"].shift()>df["peg"]])
price peg
7 1006 1005.0
Here is one way,
the idea is to pass all your logical conditions into true false booleans, we can then iteratively step through the assignments and pass them in. Once we have done that we can find the row where the peg is greater then the price we can then assign STOP if you have data that you need to NA after this you can easily to a logical .loc and assign any values after stop to NA.
for this example I've used your peg column name as counter so we can compare.
import pandas as pd
peg1 = df['price'].sub(df['price'].shift(1)) == 1 # rolling cumcount
peg2 = df['price'].sub(df['price'].shift(1)) > 1 # -5 these vals
peg3 = df['price'].sub(df['price'].shift(1)) <= -1 # keep as row above.
#assignments
df.loc[peg1,'counter'] = df['counter'].ffill() + peg1.cumsum()
df.loc[peg2,'counter'] = df['price'] - 5
df.loc[peg3,'counter'] = df['counter'].ffill()
df.loc[df['counter'] > df['price'], 'counter'] = 'STOP'
print(df)
price peg counter
0 1000 0995 995
1 1001 0996 996
2 1002 0997 997
3 1001 0997 997
4 1010 1005 1005
5 1012 1007 1007
6 1010 1007 1007
7 1006 STOP STOP

Calculating total monthly cumulative number of Order

I need to find the total monthly cumulative number of order. I have 2 columns OrderDate and OrderId.I cant use a list to find the cumulative numbers since data is so large. and result should be year_month format along with cumulative order total per each months.
orderDate OrderId
2011-11-18 06:41:16 23
2011-11-18 04:41:16 2
2011-12-18 06:41:16 69
2012-03-12 07:32:15 235
2012-03-12 08:32:15 234
2012-03-12 09:32:15 235
2012-05-12 07:32:15 233
desired Result
Date CumulativeOrder
2011-11 2
2011-12 3
2012-03 6
2012-05 7
I have imported my excel into pycharm and use pandas to read excel
I have tried to split the datetime column to year and month then grouped but not getting the correct result.
df1 = df1[['OrderId','orderDate']]
df1['year'] = pd.DatetimeIndex(df1['orderDate']).year
df1['month'] = pd.DatetimeIndex(df1['orderDate']).month
df1.groupby(['year','month']).sum().groupby('year','month').cumsum()
print (df1)
Convert column to datetimes, then to months period by to_period, add new column by numpy.arange and last remove duplicates with keep last dupe by column Date and DataFrame.drop_duplicates:
import numpy as np
df1['orderDate'] = pd.to_datetime(df1['orderDate'])
df1['Date'] = df1['orderDate'].dt.to_period('m')
#use if not sorted datetimes
#df1 = df1.sort_values('Date')
df1['CumulativeOrder'] = np.arange(1, len(df1) + 1)
print (df1)
orderDate OrderId Date CumulativeOrder
0 2011-11-18 06:41:16 23 2011-11 1
1 2011-11-18 04:41:16 2 2011-11 2
2 2011-12-18 06:41:16 69 2011-12 3
3 2012-03-12 07:32:15 235 2012-03 4
df2 = df1.drop_duplicates('Date', keep='last')[['Date','CumulativeOrder']]
print (df2)
Date CumulativeOrder
1 2011-11 2
2 2011-12 3
3 2012-03 4
Another solution:
df2 = (df1.groupby(df1['orderDate'].dt.to_period('m')).size()
.cumsum()
.rename_axis('Date')
.reset_index(name='CumulativeOrder'))
print (df2)
Date CumulativeOrder
0 2011-11 2
1 2011-12 3
2 2012-03 6
3 2012-05 7

Aggregation within a index heirarchy

Currently, I have a dataframe with an index heirarchy for monthly cohorts. Here is how I grouped them.
grouped = dfsort.groupby(['Cohort','Lifetime_Revenue'])
cohorts = grouped.agg({'Customer_ID': pd.Series.nunique})
Which outputs:
Cohort Lifetime_Revenue Customer_ID
2014-01 149.9 1
2014-02 299.9 1
2014-03 269.91 1
329.89 1
899.88 1
2014-04 299.9 1
674.91 2
2014-05 899.88 1
2014-06 824.89 1
And so on.
I was looking to get the total sum of the Lifetime revenue for each cohort as well as the total amount of users for a cohort.
Basically, I want to turn it into a regular database.
Anyone got any thoughts on this?

Python selecting row from second dataframe based on complex criteria

I have two dataframes, one with some purchasing data, and one with a weekly calendar, e.g.
df1:
purchased_at product_id cost
01-01-2017 1 £10
01-01-2017 2 £8
09-01-2017 1 £10
18-01-2017 3 £12
df2:
week_no week_start week_end
1 31-12-2016 06-01-2017
2 07-01-2017 13-01-2017
3 14-01-2017 20-01-2017
I want to use data from the two to add a 'week_no' column to df1, which is selected from df2 based on where the 'purchased_at' date in df1 falls between the 'week_start' and 'week_end' dates in df2, i.e.
df1:
purchased_at product_id cost week_no
01-01-2017 1 £10 1
01-01-2017 2 £8 1
09-01-2017 1 £10 2
18-01-2017 3 £12 3
I've searched but I've not been able to find an example where the data is being pulled from a second dataframe using comparisons between the two, and I've been unable to correctly apply any examples I've found, e.g.
df1.loc[(df1['purchased_at'] < df2['week_end']) &
(df1['purchased_at'] > df2['week_start']), df2['week_no']
was unsuccessful, with the ValueError 'can only compare identically-labeled Series objects'
Could anyone help with this problem, or I'm open to suggestions if there is a better way to achieve the same outcome.
edit to add further detail of df1
df1 full dataframe headers
purchased_at purchase_id product_id product_name transaction_id account_number cost
01-01-2017 1 1 A 1 AA001 £10
01-01-2017 2 2 B 1 AA001 £8
02-01-2017 3 1 A 2 AA008 £10
03-01-2017 4 3 C 3 AB040 £12
...
09-01-2017 12 1 A 10 AB102 £10
09-01-2017 13 2 B 11 AB102 £8
...
18-01-2017 20 3 C 15 AA001 £12
So the purchase_id increases incrementally with each row, the product_id and product_name have a 1:1 relationship, the transaction_id also increases incrementally, but there can be multiple purchases within a transaction.
If your dataframes are to big you can use this trick.
Do a full cartisian product join of all records to all records:
df_out = pd.merge(df1.assign(key=1),df2.assign(key=1),on='key')
Next filter out those records that do not match criteria in this case, where purchased_at is not between week_start and week_end
(df_out.query('week_start < purchased_at < week_end')
.drop(['key','week_start','week_end'], axis=1))
Output:
purchased_at product_id cost week_no
0 2017-01-01 1 £10 1
3 2017-01-01 2 £8 1
7 2017-01-09 1 £10 2
11 2017-01-18 3 £12 3
If you do have large dataframes then you can use this numpy method as proposed by PiRSquared.
a = df1.purchased_at.values
bh = df2.week_end.values
bl = df2.week_start.values
i, j = np.where((a[:, None] >= bl) & (a[:, None] <= bh))
pd.DataFrame(
np.column_stack([df1.values[i], df2.values[j]]),
columns=df1.columns.append(df2.columns)
).drop(['week_start','week_end'],axis=1)
Output:
purchased_at product_id cost week_no
0 2017-01-01 00:00:00 1 £10 1
1 2017-01-01 00:00:00 2 £8 1
2 2017-01-09 00:00:00 1 £10 2
3 2017-01-18 00:00:00 3 £12 3
You could just use time.strftime() to extract the week number from the date. If you want to keep counting the weeks upwards, you need to define a "zero year" as the start of your time-series and offset the week_no accordingly:
import pandas as pd
data = {'purchased_at': ['01-01-2017', '01-01-2017', '09-01-2017', '18-01-2017'], 'product_id': [1,2,1,3], 'cost':['£10', '£8', '£10', '£12']}
df = pd.DataFrame(data, columns=['purchased_at', 'product_id', 'cost'])
def getWeekNo(date, year0):
datetime = pd.to_datetime(date, dayfirst=True)
year = int(datetime.strftime('%Y'))
weekNo = int(datetime.strftime('%U'))
return weekNo + 52*(year-year0)
df['week_no'] = df.purchased_at.apply(lambda x: getWeekNo(x, 2017))
Here, I use pd.to_dateime() to convert the datestring from df into a datetime-object. strftime('%Y') returns the year and strftime('%U') the week (with the first week of a year starting on it's first Sunday. If weeks should start on Monday, use '%W' instead).
This way, you don't need to maintain a seperate DataFrame only for week numbers.

Categories