Pandas dataframe groupby using cyclical data - python

I have some pricing data that looks like this:
import pandas as pd
df=pd.DataFrame([['A','1', 2015-02-01, 20.00, 20.00, 5],
['A','1', 2015-02-06, 16.00, 20.00, 8],
['A','1', 2015-02-14, 14.00, 20.00, 34],
['A','1', 2015-03-20, 20.00, 20.00, 5],
['A','1', 2015-03-25, 15.00, 20.00, 15],
['A','2', 2015-02-01, 75.99, 100.00, 22],
['A','2', 2015-02-23, 100.00, 100.00, 30],
['A','2', 2015-03-25, 65.00, 100.00, 64],
['B','3', 2015-04-01, 45.00, 45.00, 15],
['B','3', 2015-04-16, 40.00, 45.00, 2],
['B','3', 2015-04-18, 45.00, 45.00, 30],
['B','4', 2015-07-25, 5.00, 10.00, 55]],
columns=['dept','sku', 'date', 'price', 'orig_price', 'days_at_price'])
print(df)
dept sku date price orig_price days_at_price
0 A 1 2015-02-01 20.00 20.00 5
1 A 1 2015-02-06 16.00 20.00 8
2 A 1 2015-02-14 14.00 20.00 34
3 A 1 2015-03-20 20.00 20.00 5
4 A 1 2015-03-25 15.00 20.00 15
5 A 2 2015-02-01 75.99 100.00 22
6 A 2 2015-02-23 100.00 100.00 30
7 A 2 2015-03-25 65.00 100.00 64
8 B 3 2015-04-01 45.00 45.00 15
9 B 3 2015-04-16 40.00 45.00 2
10 B 3 2015-04-18 45.00 45.00 30
11 B 4 2015-07-25 5.00 10.00 55
I want to describe the pricing cycles, which can be defined as the period when a sku goes from original price to promotional price (or multiple promotional prices) and returns to original. A cycle must start with the original price. It is okay to include cycles which never change in price, as well as those that are reduced and never return. But an initial price that is less than orig_price would not be counted as a cycle. For the above df, the result I am looking for is:
dept sku cycle orig_price_days promo_days
0 A 1 1 5 42
1 A 1 2 5 15
2 A 2 1 30 64
3 B 3 1 15 2
4 B 3 2 30 0
I played around with groupby and sum, but can't quite figure out how to define a cycle and total the rows accordingly. Any help would be greatly appreciated.

I got very close to producing the desired end result...
# add a column to track whether price is above/below/equal to orig
df.loc[:,'reg'] = np.sign(df.price - df.orig_price)
# remove row where first known price for sku is promo
df_gp = df.groupby(['dept', 'sku'])
df = df[~((df_gp.cumcount() == 0) & (df.reg == -1))]
# enumerate all the individual pricing cycles
df.loc[:,'cycle'] = (df.reg == 0).cumsum()
# group/aggregate to get days at orig vs. promo pricing
cycles = df.groupby(['dept', 'sku', 'cycle'])['days_at_price'].agg({'promo_days': lambda x: x[1:].sum(), 'reg_days':lambda x: x[:1].sum()})
print cycles.reset_index()
dept sku cycle reg_days promo_days
0 A 1 1 5 42
1 A 1 2 5 15
2 A 2 3 30 64
3 B 3 4 15 2
4 B 3 5 30 0
The only part that I couldn't quite crack was how to restart the cycle number for each sku before the groupby.

Try using loc instead of groupby - you want chunks of skus over time periods, not aggregated groups. A for-loop, used in moderation, can also help here and won't be particularly un-pandas like. (At least if, like me, you consider looping over unique array slices to be fine.)
df['cycle'] = -1 # create a column for the cycle
skus = df.sku.unique() # get unique skus for iteration
for sku in skus:
# Get the start date for each cycle for this sku
# NOTE that we define cycles as beginning
# when the price equals the original price
# This avoids the mentioned issue that a cycle should not start
# if initial is less than original.
cycle_start_dates = df.loc[(df.sku == sku]) & \
(df.price == df.orig_price),
'date'].tolist()
# append a terminal date
cycle_start_dates.append(df.date.max()+timedelta(1))
# Assign the cycle values
for i in range(len(cycle_start_dates) - 1):
df.loc[(df.sku == sku) & \
(cycle_start_dates[i] <= df.date) & \
(df.date < cycle_start_dates[i+1]), 'cycle'] = i+1
This should give you a column with all of the cycles for each sku:
dept sku date price orig_price days_at_price cycle
0 A 1 2015-02-01 20.00 20.0 5 1
1 A 1 2015-02-06 16.00 20.0 8 1
2 A 1 2015-02-14 14.00 20.0 34 1
3 A 1 2015-03-20 20.00 20.0 5 2
4 A 1 2015-03-25 15.00 20.0 15 2
5 A 2 2015-02-01 75.99 100.0 22 1
6 A 2 2015-02-23 100.00 100.0 30 1
7 A 2 2015-03-25 65.00 100.0 64 2
8 B 3 2015-04-01 45.00 45.0 15 2
9 B 3 2015-04-16 40.00 45.0 2 2
10 B 3 2015-04-18 45.00 45.0 30 2
11 B 4 2015-07-25 5.00 10.0 55 2
Once you have the cycle column, aggregation becomes relatively straightforward. This multiple aggregation:
df.groupby(['dept', 'sku','cycle'])['days_at_price']\
.agg({'orig_price_days': lambda x: x[:1].sum(),
'promo_days': lambda x: x[1:].sum()
})\
.reset_index()
will give you the desired result:
dept sku cycle promo_days orig_price_days
0 A 1 1 42 5
1 A 1 2 15 5
2 A 2 -1 0 22
3 A 2 1 64 30
4 B 3 1 2 15
5 B 3 2 0 30
6 B 4 -1 0 55
Note that this has additional -1 values for cycle for pre-cycle, below original pricing.

Related

Pandas: Group by and conditional sum based on value of current row

My dataframe looks like this:
customer_nr
order_value
year_ordered
payment_successful
1
50
1980
1
1
75
2017
0
1
10
2020
1
2
55
2000
1
2
300
2007
1
2
15
2010
0
I want to know the total amount a customer has successfully paid in the years before, for a specific order.
The expected output is as follows:
customer_nr
order_value
year_ordered
payment_successful
total_successfully_previously_paid
1
50
1980
1
0
1
75
2017
0
50
1
10
2020
1
50
2
55
2000
1
0
2
300
2007
1
55
2
15
2010
0
355
Closest i've gotten is this:
df.groupby(['customer_nr', 'payment_successful'], as_index=False)['order_value'].sum()
That just gives me the summed amount successfully and unsuccessfully paid all time per customer. It doesn't account for selecting only previous orders to participate in the sum.
Try:
df["total_successfully_previously_paid"] = (df["payment_successful"].mul(df["order_value"])
.groupby(df["customer_nr"])
.transform(lambda x: x.cumsum().shift().fillna(0))
)
>>> df
customer_nr ... total_successfully_previously_paid
0 1 ... 0.0
1 1 ... 50.0
2 1 ... 50.0
3 2 ... 0.0
4 2 ... 55.0
5 2 ... 355.0
[6 rows x 5 columns]

python pandas how to get data every n and every nth rows?

This question is not same as pandas every nth row or every n row,please don't delete it.
Following are some rows of my table:
open high low close volume datetime
277.14 277.51 276.71 276.8799 968908 2020-04-13 08:31:00.000
245.3 246.06 245.2 246.01 1094537 2020-04-13 08:32:00.000
285.12 285.27 284.81 285.22 534427 2020-04-13 08:33:00.000
246.08 246.08 245.27 245.46 1333257 2020-04-13 08:34:00.000
291.71 291.73 291.08 291.28 1439183 2020-04-13 08:35:00.000
245.89 246.63 245.64 246.25 960411 2020-04-13 08:36:00.000
285.18 285.4 285 285.36 188531 2020-04-13 08:30:37.000
285.79 285.79 285.65 285.68 6251 2020-04-13 08:38:00.000
246.25 246.56 246.12 246.515 956339 2020-04-13 08:39:00.000
I want to get every 3 rows,and for exmaple,
the 1st time get : 1st,2end,3rd rows,
2end time get : 2end,3rd,4th rows,
3rd time get : 3rd,4th,5th rows,
4th time get : 4th,5th,6th rows.
Any good way that I can use pandas or python to get this.Thanks.
Use generator with iloc to select the desire rows:
def rows_generator(df):
i = 0
while (i+3) <= df.shape[0]:
yield df.iloc[i:(i+3):1, :]
i += 1
i = 1
for df in rows_generator(df):
print(f'Time #{i}')
print(df)
i += 1
Example output:
Time #1
Group Cat Value
0 Group1 Cat1 1230
1 Group2 Cat2 4019
2 Group3 Cat3 9491
Time #2
Group Cat Value
1 Group2 Cat2 4019
2 Group3 Cat3 9491
3 Group4 Cat4 9588
Time #3
Group Cat Value
2 Group3 Cat3 9491
3 Group4 Cat4 9588
4 Group5 Cat5 6402
Time #4
Group Cat Value
3 Group4 Cat4 9588
4 Group5 Cat5 6402
5 Group6 Cat 1923
Time #5
Group Cat Value
4 Group5 Cat5 6402
5 Group6 Cat 1923
6 Group7 Cat7 492
Time #6
Group Cat Value
5 Group6 Cat 1923
6 Group7 Cat7 492
7 Group8 Cat8 8589
Time #7
Group Cat Value
6 Group7 Cat7 492
7 Group8 Cat8 8589
8 Group9 Cat9 8582
Does .shift() do what you want?
import pandas as pd
df = pd.DataFrame({'w': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]})
df['x'] = df['w'].shift( 0)
df['y'] = df['w'].shift(-1)
df['z'] = df['w'].shift(-2)
print(df)
w x y z
0 10 10 20.0 30.0
1 20 20 30.0 40.0
2 30 30 40.0 50.0
3 40 40 50.0 60.0
4 50 50 60.0 70.0
5 60 60 70.0 80.0
6 70 70 80.0 90.0
7 80 80 90.0 100.0
8 90 90 100.0 NaN
9 100 100 NaN NaN
The following should work:
for i in range(len(df)-2):
result=df.iloc[i:i+3, :]

Pandas - Adjust stock prices to stock splits

I have a DataFrame, with prices data and stock splits data.
I want to put all the prices on the same page, so for example if we had a stock split of 0.11111 (1/9),
from now on all the stock prices would be multiplied by 9.
So for example if this is my initial dataframe:
df= Price Stock_Splits
0 100 0
1 99 0
2 10 0.1111111
3 8 0
4 8.5 0
5 4 0.5
The "Price" column will become:
df= Price Stock_Splits
0 100 0
1 99 0
2 90 0.1111111
3 72 0
4 76.5 0
5 72 0.5
Here is one example:
df['New_Price'] = (1 / df.Stock_Splits).replace(np.inf, 1).cumprod() * df.Price
Price Stock_Splits New_Price
0 100.0 0.000000 100.000000
1 99.0 0.000000 99.000000
2 10.0 0.111111 90.000009
3 8.0 0.000000 72.000007
4 8.5 0.000000 76.500008
5 4.0 0.500000 72.000007

creating daily price change for a product on a pandas dataframe

I am working on a data set with the following columns:
order_id
order_item_id
product mrp
units
sale_date
I want to create a new column which shows how much the mrp changed from the last time this product was. This there a way I can do this with pandas data frame?
Sorry if this question is very basic but I am pretty new to pandas.
Sample data:
expected data:
For each row of the data I want to check the amount of price change for the last time the product was sold.
You can do this as follows:
# define a function that applies rolling window calculationg
# taking the difference between the last value and the current
# value
def calc_mrp(ser):
# in case you want the relative change, just
# divide by x[1] or x[0] in the lambda function
return ser.rolling(window=2).apply(lambda x: x[1]-x[0])
# apply this to the grouped 'product_mrp' column
# and store the result in a new column
df['mrp_change']=df.groupby('product_id')['product_mrp'].apply(calc_mrp)
If this is executed on a dataframe like:
Out[398]:
order_id product_id product_mrp units_sold sale_date
0 0 2 647.169280 8 2019-08-23
1 1 0 500.641188 0 2019-08-24
2 2 1 647.789399 15 2019-08-25
3 3 0 381.278167 12 2019-08-26
4 4 2 373.685000 7 2019-08-27
5 5 4 553.472850 2 2019-08-28
6 6 4 634.482718 7 2019-08-29
7 7 3 536.760482 11 2019-08-30
8 8 0 690.242274 6 2019-08-31
9 9 4 500.515521 0 2019-09-01
It yields:
Out[400]:
order_id product_id product_mrp units_sold sale_date mrp_change
0 0 2 647.169280 8 2019-08-23 NaN
1 1 0 500.641188 0 2019-08-24 NaN
2 2 1 647.789399 15 2019-08-25 NaN
3 3 0 381.278167 12 2019-08-26 -119.363022
4 4 2 373.685000 7 2019-08-27 -273.484280
5 5 4 553.472850 2 2019-08-28 NaN
6 6 4 634.482718 7 2019-08-29 81.009868
7 7 3 536.760482 11 2019-08-30 NaN
8 8 0 690.242274 6 2019-08-31 308.964107
9 9 4 500.515521 0 2019-09-01 -133.967197
The NaNs are in the rows, for which there is not previous order with the same product_id.

Pandas: Inventory recalculation given a final value

I'm coding a Pyhton script to make an inventory recalculation of a specific SKU from today over the past 365 days, given the actual stock. For that I'm using a Python Pandas Dataframe, as it is shown below:
Index DATE SUM_IN SUM_OUT
0 5/12/18 500 0
1 5/13/18 0 -403
2 5/14/18 0 -58
3 5/15/18 0 -39
4 5/16/18 100 0
5 5/17/18 0 -98
6 5/18/18 276 0
7 5/19/18 0 -139
8 5/20/18 0 -59
9 5/21/18 0 -70
The dataframe presents the sum of quantities IN and OUT of the warehouse, grouped by date. My intention is to add a column named "STOCK" that presents the stock level of the SKU of the represented day. For that, what I have is the actual stock level (index 9). So what I need is to recalculate all the levels day by day through all the dates series (From index 9 until index 0).
In Excel it's easy. I can put the actual level in the last row and just extend a the calculation until I reach the row of index 0. As presented (Column E is the formula, Column G is the desired Output):
Does someone can help me achieve this result?
I already have the stock level of the last day (i. e. 5/21/2018 is equal to 10). What I need is place the number 10 in index 9 and calculate the stock levels of the other past days, from index 8 until 0.
The desired output should be:
Index DATE TRANSACTION_IN TRANSACTION_OUT SUM_IN SUM_OUT STOCK
0 5/12/18 1 0 500 0 500
1 5/13/18 0 90 0 -403 97
2 5/14/18 0 11 0 -58 39
3 5/15/18 0 11 0 -39 0
4 5/16/18 1 0 100 0 100
5 5/17/18 0 17 0 -98 2
6 5/18/18 1 0 276 0 278
7 5/19/18 0 12 0 -139 139
8 5/20/18 0 4 0 -59 80
9 5/21/18 0 7 0 -70 10
(Updated)
last_stock = 10 # You should try another value
a = (df.SUM_IN + df.SUM_OUT).cumsum()
df["STOCK"] = a - (a.iloc[-1] - last_stock)
By using cumsum to create the key for groupby , then we using cumsum again
df['SUM_IN'].replace(0,np.nan).ffill()+df.groupby(df['SUM_IN'].gt(0).cumsum()).SUM_OUT.cumsum()
Out[292]:
0 500.0
1 97.0
2 39.0
3 0.0
4 100.0
5 2.0
6 276.0
7 137.0
8 78.0
9 8.0
dtype: float64
Update
s=df['SUM_IN'].replace(0,np.nan).ffill()+df.groupby(df['SUM_IN'].gt(0).cumsum()).SUM_OUT.cumsum()-df.STOCK
df['SUM_IN'].replace(0,np.nan).ffill()+df.groupby(df['SUM_IN'].gt(0).cumsum()).SUM_OUT.cumsum()-s.groupby(df['SUM_IN'].gt(0).cumsum()).bfill().fillna(0)
Out[318]:
0 500.0
1 97.0
2 39.0
3 0.0
4 100.0
5 2.0
6 278.0
7 139.0
8 80.0
9 10.0
dtype: float64

Categories