Pandas - Adjust stock prices to stock splits - python

I have a DataFrame, with prices data and stock splits data.
I want to put all the prices on the same page, so for example if we had a stock split of 0.11111 (1/9),
from now on all the stock prices would be multiplied by 9.
So for example if this is my initial dataframe:
df= Price Stock_Splits
0 100 0
1 99 0
2 10 0.1111111
3 8 0
4 8.5 0
5 4 0.5
The "Price" column will become:
df= Price Stock_Splits
0 100 0
1 99 0
2 90 0.1111111
3 72 0
4 76.5 0
5 72 0.5

Here is one example:
df['New_Price'] = (1 / df.Stock_Splits).replace(np.inf, 1).cumprod() * df.Price
Price Stock_Splits New_Price
0 100.0 0.000000 100.000000
1 99.0 0.000000 99.000000
2 10.0 0.111111 90.000009
3 8.0 0.000000 72.000007
4 8.5 0.000000 76.500008
5 4.0 0.500000 72.000007

Related

Groupby and get value offset by one year in pandas

My goal today is to follow each ID that belongs to Category==1 in a given date, one year later. So I have a dataframe like this:
Period ID Amount Category
20130101 1 100 1
20130101 2 150 1
20130101 3 100 1
20130201 1 90 1
20130201 2 140 1
20130201 3 95 1
20130201 5 250 0
. . .
20140101 1 40 1
20140101 2 70 1
20140101 5 160 0
20140201 1 35 1
20140201 2 65 1
20140201 5 150 0
For example, in 20130201 I have 2 ID's that belong to Category 1: 1,2,3, but just 2 of them are present in 20140201: 1,2. So I need to get the value of Amount, only for those ID's, one year later, like this:
Period ID Amount Category Amount_t1
20130101 1 100 1 40
20130101 2 150 1 70
20130101 3 100 1 nan
20130201 1 90 1 35
20130201 2 140 1 65
20130201 3 95 1 nan
20130201 5 250 0 nan
. . .
20140101 1 40 1 nan
20140101 2 70 1 nan
20140101 5 160 0 nan
20140201 1 35 1 nan
20140201 2 65 1 nan
20140201 5 150 0 nan
So, if the ID doesn't appear next year or belong to Category 0, I'll get a nan. My first approach was to get the list of unique ID's on each Period and then trying to map that to the next year, using some sort of combination of groupby() and isin() like this:
aux = df[df.Category==1].groupby('Period').ID.unique()
aux.index = aux.index + pd.DateOffset(years=1)
But I didn't know how to keep going. I'm thinking some kind of groupby('ID') might be more efficient too. If it were a simple shift() that would be easy, but I'm not sure about how to get the value offset by a year by group.
You can create lagged features with an exact merge after you manually lag one of the join keys.
import pandas as pd
# Datetime so we can do calendar year subtraction
df['Period'] = pd.to_datetime(df.Period, format='%Y%m%d')
# Create one with the lagged features. Here I'll split the steps out.
df2 = df.copy()
df2['Period'] = df2.Period-pd.offsets.DateOffset(years=1) # 1 year lag
df2 = df2.rename(columns={'Amount': 'Amount_t1'})
# Keep only values you want to merge
df2 = df2[df2.Category.eq(1)]
# Bring lagged features
df.merge(df2, on=['Period', 'ID', 'Category'], how='left')
Period ID Amount Category Amount_t1
0 2013-01-01 1 100 1 40.0
1 2013-01-01 2 150 1 70.0
2 2013-01-01 3 100 1 NaN
3 2013-02-01 1 90 1 35.0
4 2013-02-01 2 140 1 65.0
5 2013-02-01 3 95 1 NaN
6 2013-02-01 5 250 0 NaN
7 2014-01-01 1 40 1 NaN
8 2014-01-01 2 70 1 NaN
9 2014-01-01 5 160 0 NaN
10 2014-02-01 1 35 1 NaN
11 2014-02-01 2 65 1 NaN
12 2014-02-01 5 150 0 NaN

how to add a mean column for the groupby movieID?

I have a df like below
userId movieId rating
0 1 31 2.0
1 2 10 4.0
2 2 17 5.0
3 2 39 5.0
4 2 47 4.0
5 3 31 3.0
6 3 10 2.0
I need to add two column, one is mean for each movie, the other is diff which is the difference between rating and mean.
Please note that movieId can be repeated because different users may rate the same movie. Here row 0 and 5 is for movieId 31, row 1 and 6 is for movieId 10
userId movieId rating mean diff
0 1 31 2.0 2.5 -0.5
1 2 10 4.0 3 1
2 2 17 5.0 5 0
3 2 39 5.0 5 0
4 2 47 4.0 4 0
5 3 31 3.0 2.5 0.5
6 3 10 2.0 3 -1
here is some of my code which calculates the mean
df = df.groupby('movieId')['rating'].agg(['count','mean']).reset_index()
You can use transform to keep the same number of rows when calculating mean with groupby. Calculating the difference is straightforward from that:
df['mean'] = df.groupby('movieId')['rating'].transform('mean')
df['diff'] = df['rating'] - df['mean']

How to find rate of change across successive rows using time and data columns after grouping by a different column using pandas?

I have a pandas DataFrame of the form:
df
ID_col time_in_hours data_col
1 62.5 4
1 40 3
1 20 3
2 30 1
2 20 5
3 50 6
What I want to be able to do is, find the rate of change of data_col by using the time_in_hours column. Specifically,
rate_of_change = (data_col[i+1] - data_col[i]) / abs(time_in_hours[ i +1] - time_in_hours[i])
Where i is a given row and the rate_of_change is calculated separately for different IDs
Effectively, I want a new DataFrame of the form:
new_df
ID_col time_in_hours data_col rate_of_change
1 62.5 4 NaN
1 40 3 -0.044
1 20 3 0
2 30 1 NaN
2 20 5 0.4
3 50 6 NaN
How do I go about this?
You can use groupby:
s = df.groupby('ID_col').apply(lambda dft: dft['data_col'].diff() / dft['time_in_hours'].diff().abs())
s.index = s.index.droplevel()
s
returns
0 NaN
1 -0.044444
2 0.000000
3 NaN
4 0.400000
5 NaN
dtype: float64
You can actually get around the groupby + apply given how your DataFrame is sorted. In this case, you can just check if the ID_col is the same as the shifted row.
So calculate the rate of change for everything, and then only assign the values back if they are within a group.
import numpy as np
mask = df.ID_col == df.ID_col.shift(1)
roc = (df.data_col - df.data_col.shift(1))/np.abs(df.time_in_hours - df.time_in_hours.shift(1))
df.loc[mask, 'rate_of_change'] = roc[mask]
Output:
ID_col time_in_hours data_col rate_of_change
0 1 62.5 4 NaN
1 1 40.0 3 -0.044444
2 1 20.0 3 0.000000
3 2 30.0 1 NaN
4 2 20.0 5 0.400000
5 3 50.0 6 NaN
You can use pandas.diff:
df.groupby('ID_col').apply(
lambda x: x['data_col'].diff() / x['time_in_hours'].diff().abs())
ID_col
1 0 NaN
1 -0.044444
2 0.000000
2 3 NaN
4 0.400000
3 5 NaN
dtype: float64

Pandas: Inventory recalculation given a final value

I'm coding a Pyhton script to make an inventory recalculation of a specific SKU from today over the past 365 days, given the actual stock. For that I'm using a Python Pandas Dataframe, as it is shown below:
Index DATE SUM_IN SUM_OUT
0 5/12/18 500 0
1 5/13/18 0 -403
2 5/14/18 0 -58
3 5/15/18 0 -39
4 5/16/18 100 0
5 5/17/18 0 -98
6 5/18/18 276 0
7 5/19/18 0 -139
8 5/20/18 0 -59
9 5/21/18 0 -70
The dataframe presents the sum of quantities IN and OUT of the warehouse, grouped by date. My intention is to add a column named "STOCK" that presents the stock level of the SKU of the represented day. For that, what I have is the actual stock level (index 9). So what I need is to recalculate all the levels day by day through all the dates series (From index 9 until index 0).
In Excel it's easy. I can put the actual level in the last row and just extend a the calculation until I reach the row of index 0. As presented (Column E is the formula, Column G is the desired Output):
Does someone can help me achieve this result?
I already have the stock level of the last day (i. e. 5/21/2018 is equal to 10). What I need is place the number 10 in index 9 and calculate the stock levels of the other past days, from index 8 until 0.
The desired output should be:
Index DATE TRANSACTION_IN TRANSACTION_OUT SUM_IN SUM_OUT STOCK
0 5/12/18 1 0 500 0 500
1 5/13/18 0 90 0 -403 97
2 5/14/18 0 11 0 -58 39
3 5/15/18 0 11 0 -39 0
4 5/16/18 1 0 100 0 100
5 5/17/18 0 17 0 -98 2
6 5/18/18 1 0 276 0 278
7 5/19/18 0 12 0 -139 139
8 5/20/18 0 4 0 -59 80
9 5/21/18 0 7 0 -70 10
(Updated)
last_stock = 10 # You should try another value
a = (df.SUM_IN + df.SUM_OUT).cumsum()
df["STOCK"] = a - (a.iloc[-1] - last_stock)
By using cumsum to create the key for groupby , then we using cumsum again
df['SUM_IN'].replace(0,np.nan).ffill()+df.groupby(df['SUM_IN'].gt(0).cumsum()).SUM_OUT.cumsum()
Out[292]:
0 500.0
1 97.0
2 39.0
3 0.0
4 100.0
5 2.0
6 276.0
7 137.0
8 78.0
9 8.0
dtype: float64
Update
s=df['SUM_IN'].replace(0,np.nan).ffill()+df.groupby(df['SUM_IN'].gt(0).cumsum()).SUM_OUT.cumsum()-df.STOCK
df['SUM_IN'].replace(0,np.nan).ffill()+df.groupby(df['SUM_IN'].gt(0).cumsum()).SUM_OUT.cumsum()-s.groupby(df['SUM_IN'].gt(0).cumsum()).bfill().fillna(0)
Out[318]:
0 500.0
1 97.0
2 39.0
3 0.0
4 100.0
5 2.0
6 278.0
7 139.0
8 80.0
9 10.0
dtype: float64

Pandas dataframe groupby using cyclical data

I have some pricing data that looks like this:
import pandas as pd
df=pd.DataFrame([['A','1', 2015-02-01, 20.00, 20.00, 5],
['A','1', 2015-02-06, 16.00, 20.00, 8],
['A','1', 2015-02-14, 14.00, 20.00, 34],
['A','1', 2015-03-20, 20.00, 20.00, 5],
['A','1', 2015-03-25, 15.00, 20.00, 15],
['A','2', 2015-02-01, 75.99, 100.00, 22],
['A','2', 2015-02-23, 100.00, 100.00, 30],
['A','2', 2015-03-25, 65.00, 100.00, 64],
['B','3', 2015-04-01, 45.00, 45.00, 15],
['B','3', 2015-04-16, 40.00, 45.00, 2],
['B','3', 2015-04-18, 45.00, 45.00, 30],
['B','4', 2015-07-25, 5.00, 10.00, 55]],
columns=['dept','sku', 'date', 'price', 'orig_price', 'days_at_price'])
print(df)
dept sku date price orig_price days_at_price
0 A 1 2015-02-01 20.00 20.00 5
1 A 1 2015-02-06 16.00 20.00 8
2 A 1 2015-02-14 14.00 20.00 34
3 A 1 2015-03-20 20.00 20.00 5
4 A 1 2015-03-25 15.00 20.00 15
5 A 2 2015-02-01 75.99 100.00 22
6 A 2 2015-02-23 100.00 100.00 30
7 A 2 2015-03-25 65.00 100.00 64
8 B 3 2015-04-01 45.00 45.00 15
9 B 3 2015-04-16 40.00 45.00 2
10 B 3 2015-04-18 45.00 45.00 30
11 B 4 2015-07-25 5.00 10.00 55
I want to describe the pricing cycles, which can be defined as the period when a sku goes from original price to promotional price (or multiple promotional prices) and returns to original. A cycle must start with the original price. It is okay to include cycles which never change in price, as well as those that are reduced and never return. But an initial price that is less than orig_price would not be counted as a cycle. For the above df, the result I am looking for is:
dept sku cycle orig_price_days promo_days
0 A 1 1 5 42
1 A 1 2 5 15
2 A 2 1 30 64
3 B 3 1 15 2
4 B 3 2 30 0
I played around with groupby and sum, but can't quite figure out how to define a cycle and total the rows accordingly. Any help would be greatly appreciated.
I got very close to producing the desired end result...
# add a column to track whether price is above/below/equal to orig
df.loc[:,'reg'] = np.sign(df.price - df.orig_price)
# remove row where first known price for sku is promo
df_gp = df.groupby(['dept', 'sku'])
df = df[~((df_gp.cumcount() == 0) & (df.reg == -1))]
# enumerate all the individual pricing cycles
df.loc[:,'cycle'] = (df.reg == 0).cumsum()
# group/aggregate to get days at orig vs. promo pricing
cycles = df.groupby(['dept', 'sku', 'cycle'])['days_at_price'].agg({'promo_days': lambda x: x[1:].sum(), 'reg_days':lambda x: x[:1].sum()})
print cycles.reset_index()
dept sku cycle reg_days promo_days
0 A 1 1 5 42
1 A 1 2 5 15
2 A 2 3 30 64
3 B 3 4 15 2
4 B 3 5 30 0
The only part that I couldn't quite crack was how to restart the cycle number for each sku before the groupby.
Try using loc instead of groupby - you want chunks of skus over time periods, not aggregated groups. A for-loop, used in moderation, can also help here and won't be particularly un-pandas like. (At least if, like me, you consider looping over unique array slices to be fine.)
df['cycle'] = -1 # create a column for the cycle
skus = df.sku.unique() # get unique skus for iteration
for sku in skus:
# Get the start date for each cycle for this sku
# NOTE that we define cycles as beginning
# when the price equals the original price
# This avoids the mentioned issue that a cycle should not start
# if initial is less than original.
cycle_start_dates = df.loc[(df.sku == sku]) & \
(df.price == df.orig_price),
'date'].tolist()
# append a terminal date
cycle_start_dates.append(df.date.max()+timedelta(1))
# Assign the cycle values
for i in range(len(cycle_start_dates) - 1):
df.loc[(df.sku == sku) & \
(cycle_start_dates[i] <= df.date) & \
(df.date < cycle_start_dates[i+1]), 'cycle'] = i+1
This should give you a column with all of the cycles for each sku:
dept sku date price orig_price days_at_price cycle
0 A 1 2015-02-01 20.00 20.0 5 1
1 A 1 2015-02-06 16.00 20.0 8 1
2 A 1 2015-02-14 14.00 20.0 34 1
3 A 1 2015-03-20 20.00 20.0 5 2
4 A 1 2015-03-25 15.00 20.0 15 2
5 A 2 2015-02-01 75.99 100.0 22 1
6 A 2 2015-02-23 100.00 100.0 30 1
7 A 2 2015-03-25 65.00 100.0 64 2
8 B 3 2015-04-01 45.00 45.0 15 2
9 B 3 2015-04-16 40.00 45.0 2 2
10 B 3 2015-04-18 45.00 45.0 30 2
11 B 4 2015-07-25 5.00 10.0 55 2
Once you have the cycle column, aggregation becomes relatively straightforward. This multiple aggregation:
df.groupby(['dept', 'sku','cycle'])['days_at_price']\
.agg({'orig_price_days': lambda x: x[:1].sum(),
'promo_days': lambda x: x[1:].sum()
})\
.reset_index()
will give you the desired result:
dept sku cycle promo_days orig_price_days
0 A 1 1 42 5
1 A 1 2 15 5
2 A 2 -1 0 22
3 A 2 1 64 30
4 B 3 1 2 15
5 B 3 2 0 30
6 B 4 -1 0 55
Note that this has additional -1 values for cycle for pre-cycle, below original pricing.

Categories