calculate dataframe column based on two rows in another column

calculate dataframe column based on two rows in another column - python

I have a dataframe, containing one column with prices.
whats the best way to create a column calculating the rate of return between two rows (leaving the first or last Null).
For example the data frame looks as follows:
Date Price
2008-11-21 23.400000
2008-11-24 26.990000
2008-11-25 28.000000
2008-11-26 25.830000
Trying to add a column as follows:
Date Price Return
2008-11-21 23.400000 0.1534
2008-11-24 26.990000 0.0374
2008-11-25 28.000000 -0.0775
2008-11-26 25.830000 NaN
Where the calculation of return column as follows:
Return Row 0 = Price Row 1 / Price Row 0 - 1
Should i for loop, or is there a better way?

You can use shift to shift the rows and then div to divide the Series against itself shifted:
In [44]:
df['Return'] = (df['Price'].shift(-1).div(df['Price']) - 1)
df
Out[44]:
Date Price Return
0 2008-11-21 23.40 0.153419
1 2008-11-24 26.99 0.037421
2 2008-11-25 28.00 -0.077500
3 2008-11-26 25.83 NaN

Related

Python dataframe returning closest value above specified input in one row (pivot_table)

I have the following DataFrame, output_excel, containing inventory data and sales data for different products. See the DataFrame below:
Product 2022-04-01 2022-05-01 2022-06-01 2022-07-01 2022-08-01 2022-09-01 AvgMonthlySales Current Inventory
1 BE37908 1500 1400 1200 1134 1110 1004 150.208333 1500
2 BE37907 2000 1800 1800 1540 1300 1038 189.562500 2000
3 DE37907 5467 5355 5138 4926 4735 4734 114.729167 5467
Please note that that in my example, today's date is 2022-04-01, so all inventory numbers for the months May through September are predicted values, while the AvgMonthlySales are the mean of actual, past sales for that specific product. The current inventory just displays today's value.
I also have another dataframe, df2, containing the lead time, the same sales data, and the calculated security stock for the same products. The formula for the security stock is ((leadtime in weeks / 4) + 1) * AvgMonthlySales:
Product AvgMonthlySales Lead time in weeks Security Stock
1 BE37908 250.208333 16 1251.04166
2 BE37907 189.562500 24 1326.9375
3 DE37907 114.729167 10 401.552084
What I am trying to achieve:
I want to create a new dataframe, which tells me how many months are left until our inventory drops below the security stock. For example, for the first product, BE37908, the security stock is ~1251 units, and by 2022-06-01 our inventory will drop below that number. So I want to return 2022-05-01, as this is the last month where our inventories are projected to be above the security stock. The whole output should look something like this:
Product Last Date Above Security Stock
1 BE37908 2022-05-01
2 BE37907 2022-07-01
3 DE37907 NaN
Please also note that the timeframe for the projections (the columns) can be set by the user, so we couldn't just select columns 2 through 7. However, the Product column will always be the first one, and the AvgMonthlySales and the Current Inventory columns will always be the last two.
To recap, I want to return the column with the smallest value above the security stock for each product. I have an idea on how to do that by column using argsort, but not by row. What is the best way to achieve this? Any tips?

You could try as follows:
# create list with columns with dates
cols = [col for col in df.columns if col.startswith('20')]
# select cols, apply df.gt row-wise, sum and subtract 1
idx = df.loc[:,cols].gt(df2['Security Stock'], axis=0).sum(axis=1).sub(1)
# get the correct dates from the cols
# if the value == len(cols)-1, *all* values will have been greater so: np.nan
idx = [cols[i] if i != len(cols)-1 else np.nan for i in idx]
out = df['Product'].to_frame()
out['Last Date Above Security Stock'] = idx
print(out)
Product Last Date Above Security Stock
1 BE37908 2022-05-01
2 BE37907 2022-07-01
3 DE37907 NaN

how to expand a string into multiple rows in dataframe?

i want to split a string into multiple rows.
df.assign(MODEL_ABC = df['MODEL_ABC'].str.split('_').explode('MODEL_ABC'))
my output:
YEAR PERIOD MODEL_ABC Price Qty
0 2018 First A 75.0 25.0
if i run individually for column i'm getting like below but not entire dataframe
A
B
this is my dataframe df
YEAR PERIOD MODEL_ABC Price Qty
0 2018 First A_B 75.0 25.0
expected output:
YEAR PERIOD MODEL_ABC Price Qty
0 2018 First A 75.0 25.0
1 2018 First B 75.0 25.0

You can do the following, start by transforming the column into a list, so then you can explode it to create multiple rows:
df['MODEL_ABC'] = df['MODEL_ABC'].str.split('_')
df = df.explode('MODEL_ABC')

Dividing rows for specific columns by date+n in Pandas

I want to divide rows in my dataframe via specific columns.
That is, I have a column named 'ticker' which has a attributes 'date' and 'price'.
I want to divide date[i+2] by date[i] where i and i+2 just mean the DAY and the DAY +2 for the price of that ticker. The date is also in proper datetime format for operations using Pandas.
The data looks like:
date | ticker | price |
2002-01-30 A 20
2002-01-31 A 21
2002-02-01 A 21.4
2002-02-02 A 21.3
.
.
That means I want to select the price based off the ticker and the DAY and the DAY + 2 specifically for each ticker to calculate the ratio date[i+2]/date[i].
I've considered using iloc but I'm not sure how to select for specific tickers only to do the math on.

use groupby:
df.groupby('ticker')['price'].transform(lambda x: x / x.shift(2))
0 NaN
1 NaN
2 1.070000
3 1.014286
Name: price, dtype: float64

Pandas: Conditional replace on consecutive rows within a group

I am trying to build "episodes" from a list of transactions organized by group (patient). I used to do this with Stata, but I'm not sure how to do it in Python. In Stata, I would say something like:
by patient: replace startDate = startDate[_n-1] if startDate-endDate[_n-1]<10
In English, that meant to start with the first row of a group and check if the number of days between the startDate of that group and the endDate of the prior group was less than 10. Then, move to the next row and perform the same thing, then the next row... until you'd exhausted all rows.
I have been trying to figure out how to do the same thing in Python/Pandas and running into a wall. I could sort the dataframe by patient and date, then iterate over the entire data frame. It seems like there should be a better way to do this.
It's important that the script first compare row 2 to row 1 because, when I get to row 3, if the script has replaced the value in row 2, when I get to row 3, I want to use the replaced value, not the original value.
Sample input:
Patient startDate endDate
1 1/1/2016 1/2/2016
1 1/11/2016 1/12/2016
1 1/28/2016 1/28/2016
1 6/15/2016 6/16/2016
2 3/1/2016 3/1/2016
Sample output:
Patient startDate endDate
1 1/1/2016 1/2/2016
1 1/1/2016 1/12/2016
1 1/1/2016 1/28/2016
1 6/15/2016 6/16/2016
2 3/1/2016 3/1/2016

I think we need shift + groupby , and bfill + mask is the key
df.startDate=pd.to_datetime(df.startDate)
df.endDate=pd.to_datetime(df.endDate)
df.startDate=df.groupby('Patient').apply(lambda x : x.startDate.mask((x.startDate-x.endDate.shift(1)).fillna(0).astype('timedelta64[D]')<10).bfill()).reset_index(level=0,drop=True).fillna(df.startDate)
df
Out[495]:
Patient startDate endDate
0 1 2016-01-28 2016-01-02
1 1 2016-01-28 2016-01-12
2 1 2016-01-28 2016-01-28
3 1 2016-06-15 2016-06-16
4 2 2016-03-01 2016-03-01

Pandas: groupby and get median value by month?

I have a data frame that looks like this:
org date value
0 00C 2013-04-01 0.092535
1 00D 2013-04-01 0.114941
2 00F 2013-04-01 0.102794
3 00G 2013-04-01 0.099421
4 00H 2013-04-01 0.114983
Now I want to figure out:
the median value for each organisation in each month of the year
X for each organisation, where X is the percentage difference between the lowest median monthly value, and the highest median value.
What's the best way to approach this in Pandas?
I am trying to generate the medians by month as follows, but it's failing:
df['date'] = pd.to_datetime(df['date'])
ave = df.groupby(['row_id', 'date.month']).median()
This fails with KeyError: 'date.month'.
UPDATE: Thanks to #EdChum I'm now doing:
ave = df.groupby([df['row_id'], df['date'].dt.month]).median()
which works great and gives me:
99P 1 0.106975
2 0.091344
3 0.098958
4 0.092400
5 0.087996
6 0.081632
7 0.083592
8 0.075258
9 0.080393
10 0.089634
11 0.085679
12 0.108039
99Q 1 0.110889
2 0.094837
3 0.100658
4 0.091641
5 0.088971
6 0.083329
7 0.086465
8 0.078368
9 0.082947
10 0.090943
11 0.086343
12 0.109408
Now I guess, for each item in the index, I need to find the min and max calculated values, then the difference between them. What is the best way to do that?

For your first error you have a syntax error, you can pass a list of the column names or the columns themselves, you passed a list of names and date.month doesn't exist so you want:
ave = df.groupby([df['row_id'], df['date'].dt.month]).median()
After that you can calculate the min and max for a specific index level so:
((ave.max(level=0) - ave.min(level=0))/ave.max(level=0)) * 100
should give you what you want.
This calculates the difference between the min and max value for each organisation, divides by the max at that level and creates the percentage by multiplying by 100

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.