I want to divide rows in my dataframe via specific columns.
That is, I have a column named 'ticker' which has a attributes 'date' and 'price'.
I want to divide date[i+2] by date[i] where i and i+2 just mean the DAY and the DAY +2 for the price of that ticker. The date is also in proper datetime format for operations using Pandas.
The data looks like:
date | ticker | price |
2002-01-30 A 20
2002-01-31 A 21
2002-02-01 A 21.4
2002-02-02 A 21.3
.
.
That means I want to select the price based off the ticker and the DAY and the DAY + 2 specifically for each ticker to calculate the ratio date[i+2]/date[i].
I've considered using iloc but I'm not sure how to select for specific tickers only to do the math on.
use groupby:
df.groupby('ticker')['price'].transform(lambda x: x / x.shift(2))
0 NaN
1 NaN
2 1.070000
3 1.014286
Name: price, dtype: float64
Related
I have the following DataFrame, output_excel, containing inventory data and sales data for different products. See the DataFrame below:
Product 2022-04-01 2022-05-01 2022-06-01 2022-07-01 2022-08-01 2022-09-01 AvgMonthlySales Current Inventory
1 BE37908 1500 1400 1200 1134 1110 1004 150.208333 1500
2 BE37907 2000 1800 1800 1540 1300 1038 189.562500 2000
3 DE37907 5467 5355 5138 4926 4735 4734 114.729167 5467
Please note that that in my example, today's date is 2022-04-01, so all inventory numbers for the months May through September are predicted values, while the AvgMonthlySales are the mean of actual, past sales for that specific product. The current inventory just displays today's value.
I also have another dataframe, df2, containing the lead time, the same sales data, and the calculated security stock for the same products. The formula for the security stock is ((leadtime in weeks / 4) + 1) * AvgMonthlySales:
Product AvgMonthlySales Lead time in weeks Security Stock
1 BE37908 250.208333 16 1251.04166
2 BE37907 189.562500 24 1326.9375
3 DE37907 114.729167 10 401.552084
What I am trying to achieve:
I want to create a new dataframe, which tells me how many months are left until our inventory drops below the security stock. For example, for the first product, BE37908, the security stock is ~1251 units, and by 2022-06-01 our inventory will drop below that number. So I want to return 2022-05-01, as this is the last month where our inventories are projected to be above the security stock. The whole output should look something like this:
Product Last Date Above Security Stock
1 BE37908 2022-05-01
2 BE37907 2022-07-01
3 DE37907 NaN
Please also note that the timeframe for the projections (the columns) can be set by the user, so we couldn't just select columns 2 through 7. However, the Product column will always be the first one, and the AvgMonthlySales and the Current Inventory columns will always be the last two.
To recap, I want to return the column with the smallest value above the security stock for each product. I have an idea on how to do that by column using argsort, but not by row. What is the best way to achieve this? Any tips?
You could try as follows:
# create list with columns with dates
cols = [col for col in df.columns if col.startswith('20')]
# select cols, apply df.gt row-wise, sum and subtract 1
idx = df.loc[:,cols].gt(df2['Security Stock'], axis=0).sum(axis=1).sub(1)
# get the correct dates from the cols
# if the value == len(cols)-1, *all* values will have been greater so: np.nan
idx = [cols[i] if i != len(cols)-1 else np.nan for i in idx]
out = df['Product'].to_frame()
out['Last Date Above Security Stock'] = idx
print(out)
Product Last Date Above Security Stock
1 BE37908 2022-05-01
2 BE37907 2022-07-01
3 DE37907 NaN
I am trying to calculate a rolling median as an aggregated function on a pandas dataframe. Here is some sample data:
import pandas as pd
import numpy as np
d = {'date': ['2020-01-01','2020-02-01','2020-03-01','2020-01-01','2020-02-01','2020-02-01','2020-03-01','2020-02-01','2020-03-01','2020-03-01','2020-03-01','2020-03-01','2020-03-01'],
'count': [1,1,1,2,2,3,3,3,4,3,3,3,1],
'type': ['type1','type2','type3','type1','type3','type1','type2','type2','type2','type3','type1','type2','type1'],
'salary':[1000,2000,3000,10000,15000,30000,100000,50000,25000,10000,25000,30000,40000]}
df: pd.DataFrame = pd.DataFrame(data=d)
df_pvt: pd.DataFrame = df.pivot_table(index='date',
columns='type',
aggfunc={'salary': np.median})
df_pvt.head(5)
I would like to perform a rolling median on the salaries using pandas rolling(2).median() function.
How can I go about inserting this type of window function into the aggregate function for a pivot table?
My goal is to aggregate a large amount of numeric data by date and take the rolling median of variable lengths and report that in my resulting pivot table. I am not entirely sure how to insert this function into aggfunc or the like.
The expected output orders by the date in ascending order and takes all observations associated with both months and finds the median.
For type1 we have:
date count type salary
0 2020-01-01 1 type1 1000
3 2020-01-01 2 type1 10000
5 2020-02-01 3 type1 30000
10 2020-03-01 3 type1 25000
12 2020-03-01 1 type1 40000
Thus, for type1 the expected output with rolling(2) would be:
salary
type type1
date
2020-01-01 NaN
2020-02-01 10000.0
2020-03-01 30000.0
The logic follows that for the first 2 month rolling window we would have data points 1000,10000 and 30000 and produce a median of 10000.
For 2020-03-01, the rolling 2 would include 30000, 25000, 40000 so the median result should be 30000.
Not sure it can be done directly with the parameter aggfunc. so a work around could be to create the a double of the data with a date column shifted of a month. Note that this method is not really scalable to bigger rolling window. it can but you may end up with too much data.
# first convert to datetime
df['date'] = pd.to_datetime(df['date'])
# append the data shifted of a month to df and perform the pivot_table
res = (
df
.append(df.assign(date=lambda x: x['date']+pd.DateOffset(months=1)))
.pivot_table(index='date',columns='type',
aggfunc={'salary': np.median})
.reindex(df['date'].unique()) # to avoid an extra month
)
print(res)
salary
type type1 type2 type3
date
2020-01-01 5500.0 NaN NaN
2020-02-01 10000.0 26000.0 15000.0
2020-03-01 30000.0 30000.0 10000.0
for the first date if you want to get nan as a rolling window would do, then you can do res.loc[res.index.min()] = np.nan after
I have a dataframe containing materials, dates of purchase and purchase prices. I want to filter my dataframe such that I only keep one row containing each material, and that row contains the material at the latest purchase date and corresponding price.
How could I achieve this? I have racked my brains trying to work out how to apply aggregation functions to this but I just can't work out how.
Do a multisort and then use drop duplicates, keeping the first occurrence.
import pandas as pd
df.sort_values(by=['materials', 'purchase_date'], ascending=[True, False], inplace=True)
df.drop_duplicates(subset=['materials'], keep='first', inplace=True)
Two steps
sort_values() by material and purchaseDate
groupby() material and take first row
d = pd.date_range("1-apr-2020", "30-oct-2020", freq="W")
df = pd.DataFrame({"material":np.random.choice(list("abcd"),len(d)), "purchaseDate":d, "purchasePrice":np.random.randint(1,100, len(d))})
df.sort_values(["material","purchaseDate"], ascending=[1,0]).groupby("material", as_index=False).first()
output
material
purchaseDate
purchasePrice
0
a
2020-09-27 00:00:00
85
1
b
2020-10-25 00:00:00
54
2
c
2020-10-11 00:00:00
21
3
d
2020-10-18 00:00:00
45
I am a java developer finding it a bit tricky to switch to python and Pandas. Im trying to iterate over dates of a Pandas Dataframe which looks like below,
sender_user_id created
0 1 2016-12-19 07:36:07.816676
1 33 2016-12-19 07:56:07.816676
2 1 2016-12-19 08:14:07.816676
3 15 2016-12-19 08:34:07.816676
what I am trying to get is a dataframe which gives me a count of the total number of transactions that have occurred per week. From the forums I have been able to get syntax for 'for loops' which iterate over indexes only. Basically I need a result dataframe which looks like this. The value field contains the count of sender_user_id and the date needs to be modified to show the starting date per week.
date value
0 2016-12-09 20
1 2016-12-16 36
2 2016-12-23 56
3 2016-12-30 32
Thanks in advance for the help.
I think you need resample by week and aggregate size:
#cast to datetime if necessary
df.created = pd.to_datetime(df.created)
print (df.resample('W', on='created').size().reset_index(name='value'))
created value
0 2016-12-25 4
If need another offsets:
df.created = pd.to_datetime(df.created)
print (df.resample('W-FRI', on='created').size().reset_index(name='value'))
created value
0 2016-12-23 4
If need number of unique values per week aggregate by nunique:
df.created = pd.to_datetime(df.created)
print (df.resample('W-FRI', on='created')['sender_user_id'].nunique()
.reset_index(name='value'))
created value
0 2016-12-23 3
I have a dataframe, containing one column with prices.
whats the best way to create a column calculating the rate of return between two rows (leaving the first or last Null).
For example the data frame looks as follows:
Date Price
2008-11-21 23.400000
2008-11-24 26.990000
2008-11-25 28.000000
2008-11-26 25.830000
Trying to add a column as follows:
Date Price Return
2008-11-21 23.400000 0.1534
2008-11-24 26.990000 0.0374
2008-11-25 28.000000 -0.0775
2008-11-26 25.830000 NaN
Where the calculation of return column as follows:
Return Row 0 = Price Row 1 / Price Row 0 - 1
Should i for loop, or is there a better way?
You can use shift to shift the rows and then div to divide the Series against itself shifted:
In [44]:
df['Return'] = (df['Price'].shift(-1).div(df['Price']) - 1)
df
Out[44]:
Date Price Return
0 2008-11-21 23.40 0.153419
1 2008-11-24 26.99 0.037421
2 2008-11-25 28.00 -0.077500
3 2008-11-26 25.83 NaN