merging quarterly and monthly data while doing ffill on multiindex - python

I am trying to merge a quarterly series and a monthly series, and in the process essentially "downsampling" the quarterly series. Both dataframes contain a DATE column, BANK, and the remaining columns are various values either in a monthly or quarterly format. The complication I have had is that it is a multiindex, so if I try:
merged_data=df1.join(df2).reset_index(['DATE', 'BANK_CODE']).ffill()
the forward fill for quarterly data up to the last monthly datapoint is not done for each respective bank as I intended. Could anyone help with this please? Note: I have also tried to resample the quarterly dataframe separately, however I do not know of a way to downsample it to a monthly level until a certain date (should be the latest date in the monthly data).
df2 = df2.set_index(['DATE']).groupby(['BANK']).resample('M')['VALUE'].ffill()
df1:
Date Bank Value1 Value2
2021-06-30 bank 1 2000 7000
2021-07-31 bank 1 3000 2000
2021-06-30 bank 2 6000 9000
df2:
Date Bank Value1 Value2
2021-06-30 bank 1 2000 5000
2021-09-30 bank 1 5000 4000
2021-06-30 bank 2 9000 10000
HERE IS A MINI EXAMPLE

Using the data provided, assuming df1 is monthly and df2 is quarterly.
Set index and resample your quarterly data to monthly:
# monthly data
x1 = df1.set_index(['Bank','Date'])
# quarterly data, resampling back to monthly
x2 = ( df2.set_index('Date')
.groupby('Bank')
.resample('M')
.ffill()
.drop(columns='Bank')
)
Merge both - I assume you want the product, not the union:
x1.join(x2, lsuffix='_m', rsuffix='_q', how='outer').fillna(0)
Value1_m Value2_m Value1_q Value2_q
Bank Date
bank 1 2021-06-30 2000.0 7000.0 2000 5000
2021-07-31 3000.0 2000.0 2000 5000
2021-08-31 0.0 0.0 2000 5000
2021-09-30 0.0 0.0 5000 4000
bank 2 2021-06-30 6000.0 9000.0 9000 10000
The _m suffices are the values from df1, _q are from df2. I'm assuming you'll know how to explain or deal with the differences between monthly and quarterly values on the same dates.
As you can see, no need to specify the interval, this is provided automatically.

Related

Create a pivot table where the value is held flat until a certain date

I want to create a pivot table with months as the columns and items as rows. Currently the data is in a table that looks like this:
Item
Balance
Maturity
A
100
1/31/23
B
150
2/28/23
C
200
3/31/23
But I want the data to look like this:
1/31/23
2/28/23
3/31/23
A
100
B
150
150
C
200
200
200
In python I have created a date range with frequency 'M'. The idea I am trying to accomplish is if the date is less than the Maturity date, repeat balance.
pivot and bfill:
(df.pivot(index='Item', columns='Maturity', values='Balance')
.sort_index(axis=1, key=lambda x: pd.to_datetime(x, dayfirst=False))
.bfill(axis=1)
)
NB. sort_index using conversion to_datetime to ensure the correct order of the columns.
Output:
Maturity 1/31/23 2/28/23 3/31/23
Item
A 100.0 NaN NaN
B 150.0 150.0 NaN
C 200.0 200.0 200.0

Stacking column indices on top of one another using Pandas

I'm looking to stack the indices of some columns on top of one another, this is what I currently have:
Buy Buy Currency Sell Sell Currency
Date
2013-12-31 100 CAD 100 USD
2014-01-02 200 USD 200 CAD
2014-01-03 300 CAD 300 USD
2014-01-06 400 USD 400 CAD
This is what I'm looking to achieve:
Buy/Sell Buy/Sell Currency
100 USD
100 CAD
200 CAD
200 USD
300 USD
300 CAD
And so on, Basically want to take the values in "Buy" and "Buy Currency" and stack their values in the "Sell" and "Sell Currency" columns, one after the other.
And so on. I should mention that my data frame has 10 columns in total so using
df_pl.stack(level=0)
doesn't seem to work.
One option is with pivot_longer from pyjanitor, where for this particular use case, you pass a list of regular expressions (to names_pattern) to aggregate the desired column labels into new groups (in names_to):
# pip install pyjanitor
import pandas as pd
import janitor
df.pivot_longer(index=None,
names_to = ['Buy/Sell', 'Buy/Sell Currency'],
names_pattern = [r"Buy$|Sell$", ".+Currency$"],
ignore_index = False,
sort_by_appearance=True)
Buy/Sell Buy/Sell Currency
Date
2013-12-31 100 CAD
2013-12-31 100 USD
2014-01-02 200 USD
2014-01-02 200 CAD
2014-01-03 300 CAD
2014-01-03 300 USD
2014-01-06 400 USD
2014-01-06 400 CAD
using concat
import pandas as pd
print(pd.concat(
[df['Buy'], df['sell']], axis=1
).stack().reset_index(1, drop=True).rename(index='buy/sell')
)
output:
0 100
0 100
1 200
1 200
2 300
2 300
3 400
3 400
# assuming that your data has date as index.
df.set_index('date', inplace=True)
# create a mapping to new column names
d={'Buy Currency': 'Buy/Sell Currency',
'Sell Currency' : 'Buy/Sell Currency',
'Buy' : 'Buy/Sell',
'Sell' :'Buy/Sell'
}
df.columns=df.columns.map(d)
# stack first two columns over the next two columns
out=pd.concat([ df.iloc[:,:2],
df.iloc[:,2:]
],
ignore_index=True
)
out
Buy/Sell Buy/Sell Currency
0 100 CAD
1 200 USD
2 300 CAD
3 400 USD
4 100 USD
5 200 CAD
6 300 USD
7 400 CAD

Python dataframe returning closest value above specified input in one row (pivot_table)

I have the following DataFrame, output_excel, containing inventory data and sales data for different products. See the DataFrame below:
Product 2022-04-01 2022-05-01 2022-06-01 2022-07-01 2022-08-01 2022-09-01 AvgMonthlySales Current Inventory
1 BE37908 1500 1400 1200 1134 1110 1004 150.208333 1500
2 BE37907 2000 1800 1800 1540 1300 1038 189.562500 2000
3 DE37907 5467 5355 5138 4926 4735 4734 114.729167 5467
Please note that that in my example, today's date is 2022-04-01, so all inventory numbers for the months May through September are predicted values, while the AvgMonthlySales are the mean of actual, past sales for that specific product. The current inventory just displays today's value.
I also have another dataframe, df2, containing the lead time, the same sales data, and the calculated security stock for the same products. The formula for the security stock is ((leadtime in weeks / 4) + 1) * AvgMonthlySales:
Product AvgMonthlySales Lead time in weeks Security Stock
1 BE37908 250.208333 16 1251.04166
2 BE37907 189.562500 24 1326.9375
3 DE37907 114.729167 10 401.552084
What I am trying to achieve:
I want to create a new dataframe, which tells me how many months are left until our inventory drops below the security stock. For example, for the first product, BE37908, the security stock is ~1251 units, and by 2022-06-01 our inventory will drop below that number. So I want to return 2022-05-01, as this is the last month where our inventories are projected to be above the security stock. The whole output should look something like this:
Product Last Date Above Security Stock
1 BE37908 2022-05-01
2 BE37907 2022-07-01
3 DE37907 NaN
Please also note that the timeframe for the projections (the columns) can be set by the user, so we couldn't just select columns 2 through 7. However, the Product column will always be the first one, and the AvgMonthlySales and the Current Inventory columns will always be the last two.
To recap, I want to return the column with the smallest value above the security stock for each product. I have an idea on how to do that by column using argsort, but not by row. What is the best way to achieve this? Any tips?
You could try as follows:
# create list with columns with dates
cols = [col for col in df.columns if col.startswith('20')]
# select cols, apply df.gt row-wise, sum and subtract 1
idx = df.loc[:,cols].gt(df2['Security Stock'], axis=0).sum(axis=1).sub(1)
# get the correct dates from the cols
# if the value == len(cols)-1, *all* values will have been greater so: np.nan
idx = [cols[i] if i != len(cols)-1 else np.nan for i in idx]
out = df['Product'].to_frame()
out['Last Date Above Security Stock'] = idx
print(out)
Product Last Date Above Security Stock
1 BE37908 2022-05-01
2 BE37907 2022-07-01
3 DE37907 NaN

Handling a column with dates and missing dates

I have the following code to estimate profit from buy and sell price of crypto token.
import pandas as pd
# Read text file into pandas DataFrame
# --------------------------------------
df = pd.read_csv("crypto.txt", comment="#", skip_blank_lines=True, delim_whitespace=True).dropna()
# Display DataFrame
# -----------------
print(df)
print()
# Replace commas in number
# --------------------------------------
df['BuyPrice'] = df['BuyPrice'].str.replace(',', '').astype(float)
df['SellPrice'] = df['SellPrice'].str.replace(',', '').astype(float)
df['Size'] = df['Size'].str.replace(',', '').astype(float)
df['Profit'] = df.SellPrice - df.BuyPrice
# Sort BuyPrice column in ascending way
# --------------------------------------
df = df.sort_values('BuyPrice', ignore_index=True)
#df = df.sort_values('BuyPrice').reset_index(drop=True)
print()
# Sum all the numerical values and create a 'Total' row
# -----------------------------------------------------
df.loc['Total'] = df.sum(numeric_only=True)
# Replace NaN by empty space
# ---------------------------
df = df.fillna('')
df = df.rename({'BuyPrice': 'Buy Price', 'SellPrice': 'Sell Price'}, axis=1)
# Display Final DataFrame
# -----------------
print(df)
Now the output does only show the rows with sensible entries in the 'Date' column. I get
Coin BuyPrice SellPrice Size Date
1 1INCH 2,520 3180 10 23-10-2021
3 SHIB 500 450 200,000 27-10-2021
4 DOT 1650 2500 1 June 01, 2021
Coin Buy Price Sell Price Size Date Profit
0 SHIB 500.0 450.0 200000.0 27-10-2021 -50.0
1 DOT 1650.0 2500.0 1.0 June 01, 2021 850.0
2 1INCH 2520.0 3180.0 10.0 23-10-2021 660.0
Total 4670.0 6130.0 200011.0 1460.0
Clearly, we can see the rows without dates have been ignored. How could one tackle this issue? How can Pandas understand they are dates?
crypto.txt file contains:
Coin BuyPrice SellPrice Size Date
#--- --------- ---------- ---- -----------
ADA 1,580 1,600 1 NA
1INCH 2,520 3180 10 23-10-2021
SHIB 261.6 450 200,000 NA
SHIB 500 450 200,000 27-10-2021
DOT 1650 2500 1 "June 01, 2021"
It seems I couldn't write the last row and column entry within single inverted commas. Is it possible to convert all the dates in one single kind of format (say , )?

Rolling Median in a Pandas Pivot Table

I am trying to calculate a rolling median as an aggregated function on a pandas dataframe. Here is some sample data:
import pandas as pd
import numpy as np
d = {'date': ['2020-01-01','2020-02-01','2020-03-01','2020-01-01','2020-02-01','2020-02-01','2020-03-01','2020-02-01','2020-03-01','2020-03-01','2020-03-01','2020-03-01','2020-03-01'],
'count': [1,1,1,2,2,3,3,3,4,3,3,3,1],
'type': ['type1','type2','type3','type1','type3','type1','type2','type2','type2','type3','type1','type2','type1'],
'salary':[1000,2000,3000,10000,15000,30000,100000,50000,25000,10000,25000,30000,40000]}
df: pd.DataFrame = pd.DataFrame(data=d)
df_pvt: pd.DataFrame = df.pivot_table(index='date',
columns='type',
aggfunc={'salary': np.median})
df_pvt.head(5)
I would like to perform a rolling median on the salaries using pandas rolling(2).median() function.
How can I go about inserting this type of window function into the aggregate function for a pivot table?
My goal is to aggregate a large amount of numeric data by date and take the rolling median of variable lengths and report that in my resulting pivot table. I am not entirely sure how to insert this function into aggfunc or the like.
The expected output orders by the date in ascending order and takes all observations associated with both months and finds the median.
For type1 we have:
date count type salary
0 2020-01-01 1 type1 1000
3 2020-01-01 2 type1 10000
5 2020-02-01 3 type1 30000
10 2020-03-01 3 type1 25000
12 2020-03-01 1 type1 40000
Thus, for type1 the expected output with rolling(2) would be:
salary
type type1
date
2020-01-01 NaN
2020-02-01 10000.0
2020-03-01 30000.0
The logic follows that for the first 2 month rolling window we would have data points 1000,10000 and 30000 and produce a median of 10000.
For 2020-03-01, the rolling 2 would include 30000, 25000, 40000 so the median result should be 30000.
Not sure it can be done directly with the parameter aggfunc. so a work around could be to create the a double of the data with a date column shifted of a month. Note that this method is not really scalable to bigger rolling window. it can but you may end up with too much data.
# first convert to datetime
df['date'] = pd.to_datetime(df['date'])
# append the data shifted of a month to df and perform the pivot_table
res = (
df
.append(df.assign(date=lambda x: x['date']+pd.DateOffset(months=1)))
.pivot_table(index='date',columns='type',
aggfunc={'salary': np.median})
.reindex(df['date'].unique()) # to avoid an extra month
)
print(res)
salary
type type1 type2 type3
date
2020-01-01 5500.0 NaN NaN
2020-02-01 10000.0 26000.0 15000.0
2020-03-01 30000.0 30000.0 10000.0
for the first date if you want to get nan as a rolling window would do, then you can do res.loc[res.index.min()] = np.nan after

Categories