I have an input dataframe for daily fruit spend which looks like this:
spend_df
Date Apples Pears Grapes
01/01/22 10 47 0
02/01/22 0 22 3
03/01/22 11 0 3
...
For each fruit, I need to apply a function using their respective parameters and inputs spends. The function includes the previous day and the current day spends, which is as follows:
y = beta(1 - exp(-(theta*previous + current)/alpha))
parameters_df
Parameter Apples Pears Grapes
alpha 132 323 56
beta 424 31 33
theta 13 244 323
My output data frame should look like this (may contain errors):
profit_df
Date Apples Pears Grapes
01/01/22 30.93 4.19 0
02/01/22 265.63 31.00 1.72
03/01/22 33.90 30.99 32.99
...
This is what I attempted:
# First map parameters_df to spend_df
merged_df = input_df.merge(parameters_df, on=['Apples','Pears','Grapes'])
# Apply function to each row
profit_df = merged_df.apply(lambda x: beta(1 - exp(-(theta*x[-1] + x)/alpha))
It might be easier to read if you extract the necessary variables from parameters_df and spend_df first. Then a simple application of the formula will produce the expected output.
# extract alpha, beta, theta from parameters df
alpha, beta, theta = parameters_df.iloc[:, 1:].values
# select fruit columns
current = spend_df[['Apples', 'Pears', 'Grapes']]
# find previous values of fruit columns
previous = current.shift(fill_value=0)
# calculate profit using formula
y = beta*(1 - np.exp(-(theta*previous + current) / alpha))
profit_df = spend_df[['Date']].join(y)
Another approach using Pandas rolling function (this is a generalized version to as many fruits as necessary) :
import pandas as pd
import numpy as np
sdf = pd.DataFrame({
"Date": ['01/01/22', '02/01/22', '03/01/22'],
"Apples": [10, 0, 11],
"Pears": [47, 22, 0],
"Grapes": [0, 3, 3],
}).set_index("Date")
pdf = pd.DataFrame({
"Parameter": ['alpha', 'beta', 'theta'],
"Apples": [132, 424, 13],
"Pears": [323, 31, 244],
"Grapes": [56, 33, 323],
}).set_index("Parameter")
def func(r):
t = (pdf.loc['alpha', r.name], pdf.loc['beta', r.name], pdf.loc['theta', r.name])
return r.rolling(2).apply(lambda x: t[1]*(1 - np.exp(-(t[2]*x[0] + x[1])/t[0])))
r1 = sdf.iloc[0:2,:].shift(fill_value=0).apply(lambda r: func(r), axis=0)
r = sdf.apply(lambda r: func(r), axis=0)
r.iloc[0]=r1.shift(-1).iloc[0]
print(r)
Result
Apples Pears Grapes
Date
01/01/22 30.934651 4.198004 0.000000
02/01/22 265.637775 31.000000 1.721338
03/01/22 33.901168 30.999998 32.999999
Related
Imagine following dataframe is given.
import pandas as pd
products = ['Apple', 'Apple', 'Carrot', 'Eggplant', 'Eggplant']
customer_demand_date = ['2023-01-01', '2023-01-07', '2023-01-01', '2023-01-01', '2023-01-07', '2023-01-14']
col_02_2023 = [0, 20, 0, 0, 0, 10]
col_03_2023 = [20, 30, 10, 0, 10, 0]
col_04_2023 = [10, 40, 50, 30, 40, 10]
col_05_2023 = [40, 40, 60, 50, 60, 20]
data = {'Products': products,
'customer_demand_date': customer_demand_date,
'02_2023': col_02_2023,
'03_2023': col_03_2023,
'04_2023': col_04_2023,
'05_2023': col_05_2023}
df = pd.DataFrame(data)
print(df)
Products customer_demand_date 02_2023 03_2023 04_2023 05_2023
0 Apple 2023-01-01 0 20 10 40
1 Apple 2023-01-07 20 30 40 40
2 Carrot 2023-01-01 0 10 50 60
3 Egg 2023-01-01 0 0 30 50
4 Egg 2023-01-07 0 10 40 60
5 Egg 2023-01-14 0 0 10 20
I have columns products, custome_demand_date (every week there is new customer demand for products per upcoming months) and months with quantity demand.
How can I determine which product has experienced the most frequent changes in customer demand over the months, and sort the products in descending order of frequency of change?
I have tried to group by product, accumulate the demand quantity but none of them can analyze the data both horizontally (per customer demand date) and vertically (per months).
Desired output:
Sorted products Ranking(or %, or count of changes)
Egg 1 (or 70% or 13)
Apple 2 (or 52% or 8)
Carrot 3 (22% or 3)
Either ranking or % of change frequency or count of changes.
Note: percentages in desired output are random numbers
I'd really appreciate if you have any clever approach to solve this problem?
Thanks
One way is to define a function that counts horizontal and vertical changes which you can apply to each group individually.
import pandas as pd
from io import StringIO
def change_freq(x, months):
# count horizontal changes
chngs_horizontal = x[months].diff(axis=1).fillna(0).astype(bool).sum().sum()
# count vertical changes
chngs_vertical = x[months].diff(axis=0).fillna(0).astype(bool).sum().sum()
return chngs_horizontal+chngs_vertical
# sample data
data = StringIO("""
Products,customer_demand_date,02_2023,03_2023,04_2023,05_2023
Apple,2023-01-01,0,20,10,40
Apple,2023-01-07,20,30,40,40
Carrot,2023-01-01,0,10,50,60
Egg,2023-01-01,0,0,30,50
Egg,2023-01-07,0,10,40,60
Egg,2023-01-14,0,0,10,20
""")
df = pd.read_csv(data, sep=",")
# count horizontal and vertical changes by product
result = df.groupby('Products').apply(change_freq, ['02_2023','03_2023','04_2023','05_2023'])
result = result.sort_values(ascending=False).to_frame('count_changes')
result['rank'] = result['count_changes'].rank(ascending=False)
This returns
count_changes rank
Products
Egg 13 1.0
Apple 8 2.0
Carrot 3 3.0
to find the variance in y direction a groupby("Products") with an lambda function can be used:
var_y=(df.loc[:,~df.columns.isin(['customer_demand_date','HEHE'])].groupby("Products").agg(lambda x: x.pct_change().fillna(0).astype(bool).sum())).reset_index(level=0)
to find the variance in x direction pct_change interpreted as True or False can be used using astype(bool):
var_x=pd.concat([df[["Products"]], df.iloc[:,2:].pct_change(axis='columns').replace(np.inf, 1).fillna(0).astype(bool).sum(axis=1).rename('sum_x')], axis=1)
adding / grouping both together would finally look like this:
Products sum_x sum_y sum_xy
0 Apple 5 3 8
1 Carrot 3 0 3
2 Eggplant 7 6 13
below the complete code:
import pandas as pd
import numpy as np
products = ['Apple', 'Apple', 'Carrot', 'Eggplant', 'Eggplant', 'Eggplant']
customer_demand_date = ['2023-01-01', '2023-01-07', '2023-01-01', '2023-01-01', '2023-01-07','2023-01-14']
col_02_2023 = [0, 20, 0, 0, 0, 0]
col_03_2023 = [20, 30, 10, 0, 10, 0]
col_04_2023 = [10, 40, 50, 30, 40, 10]
col_05_2023 = [40, 40, 60, 50, 60, 20]
data = {'Products': products,
'customer_demand_date': customer_demand_date,
'02_2023': col_02_2023,
'03_2023': col_03_2023,
'04_2023': col_04_2023,
'05_2023': col_05_2023}
df = pd.DataFrame(data)
var_y=(df.loc[:,~df.columns.isin(['customer_demand_date','HEHE'])].groupby("Products").agg(lambda x: x.pct_change().fillna(0).astype(bool).sum())).reset_index(level=0)
var_y["sum_y"]=var_y.iloc[:,1:].sum(axis="columns")
var_x=pd.concat([df[["Products"]], df.iloc[:,2:].pct_change(axis='columns').replace(np.inf, 1).fillna(0).astype(bool).sum(axis=1).rename('sum_x')], axis=1)
var_x_sum=var_x.groupby("Products", as_index=False).agg(sum_x=('sum_x','sum'))
var_total=pd.concat([var_x_sum,var_y["sum_y"]],axis=1)
var_total["sum_xy"]=var_total.iloc[:,1:].sum(axis="columns")
print(var_total)
I have a dataframe where I have six columns that are coded 1 for yes and 0 for no. There is also a column for year. The output I need is finding the conditional probability between each column being coded 1 according to year. I tried incorporating some suggestions from this post: Pandas - Conditional Probability of a given specific b but with no luck. Other things I came up with are inefficient. I am really struggling to find the best way to go about this.
Current dataframe:
Output I am seeking:
To get your wide-formatted data into the long format of linked post, consider running melt and then run a self merge by year for all pairwise combinations (avoiding same keys and reverse duplicates). Then calculate as linked post shows:
long_df = current_df.melt(
id_vars = "Year",
var_name = "Key",
value_name = "Value"
)
pairwise_df = (
long_df.merge(
long_df,
on = "Year",
suffixes = ["1", "2"]
).query("Key1 < Key2")
.assign(
Both_Occur = lambda x: np.where(
(x["Value1"] == 1) & (x["Value2"] == 1),
1,
0
)
)
)
prob_df = (
(pairwise_df.groupby(["Year", "Key1", "Key2"])["Both_Occur"].value_counts() /
pairwise_df.groupby(["Year", "Key1", "Key2"])["Both_Occur"].count()
).to_frame(name = "Prob")
.reset_index()
.query("Both_Occur == 1")
.drop(["Both_Occur"], axis = "columns")
)
To demonstrate with reproducible data
import numpy as np
import pandas as pd
np.random.seed(112621)
random_df = pd.DataFrame({
'At least one tree': np.random.randint(0, 2, 100),
'At least two trees': np.random.randint(0, 2, 100),
'Clouds': np.random.randint(0, 2, 100),
'Grass': np.random.randint(0, 2, 100),
'At least one mounain': np.random.randint(0, 2, 100),
'Lake': np.random.randint(0, 2, 100),
'Year': np.random.randint(1983, 1995, 100)
})
# ...same code as above...
prob_df
Year Key1 Key2 Prob
0 1983 At least one mounain At least one tree 0.555556
2 1983 At least one mounain At least two trees 0.555556
5 1983 At least one mounain Clouds 0.416667
6 1983 At least one mounain Grass 0.555556
8 1983 At least one mounain Lake 0.555556
.. ... ... ... ...
351 1994 At least two trees Grass 0.490000
353 1994 At least two trees Lake 0.420000
355 1994 Clouds Grass 0.280000
357 1994 Clouds Lake 0.240000
359 1994 Grass Lake 0.420000
I am trying to write code that loops over the following code for columns in a dataframe: four times for four different arrays:
median_alcohol = df.alcohol.median()
for i, alcohol in enumerate(df.alcohol):
if alcohol >= median_alcohol:
df.loc[i, 'alcohol'] = 'high'
else:
df.loc[i, 'alcohol'] = 'low'
df.groupby('alcohol').quality.mean()
The columns in the dataframe are:
alcohol
pH
residual_sugar
citric_acid
I am trying to come up with a method to capture the four different arrays. Any ideas how I should go about this?
I'm not sure what actually you're trying to do, but, from what I understood, you could try something like this:
import pandas as pd
from statistics import mean
df = pd.DataFrame({'alcohol':[45, 88, 56, 15, 71], 'pH':[12, 83, 56, 25,71],'residual_sugar':[14, 25, 55, 8, 21]})
print(df)
#Output
>>> alcohol pH residual_sugar
0 45 12 14
1 88 83 25
2 56 56 55
3 15 25 8
4 71 71 21
def func(colum):
dftemp=df.copy()
median_colum = eval('df.'+colum).median()
for i, item in enumerate(eval('df.'+colum)):
dftemp.loc[i, colum] = 'high' if item >= median_colum else 'low'
return dftemp.groupby(colum).agg(list).applymap(mean)
diferrentarrays = [func(i) for i in df.columns]
for array in diferrentarrays:
print(array)
Output:
pH residual_sugar
alcohol
high 70.0 33.666667
low 18.5 11.000000
alcohol residual_sugar
pH
high 71.666667 33.666667
low 30.000000 11.000000
alcohol pH
residual_sugar
high 71.666667 70.0
low 30.000000 18.5
def numeric_to_buckets(df, column_name):
median = df[column_name].median()
for i, val in enumerate(df[column_name]):
if val >= median:
df.loc[i, column_name] = 'high'
else:
df.loc[i, column_name] = 'low'
for feature in df.columns[:-1]:
numeric_to_buckets(df, feature)
print(df.groupby(feature).quality.mean(), '\n')
I would like to move from the dataset, like this
AAPL GOOGL FB PG
AAPL 50 5 6 30
GOOGL 120 10 12 30
FB 5 6 1 4
PG 30 30 4 40
to the list of edges, if the values in the matrix satisfy some particular conditions
But my dataset is quite big with a matrix approx. 3500x3500. So the code becomes super memory-consuming.
r = [[50, 5, 6, 0],[120, 10, 12, 30],[5, 6, 1, 4],[0, 30, 4, 40]]
ind = ['AAPL', 'GOOGL', 'FB', 'PG']
df1 = pd.DataFrame(r, columns=ind)
df1.index = ind
G = nx.Graph()
l = list(df1)
for item_v in l:
for item_h in l:
if (l.index(item_v) < l.index(item_h)) and (1 - cosine(df1[item_v], df1[item_h]) > 0.01):
G.add_edge(item_h, item_v)
Maybe it can be a bit more efficient?(e.g. if I calculate cosine manually etc.)
I have a dataframe with 4 sections
Section 1: Product details
Section 2: 6 Potential product values based on a range of simulations
Section 3: Upper and lower bound for the input parameter to the simulations
Section 4: Randomly generated values for the input parameters
Section 2 is generated by pricing the product at equal intervals between the upper and lower bound.
I need to take the values in Section 4 and figure out the corresponding product value. Here is a possible setup for this dataframe:
table2 = pd.DataFrame({
'Product Type': ['A', 'B', 'C', 'D'],
'State_1_Value': [10, 11, 12, 13],
'State_2_Value': [20, 21, 22, 23],
'State_3_Value': [30, 31, 32, 33],
'State_4_Value': [40, 41, 42, 43],
'State_5_Value': [50, 51, 52, 53],
'State_6_Value': [60, 61, 62, 63],
'Lower_Bound': [-1, 1, .5, 5],
'Upper_Bound': [1, 2, .625, 15],
'sim_1': [0, 0, .61, 7],
'sim_2': [1, 1.5, .7, 9],
})
>>> table2
Lower_Bound Product Type State_1_Value State_2_Value State_3_Value \
0 -1.0 A 10 20 30
1 1.0 B 11 21 31
2 0.5 C 12 22 32
3 5.0 D 13 23 33
State_4_Value State_5_Value State_6_Value Upper_Bound sim_1 sim_2
0 40 50 60 1.000 0.0 1.0
1 41 51 61 2.000 0.0 1.5
2 42 52 62 0.625 0.61 0.7
3 43 53 63 15.000 7.0 9.0
I will run through a couple examples of this calculation to make it clear what my question is.
Product A - sim_2
The input here is 1.0. This is equal to the upper bound for this product. Therefore the simulation value is equivalent to the state_6 value - 60
Product B - sim_2
The input here is 1.5. the LB to UB range is (1,2), therefore the 6 states are {1,1.2,1.4,1.6,1.8,2}. 1.5 is exactly in the middle of state_3 which has a value of 31 and state 4 which has a value of 41. Therefore the simulation value is 36.
Product C - sim_1
The input here is .61. The LB to UB range is (.5,.625), therefore the 6 states are {.5,.525,.55,.575,.6,.625}. .61 is between state 5 and 6. Specifically the bucket it would fall under would be 5*(.61-.5)/(.625-.5)+1 = 5.4 (it is multiplied by 5 as that is the number of intervals - you can calculate it other ways and get the same result). Then to calculate the value we use that bucket in a weighing of the values for state 5 and state 6: (62-52)*(5.4-5)+52 = 56.
Product B - sim_1
The input here is 0 which is below the lower bound of 1. Therefore we need to extrapolate the value. We use the same formula as above we just use the values of state 1 and state 2 to extrapolate. The bucket would be 5*(0-1)/(2-1)+1 = -4. The two values used at 11 and 21, so the value is (21-11)*(-4-1)+11= -39
I've also simplified the problem to try to visualize the solution, my final code needs to run on 500 values and 10,000 simulations, and the dataframe will have about 200 rows.
Here are the formulas I've used for the interpolation although I'm not committed to them specifically.
Bucket = N*(sim_value-LB)/(UB-LB) + 1
where N is the number of intervals
then nLower is the state value directly below the bucket, and nHigher is the state value directly above the bucket. If the bucket is outside the UB/LB, then force nLower and nHigher to be either the first two or last two values.
Final_value = (nHigher-nLower)*(Bucket1 - number_value_of_nLower)+nLower
To summarize, my question is how I can generate the final results based on the combination of input data provided. The most challenging part to me is how to make the connection from the Bucket number to the nLower and nHigher values.
I was able to generate the result using the following code. I'm not sure of the memory implications on a large dataframe, so still interested in better answers or improvements.
Edit: Ran this code on the full dataset, 141 rows, 500 intervals, 10,000 simulations, and it took slightly over 1.5 hours. So not quite as useless as I assumed, but there is probably a smarter way of doing this in a tiny fraction of that time.
for i in range(1,3):
table2['Bucket%s'%i] = 5 * (table2['sim_%s'%i] - table2['Lower_Bound']) / (table2['Upper_Bound'] - table2['Lower_Bound']) + 1
table2['lv'] = table2['Bucket%s'%i].map(int)
table2['hv'] = table2['Bucket%s'%i].map(int) + 1
table2.ix[table2['lv'] < 1 , 'lv'] = 1
table2.ix[table2['lv'] > 5 , 'lv'] = 5
table2.ix[table2['hv'] > 6 , 'hv'] = 6
table2.ix[table2['hv'] < 2 , 'hv'] = 2
table2['nLower'] = table2.apply(lambda row: row['State_%s_Value'%row['lv']],axis=1)
table2['nHigher'] = table2.apply(lambda row: row['State_%s_Value'%row['hv']],axis=1)
table2['Final_value_%s'%i] = (table2['nHigher'] - table2['nLower'])*(table2['Bucket%s'%i]-table2['lv']) + table2['nLower']
Output:
>>> table2
Lower_Bound Product Type State_1_Value State_2_Value State_3_Value \
0 -1.0 A 10 20 30
1 1.0 B 11 21 31
2 0.5 C 12 22 32
3 5.0 D 13 23 33
State_4_Value State_5_Value State_6_Value Upper_Bound sim_1 sim_2 \
0 40 50 60 1.000 0.00 1.0
1 41 51 61 2.000 0.00 1.5
2 42 52 62 0.625 0.61 0.7
3 43 53 63 15.000 7.00 9.0
Bucket1 lv hv nLower nHigher Final_value_1 Bucket2 Final_value_2
0 3.5 5 6 50 60 35.0 6.0 60.0
1 -4.0 3 4 31 41 -39.0 3.5 36.0
2 5.4 5 6 52 62 56.0 9.0 92.0
3 2.0 3 4 33 43 23.0 3.0 33.0
I posted a superior solution with no loops here:
Alternate method to avoid loop in pandas dataframe
df= pd.DataFrame({
'Product Type': ['A', 'B', 'C', 'D'],
'State_1_Value': [10, 11, 12, 13],
'State_2_Value': [20, 21, 22, 23],
'State_3_Value': [30, 31, 32, 33],
'State_4_Value': [40, 41, 42, 43],
'State_5_Value': [50, 51, 52, 53],
'State_6_Value': [60, 61, 62, 63],
'Lower_Bound': [-1, 1, .5, 5],
'Upper_Bound': [1, 2, .625, 15],
'sim_1': [0, 0, .61, 7],
'sim_2': [1, 1.5, .7, 9],
})
buckets = df.ix[:,-2:].sub(df['Lower_Bound'],axis=0).div(df['Upper_Bound'].sub(df['Lower_Bound'],axis=0),axis=0) * 5 + 1
low = buckets.applymap(int)
high = buckets.applymap(int) + 1
low = low.applymap(lambda x: 1 if x < 1 else x)
low = low.applymap(lambda x: 5 if x > 5 else x)
high = high.applymap(lambda x: 6 if x > 6 else x)
high = high.applymap(lambda x: 2 if x < 2 else x)
low_value = pd.DataFrame(df.filter(regex="State|Type").values[np.arange(low.shape[0])[:,None], low])
high_value = pd.DataFrame(df.filter(regex="State|Type").values[np.arange(high.shape[0])[:,None], high])
df1 = (high_value - low_value).mul((buckets - low).values) + low_value
df1['Product Type'] = df['Product Type']