How do I modify this function to accept multiple Dataframes?

How do I modify this function to accept multiple Dataframes? - python

I wrote this function and I would like it to accept more than one DF so that the final plot has multiple plotted lines for the predictions and the coef_DF gets completed with the rest of the coefficients.
The function extracts the needed features and target from a much larger dataset to make predictions using a linear regression func, it then makes the model, plots the line over the dataset and returns a df with all the coeficients.
(This is just an exercise.)
def prep_model_and_predict(feature, target, dataset, degree):
# part 1: make a df with relevant format and features
# degree >=1
poly_df=pd.DataFrame()
poly_df[str(target)] = dataset[str(target)]
poly_df['power_1'] = dataset[str(feature)]
#cehck if degree >1
if degree > 1:
for power in range(2, degree+1): #loop over reaming deg
name = 'power_'+str(power)
poly_df[name]=poly_df['power_1'].apply(lambda x: x**power)
#part 2: make model and predictions
features=list(poly_df.columns[1:])
X=poly_df[features]
y=poly_df[str(target)]
model=LinearRegression().fit(X,y)
predictions=model.predict(X)
#part 3: put weghts in a nice df
coef_df=pd.DataFrame()
coef_df=coef_df.append({"Name":'Intercept', 'Value':model.intercept_}, ignore_index=True)
coef_df=coef_df.append({'Name':'Power_1', 'Value':model.coef_[0]}, ignore_index=True)
if degree > 1:
for degree in range(2, degree+1):
name = 'Power_' + str(degree)
coef_df = coef_df.append({"Name":name,
'Value':'{:.3e}'.format(model.coef_[degree-1])}, ignore_index=True)
#prt 4: plot it
fig, ax = plt.subplots()
ax.plot(poly_df['power_1'], poly_df[str(target)], '.',
poly_df['power_1'], predictions, '-')
ax.set_xlabel('Square footage, living area')
ax.set_ylabel('Price per Sqft')
ax.ticklabel_format(axis='y', style='sci', scilimits=(-2,2))
return coef_df, ax
and this is the result:
Name Value
0 Intercept 506738
1 Power_1 2.71336e-77
2 Power_2 7.335e-39
3 Power_3 -1.850e-44
4 Power_4 8.437e-50
5 Power_5 0.000e+00
6 Power_6 0.000e+00
7 Power_7 3.645e-55
8 Power_8 1.504e-51
9 Power_9 5.760e-48
10 Power_10 1.958e-44
11 Power_11 5.394e-41
12 Power_12 9.404e-38
13 Power_13 -3.635e-41
14 Power_14 4.655e-45
15 Power_15 -1.972e-49
much appreciated!

I am not sure what exactly you are asking for. But I would suggest, next time try to ask a question that is easily produce-able and runnable by other people here in SO.
I have tried to answer your questions. Correct me if I misunderstand your question.
Pass arbitrary number of DataFrame to your function and plot it:
I have created three random dataframes for use:
df1 = pd.DataFrame(np.random.randint(0,10,size=(10, 2)), columns=list('AB'))
df2 = pd.DataFrame(np.random.randint(0,10,size=(10, 2)), columns=list('AB'))
df3 = pd.DataFrame(np.random.randint(0,10,size=(10, 2)), columns=list('AB'))
The functions that plots them:
def plot_me(*kwargs):
plt.figure(figsize=(13,9))
lab_ind = 0
for i in kwargs:
plt.plot(i['A'], i['B'], label = lab_ind)
lab_ind += 1
plt.legend()
plt.show()
The result plot you get:
Put the results of your model into a DataFrame
Regarding your second question, I am not going to concentrate too much on your exact details - for example the name of the columns of your dataframe, etc.
For this particular example I have generated two random arrays:
X = np.random.randint(0,50 ,size=(50, 2))
y = np.random.randint(0,2 ,size=(50, 1))
Then fit a LinearRegression model on this data.
model=LinearRegression().fit(X,y)
predictions=model.predict(X)
And then add it to a DataFrame:
res_df = pd.DataFrame(predictions,columns = ['Value'])
And if you print res_df
Value
0 0.420395
1 0.459389
2 0.369648
3 0.416058
4 0.644088
5 0.362072
6 0.363157
7 0.468943
. .
. .

Related

Grouped Column Operations in Python using Pandas

I have a data frame consisting of a .csv import that contains n number of trials. Trials are arranged by column with a header (wavelength1 for trial 1, wavelength2 for trial 2 etc.) We're tracking the absorption of a solution over time during a chemical reaction. You can see a SS of the excel file in the link. Trials are grouped in to threes (with g of sugar being the IDV and the absorbance in nm being the DV). For each trial:
I need to determine what the maximum and minimum values are. This can of course be done using max() and min() but when we are sampling every 0.25 seconds, the data can be noisy, meaning that I have to smooth it out. I have already built a function to do that. We're also probably just going to sample every one second as it's much smoother anyway.
Each group of three trials needs to be plotted on the same graph for comparison. n number of trials will create n/3 graphs.
I'm coming from an intermediate background in MATLAB. This is not something I was ever able to figure out in there, either.
What have I done so far?
I have attempted to make a list out of the header for each trial, and then use use a for loop to move through the data using the df.column_name command:
data = pd.read_csv('data.csv')
col_name = data.columns.values
print(col_name)
for i in col_name:
print(data.col_name[i])
The code works up to the 4th line, where it returns the error: AttributeError: 'DataFrame' object has no attribute 'col_name'. Here is where I would like to make a series or set (whatever it's called here) with all of the values from the wavelength1 trial to plot/manipulate/etc. It's worth noting that I have gotten the multiple plots and multiple lines to work manually: but I want to automate it as that's ofc the point of coding. Here's one out of four graphs of the 'manual' version:
import pandas as pd
import matplotlib.pyplot as plt
#import matplotlib as matplotlib
data = pd.read_csv('data.csv')
plt.close("all")
n_rows = 2
n_columns = 2
#initialize figure
figure_size = (30,15)
font_size = 13
f, ([plt1, plt2], [plt3, plt4]) = plt.subplots(n_rows,n_columns, figsize = figure_size)
#plot first three runs
x=data.time1
y=data.wavelength1
plt1.plot(x,y, label='Trial 1')
x=data.time2
y=data.wavelength2
plt1.plot(x,y,label='Trial 2')
plt1.set_title('0.3g Glucose', fontweight="bold", size=font_size)
x=data.time3
y=data.wavelength3
plt1.plot(x,y,label='Trial 3')
plt1.set_ylabel('Wavelength (nm)', fontsize = font_size)
plt1.set_xlabel('Time (s)', fontsize = font_size)
plt1.legend(fontsize=font_size)
My first thought was just to do:
for i in range (0,num_col):
plot(time,data.wavelength(i))
But this does not work. I'm sure it's something quite simple but it is escaping me.
Example data:
https://ufile.io/ac226vma
Thanks in advance!
[1]: https://i.stack.imgur.com/gMtBN.png

Analysis
I need to determine what the maximum and minimum values are.
Since you want the largest value within each trial, and each trial is represented by one column, you can use DataFrame.min() to get the smallest value in each column. If you want to know the index of the smallest value, you can throw in idxmin() too. Same idea with max.
df = pd.read_csv("data.csv")
# Get max and min values
print("ANALYSIS OF MIN AND MAX VALUES")
analysis_df = pd.DataFrame()
analysis_df["min"] = df.min()
analysis_df["min_idx"] = df.idxmin()
analysis_df["max"] = df.max()
analysis_df["max_idx"] = df.idxmax()
print(analysis_df)
produces:
ANALYSIS OF MIN AND MAX VALUES
min min_idx max max_idx
wavelength1 801.0 120 888.0 4
wavelength2 809.0 85 888.0 1
wavelength3 728.0 96 837.0 1
wavelength4 762.0 114 864.0 3
wavelength5 785.0 115 878.0 2
wavelength6 747.0 118 866.0 1
wavelength7 748.0 119 851.0 3
wavelength8 776.0 113 880.0 0
wavelength9 812.0 112 900.0 0
wavelength10 770.0 110 863.0 1
wavelength11 759.0 100 858.0 0
wavelength12 787.0 91 876.0 0
wavelength13 756.0 66 862.0 2
wavelength14 809.0 70 877.0 1
wavelength15 828.0 62 866.0 0
Plotting
Each group of three trials needs to be plotted on the same graph for comparison. n number of trials will create n/3 graphs.
This is easier if you break it up into a few smaller subproblems.
First, you want to take a list of all of your columns and break them up into groups of three. I copied the code to do this from here.
def grouper(n, iterable, fillvalue=None):
"grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx"
args = [iter(iterable)] * n
return itertools.zip_longest(fillvalue=fillvalue, *args)
Now, once we have a group of three column names, we need to get the values within the dataframe associated with those columns. Also, since your datafile contains unequal numbers of observations per trial, we need to get rid of the NaN's at the end of the file.
def get_trials(df, column_group_names):
"""Get columns from dataframe, dropping missing values."""
column_group = df[list(column_group_names)]
column_group = column_group.dropna(how='all')
return column_group
Now, let's combine those two functions:
col_iterator = grouper(3, df.columns)
[...]
for column_group_names in col_iterator:
column_group = get_trials(df, column_group_names)
[...]
This will let us loop over the columns in groups of three, and plot them individually. Since we've filtered it down to the data we're interested in, we can use DataFrame.plot to plot it to the matplotlib plot.
Next, we need to loop over the subplots. This is a little annoying to do while also looping over groups, so I like to define an iterator.
def subplot_axes_iterator(n_rows, n_columns):
for i in range(n_rows):
for j in range(n_columns):
yield i, j
Example of it in use:
>>> list(subplot_axes_iterator(2, 2))
[(0, 0), (0, 1), (1, 0), (1, 1)]
Now, combine those pieces:
# Plot data
n_rows = 2
n_columns = 3
figure_size = (15, 10)
font_size = 13
fig, axes = plt.subplots(n_rows, n_columns, figsize=figure_size)
col_iterator = grouper(3, df.columns)
axes_iterator = subplot_axes_iterator(n_rows, n_columns)
plot_names = [
"Group 1",
"Group 2",
"Group 3",
"Group 4",
"Group 5",
]
for column_group_names, axes_position, plot_name in \
zip(col_iterator, axes_iterator, plot_names):
print(f"plotting {column_group_names} at {axes_position}")
column_group = get_trials(df, column_group_names)
column_group.plot(ax=axes[axes_position])
axes[axes_position].set_title(plot_name, fontweight="bold", size=font_size)
axes[axes_position].set_xlabel("Time (s)", fontsize=font_size)
axes[axes_position].set_ylabel("Wavelength (nm)", fontsize=font_size)
plt.tight_layout()
plt.show()
(By the way, you said that you want 4 graphs, but the dataset posted has fifteen trials, so I made 5 graphs.)
Final script
(Included for easy copy/paste.)
import pandas as pd
import matplotlib.pyplot as plt
import itertools
def grouper(n, iterable, fillvalue=None):
"grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx"
args = [iter(iterable)] * n
return itertools.zip_longest(fillvalue=fillvalue, *args)
def get_trials(df, column_group_names):
"""Get columns from dataframe, dropping missing values."""
column_group = df[list(column_group_names)]
column_group = column_group.dropna(how='all')
return column_group
def subplot_axes_iterator(n_rows, n_columns):
for i in range(n_rows):
for j in range(n_columns):
yield i, j
df = pd.read_csv("data.csv")
# Get max and min values
print("ANALYSIS OF MIN AND MAX VALUES")
analysis_df = pd.DataFrame()
analysis_df["min"] = df.min()
analysis_df["min_idx"] = df.idxmin()
analysis_df["max"] = df.max()
analysis_df["max_idx"] = df.idxmax()
print(analysis_df)
# Plot data
n_rows = 2
n_columns = 3
figure_size = (15, 10)
font_size = 13
fig, axes = plt.subplots(n_rows, n_columns, figsize=figure_size)
col_iterator = grouper(3, df.columns)
axes_iterator = subplot_axes_iterator(n_rows, n_columns)
plot_names = [
"Group 1",
"Group 2",
"Group 3",
"Group 4",
"Group 5",
]
for column_group_names, axes_position, plot_name in \
zip(col_iterator, axes_iterator, plot_names):
print(f"plotting {column_group_names} at {axes_position}")
column_group = get_trials(df, column_group_names)
column_group.plot(ax=axes[axes_position])
axes[axes_position].set_title(plot_name, fontweight="bold", size=font_size)
axes[axes_position].set_xlabel("Time (s)", fontsize=font_size)
axes[axes_position].set_ylabel("Wavelength (nm)", fontsize=font_size)
plt.tight_layout()
plt.show()

Calculate gap between two datasets (pandas, matplotlib, fill_between already used)

I'd like to ask for suggestions how to calculate lenght of gap between two datasets in matplotlib made of pandas dataframe. Ideally, I would like to have these gap values written in the plot and also, if it is possible, include them into the dataframe.
Here is my simplified example of dataframe:
import pandas as pd
d = {'Mean-1': [0.195842, 0.295069, 0.321345, 0.773725], 'SEM-1': [0.001216, 0.002687, 0.005267, 0.029974], 'Mean-2': [0.143103, 0.250505, 0.305767, 0.960804],'SEM-2': [0.000959, 0.001368, 0.003722, 0.150025], 'Atom Number': [1, 3, 5, 7]}
df=pd.DataFrame(d)
df
Mean-1 SEM-1 Mean-2 SEM-2 Atom Number
0 0.195842 0.001216 0.143103 0.000959 1
1 0.295069 0.002687 0.250505 0.001368 3
2 0.321345 0.005267 0.305767 0.003722 5
3 0.773725 0.029974 0.960804 0.150025 7
Then I made plot, where we can see two lines representing Mean-1 and Mean-2, and then shaded area around each line representing standard error of the mean. This is done for the selected atom numbers.
import matplotlib.pyplot as plt
ax = df.plot(x='Atom Number', y=['Mean-1','Mean-2'])
y_1 = df['Mean-1']
y_2 = df['Mean-2']
x = df['Atom Number']
error_1 = df['SEM-1']
error_2 = df['SEM-1']
ax.fill_between(df['Atom Number'], y_1-error_1, y_1+error_1, alpha=0.2, edgecolor='#CC4F1B', facecolor='#FF9848')
ax.fill_between(df['Atom Number'], y_2-error_2, y_2+error_2, alpha=0.2, edgecolor='#3F7F4C', facecolor='#7EFF99')
plt.xticks(x)
What I would like to do further is to calculate the gap for each residue. The gap is the white space only, thus space where the lines as well as the shaded areas (SEMs) don't overlap.
And also would like to know if I can somehow print the gap values from the plot? And save them into column. Thank You for suggestions.

It's not a compact solution but you could try something like this (Check the order of things). Calculate all the position (y_i and upper and lower limits).
import numpy as np
df['y1_upper'] = y_1+error_1
df['y1_lower'] = y_1-error_1
df['y2_upper'] = y_2+error_2
df['y2_lower'] = y_2-error_2
which gives
Mean-1 SEM-1 Mean-2 SEM-2 Atom Number y1_upper y1_lower \
0 0.195842 0.001216 0.143103 0.000959 1 0.197058 0.194626
1 0.295069 0.002687 0.250505 0.001368 3 0.297756 0.292382
2 0.321345 0.005267 0.305767 0.003722 5 0.326612 0.316078
3 0.773725 0.029974 0.960804 0.150025 7 0.803699 0.743751
y2_upper y2_lower
0 0.144319 0.141887
1 0.253192 0.247818
2 0.311034 0.300500
3 0.990778 0.930830
The distances (gaps) are calculated differently depending on if y_1 is over y_2and vice versa. So use conditions on the upper and lower limits and use linalg.norm to compute the distance.
conditions = [
(df['y1_lower'] >= df['y2_upper']),
(df['y1_lower'] < df['y2_upper'])]
choices = [np.linalg.norm(df['y1_lower']-df['y2_upper']), np.linalg.norm(df['y2_lower']-df['y1_upper'])]
df['dist'] = np.select(conditions, choices)
This gives
Mean-1 SEM-1 Mean-2 SEM-2 Atom Number y1_upper y1_lower \
0 0.195842 0.001216 0.143103 0.000959 1 0.197058 0.194626
1 0.295069 0.002687 0.250505 0.001368 3 0.297756 0.292382
2 0.321345 0.005267 0.305767 0.003722 5 0.326612 0.316078
3 0.773725 0.029974 0.960804 0.150025 7 0.803699 0.743751
y2_upper y2_lower dist
0 0.144319 0.141887 0.255175
1 0.253192 0.247818 0.255175
2 0.311034 0.300500 0.255175
3 0.990778 0.930830 0.149605
As I said, check the order, but this is a possible solution.

IIUC, do you want something like this:
import matplotlib.pyplot as plt
ax = df.plot(x='Atom Number', y=['Mean-1','Mean-2'], figsize=(15,8))
y_1 = df['Mean-1']
y_2 = df['Mean-2']
x = df['Atom Number']
error_1 = df['SEM-1']
error_2 = df['SEM-1']
ax.fill_between(df['Atom Number'], y_1-error_1, y_1+error_1, alpha=0.2, edgecolor='#CC4F1B', facecolor='#FF9848')
ax.fill_between(df['Atom Number'], y_2-error_2, y_2+error_2, alpha=0.2, edgecolor='#3F7F4C', facecolor='#7EFF99')
ax.fill_between(df['Atom Number'], y_1+error_1, y_2-error_2, alpha=.2, edgecolor='k', facecolor='blue')
for i in range(len(x)):
gap = y_1[i]+error_1[i] - y_2[i]-error_2[i]
ylabel = min(y_1[i], y_2[i]) + abs(gap) / 2
_ = ax.annotate(f'{gap:0.4f}', xy=(x[i],ylabel), xytext=(x[i]-.14,y_1[i]+gap/abs(gap)*.2), arrowprops=dict(arrowstyle="-"))
plt.xticks(x);
Output:

Creating a bootstrap sample by group in python

I have a dataframe looking something like that:
y X1 X2 X3
ID year
1 2010 1 2 3 4
1 2011 3 4 5 6
2 2010 1 2 3 4
2 2011 3 4 5 6
2 2012 7 8 9 10
...
I'd like to create several bootstrap sample from the original df, calculate a fixed effects panel regression on the new bootstrap samples and than store the corresponding beta coefficients. The approach I found for "normal" linear regression is the following
betas = pd.DataFrame()
for i in range(10):
# Creating a bootstrap sample with replacement
bootstrap = df.sample(n=df.shape[0], replace=True)
# Fit the regression and save beta coefficients
DV_bs = bootstrap.y
IV_bs = sm2.add_constant(bootstrap[['X1', 'X2', 'X3']])
fe_mod_bs = PanelOLS(DV_bs, IV_bs, entity_effects=True ).fit(cov_type='clustered', cluster_entity=True)
b = pd.DataFrame(fe_mod_bs.params)
print(b.head())
betas = pd.concat([betas, b], axis = 1, join = 'outer')
Unfortunately the bootstrap samples need to be selected by group for the panel regression, so that a complete ID is picked instead of just one row. I could not figure out how to extend the function to create a sample that way. So I basically have two questions:
Does the overall approach make sense for panel regression at all?
How do I adjust the bootstrapping so that the multilevel / panel structure is taken into account and complete IDs instead of single rows are "picked" during the bootstrapping?

I solved my problem with the following code:
companies = pd.DataFrame(df.reset_index().Company.unique())
betas_summary = pd.DataFrame()
for i in tqdm(range(1, 10001)):
# Creating a bootstrap sample with replacement
bootstrap = companies.sample(n=companies.shape[0], replace=True)
bootstrap.rename(columns={bootstrap.columns[0]: "Company"}, inplace=True)
Period = list(range(1, 25))
list_of_bs_comp = bootstrap.Company.to_list()
multiindex = [list_of_bs_comp, np.array(Period)]
bs_df = pd.MultiIndex.from_product(multiindex, names=['Company', 'Period'])
bs_result = df.loc[bs_df, :]
betas = pd.DataFrame()
# Fit the regression and save beta coefficients
DV_bs = bs_result.y
IV_bs = sm2.add_constant(bs_result[['X1', 'X2', 'X3']])
fe_mod_bs = PanelOLS(DV_bs, IV_bs, entity_effects=True ).fit(cov_type='clustered', cluster_entity=True)
b = pd.DataFrame(fe_mod_bs.params)
b.rename(columns={'parameter':"b"}, inplace=True)
betas = pd.concat([betas, b], axis = 1, join = 'outer')
where Company is my entity variable and Period is my time variable

Visualize multidimensional datasets with MDS

I am trying to visualize the 3 features of my dataframe using MDS to scale them in 2 dimensions.
So, I performed MDS in 2 dimensions to plot the new data, giving each point a different color according to the target variable. my target variable is 'Type'
In: df
Sales hours month Type
243 13 5 A
111 4 3 B
250 7 7 C
101 12 1 A
X = df
X = pd.get_dummies(X)
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
# Apply the MDS
mds = MDS(2,random_state=0)
X_2d = mds.fit_transform(X_scaled)
# Plot the new dataset.
colors = ['red','green','blue']
plt.rcParams['figure.figsize'] = [7, 7]
plt.rc('font', size=14)
for i in np.unique(df.Type):
subset = X_2d[df.Type == i]
x = [row[0] for row in subset]
y = [row[1] for row in subset]
plt.scatter(x,y,c=colors[i],label= df.target_names[i])
plt.legend()
plt.show()
When I applied the MDS, it works well and the new dataset is generated.
But my problem is in the plotting.
TypeError: list indices must be integers or slices, not str
----> plt.scatter(x,y,c=colors[i],label=all_outliers_type.target_names[i])

Seems like your indentation is off: you're calling colors[i] outside of your for loop, and i seems to be one of "A", "B", "C".

How to perform time series analysis that contains multiple groups in Python using fbProphet or other models?

All,
My dataset looks like following. I am trying to predict the 'amount' for next 6 months using either the fbProphet or other model. But my issue is that I would like to predict amount based on each groups i.e A,B,C,D for next 6 months. I am not sure how to do that in python using fbProphet or other model ? I referenced official page of fbprophet, but the only information I found is that "Prophet" takes two columns only One is "Date" and other is "amount" .
I am new to python, so any help with code explanation is greatly appreciated!
import pandas as pd
data = {'Date':['2017-01-01', '2017-02-01', '2017-03-01', '2017-04-01','2017-05-01','2017-06-01','2017-07-01'],'Group':['A','B','C','D','C','A','B'],
'Amount':['12.1','13','15','10','12','9.0','5.6']}
df = pd.DataFrame(data)
print (df)
output:
Date Group Amount
0 2017-01-01 A 12.1
1 2017-02-01 B 13
2 2017-03-01 C 15
3 2017-04-01 D 10
4 2017-05-01 C 12
5 2017-06-01 A 9.0
6 2017-07-01 B 5.6

fbprophet requires two columns ds and y, so you need to first rename the two columns
df = df.rename(columns={'Date': 'ds', 'Amount':'y'})
Assuming that your groups are independent from each other and you want to get one prediction for each group, you can group the dataframe by "Group" column and run forecast for each group
from fbprophet import Prophet
grouped = df.groupby('Group')
for g in grouped.groups:
group = grouped.get_group(g)
m = Prophet()
m.fit(group)
future = m.make_future_dataframe(periods=365)
forecast = m.predict(future)
print(forecast.tail())
Take note that the input dataframe that you supply in the question is not sufficient for the model because group D only has a single data point. fbprophet's forecast needs at least 2 non-Nan rows.
EDIT: if you want to merge all predictions into one dataframe, the idea is to name the yhat for each observations differently, do pd.merge() in the loop, and then cherry-pick the columns that you need at the end:
final = pd.DataFrame()
for g in grouped.groups:
group = grouped.get_group(g)
m = Prophet()
m.fit(group)
future = m.make_future_dataframe(periods=365)
forecast = m.predict(future)
forecast = forecast.rename(columns={'yhat': 'yhat_'+g})
final = pd.merge(final, forecast.set_index('ds'), how='outer', left_index=True, right_index=True)
final = final[['yhat_' + g for g in grouped.groups.keys()]]

import pandas as pd
import numpy as np
from statsmodels.tsa.statespace.sarimax import SARIMAX
from statsmodels.tsa.arima_model import ARIMA
from statsmodels.tsa.stattools import adfuller
from matplotlib import pyplot as plt
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_squared_log_error
# Before doing any modeling using ARIMA or SARIMAS etc Confirm that
# your time-series is stationary by using Augmented Dick Fuller test
# or other tests.
# Create a list of all groups or get from Data using np.unique or other methods
groups_iter = ['A', 'B', 'C', 'D']
dict_org = {}
dict_pred = {}
group_accuracy = {}
# Iterate over all groups and get data
# from Dataframe by filtering for specific group
for i in range(len(groups_iter)):
X = data[data['Group'] == groups_iter[i]]['Amount'].values
size = int(len(X) * 0.70)
train, test = X[0:size], X[size:len(X)]
history = [x for in train]
# Using ARIMA model here you can also do grid search for best parameters
for t in range(len(test)):
model = ARIMA(history, order = (5, 1, 0))
model_fit = model.fit(disp = 0)
output = model_fit.forecast()
yhat = output[0]
predictions.append(yhat)
obs = test[t]
history.append(obs)
print("Predicted:%f, expected:%f" %(yhat, obs))
error = mean_squared_log_error(test, predictions)
dict_org.update({groups_iter[i]: test})
dict_pred.update({group_iter[i]: test})
print("Group: ", group_iter[i], "Test MSE:%f"% error)
group_accuracy.update({group_iter[i]: error})
plt.plot(test)
plt.plot(predictions, color = 'red')
plt.show()

I know this is old but I was trying to predict outcomes for different clients and I tried to use Aditya Santoso solution above but got into some errors, so I added a couple of modifications and finally this worked for me:
df = pd.read_csv('file.csv')
df = pd.DataFrame(df)
df = df.rename(columns={'date': 'ds', 'amount': 'y', 'client_id': 'client_id'})
#I had to filter first clients with less than 3 records to avoid errors as prophet only works for 2+ records by group
df = df.groupby('client_id').filter(lambda x: len(x) > 2)
df.client_id = df.client_id.astype(str)
final = pd.DataFrame(columns=['client','ds','yhat'])
grouped = df.groupby('client_id')
for g in grouped.groups:
group = grouped.get_group(g)
m = Prophet()
m.fit(group)
future = m.make_future_dataframe(periods=365)
forecast = m.predict(future)
#I added a column with client id
forecast['client'] = g
#I used concat instead of merge
final = pd.concat([final, forecast], ignore_index=True)
final.head(10)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do I modify this function to accept multiple Dataframes? - python

Related

Grouped Column Operations in Python using Pandas

Calculate gap between two datasets (pandas, matplotlib, fill_between already used)

Creating a bootstrap sample by group in python

Visualize multidimensional datasets with MDS

How to perform time series analysis that contains multiple groups in Python using fbProphet or other models?

Categories

Resources