How to dynamically name a dataframe within this for loop - python

I have numerous dataframes and each dataframe has about 100 different chemical compounds and a categorical variable listing the type of material. For example, a smaller version of my datasets would look something like this:
Decane Octanal Material
1 20 Water
2 1 Glass
10 5 Glass
9 4 Water
I am using a linear regression model to regress the chemicals onto the material type. I want to be able to dynamically rename the results dataframe based on which dataset I am using. My code looks like this (where 'feature_cols' are the names of the chemicals):
count=0
dataframe=[]
#loop through the three datasets (In reality I have many more than three)
for dataset in [first, second, third]:
count+=1
for feature in feature_cols:
#define the model and fit it
mod = smf.ols(formula='Q(feature)'+'~material', data=dataset)
res = mod.fit()
#create a dataframe of the pvalues
#I would like to be able to dynamically name pvalues so that when looping through
#the chemicals of the first dataframe it is called 'pvalues_first' and so on.
pvalues=pd.DataFrame(res.pvalues)

You can use a dictionary (here with dummy values) :
names = ['first', 'second', 'third', 'fourth', 'fifth', 'sixth']
pvalues = {}
for i in range(len(names)):
pvalues["pvalues_" + names[i]] = i+1
print(pvalues)
Output:
{'pvalues_first': 1, 'pvalues_second': 2, 'pvalues_third': 3, 'pvalues_fourth': 4, 'pvalues_fifth': 5, 'pvalues_sixth': 6}
To access pvalues_third for example :
pvalues["pvalues_third"] = 20
print(pvalues)
**Output: **
{'pvalues_first': 1, 'pvalues_second': 2, 'pvalues_third': 20, 'pvalues_fourth': 4, 'pvalues_fifth': 5, 'pvalues_sixth': 6}

count=0
dataframe=[]
#loop through the three datasets (In reality I have many more than three)
names = ["first", "second", "third"]
for feature in feature_cols:
#define the model and fit it
mod = smf.ols(formula='Q(feature)'+'~material', data=dataset)
res = mod.fit()
#create a dataframe of the pvalues
#I would like to be able to dynamically name pvalues so that when looping through
#the chemicals of the first dataframe it is called 'pvalues_first' and so on.
name_str = "pvalues"+str(names[count])
pvalues = {'Intercept':[res.pvalues[0]], 'cap_type':[res.pvalues[1]]}
name_str=pd.DataFrame(pvalues)
count+=1

Related

How to show top 10 ranking with different magnitude

I want to create an overall ranking (but in my true data the features have not the same magnitude at all).
So for example if the top 10 in feature 6 looks like 10^6, 9^6...2^6, values in feature 1 are like 10^2,9^2...2^2.
Hence the overall ranking would be the same ranking as in feature 6, as it is influenced by the magnitude and the given weight is insignificant for infuencing the ranking.
I want to create a new column (or a new dataframe) for overall ranking.
A column that take into account the ranking for each features (hence eliminating the values).
In a second step, rank the countrues with the given different weight for each features, in order to plot the overall ranking of the 10 features.
Also it would be great if I could vizualise the result with matplotlib even though it is a dictionary in each column.
This is the dataframe I have:
import pandas as pd
import numpy as np
data = np.random.randint(100,size=(12,10))
countries = [
'Country1',
'Country2',
'Country3',
'Country4',
'Country5',
'Country6',
'Country7',
'Country8',
'Country9',
'Country10',
'Country11',
'Country12',
]
feature_names_weights = {
'feature1' :1.0,
'feature2' :4.0,
'feature3' :1.0,
'feature4' :7.0,
'feature5' :1.0,
'feature6' :1.0,
'feature7' :8.0,
'feature8' :1.0,
'feature9' :9.0,
'feature10' :1.0,
}
feature_names = list(feature_names_weights.keys())
df = pd.DataFrame(data=data, index=countries, columns=feature_names)
data_etude_copy = df
data_sorted_by_feature = {}
country_scores = (pd.DataFrame(data=np.zeros(len(countries)),index=countries))[0]
for feature in feature_names:
#Adds to each country's score and multiplies by weight factor for each feature
for country in countries:
country_scores[country] += data_etude_copy[feature][country]*(feature_names_weights[feature])
#Sorts the countries by feature (your code in loop form)
data_sorted_by_feature[feature] = data_etude_copy.sort_values(by=[feature], ascending=False).head(10)
data_sorted_by_feature[feature].drop(data_sorted_by_feature[feature].loc[:,data_sorted_by_feature[feature].columns!=feature], inplace=True, axis = 1)
#sort country total scores
ranked_countries = country_scores.sort_values(ascending=False).head(10)
##Put everything into one DataFrame
#Create empty DataFrame
empty_data=np.empty((10,10),str)
outputDF = pd.DataFrame(data=empty_data,columns=((feature_names)))
#Add entries for all features
for feature in feature_names:
for index in range(10):
country = list(data_sorted_by_feature[feature].index)[index]
outputDF[feature][index] = f'{country}: {data_sorted_by_feature[feature][feature][country]}'
#Add column for overall country score
#Print DataFrame
outputDF
The features in my dataframe have not the data normalized, just "ranked".
Expected output would be something like a sum of the normalized rankings with their corresponding weight:

Grouped Time Series forecasting with scikit-hts

I am trying to forecast sales for multiple time series I took from kaggle's Store item demand forecasting challenge. It consists of a long format time series for 10 stores and 50 items resulting in 500 time series stacked on top of each other. And for each store and each item, I have 5 years of daily records with weekly and annual seasonalities.
In total there are : 365.2days * 5years * 10stores *50items = 913000 records.
From my understanding based on what I've read so far on Hierarchical and Grouped time series, the whole dataframe could be structured as a Grouped Time Series and not simply as a strict Hierarchical Time Series as aggregation could be done at the store or item levels interchangeably.
I want to find a way to forecast all 500 time series (for store1_item1, store1_item2,..., store10_item50) for the next year (from 01-jan-2015 to 31-dec-2015) using the scikit-hts library and its AutoArimaModel function which is a wrapper function of pmdarima's AutoArima function.
To handle the two levels of seasonality, I added Fourier terms as exogenous features to deal with the annual seasonality while auto_arima deals with the weekly seasonality.
My problem is that I got an error at during prediction step.
Here's the error message :
ValueError: Provided exogenous values are not of the appropriate shape. Required (365, 4), got (365, 8).
I assume something is wrong with the exogenous dictionary but I do not know how to solve the issue as I'm using scikit-hts for the first time. To do this, I followed the official documentation of scikit-hts here.
EDIT :______________________________________________________________
I have not seen that a similar bug was reported on Github. Following the proposed fix that I implemented locally, I could have some results. However, even though there is no error when running the code, some of the forecasts are negative as raised in the comments below this post. And we even get disproportionate values for the positive ones.
Here are the plots for all the combinations of store and item. You can see that this seems to work for only one combination.
df.loc['2014','store_1_item_1'].plot()
predictions.loc['2015','store_1_item_1'].plot()
df.loc['2014','store_1_item_2'].plot()
predictions.loc['2015','store_1_item_2'].plot()
df.loc['2014','store_2_item_1'].plot()
predictions.loc['2015','store_2_item_1'].plot()
df.loc['2014','store_2_item_2'].plot()
predictions.loc['2015','store_2_item_2'].plot()
_____________________________________________________________________
Complete code:
# imports
import pandas as pd
from pmdarima.preprocessing import FourierFeaturizer
import hts
from hts.hierarchy import HierarchyTree
from hts.model import AutoArimaModel
from hts import HTSRegressor
# read data from the csv file
data = pd.read_csv('train.csv', index_col='date', parse_dates=True)
# Train/Test split with reduced size
train_data = data.query('store == [1,2] and item == [1, 2]').loc['2013':'2014']
test_data = data.query('store == [1,2] and item == [1, 2]').loc['2015']
# Create the stores time series
# For each timestamp group by store and apply sum
stores_ts = train_data.drop(columns=['item']).groupby(['date','store']).sum()
stores_ts = stores_ts.unstack('store')
stores_ts.columns = stores_ts.columns.droplevel(0)
stores_ts.columns = ['store_' + str(i) for i in stores_ts.columns]
# Create the items time series
# For each timestamp group by item and apply sum
items_ts = train_data.drop(columns=['store']).groupby(['date','item']).sum()
items_ts = items_ts.unstack('item')
items_ts.columns = items_ts.columns.droplevel(0)
items_ts.columns = ['item_' + str(i) for i in items_ts.columns]
# Create the stores_items time series
# For each timestamp group by store AND by item and apply sum
store_item_ts = train_data.pivot_table(index= 'date', columns=['store', 'item'], aggfunc='sum')
store_item_ts.columns = store_item_ts.columns.droplevel(0)
# Rename the columns as store_i_item_j
col_names = []
for i in store_item_ts.columns:
col_name = 'store_' + str(i[0]) + '_item_' + str(i[1])
col_names.append(col_name)
store_item_ts.columns = store_item_ts.columns.droplevel(0)
store_item_ts.columns = col_names
# Create a new dataframe and add the root level of the hierarchy as the sum of all stores (or all items)
df = pd.DataFrame()
df['total'] = stores_ts.sum(1)
# Concatenate all created dataframes into one df
# df is the dataframe that will be used for model training
df = pd.concat([df, stores_ts, items_ts, store_item_ts], 1)
# Build fourier terms for train and test sets
four_terms = FourierFeaturizer(365.2, 1)
# Build the exogenous features dataframe for training data
exog_train_df = pd.DataFrame()
for i in range(1, 3):
for j in range(1, 3):
_, exog = four_terms.fit_transform(train_data.query(f'store == {i} and item == {j}').sales)
exog.columns= [f'store_{i}_item_{j}_'+ x for x in exog.columns]
exog_train_df = pd.concat([exog_train_df, exog], axis=1)
exog_train_df['date'] = df.index
exog_train_df.set_index('date', inplace=True)
# add the exogenous features dataframe to df before training
df = pd.concat([df, exog_train_df], axis= 1)
# Build the exogenous features dataframe for test set
# It will be used only when using model.predict()
exog_test_df = pd.DataFrame()
for i in range(1, 3):
for j in range(1, 3):
_, exog_test = four_terms.fit_transform(test_data.query(f'store == {i} and item == {j}').sales)
exog_test.columns= [f'store_{i}_item_{j}_'+ x for x in exog_test.columns]
exog_test_df = pd.concat([exog_test_df, exog_test], axis=1)
# Build the hierarchy of the Grouped Time Series
stores = [i for i in stores_ts.columns]
items = [i for i in items_ts.columns]
store_items = col_names
# Exogenous features mapping
exog_store_items = {e: [v for v in exog_train_df.columns if v.startswith(e)] for e in store_items}
exog_stores = {e:[v for v in exog_train_df.columns if v.startswith(e)] for e in stores}
exog_items = {e:[v for v in exog_train_df.columns if v.find(e) != -1] for e in items}
exog_total = {'total':[v for v in exog_train_df.columns if v.find('FOURIER') != -1]}
# Merge all dictionaries
exog_to_merge = [exog_store_items, exog_stores, exog_items, exog_total]
exogenous = {k:v for x in exog_to_merge for k,v in x.items()}
# Build hierarchy
total = {'total': stores + items}
store_h = {k: [v for v in store_items if v.startswith(k)] for k in stores}
hierarchy = {**total, **store_h}
# Hierarchy tree automatically created by hts
ht = HierarchyTree.from_nodes(nodes=hierarchy, df=df, exogenous=exogenous)
# Instanciate the auto arima model using HTSRegressor
autoarima = HTSRegressor(model='auto_arima', D=1, m=7, seasonal=True, revision_method='OLS', n_jobs=12)
# Fit the model to the training df that includes time series and exog_train_df
# Set exogenous param to the previously built dictionary
model = autoarima.fit(df, hierarchy, exogenous=exogenous)
# Make predictions
# Set the exogenous_df param
predictions = model.predict(exogenous_df=exog_test_df, steps_ahead=365)
Other approaches I thought of and that I already implemented successfully for one series (for store 1 and item 1 for example) :
TBATS applied to each series independently inside a loop across all 500 time series
auto_arima (SARIMAX) with exogenous features (=Fourier terms to deal with the weekly and annual seasonalities) for each series independently + a loop across all 500 time series
What do you think of these approaches? Do you have other suggestions on how to scale ARIMA to multiple time series?
I also want to try LSTM but I'm new to data science and deep learning and do not know how to prepare the data. Should I keep the data in their original form (long format) and apply one hot encoding to train_data['store'] and train_data['item'] columns or should I start with the df I ended up with here?
I Hope this helped you in fixing the issue with exogenous regressors. To handle negative forecasts I would suggest you to try square root transformation.

Creating column names from multiple lists using for loop

Say I have multiple lists:
names1 = [name11, name12, etc]
names2 = [name21, name22, etc]
names3 = [name31, name32, etc]
How do I create a for loop that combines the components of the lists in order ('name11name21name31', 'name11name21name32' and so on)?
I want to use this to name columns as I add them to a data frame. I tried like this:
Results['{}' .format(model_names[j]) + '{}' .format(Data_names[i])] = proba.tolist()
I am trying to take some results that I obtain as an array and introduce them one by one in a data frame and giving the columns names as I go on. It is for a machine learning model I am trying to make.
This is the whole code, I am sure it is messy because I am a beginner.
Train = [X_train_F, X_train_M, X_train_R, X_train_SM]
Test = [X_test_F, X_test_M, X_test_R, X_test_SM]
models_to_run = [knn, svc, forest, dtc]
model_names = ['knn', 'svc' ,'forest', 'dtc']
Data_names = ['F', 'M', 'R', 'SM']
Results = pd.DataFrame()
for T, t in zip(Train, Test):
for j, model in enumerate(models_to_run):
model.fit(T, y_train.values.ravel())
proba = model.predict_proba(t)
proba = pd.DataFrame(proba.max(axis=1))
proba = proba.to_numpy()
proba = proba.flatten()
Results['{}' .format(model_names[j]) + '{}' .format(Data_names[i])] = proba.tolist()
I dont know how to integrate 'i' in the loop, to use it to go through the list Data_names to add it to the column name. I am sure there is a cleaner way to do this. Please be gentle.
Edit: It currently gives me a data frame with 4 columns instead of 16 as it should, and it just adds the whole Data_names list to the column name.
How about:
Results= {}
for T, t, dname in zip(Train, Test, Data_names):
for mname, model in zip(model_names, models_to_run):
...
Results[(dname, mname)] = proba.to_list()
Results = pd.DataFrame(Results.values(), index=Results.keys()).T

dataframe generating own column names

For a project, I want to create a script that allows the user to enter values (like a value in centimetres) multiple times. I had a While-loop in mind for this.
The values need to be stored in a dataframe, which will be used to generate a graph of the values.
Also, there is no maximum nr of entries that the user can enter, so the names of the variables that hold the values have to be generated with each entry (such as M1, M2, M3…Mn). However, the dataframe will only consist of one row (only for the specific case that the user is entering values for).
So, my question boils down to this:
How do I create a dataframe (with pandas) where the script generates its own column name for a measurement, like M1, M2, M3, …Mn, so that all the values are stored.
I can't acces my code right now, but I have created a While-loop that allows the user to enter values, but I'm stuck on the dataframe and columns part.
Any help would be greatly appreciated!
I agree with #mischi, without additional context, pandas seems overkill, but here is an alternate method to create what you describe...
This code proposes a method to collect the values using a while loop and input() (your while loop is probably similar).
colnames = []
inputs = []
counter = 0
while True:
value = input('Add a value: ')
if value == 'q': # provides a way to leave the loop
break
else:
key = 'M' + str(counter)
counter += 1
colnames.append(key)
inputs.append(value)
from pandas import DataFrame
df = DataFrame(inputs, colnames) # this creates a DataFrame with
# a single column and an index
# using the colnames
df = df.T # This transposes the DataFrame to
# so the indexes become the colnames
df.index = ['values'] # Sets the name of your row
print(df)
The output of this script looks like this...
Add a value: 1
Add a value: 2
Add a value: 3
Add a value: 4
Add a value: q
M0 M1 M2 M3
values 1 2 3 4
pandas seems a bit of an overkill, but to answer your question.
Assuming you collect numerical values from your users and store them in a list:
import numpy as np
import pandas as pd
values = np.random.random_integers(0, 10, 10)
print(values)
array([1, 5, 0, 1, 1, 1, 4, 1, 9, 6])
columns = {}
column_base_name = 'Column'
for i, value in enumerate(values):
columns['{:s}{:d}'.format(column_base_name, i)] = value
print(columns)
{'Column0': 1,
'Column1': 5,
'Column2': 0,
'Column3': 1,
'Column4': 1,
'Column5': 1,
'Column6': 4,
'Column7': 1,
'Column8': 9,
'Column9': 6}
df = pd.DataFrame(data=columns, index=[0])
print(df)
Column0 Column1 Column2 Column3 Column4 Column5 Column6 Column7 \
0 1 5 0 1 1 1 4 1
Column8 Column9
0 9 6

Trouble with pandas iterrows and loop counter

I have a dataset containing the US treasury curve for each day over a few years. Rows = Dates, Columns = tenor of specific treasury bond (3 mo, 1 yr, 10yr, etc)
I have python code that loops through each day and calibrates parameters for an interest rate model. I am having trouble looping through each row via iterrows and with my loop counter. The goal is to go row by row and calibrate the model to that daily curve, store the calibrated parameters in a dataframe, and then move onto the next row and repeat.
def do_calibration_model1():
global i
for index, row in curves.iterrows():
day = np.array(row) #the subsequent error_fxn uses this daily curve
calibration()
i += 1
def calibration():
i = 0
param = scipy.brute(error_fxn, bounds...., etc.)
opt = scipy.fmin(error_fxn, param, xtol..., ftol...)
calibration.loc[i] = np.array(opt) # store result of minimization (parameters for that day)
The code works correctly for the first iteration but then keeps repeating the calibration for the first row in the dataframe (curves). Further, it does not store the parameters in the next row of the calibration dataframe. I view the first issue as relating to the iterrows while the second is an issue of the loop counter.
Any thoughts on what is going wrong? I have a Matlab background and find the pandas setup to be very frustrating.
For reference I have consulted the links below to no avail.
https://www.python.org/dev/peps/pep-0212/
http://nipunbatra.github.io/2015/06/pandas-iteration/
Per Jason's comment below I have updated the code to:
def do_calibration_model1():
global i
for index, row in curves.iterrows():
for i in range(0,len(curves)):
day = np.array(row) #the subsequent error_fxn uses this daily curve
param = scipy.brute(error_fxn, bounds...., etc.)
opt = scipy.fmin(error_fxn, param, xtol..., ftol...)
calibration.loc[i] = np.array(opt) # store result of minimization (parameters for that day)
i += 1
The revised code now places the appropriate parameters in each row of the calibration dataframe based on the loop counter.
*However, it still does not move to the second (or subsequent rows) of the curves dataframe for the pandas iterrows function.
Each time calibration is called, you set i = 0. As a result, when you call calibration.loc[i] = np.array(opt), what is being written is item 0 of calibration. The variable i is never actually anything except 0 in this function.
In function do_calibration_model1(), you declare global i and then augment i by one at the end of the function call. I'm not sure what this i counter is meant to accomplish. Perhaps you think that the i in do_calibration_model1() is updating the value of the i variable in the calibration() function, but this is not the case. Given that there is no global i statement in calibration(), the i in this function is a local variable.
Regarding iterrows, I don't think you need the embedded for loop that cycles through the length of curves. Here's a quick example to show you how iterrows works:
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(8, 4), columns=['A','B','C','D'])
new = pd.DataFrame({'sum': [],
'mean': []})
for index, row in df.iterrows():
temp = {'sum': sum(row), 'mean': np.mean(row)}
new = new.append(temp, ignore_index=True)
In the above, df looks like this:
A B C D
0 -2.197018 1.905543 0.773851 -0.006683
1 0.675442 0.818040 -0.561957 0.002737
2 -0.833482 0.248135 -1.159698 -0.302912
3 0.784216 -0.156225 -0.043505 -2.539486
4 -0.637248 0.034303 -1.405159 -1.590045
5 0.289257 -0.085030 -0.619899 -0.211158
6 0.804702 -0.838365 0.199911 0.210378
7 -0.031306 0.166793 -0.200867 1.343865
And the new dataframe populated through the iterrows loop looks like this:
mean sum
0 0.118923 0.475693
1 0.233566 0.934262
2 -0.511989 -2.047958
3 -0.488750 -1.954999
4 -0.899537 -3.598148
5 -0.156707 -0.626830
6 0.094157 0.376626
7 0.319621 1.278485
Note that using append here makes unnecessary the use of an i counter and simplifies the code.
Returning to your code, I suggest something like the following:
def do_calibration_model1():
callibration = pd.DataFrame({'a': [],
'b': []})
for index, row in curves.iterrows():
day = np.array(row)
param = scipy.brute(error_fxn, bounds...., etc.)
opt = scipy.fmin(error_fxn, param, xtol..., ftol...)
temp = {'a': ..., 'b': ...} # put opt values into dict
callibration = calibration.append(temp, ignore_index=True)
return callibration
In this step callibration = pd.DataFrame({'a': [], 'b': []}) you will need to set up the dataframe to ingest opt. Previously, you transformed opt to a numpy array, but you will need to arrange the values of opt so they fit your callibration dataframe, in the same way that I did for temp here: temp = {'sum': sum(row), 'mean': np.mean(row)}.

Categories