Possible to assign variables based on element in for loop? - python

I have the following code:
for asset in port_close.columns.drop('P&L'):
forecasts = {}
am = arch_model(port_close[asset])
for i in range(20):
res = am.fit(first_obs=i, last_obs=i + end_loc, disp='off')
temp = res.forecast(horizon=1).variance
fcast = temp.iloc[i + end_loc - 1]
forecasts[fcast.name] = fcast
asset+'_vol' = pd.DataFrame(forecasts).T
I understand why that last line doesn't work, but in this for loop I want each name "asset" to be assigned to a separate dataframe. The "asset" in the first for loop references the columns ['AAPL', 'GOOGL', 'IBM']. So I want a dataframe called AAPL_vol and so on. Anyone know how to achieve this?

It sounds like you are trying to dynamically create a variable name, which isn't straightforward and there isn't really good reason for doing so. I suggest alternatively creating a dictionary of vols and assigning something like AAPL_vol as the key and the corresponding dataframe as the value.
Something like this.
vols = {}
for asset in port_close.columns.drop('P&L'):
forecasts = {}
am = arch_model(port_close[asset])
for i in range(20):
res = am.fit(first_obs=i, last_obs=i + end_loc, disp='off')
temp = res.forecast(horizon=1).variance
fcast = temp.iloc[i + end_loc - 1]
forecasts[fcast.name] = fcast
vols[asset+'_vol'] = pd.DataFrame(forecasts).T

Related

Creating column names from multiple lists using for loop

Say I have multiple lists:
names1 = [name11, name12, etc]
names2 = [name21, name22, etc]
names3 = [name31, name32, etc]
How do I create a for loop that combines the components of the lists in order ('name11name21name31', 'name11name21name32' and so on)?
I want to use this to name columns as I add them to a data frame. I tried like this:
Results['{}' .format(model_names[j]) + '{}' .format(Data_names[i])] = proba.tolist()
I am trying to take some results that I obtain as an array and introduce them one by one in a data frame and giving the columns names as I go on. It is for a machine learning model I am trying to make.
This is the whole code, I am sure it is messy because I am a beginner.
Train = [X_train_F, X_train_M, X_train_R, X_train_SM]
Test = [X_test_F, X_test_M, X_test_R, X_test_SM]
models_to_run = [knn, svc, forest, dtc]
model_names = ['knn', 'svc' ,'forest', 'dtc']
Data_names = ['F', 'M', 'R', 'SM']
Results = pd.DataFrame()
for T, t in zip(Train, Test):
for j, model in enumerate(models_to_run):
model.fit(T, y_train.values.ravel())
proba = model.predict_proba(t)
proba = pd.DataFrame(proba.max(axis=1))
proba = proba.to_numpy()
proba = proba.flatten()
Results['{}' .format(model_names[j]) + '{}' .format(Data_names[i])] = proba.tolist()
I dont know how to integrate 'i' in the loop, to use it to go through the list Data_names to add it to the column name. I am sure there is a cleaner way to do this. Please be gentle.
Edit: It currently gives me a data frame with 4 columns instead of 16 as it should, and it just adds the whole Data_names list to the column name.
How about:
Results= {}
for T, t, dname in zip(Train, Test, Data_names):
for mname, model in zip(model_names, models_to_run):
...
Results[(dname, mname)] = proba.to_list()
Results = pd.DataFrame(Results.values(), index=Results.keys()).T

Change Column values in pandas applying another function

I have a data frame in pandas, one of the columns contains time intervals presented as strings like 'P1Y4M1D'.
The example of the whole CSV:
oci,citing,cited,creation,timespan,journal_sc,author_sc
0200100000236252421370109080537010700020300040001-020010000073609070863016304060103630305070563074902,"10.1002/pol.1985.170230401","10.1007/978-1-4613-3575-7_2",1985-04,P2Y,no,no
...
I created a parsing function, that takes that string 'P1Y4M1D' and returns an integer number.
I am wondering how is it possible to change all the column values to parsed values using that function?
def do_process_citation_data(f_path):
global my_ocan
my_ocan = pd.read_csv("citations.csv",
names=['oci', 'citing', 'cited', 'creation', 'timespan', 'journal_sc', 'author_sc'],
parse_dates=['creation', 'timespan'])
my_ocan = my_ocan.iloc[1:] # to remove the first row iloc - to select data by row numbers
my_ocan['creation'] = pd.to_datetime(my_ocan['creation'], format="%Y-%m-%d", yearfirst=True)
return my_ocan
def parse():
mydict = dict()
mydict2 = dict()
i = 1
r = 1
for x in my_ocan['oci']:
mydict[x] = str(my_ocan['timespan'][i])
i +=1
print(mydict)
for key, value in mydict.items():
is_negative = value.startswith('-')
if is_negative:
date_info = re.findall(r"P(?:(\d+)Y)?(?:(\d+)M)?(?:(\d+)D)?$", value[1:])
else:
date_info = re.findall(r"P(?:(\d+)Y)?(?:(\d+)M)?(?:(\d+)D)?$", value)
year, month, day = [int(num) if num else 0 for num in date_info[0]] if date_info else [0,0,0]
daystotal = (year * 365) + (month * 30) + day
if not is_negative:
#mydict2[key] = daystotal
return daystotal
else:
#mydict2[key] = -daystotal
return -daystotal
#print(mydict2)
#return mydict2
Probably I do not even need to change the whole column with new parsed values, the final goal is to write a new function that returns average time of ['timespan'] of docs created in a particular year. Since I need parsed values, I thought it would be easier to change the whole column and manipulate a new data frame.
Also, I am curious what could be a way to apply the parsing function on each ['timespan'] row without modifying a data frame, I can only assume It could be smth like this, but I don't have a full understanding of how to do that:
for x in my_ocan['timespan']:
x = parse(str(my_ocan['timespan'])
How can I get a column with new values? Thank you! Peace :)
A df['timespan'].apply(parse) (as mentioned by #Dan) should work. You would need to modify only the parse function in order to receive the string as an argument and return the parsed string at the end. Something like this:
import pandas as pd
def parse_postal_code(postal_code):
# Splitting postal code and getting first letters
letters = postal_code.split('_')[0]
return letters
# Example dataframe with three columns and three rows
df = pd.DataFrame({'Age': [20, 21, 22], 'Name': ['John', 'Joe', 'Carla'], 'Postal Code': ['FF_222', 'AA_555', 'BB_111']})
# This returns a new pd.Series
print(df['Postal Code'].apply(parse_postal_code))
# Can also be assigned to another column
df['Postal Code Letter'] = df['Postal Code'].apply(parse_postal_code)
print(df['Postal Code Letter'])

Accessing a data frame that is generated inside for loop out side of the loop in python

I have a data frame that I generated inside for loop. I am trying to save this data frame so that I can access it outside of the loop. I have a snippet of my code below.
my_excel_sample = pd.read_excel(r"mypath\mydata.xlsx",sheet_name=None)
for tabs in my_excel_sample.keys():
actualData = pd.DataFrame(removeEmptyColumns(my_excel_sample[tabs],0))
data = replaceNanValues(actualData,0)
data = renameColumns(data,0)
data = removeFooters(data,0)
data.reset_index(drop=True, inplace=True)
data = pd.DataFrame(RowMerger(data,0))
Now I want to use data outside of the loop. Can anyone help me to solve this?
You are creating several dataframes iteratively inside for loop and storing it in variable data.
You can just add the dataframes (data) to a list and then access them anytime you want.
Try this :
my_excel_sample = pd.read_excel(r"mypath\mydata.xlsx",sheet_name=None)
final_df_list = []
for tabs in my_excel_sample.keys():
actualData = pd.DataFrame(removeEmptyColumns(my_excel_sample[tabs],0))
data = replaceNanValues(actualData,0)
data = renameColumns(data,0)
data = removeFooters(data,0)
data.reset_index(drop=True, inplace=True)
data = pd.DataFrame(RowMerger(data,0))
final_df_list.append(data)
print(final_df_list)
If you ave any type of identifier that you can use to recognize the dataframes later, I would suggest you to use a dictionary. Make the identifier as keys and variable data as value.
Here is an example where I take serial number as key :
my_excel_sample = pd.read_excel(r"mypath\mydata.xlsx",sheet_name=None)
final_df_dict = dict()
ind = 0
for tabs in my_excel_sample.keys():
actualData = pd.DataFrame(removeEmptyColumns(my_excel_sample[tabs],0))
data = replaceNanValues(actualData,0)
data = renameColumns(data,0)
data = removeFooters(data,0)
data.reset_index(drop=True, inplace=True)
data = pd.DataFrame(RowMerger(data,0))
final_df_dict[ind] = data
ind += 1
print(final_df_dict)

Avoid having to repeat the same dataframe column names when modifying them

I have a dataframe with over 30 columns. I am doing various modifications on specific columns and would like to find a way to avoid having to always list the specifc columns. Is there a shortcut?
For example:
matrix_bus_filled.loc[matrix_bus_filled['FNR'] == 'AB1122', ["Ice", "Tartlet", "Pain","Fruit","Club","Focaccia","SW of Month","Salad + Dressing","Planchette + bread","Muffin"]] = matrix_bus_filled[matrix_bus_filled['FNR'] == 'AB1120'][["Ice", "Tartlet", "Pain","Fruit","Club","Focaccia","SW of Month","Salad + Dressing","Planchette + bread","Muffin"]].values
Could I simply once define the term "SpecificColumns" and then paste it here?
matrix_bus_filled.loc[matrix_bus_filled['FNR'] == 'AB1122', ["SpecificColumns"]] = matrix_bus_filled[matrix_bus_filled['Flight Number'] == 'AB1120'][["SpecificColumns]].values
And here
matrix_bus_filled [["SpecificColumns"]] = matrix_bus_filled [["SpecificColumns"]].apply(scale, axis=1)
Just define a list and use that to call the columns.
specific_columns = ["Ice", "Tartlet", "Pain","Fruit","Club","Focaccia","SW of Month","Salad + Dressing","Planchette + bread","Muffin"]
matrix_bus_filled[specific_columns] = matrix_bus_filled[specific_columns].apply(scale, axis=1)

how can I make a loop for importing data and saving in sequence

I have a xls file and the first column consist of many rows for example
MN
TN
RMON
BNE
RMGS
HUDGD
YINT
Then I want to pass each cell (the value of it) to a function
mystruc1 = make_structure("MN")
mystruc2 = make_structure("TN")
mystruc3 = make_structure("RMON")
mystruc4 = make_structure("BNE")
mystruc5 = make_structure("RMGS")
mystruc6 = make_structure("HUDGD")
mystruc7 = make_structure("YINT")
So each time the value of one cell will go to the function
Then I want to pass the output of it to another function
out = Bio.PDB.PDBIO()
out.set_structure(mystruc1)
out.save( "MN001.pdb" )
out.set_structure(mystruc2)
out.save( "MN002.pdb" )
out.set_structure(mystruc3)
out.save( "MN003.pdb" )
out.set_structure(mystruc4)
out.save( "MN004.pdb" )
out.set_structure(mystruc5)
out.save( "MN005.pdb" )
out.set_structure(mystruc6)
out.save( "MN006.pdb" )
out.set_structure(mystruc7)
out.save( "MN007.pdb" )
this is how if i do it manually. I want to avoid doing it manually
You can construct the filename using str.format, Format String Syntax
>>> filename = '{}{:04}.pdb'
>>> filename.format('MN', 1)
'MN0001.pdb'
>>> filename.format('MN', 352)
'MN0352.pdb'
>>>
You can use enumerate while iterating over the sheet's rows to help construct the filename.
import xlrd
filename = '{}{:04}.pdb'
workbook = xlrd.open_workbook('test.xls')
for sheet in workbook.sheets():
for n, row in enumerate(sheet.get_rows()):
col_0 = row[0].value
print filename.format(col_0, n)
If you only want to iterate over the first column.
for sheet in workbook.sheets():
for n, value in enumerate(sheet.col_values(0, start_rowx=0, end_rowx=None)):
print filename.format(value, n)
Or you can access the cel values directly.
for sheet in workbook.sheets():
for i in xrange(sheet.nrows):
rowi_col0 = sheet.cell_value(i, 0)
print filename.format(rowi_col0, i)
Once you have extracted a cel's value you can pass it to any function/method - similar to passing the cel value to the str.format method.
mystruc = make_structure(value)
To automate processing the cel values, add your process to the loop.
for sheet in workbook.sheets():
for i in xrange(sheet.nrows):
rowi_col0 = sheet.cell_value(i, 0)
#print filename.format(col_0, i)
my_structure = make_structure(rowi_col0)
out = Bio.PDB.PDBIO()
out.set_structure(my_structure)
out.save(filename.format(rowi_col0, i))
I don't have comment privileges to ask for clarification, so I'm going to answer this best I can and hopefully you can clarify if I'm going in the wrong direction.
From what you wrote, I'm assuming that you have some column, 'MN' and you want to name a bunch of files starting from 'MN001.pdb' all the way to 'MN0xx.pdb' (where xx is the last row you're working with.
One way you can achieve this is by working with a loop that has a counter and iterates with each iteration of your second for loop.
colname = "MN"
for sheet in workbook.sheets():
counter = 0
for row in range(sheet.nrows):
# pass your code here
counter += 1
s_counter = str(counter)
s = ''
if len(s_counter) < 2:
s = '0' + s
elif len(s_counter) < 3:
s = '00' + s
...
out.save(s + '.pdb')

Categories