ARIMA.from_formula with pandas dataframe - python

Currently, I am trying to do a seasonal ARIMA model with a 2nd order autoregressive model, a 60 day lag, and a non stationary model. When I input my formula, dataframe, and time index into the function, it returns an error. I am confused with this function because I know that I need to input the order (p,q,d) into the function somehow, but its parameters doesn't specify it. Here is my code below:
wue_formula = ' WUE ~ 1 + SFO3 + PAR + Ta + VPD'
model = tsa.arima_model.ARIMA.from_formula(wue_formula,gs_residual_df,subset = gs_residual_df.index)
File "<ipython-input-67-d518e1f9e7cc>", line 1, in <module>
tsa.arima_model.ARIMA.from_formula(wue_formula,gs_residual_df,subset = gs_residual_df.index)
File "/Users/JasonDucker/anaconda/lib/python2.7/site-packages/statsmodels/base/model.py", line 99, in from_formula
mod = cls(endog, exog, *args, **kwargs)
File "/Users/JasonDucker/anaconda/lib/python2.7/site-packages/statsmodels/tsa/arima_model.py", line 872, in __new__
p, d, q = order
ValueError: too many values to unpack
Statsmodel's website has been down the past few days and I cant read the documentation for this to understand my problem fully. Some help with this code would be greatly appreciated!!!

Related

Python - Pandas, csv row iteration to divide by half based on criteria met

I work with and collect data from the meters on the side of peoples houses in our service area. I worked on a python script to send high/low voltage alerts to my email whenever it occurs, but the voltage originally came in as twice what it actually was (So instead of 125V it showed 250V). So, I used pandas to divide the entire column by half.... Well, turns out a small handful of meters were programmed to send back the actual voltage of 125... So I can no longer halve the column and now must iterate and divide individually. I'm a bit new to scripting so my problem might be simple..
df = pd.read_csv(csvPath)
if "timestamp" not in df:
df.insert(0, 'timestamp',currenttimestamp)
for i in df['VoltageA']:
if df['VoltageA'][i]>200:
df['VoltageA'][i] = df['VoltageA'][i]/2
df['VoltageAMax'][i] = df['VoltageAMax'][i]/2
df['VoltageAMin'][i] = df['VoltageAMin'][i]/2
df.to_csv(csvPath, index=False)
Timestamp is there just as a 'key' to avoid duplicate errors later in the same day.
Error I am getting,
Traceback (most recent call last):
File "C:\Users\admin\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\indexes\range.py", line 385, in get_loc
return self._range.index(new_key)
ValueError: 250 is not in range
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:\Users\admin\Documents\script.py", line 139, in <module>
updateTable(csvPath, tableName, truncate, email)
File "C:\Users\admin\Documents\script.py", line 50, in updateTable
if df['VoltageA'][i]>200.0:
File "C:\Users\admin\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\series.py", line 958, in __getitem__
return self._get_value(key)
File "C:\Users\admin\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\series.py", line 1069, in _get_value
loc = self.index.get_loc(label)
File "C:\Users\admin\AppData\Local\Programs\Python\Python310\lib\site-packages\pandas\core\indexes\range.py", line 387, in get_loc
raise KeyError(key) from err
KeyError: 250.0
If this isn't enough and you actually need to see a csv snippet, let me know. Just trying not to put unnecessary info here. Note, the first VoltageA value is 250.0
This example code below show how to use loc to conditionally change the value in multiple columns
import pandas as pd
df = pd.DataFrame({
'voltageA': [190,230,250,100],
'voltageMax': [200,200,200,200],
'voltageMin': [100,100,100,100]
})
df.loc[df['voltageA'] > 200, ['voltageA', 'voltageMax', 'voltageMin']] = df.loc[df['voltageA'] > 200, ['voltageA', 'voltageMax', 'voltageMin']]/2
df
Output
voltageA
voltageMax
voltageMin
190
200
100
115
100
50
125
100
50
100
200
100
The data in 2nd and 3rd row were divided by 2 because in the original data the value of voltageA in the two rows exceeds 200.

Dictionary - Multiplying an Daily Matrix (Date, 4 Entities) by a Constant Works ... but Results Yielding an Empty Dictionary?

I am trying to do a very simple task but cannot seem to get my code right... I've spent time troubleshooting other/bigger pieces of the code and getting my systems to work properly and now that some wheels are turning I'm just terribly stuck on this tiny seemingly easy bit that I just can't crack.
In a nutshell: I have daily data for four entities (3 pumps and leakage values) for a calendar year. I need to multiply each value by 24 and save these results (preferably in a dictionary) which will be the input data for the next step in the larger problem I am trying to solve (where this resultant data set and another data set will be divided by one another for each pump for every day of the year). The data will be being read in via a CSV file. I am able to get the pump units to multiply properly when I am inside the for loop but my final print statement is giving me a blank dictionary.
Input Data (365 lines of data):
Date, FT1003, FT2003, FT3003, LEAK
1/1/2021, 93, 0, 3, 10
1/2/2021, 0, 94, 2, 10
1/3/2021, 70, 54, 94, 10
1/4/2021, 70, 85, 87, 10
This is what my current printout looks like working with abbreviated data:
['1/1/21,93,0,3,10\n', '1/2/21,0,94,2,10']
2232.0
0.0
result={'1/1/21': {}, '1/2/21': {}}
print('Trying to solve a simple problem')
fhand2 = open('WSE_Daily_CY21_short.csv', 'r')
daily = fhand2.readlines()
print(daily)
result = {}
for line in daily:
date, unit1, unit2, unit3, leak = line.split(",")
unit1 = (float(unit1))*24
unit2 = (float(unit2))*24
unit3 = (float(unit3))*24
leak = (float(leak))*24
print(unit1) #this print has the multiplication correct for unit 1 but isn't in the dictionary
result[date] = dict()
print(f"result={result}")}

Boolean index with Numba with strings and datetime64

I am trying to convert a function that generate a Boolean index based on a date and a name to work with Numba but I have an error.
My project start with a Dataframe TS_Flujos with a structure as follow.
Fund name, Date, Var Commitment Cash flow
Fund 1 Date 1 100 -20
Fund 1 Date 2 10 -2
Fund 1 Date 3 0 +10
Fund 2 Date 3 100 0
Fund 2 Date 4 0 -10
Fund 3 Date 2 100 -20
Fund 3 Date 3 20 30
Each line is a cashflow of a specific fund. For each line I need to calculate the cumulated commitment to date and substract the amount funded to date, the "unfunded". For that, I iterate over the dataframe TS_Flujos, identify the fund and date, use a Boolean Index to identify the other "relevant rows" in the dataframe, the one of the same funds and with dates prior to the current with the following function:
def date_and_fund(date, fund, c_dates, c_funds):
i1 = (c_dates <= date)
i2 = (c_funds == fund)
result = i1 & i2
return result
And I run the following loop:
n_flujos = TS_Flujos.to_numpy()
for index in range(len(n_flujos)):
f = n_dates[index]
p = n_funds[index]
date_fund = date_and_fund(f, p, n_dates, n_funds)
TS_Flujos['Cumulated commitment'].values[index] = n_varCommitment[date_fund].sum()
This is a simplification but I also have segregate the cashflow by type and calculate many other indicators for each row. For now I have 44,000 rows but this number should increase a lot in the future and this loop already takes 1min to 2min depending of the computer. I am worried about the speed when I x10 the cashflow database and this is a small part of the total project. I have tried to understand how to use your previous answer to optimize it but I can't find a way to vectorize or use list comprehension here.
Because there is no dependency in calculation I tried to parallel the code with Numba.
#njit(parallel=True)
def cashflow(array_cashflows):
for index in prange(len(array_cashflows)):
f = n_dates[index]
p = n_funds[index]
date_funds = date_and_fund(f, p, n_dates, n_funds)
TS_Flujos['Cumulated commitment'].values[index] = n_varCommitment[date_fund].sum()
return
flujos(n_dates)
But I get the following error:
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "C:\Program Files\JetBrains\PyCharm 2020.1.3\plugins\python\helpers\pydev\_pydev_bundle\pydev_umd.py", line 197, in runfile
pydev_imports.execfile(filename, global_vars, local_vars) # execute the script
File "C:\Program Files\JetBrains\PyCharm 2020.1.3\plugins\python\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "C:/Users/ferna/OneDrive/Python/Datalts/Dataltsweb.py", line 347, in <module>
flujos(n_fecha)
File "C:\Users\ferna\venv\lib\site-packages\numba\core\dispatcher.py", line 415, in _compile_for_args
error_rewrite(e, 'typing')
File "C:\Users\ferna\venv\lib\site-packages\numba\core\dispatcher.py", line 358, in error_rewrite
reraise(type(e), e, None)
File "C:\Users\ferna\venv\lib\site-packages\numba\core\utils.py", line 80, in reraise
raise value.with_traceback(tb)
numba.core.errors.TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Untyped global name 'date_and_pos': cannot determine Numba type of <class 'function'>
File "Dataltsweb.py", line 324:
def flujos(array_flujos):
<source elided>
p = n_idpos[index]
fecha_pos = date_and_pos(f, p, n_fecha, n_idpos)
^
Given the way that you have structured you're code, you won't be gaining any performance by using Numba. You're using the decorator on a function that is already vectorized, and will perform fast. What would make sense is to try and speed up the main loop, not just CapComp_MO.
In relation to the error, it seems that it has to do with the types. Try to add explicit typing see if it solves the issue, here are Numba's datatypes for datetime objects.
I'd also recommend you to avoid .iterrows() for performance issues, see this post for an explanation.
As a side note, t1[:]: this takes a full slice, and is the same as t1.
Also, if you add a minimal example (code and dataframes), it might help in improving your current approach. It looks like you're just indexing in each iteration, so you might not need to loop at all if you use numpy.

Dealing with enormous DataFrames - in local environments versus HPC environments

I want to turn a recursive big-data filtering program running on a local environment into a program that can run it's most expensive processes in parallel.
This problem is complex and requires a bit of background information - so please bear with me :) I wrote this in Python 3.7.
I'm using the VIF filtering method from statsmodels.outliers_influence to prune an enormous dataset. Data is tab delimited and I used pandas to load it into a DataFrame.
However, the VIF operation gets extremely expensive as the number of columns and rows in the dataset increase. For math reasons, you can get away with looking at a subset of rows, but all columns need to be considered.
My solution for this was to subset the huge DataFrame into a list of smaller ones, and perform the VIF calculation on each, but there is a trade-off: smaller DataFrames are processed much more quickly but this reduces the accuracy of the overall VIF process (each column only sees what is in it's own subset).
To counteract this, I built a recursive process, involving user-defined values for number of iterations, and had a control flow that went: fragment data --> VIF calculation on fragments --> recombine fragments --> randomize columns --> repeat.
It worked okay ... for a test dataset. About 20 minutes for a single iteration where the original dataset was 600k columns. But even in that test case the # of columns in each fragment was quite small, and I'd imagine I'd want many, many iterations for a higher degree of accuracy.
Therefore, I figured: why not run the VIF process in parallel? So I set out to do just that in an HPC environment. I'm a relatively junior programmer, and I understand that HPC environments can be highly specific/exclusive in their syntax. But the jump to an HPC has complicated my control flow and code to a great extent, involving writing to multiple files in the same directory, calling my script thousands of times (running just one or two methods each time per submitted job), writing, appending to, and sometimes overwriting files each iteration ... it's just insanity.
I'm having no issues updating the sbash files required to keep the control flow between the project directory, HPC cores, and my script running smoothly; the real rub is getting my script to pull in information, alter it, and write it to a file in a highly specific order based on user input.
If you have experience running scripts in parallel, please lend your advice!
Here are the methods I used to fragment and filter the data in a local environment:
def frag_df(df_snps, og_chunk_delim=len(df_snps.columns) // 15):
"""
FUNCTION:
This method takes a DataFrame, evaluates its length, determines a
fragmentation size, and separates the DataFrame into chunks. It returns a
list of fragmented DataFrames.
INPUTS:
df_snps --> A DataFrame object, ideally one pre-processed by df_snps() [an
earlier method]
RETURNS:
df_list --> a list of fragment DataFrames
"""
df_list = []
# Subset df by all SNP predictor columns and find the total number of SNPs
in the infile.
snp_count = len(df_snps.columns)
# Create counters to be used by an iterative loop (for local
applications).
snp_counter = 0
fragment_counter = 0
chunk_delim = og_chunk_delim
num_chunks = snp_count / chunk_delim
num_chunks = int(math.ceil(num_chunks))
print(snp_counter, chunk_delim, num_chunks)
# Iterate through the snp_count DataFrame and split it into chunks.
print('\n' 'SNPs fragmenting into list of smaller DataFrames ...' '\n')
while fragment_counter < num_chunks:
print('Building Fragment #', fragment_counter + 1, 'from position',
snp_counter, 'to', chunk_delim)
df_list.append(df_snps.iloc[:, snp_counter:chunk_delim])
# Move snp_counter up by specified chunk_delim (Defaults to 50 SNPs).
snp_counter += og_chunk_delim
chunk_delim += og_chunk_delim
fragment_counter += 1
df_list.append(df_snps.iloc[:, snp_counter:])
print('\n', 'SNP fragmentation complete. Proceeding to VIF analysis.')
return df_list
df_list = frag_df(df_snps)
def vif_calc(df_list, threshold=3.0):
"""
FUNCTION: This method takes a list of DataFrame objects and conducts VIF
analysis on each of them, dropping columns based on some VIF threshold.
INPUTS:
df_list --> A list of DataFrame objects processed by frag_df.
threshold --> The VIF threshold by which columns are to be evaluated.
RETURNS:
vif_list --> A list of DataFrames without multicolinear predictor columns.
"""
df_index = 0
drop_counter = 0
filtered_list = df_list
for df in filtered_list:
if df.empty:
del df
print('\n Iterating through all DataFrames in the passed list.')
print('\n Dropping columns with a VIF threshold greater than', threshold,
'... \n')
# Create a list of indices corresponding to each column in a given
chunk.
variables = list(range(df.shape[1]))
df_index += 1
dropped = True
try:
while dropped:
vif = [variance_inflation_factor(df.iloc[:, variables].values,
var) for var in variables]
if max(vif) < threshold:
dropped = False
print('\n' 'Fragment #', df_index, 'has been VIF
filtered.','Checking list for next DataFrame ...' '\n')
break
else:
max_loc = vif.index(max(vif))
if max(vif) > threshold:
g = (float("{0:.2f}".format(max(vif))))
print('Dropping', df.iloc[:,
variables].columns[max_loc],'at index', str(max_loc +
1), 'within Chunk #', df_index, 'due to VIF of', g)
df.drop(df.columns[variables[max_loc]], 1,
inplace=True)
variables = list(range(df.shape[1]))
dropped = True
drop_counter += 1
except ValueError:
max_loc = 0
return filtered_list
filtered_list = vif_calc(df_list, 2.5)
Here's what I used to run my script recursively:
def recursion(remaining, df_shuffled):
"""
FUNCTION: This method specifies a number of times to call the other
methods in the program.
INPUTS:
remaining --> The number of iterations that the user would like to run.
More iterations = more chances for columns to see other columns.
df_shuffled --> A DataFrame with randomized columns to be used in the
next iteration of the program.
RETURNS:
df_final --> A DataFrame ready for downstream machine learning analysis.
"""
if remaining == 1:
print("Recursive VIF filtering complete!", '\n'
"Here is a preview of your data:", '\n')
df_final = df_shuffled
print('\n' 'In this iteration, a total of', col_loss, 'columns were
trimmed from the data file.')
print(df_final.head())
return df_final
else:
df_list = frag_df(df_shuffled)
vif_list = vif_calc(df_list, 2.0)
print('\n' "All done filtering this iteration! There are", remaining -
2, "iterations left.", '\n')
print('Reconstituting and running next iteration ...')
df_recon = reconstitute_df(vif_list)
recursion(remaining - 1, df_recon)
# All of the above is in working order. Here is the type of output I'd get:
SNPs fragmenting into list of smaller DataFrames ...
Building Fragment # 1 from position 0 to 31
Building Fragment # 2 from position 31 to 62
Building Fragment # 3 from position 62 to 93
Building Fragment # 4 from position 93 to 124
Building Fragment # 5 from position 124 to 155
Building Fragment # 6 from position 155 to 186
Building Fragment # 7 from position 186 to 217
Building Fragment # 8 from position 217 to 248
Building Fragment # 9 from position 248 to 279
Building Fragment # 10 from position 279 to 310
Building Fragment # 11 from position 310 to 341
Building Fragment # 12 from position 341 to 372
Building Fragment # 13 from position 372 to 403
Building Fragment # 14 from position 403 to 434
Building Fragment # 15 from position 434 to 465
Building Fragment # 16 from position 465 to 496
Iterating through all DataFrames in the passed list.
Dropping columns with a VIF threshold greater than 2.5 ...
Dropping AGE at index 2 within Chunk # 1 due to VIF of 16.32
Dropping UPSIT at index 2 within Chunk # 1 due to VIF of 7.07
Dropping snp164_C at index 5 within Chunk # 1 due to VIF of 2.74
Dropping snp139_T at index 19 within Chunk # 1 due to VIF of 2.52
Fragment # 1 has been VIF filtered. Checking list for next DataFrame ...
Dropping snp499_C at index 9 within Chunk # 2 due to VIF of 2.81
Dropping snp30_C at index 4 within Chunk # 2 due to VIF of 2.78
Dropping snp424_A at index 29 within Chunk # 2 due to VIF of 2.73
Dropping snp32_C at index 10 within Chunk # 2 due to VIF of 2.53
Fragment # 2 has been VIF filtered. Checking list for next DataFrame ...
Dropping snp483_T at index 31 within Chunk # 3 due to VIF of 2.84
Dropping snp350_T at index 26 within Chunk # 3 due to VIF of 2.6
Dropping snp150_A at index 28 within Chunk # 3 due to VIF of 2.55
``````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````
# When I tried to move to the HPC environment, things got crazy.
I only wanted to have vif_calc() run in parallel; everything else only needed 1 core. So I designed a sort of 'flag' system, and had my methods run only if the flag was set to a certain value. The dilemma was, I had to write the files that called my script using THIS SAME SCRIPT! And even worse, vif_calc() couldn't rely on the script anymore to supply the DataFrame to filter, instead I had to dump the fragments into the working directory. Here's how I set it up:
def init_pipeline():
parser = argparse.ArgumentParser()
parser.add_argument("datafile", type=str, default='discrete.dataforml',
help="Please enter the name of a CSV file to be
processed.")
parser.add_argument("fragment", type=int, choices=range(50, 300),
default=50, help="DataFrame fragment size. Smaller
fragments processed faster, but require more
iterations.")
parser.add_argument("-fg", "--flag", type=int, choices=range(1, 5),
default=1, help="Specify which part of the script to
run this iteration. Defaults to pre-processing.")
parser.add_argument("-th", "--threshold", type=float, choices=range(2, 6),
default=4.0, help="Specify VIF filtering threshold.
Columns exceeding it will be dropped from the
DataFrame.")
parser.add_argument("-re", "--recursion", type=int, choices=range(0, 50),
default=0, help="Recursive filtering. Choose # of
iterations from 0 to 50. Default is no recursion.")
args = parser.parse_args()
data = args.datafile
thresh = args.threshold
recur = args.recursion
frag = args.fragment
flag = args.flag
# I passed the arguments into a dictionary so every single time the script was called (once per submitted job), it'd have what it needed.
arguments = dict([('file', data), ('frag', frag), ('flag', flag),
('vif_threshold', thresh), ('iters', recur)])
return arguments
`````````````````````````````````````````````````````````````````````````````
# Then I ran into many complicated problems. How could I get the fragmented DataFrames to and from files and get different calls to this same script to effectively communicate with one another? Solution was basically: write to other files (master.sh and swarm_iter.txt) to handle the control flow in the HPC environment.
def frag_df(df_snps, flag, og_chunk_delim=50):
if flag == 1 or 2:
df_list = []
# Subset df by all SNP predictor columns and find the total number of
SNPs in the infile.
snp_count = len(df_snps)
# Create counters to be used by an iterative loop (for local
applications).
snp_counter = 0
num_chunks = 1
chunk_delim = og_chunk_delim
swarm_counter = 0
# Iterate through the snp_count DataFrame and split it into chunks.
print('\n' 'SNPs fragmenting into list of smaller DataFrames ...')
while chunk_delim + og_chunk_delim <= snp_count:
df_list.append(df_snps.iloc[:, snp_counter:chunk_delim])
# print(cov_snps.iloc[:,snp_counter:chunk_delim])
# Move snp_counter up by specified chunk_delim (Defaults to 50
SNPs).
snp_counter += og_chunk_delim
chunk_delim += og_chunk_delim
num_chunks += 1
print('\n', 'SNP fragmentation complete. Proceeding to VIF analysis.')
# Now use the fragments in df_list to write/overwrite the .swarm file
in the directory.
# Use the .swarm file to create a series of temporary txt files
corresponding to each fragment.
# These files can be deleted or overwritten after the VIF_filtering
process.
swarm_writer = open('swarm_iter.txt', "w")
df_list_inc = len(df_list)
while swarm_counter < df_list_inc:
if len(df_list) == 0:
return df_list
else:
fragment_list = [] # A list containing names of file
objects (each is a txt DataFrame).
frag_writer_inc = swarm_counter
df_frag_writer = open('df_' + str(frag_writer_inc+1) + '.txt',
"w")
df_frag_writer.write(str(df_list[swarm_counter]))
fragment_list.append(str('df_' + str(frag_writer_inc+1) +
'.txt'))
df_frag_writer.close()
# Write a line to the swarm file - ensure that flag == 3 so
only vif_calc() is called.
for DataFrame in fragment_list:
swarm_line = 'python filter_final_swarm.py', DataFrame,
'50 3 -re 2' '\n'
swarm_writer.write(str(swarm_line))
swarm_counter += 1
swarm_writer.close()
# Finally, append new lines to master.sh - the swarm command and bash
commands used downstream!
# Ensure that it's dependent on the previous job's completion (in
which the flag == 1 or 2).
job_counter = 1
master_writer = open('master.sh', "a+")
swarm_command = str('jobid' + str(job_counter+1) + '=$(swarm -f
swarm_iter.txt --dependancy afterany:$jobid' +
str(job_counter) + '--module python -g 3.0 -b 30)'
'\n')
nxt_sbash = str('jobid' + str(job_counter+2) + '=$(sbatch --
dependancy=afterany:$jobid' + str(job_counter+1) + 'python
filter_final_swarm.py 50 4' '\n')
master_writer.write(swarm_command)
master_writer.write(nxt_sbash)
master_writer.close()
``````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````
# So in this version, a list of file names was being passed to vif_calc() instead of a list of DataFrames.
def vif_calc(comp_df_lists_dict, flag, threshold=5.0):
if flag == 3:
df_dir_files = len(fnmatch.filter(os.listdir(r'C:\Users\thomasauw\VIF-
Filter-master'), 'df_*'))
df_index = 0
drop_counter = 0
filtered_list = comp_df_lists_dict.get('df_file_list')
print('\n Iterating through all DataFrames in the passed list.')
print('\n Dropping columns with a VIF threshold greater than',
threshold, '.''\n')
for file in filtered_list:
active_df = open(file, 'r')
df = pd.read_csv(active_df, sep="\t")
# Create a list of indices corresponding to each column in a given
chunk.
variables = list(range(df.shape[1]))
df_index += 1
dropped = True
try:
while dropped:
vif = [variance_inflation_factor(df.iloc[:,
variables].values, var) for var in variables]
if max(vif) < threshold:
dropped = False
# Now the method must overwrite the DataFrames it took in
with FILTERED DataFrames. In this version, the 'list' has
just a single DataFrame element, and the script is taking
in a different file each time (being called many times).
filtered_df_frag_writer = open('df_' + str(df_index),
'\n' "w")
filtered_df_frag_writer.write(df)
filtered_df_frag_writer.close()
print('\n' 'Fragment #', df_index, 'has been VIF
filtered. Checking list for next DataFrame ...' '\n')
break
else:
max_loc = vif.index(max(vif))
if max(vif) > threshold:
g = (float("{0:.2f}".format(max(vif))))
print('Dropping', df.iloc[:,
variables].columns[max_loc], 'at index',
str(max_loc + 1), 'within Chunk #', df_index,
'due to VIF of', g)
df.drop(df.columns[variables[max_loc]], 1,
inplace=True)
variables = list(range(df.shape[1]))
dropped = True
drop_counter += 1
except ValueError:
max_loc = 0
return filtered_list
# If the flag requirement isn't met, simply return what was passed as
the list of DataFrames when the argument was called in the first place.
else:
return comp_df_lists_dict
vif_list = vif_calc(comp_df_lists_dict, arg_holder.get('flag'),
arg_holder.get('vif_threshold'))
`````````````````````````````````````````````````````````````````````````````
`````````````````````````````````````````````````````````````````````````````
What I'm really looking for first and foremost is advice on how to approach this problem. For this specific case though, the error I've run into presently seems to be one with the statsapi VIF method itself. I've succeeded in getting the script to write a master.sh file, a swarm_iter.txt file, and the DataFrame files that vif_calc() needs. All of those files are in the working directory when this command is run:
`````````````````````````````````````````````````````````````````````````````
python filter_final_swarm.py discrete.dataForML 50 -fg 3 -re 2
`````````````````````````````````````````````````````````````````````````````
# Then, here is the result (note that with a flag == 3, the earlier methods responsible for fragmenting the data have already done their job. Assume that the HPC environment submitted that job successfully.
`````````````````````````````````````````````````````````````````````````````
SNPs fragmenting into list of smaller DataFrames ...
SNP fragmentation complete. Proceeding to VIF analysis.
Iterating through all DataFrames in the passed list.
Dropping columns with a VIF threshold greater than 4.0 .
Traceback (most recent call last):
File "filter_final_swarm.py", line 315, in <module>
vif_list = vif_calc(comp_df_lists_dict, arg_holder.get('flag'), arg_holder.get('vif_threshold'))
File "filter_final_swarm.py", line 270, in vif_calc
vif = [variance_inflation_factor(df.iloc[:, variables].values, var) for var in variables]
File "filter_final_swarm.py", line 270, in <listcomp>
vif = [variance_inflation_factor(df.iloc[:, variables].values, var) for var in variables]
File "C:\Users\thomasauw\AppData\Local\Continuum\anaconda3\envs\Py37\lib\site-packages\statsmodels\stats\outliers_influence.py", line 184, in variance_inflation_factor
r_squared_i = OLS(x_i, x_noti).fit().rsquared
File "C:\Users\thomasauw\AppData\Local\Continuum\anaconda3\envs\Py37\lib\site-packages\statsmodels\regression\linear_model.py", line 838, in __init__
hasconst=hasconst, **kwargs)
File "C:\Users\thomasauw\AppData\Local\Continuum\anaconda3\envs\Py37\lib\site-packages\statsmodels\regression\linear_model.py", line 684, in __init__
weights=weights, hasconst=hasconst, **kwargs)
File "C:\Users\thomasauw\AppData\Local\Continuum\anaconda3\envs\Py37\lib\site-packages\statsmodels\regression\linear_model.py", line 196, in __init__
super(RegressionModel, self).__init__(endog, exog, **kwargs)
File "C:\Users\thomasauw\AppData\Local\Continuum\anaconda3\envs\Py37\lib\site-packages\statsmodels\base\model.py", line 216, in __init__
super(LikelihoodModel, self).__init__(endog, exog, **kwargs)
File "C:\Users\thomasauw\AppData\Local\Continuum\anaconda3\envs\Py37\lib\site-packages\statsmodels\base\model.py", line 68, in __init__
**kwargs)
File "C:\Users\thomasauw\AppData\Local\Continuum\anaconda3\envs\Py37\lib\site-packages\statsmodels\base\model.py", line 91, in _handle_data
data = handle_data(endog, exog, missing, hasconst, **kwargs)
File "C:\Users\thomasauw\AppData\Local\Continuum\anaconda3\envs\Py37\lib\site-packages\statsmodels\base\data.py", line 635, in handle_data
**kwargs)
File "C:\Users\thomasauw\AppData\Local\Continuum\anaconda3\envs\Py37\lib\site-packages\statsmodels\base\data.py", line 80, in __init__
self._handle_constant(hasconst)
File "C:\Users\thomasauw\AppData\Local\Continuum\anaconda3\envs\Py37\lib\site-packages\statsmodels\base\data.py", line 125, in _handle_constant
if not np.isfinite(ptp_).all():
TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
# I've confirmed that 'df' is a DataFrame object, read in by looking at each 'active_df' file in the working directory in succession, and reading it into 'df'. If I were to continue down this path of insanity (please tell me that isn't the right thing to do here), how would I solve this problem? I'd expect the VIF filter to work normally and overwrite each file in succession (for later recombination/randomization/refragmentation).

Python .split() function

I am using split to separate the M/D/Y values from one field to make them in their own respective fields. My script in bombing out on the NULL values in the original date field for the Day field.
10/27/1990 ----> M:10 D:27 Y:1990
# Process: Calculate Field Month
arcpy.CalculateField_management(in_table="Assess_Template",field="Assess_Template.Month",expression="""!Middleboro_xlsx_Sheet2.Legal_Reference_Sale_Date!.split("/")[0]""",expression_type="PYTHON_9.3",code_block="#")
# Process: Calculate Field Day
arcpy.CalculateField_management(in_table="Assess_Template",field="Assess_Template.Day",expression="""!Middleboro_xlsx_Sheet2.Legal_Reference_Sale_Date!.split("/")[1]""",expression_type="PYTHON_9.3",code_block="#")
# Process: Calculate Field Year
arcpy.CalculateField_management(in_table="Assess_Template",field="Assess_Template.Year",expression="""!Middleboro_xlsx_Sheet2.Legal_Reference_Sale_Date!.split("/")[-1]""",expression_type="PYTHON_9.3",code_block="#")
I am unsure how I should fix this issue; any suggestions would be greatly appreciated!
Something like this should work (to calculate the year where possible):
in_table = "Assess_Template"
field = "Assess_Template.Year"
expression = "get_year(!Middleboro_xlsx_Sheet2.Legal_Reference_Sale_Date!)"
codeblock = """def get_year(date):
try:
return date.split("/")[-1]
except:
return date"""
arcpy.CalculateField_management(in_table, field, expression, "PYTHON_9.3", codeblock)
Good luck!
Tom

Categories