Dealing with enormous DataFrames - in local environments versus HPC environments

Dealing with enormous DataFrames - in local environments versus HPC environments - python

I want to turn a recursive big-data filtering program running on a local environment into a program that can run it's most expensive processes in parallel.
This problem is complex and requires a bit of background information - so please bear with me :) I wrote this in Python 3.7.
I'm using the VIF filtering method from statsmodels.outliers_influence to prune an enormous dataset. Data is tab delimited and I used pandas to load it into a DataFrame.
However, the VIF operation gets extremely expensive as the number of columns and rows in the dataset increase. For math reasons, you can get away with looking at a subset of rows, but all columns need to be considered.
My solution for this was to subset the huge DataFrame into a list of smaller ones, and perform the VIF calculation on each, but there is a trade-off: smaller DataFrames are processed much more quickly but this reduces the accuracy of the overall VIF process (each column only sees what is in it's own subset).
To counteract this, I built a recursive process, involving user-defined values for number of iterations, and had a control flow that went: fragment data --> VIF calculation on fragments --> recombine fragments --> randomize columns --> repeat.
It worked okay ... for a test dataset. About 20 minutes for a single iteration where the original dataset was 600k columns. But even in that test case the # of columns in each fragment was quite small, and I'd imagine I'd want many, many iterations for a higher degree of accuracy.
Therefore, I figured: why not run the VIF process in parallel? So I set out to do just that in an HPC environment. I'm a relatively junior programmer, and I understand that HPC environments can be highly specific/exclusive in their syntax. But the jump to an HPC has complicated my control flow and code to a great extent, involving writing to multiple files in the same directory, calling my script thousands of times (running just one or two methods each time per submitted job), writing, appending to, and sometimes overwriting files each iteration ... it's just insanity.
I'm having no issues updating the sbash files required to keep the control flow between the project directory, HPC cores, and my script running smoothly; the real rub is getting my script to pull in information, alter it, and write it to a file in a highly specific order based on user input.
If you have experience running scripts in parallel, please lend your advice!
Here are the methods I used to fragment and filter the data in a local environment:
def frag_df(df_snps, og_chunk_delim=len(df_snps.columns) // 15):
"""
FUNCTION:
This method takes a DataFrame, evaluates its length, determines a
fragmentation size, and separates the DataFrame into chunks. It returns a
list of fragmented DataFrames.
INPUTS:
df_snps --> A DataFrame object, ideally one pre-processed by df_snps() [an
earlier method]
RETURNS:
df_list --> a list of fragment DataFrames
"""
df_list = []
# Subset df by all SNP predictor columns and find the total number of SNPs
in the infile.
snp_count = len(df_snps.columns)
# Create counters to be used by an iterative loop (for local
applications).
snp_counter = 0
fragment_counter = 0
chunk_delim = og_chunk_delim
num_chunks = snp_count / chunk_delim
num_chunks = int(math.ceil(num_chunks))
print(snp_counter, chunk_delim, num_chunks)
# Iterate through the snp_count DataFrame and split it into chunks.
print('\n' 'SNPs fragmenting into list of smaller DataFrames ...' '\n')
while fragment_counter < num_chunks:
print('Building Fragment #', fragment_counter + 1, 'from position',
snp_counter, 'to', chunk_delim)
df_list.append(df_snps.iloc[:, snp_counter:chunk_delim])
# Move snp_counter up by specified chunk_delim (Defaults to 50 SNPs).
snp_counter += og_chunk_delim
chunk_delim += og_chunk_delim
fragment_counter += 1
df_list.append(df_snps.iloc[:, snp_counter:])
print('\n', 'SNP fragmentation complete. Proceeding to VIF analysis.')
return df_list
df_list = frag_df(df_snps)
def vif_calc(df_list, threshold=3.0):
"""
FUNCTION: This method takes a list of DataFrame objects and conducts VIF
analysis on each of them, dropping columns based on some VIF threshold.
INPUTS:
df_list --> A list of DataFrame objects processed by frag_df.
threshold --> The VIF threshold by which columns are to be evaluated.
RETURNS:
vif_list --> A list of DataFrames without multicolinear predictor columns.
"""
df_index = 0
drop_counter = 0
filtered_list = df_list
for df in filtered_list:
if df.empty:
del df
print('\n Iterating through all DataFrames in the passed list.')
print('\n Dropping columns with a VIF threshold greater than', threshold,
'... \n')
# Create a list of indices corresponding to each column in a given
chunk.
variables = list(range(df.shape[1]))
df_index += 1
dropped = True
try:
while dropped:
vif = [variance_inflation_factor(df.iloc[:, variables].values,
var) for var in variables]
if max(vif) < threshold:
dropped = False
print('\n' 'Fragment #', df_index, 'has been VIF
filtered.','Checking list for next DataFrame ...' '\n')
break
else:
max_loc = vif.index(max(vif))
if max(vif) > threshold:
g = (float("{0:.2f}".format(max(vif))))
print('Dropping', df.iloc[:,
variables].columns[max_loc],'at index', str(max_loc +
1), 'within Chunk #', df_index, 'due to VIF of', g)
df.drop(df.columns[variables[max_loc]], 1,
inplace=True)
variables = list(range(df.shape[1]))
dropped = True
drop_counter += 1
except ValueError:
max_loc = 0
return filtered_list
filtered_list = vif_calc(df_list, 2.5)
Here's what I used to run my script recursively:
def recursion(remaining, df_shuffled):
"""
FUNCTION: This method specifies a number of times to call the other
methods in the program.
INPUTS:
remaining --> The number of iterations that the user would like to run.
More iterations = more chances for columns to see other columns.
df_shuffled --> A DataFrame with randomized columns to be used in the
next iteration of the program.
RETURNS:
df_final --> A DataFrame ready for downstream machine learning analysis.
"""
if remaining == 1:
print("Recursive VIF filtering complete!", '\n'
"Here is a preview of your data:", '\n')
df_final = df_shuffled
print('\n' 'In this iteration, a total of', col_loss, 'columns were
trimmed from the data file.')
print(df_final.head())
return df_final
else:
df_list = frag_df(df_shuffled)
vif_list = vif_calc(df_list, 2.0)
print('\n' "All done filtering this iteration! There are", remaining -
2, "iterations left.", '\n')
print('Reconstituting and running next iteration ...')
df_recon = reconstitute_df(vif_list)
recursion(remaining - 1, df_recon)
# All of the above is in working order. Here is the type of output I'd get:
SNPs fragmenting into list of smaller DataFrames ...
Building Fragment # 1 from position 0 to 31
Building Fragment # 2 from position 31 to 62
Building Fragment # 3 from position 62 to 93
Building Fragment # 4 from position 93 to 124
Building Fragment # 5 from position 124 to 155
Building Fragment # 6 from position 155 to 186
Building Fragment # 7 from position 186 to 217
Building Fragment # 8 from position 217 to 248
Building Fragment # 9 from position 248 to 279
Building Fragment # 10 from position 279 to 310
Building Fragment # 11 from position 310 to 341
Building Fragment # 12 from position 341 to 372
Building Fragment # 13 from position 372 to 403
Building Fragment # 14 from position 403 to 434
Building Fragment # 15 from position 434 to 465
Building Fragment # 16 from position 465 to 496
Iterating through all DataFrames in the passed list.
Dropping columns with a VIF threshold greater than 2.5 ...
Dropping AGE at index 2 within Chunk # 1 due to VIF of 16.32
Dropping UPSIT at index 2 within Chunk # 1 due to VIF of 7.07
Dropping snp164_C at index 5 within Chunk # 1 due to VIF of 2.74
Dropping snp139_T at index 19 within Chunk # 1 due to VIF of 2.52
Fragment # 1 has been VIF filtered. Checking list for next DataFrame ...
Dropping snp499_C at index 9 within Chunk # 2 due to VIF of 2.81
Dropping snp30_C at index 4 within Chunk # 2 due to VIF of 2.78
Dropping snp424_A at index 29 within Chunk # 2 due to VIF of 2.73
Dropping snp32_C at index 10 within Chunk # 2 due to VIF of 2.53
Fragment # 2 has been VIF filtered. Checking list for next DataFrame ...
Dropping snp483_T at index 31 within Chunk # 3 due to VIF of 2.84
Dropping snp350_T at index 26 within Chunk # 3 due to VIF of 2.6
Dropping snp150_A at index 28 within Chunk # 3 due to VIF of 2.55
``````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````
# When I tried to move to the HPC environment, things got crazy.
I only wanted to have vif_calc() run in parallel; everything else only needed 1 core. So I designed a sort of 'flag' system, and had my methods run only if the flag was set to a certain value. The dilemma was, I had to write the files that called my script using THIS SAME SCRIPT! And even worse, vif_calc() couldn't rely on the script anymore to supply the DataFrame to filter, instead I had to dump the fragments into the working directory. Here's how I set it up:
def init_pipeline():
parser = argparse.ArgumentParser()
parser.add_argument("datafile", type=str, default='discrete.dataforml',
help="Please enter the name of a CSV file to be
processed.")
parser.add_argument("fragment", type=int, choices=range(50, 300),
default=50, help="DataFrame fragment size. Smaller
fragments processed faster, but require more
iterations.")
parser.add_argument("-fg", "--flag", type=int, choices=range(1, 5),
default=1, help="Specify which part of the script to
run this iteration. Defaults to pre-processing.")
parser.add_argument("-th", "--threshold", type=float, choices=range(2, 6),
default=4.0, help="Specify VIF filtering threshold.
Columns exceeding it will be dropped from the
DataFrame.")
parser.add_argument("-re", "--recursion", type=int, choices=range(0, 50),
default=0, help="Recursive filtering. Choose # of
iterations from 0 to 50. Default is no recursion.")
args = parser.parse_args()
data = args.datafile
thresh = args.threshold
recur = args.recursion
frag = args.fragment
flag = args.flag
# I passed the arguments into a dictionary so every single time the script was called (once per submitted job), it'd have what it needed.
arguments = dict([('file', data), ('frag', frag), ('flag', flag),
('vif_threshold', thresh), ('iters', recur)])
return arguments
`````````````````````````````````````````````````````````````````````````````
# Then I ran into many complicated problems. How could I get the fragmented DataFrames to and from files and get different calls to this same script to effectively communicate with one another? Solution was basically: write to other files (master.sh and swarm_iter.txt) to handle the control flow in the HPC environment.
def frag_df(df_snps, flag, og_chunk_delim=50):
if flag == 1 or 2:
df_list = []
# Subset df by all SNP predictor columns and find the total number of
SNPs in the infile.
snp_count = len(df_snps)
# Create counters to be used by an iterative loop (for local
applications).
snp_counter = 0
num_chunks = 1
chunk_delim = og_chunk_delim
swarm_counter = 0
# Iterate through the snp_count DataFrame and split it into chunks.
print('\n' 'SNPs fragmenting into list of smaller DataFrames ...')
while chunk_delim + og_chunk_delim <= snp_count:
df_list.append(df_snps.iloc[:, snp_counter:chunk_delim])
# print(cov_snps.iloc[:,snp_counter:chunk_delim])
# Move snp_counter up by specified chunk_delim (Defaults to 50
SNPs).
snp_counter += og_chunk_delim
chunk_delim += og_chunk_delim
num_chunks += 1
print('\n', 'SNP fragmentation complete. Proceeding to VIF analysis.')
# Now use the fragments in df_list to write/overwrite the .swarm file
in the directory.
# Use the .swarm file to create a series of temporary txt files
corresponding to each fragment.
# These files can be deleted or overwritten after the VIF_filtering
process.
swarm_writer = open('swarm_iter.txt', "w")
df_list_inc = len(df_list)
while swarm_counter < df_list_inc:
if len(df_list) == 0:
return df_list
else:
fragment_list = [] # A list containing names of file
objects (each is a txt DataFrame).
frag_writer_inc = swarm_counter
df_frag_writer = open('df_' + str(frag_writer_inc+1) + '.txt',
"w")
df_frag_writer.write(str(df_list[swarm_counter]))
fragment_list.append(str('df_' + str(frag_writer_inc+1) +
'.txt'))
df_frag_writer.close()
# Write a line to the swarm file - ensure that flag == 3 so
only vif_calc() is called.
for DataFrame in fragment_list:
swarm_line = 'python filter_final_swarm.py', DataFrame,
'50 3 -re 2' '\n'
swarm_writer.write(str(swarm_line))
swarm_counter += 1
swarm_writer.close()
# Finally, append new lines to master.sh - the swarm command and bash
commands used downstream!
# Ensure that it's dependent on the previous job's completion (in
which the flag == 1 or 2).
job_counter = 1
master_writer = open('master.sh', "a+")
swarm_command = str('jobid' + str(job_counter+1) + '=$(swarm -f
swarm_iter.txt --dependancy afterany:$jobid' +
str(job_counter) + '--module python -g 3.0 -b 30)'
'\n')
nxt_sbash = str('jobid' + str(job_counter+2) + '=$(sbatch --
dependancy=afterany:$jobid' + str(job_counter+1) + 'python
filter_final_swarm.py 50 4' '\n')
master_writer.write(swarm_command)
master_writer.write(nxt_sbash)
master_writer.close()
``````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````````
# So in this version, a list of file names was being passed to vif_calc() instead of a list of DataFrames.
def vif_calc(comp_df_lists_dict, flag, threshold=5.0):
if flag == 3:
df_dir_files = len(fnmatch.filter(os.listdir(r'C:\Users\thomasauw\VIF-
Filter-master'), 'df_*'))
df_index = 0
drop_counter = 0
filtered_list = comp_df_lists_dict.get('df_file_list')
print('\n Iterating through all DataFrames in the passed list.')
print('\n Dropping columns with a VIF threshold greater than',
threshold, '.''\n')
for file in filtered_list:
active_df = open(file, 'r')
df = pd.read_csv(active_df, sep="\t")
# Create a list of indices corresponding to each column in a given
chunk.
variables = list(range(df.shape[1]))
df_index += 1
dropped = True
try:
while dropped:
vif = [variance_inflation_factor(df.iloc[:,
variables].values, var) for var in variables]
if max(vif) < threshold:
dropped = False
# Now the method must overwrite the DataFrames it took in
with FILTERED DataFrames. In this version, the 'list' has
just a single DataFrame element, and the script is taking
in a different file each time (being called many times).
filtered_df_frag_writer = open('df_' + str(df_index),
'\n' "w")
filtered_df_frag_writer.write(df)
filtered_df_frag_writer.close()
print('\n' 'Fragment #', df_index, 'has been VIF
filtered. Checking list for next DataFrame ...' '\n')
break
else:
max_loc = vif.index(max(vif))
if max(vif) > threshold:
g = (float("{0:.2f}".format(max(vif))))
print('Dropping', df.iloc[:,
variables].columns[max_loc], 'at index',
str(max_loc + 1), 'within Chunk #', df_index,
'due to VIF of', g)
df.drop(df.columns[variables[max_loc]], 1,
inplace=True)
variables = list(range(df.shape[1]))
dropped = True
drop_counter += 1
except ValueError:
max_loc = 0
return filtered_list
# If the flag requirement isn't met, simply return what was passed as
the list of DataFrames when the argument was called in the first place.
else:
return comp_df_lists_dict
vif_list = vif_calc(comp_df_lists_dict, arg_holder.get('flag'),
arg_holder.get('vif_threshold'))
`````````````````````````````````````````````````````````````````````````````
`````````````````````````````````````````````````````````````````````````````
What I'm really looking for first and foremost is advice on how to approach this problem. For this specific case though, the error I've run into presently seems to be one with the statsapi VIF method itself. I've succeeded in getting the script to write a master.sh file, a swarm_iter.txt file, and the DataFrame files that vif_calc() needs. All of those files are in the working directory when this command is run:
`````````````````````````````````````````````````````````````````````````````
python filter_final_swarm.py discrete.dataForML 50 -fg 3 -re 2
`````````````````````````````````````````````````````````````````````````````
# Then, here is the result (note that with a flag == 3, the earlier methods responsible for fragmenting the data have already done their job. Assume that the HPC environment submitted that job successfully.
`````````````````````````````````````````````````````````````````````````````
SNPs fragmenting into list of smaller DataFrames ...
SNP fragmentation complete. Proceeding to VIF analysis.
Iterating through all DataFrames in the passed list.
Dropping columns with a VIF threshold greater than 4.0 .
Traceback (most recent call last):
File "filter_final_swarm.py", line 315, in <module>
vif_list = vif_calc(comp_df_lists_dict, arg_holder.get('flag'), arg_holder.get('vif_threshold'))
File "filter_final_swarm.py", line 270, in vif_calc
vif = [variance_inflation_factor(df.iloc[:, variables].values, var) for var in variables]
File "filter_final_swarm.py", line 270, in <listcomp>
vif = [variance_inflation_factor(df.iloc[:, variables].values, var) for var in variables]
File "C:\Users\thomasauw\AppData\Local\Continuum\anaconda3\envs\Py37\lib\site-packages\statsmodels\stats\outliers_influence.py", line 184, in variance_inflation_factor
r_squared_i = OLS(x_i, x_noti).fit().rsquared
File "C:\Users\thomasauw\AppData\Local\Continuum\anaconda3\envs\Py37\lib\site-packages\statsmodels\regression\linear_model.py", line 838, in __init__
hasconst=hasconst, **kwargs)
File "C:\Users\thomasauw\AppData\Local\Continuum\anaconda3\envs\Py37\lib\site-packages\statsmodels\regression\linear_model.py", line 684, in __init__
weights=weights, hasconst=hasconst, **kwargs)
File "C:\Users\thomasauw\AppData\Local\Continuum\anaconda3\envs\Py37\lib\site-packages\statsmodels\regression\linear_model.py", line 196, in __init__
super(RegressionModel, self).__init__(endog, exog, **kwargs)
File "C:\Users\thomasauw\AppData\Local\Continuum\anaconda3\envs\Py37\lib\site-packages\statsmodels\base\model.py", line 216, in __init__
super(LikelihoodModel, self).__init__(endog, exog, **kwargs)
File "C:\Users\thomasauw\AppData\Local\Continuum\anaconda3\envs\Py37\lib\site-packages\statsmodels\base\model.py", line 68, in __init__
**kwargs)
File "C:\Users\thomasauw\AppData\Local\Continuum\anaconda3\envs\Py37\lib\site-packages\statsmodels\base\model.py", line 91, in _handle_data
data = handle_data(endog, exog, missing, hasconst, **kwargs)
File "C:\Users\thomasauw\AppData\Local\Continuum\anaconda3\envs\Py37\lib\site-packages\statsmodels\base\data.py", line 635, in handle_data
**kwargs)
File "C:\Users\thomasauw\AppData\Local\Continuum\anaconda3\envs\Py37\lib\site-packages\statsmodels\base\data.py", line 80, in __init__
self._handle_constant(hasconst)
File "C:\Users\thomasauw\AppData\Local\Continuum\anaconda3\envs\Py37\lib\site-packages\statsmodels\base\data.py", line 125, in _handle_constant
if not np.isfinite(ptp_).all():
TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
# I've confirmed that 'df' is a DataFrame object, read in by looking at each 'active_df' file in the working directory in succession, and reading it into 'df'. If I were to continue down this path of insanity (please tell me that isn't the right thing to do here), how would I solve this problem? I'd expect the VIF filter to work normally and overwrite each file in succession (for later recombination/randomization/refragmentation).

Related

Iterate elements of pandas column over elements of another column from a different data frame of unequal length

I have two Pandas data frames of unequal length, the first one contains data about predicted protein modifications, the second one holds data for experimentally verified protein modifications
The first data frame contains the following columns:
protein_id
position_predicted
modification_predicted
… and looks something like this:
protein_id
position_predicted
modification_predicted
prot1
135
+
prot1
267
+
prot1
360
-
prot2
59
++
prot2
135
+++
prot3
308
-
…
…
…
The second data frame contains columns with experimentally verified protein modification positions:
protein_id
position_experimental
… and looks like so:
protein_id
position
prot1
135
prot3
300
prot4
55
…
…
protein_id in both columns refers to the same protein, using standard Uniprot identifier
modification_predicted in the first data frame responds to information about the predicted presence of the modification on the position:
‘+’ modification predicted to be present
‘-’ modification predicted to be absent
On the contrary, the second data frame holds the position are experimentally (truly) present
Now my global aim is to somehow compare the accuracy of the predictions from data frame one with the experimentally verified modifications from data frame two.
There are 5 cases that I have to count separately:
A) experimental data frame and predictions data frame both have same position for same protein and the prediction says the position is truly modified (‘+’ in the modification_predicted) - true positive cases
B) position in both data frames is same, but the same prediction says there’s no modification (‘-‘ in the modification_predicted) for the same corresponding protein - false negative cases
C) prediction says there’s a modification for the position (‘+’ in the modification_predicted), but the experimental data frame has no corresponding position for this same protein - false positive cases
D) prediction says there’s no modification for the position (‘-’ in the modification_predicted) and the experimental data frame has no corresponding position for this same protein - true negative cases
E) the experimental data frame positions that do not correspond to any position for the same protein in the prediction data frame - miscellanaous
Now I understand that I need to somewhere iterate over each position of each protein in the prediction data frame over each position for each corresponding protein in the experimental data frame
In pseudo-code the way I see the solution for this problem is something like this
TP = 0
FN = 0
TN = 0
FP = 0
Misc = 0
for protein in df1$protein_id:
for position in protein[from df1]:
if {condition for TP}:
TP += 1
if {condition for FN}:
FN += 1
if {condition for TN}:
TN += 1
if (condition for FP):
FP += 1
if {condition for misc}:
Misc += 1
There are two major problems that I face with such a solution.
(1) How do I specify for each condition that I need to compare only same protein positions positions between the two frames, in other words restrict the comparison only to within-single-protein positions, without allowing for inter-protein comparisons
(2) The length of the two frames is unequal
Any ideas how to approach these problems?

You can use merging. Reference: Pandas Merging 101
I assume the index numbers (of both dataframes) are unique. If not, use: df.reset_index()
# Inner merge:
intersection = df_pred.merge(
df_real,
left_on=['protein_id', 'position_predicted'],
right_on=['protein_id', 'position']
)
TP = intersection['modification_predicted'].str.contains('+', regex=False).sum()
FN = intersection['modification_predicted'].eq('-').sum()
# FN = len(intersection) - TP # alternative
And here select elements of both dataframes which are not present in the other one:
unique_pred = df_pred.loc[df_pred.index.difference(intersection.index)]
unique_real = df_real.loc[df_real.index.difference(intersection.index)]
TN = unique_pred['modification_predicted'].eq('-').sum()
FP = unique_pred['modification_predicted'].str.contains('+', regex=False).sum()
# FP = len(unique_pred) - TN # alternative
Misc = len(unique_real)
Result:
>>> TP, FN, TN, FP, Misc
(1, 0, 2, 3, 2)

first attempt at python, error ("IndexError: index 8 is out of bounds for axis 0 with size 8") and efficiency question

learning python, just began last week, havent otherwise coded for about 20 years and was never that advanced to begin with. I got the hello world thing down. Now im trying to back test FX pairs. Any help up the learning curve appreciated, and of course scouring this site while on my Lynda vids.
Getting a funky error, and also wondering if theres blatantly more efficient ways to loop through columns of excel data the way I am.
The spreadsheet being read is simple ... 56 FX pairs down column A, and 8 rows over where the column headers are dates, and the cells in each column are the respective FX pair closing price on that date. The strategy starts at the top of the 2nd column (so that there is a return % that can be calc'd vs the prior priord) and calcs out period/period % returns for each pair, identifying which is the 'maximum value', and then "goes long" that highest performer ... whose performance in the subsequent period/period is recorded as PnL to the portfolio ("p" in the code), loops through that until the current, most recent column is read.
The error relates to using 8 columns instead of 7 ... works when i limit the loop to 7 columns but not 8. When I used 8 I get a wall of text concluding with "IndexError: index 8 is out of bounds for axis 0 with size 8" Similar error when i use too many rows, 56 instead of 55, think im missing the bottom row.
Here's my code:
,,,
enter code here
#set up imports
import pandas as pd
#import spreadsheet
x1 = pd.ExcelFile(r"C:\Users\Gamblor\Desktop\Python\test2020.xlsx")
df = pd.read_excel(x1, "Sheet1", header=1)
#define counters for loops
o = 1 # observation counter
c = 3 # column counter
r = 0 # active row counter for sorting through for max
#define identifiers for the portfolio
rpos = 0 # static row, for identifying which currency pair is in column 0 of that row
p = 100 # portfolio size starts at $100
#define the stuff we are evaluating for
pair = df.iat[r,0] # starting pair at 0,0 where each loop will begin
pair_pct_rtn = 0 # starts out at zero, becomes something at first evaluation, then gets
compared to each subsequent eval
pair_pct_rtn_calc = 0 # a second version of above, for comparison to prior return
#runs a loop starting at the top to find the max period/period % return in a specific column
while (c < 8): # manually limiting this to 5 columns left to right
while (r < 55): # i am manually limiting this to 55 data rows per the spreadsheet ... would be better if automatic
pair_pct_rtn_calc = ((df.iat[r,c])/(df.iat[r,c-1]) - 1)
if pair_pct_rtn_calc > pair_pct_rtn: # if its a higher return, it must be the "max" to that point
pair = df.iat[r,0] # identifies the max pair for this column observation, so far
pair_pct_rtn = pair_pct_rtn_calc # sets pair_pct_rtn as the new max
rpos = r # identifies the max pair's ROW for this column observation, so far
r = r + 1 # adds to r in order to jump down and calc the next row
print('in obs #', o ,', ', pair ,'did best at' ,pair_pct_rtn ,'.')
o = o + 1
# now adjust the portfolio by however well USDMXN did in the subsequent week
p = p * ( 1 + ((df.iat[rpos,c+1])/(df.iat[rpos,c]) - 1))
print('then the subsequent period it did: ',(df.iat[rpos,c+1])/(df.iat[rpos,c]) - 1)
print('resulting in portfolio value of', p)
rpos = 0
r = 0
pair_pct_rtn = 0
c = c + 1 # adds to c in order to move to the next period to the right
print(p)

Since indices are labelled from 0 onwards, the 8th element you are looking for will have index 7. Likewise, row index 55 (the 56th row) will be your last row.

Using Multiprocessing, how to Apply() within each CPU

I have to run a process on about 2 million IDs, for which I am trying to use MultipleProcessing.
My sample data, stored in dataframe df looks like (just presenting 3 rows):
c_id
0 ID1
1 ID2
2 ID3
My parallelize code is as follows:
def parallelize(data,func,parts=cpu_count()):
if data.shape[0] < parts:
parts = data.shape[0]
data_split = np.array_split(data,parts)
pool = Pool(parts)
parallel_out = pd.concat(pool.map(func,data_split))
pool.close()
pool.join()
return parallel_out
A sample process that I want to run on all the ID's is to add my first name each ID.
There are two pieces of codes that I tested.
First: Using a for loop and then calling the parallelize function, as follows:
def pooltype1(df_id):
dfi=[]
for item in df_id['c_id']:
dfi.append({'string': str(item) + '_ravi'})
dfi = pd.DataFrame(df)
return dfi
p = parallelize(df,pooltype1,parts=cpu_count())
The output is as expected and the index of each is 0, confirming that each ID went into a different cpu (cpu_count() for my system > 3):
string
0 ID1_ravi
0 ID2_ravi
0 ID3_ravi
and the runtime is 0.12 secs.
However, to further speed it up on my actual (2 million) data, I tried to replace the for-loop in the pooltype1 function by a apply command and then calling the parallelize function as below:
# New function
def add_string(x):
return x + '_ravi'
def pooltype2(df_id):
dfi = df_id.apply(add_string)
return dfi
p = parallelize(df,pooltype2,parts=cpu_count())
Now index of the output was not all zero
string
0 ID1_ravi
1 ID2_ravi
2 ID3_ravi
and to my surprise the runtime jumped to 5.5 sec. This seems like apply was executed on the whole original dataframe and not at a cpu level.
So, while doing pool.map do I have to use a for-loop (as in pooltype1 function) or is there a way the apply can be applied within each cpu (hoping that it will further reduce run time). If one can do the apply at a cpu level, please do help me with the code.
Thank you.

Create a data frame out of input_t (which is actually three numbers and act as features ) and output_t as output

This is my code
output_data = []
out = ''
i = 0
P = 500
X = 40000
while i<600:
subVals = values[i:i+X]
signal=subVals.val1
signal, rpeaks = biosppy.signals.ecg.ecg(signal, show=False)[1:3]
rpeaks=rpeaks.tolist()
nni = tools.nn_intervals(rpeaks)
fre = fd.welch_psd(nni)
tm = td.nni_parameters(nni)
f1=(fre['fft_peak'])
t1=(tm['nni_min'])
f11=np.asarray(f1)
t11=np.asarray(t1)
input_t=np.append(f11,t11)
output_t=subVals.BLEEDING
output_t=int(round(np.mean(output_t)))
i+=P
As you see we are in a loop and the goal here is to create a data frame or a csv file from input_t and output_t. Here is an example of them in one loop
input_t
array([2.83203125e-02, 1.21093750e-01, 3.33984375e-01, 8.17000000e+02])
output_t
0
I am trying to create matrix where for every rows, the first three columns is one iteration of input_t and last column is output_t. Based on the code, since i needs to be less than 600 and the initial value of i is 0 and the step is 600 so we have two loops which makes it 2 rows in total and 5 columns(4 values from input_t and 1 value from output_t) . I tried append, I tried something like out+="," but I am not sure why that is not working

Init any variable as a list before the loop and add results to it
out = []
while i<600:
....
input_t=np.append(f11,t11)
output_t=subVals.BLEEDING
output_t=int(round(np.mean(output_t)))
out.append(input_t+[output_t])
Now out is the list of lists which you can load to DataFrame

Matching cells in CSV to return calculation

I am trying to create a program that will take the most recent 30 CSV files of data within a folder and calculate totals of certain columns. There are 4 columns of data, with the first column being the identifier and the rest being the data related to the identifier. Here's an example:
file1
Asset X Y Z
12345 250 100 150
23456 225 150 200
34567 300 175 225
file2
Asset X Y Z
12345 270 130 100
23456 235 190 270
34567 390 115 265
I want to be able to match the asset# in both CSVs to return each columns value and then perform calculations on each column. Once I have completed those calculations I intend on graphing various data as well. So far the only thing I have been able to complete is extracting ALL the data from the CSV file using the following code:
csvfile = glob.glob('C:\\Users\\tdjones\\Desktop\\Python Work Files\\FDR*.csv')
listData = []
for files in csvfile:
df = pd.read_csv(files, index_col=0)
listData.append(df)
concatenated_data = pd.concat(listData, sort=False)
group = concatenated_data.groupby('ASSET')['Slip Expense ($)', 'Net Win ($)'].sum()
group.to_csv("C:\\Users\\tdjones\\Desktop\\Python Work Files\\Test\\NewFDRConcat.csv", header=('Slip Expense', 'Net WIn'))
I am very new to Python so any and all direction is welcome. Thank you!

I'd probably also set the asset number as the index while you're reading the data, since this can help with sifting through data. So
rd = pd.read_csv(files, index_col=0)
Then you can do as Alex Yu suggested and just pick all the data from a specific asset number out when you're done using
asset_data = rd.loc[asset_number, column_name]
You'll generally need to format the data in the DataFrame before you append it to the list if you only want specific inputs. Exactly how to do that naturally depends specifically on what you want i.e. what kind of calculations you perform.
If you want a function that just returns all the data for one specific asset, you could do something along the lines of
def get_asset(asset_number):
csvfile = glob.glob('C:\\Users\\tdjones\\Desktop\\Python Work Files\\*.csv')
asset_data = []
for file in csvfile:
data = [line for line in open(file, 'r').read().splitlines()
if line.split(',')[0] == str(asset_num)]
for line in data:
asset_data.append(line.split(','))
return pd.DataFrame(asset_data, columns=['Asset', 'X', 'Y', 'Z'], dtype=float)
Although how well the above performs is going to depend on how large the dataset is your going through. Something like the above method needs to search through every line and perform several high level functions on each line, so it could potentially be problematic if you have millions of lines of data in each file.
Also, the above assumes that all data elements are strings of numbers (so can be cast to integers or floats). If thats not the case, leave the dtype argument out of the DataFrame definition, but keep in mind that everything returned is stored as a string then.

I suppose that you need to add for your code pandas.concat of your listData
So it will became:
csvfile = glob.glob('C:\\Users\\tdjones\\Desktop\\Python Work Files\\*.csv')
listData = []
for files in csvfile:
rd = pd.read_csv(files)
listData.append(rd)
concatenated_data = pd.concat(listData)
After that you can use aggregate functions with this concatenated_data DataFrame such as: concatenated_data['A'].max(), concatenated_data['A'].count(), 'groupby`s etc.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Dealing with enormous DataFrames - in local environments versus HPC environments - python

Related

Iterate elements of pandas column over elements of another column from a different data frame of unequal length

first attempt at python, error ("IndexError: index 8 is out of bounds for axis 0 with size 8") and efficiency question

Using Multiprocessing, how to Apply() within each CPU

Create a data frame out of input_t (which is actually three numbers and act as features ) and output_t as output

Matching cells in CSV to return calculation

Categories

Resources