How to avoid loading a large file into a python script repeatedly?

How to avoid loading a large file into a python script repeatedly? - python

I've written a python script to take a large file (a matrix ~50k rows X ~500 cols) and use it as a dataset to train a random forest model.
My script has two functions, one to load the dataset and the other to train the random forest model using said data. These both work fine, but the file upload takes ~45 seconds and it's a pain to do this every time I want to train a subtly different model (testing many models on the same dataset). Here is the file upload code:
def load_train_data(train_file):
# Read in training file
train_f = io.open(train_file)
train_id_list = []
train_val_list = []
for line in train_f:
list_line = line.strip().split("\t")
if list_line[0] != "Domain":
train_identifier = list_line[9]
train_values = list_line[12:]
train_id_list.append(train_identifier)
train_val_float = [float(x) for x in train_values]
train_val_list.append(train_val_float)
train_f.close()
train_val_array = np.asarray(train_val_list)
return(train_id_list,train_val_array)
This returns a numpy array with col. 9 as the label and cols. 12-end as the metadata to train the random forest.
I am going to train many different forms of my model with the same data, so I just want to upload the file one time and have it available to feed into my random forest function. I want the file to be an object I think (I am fairly new to python).

If I understand you correctly, the data set does not change but the model parameters do change and you are changing the parameters after each run.
I would put the file load script in one file, and run this in the python interpreter. Then the file will load and be saved in memory with whatever variable you use.
Then you can import another file with your model code, and run that with the training data as argument.
If all your model changes can be determined as parameters in a function call, all you need is to import your model and then call the training function with different parameter settings.
If you need to change the model code between runs, save with a new filename and import that one, run again and send the source data to that one.
If you don't want to save each model modification with a new filename, you might be able to use the reload functionality depending on python version, but it is not recommended (see Proper way to reload a python module from the console)

Simplest way would be to cache the results, like so:
_train_data_cache = {}
def load_cached_train_data(train_file):
if train_file not in _train_data_cache:
_train_data_cache[train_file] = load_train_data(train_file)
return _train_data_cache[train_file]

Try to learn about Python data serialization. You would basically be storing the large file as a python specific, serialized binary object using python's marshal function. This would drastically speed up IO of the file. See these benchmarks for performance variations. However, if these random forest models are all trained at the same time then you could just train them against the data-set you already have in memory then release train data after completion.

Load your data in ipython.
my_data = open("data.txt")
Write your codes in a python script, say example.py, which uses this data. At the top of the script example.py add these lines:
import sys
args = sys.argv
data = args[1]
...
Now run the python script in ipython:
%run example.py $mydata
Now, when running your python script, you don't need to load data multiple times.

Related

how to generate the json format for google cloud predictions

I am trying to make predictions from my custom model on Vertex AI but am getting errors.
I have deployed the model to an endpoint and I request prediction using this line
gcloud beta ai endpoints predict 1234 --region=europe-west4 --json-request=request2.json
and I get this response
Using endpoint [https://europe-west4-prediction-aiplatform.googleapis.com/]ERROR: (gcloud.beta.ai.endpoints.predict) FAILED_PRECONDITION: "Prediction failed: Exception during xgboost prediction: feature_names mismatch:
I have created the data with this code ( and later renamed to request2.json)
test3 = {}
test3["instances"] = test_set2.head(20).values.tolist()
with open('05_GCP/test_v3.jsonl', 'w') as outfile:
json.dump(test3, outfile, indent = 2)
This generates a file which looks like this.
The error tells me that its expecting column names per value instead of nothings, which are then interpreted as f0, f1 etc.
My challenge is that I don't know how to generate data that looks like this( also from the help file)
Though the result with the mismatched column names suggests I need a different format.
I tried:
import json
test4= X_test.head(20).to_json(orient='records', lines=True)
with open('05_GCP/test_v4.json', 'w') as outfile:
json.dump(test4, outfile, indent = 2)
which gives me data that looks like:
ie with many line breaks in it and cloud shell tells me this isn't json.
I also replicated this format and was informed it is not a json file
So 2 questions,
how do I create a json file that has the appropriate format for a live prediction.
how do I create a jsonl file so I can run batch jobs. This is actually want I a trying to get to. I have used csv but it returns errors:
('Post request fails. Cannot get predictions. Error: Predictions are not in the response. Got: {"error": "Prediction failed: Could not initialize DMatrix from inputs: ('Expecting 2 dimensional numpy.ndarray, got: ', (38,))"}.', 2)
This csv data is the exact same that I use to measure the model error whilst training (I know not good practise but this is just a test run)
UPDATE
Following Raj's suggestion, I tried creating two extra models. One where I changed my training code to be X_train.values and another where I replaced all the column names to be F0:F426 as the response to the json file on the endpoint said it didn't match column names.
These both SUCCEDDED with the test3 json file above when endpoints where deployed.
However I want this to be returning batch predictions, and here it returns the same errors. This is therefore clearly a formatting error but I have no clue about how to get there.
it should be pointed out that the batch predictions need to be a jsonl file. Not a json file. Trying to pass a json file doesn't work. All I have done is change the extension on the json file when I loaded it to make it appear as a jsonl file. I have not found anything that helps me create that. Tips are welcome.
I noticed that my data was in double brackets. So I created a version which had only one bracket and ran this on one of the models but it also returned errors. This was also only one prediction per the other comment.
Appreciate the help,
James

This worked for me:
If you train the model with df.values as per raj's answer, you can then pass the instances you want to get batch predictions from in a jsonl file ("input.jsonl" for example) with the following format for each instance/row:
[3.0,1.0,30.0,1.0,0.0,16.1]
File would look something like this for 5 rows to predict:
[3.0,1.0,30.0,1.0,0.0,16.1]
[3.0,0.0,22.0,0.0,0.0,9.8375]
[2.0,0.0,45.0,1.0,1.0,26.25]
[1.0,0.0,21.0,0.0,0.0,26.55]
[3.0,1.0,16.0,4.0,1.0,39.6875]

Reading Large HDF5 Files

I am new to using HDF5 files and I am trying to read files with shapes of (20670, 224, 224, 3). Whenever I try to store the results from the hdf5 into a list or another data structure, it takes either takes so long that I abort the execution or it crashes my computer. I need to be able to read 3 sets of hdf5 files, use their data, manipulate it, use it to train a CNN model and make predictions.
Any help for reading and using these large HDF5 files would be greatly appreciated.
Currently this is how I am reading the hdf5 file:
db = h5py.File(os.getcwd() + "/Results/Training_Dataset.hdf5")
training_db = list(db['data'])

Crashes probably mean you are running out of memory. Like Vignesh Pillay suggested, I would try chunking the data and work on a small piece of it at a time. If you are using the pandas method read_hdf you can use the iterator and chunksize parameters to control the chunking:
import pandas as pd
data_iter = pd.read_hdf('/tmp/test.hdf', key='test_key', iterator=True, chunksize=100)
for chunk in data_iter:
#train cnn on chunk here
print(chunk.shape)
Note this requires the hdf to be in table format

My answer updated 2020-08-03 to reflect code you added to your question.
As #Tober noted, you are running out of memory. Reading a dataset of shape (20670, 224, 224, 3) will become a list of 3.1G entities. If you read 3 image sets, it will require even more RAM.
I assume this is image data (maybe 20670 images of shape (224, 224, 3) )?
If so, you can read the data in slices with both h5py and tables (Pytables).
This will return the data as a NumPy array, which you can use directly (no need to manipulate into a different data structure).
Basic process would look like this:
with h5py.File(os.getcwd() + "/Results/Training_Dataset.hdf5",'r') as db:
training_db = db['data']
# loop to get images 1 by 1
for icnt in range(20670) :
image_arr = training_db [icnt,:,:,:}
# then do something with the image
You could also read multiple images by setting the first index to a range (say icnt:icnt+100) then handle looping appropriately.

Your problem is arising as you are running out of memory. So, Virtual Datasets come in handy while dealing with large datasets like yours. Virtual datasets allow a number of real datasets to be mapped together into a single, sliceable dataset via an interface layer. You can read more about them here https://docs.h5py.org/en/stable/vds.html
I would recommend you to start from one file at a time. Firstly, create a Virtual Dataset file of your existing data like
with h5py.File(os.getcwd() + "/Results/Training_Dataset.hdf5", 'r') as db:
data_shape = db['data'].shape
layout = h5py.VirtualLayout(shape = (data_shape), dtype = np.uint8)
vsource = h5py.VirtualSource(db['data'])
with h5py.File(os.getcwd() + "/virtual_training_dataset.hdf5", 'w', libver = 'latest') as file:
file.create_virtual_dataset('data', layout = layout, fillvalue = 0)
This will create a virtual dataset of your existing training data. Now, if you want to manipulate your data, you should open your file in r+ mode like
with h5py.File(os.getcwd() + "/virtual_training_dataset.hdf5", 'r+', libver = 'latest') as file:
# Do whatever manipulation you want to do here
One more thing I would like to advise is make sure your indices while slicing are of int datatype, otherwise you will get an error.

How can I export variables from .mat file (generated by Dymola) to .csv using python

I'm a student who is quite new to coding in Python.
I'm using Dymola for several years and now I'm using the Dymola/Python interface with which you can operate Dymola from inside Python (useful for building stock simulations, global sensitivity analysis etc.).
Now, Dymola always generates .mat files in an efficient unreadable data structure. I was wondering how to export variables I'm interested in from that .mat-file to .csv using a Python-script? (I don't want the whole file to be converted to .csv because it is simple way too large)
I'm aware of a DyMat-package for Python that should do the job but either I don't understand the code or the code is not doing what it should do? Does anybody have experience with this?
I probably miss some code defining which .mat file has to be read/exported from, which variables I want and in which directory the result.csv-file should be stored?
import csv, numpy
def export(dm, varList, fileName=None, formatOptions={}):
"""Export DyMat data to a CSV file"""
if not fileName:
fileName = dm.fileName + '.csv'
oFile = open(fileName, 'w')
csvWriter = csv.writer(oFile)
vDict = dm.sortByBlocks(varList)
for vList in vDict.values():
vData = dm.getVarArray(vList)
vList.insert(0, dm._absc[0])
csvWriter.writerow(vList)
csvWriter.writerows(numpy.transpose(vData))
oFile.close()
Thanks!

In the Dymola distribution there is a utility called alist.exe, that allows you to export a number of variables in CSV format.
Another possibility is to convert the MAT file to SDF format, which is a very simple HDF5 interpretation. The HDF5 file is not as compact as the MAT-file, but you can compress the file using ZIP/GZIP/7ZIP to reduce archival storage. There are both MATLAB and Python scripts for reading the SDF format in the Dymola distribution.

Since this was tagged openmodelica, I am proposing a solution using it:
filterSimulationResults("file.mat", "file.csv", {"x","y","z"}) creates a csv-file with only variables x, y, z (If you think it's still too large, it is possible to resample the file).

For small files (<2GB) Buildingspy (or other Python packages) covers pretty much all needs: https://simulationresearch.lbl.gov/modelica/buildingspy/
However, since one will run into issues when the files are above 2GB (e.g. for full years of simulations), "alist.exe" from Dymola may be employed. (filterSimulationResults from OpenModelica also fails then)
"alist.exe" seems to accept up until approx. 100 variables to be exported at once and single executions for each variable seems to slow things down drastically (translation of 1 or 100 rows takes almost the same time). One may employ the alist.exe as follows from Python to facilitate automation and speed things up.
var_list=['Component.Name1','Component.Name3','Component.Name2','...'] #List of Variabels to be extracted
N_batch=100 #Number of variables to be extracted from the .mat file at once (max. approx 110)
cmds=[] #list of commands to be executed batch wise
for i,var in enumerate(var_list):
if (i%N_batch == 0) &(i > 0):
cmds.append(cmd)
cmd=''
cmd+=f' -e {var}'#build command
cmds.append(cmd)
lst_df=[] #list of pandas dataframes
for i,cmd in enumerate(cmds):
os.system(f'"C:/Program Files/Dymola 2021/bin64/alist.exe" {cmd} {inFile} tmp.csv')
lst_df.append(pd.read_csv('tmp.csv',index_col=[0]).squeeze("columns"))
df_overall=pd.concat(lst_df,axis=1)
df_overall.to_csv('CompleteCSVFile.csv')#or use .pkl for more efficient writing and reading
It is still not a fast solution, but enables the processing of the date in the first instance. Variable Selection of Dymola should always be exploited first before trying to wrangle around such amounts of data on a local machine.
Hope this helps someone someday!

Python Dask Running Bag operations in parallel

I am trying to run a series of operations on a json file using Dask and read_text but I find that when I check Linux Systems Monitor, only one core is ever used at 100%. How do I know if the operations I am performing on a Dask Bag are able to be parallelized? Here is the basic layout of what I am doing:
import dask.bag as db
import json
js = db.read_text('path/to/json').map(json.loads).filter(lambda d: d['field'] == 'value')
result = js.pluck('field')
result = result.map(cleantext, tbl=tbl).str.lower().remove(exclusion).str.split()
result.map(stopwords,stop=stop).compute()
The basic premise is to extract text entries from the json file and then perform some cleaning operations. This seems like something that can be parallelized since each piece of text could be handed off to a processor since each text and the cleaning of each text is independent of any of the other. Is this an incorrect thought? Is there something I should be doing differently?
Thanks.

The read_text function breaks up a file into chunks based on byte ranges. My guess is that your file is small enough to fit into one chunk. You can check this by looking at the .npartitions attribute.
>>> js.npartitions
1
If so, then you might consider reducing the blocksize to increase the number of partitions
>>> js = db.read_text(..., blocksize=1e6)... # 1MB chunks

Loading previously saved JModelica result-file

I got the following question:
I am loading a JModelica model and simulate it easily by doing:
from pymodelica import compile_fmu
from pyfmi import load_fmu
model = load_fmu(SOME_FMU);
res=model.simulate();
Everything works fine and it even saves a resulting .txt - file. Now, with this .txt the problem is that I did not find any funtionality so far within the jmodelica-python packages to actually load such an .txt result file again later on into a result-object ( like the one being returned by simulate() ) to easily extract the previous saved data.
Implementing that by hand is of course possible but I find it quiet nasty and just wanted to ask if anyone knows of method that does the job to load that JModlica-format result-file into an result object for me.
Thanks!!!!

The functionality that you need is located in the io module:
from pyfmi.common.io import ResultDymolaTextual
res = ResultDymolaTextual("MyResult.txt")
var = res.get_variable_data("MyVariable")
var.x #Trajectory
var.t #Corresponding time vector

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.