My deep learning model will use some supervised learning datas
Data will loaded in .npz(numpy arrays) files, each files will have the 30 array list datas.
I have to traning model that using these data sets
It will be loaded when I call it in python function. Then, model will be starting training this data sets.
And the correlation between these data sets is important.
This is the subject of this question.
I will load the entire file in the folder. Then the model will continue learning until data loading is complete.
So, Can I use this data set files in training?
And how to training my model..
I'm not sure what you mean, but you should be able to convert certain data structures to tensors such as numpy arrays.
a = np.ones(5)
b = torch.from_numpy(a) # np array to tensor
This is a simple article for beginners if that helps with training:
https://medium.com/coinmonks/create-a-neural-network-in-pytorch-and-make-your-life-simpler-ec5367895199
Related
I would like to transform my data using a preprocessing pipeline in PyTorch in order to train a model. My dataset consists of many ~GB size files. Each file is effectively a series of 3D images (so 4D total). As my model is using 3D convolutions, it's a bit infeasible to keep the original images intact and so crucially I need to split up each image into many different dataset examples. (Effectively many series of 3D patches). In addition to this I need to shuffle the 4th dimension using a custom shuffling function, and split that dimension into different dataset examples too.
To achieve this in TensorFlow I would:
Save the data to the .tfrecord format
Load each large image as a tf.data.Dataset
Apply a series of mapping functions using tf.data.Dataset.map
Split the dataset into many sub examples using tf.data.Dataset.from_tensor_slices
My question is, how can I achieve the same thing in PyTorch? The splitting into patches step could be saved to disk rather than on the fly, however this would be disadvantageous in terms of data flexibility. Critically, the shuffled dimension step needs to be applied at each epoch and therefore cannot be saved to disk.
I am trying to create a large numpy array, say
S=0.5
a=np.random.normal(size=(100000,10000))
x=np.maximum(S-a,1)
#This is just an example. The calculation is more complicated than this.
But it is too large for the memory. After creating this array, I also need to do manipulation, and use it as training data in machine learning( eg, xgboost, or CART).
So my questions are : 1.How would I create such a big array without getting a memory error and could let me do calculations as well? Could you recommend some packages or links I could learn to do this please? 2. Suppose this array is already saved in a file. How do I load it and then train my model without causing memory error?
I have read
<https://pythonspeed.com/articles/mmap-vs-zarr-hdf5/ >
but it didn't say how to write data onto disk.
Anyone could help please? Thanks a lot.
Dask can help with large NumPy arrays, but they do not support all functions of NumPy API.
Since you have mentioned in your question that your final target is to use the data for training a machine learning model lets look at the problem from the other end.
Assuming that you somehow managed to load the data into memory how do you plan to pass it over to the underlining ML models ? Most of the available classical ML models work on Numpy arrays, so even if you manage to load the data is some format you cannot pass it on to the ML model to train it using those representations until unless they are numpy arrays.
In case your data is sparse you can store them in sparse numpy arrays and some classical models can handle sparse numpy arrays.
This is the general case when data is very large to be fit into memory. Ideally you have to look at ML models which can be trained on a batch at a time. This way you can load a batch of data and train one batch at a time. Any ML model which can be trained using gradient decent algorithm can be trained on a batch at a time. Deep learning models are trained using gradient decent and so they all work on batch of data at a time.
So if you decide to use deep learning models, you normally will end up writing a data loader which will load a batch of data at a time.
If you do not want to use the batch based training models:
Bottom line is, since your final target it to train a ML model, first find out the data representation of the ML model you want to use then try to solve your problem of fitting the data into that format. It will be waste of your time and effort to figure out how to fit the data into memory and then realize that your ML model can not work on that representation.
Fitting convnets such as Resnet and VGG benefits from the ImageRecordIter python class, that allows efficiently loading batches from large collections of RGB images stored in RecordIO .rec files,
Does anybody know about equivalent facilities for large arbitrary input 2D or 3D matrices (for 2D, rows = items and cols = features, + channels in 3D)?
NDArrayIter requires loading the whole dataset in memory, which is to be avoided in my case (>40Gb data file). CSVIter does not allow straightforward shuffling, and works only for 2D matrices.
You can keep each data sample as a separate file and implement a very simple custom DataSet class following the reference code ImageFolderDataset
I have saved my large array of images and their labels in HDF5 format using this link:
Saving and loading a large number of images (data) into a single HDF5 file
which gives me following keys.
list of datasets:
['test_img', 'test_labels', 'train_img', 'train_labels', 'train_mean', 'val_img', 'val_labels']
Now, I want to provide training data and training labels to a convolutional neural network (VGG-16 or ResNet) for training purpose and also want to validate and test my result using CNN. How can I input my data from HDF5 file into CNN?
Have a look at this open-source project tftables.
This other thread also has a great detailed answer by mikkola, explaining how to use your own generator with the new Tensorflow Dataset API otherwise.
I am looking for a way to load vectors I generated previously using scikit-learn's TfidfVectorizer. In general what I wish is to get a better understanding of the TfidfVectorizer's data persistence.
For instance, what I did so far is:
vectorizer = TfidfVectorizer(stop_words=stop)
vect_train = vectorizer.fit_transform(corpus)
Then I wrote 2 functions in order to be able to save and load my vectorizer:
def save_model(model,name):
'''
Function that enables us to save a trained model
'''
joblib.dump(model, '{}.pkl'.format(name))
def load_model(name):
'''
Function that enables us to load a saved model
'''
return joblib.load('{}.pkl'.format(name))
I checked posts like the one below but i still didn't manage to make much sense.
How do I store a TfidfVectorizer for future use in scikit-learn?
What I ultimately wish is to be able to have a training session and then load this set of produced vectors, transform some newly text input based on those vectors and perform cosine_similarity using old vectors and new ones generated based on them.
One of the reasons that I wish to do this is because the vectorization in such a large dataset takes approximately 10 minutes and I wish to do this once and not every time a new query comes in.
I guess what I should be saving is vect_train right? But then which is the correct way to firstly save it and then load it to a newly created instance of TfidfVectorizer?
First time I tried to save vect_train with joblib as the kind people in scikit-learn advise to do I got 4 files: tfidf.pkl, tfidf.pkl_01.npy, tfidf.pkl_02.npy, tfidf.pkl_03.npy. It would be great if I knew what exactly are those and how I could load them to a new instance of
vectorizer = TfidfVectorizer(stop_words=stop)
created in a different script.
Thank you in advance.
The result of your vect_train = vectorizer.fit_transform(corpus) is twofold: (i) the vectorizer fits your data, that is it learns the corpus vocabulary and the idf for each term, and
(ii) vect_train is instantiated with the vectors of your corpus.
The save_model and load_model functions you propose persist and load the vectorizer, that is the internal parameters that it has learned such as the vocabulary and the idfs. Having loaded the vectorizer, all you need to get vectors is to transform a list with data. It can be unseen data, or the raw data you used during the fit_transform. Therefore, all you need is:
vectorizer = load_model(name)
vect_train = vectorizer.transform(corpus) # (1) or any unseen data
At this point, you have everything you had before saving, but the transformation call (1) will take some time depending on your corpus. In case you want to skip this, you need to also save the content of vect_train, as you correctly wonder in your question. This is a sparse matrix and can be saved/loaded using scipy, you can find information in this question for example. Copying from that question, to actually save the csr matrices you also need:
def save_sparse_csr(filename,array):
np.savez(filename,data = array.data ,indices=array.indices,
indptr =array.indptr, shape=array.shape )
def load_sparse_csr(filename):
loader = np.load(filename)
return csr_matrix(( loader['data'], loader['indices'], loader['indptr']),
shape = loader['shape'])
Concluding, the above functions can be used for saving/loading your vec_train whereas the ones you provided for saving/loading the transformer in order to vectorize the new data.