I am trying to create a large numpy array, say
S=0.5
a=np.random.normal(size=(100000,10000))
x=np.maximum(S-a,1)
#This is just an example. The calculation is more complicated than this.
But it is too large for the memory. After creating this array, I also need to do manipulation, and use it as training data in machine learning( eg, xgboost, or CART).
So my questions are : 1.How would I create such a big array without getting a memory error and could let me do calculations as well? Could you recommend some packages or links I could learn to do this please? 2. Suppose this array is already saved in a file. How do I load it and then train my model without causing memory error?
I have read
<https://pythonspeed.com/articles/mmap-vs-zarr-hdf5/ >
but it didn't say how to write data onto disk.
Anyone could help please? Thanks a lot.
Dask can help with large NumPy arrays, but they do not support all functions of NumPy API.
Since you have mentioned in your question that your final target is to use the data for training a machine learning model lets look at the problem from the other end.
Assuming that you somehow managed to load the data into memory how do you plan to pass it over to the underlining ML models ? Most of the available classical ML models work on Numpy arrays, so even if you manage to load the data is some format you cannot pass it on to the ML model to train it using those representations until unless they are numpy arrays.
In case your data is sparse you can store them in sparse numpy arrays and some classical models can handle sparse numpy arrays.
This is the general case when data is very large to be fit into memory. Ideally you have to look at ML models which can be trained on a batch at a time. This way you can load a batch of data and train one batch at a time. Any ML model which can be trained using gradient decent algorithm can be trained on a batch at a time. Deep learning models are trained using gradient decent and so they all work on batch of data at a time.
So if you decide to use deep learning models, you normally will end up writing a data loader which will load a batch of data at a time.
If you do not want to use the batch based training models:
Bottom line is, since your final target it to train a ML model, first find out the data representation of the ML model you want to use then try to solve your problem of fitting the data into that format. It will be waste of your time and effort to figure out how to fit the data into memory and then realize that your ML model can not work on that representation.
Related
I have trained a keras model which uses TFrecords because the training data and also validation data does not fit into memory. Now I would like to do a confusion matrix but the classic method which uses numpy arrays or "in-memory" tensors will not work. (with tf.math.confusion_matrix for example)
Is there any way I can do confusion matrix which from TFrecords simply? Even if the amount data is huge, making a confusion matrix should not eat too much memory. (Answers saying "No because..." will also be very helpful to me.)
Or I will just have to work with a smaller validation data?
My deep learning model will use some supervised learning datas
Data will loaded in .npz(numpy arrays) files, each files will have the 30 array list datas.
I have to traning model that using these data sets
It will be loaded when I call it in python function. Then, model will be starting training this data sets.
And the correlation between these data sets is important.
This is the subject of this question.
I will load the entire file in the folder. Then the model will continue learning until data loading is complete.
So, Can I use this data set files in training?
And how to training my model..
I'm not sure what you mean, but you should be able to convert certain data structures to tensors such as numpy arrays.
a = np.ones(5)
b = torch.from_numpy(a) # np array to tensor
This is a simple article for beginners if that helps with training:
https://medium.com/coinmonks/create-a-neural-network-in-pytorch-and-make-your-life-simpler-ec5367895199
I have two set of data let's say A and B.
I want to apply PCA and T-sne to A and fine tune the algo.
Once, I am satisfied with my tuning I want to save the learnt things to some pickle file.
Now I want to apply the same learnt PCA and t-sne to set B.
I want t-sne to produce the same results every time on B. I am hoping this because I am assuming, we can save the state of learnt t-sne parametes also. If the parametes are same, and when I load the same file everytime, the result of applying t-sne on the set B should be same every time.
How can I do this in Sklearn and python?
I am sorry, I am new to ML and python, this may be a very basic question.
Fine-tuning T-SNE equals tuning some heuristic-algorithm (it's ill-conditioned after all; higher-dimension -> lower-dimension mapping is lossy) for your data.
Applying this tuned & learned mapping to other data is done by sklearn's transform.
But: you will see that there is no transform-method for T-SNE and the reason is given here (including further discussion):
It is a transductive learner, like many clustering algorithms: the model is not really applicable beyond the data points it is fed as training.
So whatever you tuned for dataset A, does not really apply to dataset B (including parameters)!
For PCA this is trivial. Use the methods described in docs: model_persistence and use PCA's transform-method (assuming compatible datasets; dimensions!).
To be able to save the models you should use the below library:
from joblib import dump, load
after establishing the model as below in PCA:
pca_model = PCA(n_components=n)
you can save the model in joblib format in the current directory:
dump(pca_model, 'pca_model.joblib')
I have a real time data feed of health patient data that I connect to with python. I want to run some sklearn algorithms over this data feed so that I can predict in real time if someone is going to get sick. Is there a standard way in which one connects real time data to sklearn? I have traditionally had static datasets and never an incoming stream so this is quite new to me. If anyone has sort of some general rules/processes/tools used that would be great.
With most algorithms training is slow and predicting is fast. Therefore it is better to train offline using training data; and then use the trained model to predict each new case in real time.
Obviously you might decide to train again later if you acquire more/better data. However there is little benefit in retraining after every case.
It is feasible to train the model from a static dataset and predict classifications for incoming data with the model. Retraining the model with each new set of patient data not so much. Also breaks the train/test mode of testing a ML model.
Trained models can be saved to file and imported in the code used for real time prediction.
In python scikit learn, this is via the pickle package.
R programming saves to an rda object. saveRDS
yay... my first answering a ML question!
Suppose I have a (possibly) large corpus, about 2.5M of them with 500 features (after running LSI on the original data with gensim). I need the corpus to train my classifiers using scikit-learn. However, I need to first convert the corpus into a numpy array. The corpus creation and classifier trainer are done in two different scripts.
So the problem is that, my collection size is expected to grow, and at this stage I already don't have enough memory (32GB on the machine) to convert all at once (with gensim.matutils.corpus2dense). In order to work around the problem I am converting one vector after another at a time, but it is very slow.
I have considered dumping the corpus into svmlight format, and have scikit-learn to load it with sklearn.datasets.load_svmlight_file. But then it would probably mean I will need to load everything into memory at once?
Is there anyway I can efficiently convert from gensim corpus to numpy array (or scipy sparse matrix)?
I'm not very knowledgable about Gensim, so I hesitate to answer, but here goes:
Your data does not fit in memory so you will have to either stream it (basically what you are doing now) or chunk it out. It looks to me like gensim.utils.chunkize chunks it out for you, and you should be able to get the dense numpy array that you need with as_numpy=True. You will have to use the sklearn models that support partial_fit. These are trained iteratively, one batch at a time. The good ones are the SGD classifier and the Passive-Aggressive Classifier. Make sure to pass the classes argument at least the first time you call partial_fit. I recommend reading the docs on out-of-core scaling.