The problem I have is the following:
I have a csv file with roughly 10 million rows. With that, I want to to run a linear regression with many interaction terms. In the end I will have 3000 such interactions. So creating them by hand would give a dataset of shape (10mio, 3000) which won't fit into memory anymore. Furthermore, I need to center these interactions terms prior to fitting.
Fixed effects are not possible as the interactions contain continouus variables and not true dummies (they will have mostly 0, some 1, and few 0.5).
The plan I have for now is the following:
Use dask (http://docs.dask.org/en/latest/dataframe.html) to read in the csv file. Then create the interactions and save them out of core so that I don't have memory problems, here pandas would fail. How can I create the interactions with dask efficiently?
Center the created interaction terms with sklearn's StandardScaler (https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler). Should I just loop over a reasonable amount of columns (say, 100), center these 100, and store the centered variables again on disk? From the dask documentation I guess it is easily possible to combine dask with sklearn?
Fit the model using Stochastic Gradient Descent (https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html#sklearn.linear_model.SGDRegressor) with the partial_fit() attribute to do it part by part in a loop. I am also aware of other updating ways (https://dahtah.wordpress.com/2011/11/29/rank-one-updates-for-faster-matrix-inversion/) but have not found implementations for python?
Predict part by part in a loop.
Do you think this is a reasonable plan?
Related
Can someone help me understand why my PCA is getting different results each run?
Im working in Pyspark using Databricks
The current implementation of my code is as below
from pyspark.ml.feature import PCA
from pyspark.mllib.linalg import Vectors
pca = PCA(k=35, inputCol="scaled_features", outputCol="pcaFeatures")
model = pca.fit(df.select('scaled_features'))
result = model.transform(df.select('scaled_features'))
print(model.explainedVariance)
If i run this code multiple times, i get different results for Explained Variance
The difference is quite small, but when i try to perform a K-Means Clustering after, the difference changes the result a lot.
PySpark is a distributed computation system and relies on distributed versions of the k-Means and PCA algorithms. In the distributed versions, they can be non-deterministic and have non-zero error bounds (see the links at the bottom) owing to the nature of data locality and a lack of a universal view of the dataset as necessary design constraints.
Each algorithm is designed in a way that no single machine has access to all of the data at one time to allow for datasets are too big to do so. At each step of the calculations, local segments of the data are used to generate data for intermediate results which is then shuffled across nodes with available capacity. Which entry is grouped with what other entry may change the results for PCA. In what order machines are available versus which elements are ready for the next iteration is not easy to synchronize.
k-Means (even on a single machine) can also be very non-deterministic - it is very sensitive to the initial seed centroids of the clusters. Making sure you always start with the same centroids can help (but if it is performed on PCA features that are changing, that will not help). Carefully setting random seeds on each machine that so they do not collide is also something to consider. Also ensuring that the same data is assigned to the same partitions at the beginning of the run with sorting/indexes can help. All of these things together may improve the variance between runs, but there are a lot of moving parts that all play a role.
https://www.cs.cmu.edu/~ninamf/papers/distributedPCAandCoresets.pdf
https://www.researchgate.net/publication/232063041_Principal_Component_Analysis_for_Dimension_Reduction_in_Massive_Distributed_Data_Sets
Hi I am currently facing an issue when trying to run a large number of Lasso Regressions on the same data.
I have a dataset with multiple 100k rows and 365 columns (data of one year).
What I need to do is to fit and predict one Lasso Regression per Row based on all other rows so that I end up with as many individual regressions as rows.
I struggle to find a way other than a loop to execute this in a rather performant way. I tried to experiment with Python's joblib package, which increased the performance, but I am still looking for a faster way if there is any. I also tried to find a way to vectorize this problem, but I did not find a solution. I also glanced at keras to implement the Lasso regression as an aNN.
I can hardly imagine that it is uncommon to fit a large number of simple models for specific tasks. So my question is: What is the best approach to reduce execution time of this problem statement? From my understanding, this problem can not be optimized with the use of GPUs as lots of individual models need to be fitted and therefore the CPU will be the bottleneck regardless.
Btw: I can get access to a GPU and I have a CPU with 16 cores.
I am trying to cluster retail data in order to extract groupings of customers based on 6 input features. The data has a shape of (1712594, 6) in the following format:
I've spilt the 'Department' categorical variable into binary n-dimensional array using Pandas get_dummies(). I'm aware this is not optimal but I just wanted to test it out before trying out Gower Distances.
The Elbow method gives the following output:
USING:
I'm using Python and Scikitlearn's KMeans because the dataset is so large and the more complex models are too computationally demanding for Google Colab.
OBSERVATINS:
I'm aware that columns 1-5 are extremely correlated but the data is limited Sales data and little to no data is captured about Customers. KMeans is very sensitive to inputs and this may affect the WCSS in the Elbow Method and cause the straight line but this is just an inclination and I don't have any quantitative backing to support the argument. I'm a Junior Data Scientist so knowledge about technical foundations of Clustering models and algorithms is still developing so forgive me if I'm missing something.
WHAT I'VE DONE:
There were massive outliers that were skewing the data (this is a Building Goods company and therefore most of their sale prices and quantities fall within a certain range. But ~5% of the data contained massive quantity entries (eg. a company buying 300000 bricks at R3/brick) or massive price entries (eg. company buying an expensive piece of equipment).
I've removed them and maintained ~94% of the data. I've also removed the returns made by customers (ie. negative quantities and prices) under the inclination that I may create a binary variable 'Returned' to capture this feature. Here are some metrics:
These are some metrics before removing the outliers:
and these are the metrics after Outlier removal:
KMeans uses Euclidean distances. I've used both Scikitlearn's StandardScaler and RobustScaler when scaling without any significant changes in both. Here are some distribution plots and scatter plots for the 3 numeric variables:
Anybody have any practical/intuitive reasoning as to why this may be happening? Open to any alternative methods to use as well and any help would be much appreciated! Thanks
I am not an expert, in my experience with scikit learn cluster analysis I find that when the features are really similar in magnitude K-means clustering usually does not fulfill the job. I will first try to use a StandardScaler to see if normalizing the data makes the clustering more efficient. the elbow plot shows that with more n_neighbors you get higher accuracy, and by the looks of the plot and the plots you provide, I would think the data is too similar, making it hard to separate into groups (clusters). Adding an additional feature made up of your data can do the trick.
I would try normalizing the data first, standard scaler.
If the groups are still not very clear with a simple plot of the data I would create another column made up of the combination of the others columns.
I would not suggest using DBSCAN, since the eps parameter (distance) would have to be tunned very finely and as you mention is more computationally expensive.
I am very sorry if this question violates SO's question guidelines but I am stuck and I cannot find anywhere else to ask this type of questions. Suppose I have a dataset containing three experimental data that were obtained in three different conditions (hot, cold, comfortable). The data is arranged in three columns in a pandas dataframe consisting of 4 columns (time, cold, comfortable and hot).
When I plot the data, I can visually see the separation of the three experiments, but I would like to do it automatically with machine learning.
The x-axis represents the time and the y-axis represents the magnitude of the data. I have read about different machine learning classification techniquesbut I do not understand how to set up my data so that I can 'feed' it into the classification algorithm. Namely, my questions are:
Is this programmatically feasible?
How can I set up (arrange my data) so that it can be easily fed into the classification algorithm? From what I read so far, it seems, for the algorithm to work, the data has to be in a certain order (see for example the iris dataset where the data is nicely labeled. How can I customize the algorithms to fit my needs?
NOTE: Ideally, I would like the program that, given a magnitude value, it would classify the value as hot, comfortable or cold. The time series is not much of relevance in my case
Of course this is feasible.
It's not entirely clear from the original post exactly what variables/features you have available for your model, but here is a bit of general guidance. All of these machine learning problems, from classification to regression, rely on the same core assumption that you are trying to predict some outcome based on a bunch of inputs. Usually this relationship is modeled like this: y ~ X1 + X2 + X3 ..., where y is your outcome ("dependent") variable, and X1, X2, etc. are features ("explanatory" variables). More simply, we can say that using our entire feature-set matrix X (i.e. the matrix containing all of our x-variables), we can predict some outcome variable y using a variety of ML techniques.
So in your case, you'd try to predict whether it's Cold, Comfortable, or Hot based on time. This is really more of a forecasting problem than it is a ML problem, since you have a time component that looks to be one of the most important (if not the only) features in your dataset. You may want to look at some simpler time-series forecasting methods (e.g. ARIMA) instead of ML algorithms, as some of the time-series ML approaches may not be well-suited for a beginner.
In any case, this should get you started, I think.
I have a dataset of 22 GB. I would like to process it on my laptop. Of course I can't load it in memory.
I use a lot sklearn but for much smaller datasets.
In this situations the classical approach should be something like.
Read only part of the data -> Partial train your estimator -> delete the data -> read other part of the data -> continue to train your estimator.
I have seen that some sklearn algorithm have the partial fit method that should allow us to train the estimator with various subsamples of the data.
Now I am wondering is there an easy why to do that in sklearn?
I am looking for something like
r = read_part_of_data('data.csv')
m = sk.my_model
`for i in range(n):
x = r.read_next_chunk(20 lines)
m.partial_fit(x)
m.predict(new_x)
Maybe sklearn is not the right tool for these kind of things?
Let me know.
I've used several scikit-learn classifiers with out-of-core capabilities to train linear models: Stochastic Gradient, Perceptron and Passive Agressive and also Multinomial Naive Bayes on a Kaggle dataset of over 30Gb. All these classifiers share the partial_fit method which you mention. Some behave better than others though.
You can find the methodology, the case study and some good resources in of this post:
http://www.opendatascience.com/blog/riding-on-large-data-with-scikit-learn/
I think sklearn is fine for larger data. If your chosen algorithms support partial_fit or an online learning approach then you're on track. One thing to be aware of is that your chunk size may influence your success.
This link may be useful...
Working with big data in python and numpy, not enough ram, how to save partial results on disc?
I agree that h5py is useful but you may wish to use tools that are already in your quiver.
Another thing you can do is to randomly pick whether or not to keep a row in your csv file...and save the result to a .npy file so it loads quicker. That way you get a sampling of your data that will allow you to start playing with it with all algorithms...and deal with the bigger data issue along the way(or not at all! sometimes a sample with a good approach is good enough depending on what you want).
You may want to take a look at Dask or Graphlab
http://dask.pydata.org/en/latest/
https://turi.com/products/create/
They are similar to pandas but working on large scale data (using out-of-core dataframes). The problem with pandas is all data has to fit into memory.
Both frameworks can be used with scikit learn. You can load 22 GB of data into Dask or SFrame, then use with sklearn.
I find it interesting that you have chosen to use Python for statistical analysis rather than R however, I would start by putting my data into a format that can handle such large datasets. The python h5py package is fantastic for this kind of storage - allowing very fast access to your data. You will need to chunk up your data in reasonable sizes say 1 million element chunks e.g. 20 columns x 50,000 rows writing each chunk to the H5 file. Next you need to think about what kind of model you are running - which you haven't really specified.
The fact is that you will probably have to write the algorithm for model and the machine learning cross validation because the data is large. Start by writing an algorithm to summarize the data, so that you know what you am looking at. Then once you decide what model you want to run you will need to think about what the cross validation will be. Put in a "column" into each chunk of the data set that denotes which validation set each row belongs to. You many choose to label each chunk to a particular validation set.
Next you will need to write a map reduce style algorithm to run your model on the validation subsets. The alternative is simply to run models on each chunk of each validation set and average the result (consider the theoretical validity of this approach).
Consider using spark, or R and rhdf5 or something similar. I haven't supplied any code because this is a project rather than just a simple coding question.