Need something similar to biglm but for mixed effect models - python

I am currently working with a dataset of approximately 4 million data points. I am using R in Rstudio on a Macbook Pro (32gb Ram, 2.2GHz Intel Core i7, 155 available memory on the Hard Drive).
My goal is to perform a non-linear mixed effect regression on the data. the data has two random effects which are nested and the model needs varying slopes as well as intercepts.
My code for this model is:
model <- lmer(DV ~ I(IVr^2) + IV + (IV | group/episode), data = data, REML=FALSE)
However, when running the model, the process uses up ~ 42gb of RAM before crashing. The log is:
Vector memory exhausted (limit reached?)
I want to manipulate R in someway so that it can run slower but use up my available hard drive memory so that it can handle this run. The closest solution I have found is the biglm package in R, however, I cannot find an equivalent for lmer(). I also can't find much on manipulating swap space on a Mac. Any solutions to the problem welcome.
Alternatively, I speak python so a solution using python instead would be great. (i.e. a module which can handle polynomial mixed effect models and a module to sort out my memory issue)

Related

Efficiently executing complex Python code in R

I'm looking for an efficient way to execute Python code in a fast and flexible way in R.
The problem is that I want to execute a particular method from topological data analysis, called persistent homology, which you don't really need to know in detail to understand the problem I'm dealing with.
The thing is that computing persistent homology is not the most easiest thing to do when working with large data sets, as it requires both a lot of memory and computational time.
Executing the method implemented in the R-package TDA is however very inconvenient: it crashes starting at around 800 data points and doesn't really allow me to use approximation algorithms.
In contrast, the Python package Ripser allows me to do the computation on a few thousands of points with ease in Python. Furthermore, it also allows one to provide a parameter that approximates the result for even larger data sets, and can also store the output I want. In summary, it is much more convenient to compute persistent homology with this package.
However, since everything else I'm working on is in R, it would also be lot more convenient for me to execute the code from the Ripser package in R as well. An illustrative example of this is as follows:
# Import R libraries
library("reticulate") # Conduct Python code in R
library("ggplot2") # plotting
# Import Python packages
ripser <- import("ripser") # Python package for computing persistent homology
persim <- import("persim") # Python package for many tools used in analyzing Persistence Diagrams
# Generate data on circle
set.seed(42)
npoints <- 200
theta <- 2 * pi * runif(npoints)
noise <- cbind(rnorm(npoints, sd=0.1), rnorm(npoints, sd=0.1))
X <- data.frame(x=cos(theta) + noise[,1], y=sin(theta) + noise[,2])
# Compute persistent homology on full data
PHfull <- ripser$ripser(X, do_cocycles=TRUE) # slow if X is large
# Plot diagrams
persim$plot_diagrams(PHfull$dgms) # crashes
Now I have two problems when using this code:
The persistent homology computation performed by the ripser function works perfectly fine in this example. However, when I increase the number of data points in X, say npoints ~ 2000, the computation will take ages compared to around 30 seconds when I would perform the computation straightforwardly in Python. I don't really know what's happening behind the scenes that causes this huge difference in computational times. Is this because R is perhaps less convenient than Python for this example, and in stead of converting my arguments to the custom types in Python and executing the code in Python, the Python code is converted to R code? I am looking for an approach that combines this flexibility and efficient type conversion with the speed of Python.
In Python, the analog to the last line would plot an image based on the result of my persistent homology computation. However, executing this line causes R to crash. Is there a possible way to show the image that would normally result in Python in R?

Is it feasible to run a Support Vector Machine Kernel on a device with <= 1 MB RAM and <= 10 MB ROM?

Some preliminary testing shows that a project I'm working on could potentially benefit from the use of a Support-Vector-Machine to solve a tricky problem. The concern that I have is that there will be major memory constraints. Prototyping and testing is being done in python with scikit-learn. The final version will be custom written in C. The model would be pre-trained and only the decision function would be stored on the final product. There would be <= 10 training features, and <= 5000 training data-points. I've been reading mixed things regarding SVM memory, and I know the default sklearn memory cache is 200 MB. (Much larger than what I have available) Is this feasible? I know there are multiple different types of SVM kernel and that the kernel's can also be custom written. What kernel types could this potentially work with, if any?
If you're that strapped for space, you'll probably want to skip scikit and simply implement the math yourself. That way, you can cycle through the data in structures of your own choosing. Memory requirements depend on the class of SVM you're using; a two-class linear SVM can be done with a single pass through the data, considering only one observation at a time as you accumulate sum-of-products, so your command logic would take far more space than the data requirements.
If you need to keep the entire data set in memory for multiple passes, that's "only" 5000*10*8 bytes for floats, or 400k of your 1Mb, which might be enough room to do your manipulations. Also consider a slow training process, re-reading the data on each pass, as this reduces the 400k to a triviality at the cost of wall-clock time.
All of this is under your control if you look up a usable SVM implementation and alter the I/O portions as needed.
Does that help?

Python/ Pycharm memory and CPU allocation for faster runtime?

I am trying to run a very capacity intensive python program which process text with NLP methods for conducting different classifications tasks.
The runtime of the programm takes several days, therefore, I am trying to allocate more capacity to the programm. However, I don't really understand if I did the right thing, because with my new allocation the python code is not significantly faster.
Here are some information about my notebook:
I have a notebook running windows 10 with a intel core i7 with 4 core (8 logical processors) # 2.5 GHZ and 32 gb physical memory.
What I did:
I changed some parameters in the vmoptions file, so that it looks like this now:
-Xms30g
-Xmx30g
-Xmn30g
-Xss128k
-XX:MaxPermSize=30g
-XX:ParallelGCThreads=20
-XX:ReservedCodeCacheSize=500m
-XX:+UseConcMarkSweepGC
-XX:SoftRefLRUPolicyMSPerMB=50
-ea
-Dsun.io.useCanonCaches=false
-Djava.net.preferIPv4Stack=true
-XX:+HeapDumpOnOutOfMemoryError
-XX:-OmitStackTraceInFastThrow
My problem:
However, as I said my code is not running significantly faster. On top of that, if I am calling the taskmanager I can see that pycharm uses neraly 80% of the memory but 0% CPU, and python uses 20% of the CPU and 0% memory.
My question:
What do I need to do that the runtime of my python code gets faster?
Is it possible that i need to allocate more CPU to pycharm or python?
What is the connection beteen the allocation of memory to pycharm and the runtime of the python interpreter?
Thank you very much =)
You can not increase CPU usage manually. Try one of these solutions:
Try to rewrite your algorithm to be multi-threaded. then you can use
more of your CPU. Note that, not all programs can profit from
multiple cores. In these cases, calculation done in steps, where the
next step depends on the results of the previous step, will not be
faster using more cores. Problems that can be vectorized (applying
the same calculation to large arrays of data) can relatively easy be
made to use multiple cores because the individual calculations are
independent.
Use numpy. It is an extension written in C that can use optimized
linear algebra libraries like ATLAS. It can speed up numerical
calculations significantly compared to standard python.
You can adjust the number of CPU cores to be used by the IDE when running the active tasks (for example, indexing header files, updating symbols, and so on) in order to keep the performance properly balanced between AppCode and other applications running on your machine.
ues this link

OPEN CL, Python and parallelisation

As a starter in Open CL, I have a simple understanding question to optimize GPU computing.
As far as I understood I can make i.e. a matrix of 1000 X 1000 and put one code at each pixel at the same time using a GPU. What about the following option :
I have 100 times a 100 x 100 matrixes and need to calculate them differently. So I need to
do the serial or can I start 100 instances, i.e. I start 100 Python multiprocesses and each
shoot a matrix calculation to the GPU (assumning thetre are enough resources).
Other way round, I have one matrix of 1000 X 1000 and 100 different instance to
calculate,
can I do this as the same time or serial processing ?
Any advice or concept how to solve this the fastest way is appreciated
Thanks
Adrian
The OpenCL execution model revolves around kernels, which are just functions that execute for each point in your problem domain. When you launch a kernel for execution on your OpenCL device, you define a 1, 2 or 3-dimensional index space for this domain (aka the NDRange or global work size). It's entirely up to you how you map the NDRange onto your actual problem.
For example, you could launch an NDRange that is 100x100x100, in order to process 100 sets of 100x100 matrices (assuming they are all independent). Your kernel then defines the computation for a single element of one of these matrices. Alternatively, you could launch 100 kernels, each with a 100x100 NDRange to achieve the same thing. The former is probably faster, since it avoids the overhead of launching multiple kernels.
I strongly recommend taking a look at the OpenCL specification for more information about the OpenCL execution model. Specifically, section 3.2 has a great description of the core concepts surrounding kernel execution.

General techniques to work with huge amounts of data on a non-super computer

I'm taking some AI classes and have learned about some basic algorithms that I want to experiment with. I have gotten access to several data sets containing lots of great real-world data through Kaggle, which hosts data analysis competitions.
I have tried entering several competitions to improve my machine learning skills, but have been unable to find a good way to access the data in my code. Kaggle provides one large data file, 50-200mb, per competition in csv format.
What is the best way to load and use these tables in my code? My first instinct was to use databases, so I tried loading the csv into sqlite a single database, but this put a tremendous load on my computer and during the commits, it was common for my computer to crash. Next, I tried using a mysql server on a shared host, but doing queries on it took forever, and it made my analysis code really slow. Plus, I am afraid I will exceed my bandwidth.
In my classes so far, my instructors usually clean up the data and give us managable datasets that can be completely loaded into RAM. Obviously this is not possible for my current interests. Please suggest how I should proceed. I am currently using a 4 year old macbook with 4gb ram and a dualcore 2.1Ghz cpu.
By the way, I am hoping to do the bulk of my analysis in Python, as I know this language the best. I'd like a solution that allows me to do all or nearly all coding in this language.
Prototype--that's the most important thing when working with big data. Sensibly carve it up so that you can load it in memory to access it with an interpreter--e.g., python, R. That's the best way to create and refine your analytics process flow at scale.
In other words, trim your multi-GB-sized data files so that they are small enough to perform command-line analytics.
Here's the workflow i use to do that--surely not the best way to do it, but it is one way, and it works:
I. Use lazy loading methods (hopefully) available in your language of
choice to read in large data files, particularly those exceeding about 1 GB. I
would then recommend processing this data stream according to the
techniques i discuss below, then finally storing this fully
pre-processed data in a Data Mart, or intermediate staging container.
One example using Python to lazy load a large data file:
# 'filename' is the full path name for a data file whose size
# exceeds the memory on the box it resides. #
import tokenize
data_reader = open(some_filename, 'r')
tokens = tokenize.generate_tokens(reader)
tokens.next() # returns a single line from the large data file.
II. Whiten and Recast:
Recast your columns storing categorical
variables (e.g., Male/Female) as integers (e.g., -1, 1). Maintain
a
look-up table (the same hash as you used for this conversion
except
the keys and values are swapped out) to convert these integers
back
to human-readable string labels as the last step in your analytic
workflow;
whiten your data--i.e., "normalize" the columns that
hold continuous data. Both of these steps will substantially
reduce
the size of your data set--without introducing any noise. A
concomitant benefit from whitening is prevention of analytics
error
caused by over-weighting.
III. Sampling: Trim your data length-wise.
IV. Dimension Reduction: the orthogonal analogue to sampling. Identify the variables (columns/fields/features) that have no influence or de minimis influence on the dependent variable (a.k.a., the 'outcomes' or response variable) and eliminate them from your working data cube.
Principal Component Analysis (PCA) is a simple and reliable technique to do this:
import numpy as NP
from scipy import linalg as LA
D = NP.random.randn(8, 5) # a simulated data set
# calculate the covariance matrix: #
R = NP.corrcoef(D, rowvar=1)
# calculate the eigenvalues of the covariance matrix: #
eigval, eigvec = NP.eig(R)
# sort them in descending order: #
egval = NP.sort(egval)[::-1]
# make a value-proportion table #
cs = NP.cumsum(egval)/NP.sum(egval)
print("{0}\t{1}".format('eigenvalue', 'var proportion'))
for i in range(len(egval)) :
print("{0:.2f}\t\t{1:.2f}".format(egval[i], cs[i]))
eigenvalue var proportion
2.22 0.44
1.81 0.81
0.67 0.94
0.23 0.99
0.06 1.00
So as you can see, the first three eigenvalues account for 94% of the variance observed in original data. Depending on your purpose, you can often trim the original data matrix, D, by removing the last two columns:
D = D[:,:-2]
V. Data Mart Storage: insert a layer between your permanent storage (Data Warehouse) and your analytics process flow. In other words, rely heavily on data marts/data cubes--a 'staging area' that sits between your Data Warehouse and your analytics app layer. This data mart is a much better IO layer for your analytics apps. R's 'data frame' or 'data table' (from the CRAN Package of the same name) are good candidates. I also strongly recommend redis--blazing fast reads, terse semantics, and zero configuration, make it an excellent choice for this use case. redis will easily handle datasets of the size you mentioned in your Question. Using the hash data structure in redis, for instance, you can have the same structure and the same relational flexibility as MySQL or SQLite without the tedious configuration. Another advantage: unlike SQLite, redis is in fact a database server. I am actually a big fan of SQLite, but i believe redis just works better here for the reasons i just gave.
from redis import Redis
r0 = Redis(db=0)
r0.hmset(user_id : "100143321, {sex : 'M', status : 'registered_user',
traffic_source : 'affiliate', page_views_per_session : 17,
total_purchases : 28.15})
200 megabytes isn't a particularly large file for loading into a database. You might want to try to split the input file into smaller files.
split -l 50000 your-input-filename
The split utility will split your input file into multiple files of whatever size you like. I used 50000 lines per file above. It's a common Unix and Linux command-line utility; don't know whether it ships with Macs, though.
Local installs of PostgreSQL or even MySQL might be a better choice than SQLite for what you're doing.
If you don't want to load the data into a database, you can take subsets of it using command-line utilities like grep, awk, and sed. (Or scripting languages like python, ruby, and perl.) Pipe the subsets into your program.
You need 'Pandas' for it. I think you must have got it by now. But still if someone else is facing the issue can be benefited by the answer.
So you don't have to load the data into any RDBMS. keep it in file and use it by simple Pandas
load, dataframes. Here is the link for pandas lib-->
http://pandas.pydata.org/
If data is too large you need any cluster for that. Apache Spark or Mahout which can run on Amazon EC2 cloud. Buy some space there and it will be easy to use. Spark also has an API for Python.
I loaded a 2 GB Kaggle dataset in R using H2O.
H2O is build on java and it creates a virtual environment, the data will be available in memory and you will have much faster access since H2O is java.
You will just have to get used to the syntax of H2O.
It has many beautifully built ml algorithms and provides good support for distributed computing.
You can also use all your cpu cores easily. Check out h2o.ai for how to use it. It can handle 200 MB easily, given than you only have 4 GB ram. You should upgrade to 8 G or 16 G
General technic is to "divide and conquere". If you can split you data into parts and process them separetly then it can be handled by one machine. Some task can be solved that way(PageRank, NaiveBayes, HMM etc) and some were not,(one requred global optimisation) like LogisticeRegression, CRF, many dimension reduction technicues

Categories