Keep variable in memory between runs with python and spyder - python

I'm trying to use python to compute and experiment on some datas from some files.
The parsing and computation of those files can take up to 20 minutes and always lead to the exact same result. I want to experiment on that result.
Is there a way (programatically or with spyder) to only compute these datas once per python console and to keep them in memory so the script dont have to compute them again each time I run my code ?
Am I clear ? ^^'

you can use pickles. by using them you can store the modified and parsed data on disk and then load it when you need

Related

Jupyter Lab freezes the computer when out of RAM - how to prevent it?

I have recently started using Jupyter Lab and my problem is that I work with quite large datasets (usually the dataset itself is approx. 1/4 of my computer RAM). After few transformations, saved as new Python objects, I tend to run out of memory. The issue is that when I'm approaching available RAM limit and perform any operation that needs another RAM space my computer freezes and the only way to fix it is to restart it. Is this a default behaviour in Jupyter Lab/Notebook or is it some settings I should set? Normally, I would expect the program to crash out (as in RStudio for example), not the whole computer
Absolutely the most robust solution to this problem would be to use Docker containers. You can specify how much memory to allocate to Jupyter, and if the container runs out of memory it's simply not a big deal (just remember to save frequently, but that goes without saying).
This blog will get you most of the way there. There are also some decent instructions setting up Jupyter Lab from one of the freely available, officially maintained, Jupyter images here:
https://medium.com/fundbox-engineering/overview-d3759e83969c
and then you can modify the docker run command as described in the tutorial as (e.g. for 3GB):
docker run --memory 3g <other docker run args from tutorial here>
For syntax on the docker memory options, see this question:
What unit does the docker run "--memory" option expect?
If you are using a Linux based OS, check out OOM killers, you can get information from here. I don't know the details for Windows.
You can use earlyoom. It can be configured as you wish, e.g. earlyoom -s 90 -m 15 will start the earlyoom and when swap size is less than %90 and memory is less than %15, it will kill the process that causes OOM and prevent the whole system to freeze. You can also configure the priority of the processes.
I also work with very large datasets (3GB) on Jupyter Lab and have been experiencing the same issue on Labs.
It's unclear if you need to maintain access to the pre-transformed data, if not, I've started using del of unused large dataframe variables if I don't need them. del removes variables from your memory. Edit** : there a multiple possibilities for the issue I'm encountering. I encounter this more often when I'm using a remote jupyter instance, and in spyder as well when I'm perfoming large transformations.
e.g.
df = pd.read('some_giant_dataframe') # or whatever your import is
new_df = my_transform(df)
del df # if unneeded.
Jakes you may also find this thread on large data workflows helpful. I've been looking into Dask to help with memory storage.
I've noticed in spyder and jupyter that the freezeup will usually happen when working in another console while a large memory console runs. As to why it just freezes up instead of crashing out, I think this has something to do with the kernel. There are a couple memory issues open in the IPython github - #10082 and #10117 seem most relevant. One user here suggest disabling tab completion in jedi or updating jedi.
In 10117 they propose checking the output of get_ipython().history_manager.db_log_output. I have the same issues and my setting is correct, but it's worth checking
You can also use notebooks in the cloud also, such as Google Colab here. They have provided facility for recommended RAMs and support for Jupyter notebook is by default.
I am going to summarize the answers from the following question.
You can limit the memory usage of your programm. In the following this will be the function ram_intense_foo(). Before calling that you need to call the function limit_memory(10)
import resource
import platform
import sys
import numpy as np
def memory_limit(percent_of_free):
soft, hard = resource.getrlimit(resource.RLIMIT_AS)
resource.setrlimit(resource.RLIMIT_AS, (get_memory() * 1024 * percent_of_free / 100, hard))
def get_memory():
with open('/proc/meminfo', 'r') as mem:
free_memory = 0
for i in mem:
sline = i.split()
if str(sline[0]) == 'MemAvailable:':
free_memory = int(sline[1])
break
return free_memory
def ram_intense_foo(a,b):
A = np.random.rand(a,b)
return A.T#A
if __name__ == '__main__':
memory_limit(95)
try:
temp = ram_intense_foo(4000,10000)
print(temp.shape)
except MemoryError:
sys.stderr.write('\n\nERROR: Memory Exception\n')
sys.exit(1)
There is no reason to view the entire output of a large dataframe. Viewing or manipulating large dataframes will unnecessarily use large amounts of your computer resources.
Whatever you are doing can be done in miniature. It's far easier working on coding and manipulating data when the data frame is small. The best way to work with big data is to create a new data frame that takes only small portion or a small sample of the large data frame. Then you can explore the data and do your coding on the smaller data frame. Once you have explored the data and get your code working, then just use that code on the larger data frame.
The easiest way is simply take the first n, number of the first rows from the data frame using the head() function. The head function prints only n, number of rows. You can create a mini data frame by using the head function on the large data frame. Below I chose to select the first 50 rows and pass their value to the small_df. This assumes the BigData is a data file that comes from a library you opened for this project.
library(namedPackage)
df <- data.frame(BigData) # Assign big data to df
small_df <- head(df, 50) # Assign the first 50 rows to small_df
This will work most of the time, but sometimes the big data frame comes with presorted variables or with variables already grouped. If the big data is like this, then you would need to take a random sample of the rows from the big data. Then use the code that follows:
df <- data.frame(BigData)
set.seed(1016) # set your own seed
df_small <- df[sample(nrow(df),replace=F,size=.03*nrow(df)),] # samples 3% rows
df_small # much smaller df

Reading multiple (CERN) ROOT files into NumPy array - using n nodes and say, 2n GPUs

I am reading many (say 1k) CERN ROOT files using a loop and storing some data into a nested NumPy array. The use of loops makes it serial task and each file take quite some time to complete the process. Since I am working on a deep learning model, I must create a large enough dataset - but the reading time itself is taking a very long time (reading 835 events takes about 21 minutes). Can anyone please suggest if it is possible to use multiple GPUs to read the data, so that less time is required for the reading? If so, how?
Adding some more details: I pushed to program to GitHub so that this can be seen (please let me know if posting GitHub link is not allowed, in that case, I will post the relevant portion here):
https://github.com/Kolahal/SupervisedCounting/blob/master/read_n_train.py
I run the program as:
python read_n_train.py <input-file-list>
where the argument is a text file containing the list of the files with addresses. I was opening the ROOT files in a loop in the read_data_into_list() function. But as I mentioned, this serial task is consuming a lot of time. Not only that, I notice that the reading speed is getting worse as we read more and more data.
Meanwhile I tried to used slurmpy package https://github.com/brentp/slurmpy
With this, I can distribute the job into, say, N worker nodes, for example. In this case, an individual reading program will read the file assigned to it and will return a corresponding list. It is just that in the end, I need to add the lists. I couldn't figure out a way to do this.
Any help is highly appreciated.
Regards,
Kolahal
You're looping over all the events sequentially from python, that's probably the bottleneck.
You can look into root_numpy to load the data you need from the root file into numpy arrays:
root_numpy is a Python extension module that provides an efficient interface between ROOT and NumPy. root_numpy’s internals are compiled C++ and can therefore handle large amounts of data much faster than equivalent pure Python implementations.
I'm also currently looking at root_pandas which seems similar.
While this solution does not precisely answer the request for parallelization, it may make the parallelization unnecessary. And if it is still too slow, then it can still be used on parallel using slurm or something else.

How does one deal with refresh delay when calling python from VBA using .txt as input and output?

I am using VBA-macros in order to automate serveral data processing steps in excel, such as data reduction and visualization. But since excel has no appropriate fit for my purposes I use a python script using scipy's least squares cubic b-spline function. The in and output is done via .txt files. Since I adjusted the script from a manual script I got from a friend.
VBA calls Python
Call common.callLSQCBSpline(targetrng, ThisWorkbook) #calls python which works
Call common.callLoadFitTxt(targetrng, ThisWorkbook) # loads pythons output
Now the funny business:
This works in debug mode but does not when running "full speed". The solution to the problem is simply wait for the directory were the .txt is written to refresh and to load the current and not the previous output file. My solution currently looks like this:
Call common.callLSQCBSpline(targetrng, ThisWorkbook)
Application.Wait (Now + 0.00005)
Call common.callLoadFitTxt(targetrng, ThisWorkbook)
This is slow and anoying but works. Is there a way to speed this up? The python script works fine and writes the output.txt file properly. VBA just needs a second or two before it can load it. The txts are very small under 1 kB.
Thanks in advance!

Debugging a python script which first needs to read large files. Do I have to load them every time anew?

I have a python script which starts by reading a few large files and then does something else. Since I want to run this script multiple times and change some of the code until I am happy with the result, it would be nice if the script did not have to read the files every time anew, because they will not change. So I mainly want to use this for debugging.
It happens to often, that I run scripts with bugs in them, but I only see the error message after minutes, because the reading took so long.
Are there any tricks to do something like this?
(If it is feasible, I create smaller test files)
I'm not good at Python, but it seems to be able to dynamically reload code from a changed module: How to re import an updated package while in Python Interpreter?
Some other suggestions not directly related to Python.
Firstly, try to create a smaller test file. Is the whole file required to demonstrate the bug you are observing? Most probably it is only a small part of your input file that is relevant.
Secondly, are these particular files required, or the problem will show up on any big amount of data? If it shows only on particular files, then once again most probably it is related to some feature of these files and will show also on a smaller file with the same feature. If the main reason is just big amount of data, you might be able to avoid reading it by generating some random data directly in a script.
Thirdly, what is a bottleneck of your reading the file? Is it just hard drive performance issue, or do you do some heavy processing of the read data in your script before actually coming to the part that generates problems? In the latter case, you might be able to do that processing once and write the results to a new file, and then modify your script to load this processed data instead of doing the processing each time anew.
If the hard drive performance is the issue, consider a faster filesystem. On Linux, for example, you might be able to use /dev/shm.

Accessing data stored in the memory on a running Python program

I have a Python program that collects data. I have tested it many times before, but today it decided that it will not save the data. Also, unfortunately, I decided to run my program using pythonw.exe so there is no terminal to see what the errors are.
I can see that it still has the data saved to the memory because it is displayed on a plot and I can still manipulate the data using my program.
I want to know if there is a way to access the data my program collected externally or some way to read it.
I know that it is unlikely I will be able to recover my data, but it is worth a shot.
(Also, I am using Python 2.7 with PyQT4 as a GUI interface.)
You should be able to attach to your running process and examine variables using http://winpdb.org/

Categories