Python memory leak in Google Collab when interrupting the run

Python memory leak in Google Collab when interrupting the run - python

I'm running the following code in Google Collab (and on Kaggle Notebook).
When running it without pdb.set_trace(), everything works fine.
However, when using pdb.set_trace() and calling "continue/exit", it seems that the array is still stored in memory (memory consumption remains high, by the same size as the array).
from pdb import set_trace # also tried ipdb, IPython.core.debugger
def ccc():
aaa = list(range(50000000))
set_trace()
ccc()
Any ideas?
Thanks in advance.
EDIT
This also occurs when stopping the code execution manually (i.e., KeyboardInterrupt).

Related

Question about the 'non-linear' behaviour of the Python interpreter in Jupyter

I'm running the following code both remotely on a linux machine via ssh, and on the same linux machine as a Jupyter notebook accessed through a browser.
import cv2
import pdf2image
def minimalFun(pdf_filepath, make_me_suffer = False):
print("Now I start.")
images = pdf2image.convert_from_path(pdf_filepath)
print("Pdf read.")
if make_me_suffer:
cv2.namedWindow('test',0)
print("I finished!")
minimalFun('Test.pdf', make_me_suffer = True)
I'm confused on the behaviour of the difference of the behaviour of the Pyhton interpreter in Jupyter and when used on the command line.
In a Jupyter notebook
With the make_me_suffer = False setting the code will just print
Now I start.
Pdf read.
I finished!
meaning in particular that the function pdf2image.convert_from_path ran successfully. However, with the make_me_suffer set to True, the code will print just
Now I start.
and then report that the kernel has died and will be restarting. In particular, the kernel died already with the function pdf2image.convert_from_path.
On the command line
As expected, with the make_me_suffer = False setting the code will just print
Now I start.
Pdf read.
I finished!
but now when the flag is set to make_me_suffer = True, we get
Now I start.
Pdf read.
: cannot connect to X server
meaning that here the function pdf2image.convert_from_path again finished successfully.
The question:
Does the Jupyter interpreter 'look ahead' to see if there will be a command later on requiring an x-windowing system and altering the interpretation of current stuff based on the information. If so, why? Is this common? Does it happen with functions loaded from other files? What is going on?
The reason why I'm asking is, that this took me a lot of time to troubleshoot and pinpoint in a more complex function. This conserns me as I have no idea how to avoid this in the future, other than having from now on a fobia on anything graphical.

Does the Jupyter interpreter 'look ahead' to see if there will be a command later on requiring an x-windowing system and altering the interpretation of current stuff based on the information.
No, it does not.
As you know, you can run cells in any arbitrary order or modify them after you've run them once. This makes notebooks very brittle unless used properly.
You could, however, move your common code (e.g. stuff that initializes a window that you know you'll need) into a regular .py module in the notebook directory and import and use stuff from there.

How to change execution time limit in Jupyter Notebook?

I have defined a python function (as a .py file) that fits some scientific data, in a iterative way, for a few dozens of files. And now, I am trying to import this function, in a jupyter notebook, to use as part of another script, to process the obtained data. It is basically something like:
from python_file import defined_function
filename = 'name of the file'
results = defined_function(filename)
This script would naturally take a few minute to end in my machine. However, before it finishes I get an error message, related to the time limit:
RuntimeError: Execution exceeded time limit, max runtime is 30s
How do I change this time limit in my notebook? If it helps, I'm using the ipython version 6.1.0
Thanks

Overriding the NotebookApp.iopub_data_rate_limit = 10000000 in jupyter_notebook_config.py will does the trick. Please note that before you could even see a file named jupyter_notebook_config.py and then, proceed with this fixing, you must run first jupyter notebook --generate-config (For linux users).
If Overriding this in the config file doesn't work for you. Same error regardless of what you set NotebookApp.iopub_data_rate_limit = to in the config file. It shouldn't be in the correct place already. If not, try putting 'NotebookApp.iopub_data_rate_limit = ' at ~/.jupyter/jupyter_notebook_config.py.

Stop a python script without losing data

We have been running a script on partner's computer for 18 hours. We underestimated how long it would take, and now need to turn in the results. Is it possible to stop the script from running, but still have access to all the lists we are building?
We need to add additional code to the one we are currently running that will use the lists being populated right now. Is there a way to stop the process, but still use (what has been generated of) the lists in the next portion of code?
My partner was using python interactively.
update
We were able to successfully print the results and copy and paste after interrupting the program with control-C.

Well, OP doesn't seem to need an answer anymore. But I'll answer anyway for anyone else coming accross this.
While it is true that stopping the program will delete all data from memory you can still save it. You can inject a debug session and save whatever you need before you kill the process.
Both PyCharm and PyDev support attaching their debugger to a running python application.
See here for an explanation how it works in PyCharm.
Once you've attached the debugger, you can set a breakpoint in your code and the program will stop when it hits that line the next time. Then you can inspect all variables and run some code via the 'Evaluate' feature. This code may save whatever variable you need.
I've tested this with PyCharm 2018.1.1 Community Edition and Python 3.6.4.
In order to do so I ran this code which I saved as test.py
import collections
import time
data = collections.deque(maxlen=100)
i = 0
while True:
data.append(i % 1000)
i += 1
time.sleep(0.001)
via the command python3 test.py from an external Windows PowerShell instance.
Then I've opened that file in PyCharm and attached the debugger. I set a Breakpoint at the line i += 1 and it halted right there. Then I evaluated the following code fragment:
import json
with open('data.json', 'w') as ofile:
json.dump(list(data), ofile)
And found all entries from data in the json file data.json.
Follow-up:
This even works in an interactive session! I ran the very same code in a jupyter notebook cell and then attached the debugger to the kernel. Still having test.py open, I set the breakpoint again on the same line as before and the kernel halted. Then I could see all variables from the interactive notebook session.

I don't think so. Stopping the program should also release all of the memory it was using.
edit: See Swenzel's comment for one way of doing it.

pandas read_csv will occasionally hang without an error message

I am encountering a strange error that occurs once every few days. I have several virtual machines running on Google Cloud running a Python script. The Python file is very large, but the part that gets stuck is the following:
try:
f = urlopen('https://resources.lendingclub.com/SecondaryMarketAllNotes.csv')
df = pd.read_csv(f)
except:
print('error')
The first line of code always works, but the second line will occasionally stop the program. What I mean by that is that the program will not continue execution, but it does not throw any kind of error. I have logger running in my code in debug mode and it does not record anything.
Again, this happens very rarely, but when it does happen my virtual machines will stop. When I look at the processes in top, I see Python running with 0% CPU, and there is still plenty of system memory available. It will continue to sit there for hours without moving on to the next line of code or returning an error.
My application is very time sensitive and using urlopen is faster than using pd.read_csv to directly open the file.
I notice that when this rare error occurs, it happens at the same time in all of my virtual machines which means that there is likely something about the file being downloaded that is triggering this issue. Why it doesn't cause an error is beyond me.
I would greatly appreciate any ideas on what might be causing this and what workarounds might be available.
I am using Python 3.5.3 and pandas 0.19.2

How can I change device used of theano

I tried to change the device used in theano-based program.
from theano import config
config.device = "gpu1"
However I got error
Exception: Can't change the value of this config parameter after initialization!
I wonder what is the best way of change gpu to gpu1 in code ?
Thanks

Another possibility which worked for me was setting the environment variable in the process, before importing theano:
import os
os.environ['THEANO_FLAGS'] = "device=gpu1"
import theano

There is no way to change this value in code running in the same process. The best you could do is to have a "parent" process that alters, for example, the THEANO_FLAGS environment variable and spawns children. However, the method of spawning will determine which environment the children operate in.
Note also that there is no way to do this in a way that maintains a process's memory through the change. You can't start running on CPU, do some work with values stored in memory then change to running on GPU and continue running using the values still in memory from the earlier (CPU) stage of work. The process must be shutdown and restarted for a change of device to be applied.
As soon as you import theano the device is fixed and cannot be changed within the process that did the import.

Remove the "device" config in .theanorc, then in your code:
import theano.sandbox.cuda
theano.sandbox.cuda.use("gpu0")
It works for me.
https://groups.google.com/forum/#!msg/theano-users/woPgxXCEMB4/l654PPpd5joJ

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python memory leak in Google Collab when interrupting the run - python

Related

Question about the 'non-linear' behaviour of the Python interpreter in Jupyter

How to change execution time limit in Jupyter Notebook?

Stop a python script without losing data

pandas read_csv will occasionally hang without an error message

How can I change device used of theano

Categories

Resources