Python script crashes when executed from server - python

I am executing locally a python script through a php exec function and everything works fine.
Now its the time I need to move the project in the server and I am facing some problem when it comes in executing the same python script.
In both cases I have the same version of Python (2.7.3) and all the required libs are installed.
I have spotted where the issue is created but I can not figure out why.
Its this line in my python script:
import networkx as nx
CG1_nodes=nx.connected_components(G1)[:]
It runs successfully in the locally but it crashes in the server. I have found out that if I remove the:
[:]
then it works. I have also checked the content of G1 and its populated.
Any idea what I am missing here?

You're consuming a generator. It's possible that it's has billions of items. If that's the case - python may be out of resources. Make sure you're not overloading the system by checking how big the resulting list will be.
I'd also take a look at problems with slicing in libraries that are used by networkx (NumPy? SciPy?). Perhaps try:
CG1_nodes=list(nx.connected_components(G1))
To avoid slicing.

You should check if you have the same version of networkx in both cases.
In older networkx versions nx.connected_components(G1) was a list. In the newer version (1.9.1) it's a generator. If X is a generator, then X[:] doesn't work. But if X is a list it does. So if your machine and the server have different versions, then in one case it would be allowable but not the other.
You "fixed" this by removing the [:], and so CG1_nodes is now a generator rather than a list. As long as your earlier use of it is consistent with a generator, the results would (probably) be the same. So the two codes would work. Obviously making it a list explicitly would work, but might be memory intensive.
More details are documented here. In particular note:
To recover the earlier behavior, use list(connected_components(G)).
I believe the previous version returned the list sorted by decreasing component size. The new version is not sorted. If you need it sorted you'll need to do a bit more:
sorted(list(nx.connected_components(G)), key = len, reverse=True)

Related

ctypes.c_int.from_address does not work in RStudio

I am trying to count references to an object in Python using the RStudio. I use following function:
ctypes.c_int.from_address(id(an_object)).value
This work perfectly in Pycharm and Jupyter as shown bellow:
and the result in the RStudio is:
The question is why the result is not correct in RStudio and how to fix it?
I also tried to use
sys.getrefcount
function, but It does not work in RStudio too!
I did it without using "id" function as shown below:
But the result in RStudio is not correct! Sometimes It may happen in PyCharm(I did not see, Perhaps no guarantee) But in RStudio something is wrong completely!
Why this is important?! And why I care about it.
Consider following example:
Sometimes it is important to know about "a" before change "b".
The big problem in RStudio is the result increases randomly! But in PyCharm and other Python tools I did not see that happen.
I am not an expert on this so if I am wrong on it correct me please.
the problem seems to be just that r-studio spams a lot of other references to the objects you create. In the same Python version you would otherwise see no divergence there.
That said, you are definitely taking the wrong approach there: either you alias some data structure expecting it to be mutable "under your feet", or you should refrain from making changes at all to data that "may" be in use in other parts of your code (notebook). That is the reasoning Pandas, for example, have been changing all default operations on dataframe from "inplace" to return a new copy, despite the memory cost of doing that.
Other than that, as stated in the comments, the reason this is so hidden in Python - to the point you have to use ctypes (you could also use the higher level gc module, actually), is that it should not matter.
There are other things you could do - like, create objects that use Locks so that they don't allow changes if they are in use in some other critical place - that would be higher level and reproducible.

How to represent imported module or function as string?

I have a program that runs through a database of objects, runs each object through a long runtime function, and then takes the output of that function to update the attributes of said object. Of course, if the function is going to return the same result (because the function's code has not changed), then we don't need to run it through the function again.
One way to solve this problem is to give the function a version number or something and check the version number against a version number stored in the object. For example, object.lastchecked=='VER2'. But this requires manually updating versions and remembering to do so.
One other, more creative way to do this would be to convert that whole function or even imported module/library into a string, hash the string, and then use that as an "automatic" version number of sorts. So if objects.lastchecked!=hash(functionconvertedtostring), we would run function(object).
What I can't figure out is how to convert an existing python module from an import line into a string. I could hardcode the path of this module and do a file read, but it's already been read once into memory. How do I get access to it?
Example:
from hugelibrary import littlefunction
for id,object in hugeobjectdb.items():
if object.lastchecked!=hash(littlefunction):
object.attributes=littlefunction(object)
There are at least 2 different ways of doing this.
First, if you use some kind of version control to work on your code, you might be able to store information about version of the module in the module file in an object that is accessible to python and that is updated automatically. Or you can use version control tools to get current version.
It might be hard to do that on the function level though, rather than file level. If it is a single function that has this requirement you might consider moving it to a separate file.
probably preferable would be to get the text of the function. for that, this answer seems to do the trick Can Python print a function definition?

Logging progress of `enumerate` function in Python

I am working on optimizing a Python (version 2.7; constrained due to old modules I need to use) script that is running on AWS and trying to identify exactly how many resources I need in the environment I am building. In order to do so, I am logging quite a bit of information to the console when the script runs and benchmarking different resource configurations.
One of the bottlenecks is a list with 32,000 items that runs through the Python enumerate function. This takes quite a while to run and my script is blind to its progress until it finishes. I assume enumerate is looping through the items in some fashion so thought I could inject logging or printing within that loop. I know this is 'hacky' but it will work fine for me for now since it is temporary while I run tests.
I can't find where the function runs from (I did find an Enumerate class in the numba module. I tried printing in there and it did not work). I know it is part of the __builtin__ module and am having trouble tracking that down as well and tried several techniques to find its exact location such as printing __builtin__.__file__ but to no avail.
My question is, a) can you all help me identify where the function lives and if this is a good approach? or b) if there is a better approach for this.
Thanks for your help!
I suggest you to use tqdm module. tqdm
tqdm module is used for visualization of the progress of an enumeration. It works with iterables (length known or unknown).
By extending tqdm class you can log required information after any required number of iterations. You can find more information in the tqdm class docstring.

Equivalent to python's -R option that affects the hash of ints

We have a large collection of python code that takes some input and produces some output.
We would like to guarantee that, given the identical input, we produce identical output regardless of python version or local environment. (e.g. whether the code is run on Windows, Mac, or Linux, in 32-bit or 64-bit)
We have been enforcing this in an automated test suite by running our program both with and without the -R option to python and comparing the output, assuming that would shake out any spots where our output accidentally wound up dependent on iteration over a dict. (The most common source of non-determinism in our code)
However, as we recently adjusted our code to also support python 3, we discovered a place where our output depended in part on iteration over a dict that used ints as keys. This iteration order changed in python3 as compared to python2, and was making our output different. Our existing tests (all on python 2.7) didn't notice this. (Because -R doesn't affect the hash of ints) Once found, it was easy to fix, but we would like to have found it earlier.
Is there any way to further stress-test our code and give us confidence that we've ferreted out all places where we end up implicitly depending on something that will possibly be different across python versions/environments? I think that something like -R or PYTHONHASHSEED that applied to numbers as well as to str, bytes, and datetime objects could work, but I'm open to other approaches. I would however like our automated test machine to need only a single python version installed, if possible.
Another acceptable alternative would be some way to run our code with pypy tweaked so as to use a different order when iterating items out of a dict; I think our code runs on pypy, though it's not something we've ever explicitly supported. However, if some pypy expert gives us a way to tweak dictionary iteration order on different runs, it's something we'll work towards.
Using PyPy is not the best choice here, given that it always retain the insertion order in its dicts (with a method that makes dicts use less memory). We can of course make it change the order dicts are enumerated, but it defeats the point.
Instead, I'd suggest to hack at the CPython source code to change the way the hash is used inside dictobject.c. For example, after each hash = PyObject_Hash(key); if (hash == -1) { ..error.. }; you could add hash ^= HASH_TWEAK; and compile different versions of CPython with different values for HASH_TWEAK. (I did such a thing at one point, but I can't find it any more. You need to be a bit careful about where the hash values are the original ones or the modified ones.)

How to tell whether a Python function with dependencies has changed?

tl;dr:
How can I cache the results of a Python function to disk and in a later session use the cached value if and only if the function code and all of its dependencies are unchanged since I last ran it?
In other words, I want to make a Python caching system that automatically watches out for changed code.
Background
I am trying to build a tool for automatic memoization of computational results from Python. I want the memoization to persist between Python sessions (i.e. be reusable at a later time in another Python instance, preferrably even on another machine with the same Python version).
Assume I have a Python module mymodule with some function mymodule.func(). Let's say I already solved the problem of serializing/identifying the function arguments, so we can assume that mymodule.func() takes no arguments if it simplifies anything.
Also assume that I guarantee that the function mymodule.func() and all its dependencies are deterministic, so mymodule.func() == mymodule.func().
The task
I want to run the function mymodule.func() today and save its results (and any other information necessary to solve this task). When I want the same result at a later time, I would like to load the cached result instead of running mymodule.func() again, but only if the code in mymodule.func() and its dependencies are unchanged.
To simplify things, we can assume that the function is always run in a freshly started Python interpreter with a minimal script like this:
import some_save_function
import mymodule
result = mymodule.func()
some_save_function(result, 'filename')
Also, note that I don't want to be overly conservative. It is probably not too hard to use the modulefinder module to find all modules involved when running the first time, and then not use the cache if any module has changed at all. But this defeats my purpose, because in my use case it is very likely that some unrelated function in an imported module has changed.
Previous work and tools I have looked at
joblib memoizes results tied to the function name, and also saves the source code so we can check if it is unchanged. However, as far as I understand it does not check upstream functions (called by mymodule.func()).
The ast module gives me the Abstract Syntax Tree of any Python code, so I guess I can (in principle) figure it all out that way. How hard would this be? I am not very familiar with the AST.
Can I use any of all the black magic that's going on inside dill?
More trivia than a solution: IncPy, a finished/deceased research project, implemented a Python interpreter doing this by default, always. Nice idea, but never made it outside the lab.
Grateful for any input!

Categories