Background
I am using Python 2.7.6 to parse chunks of very large files (20+ GB) in parallel with the multiprocessing module. I have the worker processes extract information from the input file and put the results in a shelved dictionary for later processing. To prevent simultaneous writes to the pseudo-database, I am using a managed lock. I have also implemented a context manager for the database access to ensure it is always closed, because the shelve module doesn't natively support context manager functionality until Python 3.4.
The Problem
I would like to measure overall run time with the Linux time command. However, when I run the script with the time command, I get a SyntaxError exception that I don't get if I run it normally. Example code:
import multiprocessing
import shelve
from contextlib import contextmanager
DB_NAME = 'temp_db'
# manually implemented context manager - not natively implemented until Python 3.4
# I could use contextlib.closing, but this method makes the "with" statements cleaner
#contextmanager
def open_db(db_name, flag='c'):
db = shelve.open(db_name, flag=flag)
try:
yield db
finally:
db.close()
db_lock = multiprocessing.Manager().Lock()
with db_lock, open_db(DB_NAME) as db:
db['1'] = 'test_value1'
db['2'] = 1.5
with db_lock, open_db(DB_NAME) as db:
for key, val in db.iteritems():
print("{0} : {1}\n".format(key, val))
Running python test_script.py produces the expected output:
2 : 1.5
1 : test_value1
On the other hand, running time python test_script.py causes an exception:
File "test_script.py", line 21
with db_lock, open_db(DB_NAME) as db:
^
SyntaxError: invalid syntax
0.005u 0.002s 0:00.01 0.0% 0+0k 0+0io 0pf+0w
The Question
Why would the time command affect what the interpreter considers valid syntax?
Other Notes
I assume the time command is being invoked correctly because it does produce the timing information, and the presence of the exception shows that the interpreter is finding the correct script.
If I eliminate either the acquisition of the lock or the database opening, the exception disappears, so the problem appears to be caused by the comma in the with statement.
Something is causing the python executable (and version) to change. Try these commands:
which python
python -V
time which python
time python -V
For the overall project, consider having each worker just return data to the parent, which then stores info in a file or database. This simplifies the code because you don't need locking -- only the parent has access.
Related
I have two Python files (main.py and main_test.py). The file main_test.py is executed within main.py. When I do not use a log file this is what gets printed out:
Main file: 17:41:18
Executed file: 17:41:18
Executed file: 17:41:19
Executed file: 17:41:20
When I use a log file and execute main.py>log, then I get the following:
Executed file: 17:41:18
Executed file: 17:41:19
Executed file: 17:41:20
Main file: 17:41:18
Also, when I use python3 main.py | tee log to print out and log the output, it waits and prints out after finishing everything. In addition, the problem of reversing remains.
Questions
How can I fix the reversed print out?
How can I print out results simultaneously in terminal and log them in a correct order?
Python files for replication
main.py
import os
import time
import datetime
import pytz
python_file_name = 'main_test'+'.py'
time_zone = pytz.timezone('US/Eastern') # Eastern-Time-Zone
curr_time = datetime.datetime.now().replace(microsecond=0).astimezone(time_zone).time()
print(f'Main file: {curr_time}')
cwd = os.path.join(os.getcwd(), python_file_name)
os.system(f'python3 {cwd}')
main_test.py
import pytz
import datetime
import time
time_zone = pytz.timezone('US/Eastern') # Eastern-Time-Zone
for i in range(3):
curr_time = datetime.datetime.now().replace(microsecond=0).astimezone(time_zone).time()
print(f'Executed file: {curr_time}')
time.sleep(1)
When you run a script like this:
python main.py>log
The shell redirects output from the script to a file called log. However, if the script launches other scripts in their own subshell (which is what os.system() does), the output of that does not get captured.
What is surprising about your example is that you'd see anything at all when redirecting, since the output should have been redirected and no longer echo - so perhaps there's something you're leaving out here.
Also, tee waits for EOF on standard in, or for some error to occur, so the behaviour you're seeing there makes sense. This is intended behaviour.
Why bother with shells at all though? Why not write a few functions to call, and import the other Python module to call its functions? Or, if you need things to run in parallel (which they didn't in your example), look at multiprocessing.
In direct response to your questions:
"How can I fix the reversed print out?"
Don't use redirection, and write to file directly from the script, or ensure you use the same redirection when calling other scripts from the first (that will get messy), or capture the output from the subprocesses in the subshell and pipe it to the standard out of your main script.
"How can I print out results simultaneously in terminal and log them in a correct order?"
You should probably just do it in the script, otherwise this is not a really a Python question and you should try SuperUser or similar sites to see if there's some way to have tee or similar tools write through live.
In general though, unless you have really strong reasons to have the other functionality running in other shells, you should look at solving your problems in the Python script. And if you can't, use you can use something like Popen or derivatives to capture the subscript's output and do what you need instead of relying on tools that may or may not be available on the host OS running your script.
I have two source files that I am running in Python 3.9. (The files are big...)
File one (fileOne.py)
# ...
sessionID = uuid.uuid4().hex
# ...
File two (fileTwo.py)
# ...
from fileOne import sessionID
# ...
File two is executed using module multiprocessing.
When I run on my local machine and print the UUID in file two, it is always unique.
When I run the script on Centos OS, it somehow remained the same
If I restart the service, the UUID will change once.
My question: Why does this work locally (Windows OS) as expected, but not on a CentOS VM?
UPDATE 1.0:
To make it clear.
For each separate process, I need that UUID will be the same across FileOne and FileTwo. WHich mean
processOne = UUID in file one and in file two will be 1q2w3e
processTwo = UUID in file one and in file two will be r4t5y6 (a different one)
Your riddle is likely is caused by the way multi-processing works in different operating systems. You don't mention, but your "run locally" is certainly Windows or MacOS, not a Linux or other Unix Flavor.
The thing is that multiprocessing on Linux (and up to a time ago on MacOS, but changed that on Python 3.8), used a system fork call when using multiprocessing: the current process is duplicatesd "as is" with all its defined variables and classes - since your sessionID is defined at import time, it stays the same in all subprocesess.
Windows lacks the fork call, and multiprocessing resorts to start a new Python interpreter which re-imports all modules from the current process (this leads to another, more common cause of confusion, where any code not guarded by an if __name__ == "__main__": on the entry Python file is re-executed). In your case the value for sessionID is regenerated.
Check the docs at: https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods
So, if you want the variable to behave reliably and have teh same value across all processes when running multiprocessing, you should either pass it as a parameter to the target functions in the other processes, or use a proper structure meant to share values across processes as documented here:
https://docs.python.org/3/library/multiprocessing.html#sharing-state-between-processes
(you can also check this recent question about the same topic: why is a string printing 3 times instead of 1 when using time.sleep with multiprocessing imported?)
If you need a unique ID across files for each different process:
(As is more clear from the edit and comments)
Have a global (plain) dictionary which will work as a per-process registry for the IDs, and use a function to retrieve the ID - the function can use os.getpid() as a key to the registry.
file 1:
import os
import uuid
...
_id_registry = {}
def get_session_id():
return _id_registry.setdefault(os.getpid(), uuid.uuid4())
file2:
from file1 import get_session_id
sessionID = get_session_id()
(the setdefault dict method takes care of providing a new ID value if none was set)
NB.: the registry set up in this way will keep at most the master process ID (if multiprocessing is using fork mode) and itself - no data on the siblings, as each process will hold its own copy of the registry. If you need a working inter-process dictionary (which could hold a live registry for all processes, for example) you will probably be better using redis for it (https://redis.io - certainly one of the Python bindings have a transparent Python-mapping-over-redis, so you donĀ“t have to worry with the semantics of it)
When you run your script it generate the new value of the uuid, but when you run it inside some service you code the same as:
sessionID = 123 # simple constant
so to fix the issue you can try wrap the code to the function, for example:
def get_uuid():
return uuid.uuid4().hex
in you second file:
from frileOne import get_uuid
get_uuid()
EDIT: I had another random error pop up which I successfully caught in command prompt, this time pointing to line 69- segmentation fault calling whether length of a tuple in a different dictionary is equal to a number....
I have a long running (up to a week) script that I designed to test SQLlite3 insert times for different structures. Unfortunately the script intermittently crashes python without outputting error messages to the python GUI, below is the error message that windows gives in the 'python has stopped working' window;
Full error message:
Problem Event Name: APPCRASH
Application Name: pythonw.exe
Application Version: 3.5.150.1013
Application Timestamp: 55f4dccb
Fault Module Name: python35.dll
Fault Module Version: 3.5.150.1013
Fault Module Timestamp: 55f4dcbb
Exception Code: c0000005
Exception Offset: 000e800e
OS Version: 6.1.7601.2.1.0.768.3
Locale ID: 2057
Additional Information 1: 0a9e
Additional Information 2: 0a9e372d3b4ad19135b953a78882e789
Additional Information 3: 0a9e
Additional Information 4: 0a9e372d3b4ad19135b953a78882e789
Script I was running (warning, 1.5k lines...)
From observing what had and had not been printed I know that it was caused, or that it at least happened coincidentally at this time, with the following piece of code (starting from line 1450 on the link):
with open(r"C:\BMRA\LOG\less tbl log.csv",'a') as log:
log.write(my_date.strftime("%Y-%m-%d"))
log.write(", ")
seconds=sql_time.seconds
log.write(str(seconds))
log.write("\n")
item_collector=[]
The Log csv file appears to have written fine, so my assumption is that the error must lie with the last line.
Item_collector is a (large, ~700mb) dictionary of lists of tuples that had just been written to an sqllite3 database (the tuples containing only str, int, or floats).
As I understand it, the error refers to an application writing to memory it shouldn't and windows consequently shutting everything down to stop it messing things up. However I don't see how changing a normal vanilla python object full of other vanilla python objects should create such an error.
Does anyone have any ideas about what could underlay this, or alternatively ways to figure that out given python doesn't give an error message pointing to the specific issue? I did after a previous issue implement a logging module wrapper below around my script, but it did not catch anything
Some initial research suggested that I get a mini dump from the task manager before closing the process- I have it, but debugging hasn't been succesful- apparently I need something called python35.pdb, which as far as I can make out isn't around (for 3.5)
The script recently had a similar problem before which gave a similar error message
The advice I received was to implement the logging module around my script as so:
import logging
logging.basicConfig(filename=r'C:\BMRA\ERROR_LOG.log', level=logging.DEBUG)
try:
main()
except BaseException:
logging.getLogger(__name__).exception("Program terminated")
raise
and;
def logging_decorator(func):
def wrapper_function(self, *args, **kwargs):
logging.getLogger(__name__).debug(
"Calling %s: %r %r", func.__name__, args, kwargs)
ret = func(self, *args, **kwargs)
logging.getLogger(__name__).debug(
"%s returned %r", func.__name__, ret)
return ret
return wrapper_function
class MyConnect(sqlite3.Connection):
def cursor(self):
return super(MyConnect, self).cursor(MyCursor)
commit = logging_decorator(sqlite3.Connection.commit)
class MyCursor(sqlite3.Cursor):
execute = logging_decorator(sqlite3.Cursor.execute)
However this does not appear to have caught the error, with the script still crashing without sending any info to the designated file.
Apologies if I've not included something necessary.
After the 50th time the script crashed in a random place inexplicably, I ran the windows memory diagnostic tool.
Unfortunately it appears that my system has ram/hardware errors, which I understand would cause issues like this.
I'm trying to do some simple IPC in Python as follows: One Python process launches another with subprocess. The child process sends some data into a pipe and the parent process receives it.
Here's my current implementation:
# parent.py
import pickle
import os
import subprocess
import sys
read_fd, write_fd = os.pipe()
if hasattr(os, 'set_inheritable'):
os.set_inheritable(write_fd, True)
child = subprocess.Popen((sys.executable, 'child.py', str(write_fd)), close_fds=False)
try:
with os.fdopen(read_fd, 'rb') as reader:
data = pickle.load(reader)
finally:
child.wait()
assert data == 'This is the data.'
# child.py
import pickle
import os
import sys
with os.fdopen(int(sys.argv[1]), 'wb') as writer:
pickle.dump('This is the data.', writer)
On Unix this works as expected, but if I run this code on Windows, I get the following error, after which the program hangs until interrupted:
Traceback (most recent call last):
File "child.py", line 4, in <module>
with os.fdopen(int(sys.argv[1]), 'wb') as writer:
File "C:\Python34\lib\os.py", line 978, in fdopen
return io.open(fd, *args, **kwargs)
OSError: [Errno 9] Bad file descriptor
I suspect the problem is that the child process isn't inheriting the write_fd file descriptor. How can I fix this?
The code needs to be compatible with Python 2.7, 3.2, and all subsequent versions. This means that the solution can't depend on either the presence or the absence of the changes to file descriptor inheritance specified in PEP 446. As implied above, it also needs to run on both Unix and Windows.
(To answer a couple of obvious questions: The reason I'm not using multiprocessing is because, in my real-life non-simplified code, the two Python programs are part of Django projects with different settings modules. This means they can't share any global state. Also, the child process's standard streams are being used for other purposes and are not available for this.)
UPDATE: After setting the close_fds parameter, the code now works in all versions of Python on Unix. However, it still fails on Windows.
subprocess.PIPE is implemented for all platforms. Why don't you just use this?
If you want to manually create and use an os.pipe(), you need to take care of the fact that Windows does not support fork(). It rather uses CreateProcess() which by default not make the child inherit open files. But there is a way: each single file descriptor can be made explicitly inheritable. This requires calling Win API. I have implemented this in gipc, see the _pre/post_createprocess_windows() methods here.
As #Jan-Philip Gehrcke suggested, you could use subprocess.PIPE instead of os.pipe():
#!/usr/bin/env python
# parent.py
import sys
from subprocess import check_output
data = check_output([sys.executable or 'python', 'child.py'])
assert data.decode().strip() == 'This is the data.'
check_output() uses stdout=subprocess.PIPE internally.
You could use obj = pickle.loads(data) if child.py uses data = pickle.dumps(obj).
And the child.py could be simplified:
#!/usr/bin/env python
# child.py
print('This is the data.')
If the child process is written in Python then for greater flexibility you could import the child script as a module and call its function instead of using subprocess. You could use multiprocessing, concurrent.futures modules if you need to run some Python code in a different process.
If you can't use standard streams then your django applications could use sockets to talk to one another.
The reason I'm not using multiprocessing is because, in my real-life non-simplified code, the two Python programs are part of Django projects with different settings modules. This means they can't share any global state.
This seems bogus. multiprocessing under-the-hood also may use subprocess module. If you don't want to share global state -- don't share it -- it is the default for multiple processes. You should probably ask a more specific for your particular case question about how to organize the communication between various parts of your project.
I'm working on a wrapper script for invocations to the Ninja c/c++ buildsystem, the script is in python and one thing it should do is to log the output from Ninja and the underlying compiler but without supressing standard output.
The part that gives me trouble is that Ninja seems to detect that it is writing to a terminal or not, so simply catching the output and sending it to standard output ends up changing it (most notably, Ninja does not fill the screen with lists of warning and errorless buildfiles but removes the line of the last successfully built translation unit as a new one comes in). Is there any way to let Ninja write to the terminal, while still capturing its output? The writing to the terminal should happen as the Ninja subprocess runs, but the capturing of said output may wait until the subprocess has completed.
pty.spawn() allows you to log output to a file while hoodwinking Ninja subprocess into thinking that it works with a terminal (tty):
import os
import pty
logfile = open('logfile', 'wb')
def read(fd):
data = os.read(fd, 1024)
logfile.write(data)
return data
pty.spawn("ninja", read)