Destination path /tmp/abc in both Process 1 & Process 2
Say there are N number of process running
we need to retain the file generated by the latest one
Process1
import shutil
shutil.move(src_path, destination_path)
Process 2
import os
os.remove(destination_path)
Solution
1. Handle the process saying if copy fails with [ErrNo2]No Such File or Directory
Is this the correct solution? Is there a better way to handle this
Useful Link A safe, atomic file-copy operation
You can use FileNotFoundError error
try :
shutil.move(src_path, destination_path)
except FileNotFoundError:
print ('File Not Found')
# Add whatever logic you want to execute
except :
print ('Some Other error')
The primary solutions that come to mind are either
Keep partial information in per-process staging file
Rely on the os for atomic moves
or
Keep partial information in per-process memory
Rely on interprocess communication / locks for atomic writes
For the first, e.g.:
import tempfile
import os
FINAL = '/tmp/something'
def do_stuff():
fd, name = tempfile.mkstemp(suffix="-%s" % os.getpid())
while keep_doing_stuff():
os.write(fd, get_output())
os.close(fd)
os.rename(name, FINAL)
if __name__ == '__main__':
do_stuff()
You can choose to invoke individually from a shell (as shown above) or with some process wrappers (subprocess or multiprocessing would be fine), and either way will work.
For interprocess you would probably want to spawn everything from a parent process
from multiprocessing import Process, Lock
from cStringIO import StringIO
def do_stuff(lock):
output = StringIO()
while keep_doing_stuff():
output.write(get_output())
with lock:
with open(FINAL, 'w') as f:
f.write(output.getvalue())
output.close()
if __name__ == '__main__':
lock = Lock()
for num in range(2):
Process(target=do_stuff, args=(lock,)).start()
Related
I my application, i have below requests:
1. There has one thread will regularly record some logs in file. The log file will be rollovered in certain interval. for keeping the log files small.
2. There has another thread also will regularly to process these log files. ex: Move the log files to other place, parse the log's content to generate some log reports.
But, there has a condition is the second thread can not process the log file that's using to record the log. in code side, the pseudocode similars like below:
#code in second thread to process the log files
for logFile in os.listdir(logFolder):
if not file_is_open(logFile) or file_is_use(logFile):
ProcessLogFile(logFile) # move log file to other place, and generate log report....
So, how do i check is a file is already open or is used by other process?
I did some research in internet. And have some results:
try:
myfile = open(filename, "r+") # or "a+", whatever you need
except IOError:
print "Could not open file! Please close Excel!"
I tried this code, but it doesn't work, no matter i use "r+" or "a+" flag
try:
os.remove(filename) # try to remove it directly
except OSError as e:
if e.errno == errno.ENOENT: # file doesn't exist
break
This code can work, but it can not reach my request, since i don't want to delete the file to check if it is open.
An issue with trying to find out if a file is being used by another process is the possibility of a race condition. You could check a file, decide that it is not in use, then just before you open it another process (or thread) leaps in and grabs it (or even deletes it).
Ok, let's say you decide to live with that possibility and hope it does not occur. To check files in use by other processes is operating system dependant.
On Linux it is fairly easy, just iterate through the PIDs in /proc. Here is a generator that iterates over files in use for a specific PID:
def iterate_fds(pid):
dir = '/proc/'+str(pid)+'/fd'
if not os.access(dir,os.R_OK|os.X_OK): return
for fds in os.listdir(dir):
for fd in fds:
full_name = os.path.join(dir, fd)
try:
file = os.readlink(full_name)
if file == '/dev/null' or \
re.match(r'pipe:\[\d+\]',file) or \
re.match(r'socket:\[\d+\]',file):
file = None
except OSError as err:
if err.errno == 2:
file = None
else:
raise(err)
yield (fd,file)
On Windows it is not quite so straightforward, the APIs are not published. There is a sysinternals tool (handle.exe) that can be used, but I recommend the PyPi module psutil, which is portable (i.e., it runs on Linux as well, and probably on other OS):
import psutil
for proc in psutil.process_iter():
try:
# this returns the list of opened files by the current process
flist = proc.open_files()
if flist:
print(proc.pid,proc.name)
for nt in flist:
print("\t",nt.path)
# This catches a race condition where a process ends
# before we can examine its files
except psutil.NoSuchProcess as err:
print("****",err)
I like Daniel's answer, but for Windows users, I realized that it's safer and simpler to rename the file to the name it already has. That solves the problems brought up in the comments to his answer. Here's the code:
import os
f = 'C:/test.xlsx'
if os.path.exists(f):
try:
os.rename(f, f)
print 'Access on file "' + f +'" is available!'
except OSError as e:
print 'Access-error on file "' + f + '"! \n' + str(e)
You can check if a file has a handle on it using the next function (remember to pass the full path to that file):
import psutil
def has_handle(fpath):
for proc in psutil.process_iter():
try:
for item in proc.open_files():
if fpath == item.path:
return True
except Exception:
pass
return False
I know I'm late to the party but I also had this problem and I used the lsof command to solve it (which I think is new from the approaches mentioned above). With lsof we can basically check for the processes that are using this particular file.
Here is how I did it:
from subprocess import check_output,Popen, PIPE
try:
lsout=Popen(['lsof',filename],stdout=PIPE, shell=False)
check_output(["grep",filename], stdin=lsout.stdout, shell=False)
except:
#check_output will throw an exception here if it won't find any process using that file
just write your log processing code in the except part and you are good to go.
Instead on using os.remove() you may use the following workaround on Windows:
import os
file = "D:\\temp\\test.pdf"
if os.path.exists(file):
try:
os.rename(file,file+"_")
print "Access on file \"" + str(file) +"\" is available!"
os.rename(file+"_",file)
except OSError as e:
message = "Access-error on file \"" + str(file) + "\"!!! \n" + str(e)
print message
You can use inotify to watch for activity in file system. You can watch for file close events, indicating that a roll-over has happened. You should also add additional condition on file-size. Make sure you filter out file close events from the second thread.
A slightly more polished version of one of the answers from above.
from pathlib import Path
def is_file_in_use(file_path):
path = Path(file_path)
if not path.exists():
raise FileNotFoundError
try:
path.rename(path)
except PermissionError:
return True
else:
return False
On Windows, you can also directly retrieve the information by leveraging on the NTDLL/KERNEL32 Windows API. The following code returns a list of PIDs, in case the file is still opened/used by a process (including your own, if you have an open handle on the file):
import ctypes
from ctypes import wintypes
path = r"C:\temp\test.txt"
# -----------------------------------------------------------------------------
# generic strings and constants
# -----------------------------------------------------------------------------
ntdll = ctypes.WinDLL('ntdll')
kernel32 = ctypes.WinDLL('kernel32', use_last_error=True)
NTSTATUS = wintypes.LONG
INVALID_HANDLE_VALUE = wintypes.HANDLE(-1).value
FILE_READ_ATTRIBUTES = 0x80
FILE_SHARE_READ = 1
OPEN_EXISTING = 3
FILE_FLAG_BACKUP_SEMANTICS = 0x02000000
FILE_INFORMATION_CLASS = wintypes.ULONG
FileProcessIdsUsingFileInformation = 47
LPSECURITY_ATTRIBUTES = wintypes.LPVOID
ULONG_PTR = wintypes.WPARAM
# -----------------------------------------------------------------------------
# create handle on concerned file with dwDesiredAccess == FILE_READ_ATTRIBUTES
# -----------------------------------------------------------------------------
kernel32.CreateFileW.restype = wintypes.HANDLE
kernel32.CreateFileW.argtypes = (
wintypes.LPCWSTR, # In lpFileName
wintypes.DWORD, # In dwDesiredAccess
wintypes.DWORD, # In dwShareMode
LPSECURITY_ATTRIBUTES, # In_opt lpSecurityAttributes
wintypes.DWORD, # In dwCreationDisposition
wintypes.DWORD, # In dwFlagsAndAttributes
wintypes.HANDLE) # In_opt hTemplateFile
hFile = kernel32.CreateFileW(
path, FILE_READ_ATTRIBUTES, FILE_SHARE_READ, None, OPEN_EXISTING,
FILE_FLAG_BACKUP_SEMANTICS, None)
if hFile == INVALID_HANDLE_VALUE:
raise ctypes.WinError(ctypes.get_last_error())
# -----------------------------------------------------------------------------
# prepare data types for system call
# -----------------------------------------------------------------------------
class IO_STATUS_BLOCK(ctypes.Structure):
class _STATUS(ctypes.Union):
_fields_ = (('Status', NTSTATUS),
('Pointer', wintypes.LPVOID))
_anonymous_ = '_Status',
_fields_ = (('_Status', _STATUS),
('Information', ULONG_PTR))
iosb = IO_STATUS_BLOCK()
class FILE_PROCESS_IDS_USING_FILE_INFORMATION(ctypes.Structure):
_fields_ = (('NumberOfProcessIdsInList', wintypes.LARGE_INTEGER),
('ProcessIdList', wintypes.LARGE_INTEGER * 64))
info = FILE_PROCESS_IDS_USING_FILE_INFORMATION()
PIO_STATUS_BLOCK = ctypes.POINTER(IO_STATUS_BLOCK)
ntdll.NtQueryInformationFile.restype = NTSTATUS
ntdll.NtQueryInformationFile.argtypes = (
wintypes.HANDLE, # In FileHandle
PIO_STATUS_BLOCK, # Out IoStatusBlock
wintypes.LPVOID, # Out FileInformation
wintypes.ULONG, # In Length
FILE_INFORMATION_CLASS) # In FileInformationClass
# -----------------------------------------------------------------------------
# system call to retrieve list of PIDs currently using the file
# -----------------------------------------------------------------------------
status = ntdll.NtQueryInformationFile(hFile, ctypes.byref(iosb),
ctypes.byref(info),
ctypes.sizeof(info),
FileProcessIdsUsingFileInformation)
pidList = info.ProcessIdList[0:info.NumberOfProcessIdsInList]
print(pidList)
I provided one solution. please see the following code.
def isFileinUsed(ifile):
widlcard = "/proc/*/fd/*"
lfds = glob.glob(widlcard)
for fds in lfds:
try:
file = os.readlink(fds)
if file == ifile:
return True
except OSError as err:
if err.errno == 2:
file = None
else:
raise(err)
return False
You can use this function to check if a file is in used.
Note:
This solution only can be used for Linux system.
I wrote a kernel module that writes in /proc/mydev to notify the python program in userspace. I want to trigger a function in the python program whenever there is an update of data in /proc/mydev from the kernel module. What is the best way to listen for an update here? I am thinking about using "watchdog" (https://pythonhosted.org/watchdog/). Is there a better way for this?
This is an easy and efficient way:
import os
from time import sleep
from datetime import datetime
def myfuction(_time):
print("file modified, time: "+datetime.fromtimestamp(_time).strftime("%H:%M:%S"))
if __name__ == "__main__":
_time = 0
while True:
last_modified_time = os.stat("/proc/mydev").st_mtime
if last_modified_time > _time:
myfuction(last_modified_time)
_time = last_modified_time
sleep(1) # prevent high cpu usage
result:
file modified, time: 11:44:09
file modified, time: 11:46:15
file modified, time: 11:46:24
The while loop guarantees that the program keeps listening to changes forever.
You can set the interval by changing the sleep time. Low sleep time causes high CPU usage.
import time
import os
# get the file descriptor for the proc file
fd = os.open("/proc/mydev", os.O_RDONLY)
# create a polling object to monitor the file for updates
poller = select.poll()
poller.register(fd, select.POLLIN)
# create a loop to monitor the file for updates
while True:
events = poller.poll(10000)
if len(events) > 0:
# read the contents of the file if updated
print(os.read(fd, 1024))
sudo pip install inotify
Example
Code for monitoring a simple, flat path (see “Recursive Watching” for watching a hierarchical structure):
import inotify.adapters
def _main():
i = inotify.adapters.Inotify()
i.add_watch('/tmp')
with open('/tmp/test_file', 'w'):
pass
for event in i.event_gen(yield_nones=False):
(_, type_names, path, filename) = event
print("PATH=[{}] FILENAME=[{}] EVENT_TYPES={}".format(
path, filename, type_names))
if __name__ == '__main__':
_main()
Expected output:
PATH=[/tmp] FILENAME=[test_file] EVENT_TYPES=['IN_MODIFY']
PATH=[/tmp] FILENAME=[test_file] EVENT_TYPES=['IN_OPEN']
PATH=[/tmp] FILENAME=[test_file] EVENT_TYPES=['IN_CLOSE_WRITE']
I'm not sure if this would work for your situation, since it seems that you're wanting to watch a folder, but this program watches a file at a time until the main() loop repeats:
import os
import time
def main():
contents = os.listdir("/proc/mydev")
for file in contents:
f = open("/proc/mydev/" + file, "r")
init = f.read()
f.close()
while different = false:
f = open("/proc/mydev/" + file, "r")
check = f.read()
f.close()
if init !== check:
different = true
else:
different = false
time.sleep(1)
main()
# Write what you would want to happen if a change occured here...
main()
main()
main()
You could then write what you would want to happen right before the last usage of main(), as it would then repeat.
Also, this may contain errors, since I rushed this.
Hope this at least helps!
You can't do this efficiently without modifying your kernel driver.
Instead of using procfs, have it register a new character device under /dev, and write that driver to make new content available to read from that device only when new content has in fact come in from the underlying hardware, such that the application layer can issue a blocking read and have it return only when new content exists.
A good example to work from (which also has plenty of native Python clients) is the evdev devices in the input core.
I have a data file saved using the shelve module in python 2.7 which is somehow corrupt. I can load it with db = shelve.open('file.db') but when I call len(db) or even bool(db) it hangs, and I have to kill the process.
However, I am able to loop through the entire thing and create a new non-corrupt file:
db = shelve.open('orig.db')
db2 = shelve.open('copy.db')
for k, v in db.items():
db2[k] = v
db2.close() # copy.db will now be a fully working copy
The question is, how can I test the dict and avoid the hang?
BTW, I still have the original file, and it exhibits the same behaviour when copied to other machines, in case someone also wants to help me get to the bottom of what's actually wrong with the file in the first place!
I'm unaware of any inspection methods other than dbm.whichdb(). For debugging a possible pickle protocol mismatch in a manner that allows you to timeout long running tests maybe try:
import shelve
import pickle
import dbm
import multiprocessing
import time
import psutil
def protocol_check():
print('orig.db is', dbm.whichdb('orig.db'))
print('copy.db is', dbm.whichdb('copy.db'))
for p in range(pickle.HIGHEST_PROTOCOL + 1):
print('trying protocol', p)
db = shelve.open('orig.db', protocol=p)
db2 = shelve.open('copy.db')
try:
for k, v in db.items():
db2[k] = v
finally:
db2.close()
db.close()
print('great success on', p)
def terminate(grace_period=2):
procs = psutil.Process().children()
for p in procs:
p.terminate()
gone, still_alive = psutil.wait_procs(procs, timeout=grace_period)
for p in still_alive:
p.kill()
process = multiprocessing.Process(target=protocol_check)
process.start()
time.sleep(10)
terminate()
How I can implement the multiprocessing to my function.I tried like this but did not work.
def steric_clashes_parallel(system):
rna_st = system[MolWithResID("G")].molecule()
for i in system.molNums():
peg_st = system[i].molecule()
if rna_st != peg_st:
print(peg_st)
for i in rna_st.atoms(AtomIdx()):
for j in peg_st.atoms(AtomIdx()):
# print(Vector.distance(i.evaluate().center(), j.evaluate().center()))
dist = Vector.distance(i.evaluate().center(), j.evaluate().center())
if dist<2:
return print("there is a steric clash")
return print("there is no steric clashes")
mix = PDB().read("clash_1.pdb")
system = System()
system.add(mix)
from multiprocessing import Pool
p = Pool(4)
p.map(steric_clashes_parallel,system)
I've thousand of pdb or system files to test through this function. It took 2 h for one file on a single core without multiprocessing module. Any suggestion would be great help.
My traceback looks something like this:
self.run()
File "/home/sajid/sire.app/bundled/lib/python3.3/threading.py", line 858,
in run self._target(*self._args, **self._kwargs)
File "/home/sajid/sire.app/bundled/lib/python3.3/multiprocessing/pool.py", line 351,
in _handle_tasks put(task)
File "/home/sajid/sire.app/bundled/lib/python3.3/multiprocessing/connection.py", line 206,
in send ForkingPickler(buf, pickle.HIGHEST_PROTOCOL).dump(obj)
RuntimeError: Pickling of "Sire.System._System.System" instances is not enabled
(boost.org/libs/python/doc/v2/pickle.html)
The problem is that Sire.System._System.System can't be serialized so it can't be sent to the child process. Multiprocessing uses the pickle module for serialization and you can frequently do a sanity check in the main program with pickle.dumps(my_mp_object) to verify.
You have another problem, though (or I think you do, based on variable names). the map method takes an iterable and fans its iterated objects out to pool members, but it appears that you want to process system itself, not something at it iterates.
One trick to multiprocessing is to keep the payload that you send from the parent to the child simple and let the child do the heavy lifting of creating its objects. Here, you might be better off just sending down filenames and letting the children do most of the work.
def steric_clashes_from_file(filename):
mix = PDB().read(filename)
system = System()
system.add(mix)
steric_clashes_parallel(system)
def steric_clashes_parallel(system):
rna_st = system[MolWithResID("G")].molecule()
for i in system.molNums():
peg_st = system[i].molecule()
if rna_st != peg_st:
print(peg_st)
for i in rna_st.atoms(AtomIdx()):
for j in peg_st.atoms(AtomIdx()):
# print(Vector.distance(i.evaluate().center(), j.evaluate().center()))
dist = Vector.distance(i.evaluate().center(), j.evaluate().center())
if dist<2:
return print("there is a steric clash")
return print("there is no steric clashes")
filenames = ["clash_1.pdb",]
from multiprocessing import Pool
p = Pool(4, chunksize=1)
p.map(steric_clashes_from_file,filenames)
# martineau:
I tested pickle command and it gave me;
----> 1 pickle.dumps(clash_1.pdb)
RuntimeError: Pickling of "Sire.Mol._Mol.MoleculeGroup" instances is not enabled (http://www.boost.org/libs/python/doc/v2/pickle.html)
----> 1 pickle.dumps(system)
RuntimeError: Pickling of "Sire.System._System.System" instances is not enabled (http://www.boost.org/libs/python/doc/v2/pickle.html)
With your script it took the same time and using a single core only. dist line is iterable though. Can i run this single line over multicores ? I modify the line as;
for i in rna_st.atoms(AtomIdx()):
icent = i.evaluate().center()
for j in peg_st.atoms(AtomIdx()):
dist = Vector.distance(icent, j.evaluate().center())
There is one trick you can do to get a faster computation for each file -- processing each file sequentially, but processing the contents of the file in parallel. This relies on a number of caveats:
Your are running on a system that can fork processes (such as Linux).
The computations you are doing do not have side effects that effect the result of future computations.
It seems like this is the case in your situation, but I can't be 100% sure.
When a process is forked, all the memory in the child process is duplicated from the parent process (what's more it is duplicated in an efficient manner -- bits of memory that are only read from aren't duplicated). This makes it easy to share big, complex initial states between processes. However, once the child processes have started they will not see any changes to objects made in the parent process though (and vice versa).
Sample code:
import multiprocessing
system = None
rna_st = None
class StericClash(Exception):
"""Exception used to halt processing of a file. Could be modified to
include information about what caused the clash if this is useful."""
pass
def steric_clashes_parallel(system_index):
peg_st = system[system_index].molecule()
if rna_st != peg_st:
for i in rna_st.atoms(AtomIdx()):
for j in peg_st.atoms(AtomIdx()):
dist = Vector.distance(i.evaluate().center(),
j.evaluate().center())
if dist < 2:
raise StericClash()
def process_file(filename):
global system, rna_st
# initialise global values before creating pool
mix = PDB().read(filename)
system = System()
system.add(mix)
rna_st = system[MolWithResID("G")].molecule()
with multiprocessing.Pool() as pool:
# contents of file processed in parallel
try:
pool.map(steric_clashes_parallel, range(system.molNums()))
except StericClash:
# terminate called to halt current jobs and further processing
# of file
pool.terminate()
# wait for pool processes to terminate before returning
pool.join()
return False
else:
pool.close()
pool.join()
return True
finally:
# reset globals
system = rna_st = None
if __name__ == "__main__":
for filename in get_files_to_be_processed():
# files are being processed in serial
result = process_file(filename)
save_result_to_disk(filename, result)
I'm trying to understand how to use the new AsyncIO functionality in Python 3.4 and I'm struggling with how to use the event_loop.add_reader(). From the limited discussions that I've found it looks like its for reading the standard out of a separate process as opposed to the contents of an open file. Is that true? If so it appears that there's no AsyncIO specific way to integrate standard file IO, is this also true?
I've been playing with the following code. The output of the following gives the exception PermissionError: [Errno 1] Operation not permitted from line 399 of /python3.4/selectors.py self._epoll.register(key.fd, epoll_events) that is triggered by the add_reader() line below
import asyncio
import urllib.parse
import sys
import pdb
import os
def fileCallback(*args):
pdb.set_trace()
path = sys.argv[1]
loop = asyncio.get_event_loop()
#fd = os.open(path, os.O_RDONLY)
fd = open(path, 'r')
#data = fd.read()
#print(data)
#fd.close()
pdb.set_trace()
task = loop.add_reader(fd, fileCallback, fd)
loop.run_until_complete(task)
loop.close()
EDIT
For those looking for an example of how to use AsyncIO to read more than one file at a time like I was curious about, here's an example of how it can be accomplished. The secret is in the line yield from asyncio.sleep(0). This essentially pauses the current function, putting it back in the event loop queue, to be called after all other ready functions are executed. Functions are determined to be ready based on how they were scheduled.
import asyncio
#asyncio.coroutine
def read_section(file, length):
yield from asyncio.sleep(0)
return file.read(length)
#asyncio.coroutine
def read_file(path):
fd = open(path, 'r')
retVal = []
cnt = 0
while True:
cnt = cnt + 1
data = yield from read_section(fd, 102400)
print(path + ': ' + str(cnt) + ' - ' + str(len(data)))
if len(data) == 0:
break;
fd.close()
paths = ["loadme.txt", "loadme also.txt"]
loop = asyncio.get_event_loop()
tasks = []
for path in paths:
tasks.append(asyncio.async(read_file(path)))
loop.run_until_complete(asyncio.wait(tasks))
loop.close()
These functions expect a file descriptor, that is, the underlying integers the operating system uses, not Python's file objects. File objects that are based on file descriptors return that descriptor on the fileno() method, so for example:
>>> sys.stderr.fileno()
2
In Unix, file descriptors can be attached to files or a lot of other things, including other processes.
Edit for the OP's edit:
As Max in the comments says, you can not use epoll on local files (and asyncio uses epoll). Yes, that's kind of weird. You can use it on pipes, though, for example:
import asyncio
import urllib.parse
import sys
import pdb
import os
def fileCallback(*args):
print("Received: " + sys.stdin.readline())
loop = asyncio.get_event_loop()
task = loop.add_reader(sys.stdin.fileno(), fileCallback)
loop.run_forever()
This will echo stuff you write on stdin.
you cannot use add_reader on local files, because:
It cannot be done using select/poll/epoll
It depends on the operating system
It cannot be fully asynchronous because of os limitations (linux does not support async fs metadata read/write)
But, technically, yes you should be able to do async filesystem read/write, (almost) all systems have DMA mechanism for doing i/o "in the background". And no, local i/o is not really fast such that no one would want it, the CPU are in the order of millions times faster that disk i/o.
Look for aiofile or aiofiles if you want to try async i/o