Is it possible to detect corrupt Python dictionaries

Is it possible to detect corrupt Python dictionaries - python

I have a data file saved using the shelve module in python 2.7 which is somehow corrupt. I can load it with db = shelve.open('file.db') but when I call len(db) or even bool(db) it hangs, and I have to kill the process.
However, I am able to loop through the entire thing and create a new non-corrupt file:
db = shelve.open('orig.db')
db2 = shelve.open('copy.db')
for k, v in db.items():
db2[k] = v
db2.close() # copy.db will now be a fully working copy
The question is, how can I test the dict and avoid the hang?
BTW, I still have the original file, and it exhibits the same behaviour when copied to other machines, in case someone also wants to help me get to the bottom of what's actually wrong with the file in the first place!

I'm unaware of any inspection methods other than dbm.whichdb(). For debugging a possible pickle protocol mismatch in a manner that allows you to timeout long running tests maybe try:
import shelve
import pickle
import dbm
import multiprocessing
import time
import psutil
def protocol_check():
print('orig.db is', dbm.whichdb('orig.db'))
print('copy.db is', dbm.whichdb('copy.db'))
for p in range(pickle.HIGHEST_PROTOCOL + 1):
print('trying protocol', p)
db = shelve.open('orig.db', protocol=p)
db2 = shelve.open('copy.db')
try:
for k, v in db.items():
db2[k] = v
finally:
db2.close()
db.close()
print('great success on', p)
def terminate(grace_period=2):
procs = psutil.Process().children()
for p in procs:
p.terminate()
gone, still_alive = psutil.wait_procs(procs, timeout=grace_period)
for p in still_alive:
p.kill()
process = multiprocessing.Process(target=protocol_check)
process.start()
time.sleep(10)
terminate()

Related

python: monitor updates in /proc/mydev file

I wrote a kernel module that writes in /proc/mydev to notify the python program in userspace. I want to trigger a function in the python program whenever there is an update of data in /proc/mydev from the kernel module. What is the best way to listen for an update here? I am thinking about using "watchdog" (https://pythonhosted.org/watchdog/). Is there a better way for this?

This is an easy and efficient way:
import os
from time import sleep
from datetime import datetime
def myfuction(_time):
print("file modified, time: "+datetime.fromtimestamp(_time).strftime("%H:%M:%S"))
if __name__ == "__main__":
_time = 0
while True:
last_modified_time = os.stat("/proc/mydev").st_mtime
if last_modified_time > _time:
myfuction(last_modified_time)
_time = last_modified_time
sleep(1) # prevent high cpu usage
result:
file modified, time: 11:44:09
file modified, time: 11:46:15
file modified, time: 11:46:24
The while loop guarantees that the program keeps listening to changes forever.
You can set the interval by changing the sleep time. Low sleep time causes high CPU usage.

import time
import os
# get the file descriptor for the proc file
fd = os.open("/proc/mydev", os.O_RDONLY)
# create a polling object to monitor the file for updates
poller = select.poll()
poller.register(fd, select.POLLIN)
# create a loop to monitor the file for updates
while True:
events = poller.poll(10000)
if len(events) > 0:
# read the contents of the file if updated
print(os.read(fd, 1024))

sudo pip install inotify
Example
Code for monitoring a simple, flat path (see “Recursive Watching” for watching a hierarchical structure):
import inotify.adapters
def _main():
i = inotify.adapters.Inotify()
i.add_watch('/tmp')
with open('/tmp/test_file', 'w'):
pass
for event in i.event_gen(yield_nones=False):
(_, type_names, path, filename) = event
print("PATH=[{}] FILENAME=[{}] EVENT_TYPES={}".format(
path, filename, type_names))
if __name__ == '__main__':
_main()
Expected output:
PATH=[/tmp] FILENAME=[test_file] EVENT_TYPES=['IN_MODIFY']
PATH=[/tmp] FILENAME=[test_file] EVENT_TYPES=['IN_OPEN']
PATH=[/tmp] FILENAME=[test_file] EVENT_TYPES=['IN_CLOSE_WRITE']

I'm not sure if this would work for your situation, since it seems that you're wanting to watch a folder, but this program watches a file at a time until the main() loop repeats:
import os
import time
def main():
contents = os.listdir("/proc/mydev")
for file in contents:
f = open("/proc/mydev/" + file, "r")
init = f.read()
f.close()
while different = false:
f = open("/proc/mydev/" + file, "r")
check = f.read()
f.close()
if init !== check:
different = true
else:
different = false
time.sleep(1)
main()
# Write what you would want to happen if a change occured here...
main()
main()
main()
You could then write what you would want to happen right before the last usage of main(), as it would then repeat.
Also, this may contain errors, since I rushed this.
Hope this at least helps!

You can't do this efficiently without modifying your kernel driver.
Instead of using procfs, have it register a new character device under /dev, and write that driver to make new content available to read from that device only when new content has in fact come in from the underlying hardware, such that the application layer can issue a blocking read and have it return only when new content exists.
A good example to work from (which also has plenty of native Python clients) is the evdev devices in the input core.

Python shutil module move exception handling

Destination path /tmp/abc in both Process 1 & Process 2
Say there are N number of process running
we need to retain the file generated by the latest one
Process1
import shutil
shutil.move(src_path, destination_path)
Process 2
import os
os.remove(destination_path)
Solution
1. Handle the process saying if copy fails with [ErrNo2]No Such File or Directory
Is this the correct solution? Is there a better way to handle this
Useful Link A safe, atomic file-copy operation

You can use FileNotFoundError error
try :
shutil.move(src_path, destination_path)
except FileNotFoundError:
print ('File Not Found')
# Add whatever logic you want to execute
except :
print ('Some Other error')

The primary solutions that come to mind are either
Keep partial information in per-process staging file
Rely on the os for atomic moves
or
Keep partial information in per-process memory
Rely on interprocess communication / locks for atomic writes
For the first, e.g.:
import tempfile
import os
FINAL = '/tmp/something'
def do_stuff():
fd, name = tempfile.mkstemp(suffix="-%s" % os.getpid())
while keep_doing_stuff():
os.write(fd, get_output())
os.close(fd)
os.rename(name, FINAL)
if __name__ == '__main__':
do_stuff()
You can choose to invoke individually from a shell (as shown above) or with some process wrappers (subprocess or multiprocessing would be fine), and either way will work.
For interprocess you would probably want to spawn everything from a parent process
from multiprocessing import Process, Lock
from cStringIO import StringIO
def do_stuff(lock):
output = StringIO()
while keep_doing_stuff():
output.write(get_output())
with lock:
with open(FINAL, 'w') as f:
f.write(output.getvalue())
output.close()
if __name__ == '__main__':
lock = Lock()
for num in range(2):
Process(target=do_stuff, args=(lock,)).start()

Executing program via pyvmomi creates a process, but nothing happens after that

I'm studying vCenter 6.5 and community samples help a lot, but in this particular situation I can't figure out, what's going on. The script:
from __future__ import with_statement
import atexit
from tools import cli
from pyVim import connect
from pyVmomi import vim, vmodl
def get_args():
*Boring args parsing works*
return args
def main():
args = get_args()
try:
service_instance = connect.SmartConnectNoSSL(host=args.host,
user=args.user,
pwd=args.password,
port=int(args.port))
atexit.register(connect.Disconnect, service_instance)
content = service_instance.RetrieveContent()
vm = content.searchIndex.FindByUuid(None, args.vm_uuid, True)
creds = vim.vm.guest.NamePasswordAuthentication(
username=args.vm_user, password=args.vm_pwd
)
try:
pm = content.guestOperationsManager.processManager
ps = vim.vm.guest.ProcessManager.ProgramSpec(
programPath=args.path_to_program,
arguments=args.program_arguments
)
res = pm.StartProgramInGuest(vm, creds, ps)
if res > 0:
print "Program executed, PID is %d" % res
except IOError, e:
print e
except vmodl.MethodFault as error:
print "Caught vmodl fault : " + error.msg
return -1
return 0
# Start program
if __name__ == "__main__":
main()
When I execute it in console, it successfully connects to the target virtual machine and prints
Program executed, PID is 2036
In task manager I see process with mentioned PID, it was created by the correct user, but there is no GUI of the process (calc.exe). RMB click does not allow to "Expand" the process.
I suppose, that this process was created with special parameters, maybe in different session.
In addition, I tried to run batch file to check if it actually executes, but the answer is no, batch file does not execute.
Any help, advices, clues would be awesome.
P.S. I tried other scripts and successfully transferred a file to the VM.
P.P.S. Sorry for my English.
Update: All such processes start in session 0.

Have you tried interactiveSession ?
https://github.com/vmware/pyvmomi/blob/master/docs/vim/vm/guest/GuestAuthentication.rst
This boolean argument passed to NamePasswordAuthentication and means the following:
This is set to true if the client wants an interactive session in the guest.

multiprocessing to a python function

How I can implement the multiprocessing to my function.I tried like this but did not work.
def steric_clashes_parallel(system):
rna_st = system[MolWithResID("G")].molecule()
for i in system.molNums():
peg_st = system[i].molecule()
if rna_st != peg_st:
print(peg_st)
for i in rna_st.atoms(AtomIdx()):
for j in peg_st.atoms(AtomIdx()):
# print(Vector.distance(i.evaluate().center(), j.evaluate().center()))
dist = Vector.distance(i.evaluate().center(), j.evaluate().center())
if dist<2:
return print("there is a steric clash")
return print("there is no steric clashes")
mix = PDB().read("clash_1.pdb")
system = System()
system.add(mix)
from multiprocessing import Pool
p = Pool(4)
p.map(steric_clashes_parallel,system)
I've thousand of pdb or system files to test through this function. It took 2 h for one file on a single core without multiprocessing module. Any suggestion would be great help.
My traceback looks something like this:
self.run()
File "/home/sajid/sire.app/bundled/lib/python3.3/threading.py", line 858,
in run self._target(*self._args, **self._kwargs)
File "/home/sajid/sire.app/bundled/lib/python3.3/multiprocessing/pool.py", line 351,
in _handle_tasks put(task)
File "/home/sajid/sire.app/bundled/lib/python3.3/multiprocessing/connection.py", line 206,
in send ForkingPickler(buf, pickle.HIGHEST_PROTOCOL).dump(obj)
RuntimeError: Pickling of "Sire.System._System.System" instances is not enabled
(boost.org/libs/python/doc/v2/pickle.html)

The problem is that Sire.System._System.System can't be serialized so it can't be sent to the child process. Multiprocessing uses the pickle module for serialization and you can frequently do a sanity check in the main program with pickle.dumps(my_mp_object) to verify.
You have another problem, though (or I think you do, based on variable names). the map method takes an iterable and fans its iterated objects out to pool members, but it appears that you want to process system itself, not something at it iterates.
One trick to multiprocessing is to keep the payload that you send from the parent to the child simple and let the child do the heavy lifting of creating its objects. Here, you might be better off just sending down filenames and letting the children do most of the work.
def steric_clashes_from_file(filename):
mix = PDB().read(filename)
system = System()
system.add(mix)
steric_clashes_parallel(system)
def steric_clashes_parallel(system):
rna_st = system[MolWithResID("G")].molecule()
for i in system.molNums():
peg_st = system[i].molecule()
if rna_st != peg_st:
print(peg_st)
for i in rna_st.atoms(AtomIdx()):
for j in peg_st.atoms(AtomIdx()):
# print(Vector.distance(i.evaluate().center(), j.evaluate().center()))
dist = Vector.distance(i.evaluate().center(), j.evaluate().center())
if dist<2:
return print("there is a steric clash")
return print("there is no steric clashes")
filenames = ["clash_1.pdb",]
from multiprocessing import Pool
p = Pool(4, chunksize=1)
p.map(steric_clashes_from_file,filenames)

# martineau:
I tested pickle command and it gave me;
----> 1 pickle.dumps(clash_1.pdb)
RuntimeError: Pickling of "Sire.Mol._Mol.MoleculeGroup" instances is not enabled (http://www.boost.org/libs/python/doc/v2/pickle.html)
----> 1 pickle.dumps(system)
RuntimeError: Pickling of "Sire.System._System.System" instances is not enabled (http://www.boost.org/libs/python/doc/v2/pickle.html)
With your script it took the same time and using a single core only. dist line is iterable though. Can i run this single line over multicores ? I modify the line as;
for i in rna_st.atoms(AtomIdx()):
icent = i.evaluate().center()
for j in peg_st.atoms(AtomIdx()):
dist = Vector.distance(icent, j.evaluate().center())

There is one trick you can do to get a faster computation for each file -- processing each file sequentially, but processing the contents of the file in parallel. This relies on a number of caveats:
Your are running on a system that can fork processes (such as Linux).
The computations you are doing do not have side effects that effect the result of future computations.
It seems like this is the case in your situation, but I can't be 100% sure.
When a process is forked, all the memory in the child process is duplicated from the parent process (what's more it is duplicated in an efficient manner -- bits of memory that are only read from aren't duplicated). This makes it easy to share big, complex initial states between processes. However, once the child processes have started they will not see any changes to objects made in the parent process though (and vice versa).
Sample code:
import multiprocessing
system = None
rna_st = None
class StericClash(Exception):
"""Exception used to halt processing of a file. Could be modified to
include information about what caused the clash if this is useful."""
pass
def steric_clashes_parallel(system_index):
peg_st = system[system_index].molecule()
if rna_st != peg_st:
for i in rna_st.atoms(AtomIdx()):
for j in peg_st.atoms(AtomIdx()):
dist = Vector.distance(i.evaluate().center(),
j.evaluate().center())
if dist < 2:
raise StericClash()
def process_file(filename):
global system, rna_st
# initialise global values before creating pool
mix = PDB().read(filename)
system = System()
system.add(mix)
rna_st = system[MolWithResID("G")].molecule()
with multiprocessing.Pool() as pool:
# contents of file processed in parallel
try:
pool.map(steric_clashes_parallel, range(system.molNums()))
except StericClash:
# terminate called to halt current jobs and further processing
# of file
pool.terminate()
# wait for pool processes to terminate before returning
pool.join()
return False
else:
pool.close()
pool.join()
return True
finally:
# reset globals
system = rna_st = None
if __name__ == "__main__":
for filename in get_files_to_be_processed():
# files are being processed in serial
result = process_file(filename)
save_result_to_disk(filename, result)

What File Descriptor object does Python AsyncIO's loop.add_reader() expect?

I'm trying to understand how to use the new AsyncIO functionality in Python 3.4 and I'm struggling with how to use the event_loop.add_reader(). From the limited discussions that I've found it looks like its for reading the standard out of a separate process as opposed to the contents of an open file. Is that true? If so it appears that there's no AsyncIO specific way to integrate standard file IO, is this also true?
I've been playing with the following code. The output of the following gives the exception PermissionError: [Errno 1] Operation not permitted from line 399 of /python3.4/selectors.py self._epoll.register(key.fd, epoll_events) that is triggered by the add_reader() line below
import asyncio
import urllib.parse
import sys
import pdb
import os
def fileCallback(*args):
pdb.set_trace()
path = sys.argv[1]
loop = asyncio.get_event_loop()
#fd = os.open(path, os.O_RDONLY)
fd = open(path, 'r')
#data = fd.read()
#print(data)
#fd.close()
pdb.set_trace()
task = loop.add_reader(fd, fileCallback, fd)
loop.run_until_complete(task)
loop.close()
EDIT
For those looking for an example of how to use AsyncIO to read more than one file at a time like I was curious about, here's an example of how it can be accomplished. The secret is in the line yield from asyncio.sleep(0). This essentially pauses the current function, putting it back in the event loop queue, to be called after all other ready functions are executed. Functions are determined to be ready based on how they were scheduled.
import asyncio
#asyncio.coroutine
def read_section(file, length):
yield from asyncio.sleep(0)
return file.read(length)
#asyncio.coroutine
def read_file(path):
fd = open(path, 'r')
retVal = []
cnt = 0
while True:
cnt = cnt + 1
data = yield from read_section(fd, 102400)
print(path + ': ' + str(cnt) + ' - ' + str(len(data)))
if len(data) == 0:
break;
fd.close()
paths = ["loadme.txt", "loadme also.txt"]
loop = asyncio.get_event_loop()
tasks = []
for path in paths:
tasks.append(asyncio.async(read_file(path)))
loop.run_until_complete(asyncio.wait(tasks))
loop.close()

These functions expect a file descriptor, that is, the underlying integers the operating system uses, not Python's file objects. File objects that are based on file descriptors return that descriptor on the fileno() method, so for example:
>>> sys.stderr.fileno()
2
In Unix, file descriptors can be attached to files or a lot of other things, including other processes.
Edit for the OP's edit:
As Max in the comments says, you can not use epoll on local files (and asyncio uses epoll). Yes, that's kind of weird. You can use it on pipes, though, for example:
import asyncio
import urllib.parse
import sys
import pdb
import os
def fileCallback(*args):
print("Received: " + sys.stdin.readline())
loop = asyncio.get_event_loop()
task = loop.add_reader(sys.stdin.fileno(), fileCallback)
loop.run_forever()
This will echo stuff you write on stdin.

you cannot use add_reader on local files, because:
It cannot be done using select/poll/epoll
It depends on the operating system
It cannot be fully asynchronous because of os limitations (linux does not support async fs metadata read/write)
But, technically, yes you should be able to do async filesystem read/write, (almost) all systems have DMA mechanism for doing i/o "in the background". And no, local i/o is not really fast such that no one would want it, the CPU are in the order of millions times faster that disk i/o.
Look for aiofile or aiofiles if you want to try async i/o

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Is it possible to detect corrupt Python dictionaries - python

Related

python: monitor updates in /proc/mydev file

Python shutil module move exception handling

Executing program via pyvmomi creates a process, but nothing happens after that

multiprocessing to a python function

What File Descriptor object does Python AsyncIO's loop.add_reader() expect?

Categories

Resources