I have a Python HTTP server, on a certain GET request a file is created which is returned as response afterwards. The file creation might take a second, respectively the modification (updating) of the file.
Hence, I cannot return immediately the file as response. How do I approach such a problem? Currently I have a solution like this:
while not os.path.isfile('myfile'):
time.sleep(0.1)
return myfile
This seems very inconvenient, but is there a possibly better way?
A simple notification would do, but I don't have control over the process which creates/updates the files.
You could use Watchdog for a nicer way to watch the file system?
Something like this will remove the os call:
while updating:
time.sleep(0.1)
return myfile
...
def updateFile():
# updating file
updating = false
Implementing blocking io operations in synchronous HTTP requests is a bad approach. If many people run the same procedure simultaneously you may soon run out of threads (if there is a limited thread pool). I'd do the following:
A client requests the file creation URI. A file generating procedure is initialized in a background process (some asynchronous task system), the user gets a file id / name in the HTTP response. Next the client makes AJAX calls every once a while (polling), to check if the file has been created/modified (seperate file serve/check-if-exists URI). When the file is finaly created, the user is redirected (js window.location) to the file serving URI.
This approach will require a bit more work, but eventually it will pay off.
You can try using os.path.getmtime, this would check the modification time of the file and return if it's less than 1 sec ago. Also I suggest you only make a limited amount of tries or you will be stuck in an infinite loop if the file doesn't get created/modified. And as #Krzysztof RosiĆski pointed out you should probably think about doing it in a non-blocking way.
import os
from datetime import datetime
import time
for i in range(10):
try:
dif = datetime.now()-datetime.fromtimestamp(os.path.getmtime(file_path))
if dif.total_seconds() < 1:
return file
except OSError:
time.sleep(0.1)
Related
I have a python script that pulls from various internal network sources. With how our systems are set up we will initiate a urllib pull from a network location and it will get hung up waiting forever for a response on certain parts of the network. I would like my script to check that if it hasnt finished the pull in lets say 5 minutes it will pass the function and attempt to pull from the next address, and record it to a bad directory repository(so we can go check out which systems get hung up, there's like over 20,000 IP addresses we are checking some with some older scripts running on them that no longer work but will still try and run when requested, and they never stop trying to run)
Im familiar with having a script pause at a certain point
import time
time.sleep(300)
What Im thinking from a psuedo code perspective (not proper python just illustrating the idea)
import time
import urllib2
url_dict = ['http://1', 'http://2', 'http://3', ...]
fail_log_path = 'C:/Temp/fail_log.txt'
for addresses in url_dict:
clock_value = time.start()
while clock_value <= 300:
print str(clock_value)
res = urllib2.retrieve(url)
if res != []:
pass
else:
fail_log = open(fail_log_path, 'a')
fail_log.write("Failed to pull from site location: " + str(url) + "\n")
faile_log.close
Update: a specific option for this dealing with urls timeout for urllib2.urlopen() in pre Python 2.6 versions
Found this answer which is more in line with the overall problem of my question:
kill a function after a certain time in windows
Your code as is doesn't seem to describe what you were saying. It seems you want the if/else check inside your while loop. On top of that, you would want to loop over the ip addresses and not over a time period as your code is currently written (otherwise you will keep requesting the same ip address every time). Instead of keeping track of time yourself, I would suggest reading up on urllib.request.urlopen - specifically the timeout parameter. Once set, that function call will throw a socket.timeout exception once the time limit is reached. Surround that with a try/except block catching that error and then handle it appropriately.
So, I've been able to use multiprocessing to upload multiple files at once to a given server with the following two functions:
import ftplib,multiprocessing,subprocess
def upload(t):
server=locker.server,user=locker.user,password=locker.password,service=locker.service #These all just return strings representing the various fields I will need.
ftp=ftplib.FTP(server)
ftp.login(user=user,passwd=password,acct="")
ftp.storbinary("STOR "+t.split('/')[-1], open(t,"rb"))
ftp.close() # Doesn't seem to be necessary, same thing happens whether I close this or not
def ftp_upload(t=files,server=locker.server,user=locker.user,password=locker.password,service=locker.service):
parsed_targets=parse_it(t)
ftp=ftplib.FTP(server)
ftp.login(user=user,passwd=password,acct="")
remote_files=ftp.nlst(".")
ftp.close()
files_already_on_server=[f for f in t if f.split("/")[-1] in remote_files]
files_to_upload=[f for f in t if not f in files_already_on_server]
connections_to_make=3 #The maximum connections allowed the the server is 5, and this error will pop up even if I use 1
pool=multiprocessing.Pool(processes=connections_to_make)
pool.map(upload,files_to_upload)
My problem is that I (very regularly) end up getting errors such as:
File "/usr/lib/python2.7/multiprocessing/pool.py", line 227, in map
return self.map_async(func, iterable, chunksize).get()
File "/usr/lib/python2.7/multiprocessing/pool.py", line 528, in get
raise self._value
ftplib.error_temp: 421 Too many connections (5) from this IP
Note: There's also a timeout error that occasionally occurs, but I'm waiting for it to rear it's ugly head again, at which point I'll post it.
I don't get this error when I use the command line (i.e. "ftp -inv", "open SERVER", "user USERNAME PASSWORD", "mput *.rar"), even when I have (for example) 3 instances of this running at once.
I've read through the ftplib and multiprocessing documentation, and I can't figure out what it is that is causing these errors. This is somewhat of a problem because I'm regularly backing up a large amount of data and a large number of files.
Is there some way I can avoid these errors or is there a different way of having the/a script do this?
Is there a way I can tell the script that if it has this error, it should wait for a second, and then resume it's work?
Is there a way I can have the script upload the files in the same order they are in the list (of course speed differences would mean they wouldn't all always be 4 consecutive files, but at the moment the order seems basically random)?
Can someone explain why/how more connections are being simultaneously made to this server than the script is calling for?
So, just handling the exceptions seems to be working (except for the occasional recursion error...still have no fucking idea what the hell is going on there).
As per #3, I wasn't looking for that to be 100% in order, only that the script would pick the next file in the list to upload (so differences in processes speeds could/would still cause the order not to be completely sequential, there would be less variability than in the current system, which seems to be almost unordered).
You could try to use a single ftp instance per process:
def init(*credentials):
global ftp
server, user, password, acct = credentials
ftp = ftplib.FTP(server)
ftp.login(user=user, passwd=password, acct=acct)
def upload(path):
with open(path, 'rb') as file:
try:
ftp.storbinary("STOR " + os.path.basename(path), file)
except ftplib.error_temp as error: # handle temporary error
return path, error
else:
return path, None
def main():
# ...
pool = multiprocessing.Pool(processes=connections_to_make,
initializer=init, initargs=credentials)
for path, error in pool.imap_unordered(upload, files_to_upload):
if error is not None:
print("failed to upload %s" % (path,))
specifically answering (2) Is there a way I can tell the script that if it has this error, it should wait for a second, and then resume it's work?
Yes.
ftplib.error_temp: 421 Too many connections (5) from this IP
This is an exception. You can catch it and handle it. While python doesn't support tail calls, so this is terrible form, it can be as simple as this:
def upload(t):
server=locker.server,user=locker.user,password=locker.password,service=locker.service #These all just return strings representing the various fields I will need.
try:
ftp=ftplib.FTP(server)
ftp.login(user=user,passwd=password,acct="")
ftp.storbinary("STOR "+t.split('/')[-1], open(t,"rb"))
ftp.close() # Doesn't seem to be necessary, same thing happens whether I close this or not
except ftplib.error_temp:
ftp.close()
sleep(2)
upload(t)
As for your question (3) if that is what you want, do the upload serially, not in parallel.
I look forward to you posting with an update with an answer to (4). The only thing which comes to my mind is some other process with ftp connection to this IP.
I have a function like this in Django:
def uploaded_files(request):
global source
global password
global destination
username = request.user.username
log_id = request.user.id
b = File.objects.filter(users_id=log_id, flag='F') # Get the user id from session .delete() to use delete
source = 'sachet.adhikari#69.43.202.97:/home/sachet/my_files'
password = 'password'
destination = '/home/zurelsoft/my_files/'
a = Host.objects.all() #Lists hosts
command = subprocess.Popen(['sshpass', '-p', password, 'rsync', '--recursive', source],
stdout=subprocess.PIPE)
command = command.communicate()[0]
lines = (x.strip() for x in command.split('\n'))
remote = [x.split(None, 4)[-1] for x in lines if x]
base_name = [os.path.basename(ok) for ok in remote]
files_in_server = base_name[1:]
total_files = len(files_in_server)
info = subprocess.Popen(['sshpass', '-p', password, 'rsync', source, '--dry-run'],
stdout=subprocess.PIPE)
information = info.communicate()[0]
command = information.split()
filesize = command[1]
#st = int(os.path.getsize(filesize))
#filesize = size(filesize, system=alternative)
date = command[2]
users_b = User.objects.all()
return render_to_response('uploaded_files.html', {'files': b, 'username':username, 'host':a, 'files_server':files_in_server, 'file_size':filesize, 'date':date, 'total_files':total_files, 'list_users':users_b}, context_instance=RequestContext(request))
The main usage of the function is to transfer the file from the server to local machine and writes the data into the database. What I want it: There are single file which is of 10GB which will take a long time to copy. Since the copying happens using rsync in command line, I want to let user play with other menus while the file is being transferred. How can I achieve that? For example if the user presses OK, the file will be transferring in command line, so I want to show user "The file is being transferred" message and stop rolling the cursor or something like that? Is multiprocessing or threading appropriate in this case? Thanks
Assuming that function works inside of a view, your browser will timeout before the 10GB file has finished transferring over. Maybe you should re-think your architecture for this?
There are probably several ways to do this, but here are some that come to my mind right now:
One solution is to have an intermediary storing the status of the file transfer. Before you begin the process that transfers the file, set a flag somewhere like a database saying the process has begun. Then if you make your subprocess call blocking, wait for it to complete, check the output of the command if possible and update the flag you set earlier.
Then have whatever front end you have poll the status of the file transfer.
Another solution, if you make the subprocess call non-blocking as in your example, in that case you should use a thread which sits there reading the stdout and updating an intermediary store which your front end can query to get a more 'real time' update of the transfer process.
What you need is Celery.
It let's you spawn job as a parallel task and return http response.
RaviU solutions would certainly work.
Another option is to call a blocking subprocess in its own Thread. This thread could be responsible for setting a flag or information (in memcache, db, or just a file on the harddrive) as well as clearing it when it's complete. Personally, there is no love lost between reading rsyncs stdout and I so I usually just ask the OS for the filesize.
Also, if you don't need the file absolutely ASAP, adding "-c" to do a checksum can be good for those giant files. source: personal experience trying to transfer giant video files over spotty campus network.
I will say the one problem with all of the solutions so far is that it doesn't work for "N" files. Eventually, even if you make sure each file only can be transfered once at a time, if you have a lot of different files then eventually it'll bog down the system. You might be better off just using some sort of task queue unless you know it will only ever be the one file at a time. I haven't used one recently, but a quick google search yielded Celery which doesn't look to bad.
Every web server has a facility of uploading files. And what it does for large files is that it divides the file in chunks and does a merge after every chunk is received. What you can do here is that you can have a hidden tag in your html page which has a value attribute and whenever your upload webservice returns you an ok message at that point of time you can change the hidden html value to something relevant and also write a function that keeps on reading the value of that hidden html element and check whether your file uploading has been finished or not.
What I want to do is to check if a file exists, and if it doesn't, perform an action, then check again, until the file exists and then the code continues on with other operations.
For simplicity, I would implement a small polling function, with a timeout for safety:
def open_file(path_to_file, attempts=0, timeout=5, sleep_int=5):
if attempts < timeout and os.path.exists(path_to_file) and os.path.isfile(path_to_file):
try:
file = open(path_to_file)
return file
except:
# perform an action
sleep(sleep_int)
open_file(path_to_file, attempts + 1)
I would also look into using Python built-in polling, as this will track/report I/O events for your file descriptor
Assuming that you're on Linux:
If you really want to avoid any kind of looping to find if the file exists AND you're sure that it will be created at some point and you know the directory where it will be created, you can track changes to the parent directory using pynotify. It will notify you when something changes and you can detect if it's the file that you need being created.
Depending on your needs it might be more trouble than it's worth, though, in which case I suggest a small polling function like Kyle's solution.
I'm trying to use a unix named pipe to output statistics of a running service. I intend to provide a similar interface as /proc where one can see live stats by catting a file.
I'm using a code similar to this in my python code:
while True:
f = open('/tmp/readstatshere', 'w')
f.write('some interesting stats\n')
f.close()
/tmp/readstatshere is a named pipe created by mknod.
I then cat it to see the stats:
$ cat /tmp/readstatshere
some interesting stats
It works fine most of the time. However, if I cat the entry several times in quick successions, sometimes I get multiple lines of some interesting stats instead of one. Once or twice, it has even gone into an infinite loop printing that line forever until I killed it. The only fix that I've got so far is to put a delay of let's say 500ms after f.close() to prevent this issue.
I'd like to know why exactly this happens and if there is a better way of dealing with it.
Thanks in advance
A pipe is simply the wrong solution here. If you want to present a consistent snapshot of the internal state of your process, write that to a temporary file and then rename it to the "public" name. This will prevent all issues that can arise from other processes reading the state while you're updating it. Also, do NOT do that in a busy loop, but ideally in a thread that sleeps for at least one second between updates.
What about a UNIX socket instead of a pipe?
In this case, you can react on each connect by providing fresh data just in time.
The only downside is that you cannot cat the data; you'll have to create a new socket handle and connect() to the socket file.
MYSOCKETFILE = '/tmp/mysocket'
import socket
import os
try:
os.unlink(MYSOCKETFILE)
except OSError: pass
s = socket.socket(socket.AF_UNIX)
s.bind(MYSOCKETFILE)
s.listen(10)
while True:
s2, peeraddr = s.accept()
s2.send('These are my actual data')
s2.close()
Program querying this socket:
MYSOCKETFILE = '/tmp/mysocket'
import socket
import os
s = socket.socket(socket.AF_UNIX)
s.connect(MYSOCKETFILE)
while True:
d = s.recv(100)
if not d: break
print d
s.close()
I think you should use fuse.
it has python bindings, see http://pypi.python.org/pypi/fuse-python/
this allows you to compose answers to questions formulated as posix filesystem system calls
Don't write to an actual file. That's not what /proc does. Procfs presents a virtual (non-disk-backed) filesystem which produces the information you want on demand. You can do the same thing, but it'll be easier if it's not tied to the filesystem. Instead, just run a web service inside your Python program, and keep your statistics in memory. When a request comes in for the stats, formulate them into a nice string and return them. Most of the time you won't need to waste cycles updating a file which may not even be read before the next update.
You need to unlink the pipe after you issue the close. I think this is because there is a race condition where the pipe can be opened for reading again before cat finishes and it thus sees more data and reads it out, leading to multiples of "some interesting stats."
Basically you want something like:
while True:
os.mkfifo(the_pipe)
f = open(the_pipe, 'w')
f.write('some interesting stats')
f.close()
os.unlink(the_pipe)
Update 1: call to mkfifo
Update 2: as noted in the comments, there is a race condition in this code as well with multiple consumers.