FTP: how to deal with hangs and resume downloads? - python

I am downloading a ~300 Mb file through an ftp server periodically every 6 hours or so. Most downloads go well, but sometimes the process hangs and I need to kill and restart manually. So I want a more robust download system, preferably with the following criteria.
Avoids timeouts or hangs as much as possible. And can deal with them if they happen
If the download is killed, try resuming it a few times until completed (or send error message if it didn't work for any of the times tried)
For (1), I read in this question that it would be good to use python threading with keep_alive calls until all blocks have been downloaded.
def downloadFile(…):
ftp = FTP(…)
sock = ftp.transfercmd('RETR ' + filename)
def background():
f = open(…)
while True:
block = sock.recv(1024*1024)
if not block:
break
f.write(block)
sock.close()
t = threading.Thread(target=background)
t.start()
while t.is_alive():
t.join(60)
ftp.voidcmd('NOOP')
For (2), there could be a loop that checks if the file has been completely downloaded. And if not, it could restart from the point it left it as. Based on this question.
for i in range(3):
if "Check if file has been completely downloaded":
if os.path.exists(filename):
restarg = {'rest': str(os.path.getsize(filename))}
else:
restarg = {}
ftp.transfercmd("RETR " + filename, **restarg)
But how to combine (1) and (2)? Can you resume a threaded download? With many blocks which we don't even know in which order were downloaded..
If these two methods cannot be combined, do you have any other idea?
Also, I am not very sure how to tell if the ftp download was completed. Should I check the file size for this? File sizes might change from one download to another.

Related

is it possible to pass data from one python program to another python program? [duplicate]

Is it possible -- other than by using something like a .txt/dummy file -- to pass a value from one program to another?
I have a program that uses a .txt file to pass a starting value to another program. I update the value in the file in between starting the program each time I run it (ten times, essentially simultaneously). Doing this is fine, but I would like to have the 'child' program report back to the 'mother' program when it is finished, and also report back what files it found to download.
Is it possible to do this without using eleven files to do it (that's one for each instance of the 'child' to 'mother' reporting, and one file for the 'mother' to 'child')? I am talking about completely separate programs, not classes or functions or anything like that.
To operate efficently, and not be waiting around for hours for everything to complete, I need the 'child' program to run ten times and get things done MUCH faster. Thus I run the child program ten times and give each program a separate range to check through.
Both programs run fine, I but would like to get them to run/report back and forth with each other and hopefully not be using file 'transmission' to accomplish the task, especially on the child-mother side of the transferring of data.
'Mother' program...currently
import os
import sys
import subprocess
import time
os.chdir ('/media/')
#find highest download video
Hival = open("Highest.txt", "r")
Histr = Hival.read()
Hival.close()
HiNext = str(int(Histr)+1)
#setup download #1
NextVal = open("NextVal.txt","w")
NextVal.write(HiNext)
NextVal.close()
#call download #1
procs=[]
proc=subprocess.Popen(['python','test.py'])
procs.append(proc)
time.sleep(2)
#setup download #2-11
Histr2 = int(Histr)/10000
Histr2 = Histr2 + 1
for i in range(10):
Hiint = str(Histr2)+"0000"
NextVal = open("NextVal.txt","w")
NextVal.write(Hiint)
NextVal.close()
proc=subprocess.Popen(['python','test.py'])
procs.append(proc)
time.sleep(2)
Histr2 = Histr2 + 1
for proc in procs:
proc.wait()
'Child' program
import urllib
import os
from Tkinter import *
import time
root = Tk()
root.title("Audiodownloader")
root.geometry("200x200")
app = Frame(root)
app.grid()
os.chdir('/media/')
Fileval = open('NextVal.txt','r')
Fileupdate = Fileval.read()
Fileval.close()
Fileupdate = int(Fileupdate)
Filect = Fileupdate/10000
Filect2 = str(Filect)+"0009"
Filecount = int(Filect2)
while Fileupdate <= Filecount:
root.title(Fileupdate)
url = 'http://www.yourfavoritewebsite.com/audio/encoded/'+str(Fileupdate)+'.mp3'
urllib.urlretrieve(url,str(Fileupdate)+'.mp3')
statinfo = os.stat(str(Fileupdate)+'.mp3')
if statinfo.st_size<10000L:
os.remove(str(Fileupdate)+'.mp3')
time.sleep(.01)
Fileupdate = Fileupdate+1
root.update_idletasks()
I'm trying to convert the original VB6 program over to Linux and make it much easier to use at the same time. Hence the lack of .mainloop being missing. This was my first real attempt at anything in Python at all hence the lack of def or classes. I'm trying to come back and finish this up after 1.5 months of doing nothing with it mostly due to not knowing how to. In research a little while ago I found this is WAY over my head. I haven't ever did anything with threads/sockets/client/server interaction so I'm purely an idiot in this case. Google anything on it and I just get brought right back here to stackoverflow.
Yes, I want 10 running copies of the program at the same time, to save time. I could do without the gui interface if it was possible for the program to report back to 'mother' so the mother could print on the screen the current value that is being searched. As well as if the child could report back when its finished and if it had any file that it downloaded successfully(versus downloaded and then erased due to being to small). I would use the successful download information to update Highest.txt for the next time the program got ran.
I think this may clarify things MUCH better...that or I don't understand the nature of using server/client interaction:) Only reason time.sleep is in the program was due to try to make sure that the files could get written before the next instance of the child program got ran. I didn't know for sure what kind of timing issue I may run into so I included those lines for safety.
This can be implemented using a simple client/server topology using the multiprocessing library. Using your mother/child terminology:
server.py
from multiprocessing.connection import Listener
# client
def child(conn):
while True:
msg = conn.recv()
# this just echos the value back, replace with your custom logic
conn.send(msg)
# server
def mother(address):
serv = Listener(address)
while True:
client = serv.accept()
child(client)
mother(('', 5000))
client.py
from multiprocessing.connection import Client
c = Client(('localhost', 5000))
c.send('hello')
print('Got:', c.recv())
c.send({'a': 123})
print('Got:', c.recv())
Run with
$ python server.py
$ python client.py
When you talk about using txt to pass information between programs, we first need to know what language you're using.
Within my knowledge of Java and Python achi viable despite laborious depensendo the amount of information that wants to work.
In python, you can use the library that comes with it for reading and writing txt and schedule execution, you can use the apscheduler.

how to check whether a program using requests module is dead or not

I am trying to using python download a batch of files, and I use requests module with stream turned on, in other words, I retrieve each file in 200K blocks.
However, sometimes, the downloading may stop as it just gets stuck (no response) and there is no error. I guess this is because the connection between my computer and server was not stable enough. Here is my question, how to check this kind of stop and make a new connection?
You probably don't want to detect this from outside, when you can just use timeouts to have requests fail instead of stopping is the server stops sending bytes.
Since you didn't show us your code, it's hard to show you how to change it… but I'll show you how to change some other code:
# hanging
text = requests.get(url).text
# not hanging
try:
text = requests.get(url, timeout=10.0).text
except requests.exceptions.Timeout:
# failed, do something else
# trying until success
while True:
try:
text = requests.get(url, timeout=10.0).text
break
except requests.exceptions.Timeout:
pass
If you do want to detect it from outside for some reason, you'll need to use multiprocessing or similar to move the requests-driven code to a child process. Ideally you'll want it to post updates on some Queue or set and notify some Condition-protected shared flag variable every 200KB, then the main process can block on the Queue or Condition and kill the child process if it times out. For example (pseudocode):
def _download(url, q):
create request
for each 200kb block downloaded:
q.post(buf)
def download(url):
q = multiprocessing.Queue()
with multiprocessing.Process(_download, args=(url, q)) as proc:
try:
return ''.join(iter(functools.partial(q.get, timeout=10.0)))
except multiprocessing.Empty:
proc.kill()
# failed, do something else

Multiprocessing FTP Uploading With A Precise Number of Connections

So, I've been able to use multiprocessing to upload multiple files at once to a given server with the following two functions:
import ftplib,multiprocessing,subprocess
def upload(t):
server=locker.server,user=locker.user,password=locker.password,service=locker.service #These all just return strings representing the various fields I will need.
ftp=ftplib.FTP(server)
ftp.login(user=user,passwd=password,acct="")
ftp.storbinary("STOR "+t.split('/')[-1], open(t,"rb"))
ftp.close() # Doesn't seem to be necessary, same thing happens whether I close this or not
def ftp_upload(t=files,server=locker.server,user=locker.user,password=locker.password,service=locker.service):
parsed_targets=parse_it(t)
ftp=ftplib.FTP(server)
ftp.login(user=user,passwd=password,acct="")
remote_files=ftp.nlst(".")
ftp.close()
files_already_on_server=[f for f in t if f.split("/")[-1] in remote_files]
files_to_upload=[f for f in t if not f in files_already_on_server]
connections_to_make=3 #The maximum connections allowed the the server is 5, and this error will pop up even if I use 1
pool=multiprocessing.Pool(processes=connections_to_make)
pool.map(upload,files_to_upload)
My problem is that I (very regularly) end up getting errors such as:
File "/usr/lib/python2.7/multiprocessing/pool.py", line 227, in map
return self.map_async(func, iterable, chunksize).get()
File "/usr/lib/python2.7/multiprocessing/pool.py", line 528, in get
raise self._value
ftplib.error_temp: 421 Too many connections (5) from this IP
Note: There's also a timeout error that occasionally occurs, but I'm waiting for it to rear it's ugly head again, at which point I'll post it.
I don't get this error when I use the command line (i.e. "ftp -inv", "open SERVER", "user USERNAME PASSWORD", "mput *.rar"), even when I have (for example) 3 instances of this running at once.
I've read through the ftplib and multiprocessing documentation, and I can't figure out what it is that is causing these errors. This is somewhat of a problem because I'm regularly backing up a large amount of data and a large number of files.
Is there some way I can avoid these errors or is there a different way of having the/a script do this?
Is there a way I can tell the script that if it has this error, it should wait for a second, and then resume it's work?
Is there a way I can have the script upload the files in the same order they are in the list (of course speed differences would mean they wouldn't all always be 4 consecutive files, but at the moment the order seems basically random)?
Can someone explain why/how more connections are being simultaneously made to this server than the script is calling for?
So, just handling the exceptions seems to be working (except for the occasional recursion error...still have no fucking idea what the hell is going on there).
As per #3, I wasn't looking for that to be 100% in order, only that the script would pick the next file in the list to upload (so differences in processes speeds could/would still cause the order not to be completely sequential, there would be less variability than in the current system, which seems to be almost unordered).
You could try to use a single ftp instance per process:
def init(*credentials):
global ftp
server, user, password, acct = credentials
ftp = ftplib.FTP(server)
ftp.login(user=user, passwd=password, acct=acct)
def upload(path):
with open(path, 'rb') as file:
try:
ftp.storbinary("STOR " + os.path.basename(path), file)
except ftplib.error_temp as error: # handle temporary error
return path, error
else:
return path, None
def main():
# ...
pool = multiprocessing.Pool(processes=connections_to_make,
initializer=init, initargs=credentials)
for path, error in pool.imap_unordered(upload, files_to_upload):
if error is not None:
print("failed to upload %s" % (path,))
specifically answering (2) Is there a way I can tell the script that if it has this error, it should wait for a second, and then resume it's work?
Yes.
ftplib.error_temp: 421 Too many connections (5) from this IP
This is an exception. You can catch it and handle it. While python doesn't support tail calls, so this is terrible form, it can be as simple as this:
def upload(t):
server=locker.server,user=locker.user,password=locker.password,service=locker.service #These all just return strings representing the various fields I will need.
try:
ftp=ftplib.FTP(server)
ftp.login(user=user,passwd=password,acct="")
ftp.storbinary("STOR "+t.split('/')[-1], open(t,"rb"))
ftp.close() # Doesn't seem to be necessary, same thing happens whether I close this or not
except ftplib.error_temp:
ftp.close()
sleep(2)
upload(t)
As for your question (3) if that is what you want, do the upload serially, not in parallel.
I look forward to you posting with an update with an answer to (4). The only thing which comes to my mind is some other process with ftp connection to this IP.

Python: Check for new files and wait until they have finished transferring

I have a script to scan a directory to see when new files are added, and then process their contents. They're video files, so they're often very large, and they're being transferred over a network and often take a long time to transfer. So I need to make sure they have finished copying before going on.
At the moment, once I've found a new file in the directory I'm using os.path.mtime to check the modification date, and comparing that to the last time the file was scanned, to see if it is still being modified. The theory being that if it's no longer being modified then it should have finshed transferring.
if getmtime(path.join(self.rootFolder, thefile)) < self.lastchecktime: newfiles.append[thefile]
but that doesn't seem to work - the script gets triggered too early and the processing fails because the file is not fully loaded. Could it be that there is not enough of a pause between scans that the mtime stays the same…? I give it 10 seconds between scans - that should be enough, surely.
Is there an easy / more pythonic way of doing this? The files are on a windows server running on a VM.
Do you have any control over the adding of the files? If so, you could create an empty file with a name like videoname-complete once a video has finished uploading, and watch for those files.
Wouldn't your check be "is my modified time greater than last checked?".
if os.path.getmtime(path) > self.lastAccessedTime:
# do something as modified time is greater than last time I checked
pass
I'm not a windows guy, but I'm sure there will be some equivalent library to inotify for windows. It is a really nice way to listen for file or directory changes on file system level. I'm leaving some sample code which works on linux with pyinotify, would be helpful for someone on linux.
class PTmp(pyinotify.ProcessEvent):
def process_IN_CLOSE_WRITE(self, event):
print "Changed: %s " % os.path.join(event.path, event.name)
wm = pyinotify.WatchManager()
mask = pyinotify.IN_CLOSE_WRITE
notifier = pyinotify.Notifier(wm, PTmp())
wdd = wm.add_watch(FILE_LOCATION, mask, rec=True)
while True:
try:
notifier.process_events()
if notifier.check_events():
notifier.read_events()
except KeyboardInterrupt:
notifier.stop()
break

UNIX named PIPE end of file

I'm trying to use a unix named pipe to output statistics of a running service. I intend to provide a similar interface as /proc where one can see live stats by catting a file.
I'm using a code similar to this in my python code:
while True:
f = open('/tmp/readstatshere', 'w')
f.write('some interesting stats\n')
f.close()
/tmp/readstatshere is a named pipe created by mknod.
I then cat it to see the stats:
$ cat /tmp/readstatshere
some interesting stats
It works fine most of the time. However, if I cat the entry several times in quick successions, sometimes I get multiple lines of some interesting stats instead of one. Once or twice, it has even gone into an infinite loop printing that line forever until I killed it. The only fix that I've got so far is to put a delay of let's say 500ms after f.close() to prevent this issue.
I'd like to know why exactly this happens and if there is a better way of dealing with it.
Thanks in advance
A pipe is simply the wrong solution here. If you want to present a consistent snapshot of the internal state of your process, write that to a temporary file and then rename it to the "public" name. This will prevent all issues that can arise from other processes reading the state while you're updating it. Also, do NOT do that in a busy loop, but ideally in a thread that sleeps for at least one second between updates.
What about a UNIX socket instead of a pipe?
In this case, you can react on each connect by providing fresh data just in time.
The only downside is that you cannot cat the data; you'll have to create a new socket handle and connect() to the socket file.
MYSOCKETFILE = '/tmp/mysocket'
import socket
import os
try:
os.unlink(MYSOCKETFILE)
except OSError: pass
s = socket.socket(socket.AF_UNIX)
s.bind(MYSOCKETFILE)
s.listen(10)
while True:
s2, peeraddr = s.accept()
s2.send('These are my actual data')
s2.close()
Program querying this socket:
MYSOCKETFILE = '/tmp/mysocket'
import socket
import os
s = socket.socket(socket.AF_UNIX)
s.connect(MYSOCKETFILE)
while True:
d = s.recv(100)
if not d: break
print d
s.close()
I think you should use fuse.
it has python bindings, see http://pypi.python.org/pypi/fuse-python/
this allows you to compose answers to questions formulated as posix filesystem system calls
Don't write to an actual file. That's not what /proc does. Procfs presents a virtual (non-disk-backed) filesystem which produces the information you want on demand. You can do the same thing, but it'll be easier if it's not tied to the filesystem. Instead, just run a web service inside your Python program, and keep your statistics in memory. When a request comes in for the stats, formulate them into a nice string and return them. Most of the time you won't need to waste cycles updating a file which may not even be read before the next update.
You need to unlink the pipe after you issue the close. I think this is because there is a race condition where the pipe can be opened for reading again before cat finishes and it thus sees more data and reads it out, leading to multiples of "some interesting stats."
Basically you want something like:
while True:
os.mkfifo(the_pipe)
f = open(the_pipe, 'w')
f.write('some interesting stats')
f.close()
os.unlink(the_pipe)
Update 1: call to mkfifo
Update 2: as noted in the comments, there is a race condition in this code as well with multiple consumers.

Categories