Python read file into memory for repeated FTP copy

Python read file into memory for repeated FTP copy - python

I need to read a local file and copy to remote location with FTP, I copy same file file.txt to remote location repeatedly hundreds of times with different names like f1.txt, f2.txt... f1000.txt etc. Now, is it necessary to always open, read, close my local file.txt for every single FTP copy or is there a way to store into a variable and use that all time and avoid file open, close functions. file.txt is small file of 6KB. Below is the code I am using
for i in range(1,101):
fname = 'file'+ str(i) +'.txt'
fp = open('file.txt', 'rb')
ftp.storbinary('STOR ' + fname, fp)
fp.close()
I tried reading into a string variable and replace fp but ftp.storbinary requires second argument to have method read(), please suggest if there is better way to avoid file open close or let me know if it has no performance improvement at all. I am using python 2.7.10 on Windows 7.

Simply open it before the loop, and close it after:
fp = open('file.txt', 'rb')
for i in range(1,101):
fname = 'file'+ str(i) +'.txt'
fp.seek(0)
ftp.storbinary('STOR ' + fname, fp)
fp.close()
Update Make sure you add fp.seek(0) before the call to ftp.storbinary, otherwise the read call will exhaust the file in the first iteration as noted by #eryksun.
Update 2 depending on the size of the file it will probably be faster to use BytesIO. This way the file content is saved in memory but will still be a file-like object (ie it will have a read method).
from io import BytesIO
with open('file.txt', 'rb') as f:
output = BytesIO()
output.write(f.read())
for i in range(1, 101):
fname = 'file' + str(i) + '.txt'
output.seek(0)
ftp.storbinary('STOR ' + fname, fp)

Related

os.path.getsize() returns "0"

Getting "0" output, when I am trying to use os.path.getsize()
Not sure what's wrong, using PyCharm, I see that the file was created and the "comments" were added to the file. But PyCharm shows the output "0" :(
Here is the code:
import os
def create_python_script(filename):
comments = "# Start of a new Python program"
with open(filename, "w") as file:
file.write(comments)
filesize = os.path.getsize(filename)
return(filesize)
print(create_python_script("program.py"))
Please, point what is the error I don't see.

You're getting the size 0, due to the peculiar behaviour of the write function.
When you call the write function, it writes the content to the internal buffer. An internal buffer is kept for performance constraints (to limit too frequent I/O calls).
So in this case, you can't ensure that the data/content has been actually dumped to the file on disk or not when you call the getsize function.
with open(filename, "w") as file:
file.write(comments)
filesize = os.path.getsize(filename)
In order to ensure that the content is dumped to the file before calling the getsize function, you can call flush method.
flush method clears the internal buffer and dumps all the content to the file on the disk.
with open(filename, "w") as file:
file.write(comments)
file.flush()
filesize = os.path.getsize(filename)
Or, a better way would be to first close the file and then call the getsize method.
with open(filename, "w") as file:
file.write(comments)
filesize = os.path.getsize(filename)

race-condition: reading/writing file (windows)

I have the following situation:
-different users (all on windows OS) that run a python script that can either read or write to pickle file located on a shared folder.
-the "system" is designed in way that only one user at a time will be writing to the file (therefore no race condition of more processes trying to WRITE at the same time on the file)
-the basic code to write would be this:
with open(path + r'\final_db.p', 'wb') as f:
pickle.dump((x, y), f)
-while code to read would be:
with open(path + r'\final_db.p', 'rb') as f:
x, y = pickle.load(f)
-x is list of 5K or plus elements, where each element is a class instance containing many attributes and functions; y is a date
QUESTION:
am i correct assuming that there is a race condition when a reading and a writing process overlap? and that the reading one can end up with a corrupt file?
PROPOSED SOLUTIONS:
1.a possible solution i thought of is using filelock:
code to write:
file_path = path + r'\final_db.p'
lock_path = file_path + '.lock'
lock = filelock.FileLock(lock_path, timeout=-1)
with lock:
with open(file_path, 'wb') as f:
pickle.dump((x, y), f)
code to read:
file_path = path + r'\final_db.p'
lock_path = file_path + '.lock'
lock = filelock.FileLock(lock_path, timeout=-1)
with lock:
with open(file_path, 'rb') as f:
x, y = pickle.load(f)
this solution should work (??), but if a process crash, the file remains blocked till the "file_path + '.lock'" is cancelled
2.another solution could be to use portalocker
code to write:
with open(path + r'\final_db.p', 'wb') as f:
portalocker.lock(f, portalocker.LOCK_EX)
pickle.dump((x, y), f)
code to read:
segnale = True
while segnale:
try:
with open(path + r'\final_db.p', 'rb') as f:
x, y = pickle.load(f)
segnale = False
except:
pass
the reading process, if another process started writing before it, will keep looping till the file is unlocked (except PermissionError).
if the writing process started after the reading process, the reading should loop if the file is corrupt.
what i am not sure about is if the reading process could end up reading a partially written file.
Any advice? better solutions?

Python shutil copyfile - missing last few lines

I am routinely missing the last few kb of a file I am trying to copy using shutil copyfile.
I did some research and do see someone asking about something similar here:
python shutil copy function missing last few lines
But I am using copyfile, which DOES seem to use a with statement...
with open(src, 'rb') as fsrc:
with open(dst, 'wb') as fdst:
copyfileobj(fsrc, fdst)
So I am perplexed that more users aren't having this issue, if indeed it is some sort of buffering issue - I would think it'd be more well known.
I am calling copyfile very simply, don't think I could possibly be doing something wrong, essentially doing it the standard way I think:
copyfile(target_file_name,dest_file_name)
Yet I am missing the last 4kb or so of the file eachtime.
I have also not touched the copyfile function which gets called in shutil which is...
def copyfileobj(fsrc, fdst, length=16*1024):
"""copy data from file-like object fsrc to file-like object fdst"""
while 1:
buf = fsrc.read(length)
if not buf:
break
fdst.write(buf)
So I am at a loss, but I suppose I am about to learn something about flushing, buffering, or the with statement, or ... Help! thanks
to Anand:
Anand, I avoided mentioning that stuff bc it's my sense that it's not the problem, but since you asked... executive summary is that I am grabbing a file from an FTP, checking if the file is different from the last time I saved a copy, if so, downloading the file and saving a copy. It's circuitous spaghetti code and was written when I was a truly pure utilitarian novice of a coder I guess. It looks like:
for filename in ftp.nlst(filematch):
target_file_name = os.path.basename(filename)
with open(target_file_name ,'wb') as fhandle:
try:
ftp.retrbinary('RETR %s' % filename, fhandle.write)
the_files.append(target_file_name)
mtime = modification_date(target_file_name)
mtime_str_for_file = str(mtime)[0:10] + str(mtime)[11:13] + str(mtime)[14:16] + str(mtime)[17:19] + str(mtime)[20:28]#2014-12-11 15:08:00.338415.
sorted_xml_files = [file for file in glob.glob(os.path.join('\\\\Storage\\shared\\', '*.xml'))]
sorted_xml_files.sort(key=os.path.getmtime)
last_file = sorted_xml_files[-1]
file_is_the_same = filecmp.cmp(target_file_name, last_file)
if not file_is_the_same:
print 'File changed!'
copyfile(target_file_name, '\\\\Storage\\shared\\'+'datebreaks'+mtime_str_for_file+'.xml')
else:
print 'File '+ last_file +' hasn\'t changed, doin nothin'
continue

The issue here would most probably be that , when executing the line -
ftp.retrbinary('RETR %s' % filename, fhandle.write)
This is using the fhandle.write() function to write the data from the ftp server to the file (with name - target_file_name) , but by the time you are calling -shutil.copyfile - the buffer for fhandle has not completely flushed, so you are missing out on some data when copying the file.
To make sure that this does not occur, you can either move the copyfile logic out of the with block for fhandle .
Or you can call fhandle.flush() to flush the buffer , before copying the file .
I believe it would be better to close the file (move the logic out of the with block). Example -
for filename in ftp.nlst(filematch):
target_file_name = os.path.basename(filename)
with open(target_file_name ,'wb') as fhandle:
ftp.retrbinary('RETR %s' % filename, fhandle.write)
the_files.append(target_file_name)
mtime = modification_date(target_file_name)
mtime_str_for_file = str(mtime)[0:10] + str(mtime)[11:13] + str(mtime)[14:16] + str(mtime)[17:19] + str(mtime)[20:28]#2014-12-11 15:08:00.338415.
sorted_xml_files = [file for file in glob.glob(os.path.join('\\\\Storage\\shared\\', '*.xml'))]
sorted_xml_files.sort(key=os.path.getmtime)
last_file = sorted_xml_files[-1]
file_is_the_same = filecmp.cmp(target_file_name, last_file)
if not file_is_the_same:
print 'File changed!'
copyfile(target_file_name, '\\\\Storage\\shared\\'+'datebreaks'+mtime_str_for_file+'.xml')
else:
print 'File '+ last_file +' hasn\'t changed, doin nothin'
continue

You are trying to copy a file that was not closed. That's why buffers were not flushed. Move the copyfileobj out of the with block, to allow fhandle beeing closed.
Do:
with open(target_file_name ,'wb') as fhandle:
ftp.retrbinary('RETR %s' % filename, fhandle.write)
# and here the rest of your code
# so fhandle is closed, and file is stored completely on the disk

This looks like there is a better way to do nested withs:
with open(src, 'rb') as fsrc, open(dst, 'wb') as fdst:
copyfileobj(fsrc, fdst)
I'd try something more like this. I'm far from an expert, hopefully someone more knowledgeable can lend some insight. My best thought is that the inner with closes before the outer one.

FTP upload files Python

I am trying to upload file from windows server to a unix server (basically trying to do FTP). I have used the code below
#!/usr/bin/python
import ftplib
import os
filename = "MyFile.py"
ftp = ftplib.FTP("xx.xx.xx.xx")
ftp.login("UID", "PSW")
ftp.cwd("/Unix/Folder/where/I/want/to/put/file")
os.chdir(r"\\windows\folder\which\has\file")
ftp.storbinary('RETR %s' % filename, open(filename, 'w').write)
I am getting the following error:
Traceback (most recent call last):
File "Windows\folder\which\has\file\MyFile.py", line 11, in <module>
ftp.storbinary('RETR %s' % filename, open(filename, 'w').write)
File "windows\folder\Python\lib\ftplib.py", line 466, in storbinary
buf = fp.read(blocksize)
AttributeError: 'builtin_function_or_method' object has no attribute 'read'
Also all contents of MyFile.py got deleted .
Can anyone advise what is going wrong.I have read that ftp.storbinary is used for uploading files using FTP.

If you are trying to store a non-binary file (like a text file) try setting it to read mode instead of write mode.
ftp.storlines("STOR " + filename, open(filename, 'rb'))
for a binary file (anything that cannot be opened in a text editor) open your file in read-binary mode
ftp.storbinary("STOR " + filename, open(filename, 'rb'))
also if you plan on using the ftp lib you should probably go through a tutorial, I'd recommend this article from effbot.

Combined both suggestions. Final answer being
#!/usr/bin/python
import ftplib
import os
filename = "MyFile.py"
ftp = ftplib.FTP("xx.xx.xx.xx")
ftp.login("UID", "PSW")
ftp.cwd("/Unix/Folder/where/I/want/to/put/file")
os.chdir(r"\\windows\folder\which\has\file")
myfile = open(filename, 'r')
ftp.storlines('STOR ' + filename, myfile)
myfile.close()

try making the file an object, so you can close it at the end of the operaton.
myfile = open(filename, 'w')
ftp.storbinary('RETR %s' % filename, myfile.write)
and at the end of the transfer
myfile.close()
this might not solve the problem, but it may help.

ftplib supports the use of context managers so you can make it even simpler as such
with ftplib.FTP('ftp_address', 'user', 'pwd') as ftp, open(file_path, 'rb') as file:
ftp.storbinary(f'STOR {file_path.name}', file)
...
This way you are robust against both file and ftp issues without having to insert try/except/finally blocks. And well, it's pythonic.
PS: since it uses f-strings is python >= 3.6 only but can easily be modified to use the old .format() syntax

Slow python file I:O; Ruby runs better than this; Got the wrong language?

Please advise - I'm going to use this asa learning point. I'm a beginner.
I'm splitting a 25mb file into several smaller file.
A Kindly guru here gave me a Ruby sript. It works beautifully fast. So, in order to learn I mimicked it with a python script. This runs like a three-legged cat (slow). I wonder if anyone can tell me why?
My python script
##split a file into smaller files
###########################################
def splitlines (file) :
fileNo=0001
outFile=open("C:\\Users\\dunner7\\Desktop\\Textomics\\Media\\LexisNexus\\ele\\newdocs\%s.txt" % fileNo, 'a') ## open file to append
fh = open(file, "r") ## open the file for reading
mylines = fh.readlines() ### read in lines
for line in mylines: ## for each line
if re.search("Copyright ", line): # if the line is equal to the regex
outFile.close() ## close the file
fileNo +=1 #and add one to the filename, starting to read lines in again
else: # otherwise
outFile=open("C:\\Users\\dunner7\\Desktop\\Textomics\\Media\\LexisNexus\\ele\\newdocs\%s.txt" % fileNo, 'a') ## open file to append
outFile.write(line) ## then append it to the open outFile
fh.close()
The guru's Ruby 1.9 script
g=0001
f=File.open(g.to_s + ".txt","w")
open("corpus1.txt").each do |line|
if line[/\d+ of \d+ DOCUMENTS/]
f.close
f=File.open(g.to_s + ".txt","w")
g+=1
end
f.print line
end

There are many reasons why your script is slow -- the main reason being that you reopen the outputfile for almost every line you write. Since the old file gets implicitly closed on opening a new one (due to Python garbage collection), the write buffer is flushed for every single line you write, which is quite expensive.
A cleaned up and corrected version of your script would be
def file_generator():
file_no = 1
while True:
f = open(r"C:\Users\dunner7\Desktop\Textomics\Media"
r"\LexisNexus\ele\newdocs\%s.txt" % file_no, 'a')
yield f
f.close()
file_no += 1
def splitlines(filename):
files = file_generator()
out_file = next(files)
with open(filename) as in_file:
for line in in_file:
if "Copyright " in line:
out_file = next(files)
out_file.write(line)
out_file.close()

I guess the reason your script is so slow is that you open a new file descriptor for each line. If you look at your guru's ruby script, it closes and opens the output file only if your separator matches.
In contrast to that, your python script opens a new file descriptor for every line you read (and btw, does not close them). Opening a file requires talking to the kernel, so this is relatively slow.
Another change I would suggest is to change
fh = open(file, "r") ## open the file for reading
mylines = fh.readlines() ### read in lines
for line in mylines: ## for each line
to
fh = open(file, "r")
for line in fh:
With this change, you do not read the whole file into memory, but only block after block. Although it should not matter with a 25MiB file, it will hurt you with big files and is good practice (and less code ;)).

The Python code might be slow due to regex and not IO. Try
def splitlines (file) :
fileNo=0001
outFile=open("newdocs/%s.txt" % fileNo, 'a') ## open file to append
reg = re.compile("Copyright ")
for line in open(file, "r"):
if reg.search("Copyright ", line): # if the line is equal to the regex
outFile.close() ## close the file
outFile=open("newdocs%s.txt" % fileNo, 'a') ## open file to append
fileNo +=1 #and add one to the filename, starting to read lines in again
outFile.write(line) ## then append it to the open outFile
Several notes
Always use / instead of \ for path name
If regex is used repeatedly, compile it
Do you need re.search? or re.match?
UPDATE:
#Ed. S: point taken
#Winston Ewert: code updated to be closer to the original Ruby code

rosser,
Don't use names of built-in objects as identifiers in a code (file, splitlines)
The following code respects the effect of your own code: an out_file is closed without the line containing 'Copyright ' that constitutes the signal of closing
The use of the function writelines() is intended to obtain a faster execution than with a repetition of out_file.write(line)
The if li: block is there to trigger the closing of out_file in case the last line of the read file doesn't contains 'Copyright '
def splitfile(filename, wordstop, destrep, file_no = 1, li = []):
with open(filename) as in_file:
for line in in_file:
if wordstop in line:
with open(destrep+str(file_no)+'.txt','w') as f:
f.writelines(li)
file_no += 1
li = []
else:
li.append(line)
if li:
with open(destrep+str(file_no)+'.txt','w') as f:
f.writelines(li)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python read file into memory for repeated FTP copy - python

Related

os.path.getsize() returns "0"

race-condition: reading/writing file (windows)

Python shutil copyfile - missing last few lines

FTP upload files Python

Slow python file I:O; Ruby runs better than this; Got the wrong language?

Categories

Resources