Downloading the most recent file from FTP with Python - python

I have been trying to write a function with Python that would allow to download the most recently added file (by timestamp within filename).
You can see that the format has a big timestamp.
What I have so far with the help of forums is the following code.
In the following code, I tried to sort using the date field (real added date to FTP server). However,
I want to adjust this code so that I can sort the files by the timestamp within filename.
EDIT (Tried to clean the code a bit):
def DownloadFileFromFTPServer2 (server, username, password, directory_to_file, file_to_write):
try:
f = ftplib.FTP(server)
except ((socket.error, socket.gaierror), e):
print ('cannot reach to %s' % server)
return
print ("Connected to FTP server")
try:
f.login(username, password)
except ftplib.error_perm:
print ("cannot login anonymously")
f.quit()
return
print ("Logged on to the FTP server")
try:
f.cwd(directory_to_file)
print ("Directory has been set")
except Exception as inst:
print (inst)
data = []
f.dir(data.append)
datelist = []
filelist =[]
for line in data:
print (line)
col = line.split()
datestr = ' '.join(line.split()[5:8])
date = time.strptime (datestr, '%b %d %H:%M')
datelist.append(date)
filelist.append(col[8])
combo = zip (datelist, filelist)
who = dict ( combo )
# Sort by dates and get the latest file by date....
for key in sorted(iter(who.keys()), reverse = True):
filename = who[key]
print ("File to download is %s" % filename)
try:
f.retrbinary('RETR %s' % filename, open(filename, 'wb').write)
except (ftplib.err_perm):
print ("Error: cannot read file %s" % filename)
os.unlink(filename)
else:
print ("***Downloaded*** %s " % filename)
print ("Retrieving FTP server data ......... DONE")
#VERY IMPORTANT RETURN
return
f.quit()
return 1
Any help is greately appreciated. Thanks.
EDIT [SOLVED]:
The line
date = time.strptime (datestr, '%b %d %H:%M')
should be replaced with:
try:
date = datetime.datetime.strptime (str(col[8]), 'S01375T-%Y-%m-%d-%H-%M-%S.csv')
except Exception as inst:
continue
try-continue is important since the first two path lines such as '.' and '..' will result a ValuError.

Once you have the list of filenames you can simply sort on filename, since the naming convention is S01375T-YYYY-MM-DD-hh-mm.csv this will naturally sort into date/time order. Note that if the S01375T- part varies you could sort on the name split at a fixed position or at the first -.
If this was not the case you could use the datetime.datetime.strptime method to parse the filenames into datetime instances.
Of course if you wished to really simplify things you could use the PyFileSystem FTPFS and it's various methods to allow you to treat the FTP system as if is was a slow local file system.

Try with the -t option in ftp.dir, this orders by date in reverse, then take the first in the list:
data = []
ftp.dir('-t',data.append)
filename = data[0]

You need to extract the timestamp from the filename properly.
You could split the filename at the first '-' and remove the file extensition '.csv' (f.split('-', 1)[1][:-4]).
Then you just need to construct the datetime obj for sorting.
from datetime import datetime
def sortByTimeStampInFile(fList):
fileDict = {datetime.strptime(f.split('-', 1)[1][:-4], '%Y-%m-%d-%H-%M-%S'): f for f in fList if f.endswith('.csv')}
return [fileDict[k] for k in sorted(fileDict.keys())]
files = ['S01375T-2016-03-01-12-00-00.csv', 'S01375T-2016-01-01-13-00-00.csv', 'S01375T-2016-04-01-13-01-00.csv']
print(sortByTimeStampInFile(files))
Returns:
['S01375T-2016-01-01-13-00-00.csv', 'S01375T-2016-03-01-12-00-00.csv', 'S01375T-2016-04-01-13-01-00.csv']
Btw. as long as your time format is 'year-month-day-hour-min-sec', a simple string sort would do it:
sorted([f.split('-', 1)[1][:-4] for f in fList if f.endswith('.csv')])
>>> ['2016-01-01-13-00-00', '2016-03-01-12-00-00', '2016-04-01-13-01-00']

Related

Using a variable as a Filename Python

I cannot get using a variable in the filename to work without throwing an error.
def logData(name, info):
currentTime = datetime.datetime.now()
date = currentTime.strftime("%x")
filename = "{}_{}.txt".format(name, date)
f = open(filename, "a")
f.write(info)+ '\n')
f.close()
I have tried formatting the string as shown above as well as concatenating the variables.
Is there anything I can do to solve this?
One issue was the closing parentheses, also the date format contain / (slash ex:07/07/21) which will make the filename with slash i.e not a valid name in windows as it makes it a path.
Solution with least change to your logic:
import datetime
def logData(name, info):
currentTime = datetime.datetime.now()
date = currentTime.strftime("%Y-%m-%d")
filename = "{}_{}.txt".format(name, date)
with open(filename,"a+") as f:
f.write(f'{info}\n')
logData('my_file',"test")

best way to check if files have been processed

E: My initial Title was very misleading.
I have a SQL server with a database and I have around 10,000 excel files in a directory. The files contain values I need to copy into the DB with new excel files being added on a daily basis. Additionally, each file contains a field "finished" with a boolean value, that expresses if the file is ready to be copied to the DB. However, the filename is not connected to it's contend. Only the content of file contains primary keys and filed names corresponding to the DB's keys and field names.
Checking if the file's content is already in the DB by comparing the primary key over and over is not feasible, since opening the files is far too slow. I could however check if files are already in the DB initially and write the result in a file (say copied.txt), so it simply holds the filenames of all already copied files. The real service could then load this file's content into a dictionary (dict1) with the filename as the key and with no value (I think hash tables are the fastest for comparative operations), then store the filenames of all existing excel files in the dir in a second dictionary (dict2) and compare both dictionary and create a list of all files that are in dict2 but not in dict1. I would then iterate through the list (should usually only contain around 10-20 files), checking if the files are flagged as "ready to be copied" and copy the values to the database. Finally, I would add this file's name to dict1 and store it back to the copied.txt file.
My idea is to run this python script as a service that loops as long as there are files to work with. When it can't find files to copy from, it should wait for x seconds (maybe 45) than do it all over.
This my best concept so far. Is there a faster/ more efficient way to do it?
It just came back to my mind that sets only contain unique elements and thus are the best data type for a comparison like this. It is a data type that I hardly know but now I can see how useful it can be.
The part of the code that is related to my original question is in Part 1-3:
The program:
1. loads file names from a file to a set
2. loads file names from the filesystem/ a certain dir + subdirs to a set
3. creates a list of the difference of the two sets
4. iterates through all remaining files
looks, if they have been flagged as "finalized",
than for each row:
creates a new record in the database
and adds values to given record (one by one)
5. adds the processed file's name to the file of filenames.
It does so every 5 minutes. This is completely fine for my purpose.
I am very new to coding so sorry for my dilettantish approach. At least it works so far.
#modules
import pandas as pd
import pyodbc as db
import xlwings as xw
import glob
import os
from datetime import datetime, date
from pathlib import Path
import time
import sys
#constants
tick_time_seconds = 300
line = ("################################################################################### \n")
pathTodo = "c:\\myXlFiles\\**\\*"
pathDone = ("c:\\Done\\")
pathError = ("c:\\Error\\")
sqlServer = "MyMachine\\MySQLServer"
sqlDriver = "{SQL Server}"
sqlDatabase="master"
sqlUID="SA"
sqlPWD="PWD"
#functions
def get_list_of_files_by_extension(path:str, extension:str) -> list:
"""Recieves string patch and extension;
gets list of files with corresponding extension in path;
return list of file with full path."""
fileList = glob.glob(path+extension, recursive=True)
if not fileList:
print("no found files")
else:
print("found files")
return fileList
def write_error_to_log(description:str, errorString:str, optDetails=""):
"""Recieves strings description errorstring and opt(ional)Details;
writes the error with date and time in logfile with the name of current date;
return nothing."""
logFileName = str(date.today())+".txt"
optDetails = optDetails+"\n"
dateTimeNow = datetime.now()
newError = "{0}\n{1}\n{2}{3}\n".format(line, str(dateTimeNow), optDetails, errorString)
print(newError)
with open(Path(pathError, logFileName), "a") as logFile:
logFile.write(newError)
def sql_connector():
"""sql_connector: Recieves nothing;
creates a connection to the sql server (conncetion details sould be constants);
returns a connection."""
return db.connect("DRIVER="+sqlDriver+"; \
SERVER="+sqlServer+"; \
DATABASE="+sqlDatabase+"; \
UID="+sqlUID+"; \
PWD="+sqlPWD+";")
def sql_update_builder(dbField:str, dbValue:str, dbKey:str) -> str:
""" sql_update_builder: takes strings dbField, dbValue and dbKey;
creates a sql syntax command with the purpose to update the value of the
corresponding field with the corresponding key;
returns a string with a sql command."""
return "\
UPDATE [tbl_Main] \
SET ["+dbField+"]='"+dbValue+"' \
WHERE ((([tbl_Main].MyKey)="+dbKey+"));"
def sql_insert_builder(dbKey: str) -> str:
""" sql_insert_builder: takes strings dbKey;
creates a sql syntax command with the purpose to create a new record;
returns a string with a sql command."""
return "\
INSERT INTO [tbl_Main] ([MyKey])\
VALUES ("+dbKey+")"
def append_filename_to_fileNameFile(xlFilename):
"""recieves anywthing xlFilename;
converts it to string and writes the filename (full path) to a file;
returns nothing."""
with open(Path(pathDone, "filesDone.txt"), "a") as logFile:
logFile.write(str(xlFilename)+"\n")
###################################################################################
###################################################################################
# main loop
while __name__ == "__main__":
###################################################################################
""" 1. load filesDone.txt into set"""
listDone = []
print(line+"reading filesDone.txt in "+pathDone)
try:
with open(Path(pathDone, "filesDone.txt"), "r") as filesDoneFile:
if filesDoneFile:
print("file contains entries")
for filePath in filesDoneFile:
filePath = filePath.replace("\n","")
listDone.append(Path(filePath))
except Exception as err:
errorDescription = "failed to read filesDone.txt from {0}".format(pathDone)
write_error_to_log(description=errorDescription, errorString=str(err))
continue
else: setDone = set(listDone)
###################################################################################
""" 2. load filenames of all .xlsm files into set"""
print(line+"trying to get list of files in filesystem...")
try:
listFileSystem = get_list_of_files_by_extension(path=pathTodo, extension=".xlsm")
except Exception as err:
errorDescription = "failed to read file system "
write_error_to_log(description=errorDescription, errorString=str(err))
continue
else:
listFiles = []
for filename in listFileSystem:
listFiles.append(Path(filename))
setFiles = set(listFiles)
###################################################################################
""" 3. create list of difference of setMatchingFiles and setDone"""
print(line+"trying to compare done files and files in filesystem...")
setDifference = setFiles.difference(setDone)
###################################################################################
""" 4. iterate thru list of files """
for filename in setDifference:
""" 4.1 try: look if file is marked as "finalized=True";
if the xlfile does not have sheet 7 (old ones)
just add the xlfilename to the xlfilenameFile"""
try:
print("{0}trying to read finalized state ... of {1}".format(line, filename))
filenameClean = str(filename).replace("\n","")
xlFile = pd.ExcelFile(filenameClean)
except Exception as err:
errorDescription = "failed to read finalized-state from {0} to dataframe".format(filename)
write_error_to_log(description=errorDescription, errorString=str(err))
continue
else:
if "finalized" in xlFile.sheet_names:
dataframe = xlFile.parse("finalized")
print("finalized state ="+str(dataframe.iloc[0]["finalized"]))
if dataframe.iloc[0]["finalized"] == False:
continue
else:
append_filename_to_fileNameFile(filename) #add the xlfilename to the xlfilenameFile"
continue
###################################################################################
""" 4.2 try: read values to dataframe"""
try:
dataframe = pd.read_excel(Path(filename), sheet_name=4)
except Exception as err:
errorDescription = "Failed to read values from {0} to dataframe".format(filename)
write_error_to_log(description=errorDescription, errorString=str(err))
continue
###################################################################################
""" 4.2 try: open connection to database"""
print("{0}Trying to open connection to database {1} on {2}".format(line, sqlDatabase, sqlServer))
try:
sql_connection = sql_connector() #create connection to server
stuff = sql_connection.cursor()
except Exception as err:
write_error_to_log(description="Failed to open connection:", errorString=str(err))
continue
###################################################################################
""" 4.3 try: write to database"""
headers = list(dataframe) #copy header from dataframe to list; easier to iterate
values = dataframe.values.tolist() #copy values from dataframe to list of lists [[row1][row2]...]; easier to iterate
for row in range(len(values)): #iterate over lines
dbKey = str(values[row][0]) #first col is key
sqlCommandString = sql_insert_builder(dbKey=dbKey)
""" 4.3.1 firts trying to create (aka insert) new record in db ..."""
try:
print("{0}Trying insert new record with the id {1}".format(line, dbKey))
stuff.execute(sqlCommandString)
sql_connection.commit()
print(sqlCommandString)
except Exception as err:
sql_log_string = " ".join(sqlCommandString.split()) #get rid of whitespace in sql command
write_error_to_log(description="Failed to create new record in DB:", errorString=str(err), optDetails=sql_log_string)
else: #if record was created add the values one by one:
print("{0}Trying to add values to record with the ID {1}".format(line, dbKey))
""" 4.3.2 ... than trying to add the values one by one"""
for col in range(1, len(headers)): #skip col 0 (the key)
dbField = str(headers[col]) #field in db is header in the excel sheet
dbValue = str(values[row][col]) #get the corresponding value
dbValue = (dbValue.replace("\"","")).replace("\'","") #getting rid of ' and " to prevent trouble with the sql command
sqlCommandString = sql_update_builder(dbField, dbValue, dbKey) # calling fuction to create a sql update command string
try: #try to commit the sql command
stuff.execute(sqlCommandString)
sql_connection.commit()
print(sqlCommandString)
except Exception as err:
sql_log_string = " ".join(sqlCommandString.split()) #get rid of whitespace in sql command
write_error_to_log(description="Failed to add values in DB:", errorString=str(err), optDetails=sql_log_string)
append_filename_to_fileNameFile(filename)
print(line)
# wait for a certain amount of time
for i in range(tick_time_seconds, 0, -1):
sys.stdout.write("\r" + str(i))
sys.stdout.flush()
time.sleep(1)
sys.stdout.flush()
print(line)
#break # this is for debuggung

Class for file creation and directory validation

After reading some texts regarding creation of files under python, i decided to create this class which creates a new file on a directory, and creating a backup on the other directory if the file already exists (and if it's older than x hours )
The main reason i opened this question is to know if this is a correct way to write a class using try/except correctly, because actually i'm getting a little confused about the preference of using try/except instead if/elses.
Bellow, the working example:
import os
import datetime
class CreateXML():
def __init__(self, path, filename):
self.path = path
self.bkp_path = "%s\\backup" % path
self.filename = filename
self.bkp_file = "%s.previous" % filename
self.create_check = datetime.datetime.now()-datetime.timedelta(hours=-8)
#staticmethod
def create_dir(path):
try:
os.makedirs(path)
return True
except:
return False
#staticmethod
def file_check(file):
try:
open(file)
return True
except:
return False
def create_file(self, target_dir, target_file):
try:
target = "%s\\%s" % (target_dir, target_file)
open(target, 'w')
except:
return False
def start_creation(self):
try:
# Check if file exists
if self.file_check("%s\\%s" % (self.path, self.filename)):
self.create_dir(self.bkp_path)
creation = os.path.getmtime("%s\\%s" % (self.path, self.filename))
fcdata = datetime.datetime.fromtimestamp(creation)
# File exists and its older than 8 hours
if fcdata < self.create_check:
bkp_file_path = "%s\\%s " % (self.bkp_path, self.bkp_file)
new_file_path = "%s\\%s " % (self.path, self.filename)
# If backup file exists, erase current backup file
# Move existing file to backup and create new file.
if self.file_check("%s\\%s" % (self.bkp_path, self.bkp_file)):
os.remove(bkp_file_path)
os.rename(new_file_path, bkp_file_path)
self.create_file(self.bkp_path, self.bkp_file)
#No backup file, create new one.
else:
self.create_file(self.bkp_path, self.bkp_file)
else:
# Fresh creation
self.create_dir(self.path)
self.create_file(self.path, self.filename)
except OSError, e:
print e
if __name__ == '__main__':
path = 'c:\\tempdata'
filename = 'somefile.txt'
cx = CreateXML(path, filename)
cx.start_creation()
So, basically the real question here is:
-With the example above, the usage of try/except is correct?
-It's correct to perform the validations using try/except to check if file or directory allready exists? instead using a simplified version like this one:
import os
# Simple method of doing it
path = 'c:\\tempdata'
filename = 'somefile.txt'
bkp_path = 'c:\\tempdata\\backup'
bkp_file = 'somefile.txt.bkp'
new_file_path = "%s\\%s" % (path, filename)
bkp_file_path = "%s\\%s" % (bkp_path, bkp_file)
if not os.path.exists(path):
print "create path"
os.makedirs(bkp_path)
if not os.path.isfile(new_file_path):
print "create new file"
open(new_file_path, 'w')
else:
print"file exists, moving to backup folder"
#check if backup file exists
if not os.path.isfile(bkp_file_path):
print "New backup file created"
open(bkp_file_path, 'w')
else:
print "backup exists, removing backup, backup the current, and creating newfile"
os.remove(bkp_file_path)
os.rename(new_file_path, bkp_file_path)
open(bkp_file_path, 'w')
-If the usage of try/except is correct, its recomended write an a big class to create a file if it's possible to write a short version of it?
Please do not close this tread, since i'm really confused about what is the "most correct pythonic way to do it".
Thanks in advance.

Naming multiple files in python and scrapy

I'm trying to save files to a directory after scraping them from the web using scrapy. I'm extracting a date from the file and using that as the file name. The problem I'm running into, however, is that some files have the same date, i.e. there are two files that would take the name "June 2, 2009". So, what I'm looking to do is somehow check whether there is already a file with the same name, and if so, name it something like "June 2, 2009.1" or some such.
The code I'm using is as follows:
def parse_item(self, response):
self.log('Hi, this is an item page! %s' % response.url)
response = response.replace(body=response.body.replace('<br />', '\n'))
hxs = HtmlXPathSelector(response)
date = hxs.select("//div[#id='content']").extract()[0]
dateStrip = re.search(r"([A-Z]*|[A-z][a-z]+)\s\d*\d,\s[0-9]+", date)
newDate = dateStrip.group()
content = hxs.select("//div[#id='content']")
content = content.select('string()').extract()[0]
filename = ("/path/to/a/folder/ %s.txt") % (newDate)
with codecs.open(filename, 'w', encoding='utf-8') as output:
output.write(content)
You can use os.listdir to get a list of existing files and allocate a filename that will not cause conflict.
import os
def get_file_store_name(path, fname):
count = 0
for f in os.listdir(path):
if fname in f:
count += 1
return os.path.join(path, fname+str(count))
# This is example to use
print get_file_store_name(".", "README")+".txt"
The usual way to check for existence of a file in the C library is with a function called stat(). Python offers a thin wrapper around this function in the form of os.stat(). I suggest you use that.
http://docs.python.org/library/stat.html
def file_exists(fname):
try:
stat_info = os.stat(fname)
if os.S_ISREG(stat_info): # true for regular file
return True
except Exception:
pass
return False
one other solution is you can append time with date, for naming file like
from datetime import datetime
filename = ("/path/to/a/folder/ %s_%s.txt") % (newDate,datetime.now().strftime("%H%M%S"))
The other answer pointed me in the correct direction by checking into the os tools in python, but I think the way I found is perhaps more straightforward. Reference here How do I check whether a file exists using Python? for more.
The following is the code I came up with:
existence = os.path.isfile(filename)
if existence == False:
with codecs.open(filename, 'w', encoding='utf-8') as output:
output.write(content)
else:
newFilename = ("/path/.../.../- " + '%s' ".1.txt") % (newDate)
with codecs.open(newFilename, 'w', encoding='utf-8') as output:
output.write(content)
Edited to Add:
I didn't like this solution too much, and thought the other answer's solution was probably better but didn't quite work. The main part I didn't like about my solution was that it only worked with 2 files of the same name; if three or four files had the same name the initial problem would occur. The following is what I came up with:
filename = ("/Users/path/" + " " + "title " + '%s' + " " + "-1.txt") % (date)
filename = str(filename)
while True:
os.path.isfile(filename)
newName = filename.replace(".txt", "", filename)
newName = str.split(newName)
newName[-1] = str(int(newName[-1]) + 1)
filename = " ".join(newName) + ".txt"
if os.path.isfile(filename) == False:
with codecs.open(filename, 'w', encoding='utf-8') as output:
output.write(texts)
break
It probably isn't the most elegant and might be kind of a hackish approach, but it has worked so far and seems to have addressed my problem.

Write file to directory based on variable in Python

The script will generate multiple files using the year and id variable. These files need to be placed into a folder matching year and id. How do I write them to the correct folders?
file_root_name = row["file_root_name"]
year = row["year"]
id = row["id"]
path = year+'-'+id
try:
os.makedirs(path)
except:
pass
output = open(row['file_root_name']+'.smil', 'w')
output.write(prettify(doctype, root))
If I understand your question correctly, you want to do this:
import os.path
file_name = row['file_root_name']+'.smil'
full_path = os.path.join(path, file_name)
output = open(full_path, 'w')
Please note that it's not very common in Python to use the + operator for string concatenation. Although not in your case, with large strings the method is not very fast.
I'd prefer:
file_name = '%s.smil' % row['file_root_name']
and:
path = '%i-%i' % (year, id)

Categories