Read a CSV file stored in a FTP in Python

Read a CSV file stored in a FTP in Python - python

I have connected to a FTP and the connection is successful.
import ftplib
ftp = ftplib.FTP('***', '****','****')
listoffiles = ftp.dir()
print (listoffiles)
I have a few CSV files in this FTP and a few folders which contain some more CSV's.
I need to identify the list of folders in this location (home) and need to navigate into the folders. I think cwd command should work.
I also read the CSV stored in this FTP. How can I do that? Is there a way to directly load the CSV's here into Pandas?

Based on the answer here (Python write create file directly in FTP) and my own knowledge about ftplib:
What you can do is the following:
from ftplib import FTP
import io, pandas
session = FTP('***', '****','****')
# get filenames on ftp home/root
remoteFilenames = session.nlst()
if ".." in remoteFilenames:
remoteFilenames.remove("..")
if "." in remoteFilenames:
remoteFilenames.remove(".")
# iterate over filenames and check which ones are folder
for filename in remoteFilenames:
dirTest = session.nlst(filename)
# This dir test does not work on certain servers
if dirTest and len(dirTest) > 1:
# its a directory => go to directory
session.cwd(filename)
# get filename for on ftp one level deeper
remoteFilenames2 = session.nlst()
if ".." in remoteFilenames2:
remoteFilenames2.remove("..")
if "." in remoteFilenames2:
remoteFilenames2.remove(".")
for filename in remoteFilenames2:
# check again if the filename is a directory and this time ignore it in this case
dirTest = session.nlst(filename)
if dirTest and len(dirTest) > 1:
continue
# download the file but first create a virtual file object for it
download_file = io.BytesIO()
session.retrbinary("RETR {}".format(filename), download_file.write)
download_file.seek(0) # after writing go back to the start of the virtual file
pandas.read_csv(download_file) # read virtual file into pandas
##########
# do your thing with pandas here
##########
download_file.close() # close virtual file
session.quit() # close the ftp session
Alternatively if you know the structure of the ftpserver you could loop over a dictionary with the folder/file structure and download the files via ftplib or urllib like in the example:
for folder in {"folder1": ["file1", "file2"], "folder2": ["file1"]}:
for file in folder:
path = "/{}/{}".format(folder, file)
##########
# specific ftp file download stuff here
##########
##########
# do your thing with pandas here
##########
Both solution can be optimized by making them recursive or in general support more than one level of folders

Better late than never... I was able to read directly into pandas. Not sure if this works for anyone.
import pandas as pd
from ftplib import FTP
ftp = FTP('ftp.[domain].com') # you need to put in your correct ftp domain
ftp.login() # i don't need login info for my ftp
ftp.cwd('[Directory]') # change directory to where the file is
df = pd.read_csv("[file.csv]", delimiter = "|", encoding='latin1') # i needed to specify delimiter and encoding
df.head()

Related

Retrieve file from FTP, unzip file, save extract file to Amazon S3 bucket

Attempting to retrieve a file from FTP and save it to an S3 bucket within lambda function.
I can confirm the first part of the code works as I can see the list of files printed to Cloudwatch logs.
import ftplib
from ftplib import FTP
import zipfile
import boto3
s3 = boto3.client('s3')
S3_OUTPUT_BUCKETNAME = 'my-s3bucket'
ftp = FTP('ftp.godaddy.com')
ftp.login(user='auctions', passwd='')
ftp.retrlines('LIST')
The next part was resulting in the following error:
module initialization error: [Errno 30] Read-only file system: 'tdnam_all_listings.csv.zip'
However I managed to overcome this by adding 'tmp' to the file location as per following code:
fileName = 'all_expiring_auctions.json.zip'
with open('/tmp/fileName', 'wb') as file:
ftp.retrbinary('RETR ' + fileName, file.write)
Next, I am attempting to unzip the file from the temporary loaction
with zipfile.ZipFile('/tmp/fileName', 'r') as zip_ref:
zip_ref.extractall('')
Finally, I am attempting save the file to a particular 'folder' in the s3 bucket, as follows:
data = open('/tmp/all_expiring_auctions.json')
s3.Bucket('brnddmn-s3').upload_fileobj('data','my-s3bucket/folder/')
The code produces no errors that I can see in the log, however the unzipped file is not reaching the destination despite my efforts.
Any help greatly appreciated.

Firstly, you have to use the tmp directory for working with files in Lambda. The ZipFile extractall('') will create the extract in your current working directory though, assuming the zip content is a simple plain text file with no relative path. To create the extract in tmp directory, use
zip_ref.extract_all('tmp')
I'm not sure why there are no errors logged. data = open(...) should throw an error if no file is found. If required you can explicitly print if file exists:
import os
print(os.path.exists('tmp/all_expiring_auctions.json')) # True/False
Finally, once you have ensured the file exists, the argument for Bucket() should be the bucket name. Not sure if your bucket name is 'brnddmn-s3' or 'my-s3bucket'. Also, the first argument to upload_fileobj() should be a file object, i.e., data instead of string 'data'. The second argument should be the object key (filename in S3) instead of the folder name.
Putting it together, the last line should look like this.
S3_OUTPUT_BUCKETNAME = 'my-s3bucket' # Replace with your S3 bucket name
s3.Bucket(S3_OUTPUT_BUCKETNAME).upload_fileobj(data,'folder/all_expiring_auctions.json')

Save file in a specific folder

I would like to save a CSV file in a specific folder, but I can't find anywhere how to do it...
this is the code
# Writing on a CSV FILE
fileToWrite = open(f"{userfinder}-{month}-{year}.csv', "a")
fileToWrite.write('Subject,Start Date,Start Time,End Date,End Time,All Day Event,Description\n')
fileToWrite.write(f'{string1}{tagesinfo2},{soup_datum},{soup_dienstbegin},{soup_datum},{soup_dienstende},,Kommentar: {soup_kommentar} Schiff: {b} Funktion: {soup_funktion} Schichtdauer: {soup_schichtdauer} Bezahlte Zeit: {soup_bezahltezeit} Mannschaft: {crew_list2}\n')
fileToWrite.close()
print(f'Datum: {soup_datum} Dienst: {string1}{tagesinfo2} --> Mannschaft: {crew_list2} --> OK')

You just have to change the working directory with os.chdir(path):
import os
path = '/Users/user/folder/example sub folder'
os.chdir(path)
#your code here
or, as mentioned in the comments, you can use:
myfolder = "c:/foo/bar/"
fileToWrite = open(f"{myfolder}/{userfinder}-{month}-{year}.csv", "a")
#in this case the path is "{myfolder}/{userfinder}-{month}-{year}"
This option includes the path when opening (only affects the one file) whereas os.chdir() changes the directory for everything (what I use personally for all of my projects, which are small).
If you don't want to change your folder for all files created and read, use the second option; but when you want a python file to affect every file in a distant location I would use os.chdir().

Delete .htaccess with Python ftplib

I am trying to add FTP functionality into my Python 3 script using only ftplib or other libraries that are included with Python. The script needs to delete a directory from an FTP server in order to remove an active web page from our website.
The problem is that I cannot find a way to delete the .htaccess file using ftplib and I can't delete the directory because it is not empty.
Some people have said that this is a hidden file, and have explained how to list hidden files, but I need to delete the file, not list it. My .htaccess file also has full permissions and it can be successfully deleted using most other FTP clients.
Sample code:
files = list(ftp.nlst(myDirectory))
for f in files:
ftp.delete(f)
ftp.rmd(myDirectory)
Update: I was able to get everything working correctly, here is the complete code:
ftp.cwd(myDirectory) # move to the dir to be deleted
#upload placeholder .htaccess in case there is none in the dir and then delete it
files01 = "c:\\files\\.htaccess"
with open(files01, 'rb') as f:
ftp.storlines('STOR %s' % '.htaccess', f)
ftp.delete(".htaccess")
print("Successfully deleted .htaccess file in " + myDirectory)
files = list(ftp.nlst(myDirectory)) # delete files in dir
for f in files:
ftp.delete(f)
print("Successfully deleted visible files in " + myDirectory)
ftp.rmd(myDirectory) # remote directory deletion
print("Successfully deleted the following directory: " + myDirectory)

Is this the shortest, most efficient way to write a program to move my pdf files into a new folder?

New to coding, reading some books and trying to practice. Wrote a program in python3.7 to search through a directory, find all the pdf files and move them to a new folder called 'Reading Materials'.
How could I improve on this code e.g a shorter, more concise and/or efficient script in python?
import os, re, shutil
os.chdir(r'C:\\Users\\Luke\\Documents\\coding\\python\\') #set cwd to the where I want program to run
#create regex to identify pdf files
PDFregex = re.compile(r'''^(.*?) # all text before the file extension
\.{1} #start of file extension
(pdf)$ #ending in pdf''', re.VERBOSE)
Newdir = os.mkdir('Reading Material') #make new directory for files
NewdirPath = os.path.abspath('Reading Material')
print('new directory made at : '+NewdirPath)
#search through directory for files that contain .pdf extension using regex object
for pdf in os.listdir('.'):
mo = PDFregex.search(pdf)
if mo == None: #no pdf's found by regex search
continue #bypass loop
else:
originalLoc = os.path.join(os.path.abspath('.'), pdf) #original file location
newLoc = shutil.move(originalLoc, os.path.join(NewdirPath, pdf)) #move pdf to new folder
print('Moving file "%s" moved to "%s"...' %(pdf, newLoc)) #say what's moving
os.listdir(NewdirPath)

Regexp is overkilled here. os module has various method to help you extract informations about files.
You can use splitext method in os module to find the extension.
Something like this should work :
import os
import shutil
old_dir = 'C:\\Users\\Luke\\Documents\\coding\\python\\'
new_dir = 'Reading Material'
# You should always use underscore_notations to name variables instead of CamelCase (use for ClassNames) see https://www.python.org/dev/peps/pep-0008/
os.makedirs(new_dir, exist_ok=True)
for file_path in os.listdir(old_dir):
if os.path.splitext(file_path)[1] == '.pdf':
shutil.move(file_path, '{0}\\{1}'.format(new_dir, os.path.basename(file_path)))

gzip multiple files in python

I have to compress a lot of XML files into and split them by the data in the file name, just for clarification's sake, there is a parser which collects information from XML file and then moves it to a backup folder. My code needs to gzip it according to the date in the filename and group those files in a compressed .gz file.
Please find the code bellow:
import os
import re
import gzip
import shutil
import sys
import time
#
timestr = time.strftime("%Y%m%d%H%M")
logfile = 'D:\\Coleta\\log_compactador_xml_tar'+timestr+'.log'
ptm_dir = "D:\\PTM\\monitored_programs\\"
count_files_mdc = 0
count_files_3gpp = 0
count_tar = 0
#
for subdir, dir, files in os.walk(ptm_dir):
for file in files:
path = os.path.join(subdir, file)
try:
backup_files_dir = path.split(sep='\\')[4]
parser_id = path.split(sep='\\')[3]
if re.match('backup_files_*', backup_files_dir):
if file.endswith('xml'):
# print(time.strftime("%Y-%m-%d %H:%M:%S"), path)
data_arq = file[1:14]
if parser_id in ('parser-924'):
gzip_filename_mdc = os.path.join(subdir,'E4G_PM_MDC_IP51_'+timestr+'_'+data_arq)
with open(path, 'r')as f_in, gzip.open(gzip_filename_mdc + ".gz", 'at') as f_out_mdc:
shutil.copyfileobj(f_in, f_out_mdc)
count_files_mdc += 1
f_out_mdc.close()
f_in.close()
print(time.strftime("%Y-%m-%d %H:%M:%S"), "Compressing file MDC: ",path)
os.remove(path)
except PermissionError:
print(time.strftime("%Y-%m-%d %H:%M:%S"), "Permission error on file:", fullpath, file=logfile)
pass
except IndexError:
print(time.strftime("%Y-%m-%d %H:%M:%S"), "IndexError: ", path, file=logfile)
pass
As long as I seem it creates a stream of data, then compress and write it to a new file with the specified filename. However, instead of grouping each XML file independently inside a ".gz" file, it does creates inside the "gzip" file, a big file (big stream of data?) with the same name of the output "gzip" file, but without any extension. After the files are totally compressed, it's not possible to uncompress the big file generated inside the "gzip" output file. Does someone know where is the problem with my code?
PS: I have edited the code for readability purposes.

Not sure whether the solution is still needed, but I will just leave it here for anyone who faces the same issue.
There is a way to create a gzip archive in python using tarfile, the code is quite simple:
with tarfile.open(filename, mode="w:gz") as archive:
archive.add(name=name_of_file_to_add, recursive=True)
in this case name_of_file_to_add can be a directory, in which case tarfile will add it recursively with all its contents. Obviously you will need to import the tarfile module.
If you need to add files without a directory a simple for with calls to add will do (recursive flag is not required in this case).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Read a CSV file stored in a FTP in Python - python

Related

Retrieve file from FTP, unzip file, save extract file to Amazon S3 bucket

Save file in a specific folder

Delete .htaccess with Python ftplib

Is this the shortest, most efficient way to write a program to move my pdf files into a new folder?

gzip multiple files in python

Categories

Resources