Read CSV files to Pandas dataframe from FTP with Python ftplib

Read CSV files to Pandas dataframe from FTP with Python ftplib - python

I am getting a "[Errno 2] No such file or directory" error when trying to download files from an FTP server to a Pandas dataframe. The files are in the root directory of the FTP.
I am guessing that the pd.read_csv() function is looking at my local file system... i.e. at the local path were the script resides.... But I do not understand how to change this.
def fetch_files(site, username, password, directory: str = '/', filematch: str = '*.csv'):
with ftplib.FTP(site) as ftp:
# pass the url without protocol
ftp = ftplib.FTP(site)
# pass credentials if anonymous access is not allowed
ftp.login(username, password)
ftp.cwd(directory)
list_ = []
for file_ in ftp.nlst(filematch):
print(file_) # This works
df = pd.read_csv(file_, index_col=None, header=0) # This fails
list_.append(df)
Or would I have to use the ftp.retrlines() method? If so what is the difference between the LIST and MLSD parameter?
On a side note: The files in the CSVs have HTML code in them like & which screws out the SQL bulk insert. That's the reason I am reading them to a dataframe is to change the encoding and merge the individual files. Is there a faster way to do this directly via the Python csv module? I guess this would be faster?
Thank you in advance

Use FTP.retrbinary and BytesIO to download the file to memory and then pass the in-memory file-like object to read_csv:
flo = BytesIO()
ftp.retrbinary('RETR ' + file_, flo.write)
flo.seek(0)
pd.read_csv(flo, ...)
Similar question: Reading files from FTP server to DataFrame in Python
The above loads whole CSV file to a memory and only then it will parse it. If you want to prse the file as it downloads, it that would probably require implementing a smart custom file-like object. What is not easy.
For a question that does something similar, see my answer to:
Get files names inside a zip file on FTP server without downloading whole archive.

Related

Read parquet files from S3 bucket in a for loop

I want to read parquet files from an AWS S3 bucket in a for loop.
Here's my code (that doesn't work):
session = boto3.Session(
aws_access_key_id=key,
aws_secret_access_key=secret,
region_name=region_name)
s3 = session.resource('s3')
bucket = s3.Bucket(bucket_name)
for obj in bucket.objects.filter(Prefix=folder_path):
response = obj.get()
df = pd.read_parquet(response['Body'])
# some data processing
It prints the following errors:
ValueError: I/O operation on closed file and ArrowInvalid: Called Open() on an uninitialized FileSource.
What should I fix here?

pandas.read_parquet() expects a a reference to the file to read, not the file contents itself as you provide it.
From the documentation:
path : str, path object or file-like object
String, path object (implementing os.PathLike[str]), or file-like object implementing a binary read() function. The string could be a URL. Valid URL schemes include http, ftp, s3, gs, and file. For file URLs, a host is expected. A local file could be: file://localhost/path/to/table.parquet. A file URL can also be a path to a directory that contains multiple partitioned parquet files. Both pyarrow and fastparquet support paths to directories as well as file URLs. A directory path could be: file://localhost/path/to/tables or s3://bucket/partition_dir.
As you can see you can provide an S3-url as path, so the least intrusive change to make it work would probably be this:
for obj in bucket.objects.filter(Prefix=folder_path):
obj_url = f"s3://{obj.bucket_name}/{obj.key}"
df = pd.read_parquet(obj_url)
Alternatively "How to read a list of parquet files from S3 as a pandas dataframe using pyarrow?" lists several other solutions.

how to gzip files in tmp folder

Using an AWS Lambda function, I download an S3 zipped file and unzip it.
For now I do it using extractall. Upon unzipping, all files are saved in the tmp/ folder.
s3.download_file('test','10000838.zip','/tmp/10000838.zip')
with zipfile.ZipFile('/tmp/10000838.zip', 'r') as zip_ref:
lstNEW = list(filter(lambda x: not x.startswith("__MACOSX/"), zip_ref.namelist()))
zip_ref.extractall('/tmp/', members=lstNEW)
After unzipping, I want to gzip files and place them in another S3 bucket.
Now, how can I read all files from the tmp folder again and gzip each file?
$item.csv.gz
I see this (https://docs.python.org/3/library/gzip.html) but I am not sure which function is to be used.
If it's the compress function, how exactly do I use it? I read in this answer gzip a file in Python that I can use the open function gzip.open('', 'wb') to gzip a file but I couldn't figure out how to use it in my case. In the open function, do I specify the target location or the source location? Where do I save the gzipped files such as that I can later save them to S3?
Alternative Option:
Instead of loading everything into the tmp folder, I read that I can also open an output stream, wrap the output stream in a gzip wrapper, and then copy from one stream to the other
with zipfile.ZipFile('/tmp/10000838.zip', 'r') as zip_ref:
testList = []
for i in zip_ref.namelist():
if (i.startswith("__MACOSX/") == False):
testList.append(i)
for i in testList:
zip_ref.open(i, ‘r’)
but then again I am not sure how to continue in the for loop and open the stream and convert files there

Depending on the sizes of the files, I would skip writing the .gz file(s) to disk. Perhaps something based on s3fs | boto and gzip.
import contextlib
import gzip
import s3fs
AWS_S3 = s3fs.S3FileSystem(anon=False) # AWS env must be set up correctly
source_file_path = "/tmp/your_file.txt"
s3_file_path = "my-bucket/your_file.txt.gz"
with contextlib.ExitStack() as stack:
source_file = stack.enter_context(open(source_file_path , mode="rb"))
destination_file = stack.enter_context(AWS_S3.open(s3_file_path, mode="wb"))
destination_file_gz = stack.enter_context(gzip.GzipFile(fileobj=destination_file))
while True:
chunk = source_file.read(1024)
if not chunk:
break
destination_file_gz.write(chunk)
Note: I have not tested this so if it does not work, let me know.

Use pdfplumber and Paramiko to read a PDF file from an SFTP server

I have a direct connection to an SFTP server – the connection works without any problem and I can display files from the selected directory without any major problem. There are different files on the server, I have several functions to read them and below here is a piece of code concerning .pdf files – I use pdfplumber to read PDF files:
# SSH.connect configuration
sftp = ssh.open_sftp()
path = "/server_path/.."
for filename in sftp.listdir(path):
fullpath = path + "/" + filename
if filename.endswith('.pdf'):
#fullpath - full server path with filename - like /server_path/../file.pdf
#filename - filename without path - like file.pdf
with sftp.open(fullpath, 'rb') as fl:
pdf = pdfplumber.open(fl)
in this for loop I want to read all the .pdf files in the chosen directory - and it works for me on the localhost without any problem.
I tried to solve it this way with sftp.open(path, 'rb') as fl: - but in this case this solution doesn't work and such an error code appears:
Traceback (most recent call last):
pdf = pdfplumber.open(fl)
return cls(open(path, "rb"), **kwargs)
TypeError: expected str, bytes or os.PathLike object, not SFTPFile
pdfplumber.open takes as an argument the exact path to the file with its name – in this case fullpath. How can I solve this problem so that it works directly from the server? How to manage the memory in this case – because I understand that these files are somehow pulled into memory. Please give me some hints.

Paramiko SFTPClient.open returns a file-like object.
To use a file-like object with pftplumber, it seems that you can use load function:
pdf = pdfplumber.load(fl)
You will also want to read this:
Reading file opened with Python Paramiko SFTPClient.open method is slow
As the Paramiko file-like object seems to work suboptimal when combined with pftplumber.load function, as a workaround, you can download the file to memory instead:
flo = BytesIO()
sftp.getfo(fullpath, flo)
flo.seek(0)
pdfplumber.load(flo)
See How to use Paramiko getfo to download file from SFTP server to memory to process it

Downloading and Viewing CSV with Python

really new to python, was attempting to download a CSV through FTP.
I've made the connection go to the right folder, but I want to also print the tables as well.
import pandas as pd
from ftplib import FTP
ftp = FTP('f20-preview.xxx.com')
ftp.login(user='xxx_test', passwd = 'xxxxxxx')
ftp.cwd('/testfolder/')
def grabFile():
filename = 'MOCK_DATA.csv'
localfile = open(filename, 'wb')
ftp.retrbinary('RETR ' + filename, localfile.write, 1024)
data = pd.read_csv(filename)
data.head()
This causes a nameError, filename is not defined? Im ight be confusing myself so clarification would help.

In your code you are defining a function, never call it and afterwards you are expecting to find a variable defined inside that function.
One way to fix things would be to eliminate the line with def completely.
A possibly better solution would be something like this
import pandas as pd
from ftplib import FTP
# reusable method to retrieve a file
def grabFile(ftp_obj, filename):
localfile = open(filename, 'wb')
ftp.retrbinary('RETR ' + filename, localfile.write, 1024)
# connect to the ftp server
ftp = FTP('f20-preview.xxx.com')
ftp.login(user='xxx_test', passwd = 'xxxxxxx')
ftp.cwd('/testfolder/')
# then get files and work them
# having a "target file"
filename = 'MOCK_DATA.csv'
# grab the file
grabFile(ftp, filename)
# work the file
data = pd.read_csv(filename)
data.head()
# now you could still use the same connected ftp object and grab another file, and so on

You did not call your "grabfile" function. But it appears the other answers helped alleviate that issue, so I will merely share some quality-of-life code for working with data sets
I often store my data files in a separate folder from the python code, so this can help you keep things straight and organized if you'd prefer to have the input data in another folder.
import os
import pandas as pd
original_dir = os.getcwd()
os.chdir('/home/user/RestOfPath/')
data = pd.read_csv('Filename')
os.chdir(original_dir)
data.head()

Could you possibly use the absolute/full path instead of just the name for the CSV file? My guess is that it's looking in the wrong folder.

The working directory of your python script and the location which the CSV are stored need to be the same given the function you provided.
However, you do not call the function.
If you call the function and get the same error then it is likely that MOCK_DATA.csv is not in the location /testfolder/MOCK_DATA.csv you will run into issues.
The way to access this would be to delete def grabFile.

How to upload in-memory zip file to S3-bucket

I am trying to upload an in-memory zip file to an S3-bucket (in order to avoid temporary files on my server).
Here's my code:
import zipfile
import io
zip_bytes_io = io.BytesIO()
zip_archive = zipfile.ZipFile(zip_bytes_io, mode='a',
compression=zipfile.ZIP_DEFLATED)
zip_archive.writestr('test.txt', b'My string')
s3_client = boto3.client('s3')
zip_bytes_io.seek(0) # So that bytes are read from beginning
s3_client.upload_fileobj(zip_bytes_io, test_bucket, 'test.zip')
When I run the above code, it does upload the file to an S3 bucket. However when I download the file, I am unable to unzip it. I get the error :"Error 1 - operation not permitted."
I assume this is because I'm not uploading the file correctly in the first place. Any help appreciated :)

When looking for Error 1 - operation not permitted. error online, looks like it might be a Mac-os opening a zip file issue?
Try to open the file differently?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Read CSV files to Pandas dataframe from FTP with Python ftplib - python

Related

Read parquet files from S3 bucket in a for loop

how to gzip files in tmp folder

Use pdfplumber and Paramiko to read a PDF file from an SFTP server

Downloading and Viewing CSV with Python

How to upload in-memory zip file to S3-bucket

Categories

Resources