Read parquet files from S3 bucket in a for loop

Read parquet files from S3 bucket in a for loop - python

I want to read parquet files from an AWS S3 bucket in a for loop.
Here's my code (that doesn't work):
session = boto3.Session(
aws_access_key_id=key,
aws_secret_access_key=secret,
region_name=region_name)
s3 = session.resource('s3')
bucket = s3.Bucket(bucket_name)
for obj in bucket.objects.filter(Prefix=folder_path):
response = obj.get()
df = pd.read_parquet(response['Body'])
# some data processing
It prints the following errors:
ValueError: I/O operation on closed file and ArrowInvalid: Called Open() on an uninitialized FileSource.
What should I fix here?

pandas.read_parquet() expects a a reference to the file to read, not the file contents itself as you provide it.
From the documentation:
path : str, path object or file-like object
String, path object (implementing os.PathLike[str]), or file-like object implementing a binary read() function. The string could be a URL. Valid URL schemes include http, ftp, s3, gs, and file. For file URLs, a host is expected. A local file could be: file://localhost/path/to/table.parquet. A file URL can also be a path to a directory that contains multiple partitioned parquet files. Both pyarrow and fastparquet support paths to directories as well as file URLs. A directory path could be: file://localhost/path/to/tables or s3://bucket/partition_dir.
As you can see you can provide an S3-url as path, so the least intrusive change to make it work would probably be this:
for obj in bucket.objects.filter(Prefix=folder_path):
obj_url = f"s3://{obj.bucket_name}/{obj.key}"
df = pd.read_parquet(obj_url)
Alternatively "How to read a list of parquet files from S3 as a pandas dataframe using pyarrow?" lists several other solutions.

Related

Read CSV files to Pandas dataframe from FTP with Python ftplib

I am getting a "[Errno 2] No such file or directory" error when trying to download files from an FTP server to a Pandas dataframe. The files are in the root directory of the FTP.
I am guessing that the pd.read_csv() function is looking at my local file system... i.e. at the local path were the script resides.... But I do not understand how to change this.
def fetch_files(site, username, password, directory: str = '/', filematch: str = '*.csv'):
with ftplib.FTP(site) as ftp:
# pass the url without protocol
ftp = ftplib.FTP(site)
# pass credentials if anonymous access is not allowed
ftp.login(username, password)
ftp.cwd(directory)
list_ = []
for file_ in ftp.nlst(filematch):
print(file_) # This works
df = pd.read_csv(file_, index_col=None, header=0) # This fails
list_.append(df)
Or would I have to use the ftp.retrlines() method? If so what is the difference between the LIST and MLSD parameter?
On a side note: The files in the CSVs have HTML code in them like & which screws out the SQL bulk insert. That's the reason I am reading them to a dataframe is to change the encoding and merge the individual files. Is there a faster way to do this directly via the Python csv module? I guess this would be faster?
Thank you in advance

Use FTP.retrbinary and BytesIO to download the file to memory and then pass the in-memory file-like object to read_csv:
flo = BytesIO()
ftp.retrbinary('RETR ' + file_, flo.write)
flo.seek(0)
pd.read_csv(flo, ...)
Similar question: Reading files from FTP server to DataFrame in Python
The above loads whole CSV file to a memory and only then it will parse it. If you want to prse the file as it downloads, it that would probably require implementing a smart custom file-like object. What is not easy.
For a question that does something similar, see my answer to:
Get files names inside a zip file on FTP server without downloading whole archive.

zipfile write dont find files in gcloud

Im trying zip a few files from Google Storage.
The zipfile of Python doesnt find the files in gcloud, just in the project.
How can I do for my code find the files in gcloud?
zip_buffer = io.BytesIO()
with zipfile.ZipFile(zip_buffer, 'w') as zip_file:
for revenue in revenues:
# queryset with files a lot, so, for a each file, add in zip
t = tempfile.NamedTemporaryFile()
t.write(revenue.revenue.name)
if revenue.revenue.name:
t.seek(0)
with default_storage.open(revenue.revenue.name, "r") as file_data:
zip_file.write(file_data.name, compress_type=zipfile.ZIP_DEFLATED)
# the code dont pass from this part
t.close()
response = HttpResponse(content_type='application/x-zip-compressed')
response['Content-Disposition'] = 'attachment; filename=my_zip.zip'
response.write(zip_buffer.getvalue())
return response
In this part, I write the file that I opened from gcloud, but stop inside the function:
def write(self, filename, arcname=None, compress_type=None):
"""Put the bytes from filename into the archive under the name
arcname."""
if not self.fp:
raise RuntimeError(
"Attempt to write to ZIP archive that was already closed")
st = os.stat(filename)
# when I try find the file, the command os.stat search in project, not in gcloud
the "os.stat(filename)" search for a file in project, how can I do for find in the gcloud?

I will post my findings as an answer, since I would like to comment about few things.
I have understood:
You have a Python library zipfile that is used to work with ZIP files.
You are looking for files locally and add one by one into the ZIP file.
You would like to do this as well for files located in Google Cloud Storage bucket. But it is failing to find the files.
If I have misunderstood the use-case scenario, please elaborate further in a comment.
However, if this is exactly what you are trying to do, then this is not supported. In the StackOverflow Question - Compress files saved in Google cloud storage, it is stated that compressing files that are already in the Google Cloud Storage is not possible. The solution in that question is to subscribe to newly created files and then download them locally, compress them and overwrite them in GCS. As you can see, you can list the files, or iterate through the files stored in GCS, but you first need to download them to be able to process them.
Work around
Therefore, in your use-case scenario, I would recommend the following workaround, by using the Python client API:
You can use Listing objects Python API, to get all the objects from GCS.
Then you can use Downloading objects Python API, to download the objects locally.
As soon as the objects are located in local directory, you can use the zipfile Python library to ZIP them together, as you are already doing it.
Then the objects are ZIPed and if you no longer need the downloaded objects, you can delete them with os.remove("downloaded_file.txt").
In case you need to have the compressed ZIP file in the Google Cloud Storage bucket, then you can use the Uploading objects Python API to upload the ZIP file in the GCS bucket.
As I have mentioned above, processing files (e.g. Adding them to a ZIP files etc.) directly in Google Cloud Storage bucket, is not supported. You first need to download them locally in order to do so. I hope that my workaround is going to be helpful to you.
UPDATE
As I have mentioned above, zipping files while they are in GCS bucket is not supported. Therefore I have prepared for you a working example in Python on how to use the workaround.
NOTE: As I am not professional on operating os commands with Python
library and I am not familiar with zipfile library, there is
probably a better and more efficient way of achieving this. However,
the code that can be found in this GitHub link, does the following
procedures:
Under #Public variables: section change BUCKET_NAME to your corresponding bucket name and execute the python script in Google Cloud Shell. Cloud Shell
Now my bucket structure is as follows:
gs://my-bucket/test.txt
gs://my-bucket/test1.txt
gs://my-bucket/test2.txt
gs://my-bucket/directory/test4.txt
When executing the command, what the app does is the following:
Will get the path of where the script is executed. e.g. /home/username/myapp.
It will create a temporary directory within this directory e.g. /home/username/myapp/temp
It will iterate through all the files located in the bucket that you have specified and will download them locally inside that temp directory.
NOTE: If the file in the bucket is under directory it will simple download the file, instead of creating that sub-directory again. You can modify the code to make it work as you desired later.
So the new downloaded files will look like this:
/home/username/myapp/temp/test.txt
/home/username/myapp/temp/test1.txt
/home/username/myapp/temp/test2.txt
/home/username/myapp/temp/test4.txt
After that, the code will zip all those files to a new zipedFile.zip that will be located in the same directory with the main.py script that you have executed.
When this step is done as well, the script will delete the directory /home/username/myapp/temp/ with all of its contents.
As I have mentioned above, after executing the script locally, you should be able to see the main.py and an zipedFile.zip file with all the zipped files from the GCS bucket. Now you can take the idea of implementation and modify it according to your project's needs.

the final code:
zip_buffer = io.BytesIO()
base_path = '/home/everton/compressedfiles/'
fiscal_compentecy_month = datetime.date(int(year), int(month), 1)
revenues = CompanyRevenue.objects.filter(company__pk=company_id, fiscal_compentecy_month=fiscal_compentecy_month)
if revenues.count() > 0:
path = base_path + str(revenues.first().company.user.pk) + "/"
zip_name = "{}-{}-{}-{}".format(revenues.first().company.external_id, revenues.first().company.external_name, month, year)
for revenue in revenues:
filename = revenue.revenue.name.split('revenues/')[1]
if not os.path.exists(path):
os.makedirs(path)
with open(path + filename, 'wb+') as file:
file.write(revenue.revenue.read())
file.close()
with zipfile.ZipFile(zip_buffer, 'w') as zip_file:
for file in os.listdir(path):
zip_file.write(path + file, compress_type=zipfile.ZIP_DEFLATED)
zip_file.close()
response = HttpResponse(content_type='application/x-zip-compressed')
response['Content-Disposition'] = 'attachment; filename={}.zip'.format(zip_name)
response.write(zip_buffer.getvalue())
shutil.rmtree(path)
return response

How to iterate through s3 files and write all txt files to a csv file separated by its run date(year)

Basically, I want to iterate through the bucket and use the folders structure to classify each file by its run date(year).
So, I have an s3 bucket that path essential looks like:
file/archive/run=2017-10-07-06-13-21/folder_paths/version=1-0-0/part-00000-b.txt
file/archive/run=2018-11-07-06-13-21/folder_paths/version=1-0-0/part-00000-c.txt
In the archive folder, it has the run dates in them.
Ultimately, I want to be able to iterate of the files, and write the part-000....txt files to a csv file by date(year). So I want all the .txt files that runs are in 2018 in one csv file, all .txt files in 2017, and all .txt files that are in 2019.
I am new to boto3 and s3 so I am pretty confused on how to go about doing this:
Here is my code so far:
#Import boto3 module
import boto3
import logging
from botocore.exceptions import ClientError
#This is to List existing Buckets for the AWS account
PREFIX = 'shredded/'
#Create a session to your AWS account
s3client = boto3.client(
's3',
aws_access_key_id=ACCESS_KEY,
aws_secret_access_key=SECRET_KEY,
region_name=REGION_NAME,
)
bucket = 'mybucket'
startAfter = '2020-00-00-00-00-00'
s3objects= s3client.list_objects_v2(Bucket=bucket, StartAfter=startAfter )
for object in s3objects['Contents']:
print(object['Key'])
Any suggestions or ideas would help.

One way to approach this is something like this:
2017_files = [object for object in s3objects['Contents'] if 'run=2017' in object]
2018_files = [object for object in s3objects['Contents'] if 'run=2018' in object]
2019_files = [object for object in s3objects['Contents'] if 'run=2019' in object]
This will check for all the items in the array s3objects['Contents'] if it matches the string condition run={year}.
So then each of the variables i.e. 2017_files, 2018_files & 2019_files would contain all the relevant paths.
From there, you could split the string by / and get the last split which would be part-00000-b.txt as an example.
To write to a .csv, check out Python's csv library (https://docs.python.org/3/library/csv.html) and how to use that, it's pretty solid.
Post back with how you go!

Sync aws s3 bucket (along with folder structure) on local and run python code only on newly added local folder

I'm reading logs (which are in compressed .gz file) from source S3 bucket. The .gz files are present in folder inside s3 bucket which looks like below (folder are created based on unix epoch time, 1 folder per day)
1540684800000 (1 folder/day contains multiple .gz file)
1541462400000
1542240000000
1543104000000
Now I've to run below python code to make these .gz file in proper json array format. The data and structure inside single .gz file is as below. The json array is missing outer brackets [ ] and comma(,) between two json array (after "version":2} ). There is a new line (\n) character separating two json array records.
{"context":"abc","event":"login","timestamp":"2018-09-27T17:35:55.835Z","version":2}
{"context":"def","timestamp":"2018-09-27T17:35:55.835Z","Id":"faf91826-ebc9-4242-996f-d52969bec2d5","version":2}
{"context":"xyz","event":"LandingPage","timestamp":"2019-01-21T05:14:22.998Z","Id":"88016b33-72d7-458e-8de8-f76241f4b681","version":2}
For now i've manually decompressed the .gz files and then renamed them to .json (through unix commands). And will be running below python code to make these files in proper json format -
import json
import pandas as pd
import glob
import os
import re
for subdir, dirs, files in os.walk ('/user/home/src/'):
for file in files:
path = subdir + os.sep + file
if not(path.endswith('.DS_Store')):
print(file)
jsonFile = open(path,"r")
text = jsonFile.read()
jsonFile.close()
#working with buffered content
#text = "["+text+"]"
pattern = re.sub(r'}(\s*){', r'},\1{', text)
## Save changes to JSON file
jsonFile = open(path, "w+")
data = "["+json.loads(json.dumps(pattern))+"]"
jsonFile.write(data)
jsonFile.close()
I need to copy source .gz files locally and run above python code to make them in proper json array format and upload them to target s3 bucket. I'm beginner in python and need help with -
a) how do i daily sync source s3 bucket on local and run python code only on newly added folder (inside the local copy) . And upload new folder on target s3 bucket daily.
b) how can i drop/remove json records with missing "event" attribute (in this case second record) , while i convert the file into proper json array.
c) can i append json records from multiple .gz files and have one json array (with .json extension). So i have single .json file for each day.
Expected Output in json file:
[{"context":"abc","event":"login","timestamp":"2018-09-27T17:35:55.835Z","version":2},
{"context":"xyz","event":"LandingPage","timestamp":"2019-01-21T05:14:22.998Z","Id":"88016b33-72d7-458e-8de8-f76241f4b681","version":2}]

Create directories in Amazon S3 using python, boto3

I know S3 buckets not really have directories because the storage is flat. But it is possible to create directories programmaticaly with python/boto3, but I don't know how. I saw this on a documentary :
"Although S3 storage is flat: buckets contain keys, S3 lets you impose a directory tree structure on your bucket by using a delimiter in your keys.
For example, if you name a key ‘a/b/f’, and use ‘/’ as the delimiter, then S3 will consider that ‘a’ is a directory, ‘b’ is a sub-directory of ‘a’, and ‘f’ is a file in ‘b’."
I can create just files in the a S3 Bucket by :
self.client.put_object(Bucket=bucketname,Key=filename)
but I don't know how to create a directory.

Just a little modification in key name is required. self.client.put_object(Bucket=bucketname,Key=filename)
this should be changed to
self.client.put_object(Bucket=bucketname,Key=directoryname/filename)
Thats all.

If you read the API documentation You should be able to do this.
import boto3
s3 = boto3.client("s3")
BucketName = "mybucket"
myfilename = "myfile.dat"
KeyFileName = "/a/b/c/d/{fname}".format(fname=myfilename)
with open(myfilename) as f :
object_data = f.read()
client.put_object(Body=object_data, Bucket=BucketName, Key=KeyFileName)
Honestly, it is not a "real directory", but preformat string structure for organisation.

Adding forward slash / to the end of key name, to create directory didn't work for me:
client.put_object(Bucket="foo-bucket", Key="test-folder/")
You have to supply Body parameter in order to create directory:
client.put_object(Bucket='foo-bucket',Body='', Key='test-folder/')
Source: ryantuck in boto3 issue

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.