Combine CSV files into XLS or XLSX on AWS S3 - python

I have code running as AWS Lambda that queries internal database and generates files in different formats. The files are generated in parts and uploaded to S3 using multi-part upload:
self.mpu = self.s3_client.create_multipart_upload(
Bucket=self.bucket_name,
ContentType=self.get_content_type(),
Expires=self.expire_daytime,
Key=self.filename,
)
and
response = self.s3_client.upload_part(
Bucket=self.bucket_name,
Key=self.filename,
PartNumber=self.part_number,
UploadId=self.upload_id,
Body=data
)
self.current_part_info.update({
'PartNumber': self.part_number,
'ETag': response['ETag']
})
One of the formats I need to support is XLS or XLSX. It's fairly easy to create multiple CSV files on S3. But is it possible to combine them directly on S3 into XLS/XLSX without downloading them?
My current code generates an XLSX file in memory, creates a local file, and then uploads it to S3:
import xlsxwriter
self.workbook = xlsxwriter.Workbook(self.filename)
# download CSV files...
for sheet_name, sheet_info in sheets.items():
sheet = self.workbook.add_worksheet(name=sheet_name)
# code that does formatting
for... # loop through rows
for... # loop thru columns
sheet.write(row, col, col_str)
self.workbook.close()
This works fine for small queries, but the users will want to use it for a large amount of data.
When I run it with large queries, it runs out of memory. AWS Lambda has limited memory and limited disk space, and I'm hitting those limits.
Is it possible to combine CSV files into XLS or XLSX somehow without holding the entire file in local space (both memory and disk space are a problem)?

Related

How to read a parquet file in Azure Databricks?

I have few parquet files stored in my storage account, which I am trying to read using the below code. However it fails with error as incorrect syntax. Can someone suggest to me as whats the correct way to read parquet files using azure databricks?
val data = spark.read.parquet("abfss://containername#storagename.dfs.core.windows.net/TestFolder/XYZ/part-00000-1cf0cf7b-6c9f-41-a268-be-c000.snappy.parquet")
display(data)
abfss://containername#storagename.dfs.core.windows.net/TestFolder/XYZ/part-00000-1cf0cf7b-6c9f-41-a268-be-c000.snappy.parquet
As per the above abfss URL you can use delta or parquet format in the storage account.
Note: If you created delta table, part file creates automatically like this part-00000-1cf0cf7b-6c9f-41-a268-be-c000.snappy.parquet.As per above code it is not possible to read parquet file in delta format .
I have written the datafram df1 and overwrite into a storage account with parquet format.
df1.coalesce(1).write.format('parquet').mode("overwrite").save("abfss://<container>#<stoarge_account>.dfs.core.windows.net/<folder>/<sub_folder>")
Scala
val df11 = spark.read.format("parquet").load("abfss://<container>#<stoarge_account>.dfs.core.windows.net/demo/d121/part-00000-tid-2397072542034942773-def47888-c000.snappy.parquet")
display(df11)
python
df11 = spark.read.format("parquet").load("abfss://<container>#<stoarge_account>.dfs.core.windows.net/demo/d121/part-00000-tid-2397072542034942773-def47888-c000.snappy.parquet")
display(df11)
Output:

How to compress a json lines file and uploading to azure container?

I am working in databricks, and have a Pyspark dataframe that I am converting to pandas, and then to a json lines file and want to upload it to an Azure container (ADLS gen2). The file is large, and I wanted to compress it prior to uploading.
I am first converting the pyspark dataframe to pandas.
pandas_df = df.select("*").toPandas()
Then converting it to a newline delimited json:
json_lines_data = pandas_df.to_json(orient='records', lines=True)
Then writing to blob storage with the following function:
def upload_blob(json_lines_data, connection_string, container_name, blob_name):
blob_service_client = BlobServiceClient.from_connection_string(connection_string)
blob_client = blob_service_client.get_blob_client(container=container_name, blob=blob_name)
try:
blob_client.get_blob_properties()
blob_client.delete_blob()
# except if no delete necessary
except:
pass
blob_client.upload_blob(json_lines_data)
This is working fine, but the data is around 3 GB per file, and takes a long time to download so I would rather zip the files. Can anyone here help with how to compress the json lines file and upload it to the azure container? I have tried a lot of different things, and nothing is working.
If there is a better way to do this in databricks, I can change it. I did not write using databricks as I need to output 1 file and control the filename.
There is a way you can follow to compress the JSON file before uploading to blob storage.
Here is code for converting the data to JSON and Convert to binary code(utf-8) and lastly compress it.
Would suggest you add this code before uploading function.
import json
import gzip
def compress_data(data):
# Convert to JSON
json_data = json.dumps(data, indent=2)
# Convert to bytes
encoded = json_data.encode('utf-8')
# Compress
compressed = gzip.compress(encoded)
Reference: https://gist.github.com/LouisAmon/4bd79b8ab80d3851601f3f9016300ac4#file-json_to_gzip-py

Unable to read parquet file using StreamingBody from S3 without holding in memory

I'm trying to read a parquet file from S3, and dump the contents of it on to a Kafka topic.
This isn't too difficult when you hold the entire file in memory, but for large files this isn't feasible.
# using .read() holds the entire file open in memory - not ideal
df = pd.read_parquet(s3_response['Body'].read(), columns=columns)
Instead, I'm trying to take advantage of file-like-objects in order to stream the parquet file.
My issue is that it seems that it's impossible to do this with Parquet, because Parquet encodes data in the footer of the file as well as the header.
Here's an example of my code:
session = boto3.session.Session()
s3_client = session.client(
service_name='s3',
endpoint_url=s3_url,
)
obj = s3_client.get_object(Bucket=s3_bucket, Key=key)
for line in obj['Body'].iter_lines():
pq_file = io.BytesIO(line)
df = pd.read_parquet(pq_file, columns=columns)
# At this point I'd want to iterate over the DF rows
# and send them to kafka
print(df)
This results in the following error:
OSError: Could not open parquet input source '<Buffer>': Invalid: Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.
Is it at all possible to do what I'm trying to do? Or due to the nature of parquet files is this impossible?

Merging multiple .txt files in Google Colab

I have around 500 .txt files in my local system and would like to merge them into a dataframe in Google Colab. I have already uploaded them via Upload option where I uploaded the zipped folder containing the .txt files and later unzipped them in Google Colab. Each .txt file has one row data eg. 0 12 34.3 423
I tried the following code to directly upload from my local system but it did not work
Colab cannot access your local files through the typical built-ins as far as I know. You have to use Colab-specific modules. The guide is here.
from google.colab import files
uploaded = files.upload()
for fn in uploaded.keys():
print('User uploaded file "{name}" with length {length} bytes'.format(
name=fn, length=len(uploaded[fn])))
This will prompt you to select the files to upload.
EDIT: As you need the file names, you can just use the loop above and then concatenate as you mentioned correctly.
# create a list of file names
file = []
for fn in uploaded.keys():
files.append(fn)
# create a list of dataframes
for file in files:
new = pd.read_csv(file)
try:
frames.append(new)
except:
frames = [new]
# concat all of your frames at once
df = pd.concat(frames)
Alternatively, depending on the size of your files, you could also join the for loops and load one file and concat it directly to the existing frames such that the memory has to hold less data at once.

Writing a big Spark Dataframe into a csv file

I'm using Spark 2.3 and I need to save a Spark Dataframe into a csv file and I'm looking for a better way to do it.. looking over related/similar questions, I found this one, but I need a more specific:
If the DataFrame is too big, how can I avoid using Pandas? Because I used toCSV() function (code below) and it produced:
Out Of Memory error (could not allocate memory).
Is directly writing to a csv using file I/O a better way? Can it preserve the separators?
Using df.coalesce(1).write.option("header", "true").csv('mycsv.csv') will cause the header to be written in each file and when the files are merged, it will have headers in the middle. Am I wrong?
Using spark write and then hadoop getmerge is better than using coalesce from the point of performance?
def toCSV(spark_df, n=None, save_csv=None, csv_sep=',', csv_quote='"'):
"""get spark_df from hadoop and save to a csv file
Parameters
----------
spark_df: incoming dataframe
n: number of rows to get
save_csv=None: filename for exported csv
Returns
-------
"""
# use the more robust method
# set temp names
tmpfilename = save_csv or (wfu.random_filename() + '.csv')
tmpfoldername = wfu.random_filename()
print n
# write sparkdf to hadoop, get n rows if specified
if n:
spark_df.limit(n).write.csv(tmpfoldername, sep=csv_sep, quote=csv_quote)
else:
spark_df.write.csv(tmpfoldername, sep=csv_sep, quote=csv_quote)
# get merge file from hadoop
HDFSUtil.getmerge(tmpfoldername, tmpfilename)
HDFSUtil.rmdir(tmpfoldername)
# read into pandas df, remove tmp csv file
pd_df = pd.read_csv(tmpfilename, names=spark_df.columns, sep=csv_sep, quotechar=csv_quote)
os.remove(tmpfilename)
# re-write the csv file with header!
if save_csv is not None:
pd_df.to_csv(save_csv, sep=csv_sep, quotechar=csv_quote)
If the DataFrame is too big, how can I avoid using Pandas?
You can just save the file to HDFS or S3 or whichever distributed storage you have.
Is directly writing to a csv using file I/O a better way? Can it
preserve the separators?
If you mean by that to save file to local storage - it will still cause OOM exception, since you will need to move all data in memory on local machine to do it.
Using df.coalesce(1).write.option("header", "true").csv('mycsv.csv')
will cause the header to be written in each file and when the files
are merged, it will have headers in the middle. Am I wrong?
In this case you will have only 1 file (since you do coalesce(1)). So you don't need to care about headers. Instead - you should care about memory on the executors - you might get OOM on the executor since all the data will be moved to that executor.
Using spark write and then hadoop getmerge is better than using
coalesce from the point of performance?
Definitely better (but don't use coalesce()). Spark will efficiently write data to storage, then HDFS will duplicate data and after that getmerge will be able to efficiently read data from the nodes and merge it.
We used databricks library . It works fine
df.save("com.databricks.spark.csv", SaveMode.Overwrite, Map("delimiter" -> delim, "nullValue" -> "-", "path" -> tempFPath))
Library :
<!-- spark df to csv -->
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-csv_2.10</artifactId>
<version>1.3.0</version>
</dependency>

Categories