Merge million S3 files generated hourly - python

I have millions of files being created each hour. Each file has one line of data. These files need to be merged into a single file.
I have tried doing this in the following way:-
Using aws s3 cp to download files for the hour.
Use a bash command to merge the files.
OR
Use a python script to merge the files.
This hourly job is being run in Airflow on Kubernetes(EKS). This takes more than one hour to complete and is creating a backlog. Other problem is that it often causes the EC2 Node to stop responding due to high CPU and memory usage. What is the most efficient way of running this job?
The python script for reference:-
from os import listdir
import sys
# from tqdm import tqdm
files = listdir('./temp/')
dest = sys.argv[1]
data = []
tot_len = len(files)
percent = tot_len//100
for i, file in enumerate(files):
if(i % percent == 0):
print(f'{i/percent}% complete.')
with open('./temp/'+file, 'r') as f:
d = f.read()
data.append(d)
result = '\n'.join(data)
with open(dest, 'w') as f:
f.write(result)

A scalable and reliable method would be:
Configure the Amazon S3 bucket to trigger an AWS Lambda function whenever a new file arrives
Within the AWS Lambda function, read the contents of the file and send it to an Amazon Kinesis Firehose stream. Then, delete the input file.
Configure the Amazon Kinesis Firehose stream to buffer the input data and output a new file either based on time period (up to 15 minutes) or data size (up to 128MB)
See: Amazon Kinesis Data Firehose Data Delivery - Amazon Kinesis Data Firehose
This will not produce on file per hour -- the number of files will depend upon the size of the incoming data.
If you need to create hourly files, you could consider using Amazon Athena on the output files from Firehose. Athena allows you to run SQL queries on files stored in Amazon S3. So, if the input files contain a date column, it can select only the data for a specific hour. (You could code the Lambda function add a date column for this purpose.)

I expect you should very seriously consider following the ideas in the AWS-specific answer you got. I'd add this response as a comment, except there's no way to show indented code coherently in a comment.
With respect to your Python script, you're building a giant string with a number of characters equal to the total number of characters across all your input files. So of course the memory use will grow at least that large.
Much less memory-intensive is to write out a file content immediately upon reading it (note this code is untested - may have a typo I'm blind to):
with open(dest, 'w') as fout:
for i, file in enumerate(files):
if(i % percent == 0):
print(f'{i/percent}% complete.')
with open('./temp/'+file, 'r') as fin:
fout.write(fin.read())
And one more thing to try if you pursue this: open the files in binary mode instead ('wb' and 'rb'). That may save useless layers of text-mode character decoding/encoding. I assume you just want to paste the raw bytes together.

Putting this out there in case someone else needs it.
I optimized the merging code to the best of my ability but still the bottleneck was reading or downloading the s3 files which is pretty slow using even the official aws cli.
I found a library s5cmd which is pretty fast as it makes full use of multiprocessing and multithreading and it solved my problem.
Link :- https://github.com/peak/s5cmd

Related

Python read .json files from GCS into pandas DF in parallel

TL;DR: asyncio vs multi-processing vs threading vs. some other solution to parallelize for loop that reads files from GCS, then appends this data together into a pandas dataframe, then writes to BigQuery...
I'd like to make parallel a python function that reads hundreds of thousands of small .json files from a GCS directory, then converts those .jsons into pandas dataframes, and then writes the pandas dataframes to a BigQuery table.
Here is a not-parallel version of the function:
import gcsfs
import pandas as pd
from my.helpers import get_gcs_file_list
def load_gcs_to_bq(gcs_directory, bq_table):
# my own function to get list of filenames from GCS directory
files = get_gcs_file_list(directory=gcs_directory) #
# Create new table
output_df = pd.DataFrame()
fs = gcsfs.GCSFileSystem() # Google Cloud Storage (GCS) File System (FS)
counter = 0
for file in files:
# read files from GCS
with fs.open(file, 'r') as f:
gcs_data = json.loads(f.read())
data = [gcs_data] if isinstance(gcs_data, dict) else gcs_data
this_df = pd.DataFrame(data)
output_df = output_df.append(this_df)
# Write to BigQuery for every 5K rows of data
counter += 1
if (counter % 5000 == 0):
pd.DataFrame.to_gbq(output_df, bq_table, project_id=my_id, if_exists='append')
output_df = pd.DataFrame() # and reset the dataframe
# Write remaining rows to BigQuery
pd.DataFrame.to_gbq(output_df, bq_table, project_id=my_id, if_exists='append')
This function is straightforward:
grab ['gcs_dir/file1.json', 'gcs_dir/file2.json', ...], the list of file names in GCS
loop over each file name, and:
read the file from GCS
converts the data into a pandas DF
appends to a main pandas DF
every 5K loops, write to BigQuery (since the appends get much slower as the DF gets larger)
I have to run this function on a few GCS directories each with ~500K files. Due to the bottleneck of reading/writing this many small files, this process will take ~24 hours for a single directory... It would be great if I could make this more parallel to speed things up, as it seems like a task that lends itself to parallelization.
Edit: The solutions below are helpful, but I am particularly interested in running in parallel from within the python script. Pandas is handling some data cleaning, and using bq load will throw errors. There is asyncio and this gcloud-aio-storage that both seem potentially useful for this task, maybe as better options than threading or multiprocessing...
Rather than add parallel processing to your python code, consider invoking your python program multiple times in parallel. This is a trick that lends itself more easily to a program that takes a list of files on the command line. So, for the sake of this post, let's consider changing one line in your program:
Your line:
# my own function to get list of filenames from GCS directory
files = get_gcs_file_list(directory=gcs_directory) #
New line:
files = sys.argv[1:] # ok, import sys, too
Now, you can invoke your program this way:
PROCESSES=100
get_gcs_file_list.py | xargs -P $PROCESSES your_program
xargs will now take the file names output by get_gcs_file_list.py and invoke your_program up to 100 times in parallel, fitting as many file names as it can on each line. I believe the number of file names is limited to the maximum command size allowed by the shell. If 100 processes is not enough to process all your files, xargs will invoke your_program again (and again) until all file names it reads from stdin are processed. xargs ensures that no more than 100 invocations of your_program are running simultaneously. You can vary the number of processes based on the resources available to your host.
Instead of doing this, you can directly use bq command.
The bq command-line tool is a Python-based command-line tool for BigQuery.
When you use this command, loading takes place in google's network which is very fast than we creating a dataframe and loading to table.
bq load \
--autodetect \
--source_format=NEWLINE_DELIMITED_JSON \
mydataset.mytable \
gs://mybucket/my_json_folder/*.json
For more information - https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-json#loading_json_data_into_a_new_table

how to use external links in pytables

I have several TB of image data, that are currently stored in many hdf-files with pytables, with one file for each frame. One file contains two groups, "LabelData" and "SensorData".
I have created a single (small) file that has all the file names and some meta data, and with the help of that file, I can call and open any needed hdf-data in a python generator.
This gives me a lot of flexibilty, however, it seems quite slow, as every single file has to be opened and closed.
Now I wanted to create a single hdf-file with external links to the other files, would that speed up the process?
As I have understood, creating external links requires to create a node for each link. However, I get the following performance warning:
PerformanceWarning: group / is exceeding the recommended maximum
number of children (16384); be ready to see PyTables asking for lots
of memory and possibly slow I/O. PerformanceWarning)
This is how I have created the file:
import tables as tb
def createLinkFile(linkfile,filenames, linknames):
# Create a new file
f1 = tb.open_file(linkfile, 'w')
for filepath, linkname in zip(filenames,linknames):
data = f1.create_group('/', linkname)
# create an external link
f1.create_external_link(data, 'LabelData', filepath + ':/LabelData')
f1.create_external_link(data, 'SensorData', filepath + ':/SensorData')
f1.close()
Is there a better way?

How to prevent multi python scripts to overwrite same file?

I use multiple python scripts that collect data and write it into one single json data file.
It is not possible to combine the scripts.
The writing process is fast and it happens often that errors occur (e.g. some chars at the end duplicate), which is fatal, especially since I am using json format.
Is there a way to prevent a python script to write into a file if there are other script currently trying to write into the file? (It would be absolutely ok, if the data that the python script tries to write into the file gets lost, but it is important that the file syntax does not get somehow 'injured'.)
Code Snipped:
This opens the file and retrieves the data:
data = json.loads(open("data.json").read())
This appends a new dictionary:
data.append(new_dict)
And the old file is overwritten:
open("data.json","w").write( json.dumps(data) )
Info: data is a list which contains dicts.
Operating System: The hole process takes place on linux server.
On Windows, you could try to create the file, and bail out if an exception occurs (because file is locked by another script). But on Linux, your approach is bound to fail.
Instead, I would
write one file per new dictionary, suffixing filename by process ID and a counter
consuming process(es) don't read a single file, but the sorted files (according to modification time) and build the data from it
So in each script:
filename = "data_{}_{}.json".format(os.getpid(),counter)
counter+=1
open(filename ,"w").write( json.dumps(new_dict) )
and in the consumers (reading each dict of sorted files in a protected loop):
files = sorted(glob.glob("*.json"),key=os.path.getmtime())
data = []
for f in files:
try:
with open(f) as fh:
data.append(json.load(fh))
except Exception:
# IO error, malformed json file: ignore
pass
I will post my own solution, since it works for me:
Every single python script checks (before opening and writing the data file) whether a file called data_check exists. If so, the pyhthon script does not try to read and write the file and dismisses the data, that was supposed to be written into the file. If not, the python script creates the file data_check and then starts to read and wirte the file. After the writing process is done the file data_check is removed.

using wholeTextFiles in pyspark but get the error of out of memory

I have some files (part-00000.gz, part-00001.gz, part-00002.gz, ...) and each part is rather large. I need to use the filename of each part because it contains time stamp information. As I know, it seems that in pyspark only wholeTextFiles can read input as (filename, content). However, i get the error of out of memory when using wholeTextFiles. So, my guess is that wholeTextFiles reads a whole part as content in mapper without partition operation. I also find this answer (How does the number of partitions affect `wholeTextFiles` and `textFiles`?). If so, how can i get the filename of a rather large part file. Thanks
You get the error because wholeTextFiles tries to read the entire file into a single RDD. You're better off reading the file line-by-line, which you can do simply by writing your own generator and using the flatMap function. Here's an example of doing that to read a gzip file:
import gzip
def read_fun_generator(filename):
with gzip.open(filename, 'rb') as f:
for line in f:
yield line.strip()
gz_filelist = glob.glob("/path/to/files/*.gz")
rdd_from_bz2 = sc.parallelize(gz_filelist).flatMap(read_fun_generator)

How to split large wikipedia dump .xml.bz2 files in Python?

I am trying to build a offline wiktionary using the wikimedia dump files (.xml.bz2) using Python. I started with this article as the guide. It involves a number of languages, I wanted to combine all the steps as a single python project. I have found almost all the libraries required for the process. The only hump now is to effectively split the large .xml.bz2 file into number of smaller files for quicker parsing during search operations.
I know that bz2 library exists in python, but it provides only compress and decompress operations. But I need something that could do something like bz2recover does from the command line, which splits large files into a number of smaller junks.
One more important point is the splitting shouldn't split the page contents which start with <page> and ends </page> in the xml document that has been compressed.
Is there a library previously available which could handle this situation or the code has to be written from scratch?(Any outline/pseudo-code would be greatly helpful).
Note: I would like to make the resulting package cross-platform compatible, hence couldn't use OS specific commands.
At last I have written a Python Script myself:
import os
import bz2
def split_xml(filename):
''' The function gets the filename of wiktionary.xml.bz2 file as input and creates
smallers chunks of it in a the diretory chunks
'''
# Check and create chunk diretory
if not os.path.exists("chunks"):
os.mkdir("chunks")
# Counters
pagecount = 0
filecount = 1
#open chunkfile in write mode
chunkname = lambda filecount: os.path.join("chunks","chunk-"+str(filecount)+".xml.bz2")
chunkfile = bz2.BZ2File(chunkname(filecount), 'w')
# Read line by line
bzfile = bz2.BZ2File(filename)
for line in bzfile:
chunkfile.write(line)
# the </page> determines new wiki page
if '</page>' in line:
pagecount += 1
if pagecount > 1999:
#print chunkname() # For Debugging
chunkfile.close()
pagecount = 0 # RESET pagecount
filecount += 1 # increment filename
chunkfile = bz2.BZ2File(chunkname(filecount), 'w')
try:
chunkfile.close()
except:
print 'Files already close'
if __name__ == '__main__':
# When the script is self run
split_xml('wiki-files/tawiktionary-20110518-pages-articles.xml.bz2')
well, if you have a command-line-tool that offers the functionality you are after, you can always wrap it in a call using the subprocess module
The method you are referencing is quite a dirty hack :)
I wrote a offline Wikipedia Tool, and just Sax-parsed the dump completely. The throughput is usable if you just pipe the uncompressed xml into stdin from a proper bzip2 decompressor. Especially if its only the wiktionary.
As a simple way for testing I just compressed every page and wrote it into one big file and saved the offset and length in a cdb(small key-value store). This may be a valid solution for you.
Keep in mind, the mediawiki markup is the most horrible piece of sh*t I've come across in a long time. But in case of the wiktionary it might me feasible to handle.

Categories