How to effectively merge Spark output files on EMR?

How to effectively merge Spark output files on EMR? - python

Spark jobs (I think) create a file for each partition so that it can handle for failures, etc..., so at the end of the job you are left with a folder that can have a lot of folders left in them. These are being automatically loaded to S3, so is there a way to merge them into a single compressed file that is ready for loading to Redshift?

Instead of the following, which will write one uncompressed file per partition in "my_rdd"...
my_rdd.saveAsTextFile(destination)
One could do...
my_rdd.repartition(1).saveAsTextFile(destination, compressionCodecClass=“org.apache.hadoop.io.compress.GzipCodec”)
This sends the data in all partitions to one particular worker node in the cluster to be combined into one massive partition, which will then be written out into a single gzip compressed file.
However, I don't believe this is a desirable solution to the problem. Just one thread writes out and compresses the single result file. If that file is huge, that could take "forever". Every core in the cluster sits idle but one. Redshift doesn't need everything to be in a single file. Redshift can easily handle loading a set of files --- use COPY with a "manifest file" or a "prefix": Using the COPY Command to Load from S3.

Related

Moving files of very different size from one place to another - optimization in Airflow

I'm implementing a DAG in Airflow moving files from an on-prem location to Azure Blob Storage. Reading files from the source and sending them to Azure is realized via a Data Access Layer outside of Airflow.
The thing is that files in the source can be very small (kilobytes) but potentially also very big (gigabytes). The goal is not to delay the movement of small files while the big ones are being processed.
I currently have a DAG which has two tasks:
list_files - list files in the source.
move_file[] - download the file to a temporary location, upload it to Azure and clean up (delete it from the temporary location and from the source).
Task 1) returns a list of locations in the source, whereas task 2) is ran in parallel for each path returned by task 1) using dynamic task mapping introduced in Airflow 2.3.0.
The DAG is set with max_active_runs=1 so that another DAGRun is not created while the big files are still being processed by the previous DAGRun. The problem is however that between two scheduled DAGRuns some new files can arrive in the source and they cannot be moved right away because a previous DAGRun is still processing the big files. Setting max_active_runs to 2 does not seem like an option because the second DAGRun will attempt to process the files which are already being processed by the previous DAGRun (the big ones which did not move between two scheduled DAGRuns).
What is the best approach in Airflow to address such an issue? Basically I want to make file transfer from one place to another as smooth as possible taking into account that I might have both very small and very big files.
EDIT: I am know thinking that maybe I could use some .lock files. The move_file task will put a .lock file with the name of the file being moved in a certain location that Airflow has access to. Now list_files will read this location and only return those file which do not have locks. Of course the move_file task will clean up after successfully moving the file and release the lock. That will work but is it a good practice? Maybe instead of .lock files I should somehow use the metadata database of Airflow?

python process creating file with inflated size

i have a python process which takes a file containing streamed data and converts it into a format ready to load to a database. i have just migrated this process from one Linux GCP VM to another running exactly the same code, but the final output file size is nearly 4 times as big. 500mb vs 2gb.
When i download the files and manually inspect them, they look exactly the same to the eye.
Any ideas what could be causing this?
Edit: Thanks for the feedback, i traced it back to the input file, which is slightly different (as my stream recording process has also been migrated)
I am now trying to work out why a marginally different file creates such a different output file once its been processed.

How to detect files in a directory if the files have finished copying/adding? [duplicate]

Files are being pushed to my server via FTP. I process them with PHP code in a Drupal module. O/S is Ubuntu and the FTP server is vsftp.
At regular intervals I will check for new files, process them with SimpleXML and move them to a "Done" folder. How do I avoid processing a partially uploaded file?
vsftp has lock_upload_files defaulted to yes. I thought of attempting to move the files first, expecting the move to fail on a currently uploading file. That doesn't seem to happen, at least on the command line. If I start uploading a large file and move, it just keeps growing in the new location. I guess the directory entry is not locked.
Should I try fopen with mode 'a' or 'r+' just to see if it succeeds before attempting to load into SimpleXML or is there a better way to do this? I guess I could just detect SimpleXML load failing but... that seems messy.
I don't have control of the sender. They won't do an upload and rename.
Thanks

Using the lock_upload_files configuration option of vsftpd leads to locking files with the fcntl() function. This places advisory lock(s) on uploaded file(s) which are in progress. Other programs don't need to consider advisory locks, and mv for example does not. Advisory locks are in general just an advice for programs that care about such locks.
You need another command line tool like lockrun which respects advisory locks.
Note: lockrun must be compiled with the WAIT_AND_LOCK(fd) macro to use the lockf() and not the flock() function in order to work with locks that are set by fcntl() under Linux. So when lockrun is compiled with using lockf() then it will cooperate with the locks set by vsftpd.
With such features (lockrun, mv, lock_upload_files) you can build a shell script or similar that moves files one by one, checking if the file is locked beforehand and holding an advisory lock on it as long as the file is moved. If the file is locked by vsftpd then lockrun can skip the call to mv so that running uploads are skipped.

If locking doesn't work, I don't know of a solution as clean/simple as you'd like. You could make an educated guess by not processing files whose last modified time (which you can get with filemtime()) is within the past x minutes.
If you want a higher degree of confidence than that, you could check and store each file's size (using filesize()) in a simple database, and every x minutes check new size against its old size. If the size hasn't changed in x minutes, you can assume nothing more is being sent.

The lsof linux command lists opened files on your system. I suggest executing it with shell_exec() from PHP and parsing the output to see what files are still being used by your FTP server.

Picking up on the previous answer, you could copy the file over and then compare the sizes of the copied file and the original file at a fixed interval.
If the sizes match, the upload is done, delete the copy, work with the file.
If the sizes do not match, copy the file again.
repeat.

Here's another idea: create a super (but hopefully not root) FTP user that can access some or all of the upload directories. Instead of your PHP code reading uploaded files right off the disk, make it connect to the local FTP server and download files. This way vsftpd handles the locking for you (assuming you leave lock_upload_files enabled). You'll only be able to download a file once vsftp releases the exclusive/write lock (once writing is complete).
You mentioned trying flock in your comment (and how it fails). It does indeed seem painful to try to match whatever locking vsftpd is doing, but dio_fcntl might be worth a shot.

I guess you've solved your problem years ago but still.
If you use some pattern to find the files you need you can ask the party uploading the file to use different name and rename the file once the upload has completed.

You should check the Hidden Stores in proftp, more info here:
http://www.proftpd.org/docs/directives/linked/config_ref_HiddenStores.html

Azure ML File Dataset mount() is slow & downloads data twice

I have created a Fie Dataset using Azure ML python API. Data under question is bunch of parquet files (~10K parquet files each of size of 330 KB) residing in Azure Data Lake Gen 2 spread across multiple partitions. Then, I tried to mount the dataset in AML compute instance. During this mounting process, I have observed that each parquet file has been downloaded twice under the /tmp directory of the compute instance with the following message printed as the console logs:
Downloaded path: /tmp/tmp_3qwqu9u/c2c69fd1-9ded-4d69-b75a-c19e1694b7aa/<blob_path>/20211203.parquet is different from target path: /tmp/tmp_3qwqu9u/c2c69fd1-9ded-4d69-b75a-c19e1694b7aa/<container_name>/<blob_path>/20211203.parquet
This log message gets printed for each parquet file which is part of the dataset.
Also, the process of mounting the dataset is very slow: 44 minutes for for ~10K parquet files each of size of 330 KB.
"%%time" command in the Jupyter Lab shows most of the time has been used for IO process?
CPU times: user 4min 22s, sys: 51.5 s, total: 5min 13s
Wall time: 44min 15s
Note: Both the Data Lake Gen 2 and Azure ML compute instance are under the same virtual network.
Here are my questions:
How to avoid downloading the parquet file twice?
How to make the mounting process faster?
I have gone through this thread, but the discussion there didn't conclude
The Python code I have used is as followed:
data = Dataset.File.from_files(path=list_of_blobs, validate=True)
dataset = data.register(workspace=ws, name=dataset_name, create_new_version=create_new_version)
mount_context = None
try:
mount_context = dataset.mount(path_to_mount)
# Mount the file stream
mount_context.start()
except Exception as ex:
raise(ex)
df = pd.read_parquet(path_to_mount)

The robust option is to download directly from AzureBlobDatastore. You need to know the datastore and relative path, which you get by printing the dataset description. Namely
ws = Workspace.from_config()
dstore = ws.datastores.get(dstore_name)
target = (dstore, dstore_path)
with tempfile.TemporaryDirectory() as tmpdir:
ds = Dataset.File.from_files(target)
ds.download(tmpdir)
df = pd.read_parquet(tmpdir)
The convenient option is to stream tabular datasets. Note that you don't control how the file is read (Microsoft converters may occasionally not work as you expect). Here is the template:
ds = Dataset.Tabular.from_parquet_files(target)
df = ds.to_pandas_dataframe()

I have executed a bunch of tests to compare the performance of FileDataset.mount() and FileDataset.download(). In my environment, download() is much faster than mount().
download() works well when the disk size of the compute is large enough to fit all the files. However, in a multi-node environment, the same data (in my case parquet files) gets downloaded to each of the nodes (multiple copies). As per the documentation:
If your script processes all the files in your dataset and the disk on
your compute resource is large enough for the dataset, the download
access mode is the better choice. The download access mode will avoid
the overhead of streaming the data at runtime. If your script accesses
a subset of the dataset or it's too large for your compute, use the
mount access mode.
Downloading data in a multi-node environment could trigger performance issues (link). In such a case, mount() might be preferred.
I have tried with TabularDataset as well. As Maciej S has mentioned, in case of TabularDataset user doesn't need to decide how data is read from the datastore (i.e. user doesn't need to select mount or download as access mode). But, with the current implementation (azureml-core 1.38.0) of TabularDataset, compute needs to have larger memory (RAM) compared to FileDataset.download() for identical set of parquet files. Looks like, the current implementation reads all the individual parquet files into pandas DataFrame (which gets saved into memory/RAM) first. Then it appends those into a single DataFrame (accessed by the API user). Higher memory might be needed for this "eager" nature of the API.

Best way to process line at a time data from hdfs file from within CPython (without using stdin)?

I would like to use CPython in a hadoop streaming job that needs access to supplementary information from a line-oriented file kept in a hadoop file system. By "supplementary" I mean that this file is in addition to the information delivered via stdin. The supplementary file is large enough that I can't just slurp it into memory and parse out the end-of-line characters. Is there a particularly elegant way (or library) to process this file one line at a time?
Thanks,
SetJmp

Check out this documentation for Streaming for using the Hadoop Distributed Cache in Hadoop Streaming jobs. You first upload the file to hdfs, then you tell Hadoop to replicate it everywhere before running the job, then it conveniently places a symlink in the working directory of the job. You can then just use python's open() to read the file with for line in f or whatever.
The distributed cache is the most efficient way to push files around (out of the box) for a job to utilize as a resource. You do not just want to open the hdfs file from your process, as each task will attempt to stream the file over the network... With the distributed cache, one copy is downloaded even if several tasks are running on the same node.
First, add -files hdfs://NN:9000/user/sup.txt#sup.txt to your command-line arguments when you run the job.
Then:
for line in open('sup.txt'):
# do stuff

Are you looking for this?
http://pydoop.sourceforge.net/docs/api_docs/hdfs_api.html#module-pydoop.hdfs
with pydoop.hdfs.open( "supplementary", "r" ) as supplementary:
for line in supplementary:
# process line

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.