How to enable parallel reading of files in Dataflow? - python

I'm working on a Dataflow pipeline that reads 1000 files (50 MB each) from GCS and performs some computations on the rows across all files. Each file is a CSV with the same structure, just with different numbers in it, and I'm computing the average of each cell over all files.
The pipeline looks like this (python):
additional_side_inputs = {'key1': 'value1', 'key2': 'value2'} # etc.
p | 'Collect CSV files' >> MatchFiles(input_dir + "*.csv")
| 'Read files' >> ReadMatches()
| 'Parse contents' >> beam.ParDo(FileToRowsFn(), additional_side_inputs)
| 'Compute average' >> beam.CombinePerKey(AverageCalculatorFn())
The FileToRowsFn class looks like this (see below, some details omitted). The row_id is the 1st column and is a unique key of each row; it exists exactly once in each file, so that I can compute the average over all the files. There's some additional value provided as side inputs to the transformer, which is not shown inside the method body below, but is still used by the real implementation. This value is a dictionary that is created outside of the pipeline. I mention it here in case this might be a reason for the lack of parallelization.
class FileToRowsFn(beam.DoFn):
def process(self, file_element, additional_side_inputs):
with file_element.open() as csv_file:
for row_id, *values in csv.reader(TextIOWrapper(csv_file, encoding='utf-8')):
yield row_id, values
The AverageCalculatorFn is a typical beam.CombineFn with an accumulator, that performs the average of each cell of a given row over all rows with the same row_id across all files.
All this works fine, but there's a problem with performances and throughput: it takes more than 60 hours to execute this pipeline. From the monitoring console, I notice that the files are read sequentially (1 file every 2 minutes). I understand that reading a file may take 2 minutes (each file is 50 MB), but I don't understand why dataflow doesn't assign more workers to read multiple files in parallel. The cpu remains at ~2-3% because most of the time is spent in file IO, and the number of workers doesn't exceed 2 (although no limit is set).
The output of ReadMatches is 1000 file records, so why doesn't dataflow create lots of FileToRowsFn instances and dispatch them to new workers, each one handling a single file?
Is there a way to enforce such a behavior?

This is probably because all your steps get fused into a single step by the Dataflow runner.
For such a fused bundle to parallelize, the first step needs to be parallelizable. In your case this is a glob expansion which is not parallelizable.
To make your pipeline parallelizable, you can try to break fusion. This can be done by adding a Reshuffle transform as the consumer of one of the steps that produce many elements.
For example,
from apache_beam import Reshuffle
additional_side_inputs = {'key1': 'value1', 'key2': 'value2'} # etc.
p | 'Collect CSV files' >> MatchFiles(input_dir + "*.csv")
| 'Read files' >> ReadMatches()
| 'Reshuffle' >> Reshuffle()
| 'Parse contents' >> beam.ParDo(FileToRowsFn(), additional_side_inputs)
| 'Compute average' >> beam.CombinePerKey(AverageCalculatorFn())
You should not have to do this if you use one of the standard sources available in Beam such as textio.ReadFromText() to read data. (unfortunately we do not have a CSV source but ReadFromText supports skipping header lines).
See here for more information regarding the fusion optimization and preventing fusion.

Related

How to parallelize properly dataflow job over 11M of files stored on GCS using fileio.MatchFiles(file_pattern = 'gs://bucket/**/*')

I have a beginner question. I have ~11millions files stored on some GCS bucket with the following structure in a given bucket:
yyyy/mm/dd
I have 6 years of data with ~1500 files per years.
lines = (p
| "ReadInputData" >> fileio.MatchFiles(file_pattern = 'gs://bucket/**/*')
| "FileToBytes" >> fileio.ReadMatches()
| "Reshuffle" >> beam.Reshuffle()
| "GetMetada" >> beam.Map(lambda file: post_process(file))
| "WriteTableToBQ" >> beam.io.WriteToBigQuery(...)
)
For each files, I am checking the metadata (extension, size, ...), then for PDF files I open it to check the number of pages. For images, I compute some hash. This is is the post_process() python function.
The current dataflow is running since 3 days (which is quite a lot) but it is not parallelized properly since most of the time 1 vCPU is used:
It seems that the job search for the file for very long time with 1 worker (vCPU) then for the processing to to ~25 workers for ~10 min and continue like that for days. I am doing something wrong.
Before I did some test for a month(144'000 files) with file_pattern = 'gs://bucket/2022/04/**/*'
I went from 4h (only one workers was used ) to 16 min (scale up to 57 workers) by adding "Reshuffle" >> beam.Reshuffle() which was missing.
I am using python, Apache Beam Python 3.9 SDK 2.40.0, I use Dataflow template with a container(--sdk_container_image option).
My question is how to use properly fileio.MatchFiles(file_pattern = 'gs://bucket/**/*') to have it using multiple workers all the time. It seems that what I did works for file_pattern = 'gs://bucket/2022/04/**/*' but not for file_pattern = 'gs://bucket/**/*'. What is the best practice ? Didn't find something that could answer my question in the documentation. If I do a loop over year and month I don't know if this will be properly parallelized and if this is the right things to do. Any suggestion, recommendation ? Thanks
edit 1:
I found that looping over the years give better results
for year in range(2017, 2023):
list_yyyy_mm.append( p | f"MatchFile {year:04d}" >> fileio.MatchFiles(file_pattern = f"gs://bucket/{year:04d}/**/*"))
files = (
list_yyyy_mm
| "Flatten" >> beam.Flatten()
p
...
)
Now I can process the 6 years of data in ~2 hours which is much better.
This is not well parallelized as it takes ~30 min with 5 workers to list the 12 millions of files. Later the reading the 12 millions of files and the extraction of the metadata is working as expected. Last part is limited to ~230 workers because I don't have enough IP but this is an issue on my side).
I tried to loop over years and months but it is even worse.
Not sure is this is expected for fileio.MatchFiles or if I should write my own version and optimize it for my use case. Maybe this is documented but I didn't find it.

Merging large TSV files on select columns

Is there a memory-efficient way to merge many large TSV files, only keeping select columns of each TSV file?
Background:
I have many (~100) large TSV files (~800MB) of the form:
filename: 1.txt
ID    Char1    Char2     ...     P
1     Green      Red        .     0.1
2       Red     Orange    .     0.2
3    Purple     Blue        .    0.13
(Format A)
The ID column is the same between all TSV files. I only care about the ID and P columns for each and would like to condense them into the following form to make them take up less memory (column names are pulled from the Format A file name):
ID     1        2        ...        N
1     P_11   P_12      .      P_1N
2    P_21  P_22    .      P_2N
3    P_31 P_32     .       P_3N
(Format B)
I need to do this in a memory-efficient manner as I will be processing as many of these blocks of 100 files in parallel. Without parallelization, the runtime would be extremely long as there are 170 blocks of 100 files to process. My workflow is as follows:
Step 1: Generate M groups of these 100 files in format A
Step 2: Process these 100 files into the one file of format B
Step 3: Repeat the above two steps until all 170 groups of 100 files are processed
I've tried many different approaches, and none is both time and memory efficient.
My first approach involved storing everything in memory. I kept one file's ID and P columns in a list of lists. Then, I iterated through every other file and appended its P value to the end of the appropriate array. This approach had a satisfactory runtime (~30 minutes per block of 100 files). But, this failed as my VM does not have the memory to parallelize storing all of this data at once.
I've gone through a number of iterations over the last few days. In my current iteration, I use one file and only keep its ID and P column. Then, I loop through each file and use fileinput to append the file's P column in place to the end of line. After testing, this method passes on the memory/parallelization side but too slow in terms of runtime. I've also thought of using pandas to merge, but this runs into both a time and memory constraint.
I have scoured every corner of my brain and the internet, but have not produced a satisfactory solution.

Why partitioned parquet files consume larger disk space?

I am learning about parquet file using python and pyarrow. Parquet is great in compression and minimizing disk space. My dataset is 190MB csv file which ends up as single 3MB file when saved as snappy-compressed parquet file.
However when I am saving my dataset as partitioned files, they result in a much larger sizes combined (61MB).
Here is example dataset that I am trying to save:
listing_id | date | gender | price
-------------------------------------------
a | 2019-01-01 | M | 100
b | 2019-01-02 | M | 100
c | 2019-01-03 | F | 200
d | 2019-01-04 | F | 200
When I partitioned by date (300+ unique values), the partitioned files result in 61MB combined. Each file has 168.2kB of size.
When I partition by gender (2 unique values), the partitioned files result in just 3MB combined.
I am wondering if there is any minimum file size for parquet such that many small files combined consume greater disk space?
My env:
- OS: Ubuntu 18.04
- Language: Python
- Library: pyarrow, pandas
My dataset source:
https://www.kaggle.com/brittabettendorf/berlin-airbnb-data
# I am using calendar_summary.csv as my data from a group of datasets in that link above
My code to save as parquet file:
# write to dataset using parquet
df = pd.read_csv('./calendar_summary.csv')
table = pyarrow.Table.from_pandas(df)
pyarrow.parquet.write_table(table=table, where='./calendar_summary_write_table.parquet')
# parquet filesize
parquet_method1_filesize = os.path.getsize('./calendar_summary_write_table.parquet') / 1000
print('parquet_method1_filesize: %i kB' % parquet_method1_filesize)
My code to save as partitioned parquet file:
# write to dataset using parquet (partitioned)
df = pd.read_csv('./calendar_summary.csv')
table = pyarrow.Table.from_pandas(df)
pyarrow.parquet.write_to_dataset(
table=table,
root_path='./calendar_summary/',
partition_cols=['date'])
# parquet filesize
import os
print(os.popen('du -sh ./calendar_summary/').read())
There is no minimum file size, but there is an overhead for storing the footer and there is a wasted opportunity for optimizations via encodings and compressions. The various encodings and compressions build on the idea that the data has some amount of self-similarity which can be exploited by referencing back to earlier similar occurances. When you split the data into multiple files, each of them will need a separate "initial data point" that the successive ones can refer back to, so disk usage goes up. (Please note that there are huge oversimplifications in this wording to avoid having to specifically go through the various techniques employed to save space, but see this answer for a few examples.)
Another thing that can have a huge impact on the size of Parquet files is the order in which data is inserted. A sorted column can be stored a lot more efficiently than a randomly ordered one. It is possible that by partitioning the data you inadvertently alter its sort order. Another possibility is that you partition the data by the very attribute that it was ordered by and which allowed a huge space saving when storing in a single file and this opportunity gets lost by splitting the data into multiple files. Finally, you have to keep in mind that Parquet is not optimized for storing a few kilobytes of data but for several megabytes or gigabytes (in a single file) or several petabytes (in multiple files).
If you would like to inspect how your data is stored in your Parquet files, the Java implementation of Parquet includes the parquet-tools utility, providing several commands. See its documentation page for building and getting started. The more detailed descriptions of the individual commands are printed by parquet-tools itself. The commands most interesting to you are probably meta and dump.

Pyspark: How to parallelize multi gz file processing in HDFS

I have many gz files stored in 20 nodes HDFS cluster that need to be aggregated by columns. The gz files are very large (1GByte each and 200 files in total).
The data format is key value with 2 column values: ['key','value1','value2'], and needs to be grouped by key with aggregation by column: sum(value1), count(value2).
The data is already sorted by key and each gz files have exclusive key values.
For example:
File 1:
k1,v1,u1
k1,v2,u1
k2,v2,u2
k3,v3,u3
k3,v4,u4
File 2:
k4,v5,u6
k4,v7,u8
k5,v9,v10
File 3:
k6,...
...
...
File 200:
k200,v200,u200
k201,v201,u201
I firstly parse the date and convert the data into (key, list of (values)) structure. The parser output will be like this:
parser output
(k1,[v1,u1])
(k1,[v2,u1])
(k2,[v2,u2])
(k3,[v3,u3])
(k3,[v4,u4])
Then group by key values using reduceByKey function, which is more efficient than groupByKey function.
reducer output:
(k1,[[v1,u1],[v2,u1])
(k2,[[v2,u2]])
(k3,[[v3,u3],[v4,u4]])
Then aggregate the columns using process function:
process
(k1, sum([v1,v2], len([u1,u3])))
(k2, sum([v2], len([u2])))
(k3, sum([v3,v4], len([u3,u4])))
Here is the sample code for the process
import pyspark
from pyspark import SparkFiles
def parser(line):
try:
key,val=line.split('\t)
return (key,[val1,val2])
except:
return None
def process(line):
key,gr= line[0],line[1]
vals=zip(*gr)
val1=sum(vals[0])
val2=len(vals[1])
return ('\t'.join([key,val1,val2]))
sc = pyspark.SparkContext(appName="parse")
logs=sc.textFile("hdfs:///home/user1/*.gz")
proc=logs.map(parser).filter(bool).reduceByKey(lambda acc,x: acc+x).map(process)
proc.saveAsTextFile('hdfs:///home/user1/output1')
I think this code does not fully utilize the spark cluster. I like to optimize the code to fully utilize the processing considering.
1. What is the best way to handle gz files in HDFS and Pyspark? -- how to fully distribute the gz file processing to the entire cluster?
2. How to fully utilize all the CPUs in each node? for aggregation and parsing process
There are at least a couple of things you should consider:
If you are using YARN, the number of executors and the cores per executor that you assign to your spark app. They can be controlled by --num-executors and --executor-cores. If you are not using YARN, your scheduler will probably have a similar mechanism to control parallelism, try looking for it.
The number of partitions in your DataFrame, which directly impacts the parallelism in your job. You can control that with repartition and/or coalesce.
Both can limit the cores used by the job and hence the use of the cluster. Also, take into account that more CPUs employed will not necessarily mean better performance (or execution times). That will depend on the size of your cluster and the size of the problem, and I don't know about any easy rule to decide that. For me it usually comes down to experiment with different configurations and see which one has better performance.

Spark coalesce vs collect, which one is faster?

I am using pyspark to process 50Gb data using AWS EMR with ~15 m4.large cores.
Each row of the data contains some information at a specific time on a day. I am using the following for loop to extract and aggregate information for every hour. Finally I union the data, as I want my result to save in one csv file.
# daily_df is a empty pyspark DataFrame
for hour in range(24):
hourly_df = df.filter(hourFilter("Time")).groupby("Animal").agg(mean("weights"), sum("is_male"))
daily_df = daily_df.union(hourly_df)
As of my knowledge, I have to perform the following to force the pyspark.sql.Dataframe object to save to 1 csv files (approx 1Mb) instead of 100+ files:
daily_df.coalesce(1).write.csv("some_local.csv")
It seems it took about 70min to finish this progress, and I am wondering if I can make it faster by using collect() method like?
daily_df_pandas = daily_df.collect()
daily_df_pandas.to_csv("some_local.csv")
Both coalesce(1) and collect are pretty bad in general but with expected output size around 1MB it doesn't really matter. It simply shouldn't be a bottleneck here.
One simple improvement is to drop loop -> filter -> union and perform a single aggregation:
df.groupby(hour("Time"), col("Animal")).agg(mean("weights"), sum("is_male"))
If that's not enough then most likely the issue here is configuration (the good place to start could be adjusting spark.sql.shuffle.partitions if you don't do that already).
To save as single file these are options
Option 1 :
coalesce(1) (minimum shuffle data over network) or repartition(1) or collect may work for small data-sets, but large data-sets it may not perform, as expected.since all data will be moved to one partition on one node
option 1 would be fine if a single executor has more RAM for use than the driver.
Option 2 :
Other option would be FileUtil.copyMerge() - to merge the outputs into a single file like below code snippet.
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
def merge(srcPath: String, dstPath: String): Unit = {
val hadoopConfig = new Configuration()
val hdfs = FileSystem.get(hadoopConfig)
FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), false, hadoopConfig, null)
}
Option 3 :
after getting part files you can use hdfs getMerge command like this...
hadoop fs -getmerge /tmp/demo.csv /localmachine/tmp/demo.csv
Now you have to decide based on your requirements... which one is safer/faster
also, can have look at Dataframe save after join is creating numerous part files

Categories