Spark coalesce vs collect, which one is faster? - python

I am using pyspark to process 50Gb data using AWS EMR with ~15 m4.large cores.
Each row of the data contains some information at a specific time on a day. I am using the following for loop to extract and aggregate information for every hour. Finally I union the data, as I want my result to save in one csv file.
# daily_df is a empty pyspark DataFrame
for hour in range(24):
hourly_df = df.filter(hourFilter("Time")).groupby("Animal").agg(mean("weights"), sum("is_male"))
daily_df = daily_df.union(hourly_df)
As of my knowledge, I have to perform the following to force the pyspark.sql.Dataframe object to save to 1 csv files (approx 1Mb) instead of 100+ files:
daily_df.coalesce(1).write.csv("some_local.csv")
It seems it took about 70min to finish this progress, and I am wondering if I can make it faster by using collect() method like?
daily_df_pandas = daily_df.collect()
daily_df_pandas.to_csv("some_local.csv")

Both coalesce(1) and collect are pretty bad in general but with expected output size around 1MB it doesn't really matter. It simply shouldn't be a bottleneck here.
One simple improvement is to drop loop -> filter -> union and perform a single aggregation:
df.groupby(hour("Time"), col("Animal")).agg(mean("weights"), sum("is_male"))
If that's not enough then most likely the issue here is configuration (the good place to start could be adjusting spark.sql.shuffle.partitions if you don't do that already).

To save as single file these are options
Option 1 :
coalesce(1) (minimum shuffle data over network) or repartition(1) or collect may work for small data-sets, but large data-sets it may not perform, as expected.since all data will be moved to one partition on one node
option 1 would be fine if a single executor has more RAM for use than the driver.
Option 2 :
Other option would be FileUtil.copyMerge() - to merge the outputs into a single file like below code snippet.
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
def merge(srcPath: String, dstPath: String): Unit = {
val hadoopConfig = new Configuration()
val hdfs = FileSystem.get(hadoopConfig)
FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), false, hadoopConfig, null)
}
Option 3 :
after getting part files you can use hdfs getMerge command like this...
hadoop fs -getmerge /tmp/demo.csv /localmachine/tmp/demo.csv
Now you have to decide based on your requirements... which one is safer/faster
also, can have look at Dataframe save after join is creating numerous part files

Related

Pyspark read and combine many parquet files efficiently

I have ~ 4000 parquet files that are each 3mb. I would like to read all of the files from an S3 bucket, do some aggregations, combine the files into one dataframe, and do some more aggregations. The parquet dataframes all have the same schema. The files are not all in the same folder in the S3 bucket but rather are spread across 5 different folders. I need some files in each folder but not all. I am working on my local machine which has 16gb RAM and 8 processors that I hope to leverage to make this faster. Here is my code:
ts_dfs = []
# For loop to read parquet files and append to empty list
for id in ids:
# Make the path to the timeseries data
timeseries_path = f'{dataset_path}/timeseries_individual_buildings/by_county/upgrade=0/county={id[0]}'
ts_data_df = spark.read.parquet(timeseries_path).select('bldg_id',
'`out.natural_gas.heating.energy_consumption_intensity`',
'`out.electricity.heating.energy_consumption_intensity`',
'timestamp')
# Aggregate
ts_data_df = ts_data_df \
.groupBy(f.month('timestamp').alias('month'),'bldg_id') \
.agg(f.sum('`out.electricity.heating.energy_consumption_intensity`').alias('eui_elec'),
f.sum('`out.natural_gas.heating.energy_consumption_intensity`').alias('eui_gas'))
# Append
ts_dfs.append(ts_data_df)
# Combine all of the dfs into one
ts_sdf = reduce(DataFrame.unionAll, ts_dfs)
# Merge with ids_df
ts = ts_sdf.join(ids_df, on = ['bldg_id'])
# Mean and Standard Deviation by month
stats_df = ts_sdf.groupBy('month', '`in.hvac_heating_type_and_fuel`') \
.agg(f.mean('eui_elec').alias('mean_eui_elec'),
f.stddev('eui_elec').alias('std_eui_elec'),
f.mean('eui_gas'). alias('mean_eui_gas'),
f.stddev('eui_gas').alias('std_eui_gas'))
Reading all of the files through a forloop does not leverage the multiple cores, defeating the purpose of using Spark. As such this process takes 90 minutes on my own (though that may be more a function of my internet connection). Additionally, the next step: ts_sdf = reduce(DataFrame.unionAll, ts_dfs) which combines the dataframes using unionAll is taking even longer. I think my questions are:
How do I avoid having to read the files in a forloop?
Why is the unionAll taking so long? Does PySpark not parallelize this process somehow?
In general, how do I do this better? Perhaps PySpark is not the ideal tool for this?

Python and Dask - reading and concatenating multiple files

I have some parquet files, all coming from the same domain but with some differences in structure. I need to concatenate all of them. Below some example of these files:
file 1:
A,B
True,False
False,False
file 2:
A,C
True,False
False,True
True,True
What I am looking to do is to read and concatenate these files in the fastest way possible obtaining the following result:
A,B,C
True,False,NaN
False,False,NaN
True,NaN,False
False,NaN,True
True,NaN,True
To do that I am using the following code, extracted using (Reading multiple files with Dask, Dask dataframes: reading multiple files & storing filename in column):
import glob
import dask.dataframe as dd
from dask.distributed import Client
import dask
def read_parquet(path):
return pd.read_parquet(path)
if __name__=='__main__':
files = glob.glob('test/*/file.parquet')
print('Start dask client...')
client = Client()
results = [dd.from_delayed(dask.delayed(read_parquet)(diag)) for diag in diag_files]
results = dd.concat(results).compute()
client.close()
This code works, and it is already the fastest version I could come up with (I tried sequential pandas and multiprocessing.Pool). My idea was that Dask could ideally start part of the concatenation while still reading some of the files, however, from the task graph I see some sequential reading of the metadata of each parquet file, see the screenshot below:
The first part of the task graph is a mixture of read_parquet followed by read_metadata. The first part always shows only 1 task executed (in the task processing tab). The second part is a combination of from_delayed and concat and it is using all of my workers.
Any suggestion on how to speed up the file reading and reduce the execution time of the first part of the graph?
The problem with your code is that you use Pandas version of
read_parquet.
Instead use:
dask version of read_parquet,
map and gather methods offered by Client,
dask version of concat,
Something like:
def read_parquet(path):
return dd.read_parquet(path)
def myRead():
L = client.map(read_parquet, glob.glob('file_*.parquet'))
lst = client.gather(L)
return dd.concat(lst)
result = myRead().compute()
Before that I created a client, once only.
The reason was that during my earlier experiments I got an error
message when I attempted to create it again (in a function), even
though the first instance has been closed before.

Pyspark: How to parallelize multi gz file processing in HDFS

I have many gz files stored in 20 nodes HDFS cluster that need to be aggregated by columns. The gz files are very large (1GByte each and 200 files in total).
The data format is key value with 2 column values: ['key','value1','value2'], and needs to be grouped by key with aggregation by column: sum(value1), count(value2).
The data is already sorted by key and each gz files have exclusive key values.
For example:
File 1:
k1,v1,u1
k1,v2,u1
k2,v2,u2
k3,v3,u3
k3,v4,u4
File 2:
k4,v5,u6
k4,v7,u8
k5,v9,v10
File 3:
k6,...
...
...
File 200:
k200,v200,u200
k201,v201,u201
I firstly parse the date and convert the data into (key, list of (values)) structure. The parser output will be like this:
parser output
(k1,[v1,u1])
(k1,[v2,u1])
(k2,[v2,u2])
(k3,[v3,u3])
(k3,[v4,u4])
Then group by key values using reduceByKey function, which is more efficient than groupByKey function.
reducer output:
(k1,[[v1,u1],[v2,u1])
(k2,[[v2,u2]])
(k3,[[v3,u3],[v4,u4]])
Then aggregate the columns using process function:
process
(k1, sum([v1,v2], len([u1,u3])))
(k2, sum([v2], len([u2])))
(k3, sum([v3,v4], len([u3,u4])))
Here is the sample code for the process
import pyspark
from pyspark import SparkFiles
def parser(line):
try:
key,val=line.split('\t)
return (key,[val1,val2])
except:
return None
def process(line):
key,gr= line[0],line[1]
vals=zip(*gr)
val1=sum(vals[0])
val2=len(vals[1])
return ('\t'.join([key,val1,val2]))
sc = pyspark.SparkContext(appName="parse")
logs=sc.textFile("hdfs:///home/user1/*.gz")
proc=logs.map(parser).filter(bool).reduceByKey(lambda acc,x: acc+x).map(process)
proc.saveAsTextFile('hdfs:///home/user1/output1')
I think this code does not fully utilize the spark cluster. I like to optimize the code to fully utilize the processing considering.
1. What is the best way to handle gz files in HDFS and Pyspark? -- how to fully distribute the gz file processing to the entire cluster?
2. How to fully utilize all the CPUs in each node? for aggregation and parsing process
There are at least a couple of things you should consider:
If you are using YARN, the number of executors and the cores per executor that you assign to your spark app. They can be controlled by --num-executors and --executor-cores. If you are not using YARN, your scheduler will probably have a similar mechanism to control parallelism, try looking for it.
The number of partitions in your DataFrame, which directly impacts the parallelism in your job. You can control that with repartition and/or coalesce.
Both can limit the cores used by the job and hence the use of the cluster. Also, take into account that more CPUs employed will not necessarily mean better performance (or execution times). That will depend on the size of your cluster and the size of the problem, and I don't know about any easy rule to decide that. For me it usually comes down to experiment with different configurations and see which one has better performance.

Writing a big Spark Dataframe into a csv file

I'm using Spark 2.3 and I need to save a Spark Dataframe into a csv file and I'm looking for a better way to do it.. looking over related/similar questions, I found this one, but I need a more specific:
If the DataFrame is too big, how can I avoid using Pandas? Because I used toCSV() function (code below) and it produced:
Out Of Memory error (could not allocate memory).
Is directly writing to a csv using file I/O a better way? Can it preserve the separators?
Using df.coalesce(1).write.option("header", "true").csv('mycsv.csv') will cause the header to be written in each file and when the files are merged, it will have headers in the middle. Am I wrong?
Using spark write and then hadoop getmerge is better than using coalesce from the point of performance?
def toCSV(spark_df, n=None, save_csv=None, csv_sep=',', csv_quote='"'):
"""get spark_df from hadoop and save to a csv file
Parameters
----------
spark_df: incoming dataframe
n: number of rows to get
save_csv=None: filename for exported csv
Returns
-------
"""
# use the more robust method
# set temp names
tmpfilename = save_csv or (wfu.random_filename() + '.csv')
tmpfoldername = wfu.random_filename()
print n
# write sparkdf to hadoop, get n rows if specified
if n:
spark_df.limit(n).write.csv(tmpfoldername, sep=csv_sep, quote=csv_quote)
else:
spark_df.write.csv(tmpfoldername, sep=csv_sep, quote=csv_quote)
# get merge file from hadoop
HDFSUtil.getmerge(tmpfoldername, tmpfilename)
HDFSUtil.rmdir(tmpfoldername)
# read into pandas df, remove tmp csv file
pd_df = pd.read_csv(tmpfilename, names=spark_df.columns, sep=csv_sep, quotechar=csv_quote)
os.remove(tmpfilename)
# re-write the csv file with header!
if save_csv is not None:
pd_df.to_csv(save_csv, sep=csv_sep, quotechar=csv_quote)
If the DataFrame is too big, how can I avoid using Pandas?
You can just save the file to HDFS or S3 or whichever distributed storage you have.
Is directly writing to a csv using file I/O a better way? Can it
preserve the separators?
If you mean by that to save file to local storage - it will still cause OOM exception, since you will need to move all data in memory on local machine to do it.
Using df.coalesce(1).write.option("header", "true").csv('mycsv.csv')
will cause the header to be written in each file and when the files
are merged, it will have headers in the middle. Am I wrong?
In this case you will have only 1 file (since you do coalesce(1)). So you don't need to care about headers. Instead - you should care about memory on the executors - you might get OOM on the executor since all the data will be moved to that executor.
Using spark write and then hadoop getmerge is better than using
coalesce from the point of performance?
Definitely better (but don't use coalesce()). Spark will efficiently write data to storage, then HDFS will duplicate data and after that getmerge will be able to efficiently read data from the nodes and merge it.
We used databricks library . It works fine
df.save("com.databricks.spark.csv", SaveMode.Overwrite, Map("delimiter" -> delim, "nullValue" -> "-", "path" -> tempFPath))
Library :
<!-- spark df to csv -->
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-csv_2.10</artifactId>
<version>1.3.0</version>
</dependency>

Ingesting 150 csv's into one data source

Hello i am completely new to handling big data and comfortable in python
I have 150 csv's each of size 70MB which i have to integrate in one source to remove basic stats like unique counts, unique names and all.
Any one could suggest how should i go about it?
I came across a package 'pyelastic search' in python how feasible it is for me to use in enthaught canopy.
Suggestion needed!
Try to use pandas package.
reading a single csv would be:
import pandas as pd
df = pd.read_csv('filelocation.csv')
in case of multiple files, just concat them. let's say ls is a list of file locations, then:
df = pd.concat([pd.read_csv(f) for f in ls])
and then to write them as a single file, do:
df.to_csv('output.csv')
of course all this is valid for in-memory operations (70x150 = ~ 10.5 GB RAM). If that's not possible - consider building an incremental process or using dask dataframes.
One option if you are in AWS
Step1 - move data to S3 (AWS native File storage)
Step2 - create table for each data structure in redshift
Step3 - run COPY command to move data from S3 to Redshift (AWS native DW)
COPY command loads data in bulk, detects file name pattern

Categories