Pyspark read and combine many parquet files efficiently - python

I have ~ 4000 parquet files that are each 3mb. I would like to read all of the files from an S3 bucket, do some aggregations, combine the files into one dataframe, and do some more aggregations. The parquet dataframes all have the same schema. The files are not all in the same folder in the S3 bucket but rather are spread across 5 different folders. I need some files in each folder but not all. I am working on my local machine which has 16gb RAM and 8 processors that I hope to leverage to make this faster. Here is my code:
ts_dfs = []
# For loop to read parquet files and append to empty list
for id in ids:
# Make the path to the timeseries data
timeseries_path = f'{dataset_path}/timeseries_individual_buildings/by_county/upgrade=0/county={id[0]}'
ts_data_df = spark.read.parquet(timeseries_path).select('bldg_id',
'`out.natural_gas.heating.energy_consumption_intensity`',
'`out.electricity.heating.energy_consumption_intensity`',
'timestamp')
# Aggregate
ts_data_df = ts_data_df \
.groupBy(f.month('timestamp').alias('month'),'bldg_id') \
.agg(f.sum('`out.electricity.heating.energy_consumption_intensity`').alias('eui_elec'),
f.sum('`out.natural_gas.heating.energy_consumption_intensity`').alias('eui_gas'))
# Append
ts_dfs.append(ts_data_df)
# Combine all of the dfs into one
ts_sdf = reduce(DataFrame.unionAll, ts_dfs)
# Merge with ids_df
ts = ts_sdf.join(ids_df, on = ['bldg_id'])
# Mean and Standard Deviation by month
stats_df = ts_sdf.groupBy('month', '`in.hvac_heating_type_and_fuel`') \
.agg(f.mean('eui_elec').alias('mean_eui_elec'),
f.stddev('eui_elec').alias('std_eui_elec'),
f.mean('eui_gas'). alias('mean_eui_gas'),
f.stddev('eui_gas').alias('std_eui_gas'))
Reading all of the files through a forloop does not leverage the multiple cores, defeating the purpose of using Spark. As such this process takes 90 minutes on my own (though that may be more a function of my internet connection). Additionally, the next step: ts_sdf = reduce(DataFrame.unionAll, ts_dfs) which combines the dataframes using unionAll is taking even longer. I think my questions are:
How do I avoid having to read the files in a forloop?
Why is the unionAll taking so long? Does PySpark not parallelize this process somehow?
In general, how do I do this better? Perhaps PySpark is not the ideal tool for this?

Related

how to efficiently append 200k+ json files to dataframe

I have over 200 000 json files stored across 600 directories on my drive and a copy in Google cloud storage.
i need to merge them and transform to CSV file;
so i thougt of looping the following logic:
open each file with pandas read_json
apply some transformations ( drop some columns, add columns with values based on the filename, change order of columns)
append this df to a master_df
and finally export the master_df to a csv.
each file is around 5000 rows when converted to DF - which would result in DF of around 300 000 000 rows.
The files are grouped and reside in 3 directories ( and than depending on the file - in a couple more, hence on top level there are 3 dirs, but overall - about 600 ) - so to speed it up a bit i decided to run the script based on one of those 3 directories
I am running this script on my local machine (32GD RAM, sufficient disk space on SSD drive), yet its painfully slow, and the speed decreases.
At the beginning one df was appended within 1 second, after executing loop 4000 times - time to append grew to 2.7s
I am looking for a way to speed it up to something that couple be reasonably done within (hopefully) a couple hours
so bottomline - should i try to optimize my code and run this locally, or not even bother, likely keep the script 'as is' and run it in f.ex Google Cloud Run?
the JSON files contain keys A, B, C, D, E. I drop A, B as not important, and rename C
the code i have so far:
import pandas as pd
import os
def listfiles(track, footprint):
"""
This function lists files in a specified filepath (track) by a footprint found in the filename with extension
Function returns a list of files.
"""
locations =[]
for r, d, f in os.walk(track):
for file in f:
if footprint in file:
locations.append(os.path.join(r, file))
return locations
def create_master_df(track, footprint):
master_df = pd.DataFrame(columns = ['date','domain','property','lang','kw', 'clicks','impressions'])
all_json_files = listfiles(track, footprint)
prop = [] #this indicates the starting directory, and to which ouput file should master_df be saved
for i, file in enumerate(all_json_files):
#here starts logic by which i identify some properties of the file, to use them as values in columns in local_df
elements=file.split('\\')
elements[-1] = elements[-1].replace('.json','')
elements[-1] = elements[-1].replace('-Copy-Multiplicated','')
date = elements[-1][-10:]
domain=elements[7]
property=elements[6]
language=elements[8]
json_to_rows = pd.read_json(file)
local_df = pd.DataFrame(json_to_rows.rows.values.tolist())
local_df = local_df.drop(columns=['A','B'])
local_df = local_df.rename(columns={'C': 'new_name'})
#here i add earlier extracted data, to be able to distinguish rows in the master DF
local_df['date'] = date
local_df['domain'] = domain
local_df['property'] = property
local_df['lang'] = language
local_df = local_df[['date','domain','property','lang','new_name', 'D','E']]
master_df = master_df.append(local_df, ignore_index=True)
prop = property
print(f'appended {i} times')
out_file_path = f'{track}\\{prop}-outfile.csv'
master_df.to_csv(out_file_path, sep=";", index=False)
#run the script on first directory
track = 'C:\\xyz\\a'
footprint ='.json'
create_master_df(track, footprint)
#run the script on first directory
track = 'C:\\xyz\\b'
create_master_df(track, footprint)
any feedback is welcome!
#Edit:
i tried and timed a couple ways of approaching it, below is explanation of what i did:
clear execution - at first i ran my code as it was written
1st try - i moved all changes to local_df to before the final master_df was saved to csv. This was an attempt to see if working on files 'as they are' without manipulating them would be faster. i removed from for loop dropping columns, changing colum names, and reordering them. all this was applied on master_df before exporting to csv
2nd try - i brought back to for loop dropping unnecessary columns - as clearly performance was impacted by additional columns
3rd try - so far the most efficient - instead of appending local_df to master_df, i transformed it to local_list, and appended to master_list. master_list was than pd.concat to master_df and exported to csv
here is a comparison of speed - how fast the script was to iterate the for loop in each version, when tested on ~ 800 JSON files:
comparison of timed script executions
Its surprising to see that i actually wrote a solid code in the first place - as i don't have much experience coding - that was super nice to see :)
overall when job was ran on the test set of files it finished as follows (in seconds)

Pyspark: How to parallelize multi gz file processing in HDFS

I have many gz files stored in 20 nodes HDFS cluster that need to be aggregated by columns. The gz files are very large (1GByte each and 200 files in total).
The data format is key value with 2 column values: ['key','value1','value2'], and needs to be grouped by key with aggregation by column: sum(value1), count(value2).
The data is already sorted by key and each gz files have exclusive key values.
For example:
File 1:
k1,v1,u1
k1,v2,u1
k2,v2,u2
k3,v3,u3
k3,v4,u4
File 2:
k4,v5,u6
k4,v7,u8
k5,v9,v10
File 3:
k6,...
...
...
File 200:
k200,v200,u200
k201,v201,u201
I firstly parse the date and convert the data into (key, list of (values)) structure. The parser output will be like this:
parser output
(k1,[v1,u1])
(k1,[v2,u1])
(k2,[v2,u2])
(k3,[v3,u3])
(k3,[v4,u4])
Then group by key values using reduceByKey function, which is more efficient than groupByKey function.
reducer output:
(k1,[[v1,u1],[v2,u1])
(k2,[[v2,u2]])
(k3,[[v3,u3],[v4,u4]])
Then aggregate the columns using process function:
process
(k1, sum([v1,v2], len([u1,u3])))
(k2, sum([v2], len([u2])))
(k3, sum([v3,v4], len([u3,u4])))
Here is the sample code for the process
import pyspark
from pyspark import SparkFiles
def parser(line):
try:
key,val=line.split('\t)
return (key,[val1,val2])
except:
return None
def process(line):
key,gr= line[0],line[1]
vals=zip(*gr)
val1=sum(vals[0])
val2=len(vals[1])
return ('\t'.join([key,val1,val2]))
sc = pyspark.SparkContext(appName="parse")
logs=sc.textFile("hdfs:///home/user1/*.gz")
proc=logs.map(parser).filter(bool).reduceByKey(lambda acc,x: acc+x).map(process)
proc.saveAsTextFile('hdfs:///home/user1/output1')
I think this code does not fully utilize the spark cluster. I like to optimize the code to fully utilize the processing considering.
1. What is the best way to handle gz files in HDFS and Pyspark? -- how to fully distribute the gz file processing to the entire cluster?
2. How to fully utilize all the CPUs in each node? for aggregation and parsing process
There are at least a couple of things you should consider:
If you are using YARN, the number of executors and the cores per executor that you assign to your spark app. They can be controlled by --num-executors and --executor-cores. If you are not using YARN, your scheduler will probably have a similar mechanism to control parallelism, try looking for it.
The number of partitions in your DataFrame, which directly impacts the parallelism in your job. You can control that with repartition and/or coalesce.
Both can limit the cores used by the job and hence the use of the cluster. Also, take into account that more CPUs employed will not necessarily mean better performance (or execution times). That will depend on the size of your cluster and the size of the problem, and I don't know about any easy rule to decide that. For me it usually comes down to experiment with different configurations and see which one has better performance.

Reading dataframe from multiple input paths and adding columns simultaneously

I am trying to read multiple input paths and based on the dates in the paths adding two columns to the data frame. Actually the files were stored as orc partitioned by these dates using hive so they have a structure like
s3n://bucket_name/folder_name/partition1=value1/partition2=value2
where partition2 = mg_load_date . So here I am trying to fetch multiple directories from multiple paths and based on the partitions I have to create two columns namely mg_load_date and event_date for each spark dataframe. I am reading these as input and combining them after I add these two columns finding the dates for each file respectively.
Is there any other way since I have many reads for each file, to read all the files at once while adding two columns for their specific rows. Or any other way where I can make the read operation fast since I have many reads.
I guess reading all the files like this sqlContext.read.format('orc').load(inputpaths) is faster than reading them individually and then merging them.
Any help is appreciated.
dfs = []
for i in input_paths:
df = sqlContext.read.format('orc').load(i)
date = re.search('mg_load_date=([^/]*)/$', i).group(1)
df = df.withColumn('event_date',F.lit(date)).withColumn('mg_load_date',F.lit(date))
dfs+=[df]
df = reduce(DataFrame.unionAll,dfs)
As #user8371915 says, you should load your data from the root path instead of passing a list of subdirectory:
sqlContext.read.format('orc').load("s3n://bucket_name/folder_name/")
Then you'll have access to your partitioning columns partition1 and partition2.
If for some reason you can't load from the root path you can try using pyspark.sql.functions input_file_name to get the name of the file for each row of your dataframe.
Spark 2.2.0+
to read from multiple folders using orc format.
df=spark.read.orc([path1,path2])
ref: https://issues.apache.org/jira/browse/SPARK-12334

Ingesting 150 csv's into one data source

Hello i am completely new to handling big data and comfortable in python
I have 150 csv's each of size 70MB which i have to integrate in one source to remove basic stats like unique counts, unique names and all.
Any one could suggest how should i go about it?
I came across a package 'pyelastic search' in python how feasible it is for me to use in enthaught canopy.
Suggestion needed!
Try to use pandas package.
reading a single csv would be:
import pandas as pd
df = pd.read_csv('filelocation.csv')
in case of multiple files, just concat them. let's say ls is a list of file locations, then:
df = pd.concat([pd.read_csv(f) for f in ls])
and then to write them as a single file, do:
df.to_csv('output.csv')
of course all this is valid for in-memory operations (70x150 = ~ 10.5 GB RAM). If that's not possible - consider building an incremental process or using dask dataframes.
One option if you are in AWS
Step1 - move data to S3 (AWS native File storage)
Step2 - create table for each data structure in redshift
Step3 - run COPY command to move data from S3 to Redshift (AWS native DW)
COPY command loads data in bulk, detects file name pattern

Spark coalesce vs collect, which one is faster?

I am using pyspark to process 50Gb data using AWS EMR with ~15 m4.large cores.
Each row of the data contains some information at a specific time on a day. I am using the following for loop to extract and aggregate information for every hour. Finally I union the data, as I want my result to save in one csv file.
# daily_df is a empty pyspark DataFrame
for hour in range(24):
hourly_df = df.filter(hourFilter("Time")).groupby("Animal").agg(mean("weights"), sum("is_male"))
daily_df = daily_df.union(hourly_df)
As of my knowledge, I have to perform the following to force the pyspark.sql.Dataframe object to save to 1 csv files (approx 1Mb) instead of 100+ files:
daily_df.coalesce(1).write.csv("some_local.csv")
It seems it took about 70min to finish this progress, and I am wondering if I can make it faster by using collect() method like?
daily_df_pandas = daily_df.collect()
daily_df_pandas.to_csv("some_local.csv")
Both coalesce(1) and collect are pretty bad in general but with expected output size around 1MB it doesn't really matter. It simply shouldn't be a bottleneck here.
One simple improvement is to drop loop -> filter -> union and perform a single aggregation:
df.groupby(hour("Time"), col("Animal")).agg(mean("weights"), sum("is_male"))
If that's not enough then most likely the issue here is configuration (the good place to start could be adjusting spark.sql.shuffle.partitions if you don't do that already).
To save as single file these are options
Option 1 :
coalesce(1) (minimum shuffle data over network) or repartition(1) or collect may work for small data-sets, but large data-sets it may not perform, as expected.since all data will be moved to one partition on one node
option 1 would be fine if a single executor has more RAM for use than the driver.
Option 2 :
Other option would be FileUtil.copyMerge() - to merge the outputs into a single file like below code snippet.
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
def merge(srcPath: String, dstPath: String): Unit = {
val hadoopConfig = new Configuration()
val hdfs = FileSystem.get(hadoopConfig)
FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), false, hadoopConfig, null)
}
Option 3 :
after getting part files you can use hdfs getMerge command like this...
hadoop fs -getmerge /tmp/demo.csv /localmachine/tmp/demo.csv
Now you have to decide based on your requirements... which one is safer/faster
also, can have look at Dataframe save after join is creating numerous part files

Categories