Pyspark caches dataframe by default or not?

Pyspark caches dataframe by default or not? - python

If i read a file in pyspark:
Data = spark.read(file.csv)
Then for the life of the spark session, the ‘data’ is available in memory,correct? So if i call data.show() 5 times, it will not read from disk 5 times. Is it correct? If yes, why do i need:
Data.cache()

If i read a file in pyspark:
Data = spark.read(file.csv)
Then for the life of the spark session, the ‘data’ is available in memory,correct?
No. Nothing happens here due to Spark lazy evaluation, which happens upon the first call to show() in your case.
So if i call data.show() 5 times, it will not read from disk 5 times. Is it correct?
No. The dataframe will be re-evaluated for each call to show. Caching the dataframe will prevent that re-evaluation, forcing the data to be read from cache instead.

Related

pyarrow write_dataset limit per partition files

Let's say I've got some data which is in an existing dataset which has no column partitioning but is still 200+ files. I want to rewrite that data into a hive partition. The dataset is too big to open its entirety into memory. In this case the dataset came from a CETAs in Azure Synapse but I don't think that matters.
I do:
dataset=ds.dataset(datadir, format="parquet", filesystem=abfs)
new_part = ds.partitioning(pa.schema([("nodename", pa.string())]), flavor="hive")
scanner=dataset.scanner()
ds.write_dataset(scanner, newdatarootdir,
format="parquet", partitioning=new_part,
existing_data_behavior="overwrite_or_ignore",
max_partitions=2467)
In this case there are 2467 unique nodenames and so I want there to be 2467 directories each with 1 file. However what I get is 2467 directories each with 100+ files of about 10KB each. How do I get just 1 file per partition?
I could do a second step
for node in nodes:
fullpath=f"{datadir}\\{node}"
curnodeds=ds.dataset(fullpath, format="parquet")
curtable=curnodeds.to_table()
os.makedirs(f"{newbase}\\{node}")
pq.write_table(curtable, f"{newbase}\\{node}\\part-0.parquet",
version="2.6", flavor="spark", data_page_version="2.0")
Is there a way to incorporate the second step into the first step?
Edit:
This is with pyarrow 7.0.0
Edit(2):
Using max_open_files=3000 did result in one file per partition. The metadata comparison between the two are that the two step approach has (for one partition)...
<pyarrow._parquet.FileMetaData object at 0x000001C7FA414A40>
created_by: parquet-cpp-arrow version 7.0.0
num_columns: 8
num_rows: 24840
num_row_groups: 1
format_version: 2.6
serialized_size: 1673
size on disk: 278kb
and the one step...
<pyarrow._parquet.FileMetaData object at 0x000001C7FA414A90>
created_by: parquet-cpp-arrow version 7.0.0
num_columns: 8
num_rows: 24840
num_row_groups: 188
format_version: 1.0
serialized_size: 148313
size on disk: 1.04MB
Obviously in the two step version I'm explicitly setting the version to 2.6 so that explains that difference. The total size of my data after the 2-step is 655MB vs the 1 step is 2.6GB. The time difference was pretty significant too. The two step was something like 20 minutes for the first step and 40 minutes for the second. The one step was like 5 minutes for the whole thing.
The remaining question is, how to set version="2.6" and data_page_version="2.0" in write_dataset? I'll still wonder why the row_groups is so different if they're so different when setting those parameters but I'll defer that question.

The dataset writer was changed significantly in 7.0.0. Previously, it would always create 1 file per partition. Now there are a couple of settings that could cause it to write multiple files. Furthermore, it looks like you are ending up with a lot of small row groups which is not ideal and probably the reason the one-step process is both slower and larger.
The first significant setting is max_open_files. Some systems limit how many file descriptors can be open at one time. Linux defaults to 1024 and so pyarrow attempts defaults to ~900 (with the assumption that some file descriptors will be open for scanning, etc.) When this limit is exceeded pyarrow will close the least recently used file. For some datasets this works well. However, if each batch has data for each file this doesn't work well at all. In that case you probably want to increase max_open_files to be greater than your number of partitions (with some wiggle room because you will have some files open for reading too). You may need to adjust OS-specific settings to allow this (generally, these OS limits are pretty conservative and raising this limit is fairly harmless).
I'll still wonder why the row_groups is so different if they're so different when setting those parameters but I'll defer that question.
In addition, the 7.0.0 release adds min_rows_per_group, max_rows_per_group and max_rows_per_file parameters to the write_dataset call. Setting min_rows_per_group to something like 1 million will cause the writer to buffer rows in memory until it has enough to write. This will allow you to create files with 1 row group instead of 188 row groups. This should bring down your file size and fix your performance issues.
However, there is a memory cost associated with this which is going to be min_rows_per_group * num_files * size_of_row_in_bytes.
The remaining question is, how to set version="2.6" and data_page_version="2.0" in write_dataset?
The write_dataset call works on several different formats (e.g. csv, ipc, orc) and so the call only has generic options that apply regardless of format.
Format-specific settings can instead be set using the file_options parameter:
# It does not appear to be documented but make_write_options
# should accept most of the kwargs that write_table does
file_options = ds.ParquetFileFormat().make_write_options(version='2.6', data_page_version='2.0')
ds.write_dataset(..., file_options=file_options)

Saving to the same parquet file in parallel using dask leading to ArrowInvalid

I am doing some simulation where I compute some stuff for several time step. For each I want to save a parquet file where each line correspond to a simulation this looks like so :
def simulation():
nsim = 3
timesteps = [1,2]
data = {} #initialization not shown here
for i in nsim:
compute_stuff()
for j in timesteps:
data[str(j)]= compute_some_other_stuff()
return data
Once I have my dict data containing the result of my simulation (under numpy arrays), I transform it into dask.DataFrame objects to then save them to file using the .to_parquet() method as follows:
def save(data):
for i in data.keys():
data[i] = pd.DataFrame(data[i], bins=...)
df = from_pandas(data[i], npartitions=2)
f.to_parquet(datafolder + i + "/", engine="pyarrow", append=True, ignore_divisions = True)
When use this code only once it works perfectly and the struggle arrises when I try to implement it in parallel. Using dask I do:
client = Client(n_workers=10, processes=True)
def f():
data = simulation()
save(data)
to_compute = [delayed(f)(n) for n in range(20)]
compute(to_compute)
The behaviour of this last portion of code is quite random. At some point this happens:
distributed.worker - WARNING - Compute Failed
Function: f
args: (4)
kwargs: {}
Exception: "ArrowInvalid('Parquet file size is 0 bytes')"
....
distributed.worker - WARNING - Compute Failed
Function: f
args: (12)
kwargs: {}
Exception: "ArrowInvalid('Parquet magic bytes not found in footer. Either the file is corrupted or this is not a parquet file.')"
I think these errors are due to the fact that 2 processes try to write at the same time to the same parquet file, and it not well handled (as it can be on txt file). I already tried to switch to pySpark / Koalas without much success. I there a better way to save the result along a simulation (in case of a crash / wall time on a cluster)?

You are making a classic dask mistake of invoking the dask API from within functions that are themselves delayed. The error indicates that things are happening in parallel (which is what dask does!) which are not expected to change during processing. Specifically, a file is clearly being edited by one task while another one is reading it (not sure which).
What you probably want to do, is use concat on the dataframe pieces and then a single call to to_parquet.
Note that it seems all of your data is actually held in the client, and you are using from_parquet. This seems like a bad idea, since you are missing out on one of dask's biggest features, to only load data when needed. You should, instead, load your data inside delayed functions or dask dataframe API calls.

PySpark groupby applyInPandas save objects as files problem

I run PySpark 3.1 on a Windows computer with local mode on Jupyter Notebook. I call "applyInPandas" on Spark DataFrame.
Below function applies a few data transformations to input Pandas DataFrame, and trains an SGBT model. Then it serializes the trained model into binary and saves to S3 bucket as object. Finally it returns the DataFrame. I call this function from a Spark DataFrame grouped by two columns in the last line. I receive no error and the returned DataFrame is as the same length as the input. Data for each group is returned.
The problem is the saved model objects. There are objects saved in S3 only for 2 groups when there were supposed to be models for each group. There is no missing/wrong data point that would cause model training to fail. (I'd receive an error or warning anyway.) What I have tried so far:
Replace S3 and save to local file system: The same result.
Replace "pickle" with "joblib" and "BytesIO": The same result.
Repartition before calling the function: Now I had more objects saved for different groups, but not all. [I did this by calling "val_large_df.coalesce(1).groupby('la..." in the last line.]
So I suspect this is about parallelism and distribution, but I could not figure it out. Thank you already.
def train_sgbt(pdf):
##Some data transformations here##
#Train the model
sgbt_mdl=GradientBoostingRegressor(--Params.--).fit(--Params.--)
sgbt_mdl_b=pickle.dumps(sgbt_mdl) #Serialize
#Initiate s3_client
s3_client = boto3.client(--Params.--)
#Put file in S3
s3_client.put_object(Body=sgbt_mdl_b, Bucket='my-bucket-name',
Key="models/BT_"+str(pdf.latGroup_m[0])+"_"+str(pdf.lonGroup_m[0])+".mdl")
return pdf
dummy_df=val_large_df.groupby("latGroup_m","lonGroup_m").applyInPandas(train_sgbt,
schema="fcast_error double")
dummy_df.show()

Spark evaluates the dummy_df lazy and therefore train_sgbt will only be called for the groups that are required to complete the Spark action.
The Spark action here is show(). This action prints only the first 20 rows, so train_sgbt is only called for the groups that have at least one element in the first 20 rows. Spark may evaluate more groups, but there is no guarantee for it.
One way to solve to problem would be to call another action, for example csv.

Python Pandas memory management

2 questions :
When we delete/drop a dataframe and use gc.collect(), Does the python empty the ram occupied?
if we do df.to_csv(filename), will that df be referring to the file after that or the data still being referred from in-memory(RAM)?

When we delete/drop a dataframe and use gc.collect(), Does the python empty the ram occupied?
Yes, it will be empty space in the RAM.
If we do df.to_csv(filename), will that df be referring to the file after that or the data still being referred from in-memory(RAM)?
No, it will be not refer from the file it will still refer to the df variable.

Spark coalesce vs collect, which one is faster?

I am using pyspark to process 50Gb data using AWS EMR with ~15 m4.large cores.
Each row of the data contains some information at a specific time on a day. I am using the following for loop to extract and aggregate information for every hour. Finally I union the data, as I want my result to save in one csv file.
# daily_df is a empty pyspark DataFrame
for hour in range(24):
hourly_df = df.filter(hourFilter("Time")).groupby("Animal").agg(mean("weights"), sum("is_male"))
daily_df = daily_df.union(hourly_df)
As of my knowledge, I have to perform the following to force the pyspark.sql.Dataframe object to save to 1 csv files (approx 1Mb) instead of 100+ files:
daily_df.coalesce(1).write.csv("some_local.csv")
It seems it took about 70min to finish this progress, and I am wondering if I can make it faster by using collect() method like?
daily_df_pandas = daily_df.collect()
daily_df_pandas.to_csv("some_local.csv")

Both coalesce(1) and collect are pretty bad in general but with expected output size around 1MB it doesn't really matter. It simply shouldn't be a bottleneck here.
One simple improvement is to drop loop -> filter -> union and perform a single aggregation:
df.groupby(hour("Time"), col("Animal")).agg(mean("weights"), sum("is_male"))
If that's not enough then most likely the issue here is configuration (the good place to start could be adjusting spark.sql.shuffle.partitions if you don't do that already).

To save as single file these are options
Option 1 :
coalesce(1) (minimum shuffle data over network) or repartition(1) or collect may work for small data-sets, but large data-sets it may not perform, as expected.since all data will be moved to one partition on one node
option 1 would be fine if a single executor has more RAM for use than the driver.
Option 2 :
Other option would be FileUtil.copyMerge() - to merge the outputs into a single file like below code snippet.
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
def merge(srcPath: String, dstPath: String): Unit = {
val hadoopConfig = new Configuration()
val hdfs = FileSystem.get(hadoopConfig)
FileUtil.copyMerge(hdfs, new Path(srcPath), hdfs, new Path(dstPath), false, hadoopConfig, null)
}
Option 3 :
after getting part files you can use hdfs getMerge command like this...
hadoop fs -getmerge /tmp/demo.csv /localmachine/tmp/demo.csv
Now you have to decide based on your requirements... which one is safer/faster
also, can have look at Dataframe save after join is creating numerous part files

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.