I'm new to Koalas (pyspark), and I was trying to utilize Koalas for parallel apply, but it seemed like it was using a single core for the whole operation (correct me if I'm wrong) and ended up using dask for parallel apply (using map_partition) which worked pretty well.
However, I would like to know if there's a way to utilize Koalas for parallel apply.
I used basic codes for operation like below.
import pandas as pd
import databricks.koalas as ks
my_big_data = ks.read_parquet('my_big_file') # file is single partitioned parquet file
my_big_data['new_column'] = my_big_data['string_column'].apply(my_prep) # my_prep does stirng operations
my_big_data.to_parquet('my_big_file_modified') # for Koalas does lazy evaluation
I found a link that discuss this problem. https://github.com/databricks/koalas/issues/1280
If the number of rows that are being applied by function is less than 1,000 (default value), then pandas dataframe will be called to do the operation.
The user defined function above my_prep is applied to each row, so single core pandas was being used.
In order to force it to work in pyspark (parallel) manner, user should modify the configuration as below.
import databricks.koalas as ks
ks.set_option('compute.default_index_type','distributed') # when .head() call is too slow
ks.set_option('compute.shortcut_limit',1) # Koalas will apply pyspark
Also, explicitly specifying type (type hint) in the user defined function will make Koalas not to go shortcut path and will make parallel.
def my_prep(row) -> string:
return row
kdf['my_column'].apply(my_prep)
Related
I'm trying to understand how to leverage cache() to improve my performance. Since cache retains a DataFrame in memory "for reuse", it seems like i need to understand the conditions that eject the DataFrame from memory to better understand how to leverage it.
After defining transformations, I call an action, is the dataframe, after the action completes, gone from memory? This would imply that if I do execute an action on a dataframe, but I continue to do other stuff with the data, all the previous parts of the DAG, from the read to the action, will need to be re done.
Is this accurate?
The fact is, that after an action is executed another dag is created. You can check this via SparkUI
In code you can try to identify where your dag is done and new started by looking for actions
When looking at code you can use this simple rule:
When function is transforming one df into another df - its not action but only lazy evaluated transformation. Even if this is join or something else which requires shuffling
When fuction is returning value other dan df, then you are using and action (for example count which is returning Long)
Lets take a look at this code (Its Scala but api is similar). I created this example just to show you the mechanism, this could be done better of course :)
import org.apache.spark.sql.functions.{col, lit, format_number}
import org.apache.spark.sql.DataFrame
val input = spark.read.format("csv")
.option("header", "true")
.load("dbfs:/FileStore/shared_uploads/***#gmail.com/city_temperature.csv")
val dataWithTempInCelcius: DataFrame = input
.withColumn("avg_temp_celcius",format_number((col("AvgTemperature").cast("integer") - lit(32)) / lit(1.8), 2).cast("Double"))
val dfAfterGrouping: DataFrame = dataWithTempInCelcius
.filter(col("City") === "Warsaw")
.groupBy(col("Year"), col("Month"))
.max("avg_temp_celcius")//Not an action, we are still doing transofrmation from df to another df
val maxTemp: Row = dfAfterGrouping
.orderBy(col("Year"))
.first() // <--- first is our first action, output from first is a Row and not df
dataWithTempInCelcius
.filter(col("City") === "Warsaw")
.count() // <--- count is our second action
Here you may see what is the problem. It looks like between first and second action i am doing transformation which was already done in first dag. This intermediate results of calculation was not cached, so in second dag Spark is unable to get the dataframe after filter from the memory which leads us to recomputation. Spark is going to fetch the data again, apply our filter and then calculate the count.
In SparkUI u will find two separate dags and both of them are going to read the source csv
If you cache intermediate results after first .filter(col("City") === "Warsaw") and then use this cached DF to do grouping and count you will still find two separate dags (number of action has not changed) but this time in the plan for second dag you will find "In memory table scan" instead of read of a csv file - that means that Spark is reading data from cache
Now you can see in memory relation in plan. There is still read csv node in the dag but as you can see, for second action its skipped (0 bytes read)
** I am using Databrics cluster with Spark 3.2, SparkUI may look different on your env
Quote...
Using cache() and persist() methods, Spark provides an optimization mechanism to store the intermediate computation of a Spark DataFrame so they can be reused in subsequent actions.
See https://sparkbyexamples.com/spark/spark-dataframe-cache-and-persist-explained/
I want to use Dask for operations of the form
df.groupby(some_columns).apply(some_function)
where some_function() may compute some summary statistics, perform timeseries forecasting, or even just save the group to a single file in AWS S3.
Dask documentation states (and several other StackOverflow answers cite) that groupby-apply is not appropriate for aggregations:
Pandas’ groupby-apply can be used to to apply arbitrary functions, including aggregations that result in one row per group. Dask’s groupby-apply will apply func once to each partition-group pair, so when func is a reduction you’ll end up with one row per partition-group pair. To apply a custom aggregation with Dask, use dask.dataframe.groupby.Aggregation.
It is not clear whether Aggregation supports operations on multiple columns. However, this DataFrames tutorial seems to do exactly what I'm suggesting, with roughly some_function = lambda x: LinearRegression().fit(...). The example seems to work as intended, and I've similarly had no problems so far with e.g. some_function = lambda x: x.to_csv(...).
Under what conditions can I expect that some_function will be passed all rows of the group? If this is never guaranteed, is there a way to break the LinearRegression example? And most importantly, what is the best way to handle these use cases?
It appears that the current version of documentation and the source code are not in sync. Specifically, in the source code for dask.groupby, there is this message:
Dask groupby supports reductions, i.e., mean, sum and alike, and apply. The former do not shuffle the data and are efficiently implemented as tree reductions. The latter is implemented by shuffling the underlying partiitons such that all items of a group can be found in the same parititon.
This is not consistent with the warning in the docs about partition-group. The snippet below and task graph visualization also show that there is shuffling of data to ensure that partitions contain all members of the same group:
import dask.dataframe as dd
import pandas as pd
df = pd.DataFrame({'group': [0,0,1,1,2,0,1,2], 'npeople': [1,2,3,4,5,6,7,8]})
ddf = dd.from_pandas(df, npartitions=2)
def myfunc(df):
return df['npeople'].sum()
results_pandas = df.groupby('group').apply(myfunc).sort_index()
results_dask = ddf.groupby('group').apply(myfunc).compute().sort_index()
print(sum(results_dask != results_pandas))
# will print 0, so there are no differences
# between dask and pandas groupby
This is speculative, but maybe one way to work around the scenario where partition-group leads to rows from a single group split across partitions is to explicitly re-partition the data in a way that ensures each group is associated with a unique partition.
One way to achieve that is by creating an index that is the same as the group identifier column. This in general is not a cheap operation, but it can be helped by pre-processing the data in a way that the group identifier is already sorted.
I'm trying to rewrite my Python script (Pandas) with Pyspark, but I can't find a way to apply my Pandas functions in a way to be more efficient with Pyspark functions:
my functions are the following:
def decompose_id(id_flight):
my_id=id_flight.split("_")
Esn=my_id[0]
Year=my_id[3][0:4]
Month=my_id[3][4:6]
return Esn, Year, Month
def reverse_string(string):
stringlength=len(string) # calculate length of the list
slicedString=string[stringlength::-1] # slicing
return slicedString
I would like to apply the first function to a column of a dataframe (in Pandas I get a row of three elements)
The second functions is used when a condition of a column of a DataFrame is verified
is there a method to apply them using Pyspark dataframes?
You could apply these functions as UDF to a Spark column, but it is not very efficient.
Here are the functions you need to perform your task :
reverse : use it to replace your function reverse_string
split : Use is to replace my_id=id_flight.split("_")
getItem : use it to get the item in splitted list my_id[3]
substr : to replace the slicing in python [0:4]
Just combine these spark functions to recreate the same behavior.
IF you want to leverage pandas functionality one way would be to use - Pandas API along with groupBy
It provides you a way to treat each of your groupBy sets as pandas dataframe on which you can implement your functions.
However since its Spark , schema enforcement is pretty man necessary as you will go through the examples provided in the link as well
An implementation example can be found here
For Trivial Tasks , like reversing the string opt for inbuilt Spark Functions , else UDF's
I have a working lambda function in Python that computes the highest similarity between each string in dataset1 and the strings in dataset2. During an iteration, it writes the string, the best match and the similarity together with some other information to bigquery. There is no return value, as the purpose of the function is to insert a row into a bigquery dataset. This process takes rather long which is why I wanted to use Pyspark and Dataproc to speed up the process.
Converting the pandas dataframes to spark was easy. I am having trouble to register my udf, because it has no return value and pyspark requires one. In addition I don't understand how to map the 'apply' function in python to the pyspark variant. So basically my question is how to transform the python code below to work on a spark dataframe.
The following code works in a regular Python environment:
def embargomatch(name, code, embargo_names):
find best match
insert best match and additional information to bigquery
customer_names.apply(lambda x: embargoMatch(x['name'], x['customer_code'],embargo_names),axis=1)
Because pyspark requires a return type, I added 'return 1' to the udf and tried the following:
customer_names = spark.createDataFrame(customer_names)
from pyspark.sql.types import IntegerType
embargo_match_udf = udf(lambda x: embargoMatch(x['name'], x['customer_code'],embargo_names), IntegerType())
Now i'm stuck trying to apply the select function, as I don't know what parameters to give.
I suspect you're stuck on how to pass multiple columns to the udf -- here's a good answer to that question: Pyspark: Pass multiple columns in UDF.
Rather than creating a udf based on a lambda that wraps your function, consider simplifying by creating a udf based on embargomatch directly.
embargo_names = ...
# The parameters here are the columns passed into the udf
def embargomatch(name, customer_code):
pass
embargo_match_udf = udf(embargomatch, IntegerType())
customer_names.select(embargo_match_udf(array('name', 'customer_code')).alias('column_name'))
That being said, it's suspect that your udf doesn't return anything -- I generally see udfs as a way to add columns to the dataframe, but not to have side effects. If you want to insert records into bigquery, consider doing something like this:
customer_names.select('column_name').write.parquet('gs://some/path')
os.system("bq load --source_format=PARQUET [DATASET].[TABLE] gs://some/path")
I am a spark application with several points where I would like to persist the current state. This is usually after a large step, or caching a state that I would like to use multiple times. It appears that when I call cache on my dataframe a second time, a new copy is cached to memory. In my application, this leads to memory issues when scaling up. Even though, a given dataframe is a maximum of about 100 MB in my current tests, the cumulative size of the intermediate results grows beyond the alloted memory on the executor. See below for a small example that shows this behavior.
cache_test.py:
from pyspark import SparkContext, HiveContext
spark_context = SparkContext(appName='cache_test')
hive_context = HiveContext(spark_context)
df = (hive_context.read
.format('com.databricks.spark.csv')
.load('simple_data.csv')
)
df.cache()
df.show()
df = df.withColumn('C1+C2', df['C1'] + df['C2'])
df.cache()
df.show()
spark_context.stop()
simple_data.csv:
1,2,3
4,5,6
7,8,9
Looking at the application UI, there is a copy of the original dataframe, in adition to the one with the new column. I can remove the original copy by calling df.unpersist() before the withColumn line. Is this the recommended way to remove cached intermediate result (i.e. call unpersist before every cache()).
Also, is it possible to purge all cached objects. In my application, there are natural breakpoints where I can simply purge all memory, and move on to the next file. I would like to do this without creating a new spark application for each input file.
Thank you in advance!
Spark 2.x
You can use Catalog.clearCache:
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate
...
spark.catalog.clearCache()
Spark 1.x
You can use SQLContext.clearCache method which
Removes all cached tables from the in-memory cache.
from pyspark.sql import SQLContext
from pyspark import SparkContext
sqlContext = SQLContext.getOrCreate(SparkContext.getOrCreate())
...
sqlContext.clearCache()
We use this quite often
for (id, rdd) in sc._jsc.getPersistentRDDs().items():
rdd.unpersist()
print("Unpersisted {} rdd".format(id))
where sc is a sparkContext variable.
When you use cache on dataframe it is one of the transformation and gets evaluated lazily when you perform any action on it like count(),show() etc.
In your case after doing first cache you are calling show() that is the reason the dataframe is cached in memory. Now then you are again performing transformation on the dataframe to add additional column and again caching the new dataframe and then calling the action command show again and this would cache the second dataframe in memory. In case if size of your dataframe is big enough to just hold one dataframe then when you cache the second dataframe it would remove the first dataframe from the memory as it does not have enough space to hold the second dataframe.
Thing to keep in mind: You should not cache a dataframe unless you are using it in multiple actions otherwise it would be an overload in terms of performance as caching itself is costlier operation.