I have a dataFrame of approx. 4 million rows and 35 columns as input.
All I do to this dataFrame is the following steps:
For a list of given columns, I calculate a sum for a given list of group features and joined it as new column to my input dataFrame
I drop each new column sum right after I joined it to the dataFrame.
Therefore we end up with the same dataFrame as we started from (in theory).
However, I noticed that if my list of given columns gets too big (from more than 6 columns), the output dataFrame becomes impossible to manipulate. Even a simple display takes 10 minutes.
Here is an example of my code (df is my input dataFrame):
for c in list_columns:
df = df.join(df.groupby(list_group_features).agg(sum(c).alias('sum_' + c)), list_group_features)
df = df.drop('sum_' + c)
This happens due to the inner workings of Spark and its lazy evaluation.
What Spark does when you call groupby, join, agg, it attaches these calls to the plan of the df object. So even though it is not executing anything on the data, you are creating a large execution plan which is internally stored in the Spark DataFrame object.
Only when you call an action (show, count, write, etc.), Spark optimizes the plan and executes it. If the plan is too large, the optimization step can take a while to perform. Also remember that the plan optimization is happening on the driver, not on the executors. So if your driver is busy or overloaded, it delays spark plan optimization step as well.
It is useful to remember that joins are expensive operations in Spark, both for optimization and execution. If you can, you should always avoid joins when operating on a single DataFrame and utilise the window functionality instead. Joins should only be used if you are joining different dataframes from different sources (different tables).
A way to optimize your code would be:
import pyspark
import pyspark.sql.functions as f
w = pyspark.sql.Window.partitionBy(list_group_features)
agg_sum_exprs = [f.sum(f.col(c)).alias("sum_" + c).over(w) for c in list_columns]
res_df = df.select(df.columns + agg_sum_exprs)
This should be scalable and fast for large list_group_features and list_columns lists.
Related
I found it to be extremely slow if we initialize a pandas Series object from a list of DataFrames. E.g. the following code:
import pandas as pd
import numpy as np
# creating a large (~8GB) list of DataFrames.
l = [pd.DataFrame(np.zeros((1000, 1000))) for i in range(1000)]
# This line executes extremely slow and takes almost extra ~10GB memory. Why?
# It is even much, much slower than the original list `l` construction.
s = pd.Series(l)
Initially I thought the Series initialization accidentally deep-copied the DataFrames which make it slow, but it turned out that it's just copy by reference as the usual = in python does.
On the other hand, if I just create a series and manually shallow copy elements over (in a for loop), it will be fast:
# This for loop is faster. Why?
s1 = pd.Series(data=None, index=range(1000), dtype=object)
for i in range(1000):
s1[i] = l[i]
What is happening here?
Real-life usage: I have a table loader which reads something on disk and returns a pandas DataFrame (a table). To expedite the reading, I use a parallel tool (from this answer) to execute multiple reads (each read is for one date for example), and it returns a list (of tables). Now I want to transform this list to a pandas Series object with a proper index (e.g. the date or file location used in the read), but the Series construction takes ridiculous amount of time (as the sample code shown above). I can of course write it as a for loop to solve the issue, but that'll be ugly. Besides I want to know what is really taking the time here. Any insights?
This is not a direct answer to the OP's question (what's causing the slow-down when constructing a series from a list of dataframes):
I might be missing an important advantage of using pd.Series to store a list of dataframes, however if that's not critical for downstream processes, then a better option might be to either store this as a dictionary of dataframes or to concatenate into a single dataframe.
For the dictionary of dataframes, one could use something like:
d = {n: df for n, df in enumerate(l)}
# can change the key to something more useful in downstream processes
For concatenation:
w = pd.concat(l, axis=1)
# note that when using with the snippet in this question
# the column names will be duplicated (because they have
# the same names) but if your actual list of dataframes
# contains unique column names, then the concatenated
# dataframe will act as a normal dataframe with unique
# column names
I want to use Dask for operations of the form
df.groupby(some_columns).apply(some_function)
where some_function() may compute some summary statistics, perform timeseries forecasting, or even just save the group to a single file in AWS S3.
Dask documentation states (and several other StackOverflow answers cite) that groupby-apply is not appropriate for aggregations:
Pandas’ groupby-apply can be used to to apply arbitrary functions, including aggregations that result in one row per group. Dask’s groupby-apply will apply func once to each partition-group pair, so when func is a reduction you’ll end up with one row per partition-group pair. To apply a custom aggregation with Dask, use dask.dataframe.groupby.Aggregation.
It is not clear whether Aggregation supports operations on multiple columns. However, this DataFrames tutorial seems to do exactly what I'm suggesting, with roughly some_function = lambda x: LinearRegression().fit(...). The example seems to work as intended, and I've similarly had no problems so far with e.g. some_function = lambda x: x.to_csv(...).
Under what conditions can I expect that some_function will be passed all rows of the group? If this is never guaranteed, is there a way to break the LinearRegression example? And most importantly, what is the best way to handle these use cases?
It appears that the current version of documentation and the source code are not in sync. Specifically, in the source code for dask.groupby, there is this message:
Dask groupby supports reductions, i.e., mean, sum and alike, and apply. The former do not shuffle the data and are efficiently implemented as tree reductions. The latter is implemented by shuffling the underlying partiitons such that all items of a group can be found in the same parititon.
This is not consistent with the warning in the docs about partition-group. The snippet below and task graph visualization also show that there is shuffling of data to ensure that partitions contain all members of the same group:
import dask.dataframe as dd
import pandas as pd
df = pd.DataFrame({'group': [0,0,1,1,2,0,1,2], 'npeople': [1,2,3,4,5,6,7,8]})
ddf = dd.from_pandas(df, npartitions=2)
def myfunc(df):
return df['npeople'].sum()
results_pandas = df.groupby('group').apply(myfunc).sort_index()
results_dask = ddf.groupby('group').apply(myfunc).compute().sort_index()
print(sum(results_dask != results_pandas))
# will print 0, so there are no differences
# between dask and pandas groupby
This is speculative, but maybe one way to work around the scenario where partition-group leads to rows from a single group split across partitions is to explicitly re-partition the data in a way that ensures each group is associated with a unique partition.
One way to achieve that is by creating an index that is the same as the group identifier column. This in general is not a cheap operation, but it can be helped by pre-processing the data in a way that the group identifier is already sorted.
Let's say I have this pyspark Dataframe:
data = spark.createDataFrame(schema=['Country'], data=[('AT',), ('BE',), ('France',), ('Latvia',)])
And let's say I want to collect various statistics about this data. For example, I might want to know how many rows use a 2-character country code and how many use longer country names:
count_short = data.where(F.length(F.col('Country')) == 2).count()
count_long = data.where(F.length(F.col('Country')) > 2).count()
This works, but when I want to collect many different counts based on different conditions, it becomes very slow even for tiny datasets. In Azure Synapse Studio, where I am working, every count takes 1-2 seconds to compute.
I need to do 100+ counts, and it takes multiple minutes to compute for a dataset of 10 rows. And before somebody asks, the conditions for those counts are more complex than in my example. I cannot group by length or do other tricks like that.
I am looking for a general way to do multiple counts on arbitrary conditions, fast.
I am guessing that the reason for the slow performance is that for every count call, my pyspark notebook starts some Spark processes that have significant overhead. So I assume that if there was some way to collect these counts in a single query, my performance problems would be solved.
One possible solution I thought of is to build a temporary column that indicates which of my conditions have been matched, and then call countDistinct on it. But then I would have individual counts for all combinations of condition matches. I also noticed that depending on the situation, the performance is a bit better when I do data = data.localCheckpoint() before computing my statistics, but the general problem still persists.
Is there a better way?
Function "count" can be replaced by "sum" with condition (Scala):
data.select(
sum(
when(length(col("Country")) === 2, 1).otherwise(0)
).alias("two_characters"),
sum(
when(length(col("Country")) > 2, 1).otherwise(0)
).alias("more_than_two_characters")
)
While one way is to combine multiple queries in to one, the other way is to cache the dataframe that is being queried again and again.
By caching the dataframe, we avoid the re-evaluation each time the count() is invoked.
data.cache()
Few things to keep in mind. If you are applying multiple actions on your dataframe and there are lot of transformations and you are reading that data from some external source then you should definitely cache that dataframe before you apply any single action on that dataframe.
The answer provided by #pasha701 works but you will have to keep on adding the columns based on different country code length value you want to analyse.
You can use the below code to get the count of different country codes all in one single dataframe.
//import statements
from pyspark.sql.functions import *
//sample Dataframe
data = spark.createDataFrame(schema=['Country'], data=[('AT',), ('ACE',), ('BE',), ('France',), ('Latvia',)])
//adding additional column that gives the length of the country codes
data1 = data.withColumn("CountryLength",length(col('Country')))
//creating columns list schema for the final output
outputcolumns = ["CountryLength","RecordsCount"]
//selecting the countrylength column and converting that to rdd and performing map reduce operation to count the occurrences of the same length
countrieslength = data1.select("CountryLength").rdd.map(lambda word: (word, 1)).reduceByKey(lambda a,b:a +b).toDF(outputcolumns).select("CountryLength.CountryLength","RecordsCount")
//now you can do display or show on the dataframe to see the output
display(countrieslength)
please see the output snapshot that you might get as below :
If you want to apply multiple filter condition on this dataframe, then you can cache this dataframe and get the count of different combination of records based on the country code length.
I am new to spark dataframes and a little bit confused on its working. I had 2 similar piece of codes which is taking different amount of time to complete. Any explanation for these would be really helpful:
selectQuery = "SELECT * FROM items limit 1"
items_df = spark_session.sql(selectQuery) # completes in ~4 seconds
data_collect = items_df.collect() # completes in ~50 seconds
selectQuery = "SELECT * FROM items ORDER BY ingest_date limit 1"
items_df = spark_session.sql(selectQuery) # completes in ~4 seconds
data_collect = items_df.collect() # completes in ~20 minutes
My thought process over here was that, spark_session.sql(selectQuery) is the actual code that pulls data from Source and puts it in the application memory in the form of dataframe. Then collect() simply converts that dataframe into a python list.
But clearly, I see that collect() depends on the query as well.
PS: I saw an answer in this thread where a person mentions that collect() activates the actual query execution. I do not quite understand this.
Pyspark performance: dataframe.collect() is very slow
When you create a dataframe, you can apply two kind of methods : Transformations and Actions.
Transformations are lazy, it means they are not executed right away. Transformations are for example : withColumn, where, select ... A transformation always return a dataframe. When executed, Spark just checks if they are valid: for exemple, withColumn is a transformation, and when you apply it to your dataframe, the validity of your request is checked directly but not executed :
from pyspark.sql import functions as F
df = spark.range(10)
df.withColumn("id_2", F.col("id") * 2) # This one is OK
df.withColumn("id_2", F.col("foo") * 2) # This one will fail because foo does not exist
Actions are the method that will execute all the transformations that have been stacked. Actions are, for example, count, collect, show.
At the moment you apply an action, Spark will retrieve the data where it is stored, apply all the transformation you asked previously, and return a result depending on the action you asked (a list if it is a collect, a number if you made a count)
In you case, the creation of the dataframe is a transformation. It is lazy, just the validity will be checked. That is why both queries takes approx the same amount of time.
But when collecting, that's the moment when spark will retrieve all the data. So depending on the amount of data in the tables (and the number of transformations you asked) it will take more or less time to complete.
I have a very large (150M rows - 30GB RAM) dataframe. I do a groupby (around 40 groups) and apply a function on each group. Takes about 30 minutes to perform everything. Here was the code I used:
df = df.groupby(by='date').apply(func=my_func)
Since the operations are not interdependant, I figured I'd use Dask to parallelize the processing of each group seperately.
So I use this code:
from dask import dataframe as dd
df_dask = dd.from_pandas(df_pandas, npartitions=40)
template = pd.DataFrame(columns=['A','B','C','D','E'])
df_dask = df_dask.groupby(by='date').apply(func=my_func, meta=template)
df_dask = df_dask.compute()
However, when I run this, I get different results depending on the value of npartitions I give it. If I give a value of 1, it gives me the same (and correct) results, but then it takes the same amount of time as with pandas. If I give it a higher number, it performs faster, but returns way fewer rows. I don't understand the relationship between npartitions and the groupby.
Also, if I try with a slightly larger DataFrame (40GB), Dask runs out of memory, even though I have 64GB on my machine, while pandas is fine.
Any ideas?
Dask's DataFrameGroupBy.apply applies the user-provided function to each partition: https://docs.dask.org/en/latest/dataframe-api.html#dask.dataframe.groupby.DataFrameGroupBy.apply.
If you need a custom reduction, use Aggregate: https://docs.dask.org/en/latest/dataframe-groupby.html#aggregate