I'm new to PySpark and Pandas UDFs, I'm running the following Pandas UDF Function to jumble a column containing strings (For Example: an input 'Luke' will result in 'ulek')
pandas_udf("string")
def jumble_string(column: pd.Series)-> pd.Series:
return column.apply(lambda x: None if x==None else ''.join(random.sample(x, len(x))).lower())
spark_df = spark_df.withColumn("names", jumble_string("names"))
On running the above function on a large dataset I've noticed that the execution takes unusually long.
I'm guessing the .apply function has something to do with this issue.
Is there anyway I can rewrite this function so it can effectively execute on a Big Dataset?
Please Advise
As the .apply method is not vectorized, the given operation is done by looping through the elements which slows down the execution as the data size becomes large.
For small sized data, the time difference is usually negligible. However, as the size increases, the difference starts to become noticeable. We are likely to deal with vast amount of data so time should always be taken into consideration.
You can read more about Apply vs Vectorized Operations here.
Therefore I decided to use a list comprehension which did increase my performance marginally.
#pandas_udf("string")
def jumble_string(column: pd.Series)-> pd.Series:
return pd.Series([None if x==None else ''.join(random.sample(x, len(x))).lower() for x in column])
There is no implemented function in Spark that jumbles the strings in a column, so we are forced to resort to UDFs or Pandas UDFs.
Your solution is actually quite nice; perhaps we can improve it by removing the .apply method from the series and using only base Python on a string for each row.
#pandas_udf("string")
def jumble_string_new(column: pd.Series)-> pd.Series:
x = column.iloc[0] # each pd.Series is made only of one element
if x is None:
return pd.Series([None])
else:
return pd.Series([''.join(random.sample(x, len(x))).lower()])
The results are the same to your function; however, I've not been able to test it on a very large dataframe. Try it yourself and see whether it's computationally more efficient.
Related
I have a Pandas dataframe which is 13.4GB large. Considering it is a multi-index dataframe of which the first level varies in length, I would like to convert it to a ragged tensor. Here is how I do it:
def convert(df):
X = df.groupby(level=0).apply(tf.convert_to_tensor)
X = X.tolist()
X = tf.ragged.stack(X)
return X
However, due to the enormous size of the data, this process has become absolutely intractable.
I have found out the following:
By far most of the time is being spent in the line "X = tf.ragged.stack(X)"
tf.ragged.constant or even tf.RaggedTensor.from_row_lengths are not faster.
Any advice would be much appreciated!
What I thought I had found was wrong; Using the factory functions such as tf.RaggedTensor.from_row_lengths is tremendously faster, especially on large datasets. I think that's because these functions require you to manually encode the data, whereas using tf.ragged.constant forces Tensorflow to calculate the encoding. Also, they are implemented in C instead of in Python. However, be sure to set the argument validate to False, because the difference in time is enormous.
Introduction
I am not very certain the title is clear. I am not a native english speaker so if someones has a better summary for what this post is about, please edit !
Environment
python 3.5.2
pyspark 2.3.0
The context
I have a spark dataframe. This data, before being written to elastic search, gets transformed.
In my case, I have two transformations. They are map functions on the rdd of the dataframe.
However, instead of hard writing them, I want to make it so that I can give my function (that handles the data transformation) X functions that will be, one by one, applied to the dataframe (for the first function) and/or the result of the previous transformation function.
Initial work
This is the previous state, not desired, hard written:
df.rdd.map(transfo1) \
.map(transfo2) \
.saveAsNewAPIHadoopFile
What I have so far
def write_to_index(self, transformation_functions: list, dataframe):
// stuff
for transfo in transformation_functions:
dataframe = dataframe.rdd.map(transfo)
dataframe.saveAsNewAPIHadoopFile
However, this has a problem: if the return of the first transformation is not a dataframe, it will fail on the second iteration of the loop because the resulting object does not have a rdd property.
Working solution
object_to_process = dataframe.rdd
for transfo in transformation_functions:
object_to_process = object_to_process.map(transfo)
object_to_process.saveAsNewAPIHadoopFile
The solution above seems to work (does throw any error at least). But I would like to know if there is a more elegant solution or any built-in python solution for this.
You can use this one-liner:
from functools import reduce
def write_to_index(self, transformation_functions: list, dataframe):
reduce(lambda x, y: x.map(y), transformation_functions, dataframe.rdd).saveAsNewAPIHadoopFile
which, if written verbosely, should be identical to
dataframe.rdd.map(transformation_functions[0]) \
.map(transformation_functions[1]) \
.map(...) \
.saveAsNewAPIHadoopFile
I am wondering which would be the most efficient way of creating a column in a pandas dataframe which if an id_row exist in a given list will return 1 or else 0.
I'm currently using a lambda function to apply the result. My problem is that it is taking a long time as my dataframe is around 2M rows and the list it checks into between 200k to 100k items. If I'm not mistaken, this is quadratic time (I'm really not sure though), which in this case runs really slowly give the size of the objects.
The worst is that I have to repeat this bit of code for over a 100 other (different) dataframes.
Here is the function:
lst_to_add = [1,2,3.......,n]
df_table['TEST'] = df_table['id_row'].apply(lambda x : 1 if x i lst_to_add else 0)
I'm wondering how could I make the bit of code (way) more efficient.
I thought of a 'divide and conquer' solution using a recursive function perhaps, but I'm really open to any suggestions.
Last thing. I also have constraits with memory hence I'd would prefer a method which takes a little more time but less memory than the alternative (if I had a choice).
You could do
df_table['TEST'] = (df_table['id_row'].isin(lst_to_add)).astype(int)
This code checks if the id_row variables are in lst_to_add and returns True and False, which the astype(int) converts to 1's and 0's. Since this approach is vectorized (acts on the entire series), it should be faster than using apply.
As far as time complexity, your list should be a set, this will make your O(M*N) solution O(N), since set membership tests are constant time instead of linear time (like it is for lists). Then, use the built-in method .isin:
lst_to_add = set(lst_to_add)
df_table['TEST'] = df_table['TEST'].isin(lst_to_add)
You should stick to the boolean type if memory is an issue, and you only want 0 and 1.
We can think of applying two types of functions to a Pandas Series: transformations and aggregations. They make this distinction in the documentation; transformations map individual values in the Series while aggregations somehow summarize the entire Series (e.g. mean).
It is clear how to apply transformations using apply, but I have not be successful in implementing a custom aggregation. Note that groupby is not involved, and aggregation does not require a groupby.
I am working with the following case: I have a Series in which each row is a list of strings. One way I could aggregate this data is to count up the number of appearances of each string, and return the 5 most common terms.
def top_five_strings(series):
counter = {}
for row in series:
for s in row:
if s in counter:
counter[s] += 1
else:
counter[s] = 1
return sorted(s.items(), key=lambda x: x[1])[:5]
If I call this function as top_five_strings(series), it works fine, analogous to as if I had called np.mean(series) on a numeric series. However, the difference is I can also do series.agg(np.mean) and get the same result. If I do series.agg(top_five_strings), I instead get the top five letters of in each row of the Series (which makes sense if you make a single row the argument of the function).
I think the critical difference is that np.mean is a NumPy ufunc, but I haven't been able to work out how the _aggregate helper function works in the Pandas source.
I'm left with 2 questions:
1) Can I implement this by making my Python function a ufunc (and if so, how)?
2) Is this a stupid thing to do? I haven't found anyone else out there trying to do something like this. It seems to me like it would be quite nice, however, to be able to implement custom aggregations as well as custom transformations within the Pandas framework (e.g. I get a Series as a result as one might with df.describe).
I'm trying to optimize one piece of Software written in Python using Pandas DF . The algorithm takes a pandas DF as input, can't be distributed and it outputs a metric for each client.
Maybe it's not the best solution, but my time-efficient approach is to load all files in parallel and then build a DF for each client
This works fine BUT very few clients have really HUGE amount of data. So I need to save memory when creating their DF.
In order to do this I'm performing a groupBy() (actually a combineByKey, but logically it's a groupBy) and then for each group (aka now a single Row of an RDD) I build a list and from it, a pandas DF.
However this makes many copies of the data (RDD rows, List and pandas DF...) in a single task/node and crashes and I would like to remove that many copies in a single node.
I was thinking on a "special" combineByKey with the following pseudo-code:
def createCombiner(val):
return [val]
def mergeCombinerVal(x,val):
x.append(val);
return x;
def mergeCombiners(x,y):
#Not checking if y is a pandas DF already, but we can do it too
if (x is a list):
pandasDF= pd.Dataframe(data=x,columns=myCols);
pandasDF.append(y);
return pandasDF
else:
x.append(y);
return x;
My question here, docs say nothing, but someone knows if it's safe to assume that this will work? (return dataType of merging two combiners is not the same than the combiner). I can control datatype on mergeCombinerVal too if the amount of "bad" calls is marginal, but it would be very inefficient to append to a pandas DF row by row.
Any better idea to perform want i want to do?
Thanks!,
PS: Right now I'm packing Spark rows, would switching from Spark rows to python lists without column names help reducing memory usage?
Just writing my comment as answer
At the end I've used regular combineByKey, its faster than groupByKey (idk the exact reason, I guess it helps packing the rows, because my rows are small size, but there are maaaany rows), and also allows me to group them into a "real" Python List (groupByKey groups into some kind of Iterable which Pandas doesn't support and forces me to create another copy of the structure, which doubles memory usage and crashes), which helps me with memory management when packing them into Pandas/C datatypes.
Now I can use those lists to build a dataframe directly without any extra transformation (I don't know what structure is Spark's groupByKey "list", but pandas won't accept it in the constructor).
Still, my original idea should have given a little less memory usage (at most 1x DF + 0.5x list, while now I have 1x DF + 1x list), but as user8371915 said it's not guaranteed by the API/docs..., better not to put that into production :)
For now, my biggest clients fit into a reasonable amount of memory. I'm processing most of my clients in a very parallel low-memory-per-executor job and the biggest ones in a not-so-parallel high-memory-per-executor job. I decide based on a pre-count I perform