Using pandas functions with Pyspark - python

I'm trying to rewrite my Python script (Pandas) with Pyspark, but I can't find a way to apply my Pandas functions in a way to be more efficient with Pyspark functions:
my functions are the following:
def decompose_id(id_flight):
my_id=id_flight.split("_")
Esn=my_id[0]
Year=my_id[3][0:4]
Month=my_id[3][4:6]
return Esn, Year, Month
def reverse_string(string):
stringlength=len(string) # calculate length of the list
slicedString=string[stringlength::-1] # slicing
return slicedString
I would like to apply the first function to a column of a dataframe (in Pandas I get a row of three elements)
The second functions is used when a condition of a column of a DataFrame is verified
is there a method to apply them using Pyspark dataframes?

You could apply these functions as UDF to a Spark column, but it is not very efficient.
Here are the functions you need to perform your task :
reverse : use it to replace your function reverse_string
split : Use is to replace my_id=id_flight.split("_")
getItem : use it to get the item in splitted list my_id[3]
substr : to replace the slicing in python [0:4]
Just combine these spark functions to recreate the same behavior.

IF you want to leverage pandas functionality one way would be to use - Pandas API along with groupBy
It provides you a way to treat each of your groupBy sets as pandas dataframe on which you can implement your functions.
However since its Spark , schema enforcement is pretty man necessary as you will go through the examples provided in the link as well
An implementation example can be found here
For Trivial Tasks , like reversing the string opt for inbuilt Spark Functions , else UDF's

Related

Using apply on a pandas DataFrame with a function that returns another DataFrame

I have a function that takes a Series as input and returns a DataFrame. I want to run this function on every row in a DataFrame and collect all the rows from all the returned DataFrames into a single DataFrame. I know that whenever you want to do something over all rows, apply is the go-to function, but because the function returns a DataFrame, I'm not sure how to use apply to produce the result I need. What is the best way to do this? Is this where the itertuples function that I always see "don't use this, use apply instead" should actually be used?

Pandas parallel apply with koalas (pyspark)

I'm new to Koalas (pyspark), and I was trying to utilize Koalas for parallel apply, but it seemed like it was using a single core for the whole operation (correct me if I'm wrong) and ended up using dask for parallel apply (using map_partition) which worked pretty well.
However, I would like to know if there's a way to utilize Koalas for parallel apply.
I used basic codes for operation like below.
import pandas as pd
import databricks.koalas as ks
my_big_data = ks.read_parquet('my_big_file') # file is single partitioned parquet file
my_big_data['new_column'] = my_big_data['string_column'].apply(my_prep) # my_prep does stirng operations
my_big_data.to_parquet('my_big_file_modified') # for Koalas does lazy evaluation
I found a link that discuss this problem. https://github.com/databricks/koalas/issues/1280
If the number of rows that are being applied by function is less than 1,000 (default value), then pandas dataframe will be called to do the operation.
The user defined function above my_prep is applied to each row, so single core pandas was being used.
In order to force it to work in pyspark (parallel) manner, user should modify the configuration as below.
import databricks.koalas as ks
ks.set_option('compute.default_index_type','distributed') # when .head() call is too slow
ks.set_option('compute.shortcut_limit',1) # Koalas will apply pyspark
Also, explicitly specifying type (type hint) in the user defined function will make Koalas not to go shortcut path and will make parallel.
def my_prep(row) -> string:
return row
kdf['my_column'].apply(my_prep)

Unable to access data using pandas composite index

I'm trying to organise data using a pandas dataframe.
Given the structure of the data it seems logical to use a composite index; 'league_id' and 'fixture_id'. I believe I have implemented this according to the examples in the docs, however I am unable to access the data using the index.
My code can be found here;
https://repl.it/repls/OldCorruptRadius
** I am very new to Pandas and programming in general, so any advice would be much appreciated! Thanks! **
For multi-indexing, you would need to use the pandas MutliIndex API, which brings its own learning curve; thus, I would not recommend it for beginners. Link: https://pandas.pydata.org/pandas-docs/stable/user_guide/advanced.html
The way that I use multi-indexing is only to display a final product to others (i.e. making it easy/pretty to view). Before the multi-indexing, you filter the fixture_id and league_id as columns first:
df = pd.DataFrame(fixture, columns=features)
df[(df['fixture_id'] == 592) & (df['league_id'] == 524)]
This way, you are still technically targeting the indexes if you would have gone through with multi-indexing the two columns.
If you have to use multi-indexing, try the transform feature of a pandas DataFrame. This turns the indexes into columns and vise-versa. For example, you can do something like this:
df = pd.DataFrame(fixture, columns=features).set_index(['league_id', 'fixture_id'])
df.T[524][592].loc['event_date'] # gets you the row of `event_dates`
df.T[524][592].loc['event_date'].iloc[0] # gets you the first instance of event_dates

Transforming Python Lambda function without return value to Pyspark

I have a working lambda function in Python that computes the highest similarity between each string in dataset1 and the strings in dataset2. During an iteration, it writes the string, the best match and the similarity together with some other information to bigquery. There is no return value, as the purpose of the function is to insert a row into a bigquery dataset. This process takes rather long which is why I wanted to use Pyspark and Dataproc to speed up the process.
Converting the pandas dataframes to spark was easy. I am having trouble to register my udf, because it has no return value and pyspark requires one. In addition I don't understand how to map the 'apply' function in python to the pyspark variant. So basically my question is how to transform the python code below to work on a spark dataframe.
The following code works in a regular Python environment:
def embargomatch(name, code, embargo_names):
find best match
insert best match and additional information to bigquery
customer_names.apply(lambda x: embargoMatch(x['name'], x['customer_code'],embargo_names),axis=1)
Because pyspark requires a return type, I added 'return 1' to the udf and tried the following:
customer_names = spark.createDataFrame(customer_names)
from pyspark.sql.types import IntegerType
embargo_match_udf = udf(lambda x: embargoMatch(x['name'], x['customer_code'],embargo_names), IntegerType())
Now i'm stuck trying to apply the select function, as I don't know what parameters to give.
I suspect you're stuck on how to pass multiple columns to the udf -- here's a good answer to that question: Pyspark: Pass multiple columns in UDF.
Rather than creating a udf based on a lambda that wraps your function, consider simplifying by creating a udf based on embargomatch directly.
embargo_names = ...
# The parameters here are the columns passed into the udf
def embargomatch(name, customer_code):
pass
embargo_match_udf = udf(embargomatch, IntegerType())
customer_names.select(embargo_match_udf(array('name', 'customer_code')).alias('column_name'))
That being said, it's suspect that your udf doesn't return anything -- I generally see udfs as a way to add columns to the dataframe, but not to have side effects. If you want to insert records into bigquery, consider doing something like this:
customer_names.select('column_name').write.parquet('gs://some/path')
os.system("bq load --source_format=PARQUET [DATASET].[TABLE] gs://some/path")

basic groupby operations in Dask

I am attempting to use Dask to handle a large file (50 gb). Typically, I would load it in memory and use Pandas. I want to groupby two columns "A", and "B", and whenever column "C" starts with a value, I want to repeat that value in that column for that particular group.
In pandas, I would do the following:
df['C'] = df.groupby(['A','B'])['C'].fillna(method = 'ffill')
What would be the equivalent in Dask?
Also, I am a little bit lost as to how to structure problems in Dask as opposed to in Pandas,
thank you,
My progress so far:
First set index:
df1 = df.set_index(['A','B'])
Then groupby:
df1.groupby(['A','B']).apply(lambda x: x.fillna(method='ffill').compute()
It appears dask does not currently implement the fillna method for GroupBy objects. I've tried PRing it some time ago and gave up quite quickly.
Also, dask doesn't support the method parameter (as it isn't always trivial to implement with delayed algorithms).
A workaround for this could be using fillna before grouping, like so:
df['C'] = df.fillna(0).groupby(['A','B'])['C']
Although this wasn't tested.
You can find my (failed) attempt here: https://github.com/nirizr/dask/tree/groupy_fillna

Categories