Define partition for window operation using Pyspark.pandas - python

I am trying to learn how to use pyspark.pandas and I am coming across an issue that I don't know how to solve. I have a df of about 700k rows and 7 columns. Here is a sample of my data:
import pyspark.pandas as ps
import pandas as pd
data = {'Region': ['Africa','Africa','Africa','Africa','Africa','Africa','Africa','Asia','Asia','Asia'],
'Country': ['South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','Japan','Japan','Japan'],
'Product': ['ABC','ABC','ABC','XYZ','XYZ','XYZ','XYZ','DEF','DEF','DEF'],
'Year': [2016, 2018, 2019,2016, 2017, 2018, 2019,2016, 2017, 2019],
'Price': [500, 0,450,750,0,0,890,19,120,3],
'Quantity': [1200,0,330,500,190,70,120,300,50,80],
'Value': [600000,0,148500,350000,0,29100,106800,74300,5500,20750]}
df = ps.DataFrame(data)
Even when I run the simplest of operations like df.head(), I get the following warning and I'm not sure how to fix it:
WARN window.WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
I know how to work around this with pyspark dataframes, but I'm not sure how to fix it using the Pandas API for Pyspark to define a partition for window operation.
Does anyone have any suggestions?

For Koalas, the repartition seems to only take in a number of partitions here: https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.spark.repartition.html
I think the goal here is to run Pandas functions on a Spark DataFrame. One option you can use is Fugue. Fugue can take a Python function and apply it on Spark per partition. Example code below.
from typing import List, Dict, Any
import pandas as pd
df = pd.DataFrame({"date":["2021-01-01", "2021-01-02", "2021-01-03"] * 3,
"id": (["A"]*3 + ["B"]*3 + ["C"]*3),
"value": [3, 4, 2, 1, 2, 5, 3, 2, 3]})
def count(df: pd.DataFrame) -> pd.DataFrame:
# this assumes the data is already partitioned
id = df.iloc[0]["id"]
count = df.shape[0]
return pd.DataFrame({"id": [id], "count": [count]})
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
sdf = spark.createDataFrame(df)
from fugue import transform
# Pandas
pdf = transform(df.copy(),
count,
schema="id:str, count:int",
partition={"by": "id"})
print(pdf.head())
# Spark
transform(sdf,
count,
schema="id:str, count:int",
partition={"by": "id"},
engine=spark).show()
You just need to annotate your function with input and output types and then you can use it with the Fugue transform function. Schema is a requirement for Spark so you need to pass it. If you supply spark as the engine, then the execution will happen on Spark. Otherwise, it will run on Pandas by default.

Related

Error in pyspark.pandas when trying to reindex columns

I have a dataframe that is missing certain values and I want to generate records for those and input the value with 0.
My df looks like this:
import pyspark.pandas as ps
import databricks.koalas as ks
import pandas as pd
data = {'Region': ['Africa','Africa','Africa','Africa','Africa','Africa','Africa','Asia','Asia','Asia'],
'Country': ['South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','Japan','Japan','Japan'],
'Product': ['ABC','ABC','ABC','XYZ','XYZ','XYZ','XYZ','DEF','DEF','DEF'],
'Year': [2016, 2018, 2019,2016, 2017, 2018, 2019,2016, 2017, 2019],
'Price': [500, 0,450,750,0,0,890,19,120,3],
'Quantity': [1200,0,330,500,190,70,120,300,50,80],
'Value': [600000,0,148500,350000,0,29100,106800,74300,5500,20750]}
df = ps.DataFrame(data)
Some entries in this df are missing, such as the year 2017 for South Africa and the year 2018 for Japan.
I want to generate those entries and add 0 in the columns Quantity, Price and Value.
I managed to do this on a smaller dataset using pandas, however, when I try to implement this using pyspark.pandas, I get an error.
This is the code I have so far:
(df.set_index(['Region', 'Country','Product','Year'])
.reindex(ps.MultiIndex.from_product([df['Region'].unique(),
df['Country'].unique(),
df['Product'].unique(),
df['Year'].unique()],
names=['Region', 'Country','Product','Year']),
fill_value=0)
.reset_index())
Whenever I run it, I get the following issue:
PandasNotImplementedError: The method `pd.Series.__iter__()` is not implemented. If you want to collect your data as an NumPy array, use 'to_numpy()' instead.
Any ideas why this might happen and how to fix it?
Ok so looking closer at the synatx for ps.MultiIndex, the variables have to be passed as list, so I had to add .tolist() after each column's unique values. Code below:
(df.set_index(['Region', 'Country','Product','Year'])
.reindex(ps.MultiIndex.from_product([df['Region'].unique().tolist(),
df['Country'].unique().tolist(),
df['Product'].unique().tolist(),
df['Year'].unique().tolist()],
names=['Region', 'Country','Product','Year']),
fill_value=0)
.reset_index())

Why is my PySpark dataframe join operation writing an empty result?

I have two PySpark dataframes that I'm trying to join into a new dataframe. The join operation seems to show a null dataframe.
I'm using Jupyter notebooks to evaluate the code, on a PySpark kernel, on a cluster with a single master, 4 workers, YARN for resource allocation.
from pyspark.sql.functions import monotonically_increasing_id,udf
from pyspark.sql.types import FloatType
from pyspark.mllib.linalg import DenseVector
firstelement=udf(lambda v:float(v[1]),FloatType())
a = [{'c_id': 'a', 'cv_id': 'b', 'id': 1}, {'c_id': 'c', 'cv_id': 'd', 'id': 2}]
ip = spark.createDataFrame(a)
b = [{'probability': DenseVector([0.99,0.01]), 'id': 1}, {'probability': DenseVector([0.6,0.4]), 'id': 2}]
op = spark.createDataFrame(b)
op.show() #shows the df
#probability, id
#[0.99, 0.01], 1
##probability is a dense vector, id is bigint
ip.show() #shows the df
#c_id, cv_id, id
#a,b,1
##c_id and cv_id are strings, id is bigint
op_final = op.join(ip, ip.id == op.id).select('c_id','cv_id',firstelement('probability')).withColumnRenamed('<lambda>(probability)','probability')
op_final.show() #gives a null df
#but the below seems to work, however, quite slow
ip.collect()
op.collect()
op_final.collect()
op_final.show() #shows the joined df
Perhaps it's my lack of expertise with Spark, but could someone please explain why I'm able to see the first two dataframes, but not the joined dataframe unless I use collect()?

How to aggregate after fetching result using groupby using itertools

I am having a list as
a=[{'name': 'xyz','inv_name':'asd','quant':300,'amt':20000, 'current':30000},{'name': 'xyz','inv_name':'asd','quant':200,'amt':2000,'current':3000}]
This list i have fetched using itertools groupby.
I want to form a list after adding up the quant, amt and current filed for same name and inv_name and create a list something like : [{'name':'xyz','inv_name':'asd','quant':500,'amt':22000,'current':33000}
Any suggestions on how to achieve this?
If you are happy using a 3rd party library, pandas accepts a list of dictionaries:
import pandas as pd
a=[{'name': 'xyz','inv_name':'asd','quant':300,'amt':20000, 'current':30000},
{'name': 'xyz','inv_name':'asd','quant':200,'amt':2000,'current':3000}]
df = pd.DataFrame(a)
res = df.groupby(['name', 'inv_name'], as_index=False).sum().to_dict(orient='records')
# [{'amt': 22000,
# 'current': 33000,
# 'inv_name': 'asd',
# 'name': 'xyz',
# 'quant': 500}]

Converting a set to a list with Pandas grouopby agg function causes 'ValueError: Function does not reduce'

Sometimes, it seems that the more I use Python (and Pandas), the less I understand. So I apologise if I'm just not seeing the wood for the trees here but I've been going round in circles and just can't see what I'm doing wrong.
Basically, I have an example script (that I'd like to implement on a much larger dataframe) but I can't get it to work to my satisfaction.
The dataframe consists of columns of various datatypes. I'd like to group the dataframe on 2 columns and then produce a new dataframe that contains lists of all the unique values for each variable in each group. (Ultimately, I'd like to concatenate the list items into a single string – but that's a different question.)
The initial script I used was:
import numpy as np
import pandas as pd
def tempFuncAgg(tempVar):
tempList = set(tempVar.dropna()) # Drop NaNs and create set of unique values
print(tempList)
return tempList
# Define dataframe
tempDF = pd.DataFrame({ 'id': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
'date': ["02/04/2015 02:34","06/04/2015 12:34","09/04/2015 23:03","12/04/2015 01:00","15/04/2015 07:12","21/04/2015 12:59","29/04/2015 17:33","04/05/2015 10:44","06/05/2015 11:12","10/05/2015 08:52","12/05/2015 14:19","19/05/2015 19:22","27/05/2015 22:31","01/06/2015 11:09","04/06/2015 12:57","10/06/2015 04:00","15/06/2015 03:23","19/06/2015 05:37","23/06/2015 13:41","27/06/2015 15:43"],
'gender': ["male","female","female","male","male","female","female",np.nan,"male","male","female","male","female","female","male","female","male","female",np.nan,"male"],
'age': ["young","old","old","old","old","old",np.nan,"old","old","young","young","old","young","young","old",np.nan,"old","young",np.nan,np.nan]})
# Groupby based on 2 categorical variables
tempGroupby = tempDF.groupby(['gender','age'])
# Aggregate for each variable in each group using function defined above
dfAgg = tempGroupby.agg(lambda x: tempFuncAgg(x))
print(dfAgg)
The output from this script is as expected: a series of lines containing the sets of values and a dataframe containing the returned sets:
{'09/04/2015 23:03', '21/04/2015 12:59', '06/04/2015 12:34'}
{'01/06/2015 11:09', '12/05/2015 14:19', '27/05/2015 22:31', '19/06/2015 05:37'}
{'15/04/2015 07:12', '19/05/2015 19:22', '06/05/2015 11:12', '04/06/2015 12:57', '15/06/2015 03:23', '12/04/2015 01:00'}
{'02/04/2015 02:34', '10/05/2015 08:52'}
{2, 3, 6}
{18, 11, 13, 14}
{4, 5, 9, 12, 15, 17}
{1, 10}
date \
gender age
female old set([09/04/2015 23:03, 21/04/2015 12:59, 06/04...
young set([01/06/2015 11:09, 12/05/2015 14:19, 27/05...
male old set([15/04/2015 07:12, 19/05/2015 19:22, 06/05...
young set([02/04/2015 02:34, 10/05/2015 08:52])
id
gender age
female old set([2, 3, 6])
young set([18, 11, 13, 14])
male old set([4, 5, 9, 12, 15, 17])
young set([1, 10])
The problem occurs when I try to convert the sets to lists. Bizarrely, it produces 2 duplicated rows containing identical lists but then fails with a 'ValueError: Function does not reduce' error.
def tempFuncAgg(tempVar):
tempList = list(set(tempVar.dropna())) # This is the only difference
print(tempList)
return tempList
tempDF = pd.DataFrame({ 'id': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
'date': ["02/04/2015 02:34","06/04/2015 12:34","09/04/2015 23:03","12/04/2015 01:00","15/04/2015 07:12","21/04/2015 12:59","29/04/2015 17:33","04/05/2015 10:44","06/05/2015 11:12","10/05/2015 08:52","12/05/2015 14:19","19/05/2015 19:22","27/05/2015 22:31","01/06/2015 11:09","04/06/2015 12:57","10/06/2015 04:00","15/06/2015 03:23","19/06/2015 05:37","23/06/2015 13:41","27/06/2015 15:43"],
'gender': ["male","female","female","male","male","female","female",np.nan,"male","male","female","male","female","female","male","female","male","female",np.nan,"male"],
'age': ["young","old","old","old","old","old",np.nan,"old","old","young","young","old","young","young","old",np.nan,"old","young",np.nan,np.nan]})
tempGroupby = tempDF.groupby(['gender','age'])
dfAgg = tempGroupby.agg(lambda x: tempFuncAgg(x))
print(dfAgg)
But now the output is:
['09/04/2015 23:03', '21/04/2015 12:59', '06/04/2015 12:34']
['09/04/2015 23:03', '21/04/2015 12:59', '06/04/2015 12:34']
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
...
ValueError: Function does not reduce
Any help to troubleshoot this problem would be appreciated and I apologise in advance if it's something obvious that I'm just not seeing.
EDIT
Incidentally, converting the set to a tuple rather than a list works with no problem.
Lists can sometimes have weird problems in pandas. You can either :
Use tuples (as you've already noticed)
If you really need lists, just do it in a second operation like this :
dfAgg.applymap(lambda x: list(x))
full example :
import numpy as np
import pandas as pd
def tempFuncAgg(tempVar):
tempList = set(tempVar.dropna()) # Drop NaNs and create set of unique values
print(tempList)
return tempList
# Define dataframe
tempDF = pd.DataFrame({ 'id': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
'date': ["02/04/2015 02:34","06/04/2015 12:34","09/04/2015 23:03","12/04/2015 01:00","15/04/2015 07:12","21/04/2015 12:59","29/04/2015 17:33","04/05/2015 10:44","06/05/2015 11:12","10/05/2015 08:52","12/05/2015 14:19","19/05/2015 19:22","27/05/2015 22:31","01/06/2015 11:09","04/06/2015 12:57","10/06/2015 04:00","15/06/2015 03:23","19/06/2015 05:37","23/06/2015 13:41","27/06/2015 15:43"],
'gender': ["male","female","female","male","male","female","female",np.nan,"male","male","female","male","female","female","male","female","male","female",np.nan,"male"],
'age': ["young","old","old","old","old","old",np.nan,"old","old","young","young","old","young","young","old",np.nan,"old","young",np.nan,np.nan]})
# Groupby based on 2 categorical variables
tempGroupby = tempDF.groupby(['gender','age'])
# Aggregate for each variable in each group using function defined above
dfAgg = tempGroupby.agg(lambda x: tempFuncAgg(x))
# Transform in list
dfAgg.applymap(lambda x: list(x))
print(dfAgg)
There's many such bizzare behaviours in pandas, it is generally better to go on with a workaround (like this), than to find a perfect solution

python blaze calculate mean of multiple columns

I have a python blaze data like this
import blaze as bz
bdata = bz.Data([(1, 'Alice', 100.9, 100),
(2, 'Bob', 200.6, 200),
(3, 'Charlie', 300.45, 300),
(5, 'Edith', 400, 400)],
fields=['id', 'name', 'revenue', 'profit'])
I would like to calculate mean for the numeric columns. I tried something like this
print {col: bdata[col].mean() for col in ['revenue', 'profit']}
and I get
{'profit': 250.0, 'revenue': 250.4875}
But I would like to calculate in a single shot like in pandas, like data.mean()
Any thoughts or suggestions???
That Pandas aggregation is kind of magical, and I don't think you'll be able to skip the non-numerical columns without some kind of logic.
If you have the option to add a dummy column, you could use by to do an aggregation across the entire table.
That would look like this:
bdata = bz.Data([('fnord', 1, 'Alice', 100.9, 100),
('fnord', 2, 'Bob', 200.6, 200),
('fnord', 3, 'Charlie', 300.45, 300),
('fnord', 5, 'Edith', 400, 400)],
fields=['dummy', 'id', 'name', 'revenue', 'profit'])
bz.by(bdata.dummy, avg_profit=bdata.profit.mean(), avg_revenue=bdata.revenue.mean())
dummy avg_profit avg_revenue
0 fnord 250 250.4875
Though that's not particularly consise, either, and requires modifying your data.
You could use odo to get quick access to that concise Pandas syntax:
from odo import odo
import Pandas as pd
odo(bdata, pd.DataFrame).mean()
I think you might have better luck using the summary reduction:
from blaze import *
resume = summary(bdata,avg_profit=bdata.profit.mean(), avg_revenue=bdata.revenue.mean())
SummaryStats = pd.DataFrame(pd.Series(dict( (k,v) for k,v in zip(resume.fields,compute(resume)) ))).T
The last line can be reduced to compute(resume) if you don't care about the result being a pd.DataFrame.

Categories