I have a dataframe that is missing certain values and I want to generate records for those and input the value with 0.
My df looks like this:
import pyspark.pandas as ps
import databricks.koalas as ks
import pandas as pd
data = {'Region': ['Africa','Africa','Africa','Africa','Africa','Africa','Africa','Asia','Asia','Asia'],
'Country': ['South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','South Africa','Japan','Japan','Japan'],
'Product': ['ABC','ABC','ABC','XYZ','XYZ','XYZ','XYZ','DEF','DEF','DEF'],
'Year': [2016, 2018, 2019,2016, 2017, 2018, 2019,2016, 2017, 2019],
'Price': [500, 0,450,750,0,0,890,19,120,3],
'Quantity': [1200,0,330,500,190,70,120,300,50,80],
'Value': [600000,0,148500,350000,0,29100,106800,74300,5500,20750]}
df = ps.DataFrame(data)
Some entries in this df are missing, such as the year 2017 for South Africa and the year 2018 for Japan.
I want to generate those entries and add 0 in the columns Quantity, Price and Value.
I managed to do this on a smaller dataset using pandas, however, when I try to implement this using pyspark.pandas, I get an error.
This is the code I have so far:
(df.set_index(['Region', 'Country','Product','Year'])
.reindex(ps.MultiIndex.from_product([df['Region'].unique(),
df['Country'].unique(),
df['Product'].unique(),
df['Year'].unique()],
names=['Region', 'Country','Product','Year']),
fill_value=0)
.reset_index())
Whenever I run it, I get the following issue:
PandasNotImplementedError: The method `pd.Series.__iter__()` is not implemented. If you want to collect your data as an NumPy array, use 'to_numpy()' instead.
Any ideas why this might happen and how to fix it?
Ok so looking closer at the synatx for ps.MultiIndex, the variables have to be passed as list, so I had to add .tolist() after each column's unique values. Code below:
(df.set_index(['Region', 'Country','Product','Year'])
.reindex(ps.MultiIndex.from_product([df['Region'].unique().tolist(),
df['Country'].unique().tolist(),
df['Product'].unique().tolist(),
df['Year'].unique().tolist()],
names=['Region', 'Country','Product','Year']),
fill_value=0)
.reset_index())
Sometimes, it seems that the more I use Python (and Pandas), the less I understand. So I apologise if I'm just not seeing the wood for the trees here but I've been going round in circles and just can't see what I'm doing wrong.
Basically, I have an example script (that I'd like to implement on a much larger dataframe) but I can't get it to work to my satisfaction.
The dataframe consists of columns of various datatypes. I'd like to group the dataframe on 2 columns and then produce a new dataframe that contains lists of all the unique values for each variable in each group. (Ultimately, I'd like to concatenate the list items into a single string – but that's a different question.)
The initial script I used was:
import numpy as np
import pandas as pd
def tempFuncAgg(tempVar):
tempList = set(tempVar.dropna()) # Drop NaNs and create set of unique values
print(tempList)
return tempList
# Define dataframe
tempDF = pd.DataFrame({ 'id': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
'date': ["02/04/2015 02:34","06/04/2015 12:34","09/04/2015 23:03","12/04/2015 01:00","15/04/2015 07:12","21/04/2015 12:59","29/04/2015 17:33","04/05/2015 10:44","06/05/2015 11:12","10/05/2015 08:52","12/05/2015 14:19","19/05/2015 19:22","27/05/2015 22:31","01/06/2015 11:09","04/06/2015 12:57","10/06/2015 04:00","15/06/2015 03:23","19/06/2015 05:37","23/06/2015 13:41","27/06/2015 15:43"],
'gender': ["male","female","female","male","male","female","female",np.nan,"male","male","female","male","female","female","male","female","male","female",np.nan,"male"],
'age': ["young","old","old","old","old","old",np.nan,"old","old","young","young","old","young","young","old",np.nan,"old","young",np.nan,np.nan]})
# Groupby based on 2 categorical variables
tempGroupby = tempDF.groupby(['gender','age'])
# Aggregate for each variable in each group using function defined above
dfAgg = tempGroupby.agg(lambda x: tempFuncAgg(x))
print(dfAgg)
The output from this script is as expected: a series of lines containing the sets of values and a dataframe containing the returned sets:
{'09/04/2015 23:03', '21/04/2015 12:59', '06/04/2015 12:34'}
{'01/06/2015 11:09', '12/05/2015 14:19', '27/05/2015 22:31', '19/06/2015 05:37'}
{'15/04/2015 07:12', '19/05/2015 19:22', '06/05/2015 11:12', '04/06/2015 12:57', '15/06/2015 03:23', '12/04/2015 01:00'}
{'02/04/2015 02:34', '10/05/2015 08:52'}
{2, 3, 6}
{18, 11, 13, 14}
{4, 5, 9, 12, 15, 17}
{1, 10}
date \
gender age
female old set([09/04/2015 23:03, 21/04/2015 12:59, 06/04...
young set([01/06/2015 11:09, 12/05/2015 14:19, 27/05...
male old set([15/04/2015 07:12, 19/05/2015 19:22, 06/05...
young set([02/04/2015 02:34, 10/05/2015 08:52])
id
gender age
female old set([2, 3, 6])
young set([18, 11, 13, 14])
male old set([4, 5, 9, 12, 15, 17])
young set([1, 10])
The problem occurs when I try to convert the sets to lists. Bizarrely, it produces 2 duplicated rows containing identical lists but then fails with a 'ValueError: Function does not reduce' error.
def tempFuncAgg(tempVar):
tempList = list(set(tempVar.dropna())) # This is the only difference
print(tempList)
return tempList
tempDF = pd.DataFrame({ 'id': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
'date': ["02/04/2015 02:34","06/04/2015 12:34","09/04/2015 23:03","12/04/2015 01:00","15/04/2015 07:12","21/04/2015 12:59","29/04/2015 17:33","04/05/2015 10:44","06/05/2015 11:12","10/05/2015 08:52","12/05/2015 14:19","19/05/2015 19:22","27/05/2015 22:31","01/06/2015 11:09","04/06/2015 12:57","10/06/2015 04:00","15/06/2015 03:23","19/06/2015 05:37","23/06/2015 13:41","27/06/2015 15:43"],
'gender': ["male","female","female","male","male","female","female",np.nan,"male","male","female","male","female","female","male","female","male","female",np.nan,"male"],
'age': ["young","old","old","old","old","old",np.nan,"old","old","young","young","old","young","young","old",np.nan,"old","young",np.nan,np.nan]})
tempGroupby = tempDF.groupby(['gender','age'])
dfAgg = tempGroupby.agg(lambda x: tempFuncAgg(x))
print(dfAgg)
But now the output is:
['09/04/2015 23:03', '21/04/2015 12:59', '06/04/2015 12:34']
['09/04/2015 23:03', '21/04/2015 12:59', '06/04/2015 12:34']
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
...
ValueError: Function does not reduce
Any help to troubleshoot this problem would be appreciated and I apologise in advance if it's something obvious that I'm just not seeing.
EDIT
Incidentally, converting the set to a tuple rather than a list works with no problem.
Lists can sometimes have weird problems in pandas. You can either :
Use tuples (as you've already noticed)
If you really need lists, just do it in a second operation like this :
dfAgg.applymap(lambda x: list(x))
full example :
import numpy as np
import pandas as pd
def tempFuncAgg(tempVar):
tempList = set(tempVar.dropna()) # Drop NaNs and create set of unique values
print(tempList)
return tempList
# Define dataframe
tempDF = pd.DataFrame({ 'id': [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20],
'date': ["02/04/2015 02:34","06/04/2015 12:34","09/04/2015 23:03","12/04/2015 01:00","15/04/2015 07:12","21/04/2015 12:59","29/04/2015 17:33","04/05/2015 10:44","06/05/2015 11:12","10/05/2015 08:52","12/05/2015 14:19","19/05/2015 19:22","27/05/2015 22:31","01/06/2015 11:09","04/06/2015 12:57","10/06/2015 04:00","15/06/2015 03:23","19/06/2015 05:37","23/06/2015 13:41","27/06/2015 15:43"],
'gender': ["male","female","female","male","male","female","female",np.nan,"male","male","female","male","female","female","male","female","male","female",np.nan,"male"],
'age': ["young","old","old","old","old","old",np.nan,"old","old","young","young","old","young","young","old",np.nan,"old","young",np.nan,np.nan]})
# Groupby based on 2 categorical variables
tempGroupby = tempDF.groupby(['gender','age'])
# Aggregate for each variable in each group using function defined above
dfAgg = tempGroupby.agg(lambda x: tempFuncAgg(x))
# Transform in list
dfAgg.applymap(lambda x: list(x))
print(dfAgg)
There's many such bizzare behaviours in pandas, it is generally better to go on with a workaround (like this), than to find a perfect solution
I have a python blaze data like this
import blaze as bz
bdata = bz.Data([(1, 'Alice', 100.9, 100),
(2, 'Bob', 200.6, 200),
(3, 'Charlie', 300.45, 300),
(5, 'Edith', 400, 400)],
fields=['id', 'name', 'revenue', 'profit'])
I would like to calculate mean for the numeric columns. I tried something like this
print {col: bdata[col].mean() for col in ['revenue', 'profit']}
and I get
{'profit': 250.0, 'revenue': 250.4875}
But I would like to calculate in a single shot like in pandas, like data.mean()
Any thoughts or suggestions???
That Pandas aggregation is kind of magical, and I don't think you'll be able to skip the non-numerical columns without some kind of logic.
If you have the option to add a dummy column, you could use by to do an aggregation across the entire table.
That would look like this:
bdata = bz.Data([('fnord', 1, 'Alice', 100.9, 100),
('fnord', 2, 'Bob', 200.6, 200),
('fnord', 3, 'Charlie', 300.45, 300),
('fnord', 5, 'Edith', 400, 400)],
fields=['dummy', 'id', 'name', 'revenue', 'profit'])
bz.by(bdata.dummy, avg_profit=bdata.profit.mean(), avg_revenue=bdata.revenue.mean())
dummy avg_profit avg_revenue
0 fnord 250 250.4875
Though that's not particularly consise, either, and requires modifying your data.
You could use odo to get quick access to that concise Pandas syntax:
from odo import odo
import Pandas as pd
odo(bdata, pd.DataFrame).mean()
I think you might have better luck using the summary reduction:
from blaze import *
resume = summary(bdata,avg_profit=bdata.profit.mean(), avg_revenue=bdata.revenue.mean())
SummaryStats = pd.DataFrame(pd.Series(dict( (k,v) for k,v in zip(resume.fields,compute(resume)) ))).T
The last line can be reduced to compute(resume) if you don't care about the result being a pd.DataFrame.