I have a dataframe that looks like below
accuracy
--------
91.0
92.0
73.0
72.0
88.0
I am using aggregate, count and collect to get the column sum which is taking too much time. Below is my code
total_count = df.count()
total_sum=df.agg({'accuracy': 'sum'}).collect()
total_sum_val = [i[0] for i in total_sum]
acc_top_k = (total_sum_val[0]/total_count)*100
Is there any alternative method to get the mean accuracy in PySpark?
First, you can aggregate the column values and calculate the average. Then, extract it into the variable.
df = df.agg(F.avg('accuracy'))
acc_top_k = df.head()[0] * 100
Full test:
from pyspark.sql import functions as F
df = spark.createDataFrame([(91.0,), (92.0,), (73.0,), (72.0,), (88.0,)], ['accuracy'])
df = df.agg(F.avg('accuracy'))
acc_top_k = df.head()[0] * 100
print(acc_top_k)
# 8220.0
If you prefer, you can use your method too:
df = df.agg({'accuracy': 'avg'})
I have two data frames that I am looking to apply two separate functions to that will perform validation checks on each data frame independently, and then any differences that arise will get concatenated into one transformed list.
The issue I am facing is that the first validation check should happen ONLY if numeric values exist in ALL of the numeric columns of whichever of the two data frames it is analyzing. If there are ANY NaN values in a row line for the first validation check, then that row should be skipped.
The second validation check does not need that specification.
Here are the data frames, functions, and transformations:
import pandas as pd
import numpy as np
df1 = {'Fruits': ["Banana","Blueberry","Apple","Cherry","Mango","Pineapple","Watermelon","Papaya","Pear","Coconut"],
'Price': [2,1.5,np.nan,2.5,3,4,np.nan,3.5,1.5,2],'Amount':[40,19,np.nan,np.nan,60,70,80,np.nan,45,102],
'Quantity Frozen':[3,4,np.nan,15,np.nan,9,12,8,np.nan,80],
'Quantity Fresh':[37,12,np.nan,45,np.nan,61,np.nan,24,14,20],
'Multiple':[74,17,np.nan,112.5,np.nan,244,np.nan,84,21,40]}
df1 = pd.DataFrame(df1, columns = ['Fruits', 'Price','Amount','Quantity Frozen','Quantity Fresh','Multiple'])
df2 = {'Fruits': ["Banana","Blueberry","Apple","Cherry","Mango","Pineapple","Watermelon","Papaya","Pear","Coconut"],
'Price': [2,1.5,np.nan,2.6,3,4,np.nan,3.5,1.5,2],'Amount':[40,16,np.nan,np.nan,60,72,80,np.nan,45,100],
'Quantity Frozen':[3,4,np.nan,np.nan,np.nan,9,12,8,np.nan,80],
'Quantity Fresh':[np.nan,12,np.nan,45,np.nan,61,np.nan,24,15,20],
'Multiple':[74,17,np.nan,112.5,np.nan,244,np.nan,84,20,40]}
df2 = pd.DataFrame(df2, columns = ['Fruits', 'Price','Amount','Quantity Frozen','Quantity Fresh','Multiple'])
#Validation Check 1:
for name, dataset in {'Fruit Dataset1':df1,'Fruit Dataset2':df2}.items():
dataset['dif_Stock on Hand'] = dataset['Quantity Fresh']+dataset['Quantity Frozen']
for varname,var in {'Stock on Hand vs. Quantity Fresh + Quantity Frozen':'dif_Stock on Hand'}.items():
print('{} differences in {}:'.format(name, varname))
print(dataset[var].value_counts())
print('\n')
#Validation Check 2:
for name, dataset in {'Fruit Dataset1':df1,'Fruit Dataset2':df2}.items():
dataset['dif_Multiple'] = dataset['Price'] * dataset['Quantity Fresh']
for varname,var in {'Multiple vs. Price x Quantity Fresh':'dif_Multiple'}.items():
print('{} differences in {}:'.format(name, varname))
print(dataset[var].value_counts())
print('\n')
# #Wrangling internal inconsistency data frames to be in correct format
inconsistency_vars = ['dif_Stock on Hand','dif_Multiple']
inconsistency_var_betternames = {'dif_Stock on Hand':'Stock on Hand = Quantity Fresh + Quantity Frozen','dif_Multiple':'Multiple = Price x Quantity on Hand'}
# #Rollup1
idvars1=['Fruits']
df1 = df1[idvars1 + inconsistency_vars]
df2 = df2[idvars1 + inconsistency_vars]
df1 = df1.melt(id_vars = idvars1, value_vars = inconsistency_vars, value_name = 'Difference Magnitude')
df2 = df2.melt(id_vars = idvars1, value_vars = inconsistency_vars, value_name = 'Difference Magnitude')
df1['dataset'] = 'Fruit Dataset1'
df2['dataset'] = 'Fruit Dataset2'
# #First table in Internal Inconsistencies Sheet (Table 5)
inconsistent = pd.concat([df1,df2])
inconsistent = inconsistent[['variable','Difference Magnitude','dataset','Fruits']]
inconsistent['variable'] = inconsistent['variable'].map(inconsistency_var_betternames)
inconsistent = inconsistent[inconsistent['Difference Magnitude'] != 0]
Here is the desired output, which for the first validation check skips rows in either data frame that have ANY NaN values in the numeric columns (every column but 'Fruits'):
#Desired output
inconsistent_true = {'variable': ["Stock on Hand = Quantity Fresh + Quantity Frozen","Stock on Hand = Quantity Fresh + Quantity Frozen","Multiple = Price x Quantity on Hand",
"Multiple = Price x Quantity on Hand","Multiple = Price x Quantity on Hand"],
'Difference Magnitude': [1,2,1,4.5,2.5],
'dataset':["Fruit Dataset1","Fruit Dataset1","Fruit Dataset2","Fruit Dataset2","Fruit Datset2"],
'Fruits':["Blueberry","Coconut","Blueberry","Cherry","Pear"]}
inconsistent_true = pd.DataFrame(inconsistent_true, columns = ['variable', 'Difference Magnitude','dataset','Fruits'])
A pandas function that may come in handy is pd.isnull() return True for np.nan value-
For example take df1-
pd.isnull(df1['Amount'][2])
True
This can be added as a check to all your numeric columns as such and then use only rows that have column 'numeric_check' value as 1-
df1['numeric_check'] = df1.apply(lambda x: 0 if (pd.isnull(x['Amount']) or
pd.isnull(x['Price']) or pd.isnull(x['Quantity Frozen']) or
pd.isnull(x['Quantity Fresh']) or pd.isnull(x['Multiple'])) else 1, axis =1)
refer the modified validation check 1 -
#Validation Check 1:
for name, dataset in {'Fruit Dataset1':df1,'Fruit Dataset2':df2}.items():
if '1' in name: # check to implement condition for only df1
# Adding the 'numeric_check' column to dataset df
dataset['numeric_check'] = dataset.apply(lambda x: 0 if (pd.isnull(x['Amount']) or
pd.isnull(x['Price']) or pd.isnull(x['Quantity Frozen']) or
pd.isnull(x['Quantity Fresh']) or pd.isnull(x['Multiple'])) else 1, axis =1)
# filter out Nan rows, they will not be considered for this check
dataset = dataset.loc[dataset['numeric_check']==1]
dataset['dif_Stock on Hand'] = dataset['Quantity Fresh']+dataset['Quantity Frozen']
for varname,var in {'Stock on Hand vs. Quantity Fresh + Quantity Frozen':'dif_Stock on Hand'}.items():
print('{} differences in {}:'.format(name, varname))
print(dataset[var].value_counts())
print('\n')
I hope I got your intention.
# make boolean mask, True if all numeric values are not NaN
mask = df1.select_dtypes('number').notna().all(axis=1)
print(df1[mask])
Fruits Price Amount Quantity Frozen Quantity Fresh Multiple
0 Banana 2.0 40.0 3.0 37.0 74.0
1 Blueberry 1.5 19.0 4.0 12.0 17.0
5 Pineapple 4.0 70.0 9.0 61.0 244.0
9 Coconut 2.0 102.0 80.0 20.0 40.0
Formating output from pandas
I'm trying to automate getting output from pandas in a format that I can use with the minimum of messing about in a word processor. I'm using descriptive statistics as a practice case and so I'm trying to use the output from df[variable].describe(). My problem is that .describe() responds differently depending on the dtype of the column (if I'm understanding it properly).
In the case of a numerical column describe() produces this output:
count 306.000000
mean 36.823529
std 6.308587
min 10.000000
25% 33.000000
50% 37.000000
75% 41.000000
max 50.000000
Name: gses_tot, dtype: float64
However, for categorical columns, it produces:
count 306
unique 3
top Female
freq 166
Name: gender, dtype: object
Because of this difference, I need different code to capture the information I need, however, I can't seem to get my code to work on the categorical variables.
What I've tried
I've tried a few different versions of :
for v in df.columns:
if df[v].dtype.name == 'category': #i've also tried 'object' here
c, u, t, f, = df[v].describe()
print(f'******{str(v)}******')
print(f'Largest category = {t}')
print(f'Percentage = {(f/c)*100}%')
else:
c, m, std, mi, tf, f, sf, ma, = df[v].describe()
print(f'******{str(v)}******')
print(f'M = {m}')
print(f'SD = {std}')
print(f'Range = {float(ma) - float(mi)}')
print(f'\n')
The code in the else block works fine, but when I come to a categorical column I get the error below
******age****** #this is the output I want to a numberical column
M = 34.21568627450981
SD = 11.983015946197659
Range = 53.0
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-24-f077cc105185> in <module>
6 print(f'Percentage = {(f/c)*100}')
7 else:
----> 8 c, m, std, mi, tf, f, sf, ma, = df[v].describe()
9 print(f'******{str(v)}******')
10 print(f'M = {m}')
ValueError: not enough values to unpack (expected 8, got 4)
What I want to happen is something like
******age****** #this is the output I want to a numberical column
M = 34.21568627450981
SD = 11.983015946197659
Range = 53.0
******gender******
Largest category = female
Percentage = 52.2%
I believe that the issue is how I'm setting up the if statement with the dtype
and I've rooted around to try to find out how to access the dtype properly but I can't seem to make it work.
Advice would be much appreciated.
You can check what fields are included in the output of describe and print the corresponding sections:
import pandas as pd
df = pd.DataFrame({'categorical': pd.Categorical(['d','e','f']), 'numeric': [1, 2, 3], 'object': ['a', 'b', 'c']})
for v in df.columns:
desc = df[v].describe()
print(f'******{str(v)}******')
if 'top' in desc:
print(f'Largest category = {desc["top"]}')
print(f'Percentage = {(desc["freq"]/desc["count"])*100:.1f}%')
else:
print(f'M = {desc["mean"]}')
print(f'SD = {desc["std"]}')
print(f'Range = {float(desc["max"]) - float(desc["min"])}')
I am trying to calculate x percentage of column value using pd.Series. I am getting error of NoneType. Not really sure what I'm doing wrong.
def PctMA(self, df, pct):
pctName = 'PCT' + str(pct)
pctDecimal = float(pct/100)
pctCalulated = pd.Series((df['value'] - (df['value'] * pctDecimal)), name = pctName)
df = df.join(pctCalulated)
return df
I get the below error when I execute the code.
pctCalulated = pd.Series((df['value'] - (df['value'] * pctDecimal)), name = pctName)
TypeError: 'NoneType' object has no attribute '__getitem__'
Below is the Expected results with Pct1 being generated as 1% column of value.
index value pct1
1 2476 2451.24
2 2475 2450.25
3 2486 2461.14
4 2536 2510.64
5 2453 2428.47
6 2486 2461.14
7 2648 2621.52
8 2563 2537.37
9 2756 2728.44
Maybe the error is not in the function but in the input, as I just used your code and it is working:
def PctMA(df, pct):
pctName = 'PCT' + str(pct)
pctDecimal = float(pct/100)
pctCalulated = pd.Series((df['value'] - (df['value'] * pctDecimal)), name = pctName)
df = df.join(pctCalulated)
return df
df = pd.DataFrame({'value': [2476,2476,2486,2536]})
df_1 = PctMA(df, pct)
df_1
It is better to give more details.
I executed your code with minimal changes (just to map with 1% value). below is the code i executed, it is working absolutely fine. Issue with your input.
I have a dataframe, it has many timestamps, what I'm trying to do is get the lower of two dates only if both columns are not null. For example.
Internal Review Imported Date Lower Date
1 2/9/2018 19:44
2 2/15/2018 1:20 2/13/2018 2:18 2/13/2018 2:18
3 2/7/2018 23:17 2/12/2018 9:34 2/7/2018 23:17
4 2/12/2018 9:25
5 2/1/2018 20:57 2/12/2018 9:24 2/1/2018 20:57
If I wanted the lower of Internal Review and Imported Date, row one and four would not return any value, but would return the lower dates because they both contain dates. I know the .min(axis=1) will return a date, but they can be null which is the problem.
I tried copying something similar to here:
def business_days(start, end):
mask = pd.notnull(start) & pd.notnull(end)
start = start.values.astype('datetime64[D]')[mask]
end = end.values.astype('datetime64[D]')[mask]
result = np.empty(len(mask), dtype=float)
result[mask] = np.busday_count(start, end)
result[~mask] = np.nan
return result
and tried
def GetLowestDays(col1, col2, df):
df = df.copy()
Start = col1.copy().notnull()
End = col2.copy().notnull()
Col3 = [Start, End].min(axis=1)
return col3
But simply get a "AttributeError: 'list' object has no attribute 'min'"
The following code should do the trick :
df['Lower Date'] = df[( df['Internal Review'].notnull() ) & ( df['Imported Date'].notnull() )][['Internal Review','Imported Date']].min(axis=1)
The new column will be filled by the minimum if both are not null.
Nicolas