Mean of column in PySpark without using collect - python

I have a dataframe that looks like below
accuracy
--------
91.0
92.0
73.0
72.0
88.0
I am using aggregate, count and collect to get the column sum which is taking too much time. Below is my code
total_count = df.count()
total_sum=df.agg({'accuracy': 'sum'}).collect()
total_sum_val = [i[0] for i in total_sum]
acc_top_k = (total_sum_val[0]/total_count)*100
Is there any alternative method to get the mean accuracy in PySpark?

First, you can aggregate the column values and calculate the average. Then, extract it into the variable.
df = df.agg(F.avg('accuracy'))
acc_top_k = df.head()[0] * 100
Full test:
from pyspark.sql import functions as F
df = spark.createDataFrame([(91.0,), (92.0,), (73.0,), (72.0,), (88.0,)], ['accuracy'])
df = df.agg(F.avg('accuracy'))
acc_top_k = df.head()[0] * 100
print(acc_top_k)
# 8220.0
If you prefer, you can use your method too:
df = df.agg({'accuracy': 'avg'})

Related

Applying custom functions to groupby objects pandas

I have the following pandas dataframe.
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"bird_type": ["falcon", "crane", "crane", "falcon"],
"avg_speed": [np.random.randint(50, 200) for _ in range(4)],
"no_of_birds_observed": [np.random.randint(3, 10) for _ in range(4)],
"reliability_of_data": [np.random.rand() for _ in range(4)],
}
)
# The dataframe looks like this.
bird_type avg_speed no_of_birds_observed reliability_of_data
0 falcon 66 3 0.553841
1 crane 159 8 0.472359
2 crane 158 7 0.493193
3 falcon 161 7 0.585865
Now, I would like to have the weighted average (according to the number_of_birds_surveyed) for the average_speed and reliability variables. For that I have a simple function as follows, which calculates the weighted average.
def func(data, numbers):
ans = 0
for a, b in zip(data, numbers):
ans = ans + a*b
ans = ans / sum(numbers)
return ans
How can I apply the function of func to both average speed and reliability variables?
I expect the answer to be a dataframe like follows
bird_type avg_speed no_of_birds_observed reliability_of_data
0 falcon 132.5 10 0.5762578
# how (66*3 + 161*7)/(3+7) (3+10) (0.553841×3+0.585865×7)/(3+7)
1 crane 158.53 15 0.4820815
# how (159*8 + 158*7)/(8+7) (8+7) (0.472359×8+0.493193×7)/(8+7)
I saw this question, but could not generalize the solution / understand it completely. I thought of not asking the question, but according to this blog post by SO and this meta question, with a different example, I think this question can be considered a "borderline duplicate". An answer will benefit me and probably some others will also find this useful. So finally decided to ask.
Don't use a function with apply, rather perform a classical aggregation:
cols = ['avg_speed', 'reliability_of_data']
# multiply relevant columns by no_of_birds_observed
# aggregate everything as sum
out = (df[cols].mul(df['no_of_birds_observed'], axis=0)
.combine_first(df)
.groupby('bird_type').sum()
)
# divide the relevant columns by the sum of no_of_birds_observed
out[cols] = out[cols].div(out['no_of_birds_observed'], axis=0)
Output:
avg_speed no_of_birds_observed reliability_of_data
bird_type
crane 158.533333 15 0.482082
falcon 132.500000 10 0.576258
If want aggregate by GroupBy.agg for weights parameter is used no_of_birds_observed by DataFrame.loc:
#for correct ouput need default (or unique values) index
df = df.reset_index(drop=True)
f = lambda x: np.average(x, weights= df.loc[x.index, 'no_of_birds_observed'])
df1 = (df.groupby('bird_type', sort=False, as_index=False)
.agg(avg=('avg_speed',f),
no_of_birds=('no_of_birds_observed','sum'),
reliability_of_data=('reliability_of_data', f)))
print (df1)
bird_type avg no_of_birds reliability_of_data
0 falcon 132.500000 10 0.576258
1 crane 158.533333 15 0.482082

How to perform sliding window correlation operation on pandas dataframe with datetime index?

I am working with stock data coming from Yahoo Finance.
def load_y_finance_data(y_finance_tickers: list):
df = pd.DataFrame()
print("Loading Y-Finance data ...")
for ticker in y_finance_tickers:
df[ticker.replace("^", "")] = yf.download(
ticker,
auto_adjust=True, # only download adjusted data
progress=False,
)["Close"]
print("Done loading Y-Finance data!")
return df
x = load_y_finance_data(["^VIX", "^GSPC"])
x
VIX GSPC
Date
1990-01-02 17.240000 359.690002
1990-01-03 18.190001 358.760010
1990-01-04 19.219999 355.670013
1990-01-05 20.110001 352.200012
1990-01-08 20.260000 353.790009
DataSize=(8301, 2)
Here I want to perform a sliding window operation for every 50 days period, where I want to get correlation (using corr() function) for 50 days slice (day_1 to day_50) of data and after window will move by one day (day_2 to day_51) and so on.
I tried the naive way of using a for loop to do this and it works as well. But it takes too much time. Code below-
data_size = len(x)
period = 50
df = pd.DataFrame()
for i in range(data_size-period):
df.loc[i, "GSPC_VIX_corr"] = x[["GSPC", "VIX"]][i:i+period].corr().loc["GSPC", "VIX"]
df
GSPC_VIX_corr
0 -0.703156
1 -0.651513
2 -0.602876
3 -0.583256
4 -0.589086
How can I do this more efficiently? Is there any built-in way I can use?
Thanks :)
You can use the rolling windows functionality of Pandas with many different aggreggations, including corr(). Instead of your for loop, do this:
x["VIX"].rolling(window=period).corr(x["GSPC"])

How to find the average of data samples at random intervals in python?

I have temperature data stored in a csv file when plotted looks like the below image. How do I find the average during each interval when the temperature goes above 12. The result should be the T1, T2 ,T3 which should be the average temperature during the interval when its value is above 12.
Could you please suggest how to achieve this in python?
Highlighted the areas approximately over which I need to calculate the average:
Please find below sample data:
R3,R4
1,11
2,11
3,11
4,11
5,11
6,15.05938512
7,15.12975992
8,15.05850141
18,15.1677708
19,15.00921862
20,15.00686921
21,15.01168888
22,11
23,11
24,11
25,11
26,11
27,15.05938512
28,15.12975992
29,15.05850141
30,15.00328706
31,15.12622611
32,15.01479819
33,15.17778891
34,15.01411488
35,9
36,9
37,9
38,9
39,16.16042435
40,16.00091253
41,16.00419677
42,16.15381827
43,16.0471766
44,16.03725301
45,16.13925003
46,16.00072279
47,11
48,1
In pandas, an idea would be to group the data based on the condition T > 12 and use mean as agg func. Ex:
import pandas as pd
# a dummy df:
df = pd.DataFrame({'T': [11, 13, 13, 10, 14]})
# set the condition
m = df['T'] > 12
# define groups
grouper = (~m).cumsum().where(m)
# ...looks like
# 0 NaN
# 1 1.0
# 2 1.0
# 3 NaN
# 4 2.0
# Name: T, dtype: float64
# now we can easily calculate the mean for each group:
grp_mean = df.groupby(grouper)['T'].mean()
# T
# 1.0 13
# 2.0 14
# Name: T, dtype: int64
Note: if you have noisy data (T jumps up and down), it might be clever to apply a filter first (savgol, median etc. - whatever is appropriate) so you don't end up with groups caused by the noise.
I couldn't find a good pattern for this - here's a clunky bit of code that does what you want, though.
In general, use .shift() to find transition points, and use groupby with transform to get your means.
#if you had a csv with Dates and Temps, do this
#tempsDF = pd.read_csv("temps.csv", columns=["Date","Temp"])
#tempsDF.set_index("Date", inplace=True)
#Using fake data since I don't have your csv
tempsDF = pd.DataFrame({'Temp': [0,13,14,13,8,7,5,0,14,16,16,0,0,0]})
#This is a bit clunky - I bet there's a more elegant way to do it
tempsDF["CumulativeFlag"] = 0
tempsDF.loc[tempsDF["Temp"]>12, "CumulativeFlag"]=1
tempsDF.loc[tempsDF["CumulativeFlag"] > tempsDF["CumulativeFlag"].shift(), "HighTempGroup"] = list(range(1,len(tempsDF.loc[tempsDF["CumulativeFlag"] > tempsDF["CumulativeFlag"].shift()])+1))
tempsDF["HighTempGroup"].fillna(method='ffill', inplace=True)
tempsDF.loc[tempsDF["Temp"]<=12, "HighTempGroup"]= None
tempsDF["HighTempMean"] = tempsDF.groupby("HighTempGroup").transform(np.mean)["Temp"]

Optimizing cartesian product between two Pandas Dataframe

I have two dataframes with the same columns:
Dataframe 1:
attr_1 attr_77 ... attr_8
userID
John 1.2501 2.4196 ... 1.7610
Charles 0.0000 1.0618 ... 1.4813
Genarito 2.7037 4.6707 ... 5.3583
Mark 9.2775 6.7638 ... 6.0071
Dataframe 2:
attr_1 attr_77 ... attr_8
petID
Firulais 1.2501 2.4196 ... 1.7610
Connie 0.0000 1.0618 ... 1.4813
PopCorn 2.7037 4.6707 ... 5.3583
I want to generate a correlation and p-value dataframe of all posible combinations, this would be the result:
userId petID Correlation p-value
0 John Firulais 0.091447 1.222927e-02
1 John Connie 0.101687 5.313359e-03
2 John PopCorn 0.178965 8.103919e-07
3 Charles Firulais -0.078460 3.167896e-02
The problem is that the cartesian product generates more than 3 million tuples. Taking minutes to finish. This is my code, I've written two alternatives:
First of all, initial DataFrames:
df1 = pd.DataFrame({
'userID': ['John', 'Charles', 'Genarito', 'Mark'],
'attr_1': [1.2501, 0.0, 2.7037, 9.2775],
'attr_77': [2.4196, 1.0618, 4.6707, 6.7638],
'attr_8': [1.7610, 1.4813, 5.3583, 6.0071]
}).set_index('userID')
df2 = pd.DataFrame({
'petID': ['Firulais', 'Connie', 'PopCorn'],
'attr_1': [1.2501, 0.0, 2.7037],
'attr_77': [2.4196, 1.0618, 4.6707],
'attr_8': [1.7610, 1.4813, 5.3583]
}).set_index('petID')
Option 1:
# Pre-allocate space
df1_keys = df1.index
res_row_count = len(df1_keys) * df2.values.shape[0]
genes = np.empty(res_row_count, dtype='object')
mature_mirnas = np.empty(res_row_count, dtype='object')
coff = np.empty(res_row_count)
p_value = np.empty(res_row_count)
i = 0
for df1_key in df1_keys:
df1_values = df1.loc[df1_key, :].values
for df2_key in df2.index:
df2_values = df2.loc[df2_key, :]
pearson_res = pearsonr(df1_values, df2_values)
users[i] = df1_key
pets[i] = df2_key
coff[i] = pearson_res[0]
p_value[i] = pearson_res[1]
i += 1
# After loop, creates the resulting Dataframe
return pd.DataFrame(data={
'userID': users,
'petID': pets,
'Correlation': coff,
'p-value': p_value
})
Option 2 (slower), from here:
# Makes a merge between all the tuples
def df_crossjoin(df1_file_path, df2_file_path):
df1, df2 = prepare_df(df1_file_path, df2_file_path)
df1['_tmpkey'] = 1
df2['_tmpkey'] = 1
res = pd.merge(df1, df2, on='_tmpkey').drop('_tmpkey', axis=1)
res.index = pd.MultiIndex.from_product((df1.index, df2.index))
df1.drop('_tmpkey', axis=1, inplace=True)
df2.drop('_tmpkey', axis=1, inplace=True)
return res
# Computes Pearson Coefficient for all the tuples
def compute_pearson(row):
values = np.split(row.values, 2)
return pearsonr(values[0], values[1])
result = df_crossjoin(mrna_file, mirna_file).apply(compute_pearson, axis=1)
Is there a faster way to solve such a problem with Pandas? Or I'll have no more option than parallelize the iterations?
Edit:
As the size of the dataframe increases the second option results in a better runtime, but It's still taking seconds to finish.
Thanks in advance
Of all the alternatives tested, the one that gave me the best results was the following:
An iteration product was made with
itertools.product().
All the iterations on both iterrows were performed on a Pool of
parallel processes (using a map function).
To give it a little more performance, the function compute_row_cython was compiled with Cython as it is advised in this section of the Pandas documentation:
In the cython_modules.pyx file:
from scipy.stats import pearsonr
import numpy as np
def compute_row_cython(row):
(df1_key, df1_values), (df2_key, df2_values) = row
cdef (double, double) pearsonr_res = pearsonr(df1_values.values, df2_values.values)
return df1_key, df2_key, pearsonr_res[0], pearsonr_res[1]
Then I set up the setup.py:
from distutils.core import setup
from Cython.Build import cythonize
setup(name='Compiled Pearson',
ext_modules=cythonize("cython_modules.pyx")
Finally I compiled it with: python setup.py build_ext --inplace
The final code was left, then:
import itertools
import multiprocessing
from cython_modules import compute_row_cython
NUM_CORES = multiprocessing.cpu_count() - 1
pool = multiprocessing.Pool(NUM_CORES)
# Calls to Cython function defined in cython_modules.pyx
res = zip(*pool.map(compute_row_cython, itertools.product(df1.iterrows(), df2.iterrows()))
pool.close()
end_values = list(res)
pool.join()
Neither Dask, nor the merge function with the apply used gave me better results. Not even optimizing the apply with Cython. In fact, this alternative with those two methods gave me memory error, when implementing the solution with Dask I had to generate several partitions, which degraded the performance as it had to perform many I/O operations.
The solution with Dask can be found in my other question.
Here's another method using same cross join but using the built in pandas method DataFrame.corrwith and scipy.stats.ttest_ind. Since we use less "loopy" implementation, this should perform better.
from scipy.stats import ttest_ind
mrg = df1.assign(key=1).merge(df2.assign(key=1), on='key').drop(columns='key')
x = mrg.filter(like='_x').rename(columns=lambda x: x.rsplit('_', 1)[0])
y = mrg.filter(like='_y').rename(columns=lambda x: x.rsplit('_', 1)[0])
df = mrg[['userID', 'petID']].join(x.corrwith(y, axis=1).rename('Correlation'))
df['p_value'] = ttest_ind(x, y, axis=1)[1]
userID petID Correlation p_value
0 John Firulais 1.000000 1.000000
1 John Connie 0.641240 0.158341
2 John PopCorn 0.661040 0.048041
3 Charles Firulais 0.641240 0.158341
4 Charles Connie 1.000000 1.000000
5 Charles PopCorn 0.999660 0.020211
6 Genarito Firulais 0.661040 0.048041
7 Genarito Connie 0.999660 0.020211
8 Genarito PopCorn 1.000000 1.000000
9 Mark Firulais -0.682794 0.006080
10 Mark Connie -0.998462 0.003865
11 Mark PopCorn -0.999569 0.070639

Pandas find min date within lookback window from first order for each user

For every user, I'd like to find the date of their earliest visit that falls within a 90 day lookback window from their first order date.
data = {"date":{"145586":"2016-08-02","247940":"2016-10-04","74687":"2017-01-05","261739":"2016-10-05","121154":"2016-10-07","82658":"2016-12-01","196680":"2016-12-06","141277":"2016-12-15","189763":"2016-12-18","201564":"2016-12-20","108930":"2016-12-23"},"fullVisitorId":{"145586":643786734868244401,"247940":7634897085866546110,"74687":7634897085866546110,"261739":7634897085866546110,"121154":7634897085866546110,"82658":7634897085866546110,"196680":7634897085866546110,"141277":7634897085866546110,"189763":643786734868244401,"201564":643786734868244401,"108930":7634897085866546110},"sessionId":{"145586":"0643786734868244401_1470168779","247940":"7634897085866546110_1475590935","74687":"7634897085866546110_1483641292","261739":"7634897085866546110_1475682997","121154":"7634897085866546110_1475846055","82658":"7634897085866546110_1480614683","196680":"7634897085866546110_1481057822","141277":"7634897085866546110_1481833373","189763":"0643786734868244401_1482120932","201564":"0643786734868244401_1482246921","108930":"7634897085866546110_1482521314"},"orderNumber":{"145586":0.0,"247940":0.0,"74687":1.0,"261739":0.0,"121154":0.0,"82658":0.0,"196680":0.0,"141277":0.0,"189763":1.0,"201564":0.0,"108930":0.0}}
test = pd.DataFrame(data=data)
test.date = pd.to_datetime(test.date)
lookback = test[test['orderNumber']==1]['date'].apply(lambda x: x - timedelta(days=90))
lookback.name = 'window_min'
ids = test['fullVisitorId']
ids = ids.reset_index()
ids = ids.set_index('index')
lookback = lookback.reset_index()
lookback['fullVisitorId'] = lookback['index'].map(ids['fullVisitorId'])
lookback = lookback.set_index('fullVisitorId')
test['window'] = test['fullVisitorId'].map(lookback['window_min'])
test = test[test['window']<test['date']]
test.loc[test.groupby('fullVisitorId')['date'].idxmin()]
This works, but I feel like there ought to be a cleaner way...
How about this? Basically we assign a new column (order-90days) to help us filter away those who are False.
We apply groupby and pick the 1st (0-nth) element.
import pandas as pd
data = {"date":{"145586":"2016-08-02","247940":"2016-10-04","74687":"2017-01-05","261739":"2016-10-05","121154":"2016-10-07","82658":"2016-12-01","196680":"2016-12-06","141277":"2016-12-15","189763":"2016-12-18","201564":"2016-12-20","108930":"2016-12-23"},"fullVisitorId":{"145586":643786734868244401,"247940":7634897085866546110,"74687":7634897085866546110,"261739":7634897085866546110,"121154":7634897085866546110,"82658":7634897085866546110,"196680":7634897085866546110,"141277":7634897085866546110,"189763":643786734868244401,"201564":643786734868244401,"108930":7634897085866546110},"sessionId":{"145586":"0643786734868244401_1470168779","247940":"7634897085866546110_1475590935","74687":"7634897085866546110_1483641292","261739":"7634897085866546110_1475682997","121154":"7634897085866546110_1475846055","82658":"7634897085866546110_1480614683","196680":"7634897085866546110_1481057822","141277":"7634897085866546110_1481833373","189763":"0643786734868244401_1482120932","201564":"0643786734868244401_1482246921","108930":"7634897085866546110_1482521314"},"orderNumber":{"145586":0.0,"247940":0.0,"74687":1.0,"261739":0.0,"121154":0.0,"82658":0.0,"196680":0.0,"141277":0.0,"189763":1.0,"201564":0.0,"108930":0.0}}
test = pd.DataFrame(data=data)
test.date = pd.to_datetime(test.date)
test.sort_values(by='date', inplace=True)
firstorder = test[test.orderNumber > 0].set_index('fullVisitorId').date
test['firstorder_90'] = test.fullVisitorId.map(firstorder - pd.Timedelta(days=90))
test.query('date >= firstorder_90').groupby('fullVisitorId', as_index=False).nth(0)
We get:
date fullVisitorId sessionId \
121154 2016-10-07 7634897085866546110 7634897085866546110_1475846055
189763 2016-12-18 643786734868244401 0643786734868244401_1482120932
orderNumber firstorder_90
121154 0.0 2016-10-07
189763 1.0 2016-09-19

Categories