computing Std deviation using RDD v/s SparkSQL in Python - python

I am pretty new to the world of spark ( and to an extend even Python , but better). I am trying to compute the standard deviation and had used the following code. The first using SparkSQL and the code is as follows:
sqlsd=spark.sql("SELECT STDDEV(temperature) as stdtemp from
washing").first().stdtemp
print(sqlsd)
The above works fine ( I think) and it gives the result as 6.070
Now when I try to do this using RDD with the following code:-
def sdTemperature(df,spark):
n=float(df.count())
m=meanTemperature(df,spark)
df=df.fillna({'_id':0,'_rev':0,'count':0,'flowrate':0,'fluidlevel':0,
'frequency':0,'hardness':0,'speed':0,'temperature':0,'ts':0,'voltage':0})
rddT=df.rdd.map(lambda r: r.temperature)
c=rddT.count()
s=rddT.map(lambda x: pow(x-m,2)).sum()
print(n,c,s)
sd=sqrt(s/c)
return sd
when I run the above code, I get a different result. the value I get is 53.195
what I am in doing wrong?. All I am trying to do above is to compute the std deviation for a spark dataframe column temperature and use lambda.
thanks in advance for help ..

thanks to Zero323 who gave me the clue. I skipped the null values . the modified code is as follows:-
df2=df.na.drop(subset=["temperature"])
rddT=df2.rdd.map(lambda r: r.temperature)
c=rddT.count()
s=rddT.map(lambda x: pow(x-m,2)).sum()
sd=math.sqrt(s/c)
return(sd)

There are two types of standard deviations - please refer to this : https://math.stackexchange.com/questions/15098/sample-standard-deviation-vs-population-standard-deviation
Similar question -
Calculate the standard deviation of grouped data in a Spark DataFrame
The stddev() in Hive is a pointer to the stddev_samp(). The stddev_pop() is what you are looking for(inferred from the 2nd part of you code). So your sql query should be select stddev_pop(temperature) as stdtemp from washing

Related

Getting a Different Sort Order Between "SPSS sort cases by" and Python "sort_values(by=[])

I am attempting to convert some code from SPSS into Python. In the code, the SPSS "sort cases by" command is resulting in a different sorted order than the Pandas "df.sort_values(by=[]) command. For reference, here is the code in the two programs:
SPSS
GET FILE='C:\Data\sorttest.sav'.
sort cases by variable1.
dataset name sorttest.
execute.
Python
import pandas as pd
df_sorttest = pd.read_spss('C:\\Data\\sorttest.sav')
df_sorttest = df_sorttest.sort_values(by=['variable1'])
I assume this is because they are using different sorting algorithms, but I'm not sure how to fix it so I can get the same results in Python.
Thanks to It_is_Chris for the recommendation to specify the algorithm. I set it to kind='mergesort' and it got the correct order.

Accumulated Distribution Line

I'm trying to learn some python and are currently doing a few stock market examples. However, I ran across something called an Accumulated Distribution Line(technical indicator) and tried to follow the mathematical expression for this until I reached the following line:
ADL[i] = ADL[i-1] + money flow volume[i]
Now. I have the money flow volume at index 8 and an empty table for the ADL at index 9 (index for rows in a csv file). How would I actually compute the mathematical expression above in python? (Currently using Python with Pandas)
Currently tried using the range function such as:
for i in range(1,stock["Money flow volume"])):
stock.iloc[0,9] = stock.iloc[(i-1),9] + stock.iloc[i,8]
But I think I'm doing something wrong.
that just looks like a cumulative sum with an unspecified base case, so I'd just use the built in cumsum functionality.
import pandas as pd
df = pd.DataFrame(dict(mfv=range(10)))
df['adl'] = df['mfv'].cumsum()
should do what you want relatively efficiently

Scipy minimize like Excel Solver

I have big table of data that I read from excel in Python where I perform some calculation my dataframe looks like this but my true table is bigger and more complex but the logic stays the same:
with : My_cal_spread=set1+set2 and Errors = abs( My_cal_spread - spread)
My goal is to find using Scipy Minimize to the only same combination of (Set1 and Set 2) that can be used in each row so My_cal_spread is as close as possible to Spread by optimizing in finding the minimum sum of errors Possible.
this is the solution that I get when I am using excel solver, I'm looking for implementing the same solution using Scipy. Thanks
My code looks like this :
lnt=len(df['Spread'])
df['my_cal_Spread']=''
i=0
while i<lnt:
df['my_cal_Spread'].iloc[i]=df['set2'].iloc[i]+df['set1'].iloc[i]
df['errors'].iloc[i] = abs(df['my_cal_Spread'].iloc[i]-df['Spread'].iloc[i])
i=i+1
errors_sum=sum(df['errors'])

Python equivalent to Excel Solver: get max value knowing constraints

I'm quite new to Python and trying to find a Python equivalent to the Excel Solver function.
Let's say I have the following inputs:
import math
totalvolflow=150585.6894
gaspercentvol=0.1
prodmod=1224
blpower=562.57
powercons=6
gasvolflow=totalvolflow*gaspercentvol
quantity=math.ceil(gasvolflow/prodmod)
maxpercentvol=powercons*totalvolflow*prodmod/blpower
I want to find the maximum value of maxpercentvol by changing gaspercentvol
with the following constraint:
quantity*powercon<blpower
Any help would be appreciated.
According to:
maxpercentvol=powercons*totalvolflow*prodmod/blpower
maxpercentvol is 1,965,802.12765274 regardless of the value of gaspercentvol.

How to do a calculation on each line of a pandas dataframe in python?

I am brand new to python, I am attempting to convert the function I made in R to Python, R function described here:
How to optimize this process?
From my reading it looks like the best way to do this in python would be to use a for loop that would take the following form
for line 1 in probe test
find user in U_lookup
find movie in M_lookup
take the value found in U_lookup and retrieve that line number from knn_text
take the values found in that row of knn_text, and retrieve the line numbers from dfm
for those line numbers in dfm, retrieve column=U_lookup
take the average of the non zero values found
save value into pandas datafame in new column for that line
Is this the most efficient (in terms of speed of calculation) way to complete an operation like this? Coming from R so I wasn't sure if there was better functionality for something like this within the pandas package or not.
As a followup, is there an equivalent in python to the function dput() in R? dput essentially provides code to easily share a subset of data for questions like this.
You can use df.apply(my_func, axis=1) to apply the function/calculation to each row of a dataframe.
Where, my_func would contain the required calculations

Categories