I have big table of data that I read from excel in Python where I perform some calculation my dataframe looks like this but my true table is bigger and more complex but the logic stays the same:
with : My_cal_spread=set1+set2 and Errors = abs( My_cal_spread - spread)
My goal is to find using Scipy Minimize to the only same combination of (Set1 and Set 2) that can be used in each row so My_cal_spread is as close as possible to Spread by optimizing in finding the minimum sum of errors Possible.
this is the solution that I get when I am using excel solver, I'm looking for implementing the same solution using Scipy. Thanks
My code looks like this :
lnt=len(df['Spread'])
df['my_cal_Spread']=''
i=0
while i<lnt:
df['my_cal_Spread'].iloc[i]=df['set2'].iloc[i]+df['set1'].iloc[i]
df['errors'].iloc[i] = abs(df['my_cal_Spread'].iloc[i]-df['Spread'].iloc[i])
i=i+1
errors_sum=sum(df['errors'])
Related
I've computed the eigenvalues and eigenstates of a Hamiltonian in Python. I have a matrix containing all the wavefunctions in discrete space psi. I'd like to normalise the total wavefunction (or the 'ket') (i.e the matrix of vectors) such that its modulus squared integrates to 1.
I've tried the following:
A= np.linalg.norm(abs(psi.T)**2)
normed_psi=psi.T/np.sqrt(A)
print(np.linalg.norm(normed_psi))
The matrix is transposed so I can access each state using psi[n].
However, the output of the print statement is:
20.44795885105457
When it should be 1.I feel like I'm not using linalg.norm correctly. I've also tried using my own integral function using the trapezium rule to no success.
I'm not really sure as to what to do at this point. Any help would be great.
It seems you're confusing np.linalg.norm and np.sum, up to the usual floating point issues these two snippets should be identical:
normed_psi = psi.T / np.sqrt(np.sum(psi.T**2))
normed_psi = psi.T / np.linalg.norm(psi.T)
I'm trying to learn some python and are currently doing a few stock market examples. However, I ran across something called an Accumulated Distribution Line(technical indicator) and tried to follow the mathematical expression for this until I reached the following line:
ADL[i] = ADL[i-1] + money flow volume[i]
Now. I have the money flow volume at index 8 and an empty table for the ADL at index 9 (index for rows in a csv file). How would I actually compute the mathematical expression above in python? (Currently using Python with Pandas)
Currently tried using the range function such as:
for i in range(1,stock["Money flow volume"])):
stock.iloc[0,9] = stock.iloc[(i-1),9] + stock.iloc[i,8]
But I think I'm doing something wrong.
that just looks like a cumulative sum with an unspecified base case, so I'd just use the built in cumsum functionality.
import pandas as pd
df = pd.DataFrame(dict(mfv=range(10)))
df['adl'] = df['mfv'].cumsum()
should do what you want relatively efficiently
I am brand new to python, I am attempting to convert the function I made in R to Python, R function described here:
How to optimize this process?
From my reading it looks like the best way to do this in python would be to use a for loop that would take the following form
for line 1 in probe test
find user in U_lookup
find movie in M_lookup
take the value found in U_lookup and retrieve that line number from knn_text
take the values found in that row of knn_text, and retrieve the line numbers from dfm
for those line numbers in dfm, retrieve column=U_lookup
take the average of the non zero values found
save value into pandas datafame in new column for that line
Is this the most efficient (in terms of speed of calculation) way to complete an operation like this? Coming from R so I wasn't sure if there was better functionality for something like this within the pandas package or not.
As a followup, is there an equivalent in python to the function dput() in R? dput essentially provides code to easily share a subset of data for questions like this.
You can use df.apply(my_func, axis=1) to apply the function/calculation to each row of a dataframe.
Where, my_func would contain the required calculations
I'm currently writng a code involving some financial calculation. More in particular some exponential moving average. To do the job I have tried Pandas and Talib:
talib_ex=pd.Series(talib.EMA(self.PriceAdjusted.values,timeperiod=200),self.PriceAdjusted.index)
pandas_ex=self.PriceAdjusted.ewm(span=200,adjust=True,min_periods=200-1).mean()
They both work fine, but they provide different results at the begining of the array:
So there is some parameter to be change into pandas's EWMA or it is a bug and I should worry?
Thanks in advance
Luca
For the talib ema, the formula is:
So when using the pandas, if you want to make pandas ema the same as talib, you should use it as:
pandas_ex=self.PriceAdjusted.ewm(span=200,adjust=False,min_periods=200-1).mean()
Set the adjust as False according to the document(https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.ewm.html) if you want to use the same formula as talib:
When adjust is True (default), weighted averages are calculated using weights (1-alpha)(n-1), (1-alpha)(n-2), ..., 1-alpha, 1.
When adjust is False, weighted averages are calculated recursively as:
weighted_average[0] = arg[0]; weighted_average[i] = (1-alpha)weighted_average[i-1] + alphaarg[i].
You can also reference here:
https://en.wikipedia.org/wiki/Moving_average
PS: however, in my project, i still find some small differences between the talib and the pandas.ewm and don't know why yet...
I am pretty new to the world of spark ( and to an extend even Python , but better). I am trying to compute the standard deviation and had used the following code. The first using SparkSQL and the code is as follows:
sqlsd=spark.sql("SELECT STDDEV(temperature) as stdtemp from
washing").first().stdtemp
print(sqlsd)
The above works fine ( I think) and it gives the result as 6.070
Now when I try to do this using RDD with the following code:-
def sdTemperature(df,spark):
n=float(df.count())
m=meanTemperature(df,spark)
df=df.fillna({'_id':0,'_rev':0,'count':0,'flowrate':0,'fluidlevel':0,
'frequency':0,'hardness':0,'speed':0,'temperature':0,'ts':0,'voltage':0})
rddT=df.rdd.map(lambda r: r.temperature)
c=rddT.count()
s=rddT.map(lambda x: pow(x-m,2)).sum()
print(n,c,s)
sd=sqrt(s/c)
return sd
when I run the above code, I get a different result. the value I get is 53.195
what I am in doing wrong?. All I am trying to do above is to compute the std deviation for a spark dataframe column temperature and use lambda.
thanks in advance for help ..
thanks to Zero323 who gave me the clue. I skipped the null values . the modified code is as follows:-
df2=df.na.drop(subset=["temperature"])
rddT=df2.rdd.map(lambda r: r.temperature)
c=rddT.count()
s=rddT.map(lambda x: pow(x-m,2)).sum()
sd=math.sqrt(s/c)
return(sd)
There are two types of standard deviations - please refer to this : https://math.stackexchange.com/questions/15098/sample-standard-deviation-vs-population-standard-deviation
Similar question -
Calculate the standard deviation of grouped data in a Spark DataFrame
The stddev() in Hive is a pointer to the stddev_samp(). The stddev_pop() is what you are looking for(inferred from the 2nd part of you code). So your sql query should be select stddev_pop(temperature) as stdtemp from washing