I am new to python and pandas. My question is related to that question:
Advanced Describe Pandas
Is it possible to add some functions to reply by noobie like:
geometric mean, weighted mean, harmonic mean, geometric standard deviation, etc.
import pandas as pd
def describex(data):
data = pd.DataFrame(data)
stats = data.describe()
skewness = data.skew()
kurtosis = data.kurtosis()
skewness_df = pd.DataFrame({'skewness':skewness}).T
kurtosis_df = pd.DataFrame({'kurtosis':kurtosis}).T
return stats.append([kurtosis_df,skewness_df])
So basically I am interested in adding something for example from scipy.stats that is not as these functions above originated from pandas. I want to have much more informations from descriptive statistics than standard describe offers. What I tried so far was adding more functions from pandas, and with that I am OK, but wasn't able to attach more functions that are outside of pandas.
How do I do it, please ?
There are a couple of things you could do.
One suggestion is to use the pandas-profiling library, which can generate a comprehensive report on the data including basic statistics, correlation analysis, data type analysis, missing values analysis, and more. This can be a very useful tool for quickly getting a comprehensive overview of the dataset.
Another suggestion is to use the scipy.stats library to add any advanced statistics to your custom function. The scipy.stats library probably has a function to compute any statistic you're looking for.
For example,
import pandas as pd
import numpy as np
from scipy.stats import gmean
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
def describex(data):
data = pd.DataFrame(data)
stats = data.describe()
skewness = data.skew()
kurtosis = data.kurtosis()
skewness_df = pd.DataFrame({'skewness':skewness}).T
kurtosis_df = pd.DataFrame({'kurtosis':kurtosis}).T
gmean_df = pd.DataFrame(df.apply(gmean, axis=0),columns=['gmean']).T
return stats.append([kurtosis_df,skewness_df,gmean_df])
print(describex(df))
Hope this helps!
Related
How can I get the same result I`m getting on pandas on DASK?
The objective is to have a uniform time interval for each group, replicating the last value until we have a new one.
import pandas as pd import numpy as np import datetime
data=pd.DataFrame([["AAAA","2020-01-15",2],
["AAAA","2020-02-15",9],
["AAAA","2020-02-20",2],
["AAAA","2020-02-25",9],
["AAAA","2020-04-18",2],
["BBBB","2020-01-01",5],
["BBBB","2020-02-15",5],
["BBBB","2020-02-20",4],
["BBBB","2020-02-25",4],
["BBBB","2020-04-15",2],
["CCCC","2020-01-01",9],
["CCCC","2020-02-15",5],
["CCCC","2020-03-20",7],
["CCCC","2020-04-25",4],
["CCCC","2020-05-15",2]])
data.columns=['Asset','Date','P']
data['Date']=pd.to_datetime(data['Date'])
data.index=data['Date'].values
temp=data.groupby('Asset').resample('2D').pad()
temp
** this is just an example, the real-world application is really big.
Thanks!
.resample() functionality is not fully replicated in the current version of dask. My suggestion would be to either look into xarray (if you want to have grid-like structure) or use dask.delayed wrapped around pandas.
I am working on a K-Means Clustering task and I am wondering if there is some way to do some kind of ranking of clusters, or maybe assign specific weights to some specific clusters. Is there a way to do this? Here is my code.
from pylab import plot,show
from numpy import vstack,array
from numpy.random import rand
import numpy as np
from scipy.cluster.vq import kmeans,vq
import pandas as pd
import pandas_datareader as dr
from math import sqrt
from sklearn.cluster import KMeans
from matplotlib import pyplot as plt
df = pd.read_csv('C:\\my_path\\analytics.csv')
data = np.asarray([np.asarray(dataset['Rating']),np.asarray(dataset['Maturity']),np.asarray(dataset['Score']),np.asarray(dataset['Bin']),np.asarray(dataset['Price1']),np.asarray(dataset['Price2']),np.asarray(dataset['Price3'])]).T
centroids,_ = kmeans(data,1000)
idx,_ = vq(data,centroids)
details = [(name,cluster) for name, cluster in zip(dataset.Cusip,idx)]
So, I get my 'details', I look at it, and everything seems fine at this point. I end up with around 700 clusters. I'm just wondering if there is a way to rank-order these clusters, assuming 'Rating' is the most important feature. Or, perhaps there is a way to assign a higher weight to 'Rating'. I'm not sure this makes 100% sense. I'm just thinking about the concept and wondering if there is some obvious solution or maybe this is just nonsense. I can easily count the records in each cluster, but I don't think that has any significance whatsoever. I Googled this and didn't find anything useful.
One "cheat" trick would be to use the feature ratingtwice or three times, then it automatically gets more weight:
data = np.asarray([np.asarray(dataset['Rating']), np.asarray(dataset['Rating']), np.asarray(dataset['Maturity']),np.asarray(dataset['Score']),np.asarray(dataset['Bin']),np.asarray(dataset['Price1']),np.asarray(dataset['Price2']),np.asarray(dataset['Price3'])]).T
there are also adjustments of kmeans around, but they are not implemented in python.
The title outlines my problem for the following script(please, run it first and then read my final question):
Now the whole code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import pandas_datareader as pdr
from sklearn.linear_model import LinearRegression
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
import datetime
tickers=['EXO.MI','LDO.MI']
end=datetime.date.today()
gap=datetime.timedelta(days=650)
start=end- gap
Bank=pdr.get_data_yahoo(tickers,start=start,end=end)
bank_matrix=Bank['Adj Close']
bank_matrix=bank_matrix.dropna()
exor=bank_matrix['EXO.MI']
leonardo=bank_matrix['LDO.MI']
Regressione=pd.DataFrame(data=np.zeros((len(exor),3)),columns=['Intercetta','Hedge','Residuals'],index=bank_matrix['EXO.MI'].index)
lookback=20
Hedge=[]
Intercetta=[]
Residuals=[]
for i in range(lookback,len(exor)):
reg=LinearRegression().fit(bank_matrix[['LDO.MI']][i-lookback+1:i],bank_matrix[['EXO.MI']][i-lookback+1:i])
# Regressione.iloc[Regressione[i,'Hedge']]=reg.coef_[0]
Hedge.append(reg.coef_[0])
Intercetta.append(reg.intercept_)
y_pred=reg.predict(bank_matrix[['LDO.MI']][lookback:])
Residuals.append(bank_matrix[['EXO.MI']][lookback:].to_numpy()-y_pred)
Regressione=pd.DataFrame(list(zip(Intercetta,Hedge,Residuals)),columns=['Intercetta','Hedge','Residuals'])
Regressione.set_index(bank_matrix[['EXO.MI']].index[lookback:],inplace=True)
The code works however I have 2 questions:
Is that 'reg._residues' the real residuals from the Y(real value of 'EXO.MI') and y predicted?I ask that because the plot of residuals was everything but normally distributed or stationary
Guys I'm getting crazy: HOW CAN I COMPUTE THE everyday residuals in a 'FOR'LOOP ?????
I mean, I tried to:
make the difference between real y values and reg.predict
make the manual computation: y_predicted= Intercetta + Hedge*bank_matrix[['LDO.MI]]
But Python always report me problems. I honestly find very hard to understand how Python works for this....
Thanks
It's still not 100% clear to me what you want to do here, but I hope this will get you somewhere.
First of all, your code runs fine if you just add import datetime in the beginning, and replace y_pred=reg.predict(bank_matrix[['LDO.MI']][lookback:]) Residuals.append(bank_matrix[['EXO.MI']][lookback:].to_numpy()-y_pred) with y_pred=reg.predict(bank_matrix[['LDO.MI']][lookback:]) Residuals.append(bank_matrix[['EXO.MI']][lookback:]-y_pred).
Then you can visually check your residuals for each sub-period using:
for df in Residuals:
df.plot.hist()
Using Residuals[-3:] will plot the last three residual series of your calculations:
You can also easily run a Shapiro-Wilk test for normality for each of your residual series and append the results in a dataframe:
from scipy import stats
shapiro=[]
for df in Residuals[-3:]:
shapiro.append(stats.shapiro(df[df.columns[0]].values))
df_shapiro = pd.DataFrame(shapiro)
df_shapiro[0] returns the W-statistic and df_shapiro[1] returns the p-values.
Take a closer look at the p-values using:
df_pVal=df_shapiro[1].to_frame()
df_pVal['alpha']=0.05
df_pVal.plot()
Take a look at here for more information on how to use the test.
The question still remains what you're aiming to do here. A detailed explanation would be great. Until then, I hope my effort gets you a few steps further.
Is there an equivalent of R's summary() function in numpy?
numpy has std, mean, average functions separately, but does it have a function that sums up everything, like summary does in R?
If found this question which relates to pandas and this article with R-to-numpy equivalents, but it doesn't have what I seek for.
1. Load Pandas in console and load csv data file
import pandas as pd
data = pd.read_csv("data.csv", sep = ",")
2. Examine first few rows of data
data.head()
3. Calculate summary statistics
summary = data.describe()
4. Transpose statistics to get similar format as R summary() function
summary = summary.transpose()
5. Visualize summary statistics in console
summary.head()
No. You'll need to use pandas.
R is for language for statistics, so many of the basic functionality you need, like summary() and lm(), are loaded when you boot it up. Python has many uses, so you need to install and import the appropriate statistical packages. numpy isn't a statistics package - it's for numerical computation more generally, so you need to use packages like pandas, scipy and statsmodels to allow Python to do what R can do out of the box.
If you are looking for details like summary() in R i.e
5 point summary for numeric variables
Frequency of occurrence of each class for categorical variable
To achieve above in Python you can use df.describe(include= 'all').
Good day
This is my maiden Stack Overflow question so I hope I get it right and don't break any rules.
I work as a Fund Manager so do not have computer science background. I am however learning python at the moment.
I am trying to fit historical data which includes multiple time series. I think I have managed to do this. The thing I need to do next is to use this data to predict values into the future for these time series. I have looked at the StatsModels documentation but can't quite make heads or tails of it.
I am using xlwings and linking to excel. My code is as follows:
import numpy as np
from xlwings import Workbook, Range
import statsmodels.api as sm
import statsmodels
import pandas
def Fit_the_AR():
dataRange = Range('Sheet1','rDataToFit').value
dateRange = Range('Sheet1', 'rDates').value
titleRange = Range('Sheet1', 'rTitles').value
ARModel = statsmodels.tsa.vector_ar.var_model.VAR(dataRange,dateRange,titleRange,freq='m')
statsmodels.tsa.vector_ar.var_model.VAR.fit(ARModel,1, 'ols', None, 'c', True)
Range('Sheet2','B2').value = ARModel.endog_names
Range('Sheet2','B3').value = ARModel.endog
I thought i would have to use the predict method but not sure how I get all the parameters required for it.
Any help or pointing in the right direction would be much appreciated. I can provide an excel file of the data if need be. Thank you.