Is there an equivalent of R's summary() function in numpy?
numpy has std, mean, average functions separately, but does it have a function that sums up everything, like summary does in R?
If found this question which relates to pandas and this article with R-to-numpy equivalents, but it doesn't have what I seek for.
1. Load Pandas in console and load csv data file
import pandas as pd
data = pd.read_csv("data.csv", sep = ",")
2. Examine first few rows of data
data.head()
3. Calculate summary statistics
summary = data.describe()
4. Transpose statistics to get similar format as R summary() function
summary = summary.transpose()
5. Visualize summary statistics in console
summary.head()
No. You'll need to use pandas.
R is for language for statistics, so many of the basic functionality you need, like summary() and lm(), are loaded when you boot it up. Python has many uses, so you need to install and import the appropriate statistical packages. numpy isn't a statistics package - it's for numerical computation more generally, so you need to use packages like pandas, scipy and statsmodels to allow Python to do what R can do out of the box.
If you are looking for details like summary() in R i.e
5 point summary for numeric variables
Frequency of occurrence of each class for categorical variable
To achieve above in Python you can use df.describe(include= 'all').
Related
I am new to python and pandas. My question is related to that question:
Advanced Describe Pandas
Is it possible to add some functions to reply by noobie like:
geometric mean, weighted mean, harmonic mean, geometric standard deviation, etc.
import pandas as pd
def describex(data):
data = pd.DataFrame(data)
stats = data.describe()
skewness = data.skew()
kurtosis = data.kurtosis()
skewness_df = pd.DataFrame({'skewness':skewness}).T
kurtosis_df = pd.DataFrame({'kurtosis':kurtosis}).T
return stats.append([kurtosis_df,skewness_df])
So basically I am interested in adding something for example from scipy.stats that is not as these functions above originated from pandas. I want to have much more informations from descriptive statistics than standard describe offers. What I tried so far was adding more functions from pandas, and with that I am OK, but wasn't able to attach more functions that are outside of pandas.
How do I do it, please ?
There are a couple of things you could do.
One suggestion is to use the pandas-profiling library, which can generate a comprehensive report on the data including basic statistics, correlation analysis, data type analysis, missing values analysis, and more. This can be a very useful tool for quickly getting a comprehensive overview of the dataset.
Another suggestion is to use the scipy.stats library to add any advanced statistics to your custom function. The scipy.stats library probably has a function to compute any statistic you're looking for.
For example,
import pandas as pd
import numpy as np
from scipy.stats import gmean
df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
def describex(data):
data = pd.DataFrame(data)
stats = data.describe()
skewness = data.skew()
kurtosis = data.kurtosis()
skewness_df = pd.DataFrame({'skewness':skewness}).T
kurtosis_df = pd.DataFrame({'kurtosis':kurtosis}).T
gmean_df = pd.DataFrame(df.apply(gmean, axis=0),columns=['gmean']).T
return stats.append([kurtosis_df,skewness_df,gmean_df])
print(describex(df))
Hope this helps!
I'm using rpy2 in order to be able to use the package FMradio in Python. This package consists of a specific pipeline for exploratory factor analysis so I'm using the output of a function as input of the next one. However, this package heavily depends on column names on matrices to do the calculations. The automatic conversions from numpy2ri and pandas2ri erases the column and row names of matrices thus making it impossible to use this package.
I thought that the simplest way to bypass this problem would be to not convert the R matrices into python arrays and just use R objects until I don't need it anymore. Is there any way to stop the automatic conversion from happening and just dealing with R objects on python?
This is how I'm trying to use it. X_filt is an empty vector because the conversion from R matrix to numpy.array erases the column names from correlation. X must be a matrix for the function subSet to work, so converting it to a pandas dataframe is not an option.
import rpy2.robjects as ro
from rpy2.robjects import pandas2ri
pandas2ri.activate()
from rpy2.robjects.packages import importr
FMradio = importr("FMradio")
stats = importr("stats")
correlation = stats.cor(X, method = "pearson", use = "pairwise.complete.obs")
correlation_filt = FMradio.RF(correlation, t = 0.9)
X_filt = FMradio.subSet(X, correlation_filt)
regular_correlation = FMradio.regcor(X_filt, 10, verbose = FALSE)
Thanks a lot!
Calling activate() is actually asking rpy2 to convert everything.
The documentation about conversion outlines how conversion is working:
https://rpy2.github.io/doc/v3.3.x/html/robjects_convert.html#conversion
If conversion from and to pandas is the only thing you need, the relevant section in the doc is probably enough:
https://rpy2.github.io/doc/v3.3.x/html/pandas.html
I was using Statsmodel to train some time series models and found out that the data type of some of the output will be different depending on the input type, while I was expecting the type of output to be independent of the input type.
My question is, is this normal in statsmodels (and in other packages, i.e., sklearn)? If not, what's the normal/standard/convention way to handle such situation?
I have an example below. If I use pandas.Series as input, then the output bse from statsmodels will be a pandas.Series. If the input is a list, then the output will be np.array.
from statsmodels.tsa.arima_model import ARIMA
x1 = pd.Series([1.5302615469999998,1.130221162,1.059648341,1.246757738,0.98096523,1.173285138,
1.630229825,1.6447988169999999,1.753422,1.7624994719999998,1.60655743,1.7999185709999999,
1.7284643419999999,1.74167109,1.606315199,1.510957898,1.38138611,1.4421003190000001,1.172060761,
0.978149498,0.878831354,0.802660206])
x2 = [s for s in x1]
model1 = ARIMA(x1, order=(1,1,0))
model2 = ARIMA(x2, order=(1,1,0))
model_fit1 = model1.fit(disp=False)
model_fit2 = model2.fit(disp=False)
model_fit1.bse #outputs pandas series
model_fit2.bse #outputs numpy array
This is true for all or most models and many functions in statsmodels. It is part of the pandas support.
Pandas Series or DataFrames provide an index and other information like column names for the design matrix, and models and many functions try to preserve it and return a Series or DataFrame with the appropriate index.
Any other types will be converted to numpy arrays (np.asarray), if that is possible, and any additional information that those data structures have will be ignored.
So the rule is, if the user uses pandas, then the code assumes that the user wants matching pandas data structures back.
This could be extended to other data structures besides pandas, but there are currently no plans for supporting datastructures from other packages.
I am trying to do some exploratory data analysis and I have a data frame with an integer age column and a "category" column. Making a histogram of the age is easy enough. What I want to do is maintain this age histogram but color the bars based on the categorical variables.
import numpy as np
import pandas as pd
ageSeries.hist(bins=np.arange(-0.5, 116.5, 1))
I was able to do what I wanted easily in one line with ggplot2 in R
ggplot(data, aes(x=Age, fill=Category)) + geom_histogram(binwidth = 1)
I wasn't able to find a good solution in Python, but then I realized there was a ggplot2 library for Python and installed it. I tried to do the same ggplot command...
ggplot(data, aes(x="Age", fill="Category")) + geom_histogram(binwidth = 1)
Looking at these results we can see that the different categories are treated as different series and and overlaid rather than stacked. I don't want to mess around with transperancies, and I still want to maintain the overall distribution of the the population.
Is this something I can fix with a parameter in the ggplot call, or is there a straightforward way to do this in Python at all without doing a bunch of extra dataframe manipulations?
I'm currently writng a code involving some financial calculation. More in particular some exponential moving average. To do the job I have tried Pandas and Talib:
talib_ex=pd.Series(talib.EMA(self.PriceAdjusted.values,timeperiod=200),self.PriceAdjusted.index)
pandas_ex=self.PriceAdjusted.ewm(span=200,adjust=True,min_periods=200-1).mean()
They both work fine, but they provide different results at the begining of the array:
So there is some parameter to be change into pandas's EWMA or it is a bug and I should worry?
Thanks in advance
Luca
For the talib ema, the formula is:
So when using the pandas, if you want to make pandas ema the same as talib, you should use it as:
pandas_ex=self.PriceAdjusted.ewm(span=200,adjust=False,min_periods=200-1).mean()
Set the adjust as False according to the document(https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.ewm.html) if you want to use the same formula as talib:
When adjust is True (default), weighted averages are calculated using weights (1-alpha)(n-1), (1-alpha)(n-2), ..., 1-alpha, 1.
When adjust is False, weighted averages are calculated recursively as:
weighted_average[0] = arg[0]; weighted_average[i] = (1-alpha)weighted_average[i-1] + alphaarg[i].
You can also reference here:
https://en.wikipedia.org/wiki/Moving_average
PS: however, in my project, i still find some small differences between the talib and the pandas.ewm and don't know why yet...