I'm using rpy2 in order to be able to use the package FMradio in Python. This package consists of a specific pipeline for exploratory factor analysis so I'm using the output of a function as input of the next one. However, this package heavily depends on column names on matrices to do the calculations. The automatic conversions from numpy2ri and pandas2ri erases the column and row names of matrices thus making it impossible to use this package.
I thought that the simplest way to bypass this problem would be to not convert the R matrices into python arrays and just use R objects until I don't need it anymore. Is there any way to stop the automatic conversion from happening and just dealing with R objects on python?
This is how I'm trying to use it. X_filt is an empty vector because the conversion from R matrix to numpy.array erases the column names from correlation. X must be a matrix for the function subSet to work, so converting it to a pandas dataframe is not an option.
import rpy2.robjects as ro
from rpy2.robjects import pandas2ri
pandas2ri.activate()
from rpy2.robjects.packages import importr
FMradio = importr("FMradio")
stats = importr("stats")
correlation = stats.cor(X, method = "pearson", use = "pairwise.complete.obs")
correlation_filt = FMradio.RF(correlation, t = 0.9)
X_filt = FMradio.subSet(X, correlation_filt)
regular_correlation = FMradio.regcor(X_filt, 10, verbose = FALSE)
Thanks a lot!
Calling activate() is actually asking rpy2 to convert everything.
The documentation about conversion outlines how conversion is working:
https://rpy2.github.io/doc/v3.3.x/html/robjects_convert.html#conversion
If conversion from and to pandas is the only thing you need, the relevant section in the doc is probably enough:
https://rpy2.github.io/doc/v3.3.x/html/pandas.html
Related
I was using Statsmodel to train some time series models and found out that the data type of some of the output will be different depending on the input type, while I was expecting the type of output to be independent of the input type.
My question is, is this normal in statsmodels (and in other packages, i.e., sklearn)? If not, what's the normal/standard/convention way to handle such situation?
I have an example below. If I use pandas.Series as input, then the output bse from statsmodels will be a pandas.Series. If the input is a list, then the output will be np.array.
from statsmodels.tsa.arima_model import ARIMA
x1 = pd.Series([1.5302615469999998,1.130221162,1.059648341,1.246757738,0.98096523,1.173285138,
1.630229825,1.6447988169999999,1.753422,1.7624994719999998,1.60655743,1.7999185709999999,
1.7284643419999999,1.74167109,1.606315199,1.510957898,1.38138611,1.4421003190000001,1.172060761,
0.978149498,0.878831354,0.802660206])
x2 = [s for s in x1]
model1 = ARIMA(x1, order=(1,1,0))
model2 = ARIMA(x2, order=(1,1,0))
model_fit1 = model1.fit(disp=False)
model_fit2 = model2.fit(disp=False)
model_fit1.bse #outputs pandas series
model_fit2.bse #outputs numpy array
This is true for all or most models and many functions in statsmodels. It is part of the pandas support.
Pandas Series or DataFrames provide an index and other information like column names for the design matrix, and models and many functions try to preserve it and return a Series or DataFrame with the appropriate index.
Any other types will be converted to numpy arrays (np.asarray), if that is possible, and any additional information that those data structures have will be ignored.
So the rule is, if the user uses pandas, then the code assumes that the user wants matching pandas data structures back.
This could be extended to other data structures besides pandas, but there are currently no plans for supporting datastructures from other packages.
I am stuck with an issue when it comes to taking slices of my data in python (I come from using Matlab).
So here is the code I'm using,
import scipy.io as sc
import math as m
import numpy as np
from scipy.linalg import expm, sinm, cosm
import matplotlib.pyplot as plt
import pandas as pd
import sys
data = pd.read_excel('DataDMD.xlsx')
print(data.shape)
print(data)
The out put looks like so,
Output
So I wish to take certain rows only (or from my understand in Python slices) of this data matrix. The other problem I have is that the top row of my matrix becomes almost like the titles of the columns instead of actually data points. So I have two problems,
1) I don't need the top of the matrix to have any 'titles' or anything of that sort because it's all numeric and all symbolizes data.
2) I only need to take the 6th row of the whole matrix as a new data matrix.
3) I plan on using matrix multiplication later so is panda allowed or do I need numpy?
So this is what I've tried,
data.iloc[0::6,:]
this gives me something like this,
Output2
which is wrong because I don't need the values of 24.8 to be the 'title' but be the first row of the new matrix.
I've also tried using np.array for this but my problem is when I try to using iloc, it says (which makes sense)
'numpy.ndarray' object has no attribute 'iloc'
If anyone has any ideas, please let me know! Thanks!
To avoid loading the first record as the header, try using the following:
pd.read_excel('DataDMD.xlsx', header=None)
The read_excel function has an header argument; the value for the header argument indicates which row of the data should be used as header. It gets a default value of 0. Use None as a value for the header argument if none of the rows in your data functions as the header.
There are many useful arguments, all described in the documentation of the function.
This should also help with number 2.
Hope this helps.
Good luck!
I've been experimenting with Seaborn's lmplot() and Statsmodels .ols() functions for simple linear regression plots and their associated p-values, r-squared, etc.
I've noticed that when I specify which columns I want to use for lmplot, I can specify a column even if it has multiple words for it:
import seaborn as sns
import pandas as pd
input_csv = pd.read_csv('./test.csv',index_col = 0,header = 0)
input_csv
sns.lmplot(x='Age',y='Count of Specific Strands',data = input_csv)
<seaborn.axisgrid.FacetGrid at 0x2800985b710>
However, if I try to use ols, I'm getting an error for inputting in "Count of Specific Strands" as my dependent variable (I've only listed out the last couple of lines in the error):
import statsmodels.formula.api as smf
test_results = smf.ols('Count of Specific Strands ~ Age',data = input_csv).fit()
File "<unknown>", line 1
Count of Specific Strands
^
SyntaxError: invalid syntax
Conversely, if I specify the "Counts of Specific Strand" as shown below, the regression works:
test_results = smf.ols('input_csv.iloc[:,1] ~ Age',data = input_csv).fit()
test_results.summary()
Does anyone know why this is? Is it just because of how Statsmodels was written? Is there an alternative to specify the dependent variable for regression analysis that doesn't involve iloc or loc?
This is due to the way the formula parser patsy is written: see this link for more information
The authors of patsy have, however, thought of this problem: (quoted from here)
This flexibility does create problems in one case, though – because we
interpret whatever you write in-between the + signs as Python code,
you do in fact have to write valid Python code. And this can be tricky
if your variable names have funny characters in them, like whitespace
or punctuation. Fortunately, patsy has a builtin “transformation”
called Q() that lets you “quote” such variables
Therefore, in your case, you should be able to write:
smf.ols('Q("Count of Specific Strands") ~ Age',data = input_csv).fit()
I am trying to do some exploratory data analysis and I have a data frame with an integer age column and a "category" column. Making a histogram of the age is easy enough. What I want to do is maintain this age histogram but color the bars based on the categorical variables.
import numpy as np
import pandas as pd
ageSeries.hist(bins=np.arange(-0.5, 116.5, 1))
I was able to do what I wanted easily in one line with ggplot2 in R
ggplot(data, aes(x=Age, fill=Category)) + geom_histogram(binwidth = 1)
I wasn't able to find a good solution in Python, but then I realized there was a ggplot2 library for Python and installed it. I tried to do the same ggplot command...
ggplot(data, aes(x="Age", fill="Category")) + geom_histogram(binwidth = 1)
Looking at these results we can see that the different categories are treated as different series and and overlaid rather than stacked. I don't want to mess around with transperancies, and I still want to maintain the overall distribution of the the population.
Is this something I can fix with a parameter in the ggplot call, or is there a straightforward way to do this in Python at all without doing a bunch of extra dataframe manipulations?
Is there an equivalent of R's summary() function in numpy?
numpy has std, mean, average functions separately, but does it have a function that sums up everything, like summary does in R?
If found this question which relates to pandas and this article with R-to-numpy equivalents, but it doesn't have what I seek for.
1. Load Pandas in console and load csv data file
import pandas as pd
data = pd.read_csv("data.csv", sep = ",")
2. Examine first few rows of data
data.head()
3. Calculate summary statistics
summary = data.describe()
4. Transpose statistics to get similar format as R summary() function
summary = summary.transpose()
5. Visualize summary statistics in console
summary.head()
No. You'll need to use pandas.
R is for language for statistics, so many of the basic functionality you need, like summary() and lm(), are loaded when you boot it up. Python has many uses, so you need to install and import the appropriate statistical packages. numpy isn't a statistics package - it's for numerical computation more generally, so you need to use packages like pandas, scipy and statsmodels to allow Python to do what R can do out of the box.
If you are looking for details like summary() in R i.e
5 point summary for numeric variables
Frequency of occurrence of each class for categorical variable
To achieve above in Python you can use df.describe(include= 'all').