python equivalent of the R table(iris$Species) - python

I want to convert R code to Python. i got the iris data sets with pandas in python but I don't know how to get the same result. how can i got the same code in python?
R Code
> table(iris$Species)
setosa versicolor virginica
50 50 50

I found something similar.
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
import pandas as pd
iris = sm.datasets.get_rdataset('iris', 'datasets')
df = pd.DataFrame(iris.data)
iris_train, iris_test = train_test_split(df, train_size = 0.7, stratify=df["Species"])
data = iris_train.rename(columns={"variety": "Species"})
data = data.groupby("Species")
print(data.size())
Species
setosa 35
versicolor 35
virginica 35
dtype: int64

Related

Having problems with splitting a dataframe into train and test

I am splitting a data frame into test and train dataframes but have a problem with the output.I want to et the dataframes for train and test The code is as follows.
train_prices=int(len(prices)*0.65)
test_prices=len(prices)-train_prices
train_data,test_data=prices.iloc[0:train_prices,:].values,prices.iloc[train_prices:len(prices),:1].values
however, I only get a single value rather than dataframes.The output is something of the sort
code
training_size,test_size
Output
11156746, 813724
I expect train and test dataframes that will enable me to go on with ML models.
Please assist
since you didn't provide any reproducible example, I'll demonstrate on iris dataset.
from sklearn.datasets import load_iris
data = load_iris(as_frame=True)
data = data["data"]
data.head()
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
first, remove the .values from your code. values gives you the values as a list, but you want the output to be dataframe.
second, in test_data you took only the first column, since you used prices.iloc[train_prices:len(prices),:1] instead of
prices.iloc[train_prices:len(prices),:] as you did in train_data.
so in order to get two dataframe outputs for train and test:
train_prices=int(len(data)*0.65)
test_prices=len(data)-train_prices
train_data,test_data=data.iloc[0:train_prices,:],data.iloc[train_prices:len(data),:]
btw, if you want to do some ML, check out sklearn train_test_split method.

What is the pandas equivalent of the R function %in%?

What is the pandas equivalent of the R function %in% ?
When we have a dataframe in R, we can check for which rows a column contains strings from a list using the operator %in% which gives a Boolean output.
Concrete example: If we want to check which rows the strings "setosa" and "virginica" are in the column species of the iris dataset, we can simply use the following code:
iris[:,c('species')] %in% c('setosa', 'virginica').
How can we do the same thing in python for a pandas DataFrame?
The reason I want to do this is I want to filter the dataset and only keep rows with the species "setosa" or "virginica".
%in% in R is actually is.element:
r$> 1 %in% 1:2
[1] TRUE
r$> is.element(1, 1:2)
[1] TRUE
datar has ported some functions in R to python:
>>> from datar.all import c, f, is_element, filter
>>> from datar.datasets import iris
>>>
>>> iris >> filter(is_element(f.Species, c('setosa', 'virginica')))
Sepal_Length Sepal_Width Petal_Length Petal_Width Species
<float64> <float64> <float64> <float64> <object>
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
.. ... ... ... ... ...
4 5.0 3.6 1.4 0.2 setosa
95 6.7 3.0 5.2 2.3 virginica
96 6.3 2.5 5.0 1.9 virginica
97 6.5 3.0 5.2 2.0 virginica
98 6.2 3.4 5.4 2.3 virginica
99 5.9 3.0 5.1 1.8 virginica
[100 rows x 5 columns]
I am the author of the datar package. Feel free to submit issues if you have any questions.
The pandas package has the .str method for columns that are strings and the .str method itself contains the .isin() method which is equivalent to the %in% operator in R. Further, as pointed out by #rhug123 the .isin method can be directly applied on a series. I have made the corresponding change to the code below.
Your R code above can be implemented in python using pandas as follows - assuming that iris is a pandas DataFrame:
iris.species.isin(['setosa', 'virginica'])
You can then filter your DataFrame and only keep the rows with species 'setosa' or 'virginica' as follows:
iris[iris.species.isin(['setosa', 'virginica'])]

Linear regression and plots through each numerical independent variable and target variable

I would like to know is there a way where can I get one on one( 1 independent variable vs target variable) linear regression analysis ,its p value, R2 value and the plot to show how linearly it is related or not. And I want this to run on all independent variables separately. As far as I know it is possible to get OLS regression analysis from Python statsmodel library. It runs on whole dataset and give the result, and there are no plots to understand it visually.
To very quickly visualize the regression you can try the below using sns:
import numpy as np
from sklearn.datasets import load_iris
import pandas as pd
import seaborn as sns
data = load_iris()
df = pd.DataFrame(data.data, columns=['sepal.length','sepal.width','petal.length','petal.width'])
df = pd.melt(df,id_vars='sepal.length')
df[:5]
sepal.length variable value
0 5.1 sepal.width 3.5
1 4.9 sepal.width 3.0
2 4.7 sepal.width 3.2
3 4.6 sepal.width 3.1
4 5.0 sepal.width 3.6
sns.lmplot(x ='sepal.length', y ='value', data = df,col='variable',
col_wrap=2,aspect = 0.6, height,= 4, palette ='coolwarm')

Python - How to do prediction and testing over multiple files using sklearn

I want to train a model and finally predict a truth value using a random forest model in Python of the three column dataset (click the link to download the full CSV-dataset formatted as in the following
t_stamp,X,Y
0.000543,0,10
0.000575,0,10
0.041324,1,10
0.041331,2,10
0.041336,3,10
0.04134,4,10
0.041345,5,10
0.04135,6,10
0.041354,7,10
I wanted to predict the current value of Y (the true value) using the last (for example: 5, 10, 100, 300, 1000, ..etc) data points of X using random forest model of sklearn in Python. Meaning taking [0,0,1,2,3] of X column as an input for the first window - i want to predict the 5th row value of Y trained on the previous values of Y.
Let's say we have 5 traces of dataset (a1.csv, a2.csv, a3.csv, a4.csv and a5.csv) in the current directory. For a single trace (dataset) (for example, a1.csv) – I can do the prediction of a 5 window as the following
import pandas as pd
import numpy as np
from io import StringIO
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
import math
from math import sqrt
df = pd.read_csv('a1.csv')
for i in range(1,5):
df['X_t'+str(i)] = df['X'].shift(i)
print(df)
df.dropna(inplace=True)
X=pd.DataFrame({ 'X_%d'%i : df['X'].shift(i) for i in range(5)}).apply(np.nan_to_num, axis=0).values
y = df['Y'].values
reg = RandomForestRegressor(criterion='mse')
reg.fit(X,y)
modelPred = reg.predict(X)
print(modelPred)
print("Number of predictions:",len(modelPred))
modelPred.tofile('predictedValues1.txt',sep="\n",format="%s")
meanSquaredError=mean_squared_error(y, modelPred)
print("Mean Square Error (MSE):", meanSquaredError)
rootMeanSquaredError = sqrt(meanSquaredError)
print("Root-Mean-Square Error (RMSE):", rootMeanSquaredError)
I have solved this problem with random forest, which yields df:
rolling_regression')
time X Y X_t1 X_t2 X_t3 X_t4
0 0.000543 0 10 NaN NaN NaN NaN
1 0.000575 0 10 0.0 NaN NaN NaN
2 0.041324 1 10 0.0 0.0 NaN NaN
3 0.041331 2 10 1.0 0.0 0.0 NaN
4 0.041336 3 10 2.0 1.0 0.0 0.0
5 0.041340 4 10 3.0 2.0 1.0 0.0
6 0.041345 5 10 4.0 3.0 2.0 1.0
7 0.041350 6 10 5.0 4.0 3.0 2.0
.........................................................
[2845 rows x 7 columns]
[ 10. 10. 10. ..., 20. 20. 20.]
RMSE: 0.5136564734333562
However, now I want to do the prediction over all of the files (a1.csv, a2.csv, a3.csv, a4.csv and a5.csv)by dividing the training into 60% of the datasets whose file name start with a and the remaining 40% for testing whose file name start with a using sklearn in Python (meaning 3 traces will be used for training and 2 files for testing)?
PS: All the files have the same structure but they are with different lengths for they are generated with different parameters.
import glob, os
df = pd.concat(map(pd.read_csv, glob.glob(os.path.join('', "a*.csv"))))
# get your X and Y Df's
x_train,x_test,y_train,y_test=sklearn.cross_validation.train_test_split(X,Y,test_size=0.40)
To read in multiple files, you'll need a slight extension. Aggregate data from each csv, then call pd.concat to join them:
df_list = []
for i in range(1, 6):
df_list.append(pd.read_csv('a%d.csv' %i))
df = pd.concat(df_list)
This will read in all your csvs, and you can carry on as usual. Get X and y:
X = pd.DataFrame({ 'X_%d'%i : df['X'].shift(i) for i in range(5)}).apply(np.nan_to_num, axis=0).values
y = df['Y'].values
Use sklearn.cross_validation.train_test_split to segment your data:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)
You can also look at StratifiedKFold.

How do I generate (train, test) folds where each test fold contains features generated by the corresponding train fold?

For example, consider the iris dataset and suppose my goal is to predict sepal length.
import pandas as pd
import numpy as np
from sklearn import datasets
iris = datasets.load_iris()
train = pd.DataFrame(iris.data, columns=iris.feature_names)
train['Species'] = iris.target
train
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) Species
0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
.. ... ... ... ... ...
148 6.2 3.4 5.4 2.3 2
149 5.9 3.0 5.1 1.8 2
I'd like to split the data into 5 folds (train1, test1), (train2, test2), ... Now consider the fold (train1, test1). Using train1, I want to measure the average sepal width, petal length, and petal width per Species, and then map those averages to test1 (based on Species). And then I want to do this for the remaining folds.
Ultimately I'd like to do this in a way that allows me to use GridSearchCV to train a RandomForestRegressor using those same folds. Is there a convenient method of doing this with scikit-learn?
I realize this seems silly for this example but I think it makes sense for my real dataset.

Categories