I am splitting a data frame into test and train dataframes but have a problem with the output.I want to et the dataframes for train and test The code is as follows.
train_prices=int(len(prices)*0.65)
test_prices=len(prices)-train_prices
train_data,test_data=prices.iloc[0:train_prices,:].values,prices.iloc[train_prices:len(prices),:1].values
however, I only get a single value rather than dataframes.The output is something of the sort
code
training_size,test_size
Output
11156746, 813724
I expect train and test dataframes that will enable me to go on with ML models.
Please assist
since you didn't provide any reproducible example, I'll demonstrate on iris dataset.
from sklearn.datasets import load_iris
data = load_iris(as_frame=True)
data = data["data"]
data.head()
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm)
0 5.1 3.5 1.4 0.2
1 4.9 3.0 1.4 0.2
2 4.7 3.2 1.3 0.2
3 4.6 3.1 1.5 0.2
4 5.0 3.6 1.4 0.2
first, remove the .values from your code. values gives you the values as a list, but you want the output to be dataframe.
second, in test_data you took only the first column, since you used prices.iloc[train_prices:len(prices),:1] instead of
prices.iloc[train_prices:len(prices),:] as you did in train_data.
so in order to get two dataframe outputs for train and test:
train_prices=int(len(data)*0.65)
test_prices=len(data)-train_prices
train_data,test_data=data.iloc[0:train_prices,:],data.iloc[train_prices:len(data),:]
btw, if you want to do some ML, check out sklearn train_test_split method.
I want to convert R code to Python. i got the iris data sets with pandas in python but I don't know how to get the same result. how can i got the same code in python?
R Code
> table(iris$Species)
setosa versicolor virginica
50 50 50
I found something similar.
from sklearn.model_selection import train_test_split
import statsmodels.api as sm
import pandas as pd
iris = sm.datasets.get_rdataset('iris', 'datasets')
df = pd.DataFrame(iris.data)
iris_train, iris_test = train_test_split(df, train_size = 0.7, stratify=df["Species"])
data = iris_train.rename(columns={"variety": "Species"})
data = data.groupby("Species")
print(data.size())
Species
setosa 35
versicolor 35
virginica 35
dtype: int64
I recently encountered a thing which made me notice that numpy.var() and pandas.DataFrame.var() or pandas.Series.var() are giving different values. I want to know if there is any difference between them or not?
Here is my dataset.
Country GDP Area Continent
0 India 2.79 3.287 Asia
1 USA 20.54 9.840 North America
2 China 13.61 9.590 Asia
Here is my code:
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
catDf.iloc[:,1:-1] = ss.fit_transform(catDf.iloc[:,1:-1])
Now checking Pandas Variance
# Pandas Variance
print(catDf.var())
print(catDf.iloc[:,1:-1].var())
print(catDf.iloc[:,1].var())
print(catDf.iloc[:,2].var())
The output is
GDP 1.5
Area 1.5
dtype: float64
GDP 1.5
Area 1.5
dtype: float64
1.5000000000000002
1.5000000000000002
Whereas it should be 1 as I have used StandardScaler on it.
And for numpy Variance
print(catDf.iloc[:,1:-1].values.var())
print(catDf.iloc[:,1].values.var())
print(catDf.iloc[:,2].values.var())
THe output is
1.0000000000000002
1.0000000000000002
1.0000000000000002
Which seems correct.
pandas var has ddof of 1 by default, numpy has it at 0.
The get the same var in pandas as you're getting in numpy do
catDf.iloc[:,1:-1].var(ddof=0)
This comes down to the difference between population variance and sample variance.
Note the sklearn standard scaler explicitly mention they use a ddof of 0 and that as it is unlikely to affect model performance (as it is just for scaling), they haven't exposed it as a configurable parameter.
I want to train a model and finally predict a truth value using a random forest model in Python of the three column dataset (click the link to download the full CSV-dataset formatted as in the following
t_stamp,X,Y
0.000543,0,10
0.000575,0,10
0.041324,1,10
0.041331,2,10
0.041336,3,10
0.04134,4,10
0.041345,5,10
0.04135,6,10
0.041354,7,10
I wanted to predict the current value of Y (the true value) using the last (for example: 5, 10, 100, 300, 1000, ..etc) data points of X using random forest model of sklearn in Python. Meaning taking [0,0,1,2,3] of X column as an input for the first window - i want to predict the 5th row value of Y trained on the previous values of Y.
Let's say we have 5 traces of dataset (a1.csv, a2.csv, a3.csv, a4.csv and a5.csv) in the current directory. For a single trace (dataset) (for example, a1.csv) – I can do the prediction of a 5 window as the following
import pandas as pd
import numpy as np
from io import StringIO
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score
import math
from math import sqrt
df = pd.read_csv('a1.csv')
for i in range(1,5):
df['X_t'+str(i)] = df['X'].shift(i)
print(df)
df.dropna(inplace=True)
X=pd.DataFrame({ 'X_%d'%i : df['X'].shift(i) for i in range(5)}).apply(np.nan_to_num, axis=0).values
y = df['Y'].values
reg = RandomForestRegressor(criterion='mse')
reg.fit(X,y)
modelPred = reg.predict(X)
print(modelPred)
print("Number of predictions:",len(modelPred))
modelPred.tofile('predictedValues1.txt',sep="\n",format="%s")
meanSquaredError=mean_squared_error(y, modelPred)
print("Mean Square Error (MSE):", meanSquaredError)
rootMeanSquaredError = sqrt(meanSquaredError)
print("Root-Mean-Square Error (RMSE):", rootMeanSquaredError)
I have solved this problem with random forest, which yields df:
rolling_regression')
time X Y X_t1 X_t2 X_t3 X_t4
0 0.000543 0 10 NaN NaN NaN NaN
1 0.000575 0 10 0.0 NaN NaN NaN
2 0.041324 1 10 0.0 0.0 NaN NaN
3 0.041331 2 10 1.0 0.0 0.0 NaN
4 0.041336 3 10 2.0 1.0 0.0 0.0
5 0.041340 4 10 3.0 2.0 1.0 0.0
6 0.041345 5 10 4.0 3.0 2.0 1.0
7 0.041350 6 10 5.0 4.0 3.0 2.0
.........................................................
[2845 rows x 7 columns]
[ 10. 10. 10. ..., 20. 20. 20.]
RMSE: 0.5136564734333562
However, now I want to do the prediction over all of the files (a1.csv, a2.csv, a3.csv, a4.csv and a5.csv)by dividing the training into 60% of the datasets whose file name start with a and the remaining 40% for testing whose file name start with a using sklearn in Python (meaning 3 traces will be used for training and 2 files for testing)?
PS: All the files have the same structure but they are with different lengths for they are generated with different parameters.
import glob, os
df = pd.concat(map(pd.read_csv, glob.glob(os.path.join('', "a*.csv"))))
# get your X and Y Df's
x_train,x_test,y_train,y_test=sklearn.cross_validation.train_test_split(X,Y,test_size=0.40)
To read in multiple files, you'll need a slight extension. Aggregate data from each csv, then call pd.concat to join them:
df_list = []
for i in range(1, 6):
df_list.append(pd.read_csv('a%d.csv' %i))
df = pd.concat(df_list)
This will read in all your csvs, and you can carry on as usual. Get X and y:
X = pd.DataFrame({ 'X_%d'%i : df['X'].shift(i) for i in range(5)}).apply(np.nan_to_num, axis=0).values
y = df['Y'].values
Use sklearn.cross_validation.train_test_split to segment your data:
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.4)
You can also look at StratifiedKFold.
For example, consider the iris dataset and suppose my goal is to predict sepal length.
import pandas as pd
import numpy as np
from sklearn import datasets
iris = datasets.load_iris()
train = pd.DataFrame(iris.data, columns=iris.feature_names)
train['Species'] = iris.target
train
sepal length (cm) sepal width (cm) petal length (cm) petal width (cm) Species
0 5.1 3.5 1.4 0.2 0
1 4.9 3.0 1.4 0.2 0
.. ... ... ... ... ...
148 6.2 3.4 5.4 2.3 2
149 5.9 3.0 5.1 1.8 2
I'd like to split the data into 5 folds (train1, test1), (train2, test2), ... Now consider the fold (train1, test1). Using train1, I want to measure the average sepal width, petal length, and petal width per Species, and then map those averages to test1 (based on Species). And then I want to do this for the remaining folds.
Ultimately I'd like to do this in a way that allows me to use GridSearchCV to train a RandomForestRegressor using those same folds. Is there a convenient method of doing this with scikit-learn?
I realize this seems silly for this example but I think it makes sense for my real dataset.