I'm performing a linear regression on a dataset (Excel file) which consists of a Date column, a scores column and additional column called Predictions with NaN values which will be used to store the predicted values.
I have found that my independent variable, X, contains timestamps which I was actually expecting...? Perhaps I'm doing something wrong, or actually missing something out..?
Top of the original dataset:
Date Score
0 2019-05-01 4.607744
1 2019-05-02 4.709202
2 2019-05-03 4.132390
3 2019-05-05 4.747308
4 2019-05-07 4.745926
Create the independent data set (X)
Convert the dataframe to a numpy array
X = np.array(df.drop(['Prediction'],1))
Remove the last '30' rows
X = X[:-forecast_out]
print(X)
Example of output:
[[Timestamp('2019-05-01 00:00:00') 4.607744342064972]
[Timestamp('2019-05-02 00:00:00') 4.709201914086133]
[Timestamp('2019-05-03 00:00:00') 4.132389742485806]
[Timestamp('2019-05-05 00:00:00') 4.74730802483691]
[Timestamp('2019-05-07 00:00:00') 4.7459264970444615]
[Timestamp('2019-05-08 00:00:00') 4.595303054619376]
Create the dependent data set (y)
Convert the dataframe to a numpy array
y = np.array(df['Prediction'])
Get all of the y values except the last '30' rows
y = y[:-forecast_out]
print(y)
Some of the output:
[4.63738251 4.34354486 5.12284464 4.2751933 4.53362196 4.32665058
4.77433793 4.37496465 4.31239161 4.90445026 4.81738271 3.99114536
5.21672369 4.4932632 4.46858993 3.93271862 4.55618508 4.11493084
4.02430584 4.11672606 4.19725244 4.3088558 4.98277563 4.97960989
Split the data into 80% training and 20% testing
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
Create and train the Linear Regression Model
lr = LinearRegression()
Train the model
lr.fit(x_train, y_train)
The error:
TypeError: float() argument must be a string or a number, not 'Timestamp'
Clearly the dataset X doesn't like having the timestamp, and like I say, I wasn't really expecting it.
Any help on removing it (or perhaps I need it!?) would be great. As you can, see I'm simply looking to perform a simple regression analysis
Do not include the Timestamps (Date) in your creation of 'X'.
The data set is already ordered, so do you really need the time stamps? Another option, try reassigning the index. In either case, I think, do not try to pass Timestamps as argument-data.
Implement changes at this step:
X = np.array(df.drop(['Prediction'],1))
Do something like:
X = np.array(df.drop(['Date', 'Prediction'],1))
I think the problem could be solved by using the date timestamp as an index field instead. You can try reset_index to re-assign index.
Related
I have a data frame "df" with columns "bedrooms", "bathrooms", "sqft_living", and "sqft_lot".
I want to create a regression model by filling the missing column values based on the values of the other columns. The missing value would be determined by observing the other columns and making a prediction based on the other column values.
As an example, the sqft_living column is missing in row 12. To determine this, the count for the bedrooms, bathrooms, and sqft_lot would be considered to make a prediction on the missing value.
Is there any way to do this? Any help is appreciated. Thanks!
import pandas as pd
from sklearn.linear_model import LinearRegression
# setup
dictionary = {'bedrooms': [3,3,2,4,3,4,3,3,3,3,3,2,3,3],
'bathrooms': [1,2.25,1,3,2,4.5,2.25,1.5,1,2.5,2.5,1,1,1.75],
'sqft_living': [1180, 2570,770,1960,1680,5420,1715,1060,1780,1890,'',1160,'',1370],
'sqft_lot': [5650,7242,10000,5000,8080,101930,6819,9711,7470,6560,9796,6000,19901,9680]}
df = pd.DataFrame(dictionary)
# setup x and y for training
# drop data with empty row
clean_df = df[df['sqft_living'] != '']
# separate variables into my x and y
x = clean_df.iloc[:, [0,1,3]].values
y = clean_df['sqft_living'].values
# fit my model
lm = LinearRegression()
lm.fit(x, y)
# get the rows I am trying to do my prediction on
predict_x = df[df['sqft_living'] == ''].iloc[:, [0,1,3]].values
# perform my prediction
lm.predict(predict_x)
# I get values 1964.983 for row 10, and 1567.068 row row 12
It should be noted that you're asking about imputation. I suggest reading and understanding other methods, trade offs, and when to do it.
Edit: Putting Code back into DataFrame:
# Get index of missing data
missing_index = df[df['sqft_living'] == ''].index
# Replace
df.loc[missing_index, 'sqft_living'] = lm.predict(predict_x)
I have dataframe like this sample.
priceUsd,time,date
38492.2698958105979245,1627948800000,2021-08-03T00:00:00.000Z
39573.1543437718690816,1628035200000,2021-08-04T00:00:00.000Z
40090.5174131427618446,1628121600000,2021-08-05T00:00:00.000Z
41356.0360622010701055,1628208000000,2021-08-06T00:00:00.000Z
43535.9969201307711635,1628294400000,2021-08-07T00:00:00.000Z
I want to split last 10 rows for test dataset in tensorflow and I get data from first row to before last 10 rows for train.
train = df.loc[:-10 , ['priceUsd']]
test = df.loc[-10: , ['priceUsd']]
when I run this code it show error
TypeError: cannot do slice indexing on DatetimeIndex with these indexers [-10] of type int
How to fix it?
Try this instead:
train = df[['priceUsd']].head(len(df) - 10)
test = df[['priceUsd']].tail(10)
I converted my dataset features into integers using the following code:
car_df = pd.DataFrame({'Integer Feature': [0,1,2,3,4,5],
'Categorical Feature': ['buying', 'maint', 'doors', 'persons', 'lug_boot', 'safety']})
This worked. Now, I am trying to create a decision tree and used the following code:
from sklearn.tree import DecisionTreeClassifier
dtree = DecisionTreeClassifier()
dtree.fit(car_df, y)
However, I get an error stating: ValueError: could not convert string to float: 'buying'
'Buying' is the first categorical feature in the dataset. There are six categorical features.
I thought that would not have been an issue since I converted the features to integers. Does anyone have an idea of how to fix this?
I just pulled this cars dataset so I have a better idea of its contents. Based on the documentation, here are the columns with possible values:
buying v-high, high, med, low
maint v-high, high, med, low
doors 2, 3, 4, 5-more
persons 2, 4, more
lug_boot small, med, big
safety low, med, high
So all of these columns can contain strings and they all need to be converted to numeric type before you can pass the dataset to your model's fit() method.
Per Pandas documentation of the get_dummies() method: https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html:
Once you have your original dataset in a dataframe (call it df), you can pass it to the .get_dummies() method like this:
import pandas as pd
df_with_dummies = pd.get_dummies(df)
This code will convert any columns with object or category dtype to integer dummies and will name each new column using the {original column name}_{original value} convention.
I want to evaluate categorical data in Python with a decision tree. I want to use the categorical data and use binning to create categorical labels. Do I have to?
The problem is that get_dummies returns a dataframe with a different length then the values that were given. It is two rows shorter than the original data.
Previously I tried to use the labelencode, but didn't get it done. I tried the get_dummies form pandas which seamed more easily to me.
I checked the reference for the get_dummies function and searched for the problem but could not find why the length is shorter.
Doing the binning:
est = bine(n_bins=50, encode='ordinal', strategy='kmeans')
cat_labels = est.fit_transform(np.array(quant_labels).reshape(-1, 1))
Extact the cateorical data (do I have to?):
category = rd.select_dtypes(exclude=['number']).astype("category")
category = category.replace(math.nan, "None")
category = category.replace(0, "None")
Prepare the split:
one_hot_features = pd.get_dummies(category[1:-1])
X_train, X_test, y_train, y_test = train_test_split(one_hot_features, cat_labels, test_size = 0.6, random_state = None)
The Error is:
ValueError: Found input variables with inconsistent number of samples: [1458, 1460]
The correct size of samples is 1460. The one_hot encoded is two samples short. Why is it so?
When you are encoding your data you use category[1:-1]. This will encode all the elements from the second to the second to last element.
Explanation:
1) Indexes are zero based so 1 is the index of the second item.
2) Index of -1 means the second to last element.
Solution:
Change your line to one_hot_features = pd.get_dummies(category[:])
Hi there I am working with a Sci Kit learn data set, digits and I Split the data
So I have X_train and Y_train arrays
The arrays are related in such a way that the index x[0] belongs to y[0]
print x_train.shape
(1347, 64)
print y_train.shape
(1347)
print set(y_train)
(0,1,2,3,4,5,6,7,8,9)
I would like to extract a random sample from x_train given the set(y), i.e. To resample my data by extracting just one random observation of the set(y).However I donĀ“t know if I can do this with numpy or pandas, any one have an idea of how to deal with this????
Thank you very much.
It is not clear what you want to do.
The set(y) contains all the available labels of your dataset X.
In general (until you specify what you need), use random.choice:
You have this:
print set(y)
(0,1,2,3,4,5,6,7,8,9)
Convert it first to a list:
index_all = list(set(y))
Now, randomly sample the set(y):
# this is a random index (class/label) from 0 to 9.
random_index = np.random.choice(index_all, 1)
Now, I see 2 possibilities (I believe you want the Case 2):
1) Directly resample x based on this random index (random based on the set(y))
Finally, if x is a numpy array:
x[random_index, :]
This returns a random observation of x based on the set(y)
2) Resample the x but get a random observation that has a label y. Label 'y' is defined randomly above (random_index)
x[y==random_index]
This returns a random observation of x that is associated with a label y.
This is the approach I generally use for constructing a dataframe and extracting data from it.
import numpy as np
import pandas as pd
#Dummy arrays for x and y
x_train = np.zeros((1347,64))
y_train = np.ones((1347))
#First we pair up the arrays according to their index using zip. Only use this
#method if both arrays are of equal length.
training_dataset = list(zip(x_train,y_train))
#Next we load the dataset as a dataframe using Pandas
df = pd.DataFrame(data=training_dataset)
#Check that the dataframe is what you want
df.head()
#If you would like to extract a random row, you may use
df.sample(n=1)
#Alternatively if you would like to extract a specific row (eg. 10th row aka index 9)
df.iloc[10]
I hope I've understood what you wanted to achieve but if not, feel free to let me know so I can amend my answer!
Sources:
Pandas Docs
Selecting Rows and Columns in Pandas Dataframes