I have dataframe like this sample.
priceUsd,time,date
38492.2698958105979245,1627948800000,2021-08-03T00:00:00.000Z
39573.1543437718690816,1628035200000,2021-08-04T00:00:00.000Z
40090.5174131427618446,1628121600000,2021-08-05T00:00:00.000Z
41356.0360622010701055,1628208000000,2021-08-06T00:00:00.000Z
43535.9969201307711635,1628294400000,2021-08-07T00:00:00.000Z
I want to split last 10 rows for test dataset in tensorflow and I get data from first row to before last 10 rows for train.
train = df.loc[:-10 , ['priceUsd']]
test = df.loc[-10: , ['priceUsd']]
when I run this code it show error
TypeError: cannot do slice indexing on DatetimeIndex with these indexers [-10] of type int
How to fix it?
Try this instead:
train = df[['priceUsd']].head(len(df) - 10)
test = df[['priceUsd']].tail(10)
Related
I have a dataframe with a date column, which contains the 'month', and I need to create two dataframes from this dataframe. the first one will contain all the listings for which the month values are from 1 to 6, and the second dataframe will contain all the listings for which the month values are from 7 to 12. How can I do this? I have tried this
train_set = data.loc(data['DateTime'] <= 6)
test_set = data.loc(data['DateTime'] > 6)
But, I am getting the following error :
TypeError: unhashable type: 'Series'
Why might I be getting this error? And what is a way that I can achieve what I am trying to do? The column 'DateTime' contains only the month value that I extracted from the original data which was in python datetime format.
Try with correct format of loc[].
train_set = data.loc[data['DateTime'] <= 6]
test_set = data.loc[data['DateTime'] > 6]
Alternatively to loc, you can as well achieve the result using:
train_set = data[data['DateTime'] <= 6]
test_set = data[data['DateTime'] > 6]
I have an 8 by 7 dataframe ‘selected_parameters’ as following
ar_params and ma_params corresponds to the evaluated parameters of an ARMA model on a time series.
I then select randomly one parameter from ar_params and ma_params:
ar_sample = selected_parameters['ar_params'].sample(1)
ma_sample = selected_parameters['ma_params'].sample(1)
And modify them as follow in order to be then used to generate time series with an ARMA process, following the explanations at the end of this page
https://www.statsmodels.org/stable/generated/statsmodels.tsa.arima_process.arma_generate_sample.html
ar_sample_array = np.r_[1, -ar_sample]
ma_sample_array = np.r_[1, ma_sample]
y = arma_generate_sample(ar_sample_array, ma_sample_array, nsample=100, scale=0.1)
plt.plot(y)
Everything works well IF we did select a set of ar_params and ma_params including only ONE value.
If at the random selection stage, we did select a set with two or more values I receive the following error message.
‘ValueError: setting an array element with a sequence.’
When printing the values of ar_sample_array and ma_sample_array
print(ar_sample_array)
print(ma_sample_array)
I get the following output
[1 array([-1.01, 0.01])]
[1 array([-0.76, 0.03])]
Thank you
I think the params must but only one array, not one array with other array inside. I think this would work:
ar_sample_array = [1].append(-ar_sample)
ma_sample_array = [1].append(ma_sample)
I'm performing a linear regression on a dataset (Excel file) which consists of a Date column, a scores column and additional column called Predictions with NaN values which will be used to store the predicted values.
I have found that my independent variable, X, contains timestamps which I was actually expecting...? Perhaps I'm doing something wrong, or actually missing something out..?
Top of the original dataset:
Date Score
0 2019-05-01 4.607744
1 2019-05-02 4.709202
2 2019-05-03 4.132390
3 2019-05-05 4.747308
4 2019-05-07 4.745926
Create the independent data set (X)
Convert the dataframe to a numpy array
X = np.array(df.drop(['Prediction'],1))
Remove the last '30' rows
X = X[:-forecast_out]
print(X)
Example of output:
[[Timestamp('2019-05-01 00:00:00') 4.607744342064972]
[Timestamp('2019-05-02 00:00:00') 4.709201914086133]
[Timestamp('2019-05-03 00:00:00') 4.132389742485806]
[Timestamp('2019-05-05 00:00:00') 4.74730802483691]
[Timestamp('2019-05-07 00:00:00') 4.7459264970444615]
[Timestamp('2019-05-08 00:00:00') 4.595303054619376]
Create the dependent data set (y)
Convert the dataframe to a numpy array
y = np.array(df['Prediction'])
Get all of the y values except the last '30' rows
y = y[:-forecast_out]
print(y)
Some of the output:
[4.63738251 4.34354486 5.12284464 4.2751933 4.53362196 4.32665058
4.77433793 4.37496465 4.31239161 4.90445026 4.81738271 3.99114536
5.21672369 4.4932632 4.46858993 3.93271862 4.55618508 4.11493084
4.02430584 4.11672606 4.19725244 4.3088558 4.98277563 4.97960989
Split the data into 80% training and 20% testing
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
Create and train the Linear Regression Model
lr = LinearRegression()
Train the model
lr.fit(x_train, y_train)
The error:
TypeError: float() argument must be a string or a number, not 'Timestamp'
Clearly the dataset X doesn't like having the timestamp, and like I say, I wasn't really expecting it.
Any help on removing it (or perhaps I need it!?) would be great. As you can, see I'm simply looking to perform a simple regression analysis
Do not include the Timestamps (Date) in your creation of 'X'.
The data set is already ordered, so do you really need the time stamps? Another option, try reassigning the index. In either case, I think, do not try to pass Timestamps as argument-data.
Implement changes at this step:
X = np.array(df.drop(['Prediction'],1))
Do something like:
X = np.array(df.drop(['Date', 'Prediction'],1))
I think the problem could be solved by using the date timestamp as an index field instead. You can try reset_index to re-assign index.
My data looks like this. They are floats and they are in a big numpy array [700000,3]. There are no empty fields.
Label | Values1 | Values2
1. | 0.01 | 0.01
1. | ... | ...
1. |
2. |
2. |
3. |
...
The idea is to feed in the set of values1 and values2 and have it identify the label using classification.
But I don't want to feed the data row by row, but input all values1/2 that belong to label 1 as a set (e.g. inputting the first 3 rows is supposed to return [1,0,...], inputting the next 2 rows as a set [0,1,...])
Is there a non-complex way of feeding the data in this way? (i.e. feed batch where column label equals 1)
I am currently sorting the data and thinking about using pointers to the start and having loops which check if the next row is equal to the current to find a pointer to the end of the set and get the number of rows of that batch. But this more or less prevents randomizing input order.
Since you have your data in a numpy array (let's call it data, you can use
single_digit = data[(data[:,0] == 1.)][: , 1:]
which will compare the zeroth element of each row with the digit (1. in this case) and select only the rows having the label 1.. From these rows, it takes the first and second element, i.e. Values1 and Values2. A working example is below. You can use a for loop to iterate over all labels contained in the data set and construct a numpy array for each label with
single_digit = data[(data[:,0] == label_of_this_iteration)][: , 1:]
and then feed these arrays to the network. Within TensorFlow you can easily feed batches of different length, if you do not specify the first dimension of the corresponding placeholders.
import numpy as np
# Generate some data with three columns (label, Values1, Values2)
n = 20
ints = np.random.randint(1,6,(n, 1))
dous = np.random.uniform(size=(n,2))
data = np.hstack((ints, dous))
print(data)
# Extract the second and third columns of all rows having the label 1.0
ones = data[(data[:,0] == 1.)][: , 1:]
print(ones)
Ideally use TFRecords format.
This approach makes it easier to mix and match data sets and network architectures
Here is a link for detail on what this json like structure looks like example.proto
I have a csv as follows:
Excel Data
I then converted it to a data frame:
f4 = open('C:\Users\cost9\OneDrive\Documents\PYTHON\TEST-ASSURANCE FILES\ADBE1.CSV')
ADBE = pd.read_csv(f4)
I want to slice off all but the last 30 rows for the new Data Frame 'ADBE_Last_30_Periods':
ADBE_Last_30_Periods = ADBE[-30:]
As shown in the CSV file above, the 'Date/Time' Column (column B in the spreadsheet) needs to be converted to a date:
ADBE_Last_30_Periods.rename(columns={'Date/Time': 'Date'}, inplace=True)
ADBE_Last_30_Periods['Date'] = ADBE_Last_30_Periods['Date'].reset_index()
ADBE_Last_30_Periods.Date.values.astype('M8[D]')
Finally the purpose is to do linear regression on the last 30 periods (which I sliced to separate above):
x = ADBE_Last_30_Periods['Date']
y = ADBE_Last_30_Periods['Close']
x = sm.add_constant(x)
ols3 = pd.ols(y = ADBE_Last_30_Periods['Close'], x = ADBE_Last_30_Periods['Date'])
I then run the script and get the following error:
ValueError: Could not convert object to NumPy datetime
Note that the value error is referring to 'ADBE_Last_30_Periods.Date.values.astype('M8[D]')' shown above.
Please note that when I just run the original file ('ADBE') I don't get this error. The script runs fine and the output looks good. For some reason slicing into only the last 30 periods is causing the Date conversion to screw up. Can anyone assist?