I have an input training dataset of dimension(1500, 5)and an output training dataset for(1499, 1). Please suggest what I can do to eliminate the error.
Error: Y = dataframe1[:,0]
IndexError: too many indices for array
Code:
dataframe = np.genfromtxt('DataInput.csv', delimiter=",")
#pandas.read_csv("DataInput.csv", delim_whitespace=True, header=None)
#dataset = dataframe.values
dataframe1 = np.genfromtxt("OptimizedSpeed.csv", delimiter=",")
#dataset1 = dataframe1.values
# split into input (X) and output (Y) variables
X = dataframe[:,0:4]
**Y = dataframe1[:,0]** <-- this line shows the error
Related
I am new to Python and apologize in advance, if it is too simple. Cannot find anything and this question did not help.
My code is
# Split data
y = starbucks_smote.iloc[:, -1]
X = starbucks_smote.drop('label', axis = 1)
# Count labels by type
counter = Counter(y)
print(counter)
Counter({0: 9634, 1: 2895})
# Transform the dataset
oversample = SMOTE()
X, y = oversample.fit_resample(X, y)
# Print the oversampled dataset
counter = Counter(y)
print(counter)
Counter({0: 9634, 1: 9634})
How to save the oversampled dataset for future work?
I tried
data_res = np.concatenate((X, y), axis = 1)
data_res.to_csv('sample_smote.csv')
Got an error
ValueError: all the input arrays must have same number of dimensions,
but the array at index 0 has 2 dimension(s) and the array at index 1 has 1 dimension(s)
Appreciate any tips!
You may create dataframe:
data_res = pd.DataFrame(X)
data_res['y'] = y
and then save data_res to CSV.
Solution based on concatenation od numpy.arrays is also possible, but np.vstack is needed to make dimensions compliant:
data_res = np.concatenate((X, np.vstack(y)), axis = 1)
data_res = pd.DataFrame(data_res)
I get a ValueError: Found input variables with inconsistent numbers of samples: [20000, 1] when I run the following even though the row values of x and y are correct. I load in the RCV1 dataset, get indices of the categories with the top x documents, create list of tuples with equal number of randomly-selected positives and negatives for each category, and then finally attempt to run a logistic regression on one of the categories.
import sklearn.datasets
from sklearn import model_selection, preprocessing
from sklearn.linear_model import LogisticRegression
from matplotlib import pyplot as plt
from scipy import sparse
rcv1 = sklearn.datasets.fetch_rcv1()
def get_top_cat_indices(target_matrix, num_cats):
cat_counts = target_matrix.sum(axis=0)
#cat_counts = cat_counts.reshape((1,103)).tolist()[0]
cat_counts = cat_counts.reshape((103,))
#b = sorted(cat_counts, reverse=True)
ind_temp = np.argsort(cat_counts)[::-1].tolist()[0]
ind = [ind_temp[i] for i in range(5)]
return ind
def prepare_data(x, y, top_cat_indices, sample_size):
res_lst = []
for i in top_cat_indices:
# get column of indices with relevant cat
temp = y.tocsc()[:, i]
# all docs with labeled category
cat_present = x.tocsr()[np.where(temp.sum(axis=1)>0)[0],:]
# all docs other than labelled category
cat_notpresent = x.tocsr()[np.where(temp.sum(axis=1)==0)[0],:]
# get indices equal to 1/2 of sample size
idx_cat = np.random.randint(cat_present.shape[0], size=int(sample_size/2))
idx_nocat = np.random.randint(cat_notpresent.shape[0], size=int(sample_size/2))
# concatenate the ids
sampled_x_pos = cat_present.tocsr()[idx_cat,:]
sampled_x_neg = cat_notpresent.tocsr()[idx_nocat,:]
sampled_x = sparse.vstack((sampled_x_pos, sampled_x_neg))
sampled_y_pos = temp.tocsr()[idx_cat,:]
sampled_y_neg = temp.tocsr()[idx_nocat,:]
sampled_y = sparse.vstack((sampled_y_pos, sampled_y_neg))
res_lst.append((sampled_x, sampled_y))
return res_lst
ind = get_top_cat_indices(rcv1.target, 5)
test_res = prepare_data(train_x, train_y, ind, 20000)
x, y = test_res[0]
print(x.shape)
print(y.shape)
LogisticRegression().fit(x, y)
Could it be an issue with the sparse matrices, or problem with dimensionality (there are 20K samples and 47K features)
When I run your code, I get following error:
AttributeError: 'bool' object has no attribute 'any'
That's because y for LogisticRegression needs to numpy array. So, I changed last line to:
LogisticRegression().fit(x, y.A.flatten())
Then I get following error:
ValueError: This solver needs samples of at least 2 classes in the data, but the data contains only one class: 0
This is because your sampling code has a bug. You need to subset y array with rows having that category before using sampling indices. See code below:
def prepare_data(x, y, top_cat_indices, sample_size):
res_lst = []
for i in top_cat_indices:
# get column of indices with relevant cat
temp = y.tocsc()[:, i]
# all docs with labeled category
c1 = np.where(temp.sum(axis=1)>0)[0]
c2 = np.where(temp.sum(axis=1)==0)[0]
cat_present = x.tocsr()[c1,:]
# all docs other than labelled category
cat_notpresent = x.tocsr()[c2,:]
# get indices equal to 1/2 of sample size
idx_cat = np.random.randint(cat_present.shape[0], size=int(sample_size/2))
idx_nocat = np.random.randint(cat_notpresent.shape[0], size=int(sample_size/2))
# concatenate the ids
sampled_x_pos = cat_present.tocsr()[idx_cat,:]
sampled_x_neg = cat_notpresent.tocsr()[idx_nocat,:]
sampled_x = sparse.vstack((sampled_x_pos, sampled_x_neg))
sampled_y_pos = temp.tocsr()[c1][idx_cat,:]
print(sampled_y_pos.nnz)
sampled_y_neg = temp.tocsr()[c2][idx_nocat,:]
print(sampled_y_neg.nnz)
sampled_y = sparse.vstack((sampled_y_pos, sampled_y_neg))
res_lst.append((sampled_x, sampled_y))
return res_lst
Now, Everything works like a charm
For the dataset that I am working with, the categorical variables are ordinal, ranging from 1 to 5 for three columns. I am going to be feeding this into XGBoost.
Would I be okay to just run this command and skip creating dummy variables:
ser = pd.Series([1, 2, 3], dtype='category')
ser = ser.to_frame()
ser = ser.T
I would like to know conceptually, since the categorical data is ordinal, would simply converting that to type category be adequate for the model? I tried creating dummy variables but all the values become a 1.
As for the code now, it runs but this command returns: 'numpy.int64'.
type(ser[0][0])
Am I going about this correctly? Any help would be great!
Edit: updated code
Edit2: Normalizing the numerical data values. Is this logic correct?:
r = [1, 2, 3, 100 ,200]
scaler = preprocessing.StandardScaler()
r = preprocessing.scale(r)
r = pd.Series(r)
r = r.to_frame()
r = r.T
Edit3: This is the dataset.
Just setting categorical variables as dtype="category" is not sufficient and won't work.
You need to convert categorical values to true categorical values with pd.factorize(), where each category is assigned a numerical label.
Let's say df is your pandas dataframe. Then in general you could use this boilerplate code:
df_numeric = df.select_dtypes(exclude=['object'])
df_obj = df.select_dtypes(include=['object']).copy()
# factorize categoricals columnwise
for c in df_obj:
df_obj[c] = pd.factorize(df_obj[c])[0]
# if you want to one hot encode then add this line:
df_obj = pd.get_dummies(df_obj, prefix_sep='_', drop_first = True)
# merge dataframes back to one dataframe
df_final = pd.concat([df_numeric, df_obj], axis=1)
Since your categorical variables already are factorized (as far as I understand), you can skip the factorization and just try one hot encoding.
See also this post on stats.stackexchange.
If you want to standardize/normalize your numerical data (not the categorical) use this function:
from sklearn import preprocessing
def scale_data(data, scale="robust"):
x = data.values
if scale == "minmax":
scaler = preprocessing.MinMaxScaler()
x_scaled = scaler.fit_transform(x)
elif scale == "standard":
scaler = preprocessing.StandardScaler()
x_scaled = scaler.fit_transform(x)
elif scale == "quantile":
scaler = preprocessing.QuantileTransformer()
x_scaled = scaler.fit_transform(x)
elif scale == "robust":
scaler = preprocessing.RobustScaler()
x_scaled = scaler.fit_transform(x)
data = pd.DataFrame(x_scaled, columns = data.columns)
return data
scaled_df = scale_data(df_numeric, "robust")
Putting it all together for your dataset:
from sklearn import preprocessing
df = pd.read_excel("default of credit card clients.xls", skiprows=1)
y = df['default payment next month'] #target variable
del df['default payment next month']
c = [2,3,4] # index of categorical data columns
r = list(range(0,24))
r = [x for x in r if x not in c] # get list of all other columns
df_cat = df.iloc[:, [2,3,4]].copy()
df_con = df.iloc[:, r].copy()
# factorize categorical data
for c in df_cat:
df_cat[c] = pd.factorize(df_cat[c])[0]
# scale continuous data
scaler = preprocessing.MinMaxScaler()
df_scaled = scaler.fit_transform(df_con)
df_scaled = pd.DataFrame(df_scaled, columns=df_con.columns)
df_final = pd.concat([df_cat, df_scaled], axis=1)
#reorder columns back to original order
cols = df.columns
df_final = df_final[cols]
To further improve the code, do the train/test split before normalization, fit_transform() on the training data and just transform() on the test data. Otherwise you will have a data leak.
I am trying to replicate Chevalier's LSTM Human Activity Recognition algorithm and came across a problem when I realized that my methods did not match that of the algorithm. As a follow-up from this question, I was able to produce a result for load_X by this method:
In[0]:
def load_X(X_signals_paths):
X_signals = []
for signal_type_path in X_signals_paths:
with open(signal_type_path, 'r') as csvfile:
reader = csv.reader(csvfile)
next(reader)
for serie in [row[1:2] for row in reader]:
#X_signals.append([np.array([row[1:2] for row in reader],dtype=np.float32) for row in reader])
X_signals.append(np.array(serie, dtype=np.int32))
file.close()
return (np.transpose(np.transpose(X_signals), (1, 0)))
X_train_signals_paths = [
DATASET_PATH + TRAIN + signal + "_train.csv" for signal in INPUT_SIGNAL_TYPES
]
X_test_signals_paths = [
DATASET_PATH + TEST + signal + "_test.csv" for signal in INPUT_SIGNAL_TYPES
]
X_train = load_X(X_train_signals_paths)
X_test = load_X(X_test_signals_paths)
print(X_train)
Out[0]:
[[ 6]
[ 6]
...,
[13]
[13]
[13]]
However I looked over Chevalier's methods a little more and I observed something interesting when I did len(X_train[0]) and len(X_train[0][0]). It seems the way I formatted my x-values is much different than how Chevalier's x-values are. My original CSV file can be found here and the original txt file for Chevalier's X_train can be found here. The following is Chevalier's code for comparison to mine:
def load_X(X_signals_paths):
X_signals = []
for signal_type_path in X_signals_paths:
file = open(signal_type_path, 'r')
# Read dataset from disk, dealing with text files' syntax
X_signals.append(
[np.array(serie, dtype=np.float32) for serie in [
row.replace(' ', ' ').strip().split(' ') for row in file
]]
)
file.close()
return np.transpose(np.array(X_signals), (1, 2, 0))
X_train_signals_paths = [
DATASET_PATH + TRAIN + "Inertial Signals/" + signal + "train.txt" for signal in INPUT_SIGNAL_TYPES
]
X_test_signals_paths = [
DATASET_PATH + TEST + "Inertial Signals/" + signal + "test.txt" for signal in INPUT_SIGNAL_TYPES
]
X_train = load_X(X_train_signals_paths)
X_test = load_X(X_test_signals_paths)
The following is from Chevalier's "Additional Parameters" section and is the main reason for my confusion:
training_data_count = len(X_train) # 7352 training series (with 50% overlap between each serie)
test_data_count = len(X_test) # 2947 testing series
n_steps = len(X_train[0]) # 128 timesteps per series
n_input = len(X_train[0][0]) # 9 input parameters per timestep
What I observe is that this 50% overlap means that the separate evaluated time intervals are overlapping like 0-64, 32-96, 64-128, 96-etc. One fact that I do know is that 7352 is the number of rows in X_train.txt. The [0] and [0][0] mean that it is selecting the 0th column of the X_train array and the 0th column and 0th row of X_train respectively. What my code is currently doing is transposing each of my data points individually. That is why when I evaluated len(X_train[0]) I received a 1 and with len(X_train[0][0]) I received an error:
TypeError Traceback (most recent call last)
<ipython-input-255-14523e544e49> in <module>()
2 test_data_count = len(list(X_test))
3 n_steps = len(X_train[0])
----> 4 n_input = len(list(X_train)[0][0])
5 print(training_data_count, test_data_count, n_steps, n_input)
TypeError: object of type 'numpy.int32' has no len()
I am wondering what I should do to reformat my data to match the intended formatting of Chevalier in the txt file? What do the numbers in the "Additional Parameters" section of the Chevalier's git mean and how can I tailor them to my current model?
I want to shift my time series data, but I am getting the following error:
AttributeError: 'numpy.ndarray' object has no attribute 'values'
This is my code:
def create_dataset(datasets):
#series = dataset
temps = DataFrame(datasets.values)
dataframes = concat(
[temps, temps.shift(-1), temps.shift(-2), temps.shift(-3)], axis=1)
lala = numpy.array(dataframes)
return lala
# Load
dataframe = pandas.read_csv('zahlenreihe.csv', index_col=False,
engine='python', header=None)
dataset = dataframe.values
dataset = dataset.astype('float32')
# Split
train_size = int(len(dataset) * 0.70)
test_size = len(dataset) - train_size
train, test = dataset[0:train_size,:], dataset[train_size:len(dataset),:]
# Create
trainX = create_dataset(train)
I think the following line is wrong:
temps = DataFrame(datasets.values)
My zahlenreihe.csv file (number sequence) just has integers ordered like:
1
2
3
4
5
n
How should I handle it?
The solution:
The given dataset was already an array, so I didn’t need to call .value.
The problem lies in the following line:
df = StandardScaler().fit_transform(df)
It returns a NumPy array (see the documentation), which does not have a drop function. You would have to convert it into a pd.DataFrame first!
new_df = pd.DataFrame(StandardScaler().fit_transform(df), columns=df.columns, index=df.index)