Let's consider the dataset of House prices from this example.
I have the entire dataset stored in the housing variable:
housing.shape
(20640, 10)
I also have done a OneHotEncoder encoding of one dimensions and get housing_cat_1hot, so
housing_cat_1hot.toarray().shape
(20640, 5)
My target is to join the two variables and store everything in just one dataset.
I have tried the Join with index tutorial but the problem is that the second matrix haven't any index.
How can I do a JOIN between housing and housing_cat_1hot?
>>> left=housing
>>> right=housing_cat_1hot.toarray()
>>> result = left.join(right)
Traceback (most recent call last): File "", line 1, in
result = left.join(right) File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pandas/core/frame.py",
line 5293, in join
rsuffix=rsuffix, sort=sort) File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pandas/core/frame.py",
line 5323, in _join_compat
can_concat = all(df.index.is_unique for df in frames) File "/usr/local/Cellar/python3/3.6.3/Frameworks/Python.framework/Versions/3.6/lib/python3.6/pandas/core/frame.py",
line 5323, in
can_concat = all(df.index.is_unique for df in frames) AttributeError: 'numpy.ndarray' object has no attribute 'index'
Well, depends on how you created the one-hot vector.
But if it's sorted the same as your original DataFrame, and itself is a DataFrame, you can add the same index before joining:
housing_cat_1hot.index = range(len(housing_cat_1hot))
And if it's not a DataFrame, convert it to one.
This is simple, as long as both objects are sorted the same
Edit: If it's not a DataFrame, then:
housing_cat_1hot = pd.DataFrame(housing_cat_1hot)
Already creates the proper index for you
If you wish to join the two arrays (assuming both housing_cat_1hot and housing are arrays), you can use
housing = np.hstack((housing, housing_cat_1hot))
Though the best way to OneHotEncode a variable is selecting that variable within the array and encode. It saves you the trouble of joining the two later
Say the index of the variable you wish to encode in your array is 1,
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
le = LabelEncoder()
X[:, 1] = le.fit_transform(X[:, 1])
onehotencoder = OneHotEncoder(categorical_features = [1])
X = onehotencoder.fit_transform(X).toarray()
Thanks to #Elez-Shenhar answer I get the following working code:
OneHot=housing_cat_1hot.toarray()
OneHot= pd.DataFrame(OneHot)
result = housing.join(OneHot)
result.shape
(20640, 15)
Related
I have a big dataset saved in parquet format with 40 files in total. The index is a datetime index.
In my script, I read in the dataframe, and want to add a new column to this dataset. I don't have a dataset available for you for testing, but the (pseudo, untested) code looks something like this:
import dask.dataframe as dd
import dask.delayed
from dask.dataframe import read_parquet
import random
from numpy import array
CUSTOM_DATE_PARSER = lambda x: datetime.strptime(x, "%Y-%m-%d %H:%M:%S")
IMPORTED_DF = read_parquet("./data/regression_df.parquet", parse_dates=['time'], date_parser=CUSTOM_DATE_PARSER, index_col='time')
new_column = array([random.randint(1,30) for _ in range(len(IMPORTED_DF))])
Now, I want to add that column to the dataset. I try using assign but am running into this error:
IMPORTED_DF.assign(new_col=dd.from_array(new_column))
ValueError: Not all divisions are known, can't align partitions. Please use `set_index` to set the index.
I saw in a GitHub post, a hacky workaround is to just reset/set the index column, so new code and error looks like:
df = df.reset_index().set_index('time') # 'time' is the name of my index column
df.assign(label=dd.from_array(new_column))
File "C:\Users\chalu\AppData\Local\Programs\Python\Python310\lib\site-packages\dask\dataframe\core.py", line 4935, in assign
data = elemwise(methods.assign, data, *pairs, meta=df2)
File "C:\Users\chalu\AppData\Local\Programs\Python\Python310\lib\site-packages\dask\dataframe\core.py", line 6020, in elemwise
args = _maybe_align_partitions(args)
File "C:\Users\chalu\AppData\Local\Programs\Python\Python310\lib\site-packages\dask\dataframe\multi.py", line 173, in _maybe_align_partitions
dfs2 = iter(align_partitions(*dfs)[0])
File "C:\Users\chalu\AppData\Local\Programs\Python\Python310\lib\site-packages\dask\dataframe\multi.py", line 133, in align_partitions
divisions = list(unique(merge_sorted(*[df.divisions for df in dfs1])))
File "C:\Users\chalu\AppData\Local\Programs\Python\Python310\lib\site-packages\toolz\itertoolz.py", line 264, in unique
for item in seq:
File "C:\Users\chalu\AppData\Local\Programs\Python\Python310\lib\site-packages\toolz\itertoolz.py", line 156, in _merge_sorted_binary
if val2 < val1:
TypeError: '<' not supported between instances of 'int' and 'Timestamp'
Another option I thought of was to assign the array to a Series instead, and use dd.from_pandas() in the assign, however it then asks for a chunksize or npartitions to be assigned, which I don't know either of them. When I saved the dataset initially, I used a row_chunk_size of 10000, however I perform some nan dropping before exporting, so that's not a true chunksize I don't think, which leaves npartitions. How could I find that number out?
Or better still, how can I just add this darn column to the dataset?
I tried to use categories instead of categorical_features, but it did not help.
Please help with the error:
Traceback (most recent call last): File "test.py", line 28, in <module> onehotencoder = OneHotEncoder(categorical_features=[0]) TypeError: __init__() got an unexpected keyword argument 'categorical_features'
from sklearn.preprocessing import LabelEncoder,OneHotEncoder
labelencoder_X=LabelEncoder()
X[:,0] = labelencoder_X.fit_transform(X[:,0]) #Encoding the values of column Country
onehotencoder=OneHotEncoder(categorical_features=[0])
X=onehotencoder.fit_transform(X).toarray()
print(X)
Acc to the documentation their is not any attribute 'categorical_features'
categories‘auto’ or a list of array-like, default=’auto’
Categories (unique values) per feature:
‘auto’ : Determine categories automatically from the training data.
list : categories[i] holds the categories expected in the ith column. The passed >categories should not mix strings and numeric values within a single feature, and should >be sorted in case of numeric values.
The used categories can be found in the categories_ attribute.
import numpy as np
from sklearn.preprocessing import OneHotEncoder
...
X = ...
# Encoding the values of column Country
onehotencoder = OneHotEncoder(sparse=False)
X = np.concatenate(
onehotencoder.fit_transform(X[:,0:1]),
X[:,1:],
axis=1
)
print(X)
# Do this to show what categories are collected and encoded.
print(onehotencoder.categories_)
The old version of scikit-learn did all the projection, encoding and re-merging, sinking non-categorical columns to the right. It doesn't support it now, so we instead extract the Country column manually, pass it through the encoder, and concatenate them together.
OneHotEncoder by default returns "sparse" arrays, and we can avoid converting them into numpy arrays by specifying sparse=False.
Note that LabelEncoder is redundant since OneHotEncoder can automatically fit to string values anyway (at least in recent versions).
I want to evaluate categorical data in Python with a decision tree. I want to use the categorical data and use binning to create categorical labels. Do I have to?
The problem is that get_dummies returns a dataframe with a different length then the values that were given. It is two rows shorter than the original data.
Previously I tried to use the labelencode, but didn't get it done. I tried the get_dummies form pandas which seamed more easily to me.
I checked the reference for the get_dummies function and searched for the problem but could not find why the length is shorter.
Doing the binning:
est = bine(n_bins=50, encode='ordinal', strategy='kmeans')
cat_labels = est.fit_transform(np.array(quant_labels).reshape(-1, 1))
Extact the cateorical data (do I have to?):
category = rd.select_dtypes(exclude=['number']).astype("category")
category = category.replace(math.nan, "None")
category = category.replace(0, "None")
Prepare the split:
one_hot_features = pd.get_dummies(category[1:-1])
X_train, X_test, y_train, y_test = train_test_split(one_hot_features, cat_labels, test_size = 0.6, random_state = None)
The Error is:
ValueError: Found input variables with inconsistent number of samples: [1458, 1460]
The correct size of samples is 1460. The one_hot encoded is two samples short. Why is it so?
When you are encoding your data you use category[1:-1]. This will encode all the elements from the second to the second to last element.
Explanation:
1) Indexes are zero based so 1 is the index of the second item.
2) Index of -1 means the second to last element.
Solution:
Change your line to one_hot_features = pd.get_dummies(category[:])
I have a CSV file containing data: (just the first ten rows of data are listed)
0,11,31,65,67
1,31,33,67
2,33,43,67
3,31,33,67
4,24,31,33,65,67,68,71,75,76,93,97
5,31,33,67
6,65,93
7,2,33,34,51,66,67,84
8,44,55,66
9,2,33,51,54,67,84
10,33,51,66,67,84
The first column indicates the row number (e.g the first column in the first row is 0). When i try to use
import pandas as pd
df0 = pd.read_csv('df0.txt', header=None, sep=',')
Error occurs as below:
pandas.errors.ParserError: Error tokenizing data. C error: Expected 5 fields in line 5, saw 12
I guess pandas computes the number of columns when it reads the first row (5 column). How can I declare the number of column by myself? It is known that there are total 120 class labels and hence, guess 121 columns should enough.
Further, how can I transform it into One Hot Encoding format because I want to use a neural network model to process the data.
For your first problem, you can pass a names=... parameter to read_csv:
df = pd.read_csv('df0.txt', header=None, names=range(121), sep=',')
As for your second problem, there's an existing solution here that uses sklearn.OneHotEncoder. If you are looking to convert each column to a one hot encoding, you may use it.
I gave this my best shot, but I don't think it's too good. I do think it gets at what you're asking, based on my own ML knowledge and your question I took you to be asking the following
1.) You have a csv of numbers
2.) This is for a problem with 120 classes
3.) You want a matrix with 1s and 0s for each class
4.) Example a csv such as:
1, 3
2, 3, 6
would be the feature matrix
Column:
1, 2, 3, 6
1, 0, 1, 0
0, 1, 1, 1
Thus this code achieves that, but it is surely not optimized:
df = pd.read_csv(file, header=None, names=range(121), sep=',')
one_hot = []
for k in df.columns:
one_hot.append(pd.get_dummies(df[k]))
for n, l in enumerate(one_hot):
if n == 0:
df = one_hot[n]
else:
df = func(df1=df, df2=one_hot[n])
def func(df1, df2):
# We can't join if columns overlap. Use set operations to identify
non_overlapping_columns = list(set(df2.columns)-set(df1.columns))
overlapping_columns = list(set(df2.columns)-set(non_overlapping_columns))
# Join where possible
df2_join = df2[non_overlapping_columns]
df3 = df1.join(df2_join)
# Manually add columns for overlaps
for k in overlapping_columns:
df3[k] = df3[k]+df2[k]
return df3
From here you could feed it into sklean onehot, as #cᴏʟᴅsᴘᴇᴇᴅ noted.
That would look like this:
from sklearn.preprocessing import OneHotEncoder
onehot = OneHotEncoder(df)
import sys
sys.getsizeof(onehot) #smaller than Pandas
sys.getsizeof(df)
I guess I'm unsure if the assumptions I noted above are what you want done in your data, it seems perhaps they aren't.
I thought that for a given line in your csv, that was indicating the classes that exist. I guess I'm a little unclear on it still.
I am learning neural networks and I am trying to automate some of the processes.
Right now, I have code to split the dataset randomly, a 284807x31 piece. Then, I need to separate inputs and outputs, meaning I need to select the entire array until the last column and, on the other hand, select only the last column. For some reason I can't figure out how to do this properly and I am stuck at splitting and separating the set as explained above. Here's my code so far (the part that refers to this specific problem):
train, test, cv = np.vsplit(data[np.random.permutation(data.shape[0])], (6,8))
# Should select entire array except the last column
train_inputs = np.resize(train, len(train[:,1]), -1)
test_inputs = np.resize(test, len(test[:,1]), -1)
cv_inputs = np.resize(cv, len(cv[:,1]), -1)
# Should select **only** the last column.
train_outs = train[:, 30]
test_outs = test[:, 30]
cv_outs = test[:, 30]
The idea is that I'd like the machine to find the column number of the corresponding dataset and do intended resizes. The second part will select only the last column - and I am not sure if that works because the script stops before that. The error is, by the way:
Traceback (most recent call last):
File "src/model.py", line 43, in <module>
train_inputs = np.resize(train, len(train[:,1]), -1)
TypeError: resize() takes exactly 2 arguments (3 given)
PS: Now that I am looking at the documentation, I can see I am very far from the solution but I really can't figure it out. It's the first time I am using NumPy.
Thanks in advance.
Some slicing should help:
Should select entire array except the last column
train_inputs = train[:,:-1]
test_inputs = test[:,:-1]
cv_inputs = cv[:,:-1]
and:
Should select only the last column.
train_outs = train[:,-1]
test_outs = test[:, -1]
cv_outs = test[:, -1]