Categorical Data evaluation in Python with get_dummies

Categorical Data evaluation in Python with get_dummies - python

I want to evaluate categorical data in Python with a decision tree. I want to use the categorical data and use binning to create categorical labels. Do I have to?
The problem is that get_dummies returns a dataframe with a different length then the values that were given. It is two rows shorter than the original data.
Previously I tried to use the labelencode, but didn't get it done. I tried the get_dummies form pandas which seamed more easily to me.
I checked the reference for the get_dummies function and searched for the problem but could not find why the length is shorter.
Doing the binning:
est = bine(n_bins=50, encode='ordinal', strategy='kmeans')
cat_labels = est.fit_transform(np.array(quant_labels).reshape(-1, 1))
Extact the cateorical data (do I have to?):
category = rd.select_dtypes(exclude=['number']).astype("category")
category = category.replace(math.nan, "None")
category = category.replace(0, "None")
Prepare the split:
one_hot_features = pd.get_dummies(category[1:-1])
X_train, X_test, y_train, y_test = train_test_split(one_hot_features, cat_labels, test_size = 0.6, random_state = None)
The Error is:
ValueError: Found input variables with inconsistent number of samples: [1458, 1460]
The correct size of samples is 1460. The one_hot encoded is two samples short. Why is it so?

When you are encoding your data you use category[1:-1]. This will encode all the elements from the second to the second to last element.
Explanation:
1) Indexes are zero based so 1 is the index of the second item.
2) Index of -1 means the second to last element.
Solution:
Change your line to one_hot_features = pd.get_dummies(category[:])

Related

Adding a Row in a dataset.csv file through using pandas in python

I have tried .append method. the code is right but its not doing anything.
my .csv is too large to open i cant physically add there so please if anyone can fix my problem pls answer:
Code:
import pandas as pd
ARP_MitM_dataset = pd.read_csv('/content/drive/MyDrive/ARP MitM_dataset-002.csv');
label = pd.read_csv('/content/drive/MyDrive/ARP MitM_labels.csv');
t = iter(range(1, 401))
ARP_MitM_dataset.columns = ['Column'+str(i).format(next(t)) if 1 <= i <= 499 else x for i, x in enumerate(ARP_MitM_dataset.columns, 1)]
dataArr = ARP_MitM_dataset
labelArr = label
dataArr.append({' ':2504267}, ignore_index = True) <------ Check
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(dataArr,labelArr, test_size = 0.40, random_state = 42) <--- Error
print(X_train.shape,y_train.shape)
print(X_test.shape,y_test.shape)
Error Showing:
ValueError: Found input variables with inconsistent numbers of
samples: [2504266, 2504267]

You should NEVER grow a DataFrame. Always append your data to a list and convert it to a DataFrame once at the end because:-
1.) It is always cheaper/faster to append to a list and create a DataFrame in one go.
2.) Lists take up less memory and are a much lighter data structure to work with, append, and remove.
3.) dtypes are automatically inferred for your data. On the flip side, creating an empty frame of NaNs will automatically make them object, which is bad.
4.) An index is automatically created for you, instead of you having to take care to assign the correct index to the row you are appending.
You could try something like this:-
data = []
for a, b, c in some_function_that_yields_data():
data.append([a, b, c])
df = pd.DataFrame(data, columns=['A', 'B', 'C'])

Extracting embeddings of categorical features back to original data frame in Python

Suppose I have a dataframe with several numerical variables and 1 categorical variable with 10000 categories. I use a neural network with Keras to get a matrix of embeddings for the categorical variable. The embedding size is 50 so the matrix that Keras returns has dimension 10002 x 50.
The extra 2 rows are for unknown categories and the other one I don't know exactly - it's the only way Keras would work, i.e.,
model_i = keras.layers.Embedding(input_dim=num_categories+2, output_dim=embedding_size, input_length=1,
name=f'embedding_{cat_feature}')(input_i)
without the +2 it would not work.
So, I have a training set with ~12M rows and validation set with ~1M rows.
Now, the way I thought of reconstructing the embeddings was:
having a reversed dictionary with numerical values (which were encoded before to represent the categories) as keys and the category names as values
Add 50 NaN columns to the data frame
for i in range(10002) (which is the number of categories + 2) look for the corresponding value of key i in the reversed dictionary and if it is in the dictionary, using pandas .loc, replace each row (in those 50 NaN columns) that correspond to the value of i (i.e., where the categorical variable is equal to the category name which i is encoded for) with the corresponding row vector from the 10002 x 50 matrix.
The problem with this solution is that it's highly inefficient.
A friend told me about another solution which consists of converting the categorical variable to a one-hot sparse matrix with dimensions 12M x 10000 (for the training set), and then use matrix multiplication with the embeddings matrix which should have dimensions 10000 x 50 thus getting a 12M x 50 matrix which I can then concatenate to my original data frame. The problems here are:
It won't work on the validation set because the number of categories appearing there is or may be different than in training, so the dimensions do not match.
Even when used on the training set, I have 10002 (=num_categories + 2) rows in the matrix Keras gives me, instead of 10000. And so again, the dimensions do not match.
Does anyone know a better way of doing this or can address the problems in this second approach?
My ultimate goal is having a data frame with all my variables minus the categorical variable and instead, having another 50 columns with the row vectors that represent the embeddings for that categorical variable.

So eventually I found a solution for the second method mentioned in my post. Using sparse matrices avoids the memory issues that might occur when attempting multiplication of matrices with large data (categories and/or observations).
I wrote this function which returns the original data frame with all the desired categorical variables' embedded vectors appended.
def get_embeddings(model: keras.models.Model, cat_vars: List[str], df: pd.DataFrame,
dict: Dict[str, Dict[str, int]]) -> pd.DataFrame:
df_list: List[pd.DataFrame] = [df]
for var_name in cat_vars:
df_1vec: pd.DataFrame = df.loc[:, var_name]
enc = OneHotEncoder()
sparse_mat = enc.fit_transform(df_1vec.values.reshape(-1, 1))
sparse_mat = sparse.csr_matrix(sparse_mat, dtype='uint8')
orig_dict = dict[var_name]
match_to_arr = np.empty(
(sparse_mat.shape[1], model.get_layer(f'embedding_{var_name}').get_weights()[0].shape[1]))
match_to_arr[:] = np.nan
unknown_cat = model.get_layer(f'embedding_{var_name}').get_weights()[0].shape[0] - 1
for i, col in enumerate(tqdm.tqdm(enc.categories_[0])):
if col in orig_dict.keys():
val = orig_dict[col]
match_to_arr[i, :] = model.get_layer(f'embedding_{var_name}').get_weights()[0][val, :]
else:
match_to_arr[i, :] = (model.get_layer(f'embedding_{var_name}')
.get_weights()[0][unknown_cat, :])
a = sparse_mat.dot(match_to_arr)
a = pd.DataFrame(a, columns=[f'{var_name}_{i}' for i in range(1, match_to_arr.shape[1] + 1)])
df_list.append(a)
df_final = pd.concat(df_list, axis=1)
return df_final
dict is a dictionary of dictionaries, i.e., holding a dictionary for each categorical variable which I encoded beforehand with keys being the category names and the values integers. Note that each category was encoded with num_values + 1 with the last being reserved for unknown categories.
Basically what I am doing is asking for each category value if it is in the dictionary. If it is, I assign the corresponding row in a temporary array (so if this is the first category then the first row) to the corresponding row in the embedding matrix where the row number corresponds to the value for which the category name was encoded to.
If it is not in the dictionary I then assign to this row (this = ith row) the last row in the embedding matrix which corresponds to unknown categories.

this is what I introduced in the comments
df = pd.DataFrame({'int':np.random.uniform(0,1, 10),'cat':np.random.randint(0,333, 10)}) # cat are encoded
## define embedding model, you can also use multiple input source
inp = Input((1))
emb = Embedding(input_dim=10000+2, output_dim=50, name='embedding')(inp)
out = Dense(10)(emb)
model = Model(inp, out)
# model.compile(...)
# model.fit(...)
## get cat embeddings
extractor = Model(model.input, Flatten()(model.get_layer('embedding').output))
## concat embedding in the orgiginal df
df = pd.concat([df, pd.DataFrame(extractor.predict(df.cat.values))], axis=1)
df

Removing the Timestamp from a NumPy Array

I'm performing a linear regression on a dataset (Excel file) which consists of a Date column, a scores column and additional column called Predictions with NaN values which will be used to store the predicted values.
I have found that my independent variable, X, contains timestamps which I was actually expecting...? Perhaps I'm doing something wrong, or actually missing something out..?
Top of the original dataset:
Date Score
0 2019-05-01 4.607744
1 2019-05-02 4.709202
2 2019-05-03 4.132390
3 2019-05-05 4.747308
4 2019-05-07 4.745926
Create the independent data set (X)
Convert the dataframe to a numpy array
X = np.array(df.drop(['Prediction'],1))
Remove the last '30' rows
X = X[:-forecast_out]
print(X)
Example of output:
[[Timestamp('2019-05-01 00:00:00') 4.607744342064972]
[Timestamp('2019-05-02 00:00:00') 4.709201914086133]
[Timestamp('2019-05-03 00:00:00') 4.132389742485806]
[Timestamp('2019-05-05 00:00:00') 4.74730802483691]
[Timestamp('2019-05-07 00:00:00') 4.7459264970444615]
[Timestamp('2019-05-08 00:00:00') 4.595303054619376]
Create the dependent data set (y)
Convert the dataframe to a numpy array
y = np.array(df['Prediction'])
Get all of the y values except the last '30' rows
y = y[:-forecast_out]
print(y)
Some of the output:
[4.63738251 4.34354486 5.12284464 4.2751933 4.53362196 4.32665058
4.77433793 4.37496465 4.31239161 4.90445026 4.81738271 3.99114536
5.21672369 4.4932632 4.46858993 3.93271862 4.55618508 4.11493084
4.02430584 4.11672606 4.19725244 4.3088558 4.98277563 4.97960989
Split the data into 80% training and 20% testing
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
Create and train the Linear Regression Model
lr = LinearRegression()
Train the model
lr.fit(x_train, y_train)
The error:
TypeError: float() argument must be a string or a number, not 'Timestamp'
Clearly the dataset X doesn't like having the timestamp, and like I say, I wasn't really expecting it.
Any help on removing it (or perhaps I need it!?) would be great. As you can, see I'm simply looking to perform a simple regression analysis

Do not include the Timestamps (Date) in your creation of 'X'.
The data set is already ordered, so do you really need the time stamps? Another option, try reassigning the index. In either case, I think, do not try to pass Timestamps as argument-data.
Implement changes at this step:
X = np.array(df.drop(['Prediction'],1))
Do something like:
X = np.array(df.drop(['Date', 'Prediction'],1))

I think the problem could be solved by using the date timestamp as an index field instead. You can try reset_index to re-assign index.

How to get a random sample given 2 arrays?

Hi there I am working with a Sci Kit learn data set, digits and I Split the data
So I have X_train and Y_train arrays
The arrays are related in such a way that the index x[0] belongs to y[0]
print x_train.shape
(1347, 64)
print y_train.shape
(1347)
print set(y_train)
(0,1,2,3,4,5,6,7,8,9)
I would like to extract a random sample from x_train given the set(y), i.e. To resample my data by extracting just one random observation of the set(y).However I don´t know if I can do this with numpy or pandas, any one have an idea of how to deal with this????
Thank you very much.

It is not clear what you want to do.
The set(y) contains all the available labels of your dataset X.
In general (until you specify what you need), use random.choice:
You have this:
print set(y)
(0,1,2,3,4,5,6,7,8,9)
Convert it first to a list:
index_all = list(set(y))
Now, randomly sample the set(y):
# this is a random index (class/label) from 0 to 9.
random_index = np.random.choice(index_all, 1)
Now, I see 2 possibilities (I believe you want the Case 2):
1) Directly resample x based on this random index (random based on the set(y))
Finally, if x is a numpy array:
x[random_index, :]
This returns a random observation of x based on the set(y)
2) Resample the x but get a random observation that has a label y. Label 'y' is defined randomly above (random_index)
x[y==random_index]
This returns a random observation of x that is associated with a label y.

This is the approach I generally use for constructing a dataframe and extracting data from it.
import numpy as np
import pandas as pd
#Dummy arrays for x and y
x_train = np.zeros((1347,64))
y_train = np.ones((1347))
#First we pair up the arrays according to their index using zip. Only use this
#method if both arrays are of equal length.
training_dataset = list(zip(x_train,y_train))
#Next we load the dataset as a dataframe using Pandas
df = pd.DataFrame(data=training_dataset)
#Check that the dataframe is what you want
df.head()
#If you would like to extract a random row, you may use
df.sample(n=1)
#Alternatively if you would like to extract a specific row (eg. 10th row aka index 9)
df.iloc[10]
I hope I've understood what you wanted to achieve but if not, feel free to let me know so I can amend my answer!
Sources:
Pandas Docs
Selecting Rows and Columns in Pandas Dataframes

How to pass a numeric feature having large number of unique values to Random Forest regression algorithm in PySpark MlLib?

I have a dataset which has a numeric feature column having large number of unique values (of the order of 10,000). I know that when we generate the model for Random Forest regression algorithm in PySpark, we pass a parameter maxBins which should be at least equal to maximum unique value in all features. So if I will pass 10,000 as the maxBins value then the algorithm will not be able take the load and it will either fail or go no forever. How can I pass such a feature to the model? I read at few places about binning the values into buckets and then passing those buckets to the model but I have no idea how to do that in PySpark. Can anyone show a sample code to do that? My current code is this:
def parse(line):
# line[6] and line[8] are feature columns with large unique values. line[12] is numeric label
return (line[1],line[3],line[4],line[5],line[6],line[8],line[11],line[12])
input = sc.textFile('file1.csv').zipWithIndex().filter(lambda (line,rownum): rownum>=0).map(lambda (line, rownum): line)
parsed_data = (input
.map(lambda line: line.split(","))
.filter(lambda line: len(line) >1 )
.map(parse))
# Divide the input data in training and test set with 70%-30% ratio
(train_data, test_data) = parsed_data.randomSplit([0.7, 0.3])
label_col = "x7"
# converting RDD to dataframe. x4 and x5 are columns with large unique values
train_data_df = train_data.toDF(("x0","x1","x2","x3","x4","x5","x6","x7"))
# Indexers encode strings with doubles
string_indexers = [
StringIndexer(inputCol=x, outputCol="idx_{0}".format(x))
for x in train_data_df.columns if x != label_col
]
# Assembles multiple columns into a single vector
assembler = VectorAssembler(
inputCols=["idx_{0}".format(x) for x in train_data_df.columns if x != label_col ],
outputCol="features"
)
pipeline = Pipeline(stages=string_indexers + [assembler])
model = pipeline.fit(train_data_df)
indexed = model.transform(train_data_df)
label_points = (indexed
.select(col(label_col).cast("float").alias("label"), col("features"))
.map(lambda row: LabeledPoint(row.label, row.features)))
If anyone can provide a sample code that how can I modify my code above to do the binning of the two large numeric value feature columns above it will be helpful.

we pass a parameter maxBins which should be at least equal to maximum unique value in all features.
It is not true. It should be greater or equal to the maximum number of categories for categorical features. You still have to tune this parameter to obtain desired performance but otherwise there is nothing else to do here.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Categorical Data evaluation in Python with get_dummies - python

Related

Adding a Row in a dataset.csv file through using pandas in python

Extracting embeddings of categorical features back to original data frame in Python

Removing the Timestamp from a NumPy Array

How to get a random sample given 2 arrays?

How to pass a numeric feature having large number of unique values to Random Forest regression algorithm in PySpark MlLib?

Categories

Resources