Reversing Sci-Kit LabelEncoder, but have a 2D array dataset - python

I'm trying to create an automated data pre-processing library and I want to transform the string data into numerical so it can be ran through ML algorithms. But I can't seem to reverse it back to its original state, which should be relatively simple given that Sci-Kit has a built in "inverse_transform()" method.
le=LabelEncoder()
def transformCatagorical(data):
catagorical_data = data.select_dtypes(include=['object']).columns.tolist()
for cat in catagorical_data:
transform = le.fit_transform(data[cat].astype(str))
data[cat] = transform
This is our transformation function which yields good results as shown here:
Transformed Data
But when we try to reverse it using this function:
def reverse(orig, data):
cols = get_categorical_columns(orig)
for col in cols:
data[col] = le.inverse_transform(data[col])
It transforms it into a complete random, coordinate like structure? Im not sure how to explain it without a picture:
Picture of wrongly transformed data
I've been trying to figure out how/why it's doing this but honestly I'm completely lost. Any help would be appreciated! Thank you!

Related

Non overlapping data in train test validation split python

I'm trying to create a function for some deep learning issues for satellite images classification. I have searched through a lot of libraries and I haven't found my needs I tried this sikit-learn but I feel that it is not what I need
Any hint for a specialised function that I may not see?
The sklearn train_test_split seems to fit all your needs.
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
This should do the trick. You can use the permutation array on the X and y data separately if you like.
num_tr, num_va = int(len(data)*0.5), int(len(data)*0.2)
perm = np.random.permutation(len(data))
tr_data = data[perm[:num_tr]]
va_data = data[perm[num_tr:num_tr+num_va]]
te_data = data[perm[num_tr+num_va:]]

SyntaxError while trying to perform RobustScaler on Pandas Dataframe

I am working with the House Prices Kaggle dataset. I am trying to use the RobustScaler from sklearn only on numerical features in the dataset (LotFrontage, LotArea, etc.). First, I fit the data to the numerical values of my dataframe by calling select_dtypes(exclude=['object']. Once the transformer has been fit to those values, I call the transform function, trying to transform those same values I just fit the data on by setting the transformer equal to object excluded attributes. Once I attempt that, I get the following error message:
SyntaxError: can't assign to function call
Data has already been rid of null values. What has worked is when I set the transform results equal to some variable, I get the results back as a numpy.ndarray
from sklearn.preprocessing import RobustScaler
transformer = RobustScaler().fit(df_train.select_dtypes(exclude=['object']))
df_train.select_dtypes(exclude=['object']) = transformer.transform(df_train.select_dtypes(exclude=['object'])) # This doesn't work
test = transformer.transform(df_train.select_dtypes(exclude=['object'])) # This DOES work, but not in the format I need
All I want is for the transformed attributes to go back into the original pandas data frame at their corresponding locations. Is there some workaround I can implement if I can't convert the original dataframe results directly?
I managed to get it to work. Not sure how Pythonic this solution is, but it got me back on track:
df_train[list(df_train.select_dtypes(exclude=['object']).columns)] = RobustScaler().fit_transform(df_train[list(df_train.select_dtypes(exclude=['object']).columns)])

feed composite inputs to model

I need to feed an image and a vector sampled from normal distribution simultaneously. As the image dataset I'm using is too large, I create a ImageDeserializer for that part. But I also need to add random vector (sampled from numpy normal distribution), to the input map before feed it to the network. Is there any way to achieve this?
I also test:
mb_data = reader_train.next_minibatch(mb_size, input_map=input_map)
mb_data[random_input_node] = np.random.normal((mb_size, 100))
but get the following error:
TypeError: cannot convert value of dictionary to N4CNTK13MinibatchDataE
The problem solved with the following snippet to feed data to trainer:
mb_data = reader_train.next_minibatch(mb_size, input_map=input_map)
z = np.random.normal(mb_size)
my_trainer.train_minibatch({feature_image: mb_data[image].data, feature_z: z})
Also thanks to #mewahl. Defining new reader is another suitable way to solve the problem, and I think it must be faster than what I have done.

Using a trained classifer on a new DataFrame

I have built a classifier, trained and tested on labeled data. Now I want to test it further by making predictions on a dataset without the labels. I already know the labels myself, but I want to remove them for the purpose of testing, and have it print out the values with a 0 prediction so I can compare the accuracy myself. I'm using the following code to iterate through my dataset and make a prediction for each row in the DataFrame;
malware = set()
for index, row in dataset.iterrows():
res = clf.predict([row])
if res == 0:
malware.add(index)
print(malware)
f.write(str(malware) + "\n")
It seems to be working, however it's not a quick process, is there a better way or anything I can do to speed it up?
Using a for loop to iterate through elements in a dataset is slow in general. What you want to do is apply your function to every element in the column(s), and generate a series of labels according to the result. (Assuming you're using Pandas for the dataframe, by the way)
labels=dataset.apply(clf.predict)
You can then just scan through this series with a for loop. That should be relatively instant.
After a bit of work I have turned the comment from Ding into a workable answer that is much quicker. My new code is;
from collections import OrderedDict
malware = []
malware.append(OrderedDict.fromkeys(dataset.index[clf.predict(dataset) == 0]))
print (malware)
Thanks very much Ding!

How to combine tfidf features with selfmade features

For a simple web page classification system I am trying to combine some selfmade features (frequency of HTML tags, frequency of certain word collocations) with the features obtained after applying tfidf. I am facing the following problem, however, and I don't really know how to proceed from here.
Right now I am trying to put all of these together in one dataframe, mainly by following the code from the following link :
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
vectorizer = TfidfVectorizer(stop_words="english")
X_train_counts = vectorizer.fit_transform(train_data['text_no_punkt'])
feature_names = vectorizer.get_feature_names()
dense = X_train_counts.todense()
denselist = dense.tolist()
tfidf_df = pd.DataFrame(denselist, columns=feature_names, index=train_data['text_no_punkt'])
But this doesn't return the index (from 0 to 2464) I had in my original dataframe with the other features, neither does it seem to produce readable column names and instead of using the different words as titles, it uses numbers.
Furthermore I am not sure if this is the right way to combine features as this will result in an extremely high-dimensional dataframe which will probably not benefit the classifiers.
You can use hstack to merge the two sparse matrices, without having to convert to dense format.
from scipy.sparse import hstack
hstack([X_train_counts, X_train_custom])

Categories