how to convert the new input non-numerical-data into numerical data - python

OUTLINE:
I want to know how to keep the new input non-numerical-data as the unique one!
DESCRIBE:
Namely, when we finish both train and test period, lot of new data that unions numerical and non-numerical data input into my model.
ISSUE(1):
Thus, the first important matter is convert these new mixed data into a dataframe with all numerical format!!
TRIED METHOD:
How can we do that? I used LabelEncoder to transform every non-numerical data into float type, however, the member of SCIKIT-LEARN feedbacked me that LabelEncoder is only available for Label[Y] and I should use OneHotEncoder
convert feature.
[He answered me at URL: https://github.com/scikit-learn/scikit-learn/issues/8674], while misunderstanding my meaning
Unfortunately, OneHotEncoder is only available for integer feature by SCIKIT-LEARN Official documentation said in p.1829.
I know there is a title about 'Feature Union with Heterogeneous Data Sources', while it is not convenient as LabelEncoder does.
ISSUE(2):
The second reason why such issue confuses me is related to how we can ensure the input new non-numerical data can be transform into an unique value that differs from the previous-transformed-training-data or previous-transformed-test-data. Namely, although we can use LabelEncoder transform new input data into numerical data, there is a risk that the transformed data[in the new input data yield] might be equal to the transformed data[in the previous transformed data]
Thanks for your help in advance

OK, I have been answered by the official guys. The follow URL maybe helpful for somebody who confused about this issue.
https://github.com/amueller/introduction_to_ml_with_python/blob/master/04-representing-data-feature-engineering.ipynb

Related

python upset plot data type unclear

I am trying to make an upset plot using gene-disease association lists. I assume that I simply do not understand which data type is required as an input as most examples use artificially created datasets that are of the data type "int64".
Upsetplot: https://buildmedia.readthedocs.org/media/pdf/upsetplot/latest/upsetplot.pdf and https://pydigger.com/pypi/UpSetPlot
I copied the examples given in the links above and they work just fine. When I try my own dataset I get the error message: AttributeError: 'Index' object has no attribute 'levels'
The data I use as input is a data frame with boolean information (see attachment "mydata.png" mydata boolean df). So I have the diseases as columns, the genes as rows and then boolean statements about the specific gene being associated with that disease or not (I can make this sound more computational if required).
An example data set that works can be found in the documentation or in the screenshot "upsetplot_data_example.png" upsetplot_data_example. In the documentation is says something about "category membership", but I do not quite understand what data type that is.
I assume it is a basic issue of not understanding what "format" is required. If anyone has an idea of what I need to do, please let me know. I welcome all feedback. I do not expect anyone to actually do the coding for me, however some pointers would be so helpful.
Thanks everyone!
The recently released Data Format Guide might prove helpful. Perhaps you need to set those boolean columns as the index of your data frame before passing it in, although ultimately, it may be easier to use from_contents or from_memberships to describe your data.
However, upsetplot will hopefully make the input format easier in a future version.

How to substitute for null values if it is a categorical variable?

I was trying to get dummy values for my data, when I noticed some values are having '?' as their value.
As many rows in my data have these values, I simply cannot drop them.
In such case what should I replace them with?
Just taking the mode of the category will help?
Also, I tried to replace the ? values with the mode.
df1 = df1[df1.workclass == '?'].replace('?',"Private")
But I get an empty table now.
It depends on the dataset. There are different methods that apply to different features. Some may require just replacing with the mode. In some cases, different ML algorithms and models are also used such as Random Forest, KNN, etc. So it completely depends on the type of data you are handling. Explore the field of data exploration. Maybe this can help you.
You will have to manually check your different variables and decide what to do with missing for each parameter.
for eg: You can drop the variables with >50 pc of missing unless they suggest very high weight of evidence.
Some variables can be substituted with central tendencies or can be predicted as well.
Categoricals can be replaces by UNK (unknown) and so on.

Encoding huge categorical features

I am currently working on a big categorical dataset using both Pandas an sklearn. As I want know to do feature extraction, or just be able to build a model, I need to encode the categoricals features as they are not handled by sklearn models.
Here is the issue : for one of the categories, I have more than 110,000 different values. I cannot possibly encode this as there would be (and there is) a memory error.
Also, this feature cannot be removed from the DataSet so this option is out of the question.
So I have multiple ways to deal with that :
Use FeatureHasher from sklearn (as mentionned in several topics related to that). But, while it's easy to encode, there is no mention on how to link the hashed feature with the DataFrame and thus, complete the feature extraction.
Use FuzzyMatching to reduce the number or values in the features and then use one-hot-encoding. The issue here is that it'll still create many dummy variables and I will lose the interest regarding the features that get encoded.
So I have multiple questions : first, do you know how to use FeatureHasher to link a Pandas DataFrame to an sklearn model through hashing ? If so, how am I supposed to do it ?
Second, can you think of any other way to do it that might be easier, or would work best with my problem?
Here are some screenshots regarding the DataSet I have, so that you understand more fully the issue.
Here is the output of the number/percentage of different values per feature.
Also, the values inside the biggest feature to encode called 'commentaire' (commentary in english) contains strings that sometimes are long sentences and sometimes just small words.
As requested, here is the current code to test Feature Hashing:
fh = FeatureHasher(input_type='string')
hashedCommentaire = fh.transform(dataFrame['commentaire'])
This outputs a Memory Error, so I can reduce the number of features, let's put it at a 100:
fh = FeatureHasher(input_type='string', n_features=100)
hashedCommentaire = fh.transform(dataFrame['commentaire'])
print(hashedCommentaire.toarray())
print(hashedCommentaire.shape)
It doesn't get an error and outputs the following : feature hashing output
Can I then directly use the result hashing in my DataFrame? The issue is that we totally loose track of the values of 'commentaire', if we then want to predict on new data, will the hashing output follow the previous one ? And also how do we know the hashing is "sufficient". Here I put 100 rows but there was more than 110,000 values to begin with.
Thanks to another observation, I began to explore the use of tf-idf. Here we can see a sample of 'commentaire' values: 'commentaire' feature sample
As we can see the different values will be strings. Sometimes sentences. The thing is, while exploring this data, I noticed that some values are very close to each other; that's why I wanted to explore the idea of Fuzzy Matching at first and now of tf-idf.

Encoded categorical features in h2o in python

Is there a way to see how the categorical features are encoded when we allow h2o to automatically create categorical data by casting a column to enum type?
I am implementing holdout stacking where my underlying training data differs for each model. I have a common feature that I want to make sure is encoded the same way across both sets. The feature contains names (str). It is guaranteed that all names that appear in one data set will be appear in the other.
The best way to see inside a model is to export the pojo, and look at the java source code. You should see how it is processing enums.
But, if I understand the rest of your question correctly, it should be fine. As long as the training data contains all possible values of a category it will work as you expect. If a categorical value not see in training is presented in production it will be treated as an NA.

rpy2 preserve metadata in FactorVector

I have a script in python that loads .RData and reads it and then writes it out to an excel file. Unfortunately, one table that contains 11 variables and 144 objects with mixed types (IntVector, FactorVector, Float Vector, Float Vector,...etc.)
When the table writes to Excel, the column names and data are preserved, except for the column that is a four-level FactorVector. Instead of returning the metadata (a,a,a,a,b,b,b,b,c,c,c,c,d,d,d,d...etc.) associated with the four levels, it returns integer values associated with each level (1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4...etc.)
I found this on the rpy2 sourceforge website, which pretty much explains my problem.
Since a FactorVector is an IntVector with attached metadata (the levels), getting items Python-style was not changed from what happens when gettings items from a IntVector. A consequence to that is that information about the levels is then lost.
It goes on below this to explain using levels, at which point I get lost as to what exactly I should do or use to keep the metadata levels intact for the FactorVector variable in question.
I presume there some sort of rpy2.robjects "switch" that will preserve this metadata when it gets translated into python? What would be the most efficient way to to apply this? Thanks!
The conversion layer customers customized for pandas DataFrame in rpy2-2.6.0 should take care of converting R factors to Pandas factors.

Categories