Creating many feature columns in Tensorflow - python

I'm getting started on a Tensorflow project, and am in the middle of defining and creating my feature columns. However, I have hundreds and hundreds of features- it's a pretty extensive dataset. Even after preprocessing and scrubbing, I have a lot of columns.
The traditional way of creating a feature_column is defined in the Tensorflow tutorial and even this StackOverflow post. You essentially declare and initialize a Tensorflow object for each feature column:
gender = tf.feature_column.categorical_column_with_vocabulary_list(
"gender", ["Female", "Male"])
This works all well and good if your dataset has only a few columns, but in my case, I surely don't want to have hundreds of lines of code initializing different feature_column objects.
What's the best way to resolve this issue? I notice that in the tutorial, all the columns are collected as a list:
base_columns = [
gender, native_country, education, occupation, workclass, relationship,
age_buckets,
]
Which is ultimately passed into your estimator:
m = tf.estimator.LinearClassifier(
model_dir=model_dir, feature_columns=base_columns)
So would the ideal way of handling feature_column creation for hundreds of columns be to append them directly into a list? Something like this?
my_columns = []
for col in df.columns:
if is_string_dtype(df[col]): #is_string_dtype is pandas function
my_column.append(tf.feature_column.categorical_column_with_hash_bucket(col,
hash_bucket_size= len(df[col].unique())))
elif is_numeric_dtype(df[col]): #is_numeric_dtype is pandas function
my_column.append(tf.feature_column.numeric_column(col))
Is this the best way of creating these feature columns? Or am I missing some functionality to Tensorflow that allows me to work around this step?

What you have posted in the question makes sense. Small extension based on your own code:
import pandas.api.types as ptypes
my_columns = []
for col in df.columns:
if ptypes.is_string_dtype(df[col]):
my_columns.append(tf.feature_column.categorical_column_with_hash_bucket(col,
hash_bucket_size= len(df[col].unique())))
elif ptypes.is_numeric_dtype(df[col]):
my_columns.append(tf.feature_column.numeric_column(col))
elif ptypes.is_categorical_dtype(df[col]):
my_columns.append(tf.feature_column.categorical_column(col,
hash_bucket_size= len(df[col].unique())))

I used your own answer. Just edited a little bit (there should be my_columns instead of my_column in for loop) and posting it the way it worked for me.
import pandas.api.types as ptypes
my_columns = []
for col in df.columns:
if ptypes.is_string_dtype(df[col]): #is_string_dtype is pandas function
my_columns.append(tf.feature_column.categorical_column_with_hash_bucket(col,
hash_bucket_size= len(df[col].unique())))
elif ptypes.is_numeric_dtype(df[col]): #is_numeric_dtype is pandas function
my_columns.append(tf.feature_column.numeric_column(col))

The above two methods works only if the data is provided in pandas data frame where you have column name for each column. But, in case you have all numeric column and you don't want to name those columns. for e.g. reading several numerical columns from a numpy array, you can use something like this:-
feature_column = [tf.feature_column.numeric_column(key='image',shape=(784,))]
input_fn = tf.estimator.inputs.numpy_input_fn(dict({'image':x_train})
where X_train is your numy array with 784 columns. You can check this post by Vikas Sangwan for more details.

Related

I'd like to make the pandas categorical class faster... how?

I am concerned with creating pandas dataframes with billions of rows. These dataframes are instantiated from a numpy array. The trick is that I need to make some columns into a categorical data type. I would like to do this as fast as possible. Currently, the creation of these categoricals is my bottleneck.
I am currently attempting to create the categoricals with fastpath=True.
Inside the __init__ of Categorical there is a function call codes = coerce_indexer_dtype(values, dtype.categories) (see: https://github.com/pandas-dev/pandas/blob/main/pandas/core/arrays/categorical.py, line 378)
I have data that I can format so I can skip this call (it is one of the primary offenders here).
The super().__init__(codes, dtype) call at the end of the fastpath block seems to prevent me from making an easy subclass of the Categorical class to override the behavior. Perhaps I'm missing something tho. I'm weary of subclassing a pandas class and screwing things up.
Would be very helpful if anyone had any feedback.
Here is a small code with the basics of what I'm doing:
import pandas as pd
import numpy as np
df = pd.DataFrame([(i, j) for i in range(1000) for j in range(1000)])
cats = list(range(1000))
dtype = pd.CategoricalDtype(categories=cats, ordered=True)
df[0] = pd.Categorical(
values=df[0].to_numpy(dtype=np.int16), dtype=dtype, fastpath=True
)
df[1] = pd.Categorical(
values=df[1].to_numpy(dtype=np.int16), dtype=dtype, fastpath=True
)

How to fix "DeprecationWarning: DataFrames with non-bool types result in worse computationalperformance..."

I have been trying to implement the Apriori Algorithm in python. There are several examples online, they all use similar methods and mostly the same example dataset. The reference link: https://www.kaggle.com/code/rockystats/apriori-algorithm-or-market-basket-analysis/notebook
(starting from the line [26])
I have a different dataset that has the same structure as the example datasets online. I keep getting the
"DeprecationWarning: DataFrames with non-bool types result in worse
computationalperformance and their support might be discontinued in
the future.Please use a DataFrame with bool type"
error.
Here is my code:
import pandas as pd
import numpy as np
from mlxtend.frequent_patterns import apriori, association_rules
df1 = pd.read_csv(r'C:\Users\USER\dataset', sep=';')
df=df1.fillna(0)
basket = pd.pivot_table(data=df, index='cust_id', columns='Product', values='quantity', aggfunc='count',fill_value=0.0)
def convert_into_binary(x):
if x > 0:
return 1
else:
return 0
basket_sets = basket.applymap(convert_into_binary)
frequent_itemsets = apriori(basket_sets, min_support=0.07, use_colnames=True)
print(frequent_itemsets)
# association rule
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1)
print(rules)
In addition, in the last step of my code, I get an empty dataframe; I can see the column headings of the dataset but the output is empty.
Empty DataFrame Columns: [antecedents, consequents, antecedent
support, consequent support, support, confidence, lift, leverage,
conviction] Index: []
I am not sure if this issue is related to this error that I am having. I am new to python and I would really appreciate assistance and support on this issue.
I ran into the same issue even after converting my dataframe fields to 0 and 1.
The fix was just making sure the apriori module knows the dataframe is of boolean type, so in your case you should run this :
frequent_itemsets = apriori(basket_sets.astype('bool'), min_support=0.07, use_colnames=True)
In addition, in the last step of my code, I get an empty dataframe; I can see the column headings of the dataset but the output is empty.
Try using a smaller min_support

How to implement a function through scikit FunctionTransformer() that refers to two columns of a data frame ('kw_args' argument?)

while working on my submission for the famous Kaggle Titanic dataset (890 rows/11 columns) I would like to execute all of my 'Feature Engineering' steps within one scikit pipeline. However, I could barely find any online examples that demonstrate how to use the scikit FunctionTransformer() in order to execute slightly more complex custom functions, especially functions that refer to more than one column of the dataset.
In my concrete example, I would like to replace NaN values in the column 'Age' depending on the passenger class (column 'Pclass'). Possible passengers classes are 1, 2 or 3 and the corresponding ages that should replace the NaN values are 38, 30 and 25. My current code looks like this:
def impute_age_class(df, column_1, column_2):
for i in range(len(df)):
if np.isnan(df[column_1].iloc[i]):
if df[column_2].iloc[i] == 1:
df[column_1].iloc[i] = 38
elif df[column_2].iloc[i] == 2:
df[column_1].iloc[i] = 30
else:
df[column_1].iloc[i] = 25
return df
age_transformers = [("impute_age_class", FunctionTransformer(impute_age_class,validate=False, kw_args={'column_1': 'Age', 'column_2': 'Pclass'}), ["Age", "Pclass"])]
It seems like the code gets executed and I receive a slightly better accuracy score with my logreg model but also the warnings on this picture:
Note message
I would be very thankful if you could give me any hints on whether the syntax of my code could be improved in order to avoid these warnings and ensure correct execution.
That warning is very common, and worth reading up on. But it's also not great to be looping over the rows of a dataframe. You can use pandas's own fillna for this:
def impute_age_class(df, fillme, groupby):
df = df.copy()
df.loc[:, fillme] = df[fillme].fillna(
value=df[groupby].map(
{1: 38, 2: 30, 3: 25})
)
return df
tfmr = FunctionTransformer(
impute_age_class,
validate=False,
kw_args={'fillme': 'age', 'groupby': 'pclass'}
)
It's a little unusual to have the parameters for the two column names when you are hard-coding the mapping inside the function. And if you didn't have the mapping already in mind, it'd be better to learn it at fit time and then transform train and test sets with that mapping: see SimpleImputer with groupby and https://datascience.stackexchange.com/q/71856/55122.

Extract a list of categorical names based on unique levels

I'm practising my python skills by solving online exercises. I would like some help with this question because i have never heard of levels in python (i know they are available in R).
Complete the function to extract a list of categorical feature names
from a dataframe using a threshold to infer categorical variables
based on the number of unique levels. Exclude the target variable and
sort the resulting list.
import pandas as pd
def extract_categorical_features(df, n_levels, target):
# Write your code here...
solution = None
return solution
# Click 'Run' to execute test case
test_case = extract_categorical_features(churn_df, 6, 'Churn')
I think that's what they are asking:
cat_feats = []
[cat_feats.append(col) for col in df.columns if len(df[col].unique()) <= n_levels]
return sorted(cat_feats)

Python sklearn-pandas Transform Multiple Columns at the same time error

I am using python with pandas and sklearn and trying to use the new and very convenient sklearn-pandas.
I have a big data frame and need to transform multiple columns in a similar way.
I have multiple column names in the variable other
the source code documentation here
states explicitly there is a possibility of transforming multiple columns with the same transformation, but the following code does not behave as expected:
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
mapper = DataFrameMapper([[other[0],other[1]],LabelEncoder()])
mapper.fit_transform(df.copy())
I get the following error:
raise ValueError("bad input shape {0}".format(shape))
ValueError: ['EFW', 'BPD']: bad input shape (154, 2)
When I use the following code, it works great:
cols = [(other[i], LabelEncoder()) for i,col in enumerate(other)]
mapper = DataFrameMapper(cols)
mapper.fit_transform(df.copy())
To my understanding, both should work well and yield same results.
What am I doing wrong here?
Thanks!
The problem you encounter here, is that the two snippets of code are completely different in terms of data structure.
cols = [(other[i], LabelEncoder()) for i,col in enumerate(other)] builds a list of tuples. Do note that you can shorten this line of code to:
cols = [(col, LabelEncoder()) for col in other]
Anyway, the first snippet, [[other[0],other[1]],LabelEncoder()] results in a list containing two elements: a list and a LabelEncoder instance. Now, it is documented that you can transform multiple columns through specifying:
Transformations may require multiple input columns. In these cases, the column names can be specified in a list:
mapper2 = DataFrameMapper([
(['children', 'salary'], sklearn.decomposition.PCA(1))
])
This is a list containing tuple(list, object) structured elements, not list[list, object] structured elements.
If we take a look at the source code itself,
class DataFrameMapper(BaseEstimator, TransformerMixin):
"""
Map Pandas data frame column subsets to their own
sklearn transformation.
"""
def __init__(self, features, default=False, sparse=False, df_out=False,
input_df=False):
"""
Params:
features a list of tuples with features definitions.
The first element is the pandas column selector. This can
be a string (for one column) or a list of strings.
The second element is an object that supports
sklearn's transform interface, or a list of such objects.
The third element is optional and, if present, must be
a dictionary with the options to apply to the
transformation. Example: {'alias': 'day_of_week'}
It is also clearly stated in the class definition that the features argument to DataFrameMapper is required to be a list of tuples, where the elements of the tuple may be lists.
As a last note, as to why you actually get your error message: The LabelEncoder transformer in sklearn is meant for labeling purposes on 1D arrays. As such, it is fundamentally unable to handle 2 columns at once and will raise an Exception. So, if you want to use the LabelEncoder, you will have to build N tuples with 1 columnname and the transformer where N is the amount of columns you wish to transform.

Categories