A function for onehotencoding and labelencoding in a dataframe - python

I keep getting AttributeError: 'DataFrame' object has no attribute 'column' when I run the function on a column in a dataframe
def reform (column, dataframe):
if dataframe.column.nunique() > 2 and dataframe.column.dtypes == object:
enc.fit(dataframe[['column']])
enc.categories_
onehot = enc.transform(dataframe[[column]]).toarray()
dataframe[enc.categories_] = onehot
elif dataframe.column.nunique() == 2 and dataframe.column.dtypes == object :
le.fit_transform(dataframe[['column']])
else:
print('Column cannot be reformed')
return dataframe

Try changing
dataframe.column to dataframe.loc[:,column].
dataframe[['column']] to dataframe.loc[:,[column]]
For more help, please provide more information. Such as: What is enc (show your imports)? What does dataframe look like (show a small example, perhaps with dataframe.head(5))?
Details:
Since column is an input (probably a string), you need to use it correctly when asking for that column from the dataframe object. If you just use dataframe.column it will try to find the column actually named 'column', but if you ask for it dataframe.loc[:,column], it will use the string that is represented by the input parameter named column.
With dataframe.loc[:,column], you get a Pandas Series, and with dataframe.loc[:,[column]] you get a Pandas DataFrame.
The pandas attribute 'columns', used as dataframe.columns (note the 's' at the end) just returns a list of the names of all columns in your dataframe, probably not what you want here.
TIPS:
Try to name input parameters so that you know what they are.
When developing a function, try setting the input to something static, and iterate the code until you get desired output. E.g.
input_df = my_df
column_name = 'some_test_column'
if input_df.loc[:,column_name].nunique() > 2 and input_df.loc[:,column_name].dtypes == object:
enc.fit(input_df.loc[:,[column_name]])
onehot = enc.transform(input_df.loc[:,[column_name]]).toarray()
input_df.loc[:, enc.categories_] = onehot
elif input_df.loc[:,column_name].nunique() == 2 and input_df.loc[:,column_name].dtypes == object :
le.fit_transform(input_df.loc[:,[column_name]])
else:
print('Column cannot be transformed')
Look up on how to use SciKit Learn Pipelines, with ColumnTransformer. It will help make the workflow easier (https://scikit-learn.org/stable/modules/compose.html).

Related

Weird behavior on String vs Categorical Dtypes

I´ve been facing a very weird, or could I say, a bug, when handling some categorical vs string Dtypes. Take a look at this simple example dataframe :
import pandas as pd
import numpy as np
data = pd.DataFrame({
'status' : ['pending', 'pending','pending', 'canceled','canceled','canceled', 'confirmed', 'confirmed','confirmed'],
'partner' : ['A', np.nan,'C', 'A',np.nan,'C', 'A', np.nan,'C'],
'product' : ['afiliates', 'pre-paid', 'giftcard','afiliates', 'pre-paid', 'giftcard','afiliates', 'pre-paid', 'giftcard'],
'brand' : ['brand_1', 'brand_2', 'brand_3','brand_1', 'brand_2', 'brand_3','brand_1', 'brand_2', 'brand_3'],
'gmv' : [100,100,100,100,100,100,100,100,100]})
data = data.astype({'partner':'category','status':'category','product':'category', 'brand':'category'})
When I execute a single Loc selection
test = data.loc[(data.partner !='A') | ((data.brand == 'A') & (data.status == 'confirmed'))]
and this is the output
Now, just move my categorical columns to string (I am moving it back to string due to an bug related to groupby issues regarding categorical as described here)
data = data.astype({'partner':'string','status':'string','product':'string', 'brand':'string'})
And lest make the same loc command.
test2 = data.loc[(data.partner !='A') | ((data.brand == 'A') & (data.status == 'confirmed'))]
but take a look at the output!
I am really lost why it does not work. I´ve figureout that is something related to the NAN categorical being sent back to strings, but I don´t see why would it be a problem.
The problem is exactly with the difference between NAN value in category type and in string type:
With category the type of nan value is 'float' (as a typical nan) and you can use it in comparison, so:
data.partner !='A' will be True for all the rows with NaN.
When converting the type to string, the type of nan value is converted to pandas._libs.missing.NAType, and you can't use this in comparison, so now:
data.partner !='A' returns <NaN> which is not True and the result differs.
Basically the NaN in category type is not a category in itself, so it is handled differently. This is why you can't use fillna on categories as is, you have to define a category value for it.
you can use something like this:
data['partner'] = data.partner.cat.add_categories('Not_available').fillna('Not_available')
to add a custom NA category and replace the missing values. Now if convert to string and run the same conditions, you should get the same result.

ValueError while trying to check for a "W" in dataset

Datset
I'm trying to check for a win from the WINorLOSS column, but I'm getting the following error:
Code and Error Message
The variable combined.WINorLOSS seems to be a Series type object and you can't compare an iterable (like list, dict, Series,etc) with a string type value. I think you meant to do:
for i in combined.WINorLOSS:
if i=='W':
hteamw+=1
else:
ateamw+=1
You can't compare a Series of values (like your WINorLOSS dataframe column) to a single string value. However you can use the following to counts the 'L' and 'W' in your columns:
hteamw = combined['WINorLOSS'].value_counts()['W']
hteaml = combined['WINorLOSS'].value_counts()['L']

Are intermediate variables necessary for a pandas boolean index? Syntax error comes up when I don't use one

I'm learning how to use pandas and trying to boolean index so that the dataframe is made up of rows where the value of the 'sector' column is 'Technology' and the 'country' column is not 'USA'. It works fine when I use an intermediate variable, like in the following:
t_nu = (f500['sector'] == 'Technology') & ~(f500['country'] == 'USA')
tech_outside_usa = f500[t_nu].head()
When I try to run without the intermediate variable, like this:
tech_outside_usa = f500[(f500['sector'] == 'Technology') & ~(f500['country'] == 'USA')].head()
I get an invalid syntax error. Can anyone tell me what the difference is?

How to search for a specific feature name in a large csv dataframe?

I want to modify a large dataframe so that the remaining columns are features that contain only 2 unique values (eg, True and False) with the exception of the feature class (that contains more than 2 unique values).
I want to remove irrelevant features to simplify/clean the data set. But I need to keep the feature class which is called 'pattern' as this will be needed for predictions.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('Threat_Prediction_Clean.csv')
print (df.nunique())
if df.nunique() < 3:
dff = df[df.columns[df.nunique()<3]
elif df[df.columns == 'Pattern']:
dff.append(df[df.columns == 'Pattern'])
Expected result:
To have a new dataframe (called 'dff') which contains features of only 2 unique data values AND the 'pattern' feature
Actual result:
File "<ipython-input-33-ccbaf00f5866>", line 29
elif df[df.columns == 'Pattern']:
^
SyntaxError: invalid syntax
A few quick comments:
To reference a specific column of a dataframe you use df["col_name"] or df.col_name. So instead of your last elif statement you can just append df["Pattern"].The reason you get your error is because your elif statement never checks for a truth condition.
You are missing a closing bracket in your if statement. (See ForceBru's comment above.)
I don't understand what you are testing for in the if statement when you write df.nunique > 3. From what you wrote you want to preserve columns which have 2 unique values. What you have tests the entire dataframe. Try something like:
for col in df.columns:
if df[col].nunique() < 3:
#Append column

Pandas: get the most count label

My dataframe has a column contains various type values, I want to get the most counted one:
In this case, I want to get the label FM-15, so later on I can query data only labled by this.
How can I do that?
Now I can get away with:
most_count = df['type'].value_counts().max()
s = df['type'].value_counts()
s[s == most_count].index
This returns
Index([u'FM-15'], dtype='object')
But I feel this is to ugly, and I don't know how to use this Index() object to query df. I only know something like df = df[(df['type'] == 'FM-15')].
Use argmax:
lbl = df['type'].value_counts().argmax()
To query,
df.query("type==#lbl")

Categories