I´ve been facing a very weird, or could I say, a bug, when handling some categorical vs string Dtypes. Take a look at this simple example dataframe :
import pandas as pd
import numpy as np
data = pd.DataFrame({
'status' : ['pending', 'pending','pending', 'canceled','canceled','canceled', 'confirmed', 'confirmed','confirmed'],
'partner' : ['A', np.nan,'C', 'A',np.nan,'C', 'A', np.nan,'C'],
'product' : ['afiliates', 'pre-paid', 'giftcard','afiliates', 'pre-paid', 'giftcard','afiliates', 'pre-paid', 'giftcard'],
'brand' : ['brand_1', 'brand_2', 'brand_3','brand_1', 'brand_2', 'brand_3','brand_1', 'brand_2', 'brand_3'],
'gmv' : [100,100,100,100,100,100,100,100,100]})
data = data.astype({'partner':'category','status':'category','product':'category', 'brand':'category'})
When I execute a single Loc selection
test = data.loc[(data.partner !='A') | ((data.brand == 'A') & (data.status == 'confirmed'))]
and this is the output
Now, just move my categorical columns to string (I am moving it back to string due to an bug related to groupby issues regarding categorical as described here)
data = data.astype({'partner':'string','status':'string','product':'string', 'brand':'string'})
And lest make the same loc command.
test2 = data.loc[(data.partner !='A') | ((data.brand == 'A') & (data.status == 'confirmed'))]
but take a look at the output!
I am really lost why it does not work. I´ve figureout that is something related to the NAN categorical being sent back to strings, but I don´t see why would it be a problem.
The problem is exactly with the difference between NAN value in category type and in string type:
With category the type of nan value is 'float' (as a typical nan) and you can use it in comparison, so:
data.partner !='A' will be True for all the rows with NaN.
When converting the type to string, the type of nan value is converted to pandas._libs.missing.NAType, and you can't use this in comparison, so now:
data.partner !='A' returns <NaN> which is not True and the result differs.
Basically the NaN in category type is not a category in itself, so it is handled differently. This is why you can't use fillna on categories as is, you have to define a category value for it.
you can use something like this:
data['partner'] = data.partner.cat.add_categories('Not_available').fillna('Not_available')
to add a custom NA category and replace the missing values. Now if convert to string and run the same conditions, you should get the same result.
Datset
I'm trying to check for a win from the WINorLOSS column, but I'm getting the following error:
Code and Error Message
The variable combined.WINorLOSS seems to be a Series type object and you can't compare an iterable (like list, dict, Series,etc) with a string type value. I think you meant to do:
for i in combined.WINorLOSS:
if i=='W':
hteamw+=1
else:
ateamw+=1
You can't compare a Series of values (like your WINorLOSS dataframe column) to a single string value. However you can use the following to counts the 'L' and 'W' in your columns:
hteamw = combined['WINorLOSS'].value_counts()['W']
hteaml = combined['WINorLOSS'].value_counts()['L']
I'm learning how to use pandas and trying to boolean index so that the dataframe is made up of rows where the value of the 'sector' column is 'Technology' and the 'country' column is not 'USA'. It works fine when I use an intermediate variable, like in the following:
t_nu = (f500['sector'] == 'Technology') & ~(f500['country'] == 'USA')
tech_outside_usa = f500[t_nu].head()
When I try to run without the intermediate variable, like this:
tech_outside_usa = f500[(f500['sector'] == 'Technology') & ~(f500['country'] == 'USA')].head()
I get an invalid syntax error. Can anyone tell me what the difference is?
I want to modify a large dataframe so that the remaining columns are features that contain only 2 unique values (eg, True and False) with the exception of the feature class (that contains more than 2 unique values).
I want to remove irrelevant features to simplify/clean the data set. But I need to keep the feature class which is called 'pattern' as this will be needed for predictions.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv('Threat_Prediction_Clean.csv')
print (df.nunique())
if df.nunique() < 3:
dff = df[df.columns[df.nunique()<3]
elif df[df.columns == 'Pattern']:
dff.append(df[df.columns == 'Pattern'])
Expected result:
To have a new dataframe (called 'dff') which contains features of only 2 unique data values AND the 'pattern' feature
Actual result:
File "<ipython-input-33-ccbaf00f5866>", line 29
elif df[df.columns == 'Pattern']:
^
SyntaxError: invalid syntax
A few quick comments:
To reference a specific column of a dataframe you use df["col_name"] or df.col_name. So instead of your last elif statement you can just append df["Pattern"].The reason you get your error is because your elif statement never checks for a truth condition.
You are missing a closing bracket in your if statement. (See ForceBru's comment above.)
I don't understand what you are testing for in the if statement when you write df.nunique > 3. From what you wrote you want to preserve columns which have 2 unique values. What you have tests the entire dataframe. Try something like:
for col in df.columns:
if df[col].nunique() < 3:
#Append column
My dataframe has a column contains various type values, I want to get the most counted one:
In this case, I want to get the label FM-15, so later on I can query data only labled by this.
How can I do that?
Now I can get away with:
most_count = df['type'].value_counts().max()
s = df['type'].value_counts()
s[s == most_count].index
This returns
Index([u'FM-15'], dtype='object')
But I feel this is to ugly, and I don't know how to use this Index() object to query df. I only know something like df = df[(df['type'] == 'FM-15')].
Use argmax:
lbl = df['type'].value_counts().argmax()
To query,
df.query("type==#lbl")