Datset
I'm trying to check for a win from the WINorLOSS column, but I'm getting the following error:
Code and Error Message
The variable combined.WINorLOSS seems to be a Series type object and you can't compare an iterable (like list, dict, Series,etc) with a string type value. I think you meant to do:
for i in combined.WINorLOSS:
if i=='W':
hteamw+=1
else:
ateamw+=1
You can't compare a Series of values (like your WINorLOSS dataframe column) to a single string value. However you can use the following to counts the 'L' and 'W' in your columns:
hteamw = combined['WINorLOSS'].value_counts()['W']
hteaml = combined['WINorLOSS'].value_counts()['L']
Related
I keep getting AttributeError: 'DataFrame' object has no attribute 'column' when I run the function on a column in a dataframe
def reform (column, dataframe):
if dataframe.column.nunique() > 2 and dataframe.column.dtypes == object:
enc.fit(dataframe[['column']])
enc.categories_
onehot = enc.transform(dataframe[[column]]).toarray()
dataframe[enc.categories_] = onehot
elif dataframe.column.nunique() == 2 and dataframe.column.dtypes == object :
le.fit_transform(dataframe[['column']])
else:
print('Column cannot be reformed')
return dataframe
Try changing
dataframe.column to dataframe.loc[:,column].
dataframe[['column']] to dataframe.loc[:,[column]]
For more help, please provide more information. Such as: What is enc (show your imports)? What does dataframe look like (show a small example, perhaps with dataframe.head(5))?
Details:
Since column is an input (probably a string), you need to use it correctly when asking for that column from the dataframe object. If you just use dataframe.column it will try to find the column actually named 'column', but if you ask for it dataframe.loc[:,column], it will use the string that is represented by the input parameter named column.
With dataframe.loc[:,column], you get a Pandas Series, and with dataframe.loc[:,[column]] you get a Pandas DataFrame.
The pandas attribute 'columns', used as dataframe.columns (note the 's' at the end) just returns a list of the names of all columns in your dataframe, probably not what you want here.
TIPS:
Try to name input parameters so that you know what they are.
When developing a function, try setting the input to something static, and iterate the code until you get desired output. E.g.
input_df = my_df
column_name = 'some_test_column'
if input_df.loc[:,column_name].nunique() > 2 and input_df.loc[:,column_name].dtypes == object:
enc.fit(input_df.loc[:,[column_name]])
onehot = enc.transform(input_df.loc[:,[column_name]]).toarray()
input_df.loc[:, enc.categories_] = onehot
elif input_df.loc[:,column_name].nunique() == 2 and input_df.loc[:,column_name].dtypes == object :
le.fit_transform(input_df.loc[:,[column_name]])
else:
print('Column cannot be transformed')
Look up on how to use SciKit Learn Pipelines, with ColumnTransformer. It will help make the workflow easier (https://scikit-learn.org/stable/modules/compose.html).
I have data frame which looks like:
Now I am comparing whether two columns (i.e. complaint and compliment) have equal value or not: I have written a function:
def col_comp(x):
return x['Complaint'].isin(x['Compliment'])
When I apply this function to dataframe i.e.
df.apply(col_comp,axis=1)
I get an error message
AttributeError: ("'float' object has no attribute 'isin'", 'occurred
at index 0')
Any suggestion where I am making the mistake.
isin requires an iterable. You are providing individual data points (floats) with apply and col_comp. What you should use is == in your function col_comp, instead of isin. Even better, you can compare the columns in one call:
df['Complaint'] == df['Compliment']
I have gone through all posts on the website and am not able to find solution to my problem.
I have a dataframe with 15 columns. Some of them come with None or NaN values. I need help in writing the if-else condition.
If the column in the dataframe is not null and nan, I need to format the datetime column. Current Code is as below
for index, row in df_with_job_name.iterrows():
start_time=df_with_job_name.loc[index,'startTime']
if not df_with_job_name.isna(df_with_job_name.loc[index,'startTime']) :
start_time_formatted =
datetime(*map(int, re.split('[^\d]', start_time)[:-1]))
The error that I am getting is
if not df_with_job_name.isna(df_with_job_name.loc[index,'startTime']) :
TypeError: isna() takes exactly 1 argument (2 given)
A direct way to take care of missing/invalid values is probably:
def is_valid(val):
if val is None:
return False
try:
return not math.isnan(val)
except TypeError:
return True
and of course you'll have to import math.
Also it seems isna is not invoked with any argument and returns a dataframe of boolean values (see link). You can iterate thru both dataframes to determine if the value is valid.
isna takes your entire data frame as the instance argument (that's self, if you're already familiar with classes) and returns a data frame of Boolean values, True where that value is invalid. You tried to specify the individual value you're checking as a second input argument. isna doesn't work that way; it takes empty parentheses in the call.
You have a couple of options. One is to follow the individual checking tactics here. The other is to make the map of the entire data frame and use that:
null_map_df = df_with_job_name.isna()
for index, row in df_with_job_name.iterrows() :
if not null_map_df.loc[index,row]) :
start_time=df_with_job_name.loc[index,'startTime']
start_time_formatted =
datetime(*map(int, re.split('[^\d]', start_time)[:-1]))
Please check my use of row & column indices; the index, row handling doesn't look right. Also, you should be able to apply an any operation to the entire row at once.
Having found the maximum value in a panda data frame column, I am just trying to get the equivalent row name as a string.
Here's my code:
df[df['ColumnName'] == df['ColumnName'].max()].index
Which returns me an answer:
Index(['RowName'], dtype='object')
How do I just get RowName back?
(stretch question - why does .idmax() fry in the formulation df['Colname'].idmax? And, yes, I have tried it as .idmax() and also appended it to df.loc[:,'ColName'] etc.)
Just use integer indexing:
df[df['ColumnName'] == df['ColumnName'].max()].index[0]
Here [0] extracts the first element. Note your criterion may yield multiple indices.
I always get this error:
AnalysisException: u"cannot resolve 'substring(l,1,-1)' due to data type mismatch: argument 1 requires (string or binary) type, however, 'l' is of array type.;"
Quite confused because l[0] is a string, and matches arg 1.
dataframe has only one column named 'value', which is a comma separated string.
And I want to convert this original dataframe to another dataframe of object LabeledPoint, with the first element to be 'label' and the others to be 'features'.
from pyspark.mllib.regression import LabeledPoint
def parse_points(dataframe):
df1=df.select(split(dataframe.value,',').alias('l'))
u_label_point=udf(LabeledPoint)
df2=df1.select(u_label_point(col('l')[0],col('l')[1:-1]))
return df2
parsed_points_df = parse_points(raw_data_df)
I think you what to create LabeledPoint in dataframe. So you can:
def parse_points(df):
df1=df.select(split(df.value,',').alias('l'))
df2=df1.map(lambda seq: LabeledPoint(float(seq[0][0]),seq[0][1:])) # since map applies lambda in each tuple
return df2.toDF() #this will convert pipelinedRDD to dataframe
parsed_points_df = parse_points(raw_data_df)