Lambda mapping column to uppercase [duplicate] - python

I'm having trouble applying upper case to a column in my DataFrame.
dataframe is df.
1/2 ID is the column head that need to apply UPPERCASE.
The problem is that the values are made up of three letters and three numbers. For example rrr123 is one of the values.
df['1/2 ID'] = map(str.upper, df['1/2 ID'])
I got an error:
TypeError: descriptor 'upper' requires a 'str' object but received a 'unicode' error.
How can I apply upper case to the first three letters in the column of the DataFrame df?

If your version of pandas is a recent version then you can just use the vectorised string method upper:
df['1/2 ID'] = df['1/2 ID'].str.upper()
This method does not work inplace, so the result must be assigned back.

This should work:
df['1/2 ID'] = map(lambda x: str(x).upper(), df['1/2 ID'])
and should you want all the columns names to be in uppercase format:
df.columns = map(lambda x: str(x).upper(), df.columns)

str.upper() wants a plain old Python 2 string
unicode.upper() will want a unicode not a string (or you get TypeError: descriptor 'upper' requires a 'unicode' object but received a 'str')
So I'd suggest making use of duck typing and call .upper() on each of your elements, e.g.
df['1/2 ID'].apply(lambda x: x.upper(), inplace=True)

Related

Python pandas lower data AttributeError: 'Series' object has no attribute 'lower'

I want to lower data taken from pandas sheet and trim all spaces then to look for an equality.
df['ColumnA'].loc[lambda x: x.lower().replace(" ", "") == var_name]
Code is above.
It says pandas series has no lower method. But I need to search for data inside column A via pandas framework while lowering all letters to small and whitespace trimmering.
Any other idea, how can I achieve in pandas?
In your lambda function, x is a Series not a string so you have to use str accessor:
df['ColumnA'].loc[lambda x: x.str.lower().replace(" ", "") == var_name]
Another way:
df.loc[df['ColumnA'].str.lower().str.replace(' ', '') == var_name, 'ColumnA']

Apply function to two columns of a Pandas dataframe

I'm finding several answers to this question, but none that seem to address or solve the error that pops up when I apply them. Per e.g. this answer I have a dataframe df and a function my_func(string_1,string_2) and I'm attempting to create a new column with the following:
df.['new_column'] = df.apply(lambda x: my_func(x['old_col_1'],x['old_col_2']),axis=1)
I'm getting an error originating inside my_func telling me that old_col_1 is type float and not a string as expected. In particular, the first line of my_func is old_col_1 = old_col_1.lower(), and the error is
AttributeError: 'float' object has no attribute 'lower'
By including debug statements using dataframe printouts I've verified old_col_1 and old_col_2 are indeed both strings. If I explicitly cast them to strings when passing as arguments, then my_func behaves as you would expect if it were being fed numeric data cast as strings, though the column values are decidedly not numeric.
Per this answer I've even explicitly ensured these columns are not being "intelligently" cast incorrectly when creating the dataframe:
df = pd.read_excel(file_name, sheetname,header=0,converters={'old_col_1':str,'old_col_2':str})
The function my_func works very well when it's called on its own. All this is making me suspect that the indices or some other numeric data from the dataframe is being passed, and not (exclusively) the column values.
Other implementations seem to give the same problem. For instance,
df['new_column'] = np.vectorize(my_func)(df['old_col_1'],df['old_col_2'])
produces the same error. Variations (e.g. using df['old_col_1'].to_numpy() or df['old_col_1'].values in place of df['old_col_1']) don't change this.
Is it possible that you have a np.nan/None/null data in your columns? If so you might be getting an error similar to the one that is caused with this data
data = {
'Column1' : ['1', '2', np.nan, '3']
}
df = pd.DataFrame(data)
df['Column1'] = df['Column1'].apply(lambda x : x.lower())
df

Pandas column dtype is object but python thinks it is float

I read in a csv like this
df = pd.read_csv(self.file_path, dtype=str)
then I try this:
df = df[df["MY_COLUMN"].apply(lambda x: x.isnumeric())]
I get an AttributeError:
AttributeError: 'float' object has no attribute 'isnumeric'
Why is this happening? The column contains mostly digits.
I want to filter out the ones where there are no digits.
This question is not how to achieve that or do it better but why do I get an AttributeError here?
Why is this happening?
I think because NaN is not converting to string if use dtype=str, still is missing value, so type=float
Use Series.str.isnumeric for working isnumeric with missing values like all text functions in pandas:
df[df["MY_COLUMN"].str.isnumeric()]

Create a new column which is cast to a string in pandas

What would be the proper way to assign a stringified column to a dataframe, as I would like to keep the original so I don't want to use .astype({'deliveries': 'str'). SO far I have:
df = ( df.groupby('path')
.agg(agg_dict)
.assign(deliveries_str=df['deliveries'].str ??)
)
What would be the proper way to do this?
I also tried the following but I get an unhashable type error:
.assign(deliveries_str=lambda x: x.deliveries.str)
TypeError: unhashable type: 'list'
You need try change .str since it is a function
.assign(deliveries_str=lambda x: x.deliveries.astype(str))
Adding mask
.assign(deliveries_str=lambda x: x['deliveries'].astype(str).mask(x['deliveries'].isnull()))

What is the right way to substitute column values in dataframe?

I want to following thing to happen:
for every column in df check if its type is numeric, if not - use label encoder to map str/obj to numeric classes (e.g 0,1,2,3...).
I am trying to do it in the following way:
for col in df:
if not np.issubdtype(df[col].dtype, np.number):
df[col] = LabelEncoder().fit_transform(df[col])
I see few problems here.
First - column names can repeat and thus df[col] returns more than one column, which is not what I want.
Second - df[col].dtype throws error:
AttributeError: 'DataFrame' object has no attribute 'dtype'
which I assume might arise due to the issue #1 , e.g we get multiple columns returned. But I am not confident.
Third - would assigning df[col] = LabelEncoder().fit_transform(df[col]) lead to a column substitution in df or should I do some esoteric df partitioning and concatenation?
Thank you
Since LabelEncoder supports only one column at a time, iteration over columns is your only option. You can make this a little more concise using select_dtypes to select the columns, and then df.apply to apply the LabelEncoder to each column.
cols = df.select_dtypes(exclude=[np.number]).columns
df[cols] = df[cols].apply(lambda x: LabelEncoder().fit_transform(x))
Alternatively, you could build a mask by selecting object dtypes only (a little more flaky but easily extensible):
m = df.dtypes == object
# m = [not np.issubdtype(d, np.number) for d in df.dtypes]
df.loc[:, m] = df.loc[:, m].apply(lambda x: LabelEncoder().fit_transform(x))

Categories