Rename Columns in Pandas Using Lambda Function Rather Than a Function - python

I'm trying to rename column headings in my dataframe in pandas using .rename().
Basically, the headings are :
column 1: "Country name[9]"
column 2: "Official state name[5]"
#etc.
I need to remove [number].
I can do that with a function:
def column(string):
for x, v in enumerate(string):
if v == '[':
return string[:x]
But I wanted to know how to convert this to a lambda function so that I can use
df.rename(columns = lambda x: do same as function)
I've never used lambda functions before so I'm not sure of the syntax to get it to work correctly.

First you would have to create function which returns new or old value - never None.
def column(name):
if '[' in name:
return name[:name.index('[')] # new - with change
else:
return name # old - without change
and then you can use it as
df.rename(columns=lambda name:columns(name))
or even simpler
df.rename(columns=columns)
Or you can convert your function to real lambda
df.rename(columns=(lambda name: name[:name.index('[')] if '[' in name else name) )
but sometimes it is more readable to keep def column(name) and use columns=column. And not all constructions can be used in lambda - ie. you can't assing value to variable (I don't know if you can use new operator := ("walrus") in Python 3.8).
Minimal working code
import pandas as pd
data = {
'Country name[9]': [1,2,3],
'Official state name[5]': [4,5,6],
'Other': [7,8,9],
}
df = pd.DataFrame(data)
def column(name):
if '[' in name:
return name[:name.index('[')]
else:
return name
print(df.columns)
df = df.rename(columns=column)
# or
df = df.rename(columns=(lambda name: name[:name.index('[')] if '[' in name else name) )
print(df.columns)

Related

Replace a value found in a Df with a value found in a Dictionary

I am having troubles writing an operation that replaces values found in a df with values that are found in a dictionary. Here's what I have:
import pandas as pd
d=[['English','Alabama','bob','smith']]
df=pd.DataFrame(d,columns=["lang",'state','first','last'])
filename='temp'
error_dict={}
error_dict[filename]={}
replace_dict={"state":{"Alabama": "AL", "Alaska": "AK"},"lang":{"English":"ENG", "French":"FR"}}
def replace(df, error_dict,colname):
#replace the values found the in df with the value that in the replace_dict
return df, error_dict
parsing_map={
"lang": [replace],
"state":[replace]
}
for i in parsing_map.keys():
for j in parsing_map[i]:
df, error_dict = j(df, error_dict, i)
I am trying to run some operation in the 'replace' function that will replace the values found in df columns 'lang' and 'state' with their abbreviated value from the dictionary 'replace_dict'. It makes it a bit more confusing since, instead of trying to reference the column name, you just use colname.
I have tried to do something like:
def replace(df, error_dict,colname):
for colname, value in replace_dict.items():
df[colname] = df[colname].replace(value)
return df, error_dict
this only replaces the state column value and not for the state and lang. I want it to be in a function i know its not necessary for this exact application but the end result it is nicer.
The desired output is:
d = [['ENG','AL','bob','smith']]
df = pd.DataFrame(d,columns=["lang",'state','first','last'])
How would I create an operation to replace the values in the df with the values in the dictionary using the replace function and 'colname' to reference the name of the column aka "lang" and "state"?
Loop through the dict and replace each column one at a time:
for colname, value in replace_dict.items():
df[colname] = df[colname].map(value)
EDIT full answer:
import pandas as pd
d=[['English','Alabama','bob','smith']]
df=pd.DataFrame(d,columns=["lang",'state','first','last'])
replace_dict={"state":{"Alabama": "AL", "Alaska": "AK"},"lang":{"English":"ENG", "French":"FR"}}
for colname, value in replace_dict.items():
df[colname] = df[colname].map(value)

Creating a python function to change sequence of columns

I am able to change the sequence of columns using below code I found on stackoverflow, now I am trying to convert it into a function for regular use but it doesnt seem to do anything. Pycharm says local variable df_name value is not used in last line of my function.
Working Code
columnsPosition = list(df.columns)
F, H = columnsPosition.index('F'), columnsPosition.index('H')
columnsPosition[F], columnsPosition[H] = columnsPosition[H], columnsPosition[F]
df = df[columnsPosition]
My Function - Doesnt work, need to make this work
def change_col_seq(df_name, old_col_position, new_col_position):
columnsPosition = list(df_name.columns)
F, H = columnsPosition.index(old_col_position), columnsPosition.index(new_col_position)
columnsPosition[F], columnsPosition[H] = columnsPosition[H], columnsPosition[F]
df_name = df_name[columnsPosition] # pycharm has issue on this line
I have tried adding return on last statement of function but I am unable to make it work.
To re-order the Columns
To change the position of 2 columns:
def change_col_seq(df_name:pd.DataFrame, old_col_position:str, new_col_position:str):
df_name[new_col_position], df_name[old_col_position] = df_name[old_col_position].copy(), df_name[new_col_position].copy()
df = df_name.rename(columns={old_col_position:new_col_position, new_col_position:old_col_position})
return df
To Rename the Columns
You can use the rename method (Documentation)
If you want to change the name of just one column:
def change_col_name(df_name, old_col_name:str, new_col_name:str):
df = df_name.rename(columns={old_col_name: new_col_name})
return df
If you want to change the name of multiple column:
def change_col_name(df_name, old_col_name:list, new_col_name:list):
df = df_name.rename(columns=dict(zip(old_col_name, new_col_name)))
return df

How can I modify a Pandas DataFrame by reference?

I'm trying to write a Python function that does One-Hot encoding in-place but I'm having trouble finding a way to do a concat operation in-place at the end. It appears to make a copy of my DataFrame for the concat output and I am unable to assign this to my DataFrame that I passed by reference.
How can this be done?
def one_hot_encode(df, col: str):
"""One-Hot encode inplace. Includes NAN.
Keyword arguments:
df (DataFrame) -- the DataFrame object to modify
col (str) -- the column name to encode
"""
insert_loc = df.columns.get_loc(col)
insert_data = pd.get_dummies(df[col], prefix=col + '_', dummy_na=True)
df.drop(col, axis=1, inplace=True)
df[:] = pd.concat([df.iloc[:, :insert_loc], insert_data, df.iloc[:, insert_loc:]], axis=1) # Doesn't take effect outside function
I don't think you can pass function arguments by reference in python (see: How do I pass a variable by reference? )
Instead what you can do is just return the modified df from your function, and assign result to the original df:
def one_hot_encode(df, col: str):
...
return df
...
df=one_hot_encode(df, col)
To make the change take affect outside the function, we have to change the object that was passed in rather than replace its name (inside the function) with a new object.
To assign the new columns, you can use
df[insert_data.columns] = insert_data
instead of the concat.
That doesn't take advantage of your careful insert order though.
To retain your order, we can redindex the data frame.
df.reindex(columns=cols)
where cols is the combined list of columns in order:
cols = [cols[:insert_loc] + list(insert_data.columns) + cols[insert_loc:]]
Putting it all together,
import pandas as pd
def one_hot_encode(df, col: str):
"""One-Hot encode inplace. Includes NAN.
Keyword arguments:
df (DataFrame) -- the DataFrame object to modify
col (str) -- the column name to encode
"""
cols = list(df.columns)
insert_loc = df.columns.get_loc(col)
insert_data = pd.get_dummies(df[col], prefix=col + '_', dummy_na=True)
cols = [cols[:insert_loc] + list(insert_data.columns) + cols[insert_loc:]]
df[insert_data.columns] = insert_data
df.reindex(columns=cols)
df.drop(col, axis=1, inplace=True)
import seaborn
diamonds=seaborn.load_dataset("diamonds")
col="color"
one_hot_encode(diamonds, "color")
assert( "color" not in diamonds.columns )
assert( len([c for c in diamonds.columns if c.startswith("color")]) == 8 )
df.insert is inplace--but can only insert one column at a time. It might not be worth the reorder.
def one_hot_encode2(df, col: str):
"""One-Hot encode inplace. Includes NAN.
Keyword arguments:
df (DataFrame) -- the DataFrame object to modify
col (str) -- the column name to encode
"""
cols = list(df.columns)
insert_loc = df.columns.get_loc(col)
insert_data = pd.get_dummies(df[col], prefix=col + '_', dummy_na=True)
for offset, newcol in enumerate(insert_data.columns):
df.insert(loc=insert_loc+offset, column=newcol, value = insert_data[[newcol]])
df.drop(col, axis=1, inplace=True)
import seaborn
diamonds=seaborn.load_dataset("diamonds")
col="color"
one_hot_encode2(diamonds, "color")
assert( "color" not in diamonds.columns )
assert(len([c for c in diamonds.columns if c.startswith("color")]) == 8)
assert([(i) for i,c in enumerate(diamonds.columns) if c.startswith("color")][0] == 2)
The scope of the variables of a function are only inside that function. Simply include a return statement in the end of the function to get your modified dataframe as output. Calling this function will now return your modified dataframe. Also while assigning new (dummy) columns, instead of df[:] use df, as you are changing the dimension of original dataframe.
def one_hot_encode(df, col: str):
insert_loc = df.columns.get_loc(col)
insert_data = pd.get_dummies(df[col], prefix=col + '_', dummy_na=True)
df.drop(col, axis=1, inplace=True)
df = pd.concat([df.iloc[:, :insert_loc], insert_data, df.iloc[:, insert_loc:]], axis=1)
return df
Now to see the modified dataframe, call the function and assign it to a new/existing dataframe as below
df=one_hot_encode(df,'<any column name>')

Is there a better way to manipulate column names in a pandas dataframe?

I'm working with a large dataframe and need a way to dynamically rename column names.
Here's a slow method I'm working with:
# Create a sample dataframe
df = pd.DataFrame.from_records([
{'Name':'Jay','Favorite Color (BLAH)':'Green'},
{'Name':'Shay','Favorite Color (BLAH)':'Blue'},
{'Name':'Ray','Favorite Color (BLAH)':'Yellow'},
])
# Current columns are: ['Name', 'Favorite Color (BLAH)']
# ------
# build two lambdas to clean the column names
f_clean = lambda x: x.split('(')[0] if ' (' in x else x
f_join = lambda x: '_'.join(x.split())
df.columns = df.columns.map(f_clean, f_join).map(f_join).str.lower()
# Columns are now: ['name', 'favorite_color']
Is there a better method for solving this?
You could define a clean function and just apply to all the columns using list comprehension.
def clean(name):
name = name.split('(')[0] if ' (' in name else name
name = '_'.join(name.split())
return name
df.columns = [clean(col) for col in df.columns]
It's clear what's happening and not overly verbose.

TypeError: string indices must be integers using pandas apply with lambda

I have a dataframe, one column is a URL, the other is a name. I'm simply trying to add a third column that takes the URL, and creates an HTML link.
The column newsSource has the Link name, and url has the URL. For each row in the dataframe, I want to create a column that has:
[newsSource name]
Trying the below throws the error
File "C:\Users\AwesomeMan\Documents\Python\MISC\News Alerts\simple_news.py", line 254, in
df['sourceURL'] = df['url'].apply(lambda x: '{1}'.format(x, x[0]['newsSource']))
TypeError: string indices must be integers
df['sourceURL'] = df['url'].apply(lambda x: '{1}'.format(x, x['source']))
But I've used x[colName] before? The below line works fine, it simply creates a column of the source's name:
df['newsSource'] = df['source'].apply(lambda x: x['name'])
Why suddenly ("suddenly" to me) is it saying I can't access the indices?
pd.Series.apply has access only to a single series, i.e. the series on which you are calling the method. In other words, the function you supply, irrespective of whether it is named or an anonymous lambda, will only have access to df['source'].
To access multiple series by row, you need pd.DataFrame.apply along axis=1:
def return_link(x):
return '{1}'.format(x['url'], x['source'])
df['sourceURL'] = df.apply(return_link, axis=1)
Note there is an overhead associated with passing an entire series in this way; pd.DataFrame.apply is just a thinly veiled, inefficient loop.
You may find a list comprehension more efficient:
df['sourceURL'] = ['{1}'.format(i, j) \
for i, j in zip(df['url'], df['source'])]
Here's a working demo:
df = pd.DataFrame([['BBC', 'http://www.bbc.o.uk']],
columns=['source', 'url'])
def return_link(x):
return '{1}'.format(x['url'], x['source'])
df['sourceURL'] = df.apply(return_link, axis=1)
print(df)
source url sourceURL
0 BBC http://www.bbc.o.uk BBC
With zip and string old school string format
df['sourceURL'] = ['%s.' % (x,y) for x , y in zip (df['url'], df['source'])]
This is f-string
[f'{y}' for x , y in zip ((df['url'], df['source'])]

Categories