Creating function to rename columns in pandas dataframe - python

I have dataframe as below:
df = pd.DataFrame({'$a':[1,2], '$b': [10,20]})
I tried creating a function which allow to change the column name dynamically where I can just input the old column name and new column name in the function as below:
def rename_column_name(df, old_column, new_column):
df = df.rename({'{}'.format(old_column) : '{}'.format(new_column)}, axis=1)
return df
This function is only applicable if I only have one input as below:
new_df = rename_column_name(df, '$a' , 'a')
which give me this new_df as below:
new_df = pd.DataFrame({'a':[1,2], '$b': [10,20]})
However, i wanted to create a function that allow me to make changes on multiple/one column depending on my preference as such:
new_df = rename_column_name(df, ['$a','$b'] , ['a','b'])
And get the new_df as below
new_df = pd.DataFrame({'a':[1,2], 'b': [10,20]})
So, how do I make my function more dynamic to allow me the freedom to enter multiple/one column names and rename them?

You don't need a function, you can do this using dict comprehension:
In [265]: old_names = df.columns.tolist()
In [266]: new_names = ['a','b']
In [268]: df = df.rename(columns=dict(zip(old_names, new_names)))
In [269]: df
Out[269]:
a b
0 1 10
1 2 20
Function that OP needs:
In [274]: def rename_column_name(df, old_column_list, new_column_list):
...: df = df.rename(columns=dict(zip(old_column_list, new_column_list)))
...: return df
...:
In [275]: rename_column_name(df,old_names,new_names)
Out[275]:
a b
0 1 10
1 2 20
You need to pass a list of columns to this function. It can be multiple columns or a single column. This should do what you were looking for.

def rename_column_name(df, old_column, new_column):
if not isinstance(old_column,(list,tuple)):
old_column = [old_column]
if not isinstance(new_column,(list,tuple)):
old_column = [new_column]
df = df.rename({'{}'.format(old) : '{}'.format(new) for old,new in zip(old_column,new_column)}, axis=1)
return df # dang i should have used dict.zip like in the other solution :P
I guess ... although i don't understand how this is easier than just calling
df.rename(columns={'$a':'a','$b':b})

You can do that with zip function where,
old_column_names and new_column_names should be lists.
def rename_column_name(df, old_column_names, new_column_names):
//validating the such that all the new names have been passed
if(len(old_column_names) == len(new_column_names)):
df = df.rename(columns=dict(zip(old_column_names, new_column_names)), inplace=True)
return df
To handle both one column rename and passing them as lists the function would require further conditions which can be
def rename_column_name(df, old_column_names, new_column_names):
//validating the such that all the new names have been passed
if(isinstance(old_column_names, list)) and (isinstance(new_column_names, list)):
if(len(old_column_names) == len(new_column_names)):
df = df.rename(columns=dict(zip(old_column_names, new_column_names)), inplace=True)
elif (isinstance(old_column_names, str)) and (isinstance(new_column_names, str)):
df = df.rename(columns={'{}'.format(old_column_names) : '{}'.format(new_column_names)}, inplace=True)
return df

Related

Create a pandas DataFrame where each cell is a set of strings

I am trying to create a DataFrame like so:
col_a
col_b
{'soln_a'}
{'soln_b'}
In case it helps, here are some of my failed attempts:
import pandas as pd
my_dict_a = {"col_a": set(["soln_a"]), "col_b": set("soln_b")}
df_0 = pd.DataFrame.from_dict(my_dict_a) # ValueError: All arrays must be of the same length
df_1 = pd.DataFrame.from_dict(my_dict_a, orient="index").T # splits 'soln_b' into individual letters
my_dict_b = {"col_a": ["soln_a"], "col_b": ["soln_b"]}
df_2 = pd.DataFrame(my_dict_b).apply(set) # TypeError: 'set' type is unordered
df_3 = pd.DataFrame.from_dict(my_dict_b, orient="index").T # creates DataFrame of lists
df_3.apply(set, axis=1) # combines into single set of {soln_a, soln_b}
What's the best way to do this?
You just need to ensure your input data structure is formatted correctly.
The (default) dictionary -> DataFrame constructor, asks for the values in the dictionary be a collection of some type. You just need to make sure you have a collection of set objects, instead of having the key link directly to a set.
So, if I change my input dictionary to have a list of sets, then it works as expected.
import pandas as pd
my_dict = {
"col_a": [{"soln_a"}, {"soln_c"}],
"col_b": [{"soln_b", "soln_d"}, {"soln_c"}]
}
df = pd.DataFrame.from_dict(my_dict)
print(df)
col_a col_b
0 {soln_a} {soln_d, soln_b}
1 {soln_c} {soln_c}
You could apply a list comprehension on the columns:
my_dict_b = {"col_a": ["soln_a"], "col_b": ["soln_b"]}
df_2 = pd.DataFrame(my_dict_b)
df_2 = df_2.apply(lambda col: [set([x]) for x in col])
Output:
col_a col_b
0 {soln_a} {soln_b}
Why not something like this?
df = pd.DataFrame({
'col_a': [set(['soln_a'])],
'col_b': [set(['soln_b'])],
})
Output:
>>> df
col_a col_b
0 {soln_a} {soln_b}

How to drop all columns but not first that starts with pattern?

I have code that deletes all columns that are starting with spike:
import pandas as pd
data = {'spike_starts1': [1,2,3], 'spike_starts2': [4,5,6], 'spike_starts3': [7,8,9], 'not': [10,11,12]}
df = pd.DataFrame(data)
df2 = df.drop(df.columns[df.columns.str.contains(pat = '^spike')].tolist() , axis=1).copy()
Question: How to modify code above so that it will leave first column that starts with spike but delete all others that starts with spike? If code above is hard to modify suggest your own versions.
This can be achieved just by changing .tolist()[1:], the final code must look like:
import pandas as pd
data = {'spike_starts1': [1,2,3], 'spike_starts2': [4,5,6], 'spike_starts3': [7,8,9], 'not': [10,11,12]}
df = pd.DataFrame(data)
df2 = df.drop(df.columns[df.columns.str.contains(pat = '^spike')].tolist()[1:] , axis=1).copy()
You can create a spike flag and drop duplicates which will only keep the first one.
(
df.T.assign(flag=lambda x: x.index.str.slice(0,5))
.drop_duplicates(subset='flag')
.drop('flag',1)
.T
)
spike_starts1 not
0 1 10
1 2 11
2 3 12
Of you can build a dict with only the first spike column and other non spike columns.
(
pd.DataFrame({'spike' if c.startswith('spike') else c:df[c] for c in df.columns})
.rename(columns = {'spike': [e for e in df.columns if e.startswith('spike')][0]})
)
Another solution:
(
pd.DataFrame(df.columns)
.assign(F=lambda x: x[0].str[:5])
.drop_duplicates(subset='F')
.pipe(lambda x: df.reindex(columns=x[0]))
)

Add a suffix to a dataframe called from a dictionary

I am trying to add a suffix to the dataframes called on by a dictionary.
Here is a sample code below:
import pandas as pd
import numpy as np
from collections import OrderedDict
from itertools import chain
# defining stuff
num_periods_1 = 11
num_periods_2 = 4
num_periods_3 = 5
# create sample time series
dates1 = pd.date_range('1/1/2000 00:00:00', periods=num_periods_1, freq='10min')
dates2 = pd.date_range('1/1/2000 01:30:00', periods=num_periods_2, freq='10min')
dates3 = pd.date_range('1/1/2000 02:00:00', periods=num_periods_3, freq='10min')
# column_names = ['WS Avg','WS Max','WS Min','WS Dev','WD Avg']
# column_names = ['A','B','C','D','E']
column_names_1 = ['C', 'B', 'A']
column_names_2 = ['B', 'C', 'D']
column_names_3 = ['E', 'B', 'C']
df1 = pd.DataFrame(np.random.randn(num_periods_1, len(column_names_1)), index=dates1, columns=column_names_1)
df2 = pd.DataFrame(np.random.randn(num_periods_2, len(column_names_2)), index=dates2, columns=column_names_2)
df3 = pd.DataFrame(np.random.randn(num_periods_3, len(column_names_3)), index=dates3, columns=column_names_3)
sep0 = '<~>'
suf1 = '_1'
suf2 = '_2'
suf3 = '_3'
ddict = {'df1': df1, 'df2': df2, 'df3': df3}
frames_to_concat = {'Sheets': ['df1', 'df3']}
Suffs = {'Suffixes': ['Suffix 1', 'Suffix 2', 'Suffix 3']}
Suff = {'Suffix 1': suf1, 'Suffix 2': suf2, 'Suffix 3': suf3}
## appply suffix to each data frame selected in order HERE
# Suffdict = [Suff[x] for x in Suffs['Suffixes']]
# print(Suffdict)
df4 = pd.concat([ddict[x] for x in frames_to_concat['Sheets']],
axis=1,
join='outer')
I want to add a suffix to each dataframe so that they can be distinguished when the dataframes are concatenated. I am having some trouble calling them and then applying them to each dataframe. So I have called for df1 and df3 to be concatenated and I would like only suffix 1 to be applied to df1 and suffix 2 to be applied to df3.
Order does not matter for the data frame suffix if df2 and df3 were called suffix 1 would be applied to df2 and suffix 2 would be applied to df3. obviously the last suffix would not be used.
Unless you have python3.6, you cannot guarantee order in dictionaries. Even if you could with python3.6, that would imply your code would not run in any lower python version. If you need order, you should be looking at lists instead.
You can store your dataframes as well as your suffixes in a list, and then use zip to add a suffix to each df in turn.
dfs = [df1, df2, df3]
sufs = [suf1, suf2, suf3]
df_sufs = [x.add_suffix(y) for x, y in zip(dfs, sufs)]
Based on your code/answer, you can load your dataframes and suffixes into lists, call zip, add a suffix to each one, and call pd.concat.
dfs = [ddict[x] for x in frames_to_concat['Sheets']]
sufs = [suff[x] for x in suffs['Suffixes']]
df4 = pd.concat([x.add_suffix(sep0 + y)
for x, y in zip(dfs, sufs)], axis=1, join='outer')
Ended up just making a simple iterator for the problem. Here is my solution
n=0
for df in frames_to_concat['Sheets']:
print(df_dict[df])
df_dict[df] = df_dict[df].add_suffix(sep0 + suff[suffs['Suffixes'][n]])
n = n+1
Anyone have a better way to do this?

A value is trying to be set on a copy of a slice from a DataFrame. - pandas

I'm new to pandas, and, given a data frame, I was trying to drop some columns that don't accomplish an specific requirement. Researching how to do it, I got to this structure:
df = df.loc[df['DS_FAMILIA_PROD'].isin(['CARTOES', 'CARTÕES'])]
However, when processing the frame, I get this error:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self[name] = value
I'm not sure about what to do because I'm already using the .loc function.
What am I missing?
f = ['ID_manifest', 'issue_date', 'channel', 'product', 'ID_client', 'desc_manifest']
df = pd.DataFrame(columns=f)
for chunk in df2017_chunks:
aux = preProcess(chunk, f)
df = pd.concat([df, aux])
def preProcess(df, f):
stops = list(stopwords.words("portuguese"))
stops.extend(['reclama', 'cliente', 'santander', 'cartao', 'cartão'])
df = df.loc[df['DS_FAMILIA_PROD'].isin(['CARTOES', 'CARTÕES'])]
df.columns = f
df.desc_manifest = df.desc_manifest.str.lower() # All lower case
df.desc_manifest = df.desc_manifest.apply(lambda x: re.sub('[^A-zÀ-ÿ]', ' ', str(x))) # Just letters
df.replace(['NaN', 'nan'], np.nan, inplace = True) # Remone nan
df.dropna(subset=['desc_manifest'], inplace=True)
df.desc_manifest = df.desc_manifest.apply(lambda x: [word for word in str(x).split() if word not in stops]) # Remove stop words
return df
You need copy, because if you modify values in df later you will find that the modifications do not propagate back to the original data (df), and that Pandas does warning.
loc can be omit, but warning without copy too.
df = pd.DataFrame({'DS_FAMILIA_PROD':['a','d','b'],
'desc_manifest':['F','rR', 'H'],
'C':[7,8,9]})
def preProcess(df):
df = df[df['DS_FAMILIA_PROD'].isin([u'a', u'b'])].copy()
df.desc_manifest = df.desc_manifest.str.lower() # All
...
...
return df
print (preProcess(df))
C DS_FAMILIA_PROD desc_manifest
0 7 a f
2 9 b h
The purpose of the warning is to show users that they may be operating on a copy and not the original but there can be False positives. As mentioned in the comments, this is not an issue for your use case.
You can simply turn off the check for your dataframe:
df.is_copy = False
or you can explicitly copy:
df = df.loc[df['DS_FAMILIA_PROD'].isin(['CARTOES', 'CARTÕES'])].copy()
If your program intends to take a copy of the df on purpose, you can stop the warning with this:
pd.set_option('mode.chained_assignment', None)
pd.set_option('mode.chained_assignment', 'warn')
# if you set a value on a copy, warning will show
df = DataFrame({'DS_FAMILIA_PROD' : [1, 2, 3], 'COL2' : [5, 6, 7]})
df = df[df.DS_FAMILIA_PROD.isin([1, 2])]
df
Out[29]:
COL2 DS_FAMILIA_PROD
0 5 1
1 6 2

Append rows to a dataframe

I am really struggling to make it work...
How can I get a Series, transform it to a dataframe, add a column to it, and concatenate it in a loop?
The pseudo code is below, but the correct syntax is a mystery to me:
The Pseudo code is:
def func_B_Column(df):
return 1
df_1 = (...) # columns=['a', 'etc1', 'etc2']
df_2 = pandas.DataFrame(columns=['a','b','c'])
listOfColumnC = ['c1','c2','c3']
for var in listOfColumnC :
series = df_1.groupby('a').apply(func_B_Column) #series object should have now 'a' as index, and func_B_Column as value
aux = series.to_frame('b')
aux['c'] = aux.apply(lambda x: var, axis=1) #add another column 'c' to the series object
df_2 = df_2 .append(aux) #concatenate the results as rows, at the end
Edited after the question's refinement
df_2 = DataFrame()
for var in listOfColumnC :
df_2 = df_2.append(DataFrame({'b': df_1.groupby('a').apply(func_B_Column), 'c': var}))

Categories