I have a DataFrame with columns like:
>>> df.columns
['A_ugly_column_name', 'B_ugly_column_name', ...]
and a Series, series_column_names, with nice column names like:
>>> series_column_names = pd.Series(
data=["A_ugly_column_name", "B_ugly_column_name"],
index=["A", "B"],
)
>>> print(series_column_names)
A A_ugly_column_name
B B_ugly_column_name
...
Name: column_names, dtype: object
Is there a nice way to rename the columns in df according to series_column_names? More specifically, I'd like to rename the columns in df to the index in column_names where value in the series is the old column name in df.
Some context - I have several DataFrames with columns for the same thing, but they're all named slightly differently. I have a DataFrame where, like here, the index is a standardized name and the columns contain the column names used by the various DataFrames. I want to use this "name mapping" DataFrame to rename the columns in the several DataFrames to the same thing.
a solution i have...
So far, the best solution I have is:
>>> df.rename(columns=lambda old: series_column_names.index[series_column_names == old][0])
which works but I'm wondering if there's a better, more pandas-native way to do this.
first create a dictionary out of your series by using .str.split
cols = {y : x for x,y in series_column_names.str.split('\s+').tolist()}
print(cols)
Edit.
If your series has your target column names as the index and the values as the series you can still create a dictionary by inversing the keys and values.
cols = {y : x for x,y in series_column_names.to_dict().items()}
or
cols = dict(zip(series_column_names.tolist(), series_column_names.index))
print(cols)
{'B_ugly_column_name': 'B_nice_column_name',
'C_ugly_column_name': 'C_nice_column_name',
'A_ugly_column_name': 'A_nice_column_name'}
then assign your column names.
df.columns = df.columns.map(cols)
print(df)
A_nice_column_name B_nice_column_name
0 0 0
Just inverse the index/values in series_column_names and use it to rename. It doesn't matter if there are extra names.
series_column_names = pd.Series(
data=["A_ugly", "B_ugly", "C_ugly"],
index=["A", "B", "C"],
)
df.rename(columns=pd.Series(series_column_names.index.values, index=series_column_names))
Wouldn't it be as simple as this?
series_column_names = pd.Series(['A_nice_column_name', 'B_nice_column_name'])
df.columns = series_column_names
Related
I have to apply many functions to certain columns in my df. But there are many columns. Is there any way to give these groups of columns a certain name and then apply functions to them?
I am looking for something like
df[['One', 'Two',...'Sixty']] = values
values.apply(lambda x: x.astype(str).str.lower())
Probably you could first create a list of column names and then use pd.DataFrame.isin as follows:
values = ['One', 'Two',...'Sixty']
df.loc[:, df.columns.isin(values)].apply(lambda x: x.astype(str).str.lower())
If performance is not an issue, you can use pandas.DataFrame.applymap :
cols = ['One', 'Two',...'Sixty']
df[cols] = df[cols].astype(str).applymap(lambda x: x.lower())
You can group the columns then use the apply option in pandas:
cols_to_group = ['col1', 'col2']
df.loc[:, cols_to_group] = df.loc[:, cols_to_group].apply(my_function)
my_function can contain the functionality you need to apply to the group.
I have a list of dataframes and each dataframe has columns that contains columns names such as "Unnamed 1", "Unnamed 2" etc
I want to drop all columns that contain the name "Unnamed" from each dataframe in the list of dataframes.
df_all = [df1,df2,df3]
df_all2 = []
for df in df_all:
df = df[df.columns.drop(list(df.filter(regex='Unnamed')))]
df_all2 = df.append
df_all = df_all2
This works, however, is there a more succinct method?
There is a better way by using regex column filter with negative lookahead
df_all = [df.filter(regex=r'^(?!Unnamed)') for df in df_all]
for df in df_all:
df.drop(df.columns[df.columns.str.contains('Unnamed:')], 1, inplace=True)
I have a dataset that has many columns, I want to extract the numeric columns and replace with the mean of the columns and then these modified columns replacing the ones in the original dataframe.
df1 = df.select_dtypes(include = ["number"]).apply(lambda x: x.fillna(x.mean()),axis=0)
df.loc[df.select_dtypes(include = ["number"])] = df1
I managed to extract the numeric columns but I couldn't replace them, the idea is not to manually indicate which are those numeric columns.
It's probably easier to assign a new/changed DataFrame. This will only change the columns you altered.
new_df = df.assign(**df.select_dtypes('number').apply(lambda x: x.fillna(x.mean())))
If you want to preserve the original DataFrame, you can do it in steps:
cols = df.select_dtypes('number').columns
df[cols] = df[cols].apply(lambda x: x.fillna(x.mean()))
Wondering what the best way to tackle this issue is. If I have a DF with the following columns
df1()
type_of_fruit name_of_fruit price
..... ..... .....
and a list called
expected_cols = ['name_of_fruit','price']
whats the best way to automate the check of df1 against the expected_cols list? I was trying something like
df_cols=df1.columns.values.tolist()
if df_cols != expected_cols:
And then try to drop to another df any columns not in expected_cols, but this doesn't seem like a great idea to me. Is there a way to save the "dropped" columns?
df2 = df1.drop(columns=expected_cols)
But then this seems problematic depending on column ordering, and also in cases where the columns could have either more values than expected, or less values than expected. In cases where there are less values than expected (ie the df1 only contains the column name_of_fruit) I'm planning on using
df1.reindex(columns=expected_cols)
But a bit iffy on how to do this programatically, and then how to handle the issue where there are more columns than expected.
You can use set difference using -:
Assuming df1 having cols:
In [542]: df1_cols = df1.columns # ['type_of_fruit', 'name_of_fruit', 'price']
In [539]: expected_cols = ['name_of_fruit','price']
In [541]: unwanted_cols = list(set(d1_cols) - set(expected_cols))
In [542]: df2 = df1[unwanted_cols]
In [543]: df1.drop(unwanted_cols, 1, inplace=True)
Use groupby along the columns axis to split the DataFrame succinctly. In this case, check whether the columns are in your list to form the grouper, and you can store the results in a dict where the True key gets the DataFrame with the subset of columns in the list and the False key has the subset of columns not in the list.
Sample Data
import pandas as pd
df = pd.DataFrame(data = [[1,2,3]],
columns=['type_of_fruit', 'name_of_fruit', 'price'])
expected_cols = ['name_of_fruit','price']
Code
d = dict(tuple(df.groupby(df.columns.isin(expected_cols), axis=1)))
# If you need to ensure columns are always there then do
#d[True] = d[True].reindex(expected_cols)
d[True]
# name_of_fruit price
#0 2 3
d[False]
# type_of_fruit
#0 1
I have a dataframe ('df') containing several columns and would like to only keep those columns with a column header starting with the prefix 'x1' or 'x4'. That is, I want to 'drop' all columns except those with a column header starting with either 'x1' or 'x4'.
How can I do this in one step?
I know that if I wanted to keep only those columns with the x1 prefix I could do:
df = df [df.columns.drop(list(df .filter(regex='x1')))]
..but this results in me losing columns with the x4 prefix, which I want to keep.
Similarly, if I wanted to keep only those columns with the x4 prefix I can do:
df = df [df.columns.drop(list(df .filter(regex='x4')))]
..but this results in me losing columns with the x1 prefix, which I want to keep.
You can use df.loc with list comprehension:
df.loc[:, [x for x in df.columns if x.startswith(('x1', 'x4'))]]
It will show you all rows and columns which have 'x1' or 'x4' at the beginning.
You can choose the desired columns first and then just select those columns.
data = [{"x1":"a", "x2":"a", "x4":"a"}]
df = pd.DataFrame(data)
desired_columns = [x for x in df.columns if x.startswith("x1") or x.startswith("x4")]
df = df[desired_columns]
You can also use a function:
def is_valid(x):
return x.startswith("x1") or x.startswith("x4")
data = [{"x1":"a", "x2":"a", "x4":"a"}]
df = pd.DataFrame(data)
desired_columns = [x for x in df.columns if is_valid(x)]
df = df[desired_columns]
You can also use filter option,
df.filter(regex='^x1|^x4')