Reordering Pandas Columns based on Column name - python

I have columns with similar names but numeric suffixes that represent different occurrences of each column. For example, I have columns (company_1, job_title_1, location_1, company_2, job_title_2, location_2). I would like to order these columns grouped together by the prefix (before the underscore) and then sequentially by the suffix (after the underscore).
How I would like the columns to be: company_1, company_2, job_title_1, job_title_2, location_1, location_2.
Here's what I tried from this question:
df = df.reindex(sorted(df.columns), axis=1)
This resulted in the order: company_1, company_10, company_11 (skipping over 2-9)

This type of sorting is called natural sorting. (There are more details in Naturally sorting Pandas DataFrame which demonstrates how to sort rows using natsort)
Setup with natsort
import pandas as pd
from natsort import natsorted
df = pd.DataFrame(columns=[f'company_{i}' for i in [5, 2, 3, 4, 1, 10]])
# Before column sort
print(df)
df = df.reindex(natsorted(df.columns), axis=1)
# After column sort
print(df)
Before sort:
Empty DataFrame
Columns: [company_5, company_2, company_3, company_4, company_1, company_10]
Index: []
After sort:
Empty DataFrame
Columns: [company_1, company_2, company_3, company_4, company_5, company_10]
Index: []
Compared to lexicographic sorting with sorted:
df = df.reindex(sorted(df.columns), axis=1)
Empty DataFrame
Columns: [company_1, company_10, company_2, company_3, company_4, company_5]
Index: []

Just for the sake of completeness, you can also get the desired result by passing a function to sorted that splits the strings into name, index tuples.
def index_splitter(x):
"""Example input: 'job_title_1'
Output: ('job_title', 1)
"""
*name, index = x.split("_")
return '_'.join(name), int(index)
df = df.reindex(sorted(df.columns, key=index_splitter), axis=1)

Related

Set an idex from the first level of a hierarchical index column, PANDAS

I have a dataframe which is the result of a concatenation of dataframe. I use "keys= " option for the title of each blocks when I export in Excel.
And now I want define the ID2 as an index with ID. (For have a multindex)
I tried to use .resetindex, but it didn't work like I want.
I have:
I want:
You can extract your indexes to lists and to create a MultiIndex object, and then simply define the index of your DataFrame with this MultiIndex. This works on my side (pandas imported as pd):
Let's assume your initial DataFrame is this one (just a smaller version of what you have):
df = pd.DataFrame({'ID2': ['b','c','b'], 'name' : ['tomato', 'pizza', 'kebap']}, index = [1,2,4])
Then, we extract the final indices from the index and from the column of the dataframe in order to build a list of tuples, with which you create the multiindex with pandas.MuliIndex method:
ID2 = df.ID2.to_list()
ID1 = df.index.to_list()
indexes = [(id1, id2) for id1,id2 in zip(ID1,ID2)]
final_indices = pd.MultiIndex.from_tuples(indexes, names=["Id1", "Id2"])
Finally, you redefine your index and you can drop the 'ID2' column:
df.index = final_indices
df = df.drop('ID2', axis = 1)
This gives the following DataFrame:
Note: I also tried with the df.reindex method, but the values of the DataFrame became NaN, I do not know why.

Rename columns in a pandas.DataFrame via lookup from a Series

I have a DataFrame with columns like:
>>> df.columns
['A_ugly_column_name', 'B_ugly_column_name', ...]
and a Series, series_column_names, with nice column names like:
>>> series_column_names = pd.Series(
data=["A_ugly_column_name", "B_ugly_column_name"],
index=["A", "B"],
)
>>> print(series_column_names)
A A_ugly_column_name
B B_ugly_column_name
...
Name: column_names, dtype: object
Is there a nice way to rename the columns in df according to series_column_names? More specifically, I'd like to rename the columns in df to the index in column_names where value in the series is the old column name in df.
Some context - I have several DataFrames with columns for the same thing, but they're all named slightly differently. I have a DataFrame where, like here, the index is a standardized name and the columns contain the column names used by the various DataFrames. I want to use this "name mapping" DataFrame to rename the columns in the several DataFrames to the same thing.
a solution i have...
So far, the best solution I have is:
>>> df.rename(columns=lambda old: series_column_names.index[series_column_names == old][0])
which works but I'm wondering if there's a better, more pandas-native way to do this.
first create a dictionary out of your series by using .str.split
cols = {y : x for x,y in series_column_names.str.split('\s+').tolist()}
print(cols)
Edit.
If your series has your target column names as the index and the values as the series you can still create a dictionary by inversing the keys and values.
cols = {y : x for x,y in series_column_names.to_dict().items()}
or
cols = dict(zip(series_column_names.tolist(), series_column_names.index))
print(cols)
{'B_ugly_column_name': 'B_nice_column_name',
'C_ugly_column_name': 'C_nice_column_name',
'A_ugly_column_name': 'A_nice_column_name'}
then assign your column names.
df.columns = df.columns.map(cols)
print(df)
A_nice_column_name B_nice_column_name
0 0 0
Just inverse the index/values in series_column_names and use it to rename. It doesn't matter if there are extra names.
series_column_names = pd.Series(
data=["A_ugly", "B_ugly", "C_ugly"],
index=["A", "B", "C"],
)
df.rename(columns=pd.Series(series_column_names.index.values, index=series_column_names))
Wouldn't it be as simple as this?
series_column_names = pd.Series(['A_nice_column_name', 'B_nice_column_name'])
df.columns = series_column_names

Pandas: Sort column by list of values, similar to SELECT * FROM df WHERE ORDER BY FIELD(id,...)

What is the similar query to SELECT * FROM df WHERE id in (3,1,2) ORDER BY FIELD(id,3,1,2) in Pandas?
list_ids = [3, 1, 2]
df[df.id.isin(list_ids)]#.sort_by_field('id', list_ids)
afaik .sort_values() can only sort by columns that already in dataframe.
Note: I don't want to sort by multiple/list of columns. I want to sort ONE column by a specific list of values
You can specify a list of columns to sort by inside sort_values()
df[df.id.isin(list_ids)].sort_values(["col1","col3","col2"], ascending=True)
If you want to order in some specific order, you can maybe create an additional column to do this.
order_list = [3,1,2]
order_dict = dict(zip(order_list , range(len(order_list ))))
df["sort_col"] = df["id"].map(order_dict)
df[df.id.isin(list_ids)].sort_values(["sort_col"], ascending=False)

Loop over multiple columns to find strings in a numerical column?

The following code finds any strings for column B. Is it possible to loop over multiple columns of a dataframe outputting the cells containing strings for each column?
import pandas as pd
for i in df:
print(df[df['i'].str.contains(r'^[a-zA-Z]+$')])
Link to code above
https://stackoverflow.com/a/65410078/12801962
Here is how to loop through columns
import pandas as pd
colList = ['ColB', 'Some_other', 'ColC']
for col in colList:
subdf = df[df[col].str.contains(r'^[a-zA-Z]+$')]
#do something with sub DF
or do it in one long test and get all the problem rows in one dataframe
import pandas as pd
subdf = df[((df['ColB'].str.contains(r'^[a-zA-Z]+$')) |
(df['Some_other'].str.contains(r'^[a-zA-Z]+$')) |
(df['ColC'].str.contains(r'^[a-zA-Z]+$')))]
Not sure if it's what you are intending to do
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['ColA'] = ['ABC', 'DEF', 12345, 23456]
df['ColB'] = ['abc', 12345, 'def', 23456]
all_trues = pd.Series(np.ones(df.shape[0], dtype=np.bool))
for col in df:
all_trues &= df[col].str.contains(r'^[a-zA-Z]+$')
df[all_trues]
Which will give the result:
ColA ColB
0 ABC abc
Try:
for k, s in df.astype(str).items():
print(s.loc[s.str.contains(r'^[a-zA-Z]+$')])
Or, for the values only (no index nor column information):
for k, s in df.astype(str).items():
print(s.loc[s.str.contains(r'^[a-zA-Z]+$')].values)
Note, both of the above only work because you just want to print the matching values in the columns, not return a new structure with filtered entries.
If you tried to make a new DataFrame with cells filtered by the condition, then that would lead to ragged arrays, which are not implemented (you could replace these cells by a marker of your choice, but you cannot cut them away). Another possibility would be to select rows where any or all the cells present the condition you are testing for (that way, the result is an homogeneous array, not a ragged one).
Yet another option would be to return a list of Series, each representing a column, or a dict of colname: Series:
{k: s.loc[s.str.contains(r'^[a-zA-Z]+$')] for k, s in df.astype(str).items()}

Pandas merge/faltten dataframe based on specific column

i have following data sample i am trying to flatten it out using pandas, i wanna flatten this data over Candidate_Name.
This is my implementation,
df= df.merge(df,on=('Candidate_Name'))
but i am not getting desired result. My desired output is as follows. So basically have all the rows that match Candidate_Name in a single row, where duplicate column names may suffix with _x
I think you need GroupBy.cumcount with DataFrame.unstack and then flatten MultiIndex with same values for first groups and added numbers for another levels for avoid duplicated columns names:
df = df.set_index(['Candidate_Name', df.groupby('Candidate_Name').cumcount()]).unstack()
df.columns = [a if b == 0 else f'{a}_{b}' for a, b in df.columns]

Categories