Distinguish repeating column names by adding an integer using pandas

Distinguish repeating column names by adding an integer using pandas - python

I have some columns that have the same names. I would like to add a 1 to the repeating column names
Data
Date Type hi hello stat hi hello
1/1/2022 a 0 0 1 1 0
Desired
Date Type hi hello stat hi1 hello1
1/1/2022 a 0 0 1 1 0
Doing
mask = df['col2'].duplicated(keep=False)
I believe I can utilize mask, but not sure how to efficiently achieve this without calling out the actual column. I would like to call the full dataset and allow the algorithm to update the dupe.
Any suggestion is appreciated

Use the built-in parser method _maybe_dedup_names():
df.columns = pd.io.parsers.base_parser.ParserBase({'usecols': None})._maybe_dedup_names(df.columns)
# Date Type hi hello stat hi.1 hello.1
# 0 1/1/2022 a 0 0 1 1 0
This is what pandas uses to deduplicate column headers from read_csv().
Note that it scales to any number of duplicate names:
cols = ['hi'] * 3 + ['hello'] * 5
pd.io.parsers.base_parser.ParserBase({'usecols': None})._maybe_dedup_names(cols)
# ['hi', 'hi.1', 'hi.2', 'hello', 'hello.1', 'hello.2', 'hello.3', 'hello.4']
In pandas < 1.3:
df.columns = pd.io.parsers.ParserBase({})._maybe_dedup_names(df.columns)

You need to apply the duplicated operation to the column names. And then map the duplication information to a string, which you can then add to the original column names.
df.columns = df.columns+[{False:'',True:'1'}[x] for x in df.columns.duplicated()]

We can do
s = df.columns.to_series().groupby(df.columns).cumcount().replace({0:''}).astype(str).radd('.')
df.columns = (df.columns + s).str.strip('.')
df
Out[153]:
Date Type hi hello stat hi.1 hello.1
0 1/1/2022 a 0 0 1 1 0

Related

Assign operator method chaining str.join()

I have the following method chaining code and want to create a new column. but i'm getting an error when doing the following.
(
pd.pivot(test, index = ['file_path'], columns = 'year', values = 'file')
.fillna(0)
.astype(int)
.reset_index()
.assign(hierarchy = file_path.str[1:-1].str.join(' > '))
)
Before the assign method the dataframe looks something like this:
file_path 2017 2018 2019 2020
S:\Test\A 0 0 1 2
S:\Test\A\B 1 0 1 3
S:\Test\A\C 3 1 1 0
S:\Test\B\A 1 0 0 1
S:\Test\B\B 1 0 0 1
The error is : name 'file_path' is not defined.
file_path exists in the data frame but I'm not calling it correctly. What is the proper way to create a new column based on another using assign?

you can pass a callable to assign that accepts the dataframe at that point:
.assign(hierarchy=lambda fr: fr["file_path"].str[1:-1].str.join(" > "))
so that fr will be the thus far modified dataframe (pivoted, index resetted etc.), over which you can access to the column "file_path".

Pandas: Dataframe itertuples boolean series groupby optimization

I'm new in python.
I have data frame (DF) example:
id
type
1
A
1
B
2
C
2
B
I would like to add a column example A_flag group by id.
In the end I have data frame (DF):
id
type
A_flag
1
A
1
1
B
1
2
C
0
2
B
0
I can do this in two step:
DF['A_flag_tmp'] = [1 if x.type=='A' else 0 for x in DF.itertuples()]
DF['A_flag'] = DF.groupby(['id'])['A_flag_tmp'].transform(np.max)
It's working, but it's very slowy for big data frame.
Is there any way to optimize this case ?
Thank's for help.

Change your codes with slow iterative coding to fast vectorized coding by replacing your first step to generate a boolean series by Pandas built-in functions, e.g.
df['type'].eq('A')
Then, you can attach it to the groupby statement for second step, as follows:
df['A_flag'] = df['type'].eq('A').groupby(df['id']).transform('max').astype(int)
Result
print(df)
id type A_flag
0 1 A 1
1 1 B 1
2 2 C 0
3 2 B 0
In general, if you have more complicated conditions, you can also define it in vectorized way, eg. define the boolean series m by:
m = df['type'].eq('A') & df['type1'].gt(1) | (df['type2'] != 0)
Then, use it in step 2 as follows:
m.groupby(df['id']).transform('max').astype(int)

replacing columns values only on specific columns using regex in pandas

I want to replace the values of specific columns. I can change the values one by one but, I have hundreds of columns and I need to change the columns starting with a specific string. Here is an example, I want to replace the string when the column name starts with "Q14"
df.filter(regex = 'Q14').replace(1, 'Selected').replace(0, 'Not selected')
The above code is working. But, how I can implement it in my dataframe? As this is the function so I can't use inplace.

Consider below df:
In [439]: df = pd.DataFrame({'Q14_A':[ 1,0,0,2], 'Q14_B':[0,1,1,2], 'Q12_A':[1,0,0,0]})
In [440]: df
Out[440]:
Q14_A Q14_B Q12_A
0 1 0 1
1 0 1 0
2 0 1 0
3 2 2 0
Filter columns that start with Q14, save it in a variable:
In [443]: cols = df.filter(regex='^Q14').columns
Now, change the above selected columns with your replace commands:
In [446]: df[cols] = df[cols].replace(1, 'Selected').replace(0, 'Not selected')
Output:
In [447]: df
Out[447]:
Q14_A Q14_B Q12_A
0 Selected Not selected 1
1 Not selected Selected 0
2 Not selected Selected 0
3 2 2 0

You can iterate over all columns and based on matched condition apply column transformation using apply command:
for column in df.columns:
if column.startswith("Q"):
df[column] = df[column].apply(lambda x: "Selected" if x == 1 else "Not selected")

Using pandas.Series.replace dict
df = pd.DataFrame({'Q14_A':[ 1,0,0,2], 'Q14_B':[0,1,1,2], 'Q12_A':[1,0,0,0]})
cols = df.filter(regex='^Q14').columns
replace_map = {
1: "Selected",
0 : "Not Selected"
}
df[cols] = df[cols].replace(replace_map)

How to insert string value into specific column value on python pandas?

I have the following dataframe.
import pandas as pd
data=['ABC1','ABC2','ABC3','ABC4']
data = pd.DataFrame(data,columns=["Column A"])
Column A
0 ABC1
1 ABC2
2 ABC3
3 ABC4
How to insert "-" a ABC on column A of data?
Output:
Column A
0 ABC-1
1 ABC-2
2 ABC-3
3 ABC-4

The Simplest solution to Use replace method as a regex and inplace method to make it permanent in the dataframe.
>>> data['Column A'].replace(['ABC'], 'ABC-', regex=True, inplace=True)
print(data)
Column A
0 ABC-1
1 ABC-2
2 ABC-3
3 ABC-4

A possible solution is
data['Column A'] = data['Column A'].str[:-1] + '-' + data['Column A'].str[-1]
print (data)
# Column A
#0 ABC-1
#1 ABC-2
#2 ABC-3
#3 ABC-4

Here's a way which only assumes that the numbers to be preceded by a dash are at the end:
df['ColumnA'].str.split('([A-z]+)(\d+)').str.join('-').str.strip('-')
0 ABC-1
1 ABC-2
2 ABC-3
3 ABC-4
Another example:
df = pd.DataFrame({'ColumnA':['asf1','Ads2','A34']})
Will give:
df['ColumnA'].str.split('([A-z]+)(\d+)').str.join('-').str.strip('-')
0 asf-1
1 Ads-2
2 A-34

Python/Pandas dataframe - return column name

Is there a way to return the name/header of a column into a string in a pandas dataframe? I want to work with a row of data which has the same prefix. The dataframe header looks like this:
col_00 | col_01 | ... | col_51 | bc_00 | cd_00 | cd_01 | ... | cd_90
I'd like to apply a function to each row, but only from col_00 to col_51 and to cd_00 to cd_90 separately. To do this, I thought I'd collect the column names into a list, fe. to_work_with would be the list of columns starting with the prefix 'col', apply the function to df[to_work_with]. Then I'd change the to_work_with and it would contain the list of columns starting with the 'cd' prefix et cetera. But I don't know how to iterate through the column names.
So basically, the thing I'm looking for is this function:
to_work_with = column names in the df that start with "thisstring"
How can I do that? Thank you!

You can use boolean indexing with str.startswith:
cols = df.columns[df.columns.str.startswith('cd')]
print (cols)
Index(['cd_00', 'cd_01', 'cd_02', 'cd_90'], dtype='object')
Sample:
print (df)
col_00 col_01 col_02 col_51 bc_00 cd_00 cd_01 cd_02 cd_90
0 1 2 3 4 5 6 7 8 9
cols = df.columns[df.columns.str.startswith('cd')]
print (cols)
Index(['cd_00', 'cd_01', 'cd_02', 'cd_90'], dtype='object')
#if want apply some function for filtered columns only
def f(x):
return x + 1
df[cols] = df[cols].apply(f)
print (df)
col_00 col_01 col_02 col_51 bc_00 cd_00 cd_01 cd_02 cd_90
0 1 2 3 4 5 7 8 9 10
Another solution with list comprehension:
cols = [col for col in df.columns if col.startswith("cd")]
print (cols)
['cd_00', 'cd_01', 'cd_02', 'cd_90']

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Distinguish repeating column names by adding an integer using pandas - python

You need to apply the duplicated operation to the column names. And then map the duplication information to a string, which you can then add to the original column names. df.columns = df.columns+[{False:'',True:'1'}[x] for x in df.columns.duplicated()]

We can do s = df.columns.to_series().groupby(df.columns).cumcount().replace({0:''}).astype(str).radd('.') df.columns = (df.columns + s).str.strip('.') df Out[153]: Date Type hi hello stat hi.1 hello.1 0 1/1/2022 a 0 0 1 1 0

Related

Assign operator method chaining str.join()

Pandas: Dataframe itertuples boolean series groupby optimization

replacing columns values only on specific columns using regex in pandas

How to insert string value into specific column value on python pandas?

Python/Pandas dataframe - return column name

Categories

Resources