How to Remove a Substring of String in a Dataframe Column? - python

I have this simplified dataframe:
ID, Date
1 8/24/1995
2 8/1/1899 :00
How can I use the power of pandas to recognize any date in the dataframe which has extra :00 and removes it.
Any idea how to solve this problem?
I have tried this syntax but did not help:
df[df["Date"].str.replace(to_replace="\s:00", value="")]
The Output Should Be Like:
ID, Date
1 8/24/1995
2 8/1/1899

You need to assign the trimmed column back to the original column instead of doing subsetting, and also the str.replace method doesn't seem to have the to_replace and value parameter. It has pat and repl parameter instead:
df["Date"] = df["Date"].str.replace("\s:00", "")
df
# ID Date
#0 1 8/24/1995
#1 2 8/1/1899

To apply this to an entire dataframe, I'd stack then unstack
df.stack().str.replace(r'\s:00', '').unstack()
functionalized
def dfreplace(df, *args, **kwargs):
s = pd.Series(df.values.flatten())
s = s.str.replace(*args, **kwargs)
return pd.DataFrame(s.values.reshape(df.shape), df.index, df.columns)
Examples
df = pd.DataFrame(['8/24/1995', '8/1/1899 :00'], pd.Index([1, 2], name='ID'), ['Date'])
dfreplace(df, '\s:00', '')
rng = range(5)
df2 = pd.concat([pd.concat([df for _ in rng]) for _ in rng], axis=1)
df2
dfreplace(df2, '\s:00', '')

Related

How to rename dataframe columns in specific way in Python

I have dataframe (df) with column names as shown below and I want to rename it any specific name
Renaming condition:
Remove the underscore -in the column name
Replace the first letter coming after the - from smallcase to uppercase.
Original Column Name
df.head(1)
risk_num start_date end_date
12 12-3-2022 25-3-2022
Expected Column Name
df.head(1)
riskNum startDate endDate
12 12-3-2022 25-3-2022
How can this donein python.
Use Index.map:
#https://stackoverflow.com/a/19053800/2901002
def to_camel_case(snake_str):
components = snake_str.split('_')
# We capitalize the first letter of each component except the first one
# with the 'title' method and join them together.
return components[0] + ''.join(x.title() for x in components[1:])
df.columns = df.columns.map(to_camel_case)
print (df)
riskNum startDate endDate
0 12 12-3-2022 25-3-2022
Or modify regex solution for pandas:
#https://stackoverflow.com/a/47253475/2901002
df.columns = df.columns.str.replace(r'_([a-zA-Z0-9])', lambda m: m.group(1).upper(), regex=True)
print (df)
riskNum startDate endDate
0 12 12-3-2022 25-3-2022
Use str.replace:
# Enhanced by #Ch3steR
df.columns = df.columns.str.replace('_(.)', lambda x: x.group(1).upper())
print(df)
# Output
# risk_num start_date end_date very_long_column_name
riskNum startDate endDate veryLongColumnName
0 12 12-3-2022 25-3-2022 0
The following code will do that for you
df.columns = [x[:x.find('_')]+x[x.find('_')+1].upper()+x[x.find('_')+2:] for x in df.columns]

Rename columns in dataframe using bespoke function python pandas

I've got a data frame with column names like 'AH_AP' and 'AH_AS'.
Essentially all i want to do is swap the part before the underscore and the part after the underscore so that the column headers are 'AP_AH' and 'AS_AH'.
I can do that if the elements are in a list, but i've no idea how to get that to apply to column names.
My solution if it were a list goes like this:
columns = ['AH_AP','AS_AS']
def rejig_col_names():
elements_of_header = columns.split('_')
new_title = elements_of_header[-1] + "_" + elements_of_header[0]
return new_title
i'm guessing i need to apply this to something like the below, but i've no idea how, or how to reference a single column within df.columns:
df.columns = df.columns.map()
Any help appreciated. Thanks :)
You can do it this way:
Input:
df = pd.DataFrame(data=[['1','2'], ['3','4']], columns=['AH_PH', 'AH_AS'])
print(df)
AH_PH AH_AS
0 1 2
1 3 4
Output:
df.columns = df.columns.str.split('_').str[::-1].str.join('_')
print(df)
PH_AH AS_AH
0 1 2
1 3 4
Explained:
Use string accessor and the split method on '_'
Then using the str accessor with index slicing reversing, [::-1], you
can reverse the order of the list
Lastly, using the string accessor and join, we can concatenate the
list back together again.
You were almost there: you can do
df.columns = df.columns.map(rejig_col_names)
except that the function gets called with a column name as argument, so change it like this:
def rejig_col_names(col_name):
elements_of_header = col_name.split('_')
new_title = elements_of_header[-1] + "_" + elements_of_header[0]
return new_title
An alternative to the other answer. Using your function and DataFrame.rename
import pandas as pd
def rejig_col_names(columns):
elements_of_header = columns.split('_')
new_title = elements_of_header[-1] + "_" + elements_of_header[0]
return new_title
data = {
'A_B': [1, 2, 3],
'C_D': [4, 5, 6],
}
df = pd.DataFrame(data)
df.rename(rejig_col_names, axis='columns', inplace=True)
print(df)
str.replace is also an option via swapping capture groups:
Sample input borrowed from ScottBoston
df = pd.DataFrame(data=[['1', '2'], ['3', '4']], columns=['AH_PH', 'AH_AS'])
Then Capture everything before and after the '_' and swap capture group 1 and 2.
df.columns = df.columns.str.replace(r'^(.*)_(.*)$', r'\2_\1', regex=True)
PH_AH AS_AH
0 1 2
1 3 4

Pythonic way to insert a DataFrame column and calculate its values from each column in a list

I have a DataFrame column (from my project here) that prints like this:
ticker 2021-02-11 21:04 2021-01-12_close 2020-02-11_close 2016-02-11_close
0 AAPL 134.94 128.607819 79.287888 21.787796
1 MSFT 244.20 214.929993 182.506607 45.343704
This gives a stock ticker and its current price followed by the close price on given dates. I am looking for a pythonic way to, after each X_close column, insert an X_return column and calculate the return between the current price and the X price. What is a good way to do this?
Thanks!
Edit: When I say "calculate the return", I mean, for example, to do:
((134.94 - 128.607819) / 128.607819) * 100
So, simply using div() or sub() isn't quite satisfactory.
Try:
df.filter to select the close columns,
then .sub to subtract the selected column
join back
sort the columns with sort_index. You may need to play with this.
All code:
df.join(df.filter(like='close').sub(df['2021-02-11 21:04'], axis=0)
.rename(columns=lambda x: x.replace('close','return'))
).sort_index(axis=1)
Good question. The idea is to simply create the new columns first and concatenate it to the dataframe.
df_returns = (df[cols].div(df["2021-02-11 21:04:00"], axis=0)).rename(columns = (lambda x: x.split('_')[0]+'_return'))
df_new = pd.concat([df, df_returns], axis=1).sort_index(axis=1)
Optionally, you could resort the indices for better graphic utility:
df_new[df_new.columns[:-3:-1].union(df_new.columns[:-2], sort=False)]
For a more customized approach use pandas apply method.
df_returns = (df[cols].apply(foo, axis=0))
def foo(s: pd.Series):
#Series specific changes
ans = pd.Series()
for i in s.shape[0]:
ans.iloc[i] = some_func(s.iloc[i])
#Rename series index for convenience
Hope this helps! You can perform any opps you like in some_func()
Combining ideas from the answers given with my own, here is my solution:
def calculate_returns(df):
print(df)
print()
# Get dataframe of return values
returns_df = df.apply(calculate_return_row, axis=1)
# Append returns df to close prices df
df = pd.concat([df, returns_df], axis=1).sort_index(axis=1, ascending=False)
# Rearrange columns so that close price precedes each respective return value
return_cols = df.columns[2::2]
close_cols = df.columns[3::2]
reordered_cols = list(df.columns[0:2])
reordered_cols = reordered_cols + [col for idx, _ in enumerate(return_cols) for col in [close_cols[idx], return_cols[idx]]]
df = df[reordered_cols]
print(df)
return df
def calculate_return_row(row: pd.Series):
current_price = row[1]
close_prices = row[2:]
returns = [calculate_return(current_price, close_price) for close_price in close_prices]
index = [label.replace('close', 'return') for label in row.index[2:]]
returns = pd.Series(returns, index=index)
return returns
def calculate_return(current_val, initial_val):
return (current_val - initial_val) / initial_val * 100
This avoids loops, and puts the return columns after the close columns:
ticker 2021-02-12 20:37 2021-01-13_close 2020-02-12_close 2016-02-12_close
0 AAPL 134.3500 130.694702 81.170799 21.855232
1 MSFT 243.9332 216.339996 182.773773 46.082863
ticker 2021-02-12 20:37 2021-01-13_close 2021-01-13_return 2020-02-12_close 2020-02-12_return 2016-02-12_close 2016-02-12_return
0 AAPL 134.3500 130.694702 2.796822 81.170799 65.515187 21.855232 514.726938
1 MSFT 243.9332 216.339996 12.754555 182.773773 33.461818 46.082863 429.336037
Thanks!

How to turn value in timestamp column into numbers

I have a dataframe:
id timestamp
1 "2025-08-02 19:08:59"
1 "2025-08-02 19:08:59"
1 "2025-08-02 19:09:59"
I need to turn timestamp into integer number to iterate over conditions. So it look like this:
id timestamp
1 20250802190859
1 20250802190859
1 20250802190959
you can convert string using string of pandas :
df = pd.DataFrame({'id':[1,1,1],'timestamp':["2025-08-02 19:08:59",
"2025-08-02 19:08:59",
"2025-08-02 19:09:59"]})
pd.set_option('display.float_format', lambda x: '%.3f' % x)
df['timestamp'] = df['timestamp'].str.replace(r'[-\s:]', '').astype('float64')
>>> df
id timestamp
0 1 20250802190859.000
1 1 20250802190859.000
2 1 20250802190959.000
Have you tried opening the file, skipping the first line (or better: validating that it contains the header fields as expected) and for each line, splitting it at the first space/tab/whitespace. The second part, e.g. "2025-08-02 19:08:59", can be parsed using datetime.fromisoformat(). You can then turn the datetime object back to a string using datetime.strftime(format) with e.g. format = '%Y%m%d%H%M%S'. Note that there is no "milliseconds" format in strftime though. You could use %f for microseconds.
Note: if datetime.fromisoformat() fails to parse the dates, try datetime.strptime(date_string, format) with a different format, e.g. format = '%Y-%m-%d %H:%M:%S'.
You can use the solutions provided in this post: How to turn timestamp into float number? and loop through the dataframe.
Let's say you have already imported pandas and have a dataframe df, see the additional code below:
import re
df = pd.DataFrame(l)
df1 = df.copy()
for x in range(len(df[0])):
df1[0][x] = re.sub(r'\D','', df[0][x])
This way you will not modify the original dataframe df and will get desired output in a new dataframe df1.
Full code that I tried (including creatiion of first dataframe), this might help in removing any confusions:
import pandas as pd
import re
l = ["2025-08-02 19:08:59", "2025-08-02 19:08:59", "2025-08-02 19:09:59"]
df = pd.DataFrame(l)
df1 = df.copy()
for x in range(len(df[0])):
df1[0][x] = re.sub(r'\D','', df[0][x])

regexp match in pandas

In want to execute a regexp match on a dataframe column in order to modify the content of the column.
For example, given this dataframe:
import pandas as pd
df = pd.DataFrame([['abra'], ['charmender'], ['goku']],
columns=['Name'])
print(df.head())
I want to execute the following regex match:
CASE
WHEN REGEXP_MATCH(Landing Page,'abra') THEN "kadabra"
WHEN REGEXP_MATCH(Landing Page,'charmender') THEN "charmaleon"
ELSE "Unknown" END
My solution is the following:
df.loc[df['Name'].str.contains("abra", na=False), 'Name'] = "kadabra"
df.loc[df['Name'].str.contains("charmender", na=False), 'Name'] = "charmeleon"
df.head()
It works but I do not know if there is a better way of doing it.
Moreover, I have to rewrite all the regex cases line by line in Python. Is there a way to execute the regex directly in Pandas?
Are you looking for map:
df['Name'] = df['Name'].map({'abra':'kadabra','charmender':'charmeleon'})
Output:
Name
0 kadabra
1 charmeleon
2 NaN
Update: For partial matches:
df = pd.DataFrame([['this abra'], ['charmender'], ['goku']],
columns=['Name'])
replaces = {'abra':'kadabra','charmender':'charmeleon'}
df['Name'] = df['Name'].str.extract(fr"\b({'|'.join(replaces.keys())})\b")[0].map(replaces)
And you get the same output (with different dataframe)

Categories