I have dataframe (df) with column names as shown below and I want to rename it any specific name
Renaming condition:
Remove the underscore -in the column name
Replace the first letter coming after the - from smallcase to uppercase.
Original Column Name
df.head(1)
risk_num start_date end_date
12 12-3-2022 25-3-2022
Expected Column Name
df.head(1)
riskNum startDate endDate
12 12-3-2022 25-3-2022
How can this donein python.
Use Index.map:
#https://stackoverflow.com/a/19053800/2901002
def to_camel_case(snake_str):
components = snake_str.split('_')
# We capitalize the first letter of each component except the first one
# with the 'title' method and join them together.
return components[0] + ''.join(x.title() for x in components[1:])
df.columns = df.columns.map(to_camel_case)
print (df)
riskNum startDate endDate
0 12 12-3-2022 25-3-2022
Or modify regex solution for pandas:
#https://stackoverflow.com/a/47253475/2901002
df.columns = df.columns.str.replace(r'_([a-zA-Z0-9])', lambda m: m.group(1).upper(), regex=True)
print (df)
riskNum startDate endDate
0 12 12-3-2022 25-3-2022
Use str.replace:
# Enhanced by #Ch3steR
df.columns = df.columns.str.replace('_(.)', lambda x: x.group(1).upper())
print(df)
# Output
# risk_num start_date end_date very_long_column_name
riskNum startDate endDate veryLongColumnName
0 12 12-3-2022 25-3-2022 0
The following code will do that for you
df.columns = [x[:x.find('_')]+x[x.find('_')+1].upper()+x[x.find('_')+2:] for x in df.columns]
Related
I have df where some of the records in the column contains prefix and some of them not. I would like to update records without prefix. Unfortunately, my script adds desired prefix to each record in df:
new_list = []
prefix = 'x'
for ids in df['ids']:
if ids.find(prefix) < 1:
new_list.append(prefix + ids)
How can I ommit records with the prefix?
I've tried with df[df['ids'].str.contains(prefix)], but I'm getting an error.
Use Series.str.startswith for mask and add values with numpy.where:
df = pd.DataFrame({'ids':['aaa','ssx','xwe']})
prefix = 'x'
df['ids'] = np.where(df['ids'].str.startswith(prefix), '', prefix) + df['ids']
print (df)
ids
0 xaaa
1 xssx
2 xwe
In a df comprised of the columns asset_id, event_start_date, event_end_date,
I wish to add a forth column datediff that for each asset_id will capture how many days passed between a end_date and the following start_date for the same asset_id, but in case that following start_date is earlier than the current end_date, I would like to capture the difference between the two start_dates. Dataset is sorted by (asset_id, start_date asc).
In Excel it would look something like:
I tried:
events['datediff'] = df.groupby('asset_id').apply(lambda x: x['event_start_date'].shift(-1)-x['event_end_date'] if
x['event_start_date'].shift(-1)>x['event_end_date'] else x['event_start_date'].shift(-1)-x['event_start_date'] ).\
fillna(pd.Timedelta(seconds=0)).reset_index(drop=True)
But this is:
not working. Throwing ValueError: The truth value of a Series is ambiguous.
so un-elegant.
Thanks!
df = pd.DataFrame({
'asset_id':[0,0,1,1],
'event_start_date':['2019-07-08','2019-07-11','2019-07-15','2019-07-25'],
'event_end_date':['2019-07-08','2019-07-23','2019-07-29','2019-07-25']
})
df['event_end_date'] = pd.to_datetime(df['event_end_date'])
df['event_start_date'] = pd.to_datetime(df['event_start_date'])
df['next_start']=df.groupby('asset_id')['event_start_date'].shift(-1)
df['date_diff'] = np.where(
df['next_start']>df['event_end_date'],
(df['next_start']-df['event_end_date']).dt.days,
(df['next_start']-df['event_start_date']).dt.days
)
df = df.drop(columns=['next_start']).fillna(0)
In Python, I have a DataFrame with column 'Date' (format e.g. 2020-06-26). This column is sorted in descending order: 2020-06-26, 2020-06-25, 2020-06-24...
The other column 'Reviews' is made of text reviews of a website. My data can have multiple reviews on a given date or no reviews on another date. I want to find what dates are missing in column 'Date'. Then, for each missing date, add one row with date in ´´format='%Y-%m-%d'´´, and an empty review on 'Reviews', to be able to plot them. How should I do this?
from datetime import date, timedelta
d = data['Date']
print(d[0])
print(d[-1])
date_set = set(d[-1] + timedelta(x) for x in range((d[0] - d[-1]).days))
missing = sorted(date_set - set(d))
missing = pd.to_datetime(missing, format='%Y-%m-%d')
idx = pd.date_range(start=min(data.Date), end=max(data.Date), freq='D')
#tried this
data = data.reindex(idx, fill_value=0)
data.head()
#Got TypeError: 'fill_value' ('0') is not in this Categorical's categories.
#also tried this
df2 = (pd.DataFrame(data.set_index('Date'), index=idx).fillna(0) + data.set_index('Date')).ffill().stack()
df2.head()
#Got ValueError: cannot reindex from a duplicate axis
This is my code:
for i in range(len(df)):
if i > 0:
prev = df.loc[i-1]["Date"]
current =df.loc[i]["Date"]
for a in range((prev-current).days):
if a > 0:
df.loc[df["Date"].count()] = [prev-timedelta(days = a), None]
df = df.sort_values("Date", ascending=False)
print(df)
I am cleaning data in pandas dataframe, I want split a column by another column.
I want split column 'id' by column 'eNBID',but don't know how to split
import pandas as pd
id_list = ['4600375067649','4600375077246','460037495681','460037495694']
eNBID_list = ['750676','750772','749568','749569']
df=pd.DataFrame({'id':id_list,'eNBID':eNBID_list})
df.head()
id eNBID
4600375067649 750676
4600375077246 750772
460037495681 749568
460037495694 749569
What I want:
df.head()
id eNBID
460-03-750676-49 750676
460-03-750772-46 750772
460-03-749568-1 749568
460-03-749569-4 749569
#column 'eNBID' is the third part of column 'id', the item length in column 'eNBID' is 6 or 7.
considering the 46003 will remain same for all ids
df['id'] = df.apply(lambda x: '-'.join([i[:3]+'-'+i[3:] if '460' in i else i for i in list(re.findall('(\w*)'+'('+x.eNBID+')'+'(\w*)',x.id)[0])]), axis=1)
Output
id eNBID
0 460-03-750676-49 750676
1 460-03-750772-46 750772
2 460-03-749568-1 749568
3 460-03-749569-4 749569
Considering '-' after 3rd, 5th, 11th position:
df['id'] = df['id'].apply(lambda s: s[:3] + '-' + s[3:5] + '-' + s[5:11] + '-' + s[11:])
I have this simplified dataframe:
ID, Date
1 8/24/1995
2 8/1/1899 :00
How can I use the power of pandas to recognize any date in the dataframe which has extra :00 and removes it.
Any idea how to solve this problem?
I have tried this syntax but did not help:
df[df["Date"].str.replace(to_replace="\s:00", value="")]
The Output Should Be Like:
ID, Date
1 8/24/1995
2 8/1/1899
You need to assign the trimmed column back to the original column instead of doing subsetting, and also the str.replace method doesn't seem to have the to_replace and value parameter. It has pat and repl parameter instead:
df["Date"] = df["Date"].str.replace("\s:00", "")
df
# ID Date
#0 1 8/24/1995
#1 2 8/1/1899
To apply this to an entire dataframe, I'd stack then unstack
df.stack().str.replace(r'\s:00', '').unstack()
functionalized
def dfreplace(df, *args, **kwargs):
s = pd.Series(df.values.flatten())
s = s.str.replace(*args, **kwargs)
return pd.DataFrame(s.values.reshape(df.shape), df.index, df.columns)
Examples
df = pd.DataFrame(['8/24/1995', '8/1/1899 :00'], pd.Index([1, 2], name='ID'), ['Date'])
dfreplace(df, '\s:00', '')
rng = range(5)
df2 = pd.concat([pd.concat([df for _ in rng]) for _ in rng], axis=1)
df2
dfreplace(df2, '\s:00', '')