String manipulation pandas - python

I have a column named Timestamp of type str, and I would like to change the values of the column to a more appropriate format, i.e. 353 to 3:53 pm.
How can I do this using pandas or appropriate string manipulation?
c = pd.DataFrame({"Timestamp":x,"Latitude":y,"Longitude":z})
c.head()

This will also work:
from datetime import datetime
c['Timestamp'].apply(lambda x: datetime.strptime(x.rjust(4, '0'), '%H%M').strftime('%H:%M'))

You can call apply on the column and pass a function that will split each string and insert a colon:
c['Timestamp'].apply(lambda x: x[0:-2] + ':' + x[-2:])

(just wanted to get this in here)
As #ChrisA mentioned in the comments, you can also do this:
c['Timestamp'] = pd.to_datetime(c['Timestamp'], format='%H%M').dt.time

Related

How to fix the datetime column which has string datetype?

I have a column of dates but not all of the rows are DateTime type, some rows have string datatype, such as Jan'11, Jan 11, Jan-11. My idea is I wanna replace the split "'", and " " to be "-", then replace it to datetime format.
so this is the code that I tried,
replacers = {"'":"-"," ":"-"}
df['Periode1'] = df['Periode1'].replace(replacers)
but when I check df[df['Periode1']=="Jan'11"] it still not change to be "Jan-11". So how to solve it?
The issue is that the .replace method looks at the entire contents of each object in the series, and since there is more than just the separator in each date, the method does nothing. A better approach would be to apply a lambda function to the series:
df['Periode1'] = df['Periode1'].apply(lambda x: str(x).replace("'", "-").replace(" ", "-"))

Pandas apply multiple function with list

I have a df with a 'File_name' column which contains strings of a file name, which I would like to parse:
data = [['f1h3_13oct2021_gt1.csv', 2], ['p8-gfr-20dec2021-81.csv', 0.5]]
df= pd.DataFrame(data, columns = ['File_name', 'Result'])
df.head()
Now I would like to create a new column where I parse the file name with '_' and '-' delimiters and then search in resulting list for the string that I could transform in datetime object. The name convention is not always the same (different order, so I cannot rely on string characters location) and the code should include a "try" conversion to datetime, as often the piece of string which should be the date is either in the wrong format or missing.
I came up with the following, but it does not really look pythonic to me
# Solution #1
for i, value in df['File_name'].iteritems():
chunks = value.split('-') + value.split('_')
for chunk in chunks:
try:
df.loc[i,'Date_Sol#1'] = dt.datetime.strptime(chunk, '%d%b%Y')
except:
pass
df.head()
Alternative, I was trying to use the apply method with the two functions I really cannot think a way to solve the two functions chained and the try - pass statement, but I really did not manage to get it working
# Solution #2
import re
splitme = lambda x: re.split('_|-', x)
calcdate = lambda x : dt.datetime.strptime(x, '%d%b%Y')
df['t1'] = df['File_name'].apply(splitme)
df['Date_Sol#2'] =df['t1'].apply(lambda x: calcdate(x) for x in df['t1'] if isinstance(calcdate(x),dt.datetime) else Pass)
df.head()
I thought a list comprehension might help?
Any help how Solution #2 might look like?
Thanks in advance
Assuming you want to extract and convert the possible chunks as date, you could split the string on delimiters, explode to multiple rows and attempt to convert to date with pandas.to_datetime:
df.join(pd
.to_datetime(df['File_name']
.str.split(r'[_-]')
.explode(), errors='coerce')
.dropna().rename('Date')
)
output:
File_name Result Date
0 f1h3_13oct2021_gt1.csv 2.0 2021-10-13
1 p8-gfr-20dec2021-81.csv 0.5 2021-12-20
NB. if you have potentially many dates per string, you need to add a further step to select the one you want. Please give more details if this is the case.
python version for old pandas
import re
s = pd.Series([next(iter(pd.to_datetime(re.split(r'[._-]', s), errors='coerce')
.dropna()), float('nan'))
for s in df['File_name']], index=df.index, name='date')
df.join(s)

How to split a column twice using Pandas?

I have a column {'duration': 0, 'is_incoming': False}
I want to fetch 0 and Falseout of this. How do I split it using Python (Pandas)?
I tried - data["additional_data"] = data["additional_data"].apply(lambda x :",".join(x.split(":")[:-1]))
I want two columns Duration and Incoming_Time
How do I do this?
You can try converting those string to actual dict:
from ast import literal_eval
Finally:
out=pd.DataFrame(df['additional_data'].astype(str).map(literal_eval).tolist())
Now if you print out you will get your expected output
If needed use join() method:
df=df.join(out)
Now if you print df you will get your expected result
If your column additional_data contains real dict / json, you can directly use the string accessor .str[] to get the dict values by keys, as follows:
data['Duration'] = data['additional_data].str['duration']
data['Incoming_Time'] = = data['additional_data].str['is_incoming']
If your column additional_data contains strings of dict (enclosing dict with a pair of single quotes or double quotes), you need to convert the string to dict first, by:
from ast import literal_eval
data['Duration'] = data['additional_data].map(literal_eval).str['duration']
data['Incoming_Time'] = data['additional_data].map(literal_eval).str['is_incoming']

Adding list of time delta to a list of datetime objects?

I have one list which has objects in the format "00:07:00" and add this to another list which has objects in the "2016-09-02 14:41:00" format.
When I searched the type, for "00:07:00", it said "pandas._libs.tslibs.timedeltas.Timedelta.
And for "2016-09-02 14:41:00", it said "pandas._libs.tslibs.timestamps.Timestamp".ana
How can I add both the lists ?
you can convert the timestamp column value to time only. using below example:
df['date_time'].dt.time
Then add the two columns having same type i.e timedelta.
df['new_time'] = df['first_time'] + df['date_time']
I don't have panda installed, but according to documentation, panda timestamps and timedeltas can be added. So something like this might work:
result = [t + d for t, d in zip(timestamps, deltas)]

Convert dataframe with whitespaces to numeric, obstacle - whitespaces (e.g. 3 014.0 i.e. '3\xa0014.0')

I have a dataframe where instead of expected numerical values were stored
data of the type "Object" which looks like 3 014.0 i.e. '3\xa0014.0', instead of 3014.0 - whitespaces (i.e. '\xa0') - create a problem for conversion
Question: Is there some way to convert it to numeric ?
Strange thing: It appears that I can do conversion of the single element:
float( df.iloc[0,0].replace('\xa0', '') ) # - works
but the same does NOT work for the whole series
df['p1'].astype('str').replace('\xa0','') # does nothing
-- does nothing
I tried:
pd.to_numeric - gives: Unable to parse string
trying to covert to string and then use replace:
df['p1'].astype('str').replace('\xa0','')
do nothing
Data example:
df.iloc[0:3,0]
2017-10-10 11:32:49.895023 3 014.0
2017-10-10 11:33:11.612169 3 013.5
2017-10-10 11:33:22.488124 3 013.0
Name: p1, dtype: object
df.iloc[0:3,0]:
'3\xa0014.0'
Use this instead: df['p1'] = df['p1'].apply(lambda x: float(x.replace('\xa0','')))
df.iloc[0,0] is a string while df['p1'] is a pandas series. The replace method associated with a string and with a series is different. When you call replace on a series, pandas will attempt to replace elements.
For example,
df = pd.DataFrame({'name': 'alexander'})`
df['name'].replace('a', 'x') #does nothing`
df['name'].replace('alexander', 'x') #replaces the name alexander with x
df['p1'].apply(lambda x: float(x.replace('\xa0',''))) applies the replace method to each element (which happens to be a string) in the column p1. You can read more about the method here.
Hope this makes things clearer :)

Categories