Changing format of date in pandas dataframe

Changing format of date in pandas dataframe - python

I have a pandas dataframe, in which a column is a string formatted as
yyyymmdd
which should be a date. Is there an easy way to convert it to a recognizable form of date?
And then what python libraries should I use to handle them?
Let's say, for example, that I would like to consider all the events (rows) whose date field is a working day (so mon-fri). What is the smoothest way to handle such a task?

Ok so you want to select Mon-Friday. Do that by converting your column to datetime and check if the dt.dayofweek is lower than 6 (Mon-Friday --> 0-4)
m = pd.to_datetime(df['date']).dt.dayofweek < 5
df2 = df[m]
Full example:
import pandas as pd
df = pd.DataFrame({
'date': [
'20180101',
'20180102',
'20180103',
'20180104',
'20180105',
'20180106',
'20180107'
],
'value': range(7)
})
m = pd.to_datetime(df['date']).dt.dayofweek < 5
df2 = df[m]
print(df2)
Returns:
date value
0 20180101 0
1 20180102 1
2 20180103 2
3 20180104 3
4 20180105 4

Related

Pandas change values based on previous value in same column

I have the following dataframe:
import pandas as pd
import datetime
df = pd.DataFrame({'ID': [1, 2, 1, 1],
'Date' : [datetime.date(year=2022,month=5,day=1), datetime.date(year=2022,month=11,day=1),
datetime.date(year=2022,month=10,day=1), datetime.date(year=2022,month=11,day=1)],
"Lifecycle ID": [5,5,5,5]})
And I need to change the lifecycle based on the lifecycle 6 month ago (if it was 5, it should always be 6 (not +1)).
I'm currently trying:
df.loc[(df["Date"] == (df["Date"] - pd.DateOffset(months=6))) & (df["Lifecycle ID"] == 5), "Lifecycle ID"] = 6
However Pandas is not considering the ID and I don't know how.
The output should be this dataframe (only last Lifecycle ID changed to 6):
Could you please help me here?

The logic is not fully clear, but if I guess correctly:
# ensure datetime type
df['Date'] = pd.to_datetime(df['Date'])
# add the time delta to form a helper DataFrame
df2 = df.assign(Date=df['Date'].add(pd.DateOffset(months=6)))
# merge on ID/Date, retrieve "Lifecycle ID"
# and check if the value is 5
m = df[['ID', 'Date']].merge(df2, how='left')['Lifecycle ID'].eq(5)
# if it is, update the value
df.loc[m, 'Lifecycle ID'] = 6
If you want to increment the value automatically from the value 6 months ago:
s = df[['ID', 'Date']].merge(df2, how='left')['Lifecycle ID']
df.loc[s.notna(), 'Lifecycle ID'] = s.add(1)
Output:
ID Date Lifecycle ID
0 1 2022-05-01 5
1 2 2022-11-01 5
2 1 2022-10-01 5
3 1 2022-11-01 6

How to check the dates in different columns?

I have a dataframe "expeditions" where there are 3 columns ("basecamp_date", "highpoint_date" and "termination_date"). I would like to check that the basecamp date is before the highpoint date and before the termination date because I noticed that there are rows where this is not the case (see picture)
Do you have any idea what I should do (a loop, a new dataframe...?)
Code
import pandas as pd
expeditions = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/expeditions.csv")

I would start by transforming the columns into a datetime format to be extra sure:
for x in df:
df[x] = pd.to_datetime(df[x],infer_datetime_format=True)
And then follow it by the comparison using np.where()
df['Check'] = np.where((df['basecamp_date'] < df['highpoint_date']) & (df['basecamp_date'] < df['termination_date']),True,False)
EDIT: Based on OPs follow up question I propose the following solution:
filtered = df[((df['basecamp_date'] < df['highpoint_date']) & (df['basecamp_date'] < df['termination_date'])) | (df.isna().sum(axis=1) != 0)]
Example:
basecamp_date highpoint_date termination_date
0 2008-01-04 2008-05-01 2008-04-05
1 NaN 2008-05-03 2008-06-03
2 2008-01-04 2008-01-01 2009-01-01
Only row 0 should be kept as row 2 doesn't match that date conditions and row 1 has a null value. Using the proposed code, the output is:
basecamp_date highpoint_date termination_date
0 2008-01-04 2008-05-01 2008-04-05
1 NaT 2008-05-03 2008-06-03

You should convert your data to datetime format:
df['date_col'] = pd.to_datetime(df['date_col'])
and then do like this:
df[df['date_col1'] < df['date_col2']]
In your case date_col might be you column names.

All the other answers are working solutions but I find this much easier:
df.query("basecamp_date < highpoint_date and basecamp_date
< termination_date")

Use Series.lt for compare columns and chain masks by & for bitwise AND:
m = (df['basecamp_date'].ge(df['highpoint_date']) |
df['basecamp_date'].ge(df['termination_date']) |
df[['basecamp_date', 'termination_date', 'highpoint_date']].notna().all(1)))
)
If need check matched values:
df1 = df[m]

How to convert month number to datetime in pandas

I have followed the instructions from this thread, but have run into issues.
Converting month number to datetime in pandas
I think it may have to do with having an additional variable in my dataframe but I am not sure. Here is my dataframe:
0 Month Temp
1 0 2
2 1 4
3 2 3
What I want is:
0 Month Temp
1 1990-01 2
2 1990-02 4
3 1990-03 3
Here is what I have tried:
df= pd.to_datetime('1990-' + df.Month.astype(int).astype(str) + '-1', format = '%Y-%m')
And I get this error:
ValueError: time data 1990-0-1 doesn't match format specified

IIUC, we can manually create your datetime object then format it as your expected output:
m = np.where(df['Month'].eq(0),
df['Month'].add(1), df['Month']
).astype(int).astype(str)
df['date'] = pd.to_datetime(
"1900" + "-" + pd.Series(m), format="%Y-%m"
).dt.strftime("%Y-%m")
print(df)
Month Temp date
0 0 2 1900-01
1 1 4 1900-02
2 2 3 1900-03

Try .dt.strftime() to show how to display the date, because datetime values are by default stored in %Y-%m-%d 00:00:00 format.
import pandas as pd
df= pd.DataFrame({'month':[1,2,3]})
df['date']=pd.to_datetime(df['month'], format="%m").dt.strftime('%Y-%m')
print(df)
You have to explicitly tell pandas to add 1 to the months as they are from range 0-11 not 1-12 in your case.
df=pd.DataFrame({'month':[11,1,2,3,0]})
df['date']=pd.to_datetime(df['month']+1, format='%m').dt.strftime('1990-%m')

Here is my solution for you
import pandas as pd
Data = {
'Month' : [1,2,3],
'Temp' : [2,4,3]
}
data = pd.DataFrame(Data)
data['Month']= pd.to_datetime('1990-' + data.Month.astype(int).astype(str) + '-1', format = '%Y-%m').dt.to_period('M')
Month Temp
0 1990-01 2
1 1990-02 4
2 1990-03 3
If you want Month[0] means 1 then you can conditionally add this one

Converting dataframe column of datetime data to DD/MM/YYYY string data

I have a dataframe column with datetime data in 1980-12-11T00:00:00 format.
I need to convert the whole column to DD/MM/YYY string format.
Is there any easy code for this?

Creating a working example:
df = pd.DataFrame({'date':['1980-12-11T00:00:00', '1990-12-11T00:00:00', '2000-12-11T00:00:00']})
print(df)
date
0 1980-12-11T00:00:00
1 1990-12-11T00:00:00
2 2000-12-11T00:00:00
Convert the column to datetime by pd.to_datetime() and invoke strftime()
df['date_new']=pd.to_datetime(df.date).dt.strftime('%d/%m/%Y')
print(df)
date date_new
0 1980-12-11T00:00:00 11/12/1980
1 1990-12-11T00:00:00 11/12/1990
2 2000-12-11T00:00:00 11/12/2000

You can use pd.to_datetime to convert string to datetime data
pd.to_datetime(df['col'])
You can also pass specific format as:
pd.to_datetime(df['col']).dt.strftime('%d/%m/%Y')

When using pandas, try pandas.to_datetime:
import pandas as pd
df = pd.DataFrame({'date': ['1980-12-%sT00:00:00'%i for i in range(10,20)]})
df.date = pd.to_datetime(df.date).dt.strftime("%d/%m/%Y")
print(df)
date
0 10/12/1980
1 11/12/1980
2 12/12/1980
3 13/12/1980
4 14/12/1980
5 15/12/1980
6 16/12/1980
7 17/12/1980
8 18/12/1980
9 19/12/1980

shifting pandas series for only some entries

I've got a dataframe that has a Time Series (made up of strings) with some missing information:
# Generate a toy dataframe:
import pandas as pd
data = {'Time': ['0'+str(i)+':15:45' for i in range(10)]}
data['Time'][4] = 'unknown'
data['Time'][8] = 'unknown'
df = pd.DataFrame(data)
# df
Time
0 00:15:45
1 01:15:45
2 02:15:45
3 03:15:45
4 unknown
5 05:15:45
6 06:15:45
7 07:15:45
8 unknown
9 09:15:45
I would like the unknown entries to match the entry above, resulting in this dataframe:
# desired_df
Time
0 00:15:45
1 01:15:45
2 02:15:45
3 03:15:45
4 03:15:45
5 05:15:45
6 06:15:45
7 07:15:45
8 07:15:45
9 09:15:45
What is the best way to achieve this?

If you're intent on working with a time series data. I would recommend converting it to a time series, and then forward filling the blanks
import pandas as pd
data = {'Time': ['0'+str(i)+':15:45' for i in range(10)]}
data['Time'][4] = 'unknown'
data['Time'][8] = 'unknown'
df.Time = pd.to_datetime(df.Time, errors = 'coerce')
df.fillna(method='ffill')
However, if you are getting this data from a csv file or something where you use pandas.read_* function you should use the na_values argument in those functions to specify unknown as a NA value
df = pd.read_csv('example.csv', na_values = 'unknown')
df = df.fillna(method='ffill')
you can also pass a list instead of the string, and it adds the words passed to already existing list of NA values
However, if you want to keep the column a string, I would recommend just doing a find and replace
df.Time = np.where(df.Time == 'unknown', df.Time.shift(),df.Time)

One way to do this would be using pandas' shift, creating a new column with the data in Time shifted by one, and dropping it. But there may be a cleaner way to achieve this:
# Create new column with the shifted time data
df['Time2'] = df['Time'].shift()
# Replace the data in Time with the data in your new column where necessary
df.loc[df['Time'] == 'unknown', 'Time'] = df.loc[df['Time'] == 'unknown', 'Time2']
# Drop your new column
df = df.drop('Time2', axis=1)
print(df)
Time
0 00:15:45
1 01:15:45
2 02:15:45
3 03:15:45
4 03:15:45
5 05:15:45
6 06:15:45
7 07:15:45
8 07:15:45
9 09:15:45
EDIT: as pointed out by Zero, the new column step can be skipped altogether:
df.loc[df['Time'] == 'unknown', 'Time'] = df['Time'].shift()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Changing format of date in pandas dataframe - python

Related

Pandas change values based on previous value in same column

How to check the dates in different columns?

How to convert month number to datetime in pandas

Converting dataframe column of datetime data to DD/MM/YYYY string data

shifting pandas series for only some entries

Categories

Resources