Changing format of date in pandas dataframe - python

I have a pandas dataframe, in which a column is a string formatted as
yyyymmdd
which should be a date. Is there an easy way to convert it to a recognizable form of date?
And then what python libraries should I use to handle them?
Let's say, for example, that I would like to consider all the events (rows) whose date field is a working day (so mon-fri). What is the smoothest way to handle such a task?

Ok so you want to select Mon-Friday. Do that by converting your column to datetime and check if the dt.dayofweek is lower than 6 (Mon-Friday --> 0-4)
m = pd.to_datetime(df['date']).dt.dayofweek < 5
df2 = df[m]
Full example:
import pandas as pd
df = pd.DataFrame({
'date': [
'20180101',
'20180102',
'20180103',
'20180104',
'20180105',
'20180106',
'20180107'
],
'value': range(7)
})
m = pd.to_datetime(df['date']).dt.dayofweek < 5
df2 = df[m]
print(df2)
Returns:
date value
0 20180101 0
1 20180102 1
2 20180103 2
3 20180104 3
4 20180105 4

Related

Pandas change values based on previous value in same column

I have the following dataframe:
import pandas as pd
import datetime
df = pd.DataFrame({'ID': [1, 2, 1, 1],
'Date' : [datetime.date(year=2022,month=5,day=1), datetime.date(year=2022,month=11,day=1),
datetime.date(year=2022,month=10,day=1), datetime.date(year=2022,month=11,day=1)],
"Lifecycle ID": [5,5,5,5]})
And I need to change the lifecycle based on the lifecycle 6 month ago (if it was 5, it should always be 6 (not +1)).
I'm currently trying:
df.loc[(df["Date"] == (df["Date"] - pd.DateOffset(months=6))) & (df["Lifecycle ID"] == 5), "Lifecycle ID"] = 6
However Pandas is not considering the ID and I don't know how.
The output should be this dataframe (only last Lifecycle ID changed to 6):
Could you please help me here?
The logic is not fully clear, but if I guess correctly:
# ensure datetime type
df['Date'] = pd.to_datetime(df['Date'])
# add the time delta to form a helper DataFrame
df2 = df.assign(Date=df['Date'].add(pd.DateOffset(months=6)))
# merge on ID/Date, retrieve "Lifecycle ID"
# and check if the value is 5
m = df[['ID', 'Date']].merge(df2, how='left')['Lifecycle ID'].eq(5)
# if it is, update the value
df.loc[m, 'Lifecycle ID'] = 6
If you want to increment the value automatically from the value 6 months ago:
s = df[['ID', 'Date']].merge(df2, how='left')['Lifecycle ID']
df.loc[s.notna(), 'Lifecycle ID'] = s.add(1)
Output:
ID Date Lifecycle ID
0 1 2022-05-01 5
1 2 2022-11-01 5
2 1 2022-10-01 5
3 1 2022-11-01 6

How to check the dates in different columns?

I have a dataframe "expeditions" where there are 3 columns ("basecamp_date", "highpoint_date" and "termination_date"). I would like to check that the basecamp date is before the highpoint date and before the termination date because I noticed that there are rows where this is not the case (see picture)
Do you have any idea what I should do (a loop, a new dataframe...?)
Code
import pandas as pd
expeditions = pd.read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-09-22/expeditions.csv")
I would start by transforming the columns into a datetime format to be extra sure:
for x in df:
df[x] = pd.to_datetime(df[x],infer_datetime_format=True)
And then follow it by the comparison using np.where()
df['Check'] = np.where((df['basecamp_date'] < df['highpoint_date']) & (df['basecamp_date'] < df['termination_date']),True,False)
EDIT: Based on OPs follow up question I propose the following solution:
filtered = df[((df['basecamp_date'] < df['highpoint_date']) & (df['basecamp_date'] < df['termination_date'])) | (df.isna().sum(axis=1) != 0)]
Example:
basecamp_date highpoint_date termination_date
0 2008-01-04 2008-05-01 2008-04-05
1 NaN 2008-05-03 2008-06-03
2 2008-01-04 2008-01-01 2009-01-01
Only row 0 should be kept as row 2 doesn't match that date conditions and row 1 has a null value. Using the proposed code, the output is:
basecamp_date highpoint_date termination_date
0 2008-01-04 2008-05-01 2008-04-05
1 NaT 2008-05-03 2008-06-03
You should convert your data to datetime format:
df['date_col'] = pd.to_datetime(df['date_col'])
and then do like this:
df[df['date_col1'] < df['date_col2']]
In your case date_col might be you column names.
All the other answers are working solutions but I find this much easier:
df.query("basecamp_date < highpoint_date and basecamp_date
< termination_date")
Use Series.lt for compare columns and chain masks by & for bitwise AND:
m = (df['basecamp_date'].ge(df['highpoint_date']) |
df['basecamp_date'].ge(df['termination_date']) |
df[['basecamp_date', 'termination_date', 'highpoint_date']].notna().all(1)))
)
If need check matched values:
df1 = df[m]

How to convert month number to datetime in pandas

I have followed the instructions from this thread, but have run into issues.
Converting month number to datetime in pandas
I think it may have to do with having an additional variable in my dataframe but I am not sure. Here is my dataframe:
0 Month Temp
1 0 2
2 1 4
3 2 3
What I want is:
0 Month Temp
1 1990-01 2
2 1990-02 4
3 1990-03 3
Here is what I have tried:
df= pd.to_datetime('1990-' + df.Month.astype(int).astype(str) + '-1', format = '%Y-%m')
And I get this error:
ValueError: time data 1990-0-1 doesn't match format specified
IIUC, we can manually create your datetime object then format it as your expected output:
m = np.where(df['Month'].eq(0),
df['Month'].add(1), df['Month']
).astype(int).astype(str)
df['date'] = pd.to_datetime(
"1900" + "-" + pd.Series(m), format="%Y-%m"
).dt.strftime("%Y-%m")
print(df)
Month Temp date
0 0 2 1900-01
1 1 4 1900-02
2 2 3 1900-03
Try .dt.strftime() to show how to display the date, because datetime values are by default stored in %Y-%m-%d 00:00:00 format.
import pandas as pd
df= pd.DataFrame({'month':[1,2,3]})
df['date']=pd.to_datetime(df['month'], format="%m").dt.strftime('%Y-%m')
print(df)
You have to explicitly tell pandas to add 1 to the months as they are from range 0-11 not 1-12 in your case.
df=pd.DataFrame({'month':[11,1,2,3,0]})
df['date']=pd.to_datetime(df['month']+1, format='%m').dt.strftime('1990-%m')
Here is my solution for you
import pandas as pd
Data = {
'Month' : [1,2,3],
'Temp' : [2,4,3]
}
data = pd.DataFrame(Data)
data['Month']= pd.to_datetime('1990-' + data.Month.astype(int).astype(str) + '-1', format = '%Y-%m').dt.to_period('M')
Month Temp
0 1990-01 2
1 1990-02 4
2 1990-03 3
If you want Month[0] means 1 then you can conditionally add this one

Converting dataframe column of datetime data to DD/MM/YYYY string data

I have a dataframe column with datetime data in 1980-12-11T00:00:00 format.
I need to convert the whole column to DD/MM/YYY string format.
Is there any easy code for this?
Creating a working example:
df = pd.DataFrame({'date':['1980-12-11T00:00:00', '1990-12-11T00:00:00', '2000-12-11T00:00:00']})
print(df)
date
0 1980-12-11T00:00:00
1 1990-12-11T00:00:00
2 2000-12-11T00:00:00
Convert the column to datetime by pd.to_datetime() and invoke strftime()
df['date_new']=pd.to_datetime(df.date).dt.strftime('%d/%m/%Y')
print(df)
date date_new
0 1980-12-11T00:00:00 11/12/1980
1 1990-12-11T00:00:00 11/12/1990
2 2000-12-11T00:00:00 11/12/2000
You can use pd.to_datetime to convert string to datetime data
pd.to_datetime(df['col'])
You can also pass specific format as:
pd.to_datetime(df['col']).dt.strftime('%d/%m/%Y')
When using pandas, try pandas.to_datetime:
import pandas as pd
df = pd.DataFrame({'date': ['1980-12-%sT00:00:00'%i for i in range(10,20)]})
df.date = pd.to_datetime(df.date).dt.strftime("%d/%m/%Y")
print(df)
date
0 10/12/1980
1 11/12/1980
2 12/12/1980
3 13/12/1980
4 14/12/1980
5 15/12/1980
6 16/12/1980
7 17/12/1980
8 18/12/1980
9 19/12/1980

shifting pandas series for only some entries

I've got a dataframe that has a Time Series (made up of strings) with some missing information:
# Generate a toy dataframe:
import pandas as pd
data = {'Time': ['0'+str(i)+':15:45' for i in range(10)]}
data['Time'][4] = 'unknown'
data['Time'][8] = 'unknown'
df = pd.DataFrame(data)
# df
Time
0 00:15:45
1 01:15:45
2 02:15:45
3 03:15:45
4 unknown
5 05:15:45
6 06:15:45
7 07:15:45
8 unknown
9 09:15:45
I would like the unknown entries to match the entry above, resulting in this dataframe:
# desired_df
Time
0 00:15:45
1 01:15:45
2 02:15:45
3 03:15:45
4 03:15:45
5 05:15:45
6 06:15:45
7 07:15:45
8 07:15:45
9 09:15:45
What is the best way to achieve this?
If you're intent on working with a time series data. I would recommend converting it to a time series, and then forward filling the blanks
import pandas as pd
data = {'Time': ['0'+str(i)+':15:45' for i in range(10)]}
data['Time'][4] = 'unknown'
data['Time'][8] = 'unknown'
df.Time = pd.to_datetime(df.Time, errors = 'coerce')
df.fillna(method='ffill')
However, if you are getting this data from a csv file or something where you use pandas.read_* function you should use the na_values argument in those functions to specify unknown as a NA value
df = pd.read_csv('example.csv', na_values = 'unknown')
df = df.fillna(method='ffill')
you can also pass a list instead of the string, and it adds the words passed to already existing list of NA values
However, if you want to keep the column a string, I would recommend just doing a find and replace
df.Time = np.where(df.Time == 'unknown', df.Time.shift(),df.Time)
One way to do this would be using pandas' shift, creating a new column with the data in Time shifted by one, and dropping it. But there may be a cleaner way to achieve this:
# Create new column with the shifted time data
df['Time2'] = df['Time'].shift()
# Replace the data in Time with the data in your new column where necessary
df.loc[df['Time'] == 'unknown', 'Time'] = df.loc[df['Time'] == 'unknown', 'Time2']
# Drop your new column
df = df.drop('Time2', axis=1)
print(df)
Time
0 00:15:45
1 01:15:45
2 02:15:45
3 03:15:45
4 03:15:45
5 05:15:45
6 06:15:45
7 07:15:45
8 07:15:45
9 09:15:45
EDIT: as pointed out by Zero, the new column step can be skipped altogether:
df.loc[df['Time'] == 'unknown', 'Time'] = df['Time'].shift()

Categories