Converting DDMMMYYYY to date format in pandas - python

I have dates in a DataFrame's column like:
1 06AUG2010
2 07APR2011
I want to convert them to a type, where i can count diffrences between dates in days.
I'm searching the internet for the answer, but cant find it. New to pandas.

You can use to_datetime with custom format:
df = pd.DataFrame({'date':['06AUG2010','07APR2011']}, index=[1,2])
print (df)
date
1 06AUG2010
2 07APR2011
df['date'] = pd.to_datetime(df['date'], format='%d%b%Y')
print (df)
date
1 2010-08-06
2 2011-04-07
And then for differences add diff:
df['date'] = df['date'].diff()
print (df)
date
1 NaT
2 244 days

Related

How do I convert an object to a week number in datetime

I am a beginner in Python and I am trying to change column names that currently represent the week number, to something easier to digest. I wanted to change them to show the date of the week commencing but I am having issues with converting the types.
I have a table that looks similar to the following:
import pandas as pd
data = [[0,'John',1,2,3]
df = pd.dataframe(data, columns = ['Index','Owner','32.0','33.0','34.0']
print(df)
I tried to use df.melt to get a column with the week numbers and then convert them to datetime and obtain the week commencing from that but I have not been successfull.
df = df.melt(id_vars=['Owner'])
df['variable'] = pd.to_datetime(df['variable'], format = %U)
This is as far as I have gotten as I have not been able to obtain the week number as a datetime type to then use it to get the week commencing.
After this, I was going to then transform the dataframe back to its original shape and have the newly obtained week commencing date times as the column headers again.
Can anyone advise me on what I am doing wrong, or alternatively is there a better way to do this?
Any help would be greatly appreciated!
Add Index column to melt first for only week values in variable, then convert to floats, integers and strings, so possible match by weeks:
data = [[0,'John',1,2,3]]
df = pd.DataFrame(data, columns = ['Index','Owner','32.0','33.0','34.0'])
print(df)
Index Owner 32.0 33.0 34.0
0 0 John 1 2 3
df = df.melt(id_vars=['Index','Owner'])
s = df['variable'].astype(float).astype(int).astype(str) + '-0-2021'
print (s)
0 32-0-2021
1 33-0-2021
2 34-0-2021
Name: variable, dtype: object
#https://stackoverflow.com/a/17087427/2901002
df['variable'] = pd.to_datetime(s, format = '%W-%w-%Y')
print (df)
Index Owner variable value
0 0 John 2021-08-15 1
1 0 John 2021-08-22 2
2 0 John 2021-08-29 3
EDIT:
For get original DataFrame (integers columns for weeks) use DataFrame.pivot:
df1 = (df.pivot(index=['Index','Owner'], columns='variable', values='value')
.rename_axis(None, axis=1))
df1.columns = df1.columns.strftime('%W')
df1 = df1.reset_index()
print (df1)
Index Owner 32 33 34
0 0 John 1 2 3
One solution to convert a week number to a date is to use a timedelta. For example you may have
from datetime import timedelta, datetime
week_number = 5
first_monday_of_the_year = datetime(2021, 1, 3)
week_date = first_monday_of_the_year + timedelta(weeks=week_number)

Drop rows based on date condition

I have a Pandas DataFrame called new in which the YearMonth column has date in the format of YYYY-MM. I want to drop the rows based on the condition: if the date is beyond "2020-05". I tried using this:
new = new.drop(new[new.YearMonth>'2020-05'].index)
but its not working displaying a syntax error of "invalid token".
Here is a sample DataFrame:
>>> new = pd.DataFrame({
'YearMonth': ['2014-09', '2014-10', '2020-09', '2021-09']
})
>>> print(new)
YearMonth
0 2014-09
1 2014-10
2 2020-09
3 2021-09
The expected DataFrame after the drop should be:
YearMonth
0 2014-09
1 2014-10
Just convert to datetime, then format it to month and subset it.
from datetime import datetime as dt
new['YearMonth']=pd.to_datetime(new['YearMonth']).dt.to_period('M')
new=new[~(new['YearMonth']>'2020-05')]
I think you want boolean indexing with change > to <= so comparing by month periods working nice:
new = pd.DataFrame({
'YearMonth': pd.to_datetime(['2014-09', '2014-10', '2020-09', '2021-09']).to_period('m')
})
print (new)
YearMonth
0 2014-09
1 2014-10
2 2020-09
3 2021-09
df = new[new.YearMonth <= pd.Period('2020-05', freq='m')]
print (df)
YearMonth
0 2014-09
1 2014-10
In newest versions of pandas also working with compare by strings:
df = new[new.YearMonth <= '2020-05']

Date formatting in pandas columns

I have 2 two data frames.
Date thing
201712.0 1
201801.0 2
The Date column is float64 type and I am trying to convert it to date of 12/1/2017 and 1/1/2018 respectively.
Date thing2
12/16/2017 2
1/16/2018 3
The Date column here is object type and I hope to convert to 12/1/2017 and 1/1/2018 as well. The idea here is to do a pd.merge after.
You need:
df['Date'] = pd.to_datetime(df['Date'], format='%Y%m') + pd.Timedelta(days=16)
Output:
Date thing
0 2017-12-16 1
1 2018-01-16 2
Using pandas.to_datetime to convert the 'Date' columns of your original dataframes:
df1 = pd.DataFrame([[201712.0, 1], [201801.0, 2]], columns=["Date", "thing"])
df2 = pd.DataFrame([["12/16/2017", 2], ["1/16/2018", 3]], columns=["Date", "thing2"])
df1['Date'] = pd.to_datetime(df1['Date'].astype(str), format='%Y%m.0')
df2['Date'] = pd.to_datetime(df2['Date']).apply(lambda x : x.replace(day=1))
In the first dataframe, 'Date' column is converted to string type (the .astype(str)) stuff) in order to use a format string.
In the second dataframe, apply function is used to reset the day of the month to the first from whatever it was in the beginning.

Pandas: How to sort dataframe rows by date of one column

So I have two different data-frame and I concatenated both. All columns are the same; however, the date column has all sorts of different dates in the M/D/YR format.
dataframe dates get shuffled around later in the sequence
Is there a way to keep the whole dataframe itself and just sort the rows based on the dates in the date column. I also want to keep the format that date is in.
so basically
date people
6/8/2015 1
7/10/2018 2
6/5/2015 0
gets converted into:
date people
6/5/2015 0
6/8/2015 1
7/10/2018 2
Thank you!
PS: I've tried the options in the other post on this but it does not work
Trying to elaborate on what can be done:
Intialize/ Merge the dataframe and convert the column into datetime type
df= pd.DataFrame({'people':[1,2,0],'date': ['6/8/2015','7/10/2018','6/5/2015',]})
df.date=pd.to_datetime(df.date,format="%m/%d/%Y")
print(df)
Output:
date people
0 2015-06-08 1
1 2018-07-10 2
2 2015-06-05 0
Sort on the basis of date
df=df.sort_values('date')
print(df)
Output:
date people
2 2015-06-05 0
0 2015-06-08 1
1 2018-07-10 2
Maintain the format again:
df['date']=df['date'].dt.strftime('%m/%d/%Y')
print(df)
Output:
date people
2 06/05/2015 0
0 06/08/2015 1
1 07/10/2018 2
Try changing the 'date' column to pandas Datetime and then sort
import pandas as pd
df= pd.DataFrame({'people':[1,1,1,2],'date':
['4/12/1961','5/5/1961','7/21/1961','8/6/1961']})
df['date'] =pd.to_datetime(df.date)
df.sort_values(by='date')
Output:
date people
1961-04-12 1
1961-05-05 1
1961-07-21 1
1961-08-06 2
To get back the initial format:
df['date']=df['date'].dt.strftime('%m/%d/%y')
Output:
date people
04/12/61 1
05/05/61 1
07/21/61 1
08/06/61 2
why not simply?
dataset[SortBy["date"]]
can you provide what you tried or how is your structure?
In case you need to sort in reversed order do:
dataset[SortBy["date"]][Reverse]

Get data into monthly datetime index

I have a pd.dataframe that looks like the one below
Start Date End Date
1/1/1990 7/1/2014
7/1/2005 5/1/2013
8/1/1997 8/1/2004
9/1/2001
I'd like to capture this data where it shows how many items had started but ended by certain months, in a datetimeindex. What I want it to look like is illustrated below.
Date Count
4/1/2013 3
5/1/2013 2
6/1/2013 2
7/1/2013 2
So far I have created a series that creates a string combining the start and finish dates and sums up all items with the same start and end dates.
1/1/19007/1/2014 1
7/1/20055/1/2013 1
8/1/19978/1/2004 1
9/1/2001 1
And I have a dataframe with the datetimeindex looking as follows:
4/1/2013
5/1/2013
6/1/2013
7/1/2013
Now I'm struggling to combine the two to get what I'm looking for. I'm probably thinking about this all wrong and was looking for better ideas.
You can try:
print df1
Start Date End Date
0 1/1/1990 7/1/2014
1 7/1/2005 5/1/2013
2 8/1/1997 8/1/2004
3 9/1/2001 NaN
print df2
Index: [4/1/2013, 5/1/2013, 6/1/2013, 7/1/2013]
#drop NaT in columns Start Date, End Date
df1 = df1.dropna(subset=['Start Date','End Date'])
#convert columns to datetime and then to month period
df1['Start Date'] = pd.to_datetime(df1['Start Date']).dt.to_period('M')
df1['End Date'] = pd.to_datetime(df1['End Date']).dt.to_period('M')
#create new column from datetimeindex and convert it to month period
df2['Date'] = pd.DatetimeIndex(df2.index).to_period('M')
print df1
Start Date End Date
0 1990-01 2014-07
1 2005-07 2013-05
2 1997-08 2004-08
print df2
Date
Date
4/1/2013 2013-04
5/1/2013 2013-05
6/1/2013 2013-06
7/1/2013 2013-07
#stack data for resampling
df1 = df1.stack().reset_index(drop=True, level=1).reset_index(name='Date')
print df1
index Date
0 0 1990-01
1 0 2014-07
2 1 2005-07
3 1 2013-05
4 2 1997-08
5 2 2004-08
#resample by column index
df = df1.groupby(df1['index']).apply(lambda x: x.set_index('Date').resample('1M', how='first')).reset_index(level=1)
#remove unecessary column index
df = df.drop('index', axis=1)
print df.head()
Date
index
0 1990-01
0 1990-02
0 1990-03
0 1990-04
0 1990-05
#merge df and df2 by column Date, groupby by Date and count
print pd.merge(df, df2, on='Date').groupby('Date')['Date'].count()
Date
2013-04 2
2013-05 2
2013-06 1
2013-07 1
Freq: M, Name: Date, dtype: int64

Categories