Date formatting in pandas columns - python

I have 2 two data frames.
Date thing
201712.0 1
201801.0 2
The Date column is float64 type and I am trying to convert it to date of 12/1/2017 and 1/1/2018 respectively.
Date thing2
12/16/2017 2
1/16/2018 3
The Date column here is object type and I hope to convert to 12/1/2017 and 1/1/2018 as well. The idea here is to do a pd.merge after.

You need:
df['Date'] = pd.to_datetime(df['Date'], format='%Y%m') + pd.Timedelta(days=16)
Output:
Date thing
0 2017-12-16 1
1 2018-01-16 2

Using pandas.to_datetime to convert the 'Date' columns of your original dataframes:
df1 = pd.DataFrame([[201712.0, 1], [201801.0, 2]], columns=["Date", "thing"])
df2 = pd.DataFrame([["12/16/2017", 2], ["1/16/2018", 3]], columns=["Date", "thing2"])
df1['Date'] = pd.to_datetime(df1['Date'].astype(str), format='%Y%m.0')
df2['Date'] = pd.to_datetime(df2['Date']).apply(lambda x : x.replace(day=1))
In the first dataframe, 'Date' column is converted to string type (the .astype(str)) stuff) in order to use a format string.
In the second dataframe, apply function is used to reset the day of the month to the first from whatever it was in the beginning.

Related

How to populate date in a dataframe using pandas in python

I have a dataframe with two columns, Case and Date. Here Date is actually the starting date. I want to populate it as a time series, saying add three (month_num) more dates to each case and removing the original ones.
original dataframe:
Case Date
0 1 2010-01-01
1 2 2011-04-01
2 3 2012-08-01
after populating dates:
Case Date
0 1 2010-02-01
1 1 2010-03-01
2 1 2010-04-01
3 2 2011-05-01
4 2 2011-06-01
5 2 2011-07-01
6 3 2012-09-01
7 3 2012-10-01
8 3 2012-11-01
I tried to declare an empty dataframe with the same column names and data type, and used for loop to loop over Case and month_num, and add rows into the new dataframe.
import pandas as pd
data = [[1, '2010-01-01'], [2, '2011-04-01'], [3, '2012-08-01']]
df = pd.DataFrame(data, columns = ['Case', 'Date'])
df.Date = pd.to_datetime(df.Date)
df_new = pd.DataFrame(columns=df.columns)
df_new['Case'] = pd.to_numeric(df_new['Case'])
df_new['Date'] = pd.to_datetime(df_new['Date'])
month_num = 3
for c in df.Case:
for m in range(1, month_num+1):
temp = df.loc[df['Case']==c]
temp['Date'] = temp['Date'] + pd.DateOffset(months=m)
df_new = pd.concat([df_new, temp])
df_new.reset_index(inplace=True, drop=True)
My code can work, however, when the original dataframe and month_num become large, it took huge time to run. Are there any better ways to do what I need? Thanks a alot!!
Your performance issue is probably related to the use of pd.concat inside the inner for loop. This answer explains why.
As the answer suggests, you may want to use an external list to collect all the dataframes you create in the for loop, and then concatenate once the list.
Given your input data this is what worked on my notebook:
df2=pd.DataFrame()
df2['Date']=df['Date'].apply(lambda x: pd.date_range(start=x, periods=3,freq='M')).explode()
df3=pd.merge_asof(df2,df,on='Date')
df3['Date']=df3['Date']+ pd.DateOffset(days=1)
df3[['Case','Date']]
We create a df2 to which we populate 'Date' with the needed dates coming from the original df
Then df3 resulting of a merge_asof between df2 and df (to populate the 'Case' column)
Finally , we offset the resulting column off 1 day

How do I convert an object to a week number in datetime

I am a beginner in Python and I am trying to change column names that currently represent the week number, to something easier to digest. I wanted to change them to show the date of the week commencing but I am having issues with converting the types.
I have a table that looks similar to the following:
import pandas as pd
data = [[0,'John',1,2,3]
df = pd.dataframe(data, columns = ['Index','Owner','32.0','33.0','34.0']
print(df)
I tried to use df.melt to get a column with the week numbers and then convert them to datetime and obtain the week commencing from that but I have not been successfull.
df = df.melt(id_vars=['Owner'])
df['variable'] = pd.to_datetime(df['variable'], format = %U)
This is as far as I have gotten as I have not been able to obtain the week number as a datetime type to then use it to get the week commencing.
After this, I was going to then transform the dataframe back to its original shape and have the newly obtained week commencing date times as the column headers again.
Can anyone advise me on what I am doing wrong, or alternatively is there a better way to do this?
Any help would be greatly appreciated!
Add Index column to melt first for only week values in variable, then convert to floats, integers and strings, so possible match by weeks:
data = [[0,'John',1,2,3]]
df = pd.DataFrame(data, columns = ['Index','Owner','32.0','33.0','34.0'])
print(df)
Index Owner 32.0 33.0 34.0
0 0 John 1 2 3
df = df.melt(id_vars=['Index','Owner'])
s = df['variable'].astype(float).astype(int).astype(str) + '-0-2021'
print (s)
0 32-0-2021
1 33-0-2021
2 34-0-2021
Name: variable, dtype: object
#https://stackoverflow.com/a/17087427/2901002
df['variable'] = pd.to_datetime(s, format = '%W-%w-%Y')
print (df)
Index Owner variable value
0 0 John 2021-08-15 1
1 0 John 2021-08-22 2
2 0 John 2021-08-29 3
EDIT:
For get original DataFrame (integers columns for weeks) use DataFrame.pivot:
df1 = (df.pivot(index=['Index','Owner'], columns='variable', values='value')
.rename_axis(None, axis=1))
df1.columns = df1.columns.strftime('%W')
df1 = df1.reset_index()
print (df1)
Index Owner 32 33 34
0 0 John 1 2 3
One solution to convert a week number to a date is to use a timedelta. For example you may have
from datetime import timedelta, datetime
week_number = 5
first_monday_of_the_year = datetime(2021, 1, 3)
week_date = first_monday_of_the_year + timedelta(weeks=week_number)

How to load this kind of data in pandas

Background: I have logs which are generated during the testing of the devices after manufacture. Each device has a serial number and a corresponding csv log file with all the data. Something like this.
DATE,TESTSTEP,READING,LIMIT,RESULT
01/01/2019 07:37:17.432 AM,1,23,10,FAIL
01/01/2019 07:37:23.661 AM,2,3,3,PASS
So there are many such log files. Each with the test data.
I have the the serial number of devices which failed in field. I want to create a model using these log files. And then use it to predict if the given device has a chance of failing in field given its log file.
Till now as a part of learning, I have worked with data like housing price. Every row was complete. Depending on area, number of rooms etc, it was easy to define a model for expected selling price.
Here I wish to find a way to somehow flatten all the logs into a single row. I am thinking of having something like:
DATE_1,TESTSTEP_1,READING_1,LIMIT_1,RESULT_1,DATE_2,TESTSTEP_2,READING_2,LIMIT_2,RESULT_2
1/1/2019 07:37:17.432 AM,1,23,10,FAIL,01/01/2019 07:37:23.661 AM,2,3,3,PASS
Is this the right way to deal with this kind of data?
If so, then does Pandas has any inbuilt support for this?
I will be using scikit-learn to create models.
First convert columns to ordered CategoricalIndex for same order of columns in output, convert DATE column by to_datetime and convert datetimes to dates by Series.dt.date with cumcount for counter, create MultiIndex by set_index, reshape by unstack and sort second level of MultiIndex in columns by sort_index. Last flatten it by list comprehension with reset_index:
df['DATE'] = pd.to_datetime(df['DATE'])
dates = df['DATE'].dt.date
df.columns = pd.CategoricalIndex(df.columns,categories=df.columns, ordered=True)
g = df.groupby(dates).cumcount().add(1)
df = df.set_index([dates, g]).unstack().sort_index(axis=1, level=1)
df.columns = [f'{a}_{b}' for a, b in df.columns]
df = df.reset_index(drop=True)
print (df)
DATE_1 TESTSTEP_1 READING_1 LIMIT_1 RESULT_1 \
0 2019-01-01 07:37:17.432 1 23 10 FAIL
DATE_2 TESTSTEP_2 READING_2 LIMIT_2 RESULT_2
0 2019-01-01 07:37:23.661 2 3 3 PASS
If need also dates in separate first column:
df['DATE'] = pd.to_datetime(df['DATE'])
dates = df['DATE'].dt.date
df.columns = pd.CategoricalIndex(df.columns,categories=df.columns, ordered=True)
g = df.groupby(dates).cumcount().add(1)
df = df.set_index([dates.rename('DAT'), g]).unstack().sort_index(axis=1, level=1)
df.columns = [f'{a}_{b}' for a, b in df.columns]
df = df.reset_index()
print (df)
DAT DATE_1 TESTSTEP_1 READING_1 LIMIT_1 RESULT_1 \
0 2019-01-01 2019-01-01 07:37:17.432 1 23 10 FAIL
DATE_2 TESTSTEP_2 READING_2 LIMIT_2 RESULT_2
0 2019-01-01 07:37:23.661 2 3 3 PASS

Create new column based on another column for a multi-index Panda dataframe

I'm running Python 3.5 on Windows and writing code to study financial econometrics.
I have a multi-index panda dataframe where the level=0 index is a series of month-end dates and the level=1 index is a simple integer ID. I want to create a new column of values ('new_var') where for each month-end date, I look forward 1-month and get the values from another column ('some_var') and of course the IDs from the current month need to align with the IDs for the forward month. Here is a simple test case.
import pandas as pd
import numpy as np
# Create some time series data
id = np.arange(0,5)
date = [pd.datetime(2017,1,31)+pd.offsets.MonthEnd(i) for i in [0,1]]
my_data = []
for d in date:
for i in id:
my_data.append((d, i, np.random.random()))
df = pd.DataFrame(my_data, columns=['date', 'id', 'some_var'])
df['new_var'] = np.nan
df.set_index(['date', 'id'], inplace=True)
# Drop an observation to reflect my true data
df.drop(('2017-02-28',3), level=None, inplace=True)
df
# The desired output....
list1 = df.loc['2017-01-31'].index.labels[1].tolist()
list2 = df.loc['2017-02-28'].index.labels[1].tolist()
common = list(set(list1) & set(list2))
for i in common:
df.loc[('2017-01-31', i)]['new_var'] = df.loc[('2017-02-28', i)]['some_var']
df
I feel like there is a better way to get my desired output. Maybe I should just embrace the "for" loop? Maybe a better solution is to reset the index?
Thank you,
F
I would create a integer column representing the date, substrate one from it (to shift it by one month) and the merge the value left on back to the original dataframe.
Out[28]:
some_var
date id
2017-01-31 0 0.736003
1 0.248275
2 0.844170
3 0.671364
4 0.034331
2017-02-28 0 0.051586
1 0.894579
2 0.136740
4 0.902409
df = df.reset_index()
df['n_group'] = df.groupby('date').ngroup()
df_shifted = df[['n_group', 'some_var','id']].rename(columns={'some_var':'new_var'})
df_shifted['n_group'] = df_shifted['n_group']-1
df = df.merge(df_shifted, on=['n_group','id'], how='left')
df = df.set_index(['date','id']).drop('n_group', axis=1)
Out[31]:
some_var new_var
date id
2017-01-31 0 0.736003 0.051586
1 0.248275 0.894579
2 0.844170 0.136740
3 0.671364 NaN
4 0.034331 0.902409
2017-02-28 0 0.051586 NaN
1 0.894579 NaN
2 0.136740 NaN
4 0.902409 NaN

Converting DDMMMYYYY to date format in pandas

I have dates in a DataFrame's column like:
1 06AUG2010
2 07APR2011
I want to convert them to a type, where i can count diffrences between dates in days.
I'm searching the internet for the answer, but cant find it. New to pandas.
You can use to_datetime with custom format:
df = pd.DataFrame({'date':['06AUG2010','07APR2011']}, index=[1,2])
print (df)
date
1 06AUG2010
2 07APR2011
df['date'] = pd.to_datetime(df['date'], format='%d%b%Y')
print (df)
date
1 2010-08-06
2 2011-04-07
And then for differences add diff:
df['date'] = df['date'].diff()
print (df)
date
1 NaT
2 244 days

Categories