How to discretize a datetime column? - python

I have a dataset that contains a column of datetime of a month, and I need to divide it into two blocks (day and night or am\pm) and then discretize the time in each block into 10mins bins. I could add another column of 0 and 1 to show it is am or pm, but I cannot discretize it! Can you please help me with it?
df['started_at'] = pd.to_datetime(df['started_at'])
df['start hour'] = df['started_at'].dt.hour.astype('int')
df['mor/aft'] = np.where(df['start hour'] < 12, 1, 0)
df['started_at']
0 16:05:36
2 06:22:40
3 16:08:10
4 12:28:57
6 15:47:30
...
3084526 15:24:24
3084527 16:33:07
3084532 14:08:12
3084535 09:43:46
3084536 17:02:26

If I understood correctly you are trying to add a column for every interval of ten minutes to indicate if an observation is from that interval of time.
You can use lambda expressions to loop through each observation from the series.
Dividing by 10 and making this an integer gives the first digit of the minutes, based on which you can add indicator columns.
I also included how to extract the day indicator column with a lambda expression for you to compare. It achieves the same as your np.where().
import pandas as pd
from datetime import datetime
# make dataframe
df = pd.DataFrame({
'started_at': ['14:20:56',
'00:13:24',
'16:01:33']
})
# convert column to datetime
df['started_at'] = pd.to_datetime(df['started_at'])
# make day indicator column
df['day'] = df['started_at'].apply(lambda ts: 1 if ts.hour > 12 else 0)
# make indicator column for every ten minutes
for i in range(24):
for j in range(6):
col = 'hour_' + str(i) + '_min_' + str(j) + '0'
df[col] = df['started_at'].apply(lambda ts: 1 if int(ts.minute/10) == j and ts.hour == i else 0)
print(df)
Output first columns:
started_at day hour_0_min_00 hour_0_min_10 hour_0_min_20
0 2021-11-21 14:20:56 1 0 0 0
1 2021-11-21 00:13:24 0 0 1 0
2 2021-11-21 16:01:33 1 0 0 0
...
...
...

Related

Count valuable (more than n times) repetitions of a pandas time series

I want to count the time of every phase in my series. For phase I mean the number of repetition of consecutive 1 or 0 for example:
rng = pd.date_range('2015-02-24', periods=15, freq='T')
s = pd.Series([0,1,1,1,0,0,1,0,1,0,1,1,1,1,0],index=rng)
I would like as output:
phase0 -> zeros:1 minute, ones:3 minutes,
pahse1 -> zeros:6 minutes, ones:4 minutes,
etc
In this case valuabe is >= than 3.
I was able to remove the 1 with low repetition with this:
index_to_remove=s.groupby((s.shift() != s).cumsum()).filter(lambda x: len(x) < 3).index
And now I can put equal 0 in the original time series the elemnts at that index.
s[index_to_remove]=0
What miss is to count the minutes of every phase.
Someone can help me? I'am interested in a smart way of doing it. I am not so proud of what I ve used until now so if you can give me a better way I will appreciate.
Thank you all
*** I know I should work with s.diff() and when this new time series goes from 1 to -1 is a phase of ones while whem it goes from -1 to 1 is a phase of zeros
I think you need aggreggate min and max, get difference, convert to minutes with add 1 minute and reshape to DataFrame:
#faster solution for set 0 by length per groups
m=s.groupby((s.shift() != s).cumsum()).transform('size') < 3
s[m]=0
#create groups for 0,1 pairs
res = (s.eq(0) & s.shift().eq(1)).cumsum()
print (res)
df = s.index.to_series().groupby([res, s]).agg(['min','max'])
df = (df['max'].sub(df['min'])
.dt.total_seconds()
.div(60)
.add(1)
.unstack(fill_value=0)
.astype(int)
.rename_axis('phase'))
print (df)
0 1
phase
0 1 3
1 6 4
2 1 0
*** This is best solution i found:
from itertools import groupby
groups = groupby(s)
result = [(label, sum(1 for _ in group)) for label, group in groups]
but I can't hande the fact of grouping 0 and 1 together

How to add calculated rows below each row in a pandas DataFrame

I have a dataframe_1 as such:
Index Time Label
0 0.000 ns Segment 1
1 2.749 sec baseline
2 3.459 min begin test
3 7.009 min end of test
And I would like to add multiple new rows in between each of dataframe_1's rows, where the Time column for each new row would add an additional minute until reaching dataframe_1's next row's time (and corresponding Label). For example, the above table should ultimately look like this:
Index Time Label
0 0.000 ns Segment 1
1 2.749 sec baseline
2 00:01:02.749000 baseline + 1min
3 00:02:02.749000 baseline + 2min
4 00:03:02.749000 baseline + 3min
5 3.459 min begin test
6 00:04:27.540000 begin test + 1min
7 00:05:27.540000 begin test + 2min
8 00:06:27.540000 begin test + 3min
9 7.009 min end of test
Using Timedelta type via pd.to_timedelta() is perfectly fine.
I thought the best way to do this would be to break up each row of dataframe_1 into its own dataframe, and then adding rows for each added minute, and then concating the dataframes back together. However, I am unsure of how to accomplish this.
Should I use a nested for-loop to [first] iterate over the rows of dataframe_1 and then [second] iterate over a counter so I can create new rows with added minutes?
I was previously not splitting up the individual rows into new dataframes, and I was doing the second iteration like this:
baseline_row = df_legend[df_legend['Label'] == 'baseline']
[baseline_index] = baseline_row.index
baseline_time = baseline_row['Time']
interval_mins = 1
new_time = baseline_time + pd.Timedelta(minutes=interval_mins)
cutoff_time_np = df_legend.iloc[baseline_row.index + 1]['Time']
cutoff_time = pd.to_timedelta(cutoff_time_np)
while new_time.reset_index(drop=True).get(0) < cutoff_time.reset_index(drop=True).get(0):
new_row = baseline_row.copy()
new_row['Label'] = f'minute {interval_mins}'
new_row['Time'] = baseline_time + pd.Timedelta(minutes=interval_mins)
new_row.index = [baseline_index + interval_mins - 0.5]
df_legend = df_legend.append(new_row, ignore_index=False)
df_legend = df_legend.sort_index().reset_index(drop=True)
pdb.set_trace()
interval_mins += 1
new_time = baseline_time + pd.Timedelta(minutes=interval_mins)
But since I want to do this for each row in the original dataframe_1, then I was thinking to split it up into separate dataframes and put it back together. I'm just not sure what the best way is to do that, especially since pandas is apparently very slow if iterating over the rows.
I would really appreciate some guidance.
This might faster than your solution.
df.Time = pd.to_timedelta(df.Time)
df['counts'] = df.Time.diff().apply(lambda x: x.total_seconds()) / 60
df['counts'] = np.floor(df.counts.shift(-1)).fillna(0).astype(int)
df.drop(columns='Index', inplace=True)
df
Time Label counts
0 00:00:00 Segment 1 0
1 00:00:02.749000 baseline 3
2 00:03:27.540000 begin test 3
3 00:07:00.540000 end of test 0
Then use iterrows to get your desire output.
new_df = []
for _, row in df.iterrows():
val = row.counts
if val == 0:
new_df.append(row)
else:
new_df.append(row)
new_row = row.copy()
label = row.Label
for i in range(val):
new_row = new_row.copy()
new_row.Time += pd.Timedelta('1 min')
new_row.Label = f'{label} + {i+1}min'
new_df.append(new_row)
new_df = pd.DataFrame(new_df)
new_df
Time Label counts
0 00:00:00 Segment 1 0
1 00:00:02.749000 baseline 3
1 00:01:02.749000 baseline + 1min 3
1 00:02:02.749000 baseline + 2min 3
1 00:03:02.749000 baseline + 3min 3
2 00:03:27.540000 begin test 3
2 00:04:27.540000 begin test + 1min 3
2 00:05:27.540000 begin test + 2min 3
2 00:06:27.540000 begin test + 3min 3
3 00:07:00.540000 end of test 0
I assume that you converted Time column from "number unit" format to a string
representation of the time. Something like:
Time Label
Index
0 00:00:00.000 Segment 1
1 00:00:02.749 baseline
2 00:03:27.540 begin test
3 00:07:00.540 end of test
Then, to get your result:
Compute timNxt - the Time column shifted by 1 position and converted
to datetime:
timNxt = pd.to_datetime(df.Time.shift(-1))
Define the following "replication" function:
def myRepl(row):
timCurr = pd.to_datetime(row.Time)
timNext = timNxt[row.name]
tbl = [[timCurr.strftime('%H:%M:%S.%f'), row.Label]]
if pd.notna(timNext):
n = (timNext - timCurr) // np.timedelta64(1, 'm') + 1
tbl.extend([ [(timCurr + np.timedelta64(i, 'm')).strftime('%H:%M:%S.%f'),
row.Label + f' + {i}min'] for i in range(1, n)])
return pd.DataFrame(tbl, columns=row.index)
Apply it to each row of your df and concatenate results:
result = pd.concat(df.apply(myRepl, axis=1).tolist(), ignore_index=True)
The result is:
Time Label
0 00:00:00.000000 Segment 1
1 00:00:02.749000 baseline
2 00:01:02.749000 baseline + 1min
3 00:02:02.749000 baseline + 2min
4 00:03:02.749000 baseline + 3min
5 00:03:27.540000 begin test
6 00:04:27.540000 begin test + 1min
7 00:05:27.540000 begin test + 2min
8 00:06:27.540000 begin test + 3min
9 00:07:00.540000 end of test
The resulting DataFrame has Time column also as string, but at
least the fractional part of second has 6 digits everywhere.

Repeating the pattern of Numbers thrice in a month

I want to distribute the numbers preset in the list in whole month
a) Given a Holiday list, I want to dynamically assign '1' on the holiday date and '0' for working day .
eg.
Holiday_List = ['2020-01-01','2020-01-05','2020-01-12','2020-01-19','2020-01-26']
Start_date = datetime.datetime(year=2020, month =1 , day=1)
end_date = datetime.datetime(year =2020,month =1,day=28 )
Below is the outpput I am looking for in dataframe,where 'Date' and 'Holiday' are columns.
Date Holiday
01-01-2020 1
02-01-2020 0
03-01-2020 0
04-01-2020 0
05-01-2020 1
06-01-2020 0
07-01-2020 0
08-01-2020 0
09-01-2020 0
10-01-2020 0
11-01-2020 0
12-01-2020 1
13-01-2020 0
14-01-2020 0
15-01-2020 0
16-01-2020 0
17-01-2020 0
18-01-2020 0
19-01-2020 1
20-01-2020 0
21-01-2020 0
22-01-2020 0
23-01-2020 0
24-01-2020 0
25-01-2020 0
26-01-2020 1
27-01-2020 0
28-01-2020 0
B) Given a list of nos like [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18].. I want to break into 3 equal part and store it in 3 different list.
a=[1,2,3,4,5,6],b=[7,8,9,10,11,12], c=[13,14,15,16,17,18]..
sequence should be there like first 6 element in a, sec in 'b' and 3rd in 'c'
C) I want to distribute the above lists a,b,c in whole months such that gap between 1 element of a,b and
c should be 8 days only..similarly for others nos. and there is one constraint I cannot assign any no. of holiday.
Below is the final output I am looking for, where list values are assign in column "Values" and Here I have assigning dummy value 'NW' to have gap of 8 days between every list.
Date Holiday Values
01-01-2020 1 Holiday
02-01-2020 0 1
03-01-2020 0 2
04-01-2020 0 3
05-01-2020 1 Holiday
06-01-2020 0 4
07-01-2020 0 5
08-01-2020 0 6
09-01-2020 0 NW
10-01-2020 0 NW
11-01-2020 0 7
12-01-2020 1 Holiday
13-01-2020 0 8
14-01-2020 0 9
15-01-2020 0 10
16-01-2020 0 11
17-01-2020 0 12
18-01-2020 0 NW
19-01-2020 1 Holiday
20-01-2020 0 13
21-01-2020 0 14
22-01-2020 0 15
23-01-2020 0 16
24-01-2020 0 17
25-01-2020 0 18
26-01-2020 1 Holiday
27-01-2020 0 NW
28-01-2020 0 NW
A) You can use date_range to create column with dates
df = pd.DataFrame()
df['Date'] = pd.date_range(start_date, end_date)
Next you can create column Holiday with zeros in all cells
df['Holiday'] = 0
And next you can replace some values
for item in holiday_list:
item = datetime.datetime.strptime(item, '%Y-%m-%d')
df['Holiday'][ df['Date'] == item ] = 1
but maybe this part could be simpler using isin()
mask = df['Date'].dt.strftime('%Y-%m-%d').isin(holiday_list)
df['Holiday'][mask] = 1
or using numpy.where()
import numpy as np
mask = df['Date'].dt.strftime('%Y-%m-%d').isin(holiday_list)
df['Holiday'] = np.where(mask, 1, 0)
or simply keep it as True/False instead of 1/0
df['Holiday'] = df['Date'].dt.strftime('%Y-%m-%d').isin(holiday_list)
import pandas as pd
import datetime
holiday_list = ['2020-01-01','2020-01-05','2020-01-12','2020-01-19','2020-01-26']
start_date = datetime.datetime(year=2020, month=1, day=1)
end_date = datetime.datetime(year=2020,month=1, day=28)
df = pd.DataFrame()
df['Date'] = pd.date_range(start_date, end_date)
df['Holiday'] = 0
mask = df['Date'].dt.strftime('%Y-%m-%d').isin(holiday_list)
df['Holiday'][mask] = 1
print(df)
B) you could use [start:start+size] to split list
numbers = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18]
size = len(numbers)//3
print(d[size*0:size*1], d[size*1:size*2], d[size*2:size*3])
or
print(d[:size], d[size:size*2], d[size*2:])
Similar way you can split dataframe (after filtered "Holiday") to work with 8 days [start:star+8] but I wil use it in (C)
C) You can create column Values with NW in all cells
df['Values'] = 'NW'
Next you can use previous mask to assign "Holiday"
mask = df['Date'].dt.strftime('%Y-%m-%d').isin(holiday_list)
df['Values'][ mask ] = 'Holiday'
Using ~ you can negate mask to reverse selection - to select cells withou "Holiday"
selected = df['Values'][ ~mask ]
and now I can try to assing
for a, b in zip(range(0, len(selected), 8), range(0, len(numbers), size)):
selected[a:a+size] = numbers[b:b+size]
df['Values'][ ~mask ] = selected
but maybe it can be done in simpler way. Maybe with groupby() or rolling() ?
import pandas as pd
import datetime
holiday_list = ['2020-01-01','2020-01-05','2020-01-12','2020-01-19','2020-01-26']
start_date = datetime.datetime(year=2020, month=1, day=1)
end_date = datetime.datetime(year=2020,month=1, day=28)
df = pd.DataFrame()
# ---
df['Date'] = pd.date_range(start_date, end_date)
mask = df['Date'].dt.strftime('%Y-%m-%d').isin(holiday_list)
df['Holiday'] = 0
df['Holiday'][mask] = 1
# ---
df['Values'] = 'NW'
df['Values'][ mask ] = 'Holiday'
numbers = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18]
size = len(numbers)//3
selected = df['Values'][ ~mask ]
for a, b in zip(range(0, len(selected), 8), range(0, len(numbers), size)):
selected[a:a+size] = numbers[b:b+size]
df['Values'][ ~mask ] = selected
print(df)
EDIT:
I created this code.
Main problem was it sometimes create copy of data and it change values in this copy but not in original dataframe - so I use masks instead of slicings.
It may display warning that it changes values in copy of data (not in original dataframe) but finally it gives me correct result.
Maybe using information from Returning a view versus a cop it could remove this warning
import pandas as pd
import datetime
holiday_list = [
'2020-01-01','2020-01-05',
#'2020-01-10','2020-01-11', # add more to test when there is less then 7 NW
'2020-01-12','2020-01-19','2020-01-26'
]
start_date = datetime.datetime(year=2020, month=1, day=1)
end_date = datetime.datetime(year=2020,month=1, day=28)
df = pd.DataFrame()
# ---
df['Date'] = pd.date_range(start_date, end_date)
mask = df['Date'].dt.strftime('%Y-%m-%d').isin(holiday_list)
df['Holiday'] = 0
df['Holiday'][mask] = 1
# ---
df['Values'] = 'NW'
df['Values'][ mask ] = 'Holiday'
numbers = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18]
size = len(numbers)//3
start = 0
for b in range(0, len(numbers), size):
# find first and last NW to replace (needs `start` to keep few NW at the end of previous 8 days gap)
mask = (df['Values'] == 'NW') & (df.index >= start)
# change size if there is less then 7 `NW`
print('NW:', sum(mask)) # sum() counts all `True` in mask
if sum(mask) <= size:
left = size - sum(mask)
size = sum(mask)
print('shorter:', size, left)
# first and last NW to replace
start = df[ mask ].index[0]
end = df[ mask ].index[size-1]
print('start, end:', start, end)
# use new mask to select and replace values
# (using slicing [0:6] doesn't work beacuse it create copy of data
# and it doesn't replace in original dataframe)
mask = mask & (df.index >= start) & (df.index <= end)
df['Values'][ mask ] = numbers[b:b+size]
# create gap 8days
start += 8+1
print(df)
I hope you solved it by now :) anyway this is my approach to solve the problem,
First of all, there are certain assumptions that I consider about when writing the code,
The length of the given array of integers is <= 18 which makes the length of a,b,c arrays <= 8
First, we need to divide the given array into equal three parts,
and if the length of split arrays are < 8 we need to fill them with NW dummy values so the array length becomes 8.
To do that easily, we could use numpy.array, the array needs to split and add string type data NW. to do that we could use object as dtype of the array numpy.chararray here is an application
arr = np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18], dtype=object)
then we need to split the array into three equal parts,
arr = np.split(arr,3)
those created arrays need to fill if their length is < 8, np.insert
for i in range(len(arr[0]), 8):
arr = np.insert(arr, i, dummy, axis=1) # fill remaining slots of arrays with dummy value(NW)
Then we need to consider,
Part- A
We need to get the number of days between two days delta (can put that calculation inside the for statement)
we need to get the dates for that range of days with the help of (datetime — Basic date and time types ) and iteration.
delta = end_date - Start_date
for i in range(delta.days + 1):
day = Start_date + timedelta(days=i)
we can use .strftime() to define the time format we need.
day.strftime("%d-%m-%Y")
Finally, we need to check the current date given from the iteration is in the Holiday_List and print 1 Holiday next to date. If not, we need to print 0 and the elements from arrays next to date and also need to make sure to have a gap of 8 days between every list and empty day slot need to fill with the dummy value NW.
count = 0
for i in range(delta.days + 1):
day = Start_date + timedelta(days=i)
if day.strftime("%Y-%m-%d") in Holiday_List:
print("{}\t{}\t{}".format(day.strftime("%d-%m-%Y"), 1, hDay))
else:
print("{}\t{}\t{}".format(day.strftime("%d-%m-%Y"), 0, arr[count//8][count%8]))
count += 1
here count//8 will decide which array need to use to print its' elements and count%8 choose which element needs to print.
So the program,
import datetime
import numpy as np
from datetime import timedelta
Holiday_List = ['2020-01-01','2020-01-05','2020-01-12','2020-01-19','2020-01-26']
Start_date = datetime.datetime(year=2020, month =1 , day=1)
end_date = datetime.datetime(year =2020,month =1,day=28 )
delta = end_date - Start_date
print(delta)
hDay = "Holiday"
dummy = "NW"
# --- numpy array ---
arr = np.array([1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18], dtype=object) #Assumed that the array length of is divisible by 3 every time
arr = np.split(arr,3) #spilts the array to three equal parts
for i in range(len(arr[0]), 8):
arr = np.insert(arr, i, dummy, axis=1) # fill remaining slots with dummy value(NW)
print("{}\t{}\t{}".format("Date", "Holiday", "Values"))
count = 0
for i in range(delta.days + 1):
day = Start_date + timedelta(days=i)
if day.strftime("%Y-%m-%d") in Holiday_List:
print("{}\t{}\t{}".format(day.strftime("%d-%m-%Y"), 1, hDay))
else:
print("{}\t{}\t{}".format(day.strftime("%d-%m-%Y"), 0, arr[count//8][count%8]))
count += 1
EDIT:
The above code has an issue in the last part that determines the gap and setting the dummy value NW
"When there are no holidays then you would need 3 NW so I would add 3 NW to every list ('a', 'b', 'c'), and then I would work with every list separately. I would use external for-loop like for data in arr: instead of arr[count//8] and I would count gap to skip last element if gap is 8 and element is 'NW' (BTW: if you add more holidays then you has to create gap bigger then 8). – #furas "
So with the help of #furas able to solve the issue(Thanks to him) :), Excess dummy values NW were neglected by iterating through the list,
import datetime
import numpy as np
from datetime import timedelta
Holiday_List = ['2020-01-01','2020-01-05','2020-01-12','2020-01-19','2020-01-26']
Start_date = datetime.datetime(year=2020, month=1, day=1)
end_date = datetime.datetime(year=2020, month=1, day=28)
delta = end_date - Start_date
print(delta)
hDay = "Holiday"
dummy = "NW"
# --- numpy array ---
arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18], dtype=object) # Assumed that the array length of is divisible by 3 every time
arr = np.split(arr, 3) # spilts the array to three equal parts
for i in range(len(arr[0]), 9): # add 3 'NW' instead of 2 'NW'
arr = np.insert(arr, i, dummy, axis=1) # fill remaining slots with dummy value(NW)
print("{}\t{}\t{}".format("Date", "Holiday", "Values"))
# ---
i = 0
for numbers in arr:
gap = 0
numbers_index = 0
numbers_count = len(numbers) - 3 # count numbers without 3 `NW`
while i < delta.days + 1:
day = Start_date + timedelta(days=i)
i += 1
if day.strftime("%Y-%m-%d") in Holiday_List:
print("{}\t{}\t{}".format(day.strftime("%d-%m-%Y"), 1, hDay))
if numbers_index > 0: # don't count Holiday before displaying first number from list `data` (ie. '2020-01-01')
gap += 1
else:
value = numbers[numbers_index]
# always put number (!='NW') or put 'NW' when gap is too small (<9)
if value != 'NW' or gap < 9:
print("{}\t{}\t{}".format(day.strftime("%d-%m-%Y"), 0, value))
numbers_index += 1
gap += 1
# IDEA: maybe it could use `else:` to put `NW` without adding `NW` to `arr`
# exit loop if all numbers are displayed and gap is big enough
if numbers_index >= numbers_count and gap >= 9:
break
Answer provided by the #furas is less messier, you should study that.
Cheers mate, learned a lot actually!

How to convert month number to datetime in pandas

I have followed the instructions from this thread, but have run into issues.
Converting month number to datetime in pandas
I think it may have to do with having an additional variable in my dataframe but I am not sure. Here is my dataframe:
0 Month Temp
1 0 2
2 1 4
3 2 3
What I want is:
0 Month Temp
1 1990-01 2
2 1990-02 4
3 1990-03 3
Here is what I have tried:
df= pd.to_datetime('1990-' + df.Month.astype(int).astype(str) + '-1', format = '%Y-%m')
And I get this error:
ValueError: time data 1990-0-1 doesn't match format specified
IIUC, we can manually create your datetime object then format it as your expected output:
m = np.where(df['Month'].eq(0),
df['Month'].add(1), df['Month']
).astype(int).astype(str)
df['date'] = pd.to_datetime(
"1900" + "-" + pd.Series(m), format="%Y-%m"
).dt.strftime("%Y-%m")
print(df)
Month Temp date
0 0 2 1900-01
1 1 4 1900-02
2 2 3 1900-03
Try .dt.strftime() to show how to display the date, because datetime values are by default stored in %Y-%m-%d 00:00:00 format.
import pandas as pd
df= pd.DataFrame({'month':[1,2,3]})
df['date']=pd.to_datetime(df['month'], format="%m").dt.strftime('%Y-%m')
print(df)
You have to explicitly tell pandas to add 1 to the months as they are from range 0-11 not 1-12 in your case.
df=pd.DataFrame({'month':[11,1,2,3,0]})
df['date']=pd.to_datetime(df['month']+1, format='%m').dt.strftime('1990-%m')
Here is my solution for you
import pandas as pd
Data = {
'Month' : [1,2,3],
'Temp' : [2,4,3]
}
data = pd.DataFrame(Data)
data['Month']= pd.to_datetime('1990-' + data.Month.astype(int).astype(str) + '-1', format = '%Y-%m').dt.to_period('M')
Month Temp
0 1990-01 2
1 1990-02 4
2 1990-03 3
If you want Month[0] means 1 then you can conditionally add this one

Aggregating unbalanced panel to time series using pandas

I have an unbalanced panel that I'm trying to aggregate up to a regular, weekly time series. The panel looks as follows:
Group Date value
A 1/1/2000 5
A 1/17/2000 10
B 1/9/2000 3
B 1/23/2000 7
C 1/22/2000 20
To give a better sense of what I'm looking for, I'm including an intermediate step, which I'd love to skip if possible. Basically some data needs to be filled in so that it can be aggregated. As you can see, missing weeks in between observations are interpolated. All other values are set equal to zero.
Group Date value
A 1/1/2000 5
A 1/8/2000 5
A 1/15/2000 10
A 1/22/2000 0
B 1/1/2000 0
B 1/8/2000 3
B 1/15/2000 3
B 1/22/2000 7
C 1/1/2000 0
C 1/8/2000 0
C 1/15/2000 0
C 1/22/2000 20
The final result that I'm looking for is as follows:
Date value
1/1/2000 5 = 5 + 0 + 0
1/8/2000 8 = 5 + 3 + 0
1/15/2000 13 = 10 + 3 + 0
1/22/2000 27 = 0 + 7 + 20
I haven't gotten very far, managed to create a panel:
panel = df.set_index(['Group','week']).to_panel()
Unfortunately, if I try to resample, I get an error
panel.resample('W')
TypeError: Only valid with DatetimeIndex or PeriodIndex
Assume df is your second dataframe with weeks, you can try the following:
df.groupby('week').sum()['value']
The documentation of groupby() and its application is here. It's similar to group-by function in SQL.
To obtain the second dataframe from the first one, try the following:
Firstly, prepare a function to map the day to week
def d2w_map(day):
if day <=7:
return 1
elif day <= 14:
return 2
elif day <= 21:
return 3
else:
return 4
In the method above, days from 29 to 31 are considered in week 4. But you get the idea. You can modify it as needed.
Secondly, take the lists out from the first dataframe, and convert days to weeks
df['Week'] = df['Day'].apply(d2w_map)
del df['Day']
Thirdly, initialize your second dataframe with only columns of 'Group' and 'Week', leaving the 'value' out. Assume now your initialized new dataframe is result, you can now do a join
result = result.join(df, on=['Group', 'Week'])
Last, write a function to fill the Nan up in the 'value' column with the nearby element. The Nan is what you need to interpolate. Since I am not sure how you want the interpolation to work, I will leave it to you.
Here is how you can change d2w_map to convert string of date to integer of week
from datetime import datetime
def d2w_map(day_str):
return datetime.strptime(day_str, '%m/%d/%Y').weekday()
Returned value of 0 means Monday, 1 means Tuesday and so on.
If you have the package dateutil installed, the function can be more robust:
from dateutil.parser import parse
def d2w_map(day_str):
return parse(day_str).weekday()
Sometimes, things you want are already implemented by magic :)
Turns out the key is to resample a groupby object like so:
df_temp = (df.set_index('date')
.groupby('Group')
.resample('W', how='sum', fill_method='ffill'))
ts = (df_temp.reset_index()
.groupby('date')
.sum()[value])
Used this tab delimited test.txt:
Group Date value
A 1/1/2000 5
A 1/17/2000 10
B 1/9/2000 3
B 1/23/2000 7
C 1/22/2000 20
You can skip the intermediate datafile as follows. Don't have time now. Just play around with it to get it right.
import pandas as pd
import datetime
time_format = '%m/%d/%Y'
Y = pd.read_csv('test.txt', sep="\t")
dates = Y['Date']
dates_right_format = map(lambda s: datetime.datetime.strptime(s, time_format), dates)
values = Y['value']
X = pd.DataFrame(values)
X.index = dates_right_format
print X
X = X.sort()
print X
print X.resample('W', how=sum, closed='right', label='right')
Last print
value
2000-01-02 5
2000-01-09 3
2000-01-16 NaN
2000-01-23 37

Categories