How to repeat same datetime data for different interval - python

I have this dataset
Value1 Value2
2000-01-01 12:00:00 1 2
2000-01-02 12:00:00 3 4
2000-01-03 12:00:00 5 6
I want to repeat the same data but at 4 different time intervals for example
Value1 Value2
2000-01-01 00:00:00 1 2
2000-01-01 06:00:00 1 2
2000-01-01 12:00:00 1 2
2000-01-01 18:00:00 1 2
2000-01-02 00:00:00 3 4
2000-01-02 06:00:00 3 4
2000-01-02 12:00:00 3 4
2000-01-02 18:00:00 3 4
and so on.

your dates are contiguous, this solution will also work for none contiguous dates
generate a series that are the expanded times, then outer join
import pandas as pd
import io
df = pd.read_csv(io.StringIO(""" Value1 Value2
2000-01-01 12:00:00 1 2
2000-01-02 12:00:00 3 4
2000-01-03 12:00:00 5 6"""), sep="\s\s+", engine="python")
df = df.set_index(pd.to_datetime(df.index))
df = df.join(
pd.Series(df.index, index=df.index)
.rename("expanded")
.dt.date.apply(lambda d: pd.date_range(d, freq="6H", periods=4))
.explode()
).set_index("expanded")
df
expanded
Value1
Value2
2000-01-01 00:00:00
1
2
2000-01-01 06:00:00
1
2
2000-01-01 12:00:00
1
2
2000-01-01 18:00:00
1
2
2000-01-02 00:00:00
3
4
2000-01-02 06:00:00
3
4
2000-01-02 12:00:00
3
4
2000-01-02 18:00:00
3
4
2000-01-03 00:00:00
5
6
2000-01-03 06:00:00
5
6
2000-01-03 12:00:00
5
6
2000-01-03 18:00:00
5
6

Related

How to use groupby() with between_time()?

I have a DataFrame and want to multiply all values in a column a for a certain day with the value of a at 6h00m00 of that day. If there is no 6h00m00 entry, that day should stay unchanged.
The code below unfortunately gives an error.
How do I have to correct this code / replace it with any working solution?
import pandas as pd
import numpy as np
start = pd.Timestamp('2000-01-01')
end = pd.Timestamp('2000-01-03')
t = np.linspace(start.value, end.value, 9)
datetime1 = pd.to_datetime(t)
df = pd.DataFrame( {'a':[1,3,4,5,6,7,8,9,14]})
df['date']= datetime1
print(df)
def myF(x):
y = x.set_index('date').between_time('05:59', '06:01').a
return y
toMultiplyWith = df.groupby(df.date.dt.floor('D')).transform(myF)
.
a date
0 1 2000-01-01 00:00:00
1 3 2000-01-01 06:00:00
2 4 2000-01-01 12:00:00
3 5 2000-01-01 18:00:00
4 6 2000-01-02 00:00:00
5 7 2000-01-02 06:00:00
6 8 2000-01-02 12:00:00
7 9 2000-01-02 18:00:00
8 14 2000-01-03 00:00:00
....
AttributeError: ("'Series' object has no attribute 'set_index'", 'occurred at index a')
you should change this line:
toMultiplyWith = df.groupby(df.date.dt.floor('D')).transform(myF)
to this:
toMultiplyWith = df.groupby(df.date.dt.floor('D')).apply(myF)
using .apply instead of .transform will give you the desired result.
apply is the right choice here since it implicitly passes all the columns for each group as a DataFrame to the custom function.
to read more about the difference between the two methods, consider this answer
If you stick to use between_times(...) function, that would be the way to do it:
df = df.set_index('date')
mask = df.between_time('05:59', '06:01').index
df.loc[mask, 'a'] = df.loc[mask, 'a'] ** 2 # the operation you want to perform
df.reset_index(inplace=True)
Outputs:
date a
0 2000-01-01 00:00:00 1
1 2000-01-01 06:00:00 9
2 2000-01-01 12:00:00 4
3 2000-01-01 18:00:00 5
4 2000-01-02 00:00:00 6
5 2000-01-02 06:00:00 49
6 2000-01-02 12:00:00 8
7 2000-01-02 18:00:00 9
8 2000-01-03 00:00:00 14
If I got your goal right, you can use apply to return a dataframe with the same amount of rows as the original dataframe (simulating a transform):
def myF(grp):
time = grp.date.dt.strftime('%T')
target_idx = time == '06:00:00'
if target_idx.any():
grp.loc[~target_idx, 'a_sum'] = grp.loc[~target_idx, 'a'].values * grp.loc[target_idx, 'a'].values
else:
grp.loc[~target_idx, 'a_sum'] = np.nan
return grp
df.groupby(df.date.dt.floor('D')).apply(myF)
Output:
a date a_sum
0 1 2000-01-01 00:00:00 3.0
1 3 2000-01-01 06:00:00 NaN
2 4 2000-01-01 12:00:00 12.0
3 5 2000-01-01 18:00:00 15.0
4 6 2000-01-02 00:00:00 42.0
5 7 2000-01-02 06:00:00 NaN
6 8 2000-01-02 12:00:00 56.0
7 9 2000-01-02 18:00:00 63.0
8 14 2000-01-03 00:00:00 NaN
See that, for each day, each value with time other than 06:00:00 is multiplied by the value with time equals 06:00:00. It retuns NaN for the 06:00:00-values themselves, as well as for the groups without this time.

How can I efficiently half forward/backward fill a gap in a dataframe?

I have a dataframe with datetimes as index. There are some gaps in the index so I upsample it to have 1 second gap only. I want to fill the gaps by doing half forward filling (from the left side of the gap) and half backward filling (from the right side of the gap).
Input:
2000-01-01 00:00:00 0
2000-01-01 00:01:00 1
2000-01-01 00:02:00 2
2000-01-01 00:03:00 3
2000-01-01 00:04:10 4
Upsampled Input, with 10 second:
2000-01-01 00:00:00 0.0
2000-01-01 00:00:10 NaN
2000-01-01 00:00:20 NaN
2000-01-01 00:00:30 NaN
2000-01-01 00:00:40 NaN
2000-01-01 00:00:50 NaN
2000-01-01 00:01:00 1.0
2000-01-01 00:01:10 NaN
2000-01-01 00:01:20 NaN
2000-01-01 00:01:30 NaN
2000-01-01 00:01:40 NaN
2000-01-01 00:01:50 NaN
2000-01-01 00:02:00 2.0
2000-01-01 00:02:10 NaN
2000-01-01 00:02:20 NaN
2000-01-01 00:02:30 NaN
2000-01-01 00:02:40 NaN
2000-01-01 00:02:50 NaN
2000-01-01 00:03:00 3.0
2000-01-01 00:04:10 4.0
Output I want:
2000-01-01 00:00:00 0.0
2000-01-01 00:00:10 0.0
2000-01-01 00:00:20 0.0
2000-01-01 00:00:30 0.0
2000-01-01 00:00:40 1.0
2000-01-01 00:00:50 1.0
2000-01-01 00:01:00 1.0
2000-01-01 00:01:10 1.0
2000-01-01 00:01:20 1.0
2000-01-01 00:01:30 1.0
2000-01-01 00:01:40 2.0
2000-01-01 00:01:50 2.0
2000-01-01 00:02:00 2.0
2000-01-01 00:02:10 2.0
2000-01-01 00:02:20 2.0
2000-01-01 00:02:30 2.0
2000-01-01 00:02:40 3.0
2000-01-01 00:02:50 3.0
2000-01-01 00:03:00 3.0
2000-01-01 00:04:10 4.0
I managed to get the results I want by getting the edges of the gaps after the upsampling, performing a forward fill across all the gap, and then updating just the right half with the value of the right edge, but since my data is so large, it takes forever to run as some of my files have 1M gaps to fill. I basically do this using a for loop that goes through all the identified gaps.
Is there a way this could be done faster?
Thanks!
Edit:
I only want to upsample and fill gaps where the time difference is smaller than or equal to a given value, in the example only those up to 1 minute, so the last 2 rows won't have an upsample and filling between them.
If you data is 1 min apart, you can do:
df.set_index(0).asfreq('10S').ffill(limit=3).bfill(limit=2)
output:
1
0
2000-01-01 00:00:00 0.0
2000-01-01 00:00:10 0.0
2000-01-01 00:00:20 0.0
2000-01-01 00:00:30 0.0
2000-01-01 00:00:40 1.0
2000-01-01 00:00:50 1.0
2000-01-01 00:01:00 1.0
2000-01-01 00:01:10 1.0
2000-01-01 00:01:20 1.0
2000-01-01 00:01:30 1.0
2000-01-01 00:01:40 2.0
2000-01-01 00:01:50 2.0
2000-01-01 00:02:00 2.0
2000-01-01 00:02:10 2.0
2000-01-01 00:02:20 2.0
2000-01-01 00:02:30 2.0
2000-01-01 00:02:40 3.0
2000-01-01 00:02:50 3.0
2000-01-01 00:03:00 3.0
Setup
ts = pd.Series([0, 1, 2, 3], pd.date_range('2000-01-01', periods=4, freq='min'))
merge_asof with direction='nearest'
pd.merge_asof(
ts.asfreq('10s').to_frame('left'),
ts.to_frame('right'),
left_index=True,
right_index=True,
direction='nearest'
)
left right
2000-01-01 00:00:00 0.0 0
2000-01-01 00:00:10 NaN 0
2000-01-01 00:00:20 NaN 0
2000-01-01 00:00:30 NaN 0
2000-01-01 00:00:40 NaN 1
2000-01-01 00:00:50 NaN 1
2000-01-01 00:01:00 1.0 1
2000-01-01 00:01:10 NaN 1
2000-01-01 00:01:20 NaN 1
2000-01-01 00:01:30 NaN 1
2000-01-01 00:01:40 NaN 2
2000-01-01 00:01:50 NaN 2
2000-01-01 00:02:00 2.0 2
2000-01-01 00:02:10 NaN 2
2000-01-01 00:02:20 NaN 2
2000-01-01 00:02:30 NaN 2
2000-01-01 00:02:40 NaN 3
2000-01-01 00:02:50 NaN 3
2000-01-01 00:03:00 3.0 3
reindex with method='nearest'
ts.reindex(ts.asfreq('10s').index, method='nearest')
2000-01-01 00:00:00 0
2000-01-01 00:00:10 0
2000-01-01 00:00:20 0
2000-01-01 00:00:30 1
2000-01-01 00:00:40 1
2000-01-01 00:00:50 1
2000-01-01 00:01:00 1
2000-01-01 00:01:10 1
2000-01-01 00:01:20 1
2000-01-01 00:01:30 2
2000-01-01 00:01:40 2
2000-01-01 00:01:50 2
2000-01-01 00:02:00 2
2000-01-01 00:02:10 2
2000-01-01 00:02:20 2
2000-01-01 00:02:30 3
2000-01-01 00:02:40 3
2000-01-01 00:02:50 3
2000-01-01 00:03:00 3
Freq: 10S, dtype: int64
Note: that the decision on how to determine nearest produces slightly different results between the two solutions.
pd.merge_asof(
ts.asfreq('10s').to_frame('left'),
ts.to_frame('merge_asof'),
left_index=True,
right_index=True,
direction='nearest'
).assign(reindex=ts.reindex(ts.asfreq('10s').index, method='nearest'))
left merge_asof reindex
2000-01-01 00:00:00 0.0 0 0
2000-01-01 00:00:10 NaN 0 0
2000-01-01 00:00:20 NaN 0 0
2000-01-01 00:00:30 NaN 0 1 # This row is different
2000-01-01 00:00:40 NaN 1 1
2000-01-01 00:00:50 NaN 1 1
2000-01-01 00:01:00 1.0 1 1
2000-01-01 00:01:10 NaN 1 1
2000-01-01 00:01:20 NaN 1 1
2000-01-01 00:01:30 NaN 1 2 # This row is different
2000-01-01 00:01:40 NaN 2 2
2000-01-01 00:01:50 NaN 2 2
2000-01-01 00:02:00 2.0 2 2
2000-01-01 00:02:10 NaN 2 2
2000-01-01 00:02:20 NaN 2 2
2000-01-01 00:02:30 NaN 2 3 # This row is different
2000-01-01 00:02:40 NaN 3 3
2000-01-01 00:02:50 NaN 3 3
2000-01-01 00:03:00 3.0 3 3

Sampling dataframe Considering NaN values+Pandas

I have a data frame like below. I want to do sampling with '3S'
So there are situations where NaN is present. What I was expecting is the data frame should do sampling with '3S' and also if there is any 'NaN' found in between then stop there and start the sampling from that index. I tried using dataframe.apply method to achieve but it looks very complex. Is there any short way to achieve?
df.sample(n=3)
Code to generate Input:
index = pd.date_range('1/1/2000', periods=13, freq='T')
series = pd.DataFrame(range(13), index=index)
print series
series.iloc[4] = 'NaN'
series.iloc[10] = 'NaN'
I tried to do sampling but after that there is no clue how to proceed.
2015-01-01 00:00:00 0.0
2015-01-01 01:00:00 1.0
2015-01-01 02:00:00 2.0
2015-01-01 03:00:00 2.0
2015-01-01 04:00:00 NaN
2015-01-01 05:00:00 3.0
2015-01-01 06:00:00 4.0
2015-01-01 07:00:00 4.0
2015-01-01 08:00:00 4.0
2015-01-01 09:00:00 NaN
2015-01-01 10:00:00 3.0
2015-01-01 11:00:00 4.0
2015-01-01 12:00:00 4.0
The new data frame should sample based on '3S' also take into account of 'NaN' if present and start the sampling from there where 'NaN' records are found.
Expected Output:
2015-01-01 02:00:00 2.0 -- Sampling after 3S
2015-01-01 03:00:00 2.0 -- Print because NaN has found in Next
2015-01-01 04:00:00 NaN -- print NaN record
2015-01-01 07:00:00 4.0 -- Sampling after 3S
2015-01-01 08:00:00 4.0 -- Print because NaN has found in Next
2015-01-01 09:00:00 NaN -- print NaN record
2015-01-01 12:00:00 4.0 -- Sampling after 3S
Use:
index = pd.date_range('1/1/2000', periods=13, freq='H')
df = pd.DataFrame({'col': range(13)}, index=index)
df.iloc[4, 0] = np.nan
df.iloc[9, 0] = np.nan
print (df)
col
2000-01-01 00:00:00 0.0
2000-01-01 01:00:00 1.0
2000-01-01 02:00:00 2.0
2000-01-01 03:00:00 3.0
2000-01-01 04:00:00 NaN
2000-01-01 05:00:00 5.0
2000-01-01 06:00:00 6.0
2000-01-01 07:00:00 7.0
2000-01-01 08:00:00 8.0
2000-01-01 09:00:00 NaN
2000-01-01 10:00:00 10.0
2000-01-01 11:00:00 11.0
2000-01-01 12:00:00 12.0
m = df['col'].isna()
s1 = m.ne(m.shift()).cumsum()
t = pd.Timedelta(2, unit='H')
mask = df.index >= df.groupby(s1)['col'].transform(lambda x: x.index[0]) + t
df1 = df[mask | m]
print (df1)
col
2000-01-01 02:00:00 2.0
2000-01-01 03:00:00 3.0
2000-01-01 04:00:00 NaN
2000-01-01 07:00:00 7.0
2000-01-01 08:00:00 8.0
2000-01-01 09:00:00 NaN
2000-01-01 12:00:00 12.0
Explanation:
Create mask for compare missing values by Series.isna
Create groups by consecutive values by comparing shifted values with Series.ne (!=)
print (s1)
2000-01-01 00:00:00 1
2000-01-01 01:00:00 1
2000-01-01 02:00:00 1
2000-01-01 03:00:00 1
2000-01-01 04:00:00 2
2000-01-01 05:00:00 3
2000-01-01 06:00:00 3
2000-01-01 07:00:00 3
2000-01-01 08:00:00 3
2000-01-01 09:00:00 4
2000-01-01 10:00:00 5
2000-01-01 11:00:00 5
2000-01-01 12:00:00 5
Freq: H, Name: col, dtype: int32
Get first value of index per groups, add timdelta (for expected output are added 2T) and compare by DatetimeIndex
Last filter by boolean indexing and chained masks by | for bitwise OR
One way would be to Fill the NAs with 0:
df['Col_of_Interest'] = df['Col_of_Interest'].fillna(0)
And then have the resampling to be done on the series:
(if datetime is your index)
series.resample('30S').asfreq()

Getting data for given day from pandas Dataframe

I have a dataframe df as below:
date1 item_id
2000-01-01 00:00:00 0
2000-01-01 10:01:00 1
2000-01-01 00:02:00 2
2000-01-01 00:03:00 3
2000-01-01 00:04:00 4
2000-01-01 00:05:00 5
2000-01-01 00:06:00 6
2000-01-01 12:07:00 7
2000-01-02 00:08:00 8
2000-01-02 00:00:00 0
2000-01-02 00:01:00 1
2000-01-02 03:02:00 2
2000-01-02 00:03:00 3
2000-01-02 00:04:00 4
2000-01-02 00:05:00 5
2000-01-02 04:06:00 6
2000-01-02 00:07:00 7
2000-01-02 00:08:00 8
I need the data for single day i.e. 1st Jan 2000. Below query gives me the correct result. But is there a way it can be done just by passing "2000-01-01"?
result= df[(df['date1'] > '2000-01-01 00:00') & (df['date1'] < '2000-01-01 23:59')]
Use partial string indexing, but need DatetimeIndex first:
df = df.set_index('date1')['2000-01-01']
print (df)
item_id
date1
2000-01-01 00:00:00 0
2000-01-01 10:01:00 1
2000-01-01 00:02:00 2
2000-01-01 00:03:00 3
2000-01-01 00:04:00 4
2000-01-01 00:05:00 5
2000-01-01 00:06:00 6
2000-01-01 12:07:00 7
Another solution is convert datetimes to strings by strftime and filter by boolean indexing:
df = df[df['date1'].dt.strftime('%Y-%m-%d') == '2000-01-01']
print (df)
date1 item_id
0 2000-01-01 00:00:00 0
1 2000-01-01 10:01:00 1
2 2000-01-01 00:02:00 2
3 2000-01-01 00:03:00 3
4 2000-01-01 00:04:00 4
5 2000-01-01 00:05:00 5
6 2000-01-01 00:06:00 6
7 2000-01-01 12:07:00 7
The other alternative would be to create a mask:
df[df.date1.dt.date.astype(str) == '2000-01-01']
Full example:
import pandas as pd
data = '''\
date1 item_id
2000-01-01T00:00:00 0
2000-01-01T10:01:00 1
2000-01-01T00:02:00 2
2000-01-01T00:03:00 3
2000-01-01T00:04:00 4
2000-01-01T00:05:00 5
2000-01-01T00:06:00 6
2000-01-01T12:07:00 7
2000-01-02T00:08:00 8
2000-01-02T00:00:00 0
2000-01-02T00:01:00 1
2000-01-02T03:02:00 2'''
df = pd.read_csv(pd.compat.StringIO(data), sep='\s+', parse_dates=['date1'])
res = df[df.date1.dt.date.astype(str) == '2000-01-01']
print(res)
Returns:
date1 item_id
0 2000-01-01 00:00:00 0
1 2000-01-01 10:01:00 1
2 2000-01-01 00:02:00 2
3 2000-01-01 00:03:00 3
4 2000-01-01 00:04:00 4
5 2000-01-01 00:05:00 5
6 2000-01-01 00:06:00 6
7 2000-01-01 12:07:00 7
Or
import datetime
df[df.date1.dt.date == datetime.date(2000,1,1)]

Python/Pandas: Speedup calculation of weekday of datetime-data

I'm loading data in pandas, whereas the column date contains the datetime values, e.g.:
date ; .....more stuff ......
2000-01-03 ;
2000-01-04 ;
2000-01-06 ;
...
2000-01-31 ;
2000-02-01 ;
2000-02-02 ;
2000-02-04 ;
I have a function to add a column containing the weekday-indices (0-6):
def genWeekdays(df,src='date',target='weekday'):
"""
bla bla bla
"""
df[target] = df[src].apply(lambda x: x.weekday())
return df
calling it via
df = genWeekdays(df)
df has about a million rows and this takes about 1.3secs.
Any way to speed this up? Im little surprised on how long this takes on my i7-4770k :(
Thanks in advance
In [30]: df = DataFrame(dict(date = pd.date_range('20000101',periods=100000,freq='s'), value = np.random.randn(100000)))
In [31]: df['weekday'] = pd.DatetimeIndex(df['date']).weekday
In [32]: %timeit pd.DatetimeIndex(df['date']).weekday
10 loops, best of 3: 34.9 ms per loop
In [33]: df
Out[33]:
date value
In [33]: df
Out[33]:
date value weekday
0 2000-01-01 00:00:00 -0.046604 5
1 2000-01-01 00:00:01 -1.691611 5
2 2000-01-01 00:00:02 0.416015 5
3 2000-01-01 00:00:03 0.054822 5
4 2000-01-01 00:00:04 -0.661163 5
5 2000-01-01 00:00:05 0.274402 5
6 2000-01-01 00:00:06 -0.426533 5
7 2000-01-01 00:00:07 0.028769 5
8 2000-01-01 00:00:08 0.248581 5
9 2000-01-01 00:00:09 1.302145 5
10 2000-01-01 00:00:10 -1.886830 5
11 2000-01-01 00:00:11 2.276506 5
12 2000-01-01 00:00:12 0.054104 5
13 2000-01-01 00:00:13 0.378990 5
14 2000-01-01 00:00:14 0.868879 5
15 2000-01-01 00:00:15 -0.046810 5
16 2000-01-01 00:00:16 -0.499447 5
17 2000-01-01 00:00:17 1.067412 5
18 2000-01-01 00:00:18 -1.625986 5
19 2000-01-01 00:00:19 0.515884 5
20 2000-01-01 00:00:20 -1.884882 5
21 2000-01-01 00:00:21 0.943775 5
22 2000-01-01 00:00:22 0.034501 5
23 2000-01-01 00:00:23 0.438170 5
24 2000-01-01 00:00:24 -1.211937 5
25 2000-01-01 00:00:25 -0.229930 5
26 2000-01-01 00:00:26 0.938805 5
27 2000-01-01 00:00:27 0.026815 5
28 2000-01-01 00:00:28 2.166740 5
29 2000-01-01 00:00:29 -0.096927 5
... ... ... ...
99970 2000-01-02 03:46:10 -0.310023 6
99971 2000-01-02 03:46:11 0.561321 6
99972 2000-01-02 03:46:12 2.207426 6
99973 2000-01-02 03:46:13 -0.253933 6
99974 2000-01-02 03:46:14 -0.711145 6
99975 2000-01-02 03:46:15 -0.477377 6
99976 2000-01-02 03:46:16 1.492970 6
99977 2000-01-02 03:46:17 0.308510 6
99978 2000-01-02 03:46:18 0.126579 6
99979 2000-01-02 03:46:19 -1.704093 6
99980 2000-01-02 03:46:20 -0.328285 6
99981 2000-01-02 03:46:21 1.685411 6
99982 2000-01-02 03:46:22 -0.368899 6
99983 2000-01-02 03:46:23 0.915786 6
99984 2000-01-02 03:46:24 -1.694855 6
99985 2000-01-02 03:46:25 -1.488130 6
99986 2000-01-02 03:46:26 -1.274004 6
99987 2000-01-02 03:46:27 -1.508376 6
99988 2000-01-02 03:46:28 0.551695 6
99989 2000-01-02 03:46:29 0.007957 6
99990 2000-01-02 03:46:30 -0.214852 6
99991 2000-01-02 03:46:31 -1.390088 6
99992 2000-01-02 03:46:32 -0.472137 6
99993 2000-01-02 03:46:33 -0.969515 6
99994 2000-01-02 03:46:34 1.129802 6
99995 2000-01-02 03:46:35 -0.291428 6
99996 2000-01-02 03:46:36 0.337134 6
99997 2000-01-02 03:46:37 0.989259 6
99998 2000-01-02 03:46:38 0.705592 6
99999 2000-01-02 03:46:39 -0.311884 6
[100000 rows x 3 columns]

Categories