I'm loading data in pandas, whereas the column date contains the datetime values, e.g.:
date ; .....more stuff ......
2000-01-03 ;
2000-01-04 ;
2000-01-06 ;
...
2000-01-31 ;
2000-02-01 ;
2000-02-02 ;
2000-02-04 ;
I have a function to add a column containing the weekday-indices (0-6):
def genWeekdays(df,src='date',target='weekday'):
"""
bla bla bla
"""
df[target] = df[src].apply(lambda x: x.weekday())
return df
calling it via
df = genWeekdays(df)
df has about a million rows and this takes about 1.3secs.
Any way to speed this up? Im little surprised on how long this takes on my i7-4770k :(
Thanks in advance
In [30]: df = DataFrame(dict(date = pd.date_range('20000101',periods=100000,freq='s'), value = np.random.randn(100000)))
In [31]: df['weekday'] = pd.DatetimeIndex(df['date']).weekday
In [32]: %timeit pd.DatetimeIndex(df['date']).weekday
10 loops, best of 3: 34.9 ms per loop
In [33]: df
Out[33]:
date value
In [33]: df
Out[33]:
date value weekday
0 2000-01-01 00:00:00 -0.046604 5
1 2000-01-01 00:00:01 -1.691611 5
2 2000-01-01 00:00:02 0.416015 5
3 2000-01-01 00:00:03 0.054822 5
4 2000-01-01 00:00:04 -0.661163 5
5 2000-01-01 00:00:05 0.274402 5
6 2000-01-01 00:00:06 -0.426533 5
7 2000-01-01 00:00:07 0.028769 5
8 2000-01-01 00:00:08 0.248581 5
9 2000-01-01 00:00:09 1.302145 5
10 2000-01-01 00:00:10 -1.886830 5
11 2000-01-01 00:00:11 2.276506 5
12 2000-01-01 00:00:12 0.054104 5
13 2000-01-01 00:00:13 0.378990 5
14 2000-01-01 00:00:14 0.868879 5
15 2000-01-01 00:00:15 -0.046810 5
16 2000-01-01 00:00:16 -0.499447 5
17 2000-01-01 00:00:17 1.067412 5
18 2000-01-01 00:00:18 -1.625986 5
19 2000-01-01 00:00:19 0.515884 5
20 2000-01-01 00:00:20 -1.884882 5
21 2000-01-01 00:00:21 0.943775 5
22 2000-01-01 00:00:22 0.034501 5
23 2000-01-01 00:00:23 0.438170 5
24 2000-01-01 00:00:24 -1.211937 5
25 2000-01-01 00:00:25 -0.229930 5
26 2000-01-01 00:00:26 0.938805 5
27 2000-01-01 00:00:27 0.026815 5
28 2000-01-01 00:00:28 2.166740 5
29 2000-01-01 00:00:29 -0.096927 5
... ... ... ...
99970 2000-01-02 03:46:10 -0.310023 6
99971 2000-01-02 03:46:11 0.561321 6
99972 2000-01-02 03:46:12 2.207426 6
99973 2000-01-02 03:46:13 -0.253933 6
99974 2000-01-02 03:46:14 -0.711145 6
99975 2000-01-02 03:46:15 -0.477377 6
99976 2000-01-02 03:46:16 1.492970 6
99977 2000-01-02 03:46:17 0.308510 6
99978 2000-01-02 03:46:18 0.126579 6
99979 2000-01-02 03:46:19 -1.704093 6
99980 2000-01-02 03:46:20 -0.328285 6
99981 2000-01-02 03:46:21 1.685411 6
99982 2000-01-02 03:46:22 -0.368899 6
99983 2000-01-02 03:46:23 0.915786 6
99984 2000-01-02 03:46:24 -1.694855 6
99985 2000-01-02 03:46:25 -1.488130 6
99986 2000-01-02 03:46:26 -1.274004 6
99987 2000-01-02 03:46:27 -1.508376 6
99988 2000-01-02 03:46:28 0.551695 6
99989 2000-01-02 03:46:29 0.007957 6
99990 2000-01-02 03:46:30 -0.214852 6
99991 2000-01-02 03:46:31 -1.390088 6
99992 2000-01-02 03:46:32 -0.472137 6
99993 2000-01-02 03:46:33 -0.969515 6
99994 2000-01-02 03:46:34 1.129802 6
99995 2000-01-02 03:46:35 -0.291428 6
99996 2000-01-02 03:46:36 0.337134 6
99997 2000-01-02 03:46:37 0.989259 6
99998 2000-01-02 03:46:38 0.705592 6
99999 2000-01-02 03:46:39 -0.311884 6
[100000 rows x 3 columns]
Related
I have this dataset
Value1 Value2
2000-01-01 12:00:00 1 2
2000-01-02 12:00:00 3 4
2000-01-03 12:00:00 5 6
I want to repeat the same data but at 4 different time intervals for example
Value1 Value2
2000-01-01 00:00:00 1 2
2000-01-01 06:00:00 1 2
2000-01-01 12:00:00 1 2
2000-01-01 18:00:00 1 2
2000-01-02 00:00:00 3 4
2000-01-02 06:00:00 3 4
2000-01-02 12:00:00 3 4
2000-01-02 18:00:00 3 4
and so on.
your dates are contiguous, this solution will also work for none contiguous dates
generate a series that are the expanded times, then outer join
import pandas as pd
import io
df = pd.read_csv(io.StringIO(""" Value1 Value2
2000-01-01 12:00:00 1 2
2000-01-02 12:00:00 3 4
2000-01-03 12:00:00 5 6"""), sep="\s\s+", engine="python")
df = df.set_index(pd.to_datetime(df.index))
df = df.join(
pd.Series(df.index, index=df.index)
.rename("expanded")
.dt.date.apply(lambda d: pd.date_range(d, freq="6H", periods=4))
.explode()
).set_index("expanded")
df
expanded
Value1
Value2
2000-01-01 00:00:00
1
2
2000-01-01 06:00:00
1
2
2000-01-01 12:00:00
1
2
2000-01-01 18:00:00
1
2
2000-01-02 00:00:00
3
4
2000-01-02 06:00:00
3
4
2000-01-02 12:00:00
3
4
2000-01-02 18:00:00
3
4
2000-01-03 00:00:00
5
6
2000-01-03 06:00:00
5
6
2000-01-03 12:00:00
5
6
2000-01-03 18:00:00
5
6
I have a DataFrame and want to multiply all values in a column a for a certain day with the value of a at 6h00m00 of that day. If there is no 6h00m00 entry, that day should stay unchanged.
The code below unfortunately gives an error.
How do I have to correct this code / replace it with any working solution?
import pandas as pd
import numpy as np
start = pd.Timestamp('2000-01-01')
end = pd.Timestamp('2000-01-03')
t = np.linspace(start.value, end.value, 9)
datetime1 = pd.to_datetime(t)
df = pd.DataFrame( {'a':[1,3,4,5,6,7,8,9,14]})
df['date']= datetime1
print(df)
def myF(x):
y = x.set_index('date').between_time('05:59', '06:01').a
return y
toMultiplyWith = df.groupby(df.date.dt.floor('D')).transform(myF)
.
a date
0 1 2000-01-01 00:00:00
1 3 2000-01-01 06:00:00
2 4 2000-01-01 12:00:00
3 5 2000-01-01 18:00:00
4 6 2000-01-02 00:00:00
5 7 2000-01-02 06:00:00
6 8 2000-01-02 12:00:00
7 9 2000-01-02 18:00:00
8 14 2000-01-03 00:00:00
....
AttributeError: ("'Series' object has no attribute 'set_index'", 'occurred at index a')
you should change this line:
toMultiplyWith = df.groupby(df.date.dt.floor('D')).transform(myF)
to this:
toMultiplyWith = df.groupby(df.date.dt.floor('D')).apply(myF)
using .apply instead of .transform will give you the desired result.
apply is the right choice here since it implicitly passes all the columns for each group as a DataFrame to the custom function.
to read more about the difference between the two methods, consider this answer
If you stick to use between_times(...) function, that would be the way to do it:
df = df.set_index('date')
mask = df.between_time('05:59', '06:01').index
df.loc[mask, 'a'] = df.loc[mask, 'a'] ** 2 # the operation you want to perform
df.reset_index(inplace=True)
Outputs:
date a
0 2000-01-01 00:00:00 1
1 2000-01-01 06:00:00 9
2 2000-01-01 12:00:00 4
3 2000-01-01 18:00:00 5
4 2000-01-02 00:00:00 6
5 2000-01-02 06:00:00 49
6 2000-01-02 12:00:00 8
7 2000-01-02 18:00:00 9
8 2000-01-03 00:00:00 14
If I got your goal right, you can use apply to return a dataframe with the same amount of rows as the original dataframe (simulating a transform):
def myF(grp):
time = grp.date.dt.strftime('%T')
target_idx = time == '06:00:00'
if target_idx.any():
grp.loc[~target_idx, 'a_sum'] = grp.loc[~target_idx, 'a'].values * grp.loc[target_idx, 'a'].values
else:
grp.loc[~target_idx, 'a_sum'] = np.nan
return grp
df.groupby(df.date.dt.floor('D')).apply(myF)
Output:
a date a_sum
0 1 2000-01-01 00:00:00 3.0
1 3 2000-01-01 06:00:00 NaN
2 4 2000-01-01 12:00:00 12.0
3 5 2000-01-01 18:00:00 15.0
4 6 2000-01-02 00:00:00 42.0
5 7 2000-01-02 06:00:00 NaN
6 8 2000-01-02 12:00:00 56.0
7 9 2000-01-02 18:00:00 63.0
8 14 2000-01-03 00:00:00 NaN
See that, for each day, each value with time other than 06:00:00 is multiplied by the value with time equals 06:00:00. It retuns NaN for the 06:00:00-values themselves, as well as for the groups without this time.
I have a dataframe with datetimes as index. There are some gaps in the index so I upsample it to have 1 second gap only. I want to fill the gaps by doing half forward filling (from the left side of the gap) and half backward filling (from the right side of the gap).
Input:
2000-01-01 00:00:00 0
2000-01-01 00:01:00 1
2000-01-01 00:02:00 2
2000-01-01 00:03:00 3
2000-01-01 00:04:10 4
Upsampled Input, with 10 second:
2000-01-01 00:00:00 0.0
2000-01-01 00:00:10 NaN
2000-01-01 00:00:20 NaN
2000-01-01 00:00:30 NaN
2000-01-01 00:00:40 NaN
2000-01-01 00:00:50 NaN
2000-01-01 00:01:00 1.0
2000-01-01 00:01:10 NaN
2000-01-01 00:01:20 NaN
2000-01-01 00:01:30 NaN
2000-01-01 00:01:40 NaN
2000-01-01 00:01:50 NaN
2000-01-01 00:02:00 2.0
2000-01-01 00:02:10 NaN
2000-01-01 00:02:20 NaN
2000-01-01 00:02:30 NaN
2000-01-01 00:02:40 NaN
2000-01-01 00:02:50 NaN
2000-01-01 00:03:00 3.0
2000-01-01 00:04:10 4.0
Output I want:
2000-01-01 00:00:00 0.0
2000-01-01 00:00:10 0.0
2000-01-01 00:00:20 0.0
2000-01-01 00:00:30 0.0
2000-01-01 00:00:40 1.0
2000-01-01 00:00:50 1.0
2000-01-01 00:01:00 1.0
2000-01-01 00:01:10 1.0
2000-01-01 00:01:20 1.0
2000-01-01 00:01:30 1.0
2000-01-01 00:01:40 2.0
2000-01-01 00:01:50 2.0
2000-01-01 00:02:00 2.0
2000-01-01 00:02:10 2.0
2000-01-01 00:02:20 2.0
2000-01-01 00:02:30 2.0
2000-01-01 00:02:40 3.0
2000-01-01 00:02:50 3.0
2000-01-01 00:03:00 3.0
2000-01-01 00:04:10 4.0
I managed to get the results I want by getting the edges of the gaps after the upsampling, performing a forward fill across all the gap, and then updating just the right half with the value of the right edge, but since my data is so large, it takes forever to run as some of my files have 1M gaps to fill. I basically do this using a for loop that goes through all the identified gaps.
Is there a way this could be done faster?
Thanks!
Edit:
I only want to upsample and fill gaps where the time difference is smaller than or equal to a given value, in the example only those up to 1 minute, so the last 2 rows won't have an upsample and filling between them.
If you data is 1 min apart, you can do:
df.set_index(0).asfreq('10S').ffill(limit=3).bfill(limit=2)
output:
1
0
2000-01-01 00:00:00 0.0
2000-01-01 00:00:10 0.0
2000-01-01 00:00:20 0.0
2000-01-01 00:00:30 0.0
2000-01-01 00:00:40 1.0
2000-01-01 00:00:50 1.0
2000-01-01 00:01:00 1.0
2000-01-01 00:01:10 1.0
2000-01-01 00:01:20 1.0
2000-01-01 00:01:30 1.0
2000-01-01 00:01:40 2.0
2000-01-01 00:01:50 2.0
2000-01-01 00:02:00 2.0
2000-01-01 00:02:10 2.0
2000-01-01 00:02:20 2.0
2000-01-01 00:02:30 2.0
2000-01-01 00:02:40 3.0
2000-01-01 00:02:50 3.0
2000-01-01 00:03:00 3.0
Setup
ts = pd.Series([0, 1, 2, 3], pd.date_range('2000-01-01', periods=4, freq='min'))
merge_asof with direction='nearest'
pd.merge_asof(
ts.asfreq('10s').to_frame('left'),
ts.to_frame('right'),
left_index=True,
right_index=True,
direction='nearest'
)
left right
2000-01-01 00:00:00 0.0 0
2000-01-01 00:00:10 NaN 0
2000-01-01 00:00:20 NaN 0
2000-01-01 00:00:30 NaN 0
2000-01-01 00:00:40 NaN 1
2000-01-01 00:00:50 NaN 1
2000-01-01 00:01:00 1.0 1
2000-01-01 00:01:10 NaN 1
2000-01-01 00:01:20 NaN 1
2000-01-01 00:01:30 NaN 1
2000-01-01 00:01:40 NaN 2
2000-01-01 00:01:50 NaN 2
2000-01-01 00:02:00 2.0 2
2000-01-01 00:02:10 NaN 2
2000-01-01 00:02:20 NaN 2
2000-01-01 00:02:30 NaN 2
2000-01-01 00:02:40 NaN 3
2000-01-01 00:02:50 NaN 3
2000-01-01 00:03:00 3.0 3
reindex with method='nearest'
ts.reindex(ts.asfreq('10s').index, method='nearest')
2000-01-01 00:00:00 0
2000-01-01 00:00:10 0
2000-01-01 00:00:20 0
2000-01-01 00:00:30 1
2000-01-01 00:00:40 1
2000-01-01 00:00:50 1
2000-01-01 00:01:00 1
2000-01-01 00:01:10 1
2000-01-01 00:01:20 1
2000-01-01 00:01:30 2
2000-01-01 00:01:40 2
2000-01-01 00:01:50 2
2000-01-01 00:02:00 2
2000-01-01 00:02:10 2
2000-01-01 00:02:20 2
2000-01-01 00:02:30 3
2000-01-01 00:02:40 3
2000-01-01 00:02:50 3
2000-01-01 00:03:00 3
Freq: 10S, dtype: int64
Note: that the decision on how to determine nearest produces slightly different results between the two solutions.
pd.merge_asof(
ts.asfreq('10s').to_frame('left'),
ts.to_frame('merge_asof'),
left_index=True,
right_index=True,
direction='nearest'
).assign(reindex=ts.reindex(ts.asfreq('10s').index, method='nearest'))
left merge_asof reindex
2000-01-01 00:00:00 0.0 0 0
2000-01-01 00:00:10 NaN 0 0
2000-01-01 00:00:20 NaN 0 0
2000-01-01 00:00:30 NaN 0 1 # This row is different
2000-01-01 00:00:40 NaN 1 1
2000-01-01 00:00:50 NaN 1 1
2000-01-01 00:01:00 1.0 1 1
2000-01-01 00:01:10 NaN 1 1
2000-01-01 00:01:20 NaN 1 1
2000-01-01 00:01:30 NaN 1 2 # This row is different
2000-01-01 00:01:40 NaN 2 2
2000-01-01 00:01:50 NaN 2 2
2000-01-01 00:02:00 2.0 2 2
2000-01-01 00:02:10 NaN 2 2
2000-01-01 00:02:20 NaN 2 2
2000-01-01 00:02:30 NaN 2 3 # This row is different
2000-01-01 00:02:40 NaN 3 3
2000-01-01 00:02:50 NaN 3 3
2000-01-01 00:03:00 3.0 3 3
I have a dataframe df as below:
date1 item_id
2000-01-01 00:00:00 0
2000-01-01 10:01:00 1
2000-01-01 00:02:00 2
2000-01-01 00:03:00 3
2000-01-01 00:04:00 4
2000-01-01 00:05:00 5
2000-01-01 00:06:00 6
2000-01-01 12:07:00 7
2000-01-02 00:08:00 8
2000-01-02 00:00:00 0
2000-01-02 00:01:00 1
2000-01-02 03:02:00 2
2000-01-02 00:03:00 3
2000-01-02 00:04:00 4
2000-01-02 00:05:00 5
2000-01-02 04:06:00 6
2000-01-02 00:07:00 7
2000-01-02 00:08:00 8
I need the data for single day i.e. 1st Jan 2000. Below query gives me the correct result. But is there a way it can be done just by passing "2000-01-01"?
result= df[(df['date1'] > '2000-01-01 00:00') & (df['date1'] < '2000-01-01 23:59')]
Use partial string indexing, but need DatetimeIndex first:
df = df.set_index('date1')['2000-01-01']
print (df)
item_id
date1
2000-01-01 00:00:00 0
2000-01-01 10:01:00 1
2000-01-01 00:02:00 2
2000-01-01 00:03:00 3
2000-01-01 00:04:00 4
2000-01-01 00:05:00 5
2000-01-01 00:06:00 6
2000-01-01 12:07:00 7
Another solution is convert datetimes to strings by strftime and filter by boolean indexing:
df = df[df['date1'].dt.strftime('%Y-%m-%d') == '2000-01-01']
print (df)
date1 item_id
0 2000-01-01 00:00:00 0
1 2000-01-01 10:01:00 1
2 2000-01-01 00:02:00 2
3 2000-01-01 00:03:00 3
4 2000-01-01 00:04:00 4
5 2000-01-01 00:05:00 5
6 2000-01-01 00:06:00 6
7 2000-01-01 12:07:00 7
The other alternative would be to create a mask:
df[df.date1.dt.date.astype(str) == '2000-01-01']
Full example:
import pandas as pd
data = '''\
date1 item_id
2000-01-01T00:00:00 0
2000-01-01T10:01:00 1
2000-01-01T00:02:00 2
2000-01-01T00:03:00 3
2000-01-01T00:04:00 4
2000-01-01T00:05:00 5
2000-01-01T00:06:00 6
2000-01-01T12:07:00 7
2000-01-02T00:08:00 8
2000-01-02T00:00:00 0
2000-01-02T00:01:00 1
2000-01-02T03:02:00 2'''
df = pd.read_csv(pd.compat.StringIO(data), sep='\s+', parse_dates=['date1'])
res = df[df.date1.dt.date.astype(str) == '2000-01-01']
print(res)
Returns:
date1 item_id
0 2000-01-01 00:00:00 0
1 2000-01-01 10:01:00 1
2 2000-01-01 00:02:00 2
3 2000-01-01 00:03:00 3
4 2000-01-01 00:04:00 4
5 2000-01-01 00:05:00 5
6 2000-01-01 00:06:00 6
7 2000-01-01 12:07:00 7
Or
import datetime
df[df.date1.dt.date == datetime.date(2000,1,1)]
For each observation in my data, I'm trying to come up with the number of observations created in the previous 7 days.
obs date
A 1/1/2000
B 1/4/2000
C 1/5/2000
D 1/10/2000
E 1/20/2000
F 1/1/2000
Would become:
obs date births last week
A 1/1/2000 2
B 1/4/2000 3
C 1/5/2000 4
D 1/10/2000 3
E 1/20/2000 1
F 1/1/2000 2
Right now I'm using the following method, but it's very slow:
def past_week(x,df):
back = x['date'] - dt.timedelta(days=7)
return df[(df['date'] >= back) & (df['date'] < x['date'])].count()
df['births_last_week'] = df.apply(lambda x: past_week(x,df),axis=1)
Edit: Having difficulty with duplicate dates. Maybe I'm doing something wrong. I've edited the example above to include a repeated date:
df['births last week'] = df.groupby('date').cumcount() + 1
pd.rolling_count(df.set_index('date'), 7 + 1, freq='D').loc[df.date] - 1
gives:
date births last week
2000-01-01 1
2000-01-04 2
2000-01-05 3
2000-01-10 3
2000-01-20 1
2000-01-01 1
I've tried rolling_sum instead, but then all I get is NA values for births last week. I imagine there's something extremely obvious that I'm getting wrong, just not sure what.
Here's one approach:
df = pd.read_csv("birth.csv", delim_whitespace=True, parse_dates=["date"])
by_day = df.groupby("date").count().resample("D").fillna(0)
csum = by_day.cumsum()
last_week = csum - csum.shift(7).fillna(0)
final = last_week.loc[df.date]
producing
>>> final
obs
date
2000-01-01 2
2000-01-04 3
2000-01-05 4
2000-01-10 3
2000-01-20 1
2000-01-01 2
Step by step, first we get the DataFrame (you probably have this already):
>>> df = pd.read_csv("birth.csv", delim_whitespace=True, parse_dates=["date"])
>>> df
obs date
0 A 2000-01-01
1 B 2000-01-04
2 C 2000-01-05
3 D 2000-01-10
4 E 2000-01-20
5 F 2000-01-01
Then we groupby on date, and count the number of observations:
>>> df.groupby("date").count()
obs
date
2000-01-01 2
2000-01-04 1
2000-01-05 1
2000-01-10 1
2000-01-20 1
We can resample this to days; it'll be a much longer timeseries, of course, but memory is cheap and I'm lazy:
>>> df.groupby("date").count().resample("D")
obs
date
2000-01-01 2
2000-01-02 NaN
2000-01-03 NaN
2000-01-04 1
2000-01-05 1
2000-01-06 NaN
2000-01-07 NaN
2000-01-08 NaN
2000-01-09 NaN
2000-01-10 1
2000-01-11 NaN
2000-01-12 NaN
2000-01-13 NaN
2000-01-14 NaN
2000-01-15 NaN
2000-01-16 NaN
2000-01-17 NaN
2000-01-18 NaN
2000-01-19 NaN
2000-01-20 1
Get rid of the nans:
>>> by_day = df.groupby("date").count().resample("D").fillna(0)
>>> by_day
obs
date
2000-01-01 2
2000-01-02 0
2000-01-03 0
2000-01-04 1
2000-01-05 1
2000-01-06 0
2000-01-07 0
2000-01-08 0
2000-01-09 0
2000-01-10 1
2000-01-11 0
2000-01-12 0
2000-01-13 0
2000-01-14 0
2000-01-15 0
2000-01-16 0
2000-01-17 0
2000-01-18 0
2000-01-19 0
2000-01-20 1
And take the cumulative sum, as part of a manual rolling-sum process. The default rolling sum has the wrong alignment, so I'll just subtract with a difference of one week:
>>> csum = by_day.cumsum()
>>> last_week = csum - csum.shift(7).fillna(0)
>>> last_week
obs
date
2000-01-01 2
2000-01-02 2
2000-01-03 2
2000-01-04 3
2000-01-05 4
2000-01-06 4
2000-01-07 4
2000-01-08 2
2000-01-09 2
2000-01-10 3
2000-01-11 2
2000-01-12 1
2000-01-13 1
2000-01-14 1
2000-01-15 1
2000-01-16 1
2000-01-17 0
2000-01-18 0
2000-01-19 0
2000-01-20 1
And then select the dates we care about:
>>> final = last_week.loc[df.date]
>>> final
obs
date
2000-01-01 2
2000-01-04 3
2000-01-05 4
2000-01-10 3
2000-01-20 1
2000-01-01 2
In [57]: df
Out[57]:
obs date
0 A 2000-01-01
1 B 2000-01-04
2 C 2000-01-05
3 D 2000-01-10
4 E 2000-01-20
In [58]: df['births last week'] = 1
In [59]: pd.rolling_count(df.set_index('date'), 7 + 1, freq='D').loc[df.date] - 1
Out[59]:
births last week
2000-01-01 0
2000-01-04 1
2000-01-05 2
2000-01-10 2
2000-01-20 0
I subtract 1 because rolling_count includes the current entry, and you don't.
Edit: To handle duplicate dates, as discussed in comments on your question, group by date and sum the 'births last week' column between inputs 58 and 59 above.