I have a dataframe df as below:
date1 item_id
2000-01-01 00:00:00 0
2000-01-01 10:01:00 1
2000-01-01 00:02:00 2
2000-01-01 00:03:00 3
2000-01-01 00:04:00 4
2000-01-01 00:05:00 5
2000-01-01 00:06:00 6
2000-01-01 12:07:00 7
2000-01-02 00:08:00 8
2000-01-02 00:00:00 0
2000-01-02 00:01:00 1
2000-01-02 03:02:00 2
2000-01-02 00:03:00 3
2000-01-02 00:04:00 4
2000-01-02 00:05:00 5
2000-01-02 04:06:00 6
2000-01-02 00:07:00 7
2000-01-02 00:08:00 8
I need the data for single day i.e. 1st Jan 2000. Below query gives me the correct result. But is there a way it can be done just by passing "2000-01-01"?
result= df[(df['date1'] > '2000-01-01 00:00') & (df['date1'] < '2000-01-01 23:59')]
Use partial string indexing, but need DatetimeIndex first:
df = df.set_index('date1')['2000-01-01']
print (df)
item_id
date1
2000-01-01 00:00:00 0
2000-01-01 10:01:00 1
2000-01-01 00:02:00 2
2000-01-01 00:03:00 3
2000-01-01 00:04:00 4
2000-01-01 00:05:00 5
2000-01-01 00:06:00 6
2000-01-01 12:07:00 7
Another solution is convert datetimes to strings by strftime and filter by boolean indexing:
df = df[df['date1'].dt.strftime('%Y-%m-%d') == '2000-01-01']
print (df)
date1 item_id
0 2000-01-01 00:00:00 0
1 2000-01-01 10:01:00 1
2 2000-01-01 00:02:00 2
3 2000-01-01 00:03:00 3
4 2000-01-01 00:04:00 4
5 2000-01-01 00:05:00 5
6 2000-01-01 00:06:00 6
7 2000-01-01 12:07:00 7
The other alternative would be to create a mask:
df[df.date1.dt.date.astype(str) == '2000-01-01']
Full example:
import pandas as pd
data = '''\
date1 item_id
2000-01-01T00:00:00 0
2000-01-01T10:01:00 1
2000-01-01T00:02:00 2
2000-01-01T00:03:00 3
2000-01-01T00:04:00 4
2000-01-01T00:05:00 5
2000-01-01T00:06:00 6
2000-01-01T12:07:00 7
2000-01-02T00:08:00 8
2000-01-02T00:00:00 0
2000-01-02T00:01:00 1
2000-01-02T03:02:00 2'''
df = pd.read_csv(pd.compat.StringIO(data), sep='\s+', parse_dates=['date1'])
res = df[df.date1.dt.date.astype(str) == '2000-01-01']
print(res)
Returns:
date1 item_id
0 2000-01-01 00:00:00 0
1 2000-01-01 10:01:00 1
2 2000-01-01 00:02:00 2
3 2000-01-01 00:03:00 3
4 2000-01-01 00:04:00 4
5 2000-01-01 00:05:00 5
6 2000-01-01 00:06:00 6
7 2000-01-01 12:07:00 7
Or
import datetime
df[df.date1.dt.date == datetime.date(2000,1,1)]
Related
I have this dataset
Value1 Value2
2000-01-01 12:00:00 1 2
2000-01-02 12:00:00 3 4
2000-01-03 12:00:00 5 6
I want to repeat the same data but at 4 different time intervals for example
Value1 Value2
2000-01-01 00:00:00 1 2
2000-01-01 06:00:00 1 2
2000-01-01 12:00:00 1 2
2000-01-01 18:00:00 1 2
2000-01-02 00:00:00 3 4
2000-01-02 06:00:00 3 4
2000-01-02 12:00:00 3 4
2000-01-02 18:00:00 3 4
and so on.
your dates are contiguous, this solution will also work for none contiguous dates
generate a series that are the expanded times, then outer join
import pandas as pd
import io
df = pd.read_csv(io.StringIO(""" Value1 Value2
2000-01-01 12:00:00 1 2
2000-01-02 12:00:00 3 4
2000-01-03 12:00:00 5 6"""), sep="\s\s+", engine="python")
df = df.set_index(pd.to_datetime(df.index))
df = df.join(
pd.Series(df.index, index=df.index)
.rename("expanded")
.dt.date.apply(lambda d: pd.date_range(d, freq="6H", periods=4))
.explode()
).set_index("expanded")
df
expanded
Value1
Value2
2000-01-01 00:00:00
1
2
2000-01-01 06:00:00
1
2
2000-01-01 12:00:00
1
2
2000-01-01 18:00:00
1
2
2000-01-02 00:00:00
3
4
2000-01-02 06:00:00
3
4
2000-01-02 12:00:00
3
4
2000-01-02 18:00:00
3
4
2000-01-03 00:00:00
5
6
2000-01-03 06:00:00
5
6
2000-01-03 12:00:00
5
6
2000-01-03 18:00:00
5
6
I have a DataFrame and want to multiply all values in a column a for a certain day with the value of a at 6h00m00 of that day. If there is no 6h00m00 entry, that day should stay unchanged.
The code below unfortunately gives an error.
How do I have to correct this code / replace it with any working solution?
import pandas as pd
import numpy as np
start = pd.Timestamp('2000-01-01')
end = pd.Timestamp('2000-01-03')
t = np.linspace(start.value, end.value, 9)
datetime1 = pd.to_datetime(t)
df = pd.DataFrame( {'a':[1,3,4,5,6,7,8,9,14]})
df['date']= datetime1
print(df)
def myF(x):
y = x.set_index('date').between_time('05:59', '06:01').a
return y
toMultiplyWith = df.groupby(df.date.dt.floor('D')).transform(myF)
.
a date
0 1 2000-01-01 00:00:00
1 3 2000-01-01 06:00:00
2 4 2000-01-01 12:00:00
3 5 2000-01-01 18:00:00
4 6 2000-01-02 00:00:00
5 7 2000-01-02 06:00:00
6 8 2000-01-02 12:00:00
7 9 2000-01-02 18:00:00
8 14 2000-01-03 00:00:00
....
AttributeError: ("'Series' object has no attribute 'set_index'", 'occurred at index a')
you should change this line:
toMultiplyWith = df.groupby(df.date.dt.floor('D')).transform(myF)
to this:
toMultiplyWith = df.groupby(df.date.dt.floor('D')).apply(myF)
using .apply instead of .transform will give you the desired result.
apply is the right choice here since it implicitly passes all the columns for each group as a DataFrame to the custom function.
to read more about the difference between the two methods, consider this answer
If you stick to use between_times(...) function, that would be the way to do it:
df = df.set_index('date')
mask = df.between_time('05:59', '06:01').index
df.loc[mask, 'a'] = df.loc[mask, 'a'] ** 2 # the operation you want to perform
df.reset_index(inplace=True)
Outputs:
date a
0 2000-01-01 00:00:00 1
1 2000-01-01 06:00:00 9
2 2000-01-01 12:00:00 4
3 2000-01-01 18:00:00 5
4 2000-01-02 00:00:00 6
5 2000-01-02 06:00:00 49
6 2000-01-02 12:00:00 8
7 2000-01-02 18:00:00 9
8 2000-01-03 00:00:00 14
If I got your goal right, you can use apply to return a dataframe with the same amount of rows as the original dataframe (simulating a transform):
def myF(grp):
time = grp.date.dt.strftime('%T')
target_idx = time == '06:00:00'
if target_idx.any():
grp.loc[~target_idx, 'a_sum'] = grp.loc[~target_idx, 'a'].values * grp.loc[target_idx, 'a'].values
else:
grp.loc[~target_idx, 'a_sum'] = np.nan
return grp
df.groupby(df.date.dt.floor('D')).apply(myF)
Output:
a date a_sum
0 1 2000-01-01 00:00:00 3.0
1 3 2000-01-01 06:00:00 NaN
2 4 2000-01-01 12:00:00 12.0
3 5 2000-01-01 18:00:00 15.0
4 6 2000-01-02 00:00:00 42.0
5 7 2000-01-02 06:00:00 NaN
6 8 2000-01-02 12:00:00 56.0
7 9 2000-01-02 18:00:00 63.0
8 14 2000-01-03 00:00:00 NaN
See that, for each day, each value with time other than 06:00:00 is multiplied by the value with time equals 06:00:00. It retuns NaN for the 06:00:00-values themselves, as well as for the groups without this time.
let’s look at some one-minute data:
In [513]: rng = pd.date_range('1/1/2000', periods=12, freq='T')
In [514]: ts = Series(np.arange(12), index=rng)
In [515]: ts
Out[515]:
2000-01-01 00:00:00 0
2000-01-01 00:01:00 1
2000-01-01 00:02:00 2
2000-01-01 00:03:00 3
2000-01-01 00:04:00 4
2000-01-01 00:05:00 5
2000-01-01 00:06:00 6
2000-01-01 00:07:00 7
2000-01-01 00:08:00 8
2000-01-01 00:09:00 9
2000-01-01 00:10:00 10
2000-01-01 00:11:00 11
Freq: T
Suppose you wanted to aggregate this data into five-minute chunks or bars by taking
the sum of each group:
In [516]: ts.resample('5min', how='sum')
Out[516]:
2000-01-01 00:00:00 0
2000-01-01 00:05:00 15
2000-01-01 00:10:00 40
2000-01-01 00:15:00 11
Freq: 5T
However I don't want to use the resample method and still want the same input-output. How can I use group_by or reindex or any of such other methods?
You can use a custom pd.Grouper this way:
In [78]: ts.groupby(pd.Grouper(freq='5min', closed='right')).sum()
Out [78]:
1999-12-31 23:55:00 0
2000-01-01 00:00:00 15
2000-01-01 00:05:00 40
2000-01-01 00:10:00 11
Freq: 5T, dtype: int64
The closed='right' ensures that the output is exactly the same.
However, if your aim is to do more custom grouping, you can use .groupby with your own vector:
In [78]: buckets = (ts.index - ts.index[0]) / pd.Timedelta('5min')
In [79]: grp = ts.groupby(np.ceil(buckets.values))
In [80]: grp.sum()
Out[80]:
0 0
1 15
2 40
3 11
dtype: int64
The output is not exactly the same, but the method is more flexible (e.g. can create uneven buckets).
I have a csv-file with entries like this:
1,2014 1 1 0 1,5
2,2014 1 1 0 1,5
3,2014 1 1 0 1,5
4,2014 1 1 0 1,6
5,2014 1 1 0 1,6
6,2014 1 1 0 1,12
7,2014 1 1 0 1,17
8,2014 5 7 1 5,4
The first column is the ID, the second the arrival-date (example of last entry: may 07, 1:05 a.m.) and the last column is the duration of work (in minutes).
Actually, I read in the data using pandas and the following function:
import pandas as pd
def convert_data(csv_path):
store = pd.HDFStore(data_file)
print('Loading CSV File')
df = pd.read_csv(csv_path, parse_dates=True)
print('CSV File Loaded, Converting Dates/Times')
df['Arrival_time'] = map(convert_time, df['Arrival_time'])
df['Rel_time'] = (df['Arrival_time'] - REF.timestamp)/60.0
print('Conversion Complete')
store['orders'] = df
My question is: How can I sort the entries according to their duration, but considering the arrival-date? So, I'd like to sort the csv-entries according to "arrival-date + duration". How is this possible?
Thanks for any hint! Best regards, Stan.
OK, the following shows you can convert the date times and then shows how to add the minutes:
In [79]:
df['Arrival_Date'] = pd.to_datetime(df['Arrival_Date'], format='%Y %m %d %H %M')
df
Out[79]:
ID Arrival_Date Duration
0 1 2014-01-01 00:01:00 5
1 2 2014-01-01 00:01:00 5
2 3 2014-01-01 00:01:00 5
3 4 2014-01-01 00:01:00 6
4 5 2014-01-01 00:01:00 6
5 6 2014-01-01 00:01:00 12
6 7 2014-01-01 00:01:00 17
7 8 2014-05-07 01:05:00 4
In [80]:
import datetime as dt
df['Arrival_and_Duration'] = df['Arrival_Date'] + df['Duration'].apply(lambda x: dt.timedelta(minutes=int(x)))
df
Out[80]:
ID Arrival_Date Duration Arrival_and_Duration
0 1 2014-01-01 00:01:00 5 2014-01-01 00:06:00
1 2 2014-01-01 00:01:00 5 2014-01-01 00:06:00
2 3 2014-01-01 00:01:00 5 2014-01-01 00:06:00
3 4 2014-01-01 00:01:00 6 2014-01-01 00:07:00
4 5 2014-01-01 00:01:00 6 2014-01-01 00:07:00
5 6 2014-01-01 00:01:00 12 2014-01-01 00:13:00
6 7 2014-01-01 00:01:00 17 2014-01-01 00:18:00
7 8 2014-05-07 01:05:00 4 2014-05-07 01:09:00
In [81]:
df.sort(columns=['Arrival_and_Duration'])
Out[81]:
ID Arrival_Date Duration Arrival_and_Duration
0 1 2014-01-01 00:01:00 5 2014-01-01 00:06:00
1 2 2014-01-01 00:01:00 5 2014-01-01 00:06:00
2 3 2014-01-01 00:01:00 5 2014-01-01 00:06:00
3 4 2014-01-01 00:01:00 6 2014-01-01 00:07:00
4 5 2014-01-01 00:01:00 6 2014-01-01 00:07:00
5 6 2014-01-01 00:01:00 12 2014-01-01 00:13:00
6 7 2014-01-01 00:01:00 17 2014-01-01 00:18:00
7 8 2014-05-07 01:05:00 4 2014-05-07 01:09:00
I'm loading data in pandas, whereas the column date contains the datetime values, e.g.:
date ; .....more stuff ......
2000-01-03 ;
2000-01-04 ;
2000-01-06 ;
...
2000-01-31 ;
2000-02-01 ;
2000-02-02 ;
2000-02-04 ;
I have a function to add a column containing the weekday-indices (0-6):
def genWeekdays(df,src='date',target='weekday'):
"""
bla bla bla
"""
df[target] = df[src].apply(lambda x: x.weekday())
return df
calling it via
df = genWeekdays(df)
df has about a million rows and this takes about 1.3secs.
Any way to speed this up? Im little surprised on how long this takes on my i7-4770k :(
Thanks in advance
In [30]: df = DataFrame(dict(date = pd.date_range('20000101',periods=100000,freq='s'), value = np.random.randn(100000)))
In [31]: df['weekday'] = pd.DatetimeIndex(df['date']).weekday
In [32]: %timeit pd.DatetimeIndex(df['date']).weekday
10 loops, best of 3: 34.9 ms per loop
In [33]: df
Out[33]:
date value
In [33]: df
Out[33]:
date value weekday
0 2000-01-01 00:00:00 -0.046604 5
1 2000-01-01 00:00:01 -1.691611 5
2 2000-01-01 00:00:02 0.416015 5
3 2000-01-01 00:00:03 0.054822 5
4 2000-01-01 00:00:04 -0.661163 5
5 2000-01-01 00:00:05 0.274402 5
6 2000-01-01 00:00:06 -0.426533 5
7 2000-01-01 00:00:07 0.028769 5
8 2000-01-01 00:00:08 0.248581 5
9 2000-01-01 00:00:09 1.302145 5
10 2000-01-01 00:00:10 -1.886830 5
11 2000-01-01 00:00:11 2.276506 5
12 2000-01-01 00:00:12 0.054104 5
13 2000-01-01 00:00:13 0.378990 5
14 2000-01-01 00:00:14 0.868879 5
15 2000-01-01 00:00:15 -0.046810 5
16 2000-01-01 00:00:16 -0.499447 5
17 2000-01-01 00:00:17 1.067412 5
18 2000-01-01 00:00:18 -1.625986 5
19 2000-01-01 00:00:19 0.515884 5
20 2000-01-01 00:00:20 -1.884882 5
21 2000-01-01 00:00:21 0.943775 5
22 2000-01-01 00:00:22 0.034501 5
23 2000-01-01 00:00:23 0.438170 5
24 2000-01-01 00:00:24 -1.211937 5
25 2000-01-01 00:00:25 -0.229930 5
26 2000-01-01 00:00:26 0.938805 5
27 2000-01-01 00:00:27 0.026815 5
28 2000-01-01 00:00:28 2.166740 5
29 2000-01-01 00:00:29 -0.096927 5
... ... ... ...
99970 2000-01-02 03:46:10 -0.310023 6
99971 2000-01-02 03:46:11 0.561321 6
99972 2000-01-02 03:46:12 2.207426 6
99973 2000-01-02 03:46:13 -0.253933 6
99974 2000-01-02 03:46:14 -0.711145 6
99975 2000-01-02 03:46:15 -0.477377 6
99976 2000-01-02 03:46:16 1.492970 6
99977 2000-01-02 03:46:17 0.308510 6
99978 2000-01-02 03:46:18 0.126579 6
99979 2000-01-02 03:46:19 -1.704093 6
99980 2000-01-02 03:46:20 -0.328285 6
99981 2000-01-02 03:46:21 1.685411 6
99982 2000-01-02 03:46:22 -0.368899 6
99983 2000-01-02 03:46:23 0.915786 6
99984 2000-01-02 03:46:24 -1.694855 6
99985 2000-01-02 03:46:25 -1.488130 6
99986 2000-01-02 03:46:26 -1.274004 6
99987 2000-01-02 03:46:27 -1.508376 6
99988 2000-01-02 03:46:28 0.551695 6
99989 2000-01-02 03:46:29 0.007957 6
99990 2000-01-02 03:46:30 -0.214852 6
99991 2000-01-02 03:46:31 -1.390088 6
99992 2000-01-02 03:46:32 -0.472137 6
99993 2000-01-02 03:46:33 -0.969515 6
99994 2000-01-02 03:46:34 1.129802 6
99995 2000-01-02 03:46:35 -0.291428 6
99996 2000-01-02 03:46:36 0.337134 6
99997 2000-01-02 03:46:37 0.989259 6
99998 2000-01-02 03:46:38 0.705592 6
99999 2000-01-02 03:46:39 -0.311884 6
[100000 rows x 3 columns]