With pandas.date_range one can create time columns for one's dataframe with ease such as
df["Data"] = pd.date_range(start='2020-01-01 00:00', end='2020-01-05 02:00', freq='H')
Which something like this
I wonder if is it possible to use date_range to create, within the range defined (such as the above), 10 entries for each of the time periods. In other words, 10 cells with 2010-01-01 00:00:00, 10 cells with 2010-01-01 01:00:00, and so on.
If not, how should one do that?
Try np.repeat:
df['Data'] = np.repeat(pd.date_range(start='2020-01-01 00:00',
end='2020-01-05 02:00', freq='H'),
10)
You can use with a list comprehension as below
pd.DataFrame({"Data":[d
for i in range(10)
for d in pd.date_range(start='2020-01-01 00:00', end='2020-01-05 02:00', freq='H')]})
Related
I have a range of dates in date column of dataframe. The dates are scattered eg 1st feb, 5th Feb, 11th feb etc.
I want to use pd.date_range with frequency one minute on every date in this column. So my start argument will be date and the end argument will be date + datetime.timedelta(days=1).
I'm struggling with using apply function with this, can someone help me with it? or can I use some other function over here?
I don't want to use a for loop because the length of my dates will be HUGE.
I tried this :
df.date.apply(lamda x : pd.date_range(start=df['date'],end = df['date']+datetime.timedelta(days=1),freq="1min"),axis =1)
but I'm getting error.
Thanks in advance
Use x in lambda function instead df['date'] and remove axis=1:
df = pd.DataFrame({'date':pd.date_range('2021-11-26', periods=3)})
print (df)
date
0 2021-11-26
1 2021-11-27
2 2021-11-28
s = df['date'].apply(lambda x:pd.date_range(start=x,end=x+pd.Timedelta(days=1),freq="1min"))
print (s)
0 DatetimeIndex(['2021-11-26 00:00:00', '2021-11...
1 DatetimeIndex(['2021-11-27 00:00:00', '2021-11...
2 DatetimeIndex(['2021-11-28 00:00:00', '2021-11...
Name: date, dtype: object.Timedelta(days=1),freq="1min"))
I've spent a few hours on google searching to solve this problem, but I haven't been able to find anything regarding these parameters. I want to find if a value associated with a specific datetime occurs between at least one datetime range in another data frame of a different size. Below are the example data frames:
import pandas as pd
df1 = pd.DataFrame({'Datetime': ['2020-01-01 01:01:01', '2020-01-01 10:10:10', '2020-01-01 12:10:01', '2020-01-02 03:16:24', '2020-12-01 04:34:21'], 'Value': [0.006, 0.002, 0.005, 0.034, 0.001]})
df2 = pd.DataFrame({'Start': ['2020-01-01 01:01:00', '2020-01-01 07:10:10', '2020-01-01 21:10:01', '2020-01-03 06:16:24', '2020-12-25 14:12:34'], 'End': ['2020-01-01 02:00:00', '2020-01-01 08:01:01', '2020-01-01 21:34:09', '2020-01-01 09:23:42', '2020-12-25 15:13:21']})
# convert columns to datetime format
df1.Datetime = pd.to_datetime(df1.Datetime)
df2[['Start', 'End']] = df2[['Start', 'End']].apply(pd.to_datetime)
df1
Datetime Value Check
2020-01-01 01:01:01 0.006
2020-01-01 10:10:10 0.002
2020-01-01 12:10:01 0.005
2020-01-02 03:16:24 0.034
2020-12-01 04:34:21 0.001
df2
Start End
2020-01-01 01:01:00 2020-01-01 02:00:00
2020-01-01 07:10:10 2020-01-01 08:01:01
2020-01-01 21:10:01 2020-01-01 21:34:09
2020-01-03 06:16:24 2020-01-01 09:23:42
2020-12-25 14:12:34 2020-12-25 15:13:21
If df1['Value'] associated with df1['Datetime'] at index 1 is between any of the ranges in df2, the function should return True in df1['Check'] and False if the associated datetime is not between any of the ranges. This should continue to check every index in df1 against all ranges in df2.
I've tried using pd.DataFrame.any, but this throws "ValueError: Can only compare identically-labeled Series objects". I was thinking a nested for loop would be the way to go, but I am not sure how to set one up for something like this.
Using numpy's array broadcasting:
dt = df1['Datetime'].to_numpy()
start = df2['Start'].to_numpy()[:, None]
end = df2['End'].to_numpy()[:, None]
mask = (start <= dt) & (dt <= end)
df1['Check'] = mask.any(axis=0)
You can use merge_asof to associate Datetime value with the latest Start value it comes after, then filter to see if it is in the Start-End range. Below I define a dataframe with random timestamps within a 24 hr period, and the second dataframe has one hour intervals every two hours. If our dataframe is large, we expect half the rows to be within an interval, half outside them. Note this is for 1m rows, so you have an idea that this is pretty performant:
# create random time data
data = 24*np.random.random(1000000)
time_values = pd.to_datetime("2020-01-01") + pd.to_timedelta(data, unit="hour")
df = pd.DataFrame(time_values, columns=["Datetime"])
df["Value"] = data
df = df.sort_values("Datetime")
# create one hour intervals every two hours:
start_times = pd.to_datetime("2020-01-01") + pd.to_timedelta(np.arange(0,24,2), unit="hour")
df2 = pd.DataFrame(start_times, columns=["Start"])
df2["End"] = df2["Start"] + pd.to_timedelta(1, unit="hour")
df2 = df2.sort_values("Start")
# associate "Datetime" with the latest "Start" that occurs before it
df_merged = pd.merge_asof(left=df, right=df2, left_on="Datetime", right_on="Start")
# If it's within the window, "End" will come after "Datetime":
print("number of points in intervals:")
print(np.sum(df_merged.Datetime < df_merged.End))
# 499763
As we expect, about half of the random points fall within the intervals.
df1.Check = df1.Datetime.apply(lambda x: ((df2.Start<=x) & (df2.End>=x)).any())
I have a feeling that the numpy broadcasting answer might in fact be faster , but apply is more pythonic. :-)
slots = pd.DataFrame({'times': ['2020-02-01 18:40:00', '2020-02-01 08:40:00',
'2020-02-01 03:40:00', '2020-02-01 14:40:00',
'2010-05-05 22:00:00', '2018-03-08 23:00:00']})
print(slots)
slots['times'] = pd.to_datetime(slots.times)
from datetime import datetime
start = datetime.strptime('17:09:00', '%H:%M:%S').time()
print(start)
end = datetime.strptime('01:59:00', '%H:%M:%S').time()
print(end)
print(slots[slots['times'].dt.time.between(start, end)])
output: Empty DataFrame
Columns: [times]
Index: []
I am getting empty dataframe. Can someone please guide or is there any other way to do it.
Pandas has method DataFrame.between_time so I suggest use it, also is added DataFrame.set_index and DataFrame.reset_index because method working with DatetimeIndex:
df = (slots.set_index('times', drop=False)
.between_time('17:09:00', '01:59:00')
.reset_index(drop=True))
print (df)
times
0 2020-02-01 18:40:00
1 2010-05-05 22:00:00
2 2018-03-08 23:00:00
what is the efficient way to convert the column values into dates "DD-MM-YYYY" when the values given like "Feb-15" which needs to be "01-02-2015". if it's "Dec-46" it must return "01-12-1946".
You can pass the format '%b-%y' to to_datetime:
In[42]:
df = pd.DataFrame({'date':["Feb-15","Dec-46"]})
df['new_date'] = pd.to_datetime(df['date'], format='%b-%y')
df
Out[42]:
date new_date
0 Feb-15 2015-02-01
1 Dec-46 2046-12-01
Note that the new dtype is datetime64, you cannot control the display output, if you insist on DD-MM-YYYY then you would have to convert to a string using dt.strftime:
In[43]:
df['str_date'] = df['new_date'].dt.strftime('%d-%m-%Y')
df
Out[43]:
date new_date str_date
0 Feb-15 2015-02-01 01-02-2015
1 Dec-46 2046-12-01 01-12-2046
but then you have strings which is not that useful if you need to perform arithmetic operations or filtering
EDIT
You cannot store dates earlier than 1970 so '01-01-1946' is not a valid datetime that can be represented by datetime64
I have a large Pandas dataframe in which one column is (unordered) datetimes from a known period (the year 2013). I need an efficient way to convert these datetimes to indices, where each index = # hours since start_time ('2013-1-1 00)'. There are duplicate times, which should map to duplicate indices.
Obviously, this can be done one-at-a-time with a loop by using timedelta. It can also be done with a loop by using Pandas Series (see the following snippet, which generates the ordered series of all datetimes since start_time):
nhours = 365*24
time_series = Series(range(nhours), index=pd.date_range('2013-1-1', periods=nhours, freq='H'))
After running this snippet, one can get indices using the .index or .get_loc methods in a loop.
** However, is there a fast (non-loopy?) way to take a column of arbitrary datetimes and find their respective indices? **
For example, inputing the following column of datetimes:
2013-01-01 11:00:00
2013-01-01 11:00:00
2013-01-01 00:00:00
2013-12-30 18:00:00
should output the following indices: [11, 11, 0, 8730]
loc can take a list or array of labels to look up:
>>> print time_series.loc[[pd.Timestamp('20130101 11:00'), pd.Timestamp('20130101 11:00'), pd.Timestamp('20130101'), pd.Timestamp('20131230 18:00')]]
2013-01-01 11:00:00 11
2013-01-01 11:00:00 11
2013-01-01 00:00:00 0
2013-12-30 18:00:00 8730
dtype: int64
Thank you for the responses. I have a new, faster solution that takes advantage of the fact that pandas supports datetime and timedelta formats. It turns out that the following is roughly twice as fast as Colin's solution above (although not as flexible), and it avoids the overhead of building a Series of ordered datetimes:
all_indices = (df['mydatetimes'] - datetime(2013,1,1,0)) / np.timedelta64(1,'h')
where df is the pandas dataframe and 'mydatetimes' is the column name that includes the datetimes.
Timing the code yields that this solution performs 30,000 indices in:
0:00:00.009909 --> this snippet
0:00:00.017800 --> Colin's solution with ts=Series(...) and ts.loc. I have excluded the one-time overhead of building a Series from this timing
Use isin:
time_series[time_series.index.isin(['2013-01-01 11:00:00',
'2013-01-01 00:00:00',
'2013-12-30 18:00:00'])].values
# Returns: array([ 0, 11, 8730])
between and between_time are also useful