I have a pandas dataframe indexed by DateTime from hour "00:00:00" until hour "23:59:00" (increments by minute, seconds not counted).
in: df.index
out: DatetimeIndex(['2018-10-08 00:00:00', '2018-10-08 00:00:00',
'2018-10-08 00:00:00', '2018-10-08 00:00:00',
'2018-10-08 00:00:00', '2018-10-08 00:00:00',
'2018-10-08 00:00:00', '2018-10-08 00:00:00',
'2018-10-08 00:00:00', '2018-10-08 00:00:00',
...
'2018-10-08 23:59:00', '2018-10-08 23:59:00',
'2018-10-08 23:59:00', '2018-10-08 23:59:00',
'2018-10-08 23:59:00', '2018-10-08 23:59:00',
'2018-10-08 05:16:00', '2018-10-08 07:08:00',
'2018-10-08 13:58:00', '2018-10-08 09:30:00'],
dtype='datetime64[ns]', name='DateTime', length=91846, freq=None)
Now I want to choose specific intervals, say every 1 minute, or every 1 hour, starting from "00:00:00" and retrieve all the rows that interval apart consecutively.
I can grab entire intervals, say the first hour interval, with
df.between_time("01:00:00","00:00:00")
But I want to be able to
(a) get only all the times that are a specific intervals apart
(b) get all the 1-hour intervals without having to manually ask for them 24 times. How do I increment the DatetimeIndex inside the between_time command? Is there a better way than that?
I would solve this problem with masking rather than making new dataframes. For example you can add a column df['which_one'] and set different numbers for each subset. Then you can access the subset by calling df[df['which_one']==x] where x is the subset you want to select. You can still do other conditional statements and just about everything else that Pandas had to offer by access the data this way.
P.S. There are other methods to access data that might be faster. I just used what I'm most comfortable with another way would be df[df['which_one'].eq(x)].
If you are deadset on dataframes I would suggest doing so with a dictionary of dataframes such as:
import pandas as pd
dfdict={}
for i in range(0,10):
dfdict[i]=pd.DataFrame()
print(dfdict)
as you will see they are indeed dfs
out[1]
{0: Empty DataFrame
Columns: []
Index: [], 1: Empty DataFrame
Columns: []
Index: [], 2: Empty DataFrame
Columns: []
Index: [], 3: Empty DataFrame
Columns: []
Index: [], 4: Empty DataFrame
Columns: []
Index: [], 5: Empty DataFrame
Columns: []
Index: [], 6: Empty DataFrame
Columns: []
Index: [], 7: Empty DataFrame
Columns: []
Index: [], 8: Empty DataFrame
Columns: []
Index: [], 9: Empty DataFrame
Columns: []
Index: []}
Although as others have suggested there might be a more practical approach to solve your problem (difficult to say without more specifics of the issue)
Related
I have a range of dates in date column of dataframe. The dates are scattered eg 1st feb, 5th Feb, 11th feb etc.
I want to use pd.date_range with frequency one minute on every date in this column. So my start argument will be date and the end argument will be date + datetime.timedelta(days=1).
I'm struggling with using apply function with this, can someone help me with it? or can I use some other function over here?
I don't want to use a for loop because the length of my dates will be HUGE.
I tried this :
df.date.apply(lamda x : pd.date_range(start=df['date'],end = df['date']+datetime.timedelta(days=1),freq="1min"),axis =1)
but I'm getting error.
Thanks in advance
Use x in lambda function instead df['date'] and remove axis=1:
df = pd.DataFrame({'date':pd.date_range('2021-11-26', periods=3)})
print (df)
date
0 2021-11-26
1 2021-11-27
2 2021-11-28
s = df['date'].apply(lambda x:pd.date_range(start=x,end=x+pd.Timedelta(days=1),freq="1min"))
print (s)
0 DatetimeIndex(['2021-11-26 00:00:00', '2021-11...
1 DatetimeIndex(['2021-11-27 00:00:00', '2021-11...
2 DatetimeIndex(['2021-11-28 00:00:00', '2021-11...
Name: date, dtype: object.Timedelta(days=1),freq="1min"))
With pandas.date_range one can create time columns for one's dataframe with ease such as
df["Data"] = pd.date_range(start='2020-01-01 00:00', end='2020-01-05 02:00', freq='H')
Which something like this
I wonder if is it possible to use date_range to create, within the range defined (such as the above), 10 entries for each of the time periods. In other words, 10 cells with 2010-01-01 00:00:00, 10 cells with 2010-01-01 01:00:00, and so on.
If not, how should one do that?
Try np.repeat:
df['Data'] = np.repeat(pd.date_range(start='2020-01-01 00:00',
end='2020-01-05 02:00', freq='H'),
10)
You can use with a list comprehension as below
pd.DataFrame({"Data":[d
for i in range(10)
for d in pd.date_range(start='2020-01-01 00:00', end='2020-01-05 02:00', freq='H')]})
I have a dataframe which includes two data frame columns with a min_peak and a max_peak value.
I am attempting to filter the index('Date') values which are timestamps between the two peaks.
I would like to allocate a value of 0 for all dates that are greater than the min_peak but less than the max_peak and a value of 1 if false.
Date
2019-02-02 0.3985
2019-09-24 1.4612
2019-12-18 1.5996
2020-03-12 0.0001
Name: min_peak, dtype: float64
Date
2019-07-03 3.4769
2019-11-14 2.9666
2020-03-05 4.6239
2020-06-09 4.3605
Name: max_peak, dtype: float64
I have a list of the zipped dates for the min_peak and max_peak columns but am not sure how to filter my dataframe using the values.
[(Timestamp('2019-02-02 00:00:00'), Timestamp('2019-07-03 00:00:00')), (Timestamp('2019-09-24 00:00:00'), Timestamp('2019-11-14 00:00:00')), (Timestamp('2019-12-18 00:00:00'), Timestamp('2020-03-05 00:00:00')), (Timestamp('2020-03-12 00:00:00'), Timestamp('2020-06-09 00:00:00'))]
As an example I would filter my dataframe based on the first two peaks '2019-02-02 00:00:00' and 2019-07-03 00:00:00 , for all index values greater than 2019-02-02 00:00:00 but less than '2019-07-03 00:00:00 to equal 0.
For all values after '2019-07-03 00:00:00' but less than 2019-09-24 00:00:00' to equal 1.
I have looked tried using the loc method and the df.index.isin but without success.
IIUC you want to set a new colum (flag in my example) to 1 if the index (Date) is within any of the tuples from the list. You can use an IntervalIndex and get_indexer which will return the index position (>= 0) in the intervall index or -1 if the date isn't in any intervall of the index.
Example:
import pandas as pd
from pandas import Timestamp
#make sample data
df = pd.DataFrame(index=pd.date_range('2019-01-01', '2020-06-15', freq='W'))
df['flag'] = 0
#make IntervalIndex
l = [(Timestamp('2019-02-02 00:00:00'), Timestamp('2019-07-03 00:00:00')), (Timestamp('2019-09-24 00:00:00'), Timestamp('2019-11-14 00:00:00')), (Timestamp('2019-12-18 00:00:00'), Timestamp('2020-03-05 00:00:00')), (Timestamp('2020-03-12 00:00:00'), Timestamp('2020-06-09 00:00:00'))]
idx = pd.IntervalIndex.from_tuples(l)
#set flag to 1 for all index values within given intervals
df.loc[idx.get_indexer(df.index)>=0, 'flag'] = 1
I need to generate df_Result_Sensor automatically.
I would like the dataframe (df_Result_Sensor) to only receive df_Sensor rows where the ['TimeStamp'] Column was not contained in the df_Message ['date init'] and df_Message ['date end'] ranges.
#In the code example, I wrote a df_Result_Sensor manually, just to illustrate the desired output:
TimeStamp Sensor_one Sensor_two
0 2017-05-20 00:00:00 1 1
1 2017-04-13 00:00:00 1 1
2 2017-09-10 00:00:00 0 1
import pandas as pd
df_Sensor = pd.DataFrame({'TimeStamp' : ['2017-05-25 00:00:00','2017-05-20 00:00:00', '2017-04-13 00:00:00', '2017-08-29 01:15:12', '2017-08-15 02:15:12', '2017-09-10 00:00:00'], 'Sensor_one': [1,1,1,1,1,0], 'Sensor_two': [1,1,1,0,1,1]})
df_Message = pd.DataFrame({'date init': ['2017-05-22 00:00:00', '2017-08-14 00:00:10'], 'date end': ['2017-05-26 00:00:05', '2017-09-01 02:10:05'], 'Message': ['Cold', 'Cold']})
just to illustrate the desired output:
df_Result_Sensor = pd.DataFrame({'TimeStamp' : ['2017-05-20 00:00:00', '2017-04-13 00:00:00', '2017-09-10 00:00:00'], 'Sensor_one': [1,1,0], 'Sensor_two': [1,1,1]})
This will work,make sure your date columns are converted to datetime before date comparisons
df_Message["date init"] = pd.to_datetime(df_Message["date init"])
df_Message['date end'] = pd.to_datetime(df_Message['date end'])
df_Sensor["TimeStamp"] = pd.to_datetime(df_Sensor["TimeStamp"])
df_Sensor_ = df_Sensor.copy()
for index, row in df_Message.iterrows():
df_Sensor_ = df_Sensor_[~((df_Sensor_["TimeStamp"] > row['date init']) & (df_Sensor_["TimeStamp"] < row['date end'])) ]
df_Result_Sensor = df_Sensor_
Please consider the following reproducible dataframe as an example:
import pandas as pd
import numpy as np
from datetime import datetime
list_dates = ['2018-01-05',
'2019-01-01',
'2019-01-02',
'2019-01-05',
'2019-01-08',
'2019-01-22']
index = []
for i in list_dates:
tp = datetime.strptime(i, "%Y-%m-%d")
index.append(tp)
data = np.array([np.arange(6)]*3).T
columns = ['A','B', 'C']
df = pd.DataFrame(data, index = index, columns=columns)
df['D']= ['Loc1', 'Loc1', 'Loc2', 'Loc2', 'Loc4', 'Loc3']
df['E'] = [0.1, 1, 10, 100, 1000, 10000]
Image of the above example dataframe:
Then, I try to create a new dataframe df2 by resampling the above dataset, so I have all daily dates between 2018-01-05 (first date in list_dates) and 2019-01-22 (last date in list_dates). When doing the resampling, I basically create new rows in my dataframe for which I don't have any data.
These new rows should simply be copies of their last known value. So for example, in my example dataframe above, I have data for 2018-01-05, but not for 2018-01-06 until 2018-12-31. All these rows should be filled with the values of the previous / last known value (= the row of 2018-01-05).
I tried doing that using:
df2 = df.resample('D').last()
However, this doesn't work. Instead I get the full range of dates from 2018-01-05 until 2019-01-22, where all new rows (that were not in the original dataframe df) have nan values only.
What am I missing? Any suggestions on how I can fix this?