I have two features of datetime type, few records of each column have NaT values. These NaT are meaningful and I can fill it up with 0.
What I want to do is to find the difference between the two dates to create a new feature "Time Spent". Because of NaT, my code is throwing error, which makes sense as I am trying to subtract NaT values. I was wondering if there is an efficient way to do it in Python. Thanks in advance
Example :
DateTime_Min DateTime_Max Process
2020-01-01 11:30:00 2020-01-01 11:30:30. A
2020-01-01 11:30:00 2020-01-01 11:30:20. B
NaT. NaT. C
2020-01-01 11:30:00 2020-01-01 11:30:30. D
What I want:
---
DateTime_Min DateTime_Max Process. Time_Spent(seconds)
2020-01-01 11:30:00 2020-01-01 11:30:30. A. 30
2020-01-01 11:30:00 2020-01-01 11:30:20. B. 20
NaT. NaT. C. 0
2020-01-01 11:30:00 2020-01-01 11:30:30. D. 30
Code
#calculating time spent on each process in seconds
def calculate_seconds(df):
if (df['DateTime_Max'] == 0 | df['DateTime_Min'] == 0):
df['Time_Spent']=0
else:
df['Time_Spent']=(df['DateTime_Max'] - df['DateTime_Min'])/np.timedelta64(1, 's')
Here's a solution using fillna:
df.DateTime_Min = pd.to_datetime(df.DateTime_Min)
df.DateTime_Max = pd.to_datetime(df.DateTime_Max)
df["Time_Spent"] = (df.DateTime_Max - df.DateTime_Min).fillna(pd.Timedelta(seconds=0))
The result is:
DateTime_Min DateTime_Max Process Time_Spent
0 2020-01-01 11:30:00 2020-01-01 11:30:30 A 00:00:30
1 2020-01-01 11:30:00 2020-01-01 11:30:20 B 00:00:20
2 NaT NaT C 00:00:00
3 2020-01-01 11:30:00 2020-01-01 11:30:30 D 00:00:30
Related
I have a DataFrame with relevant stock information that looks like this.
Screenshot of my dataframe
I need it so that if the 'close' from one row is different from the 'open' in the next row a new dataframe will be created storing the ones that fulfill this criteria. I would like that all of the values from the row to be saved in the new dataframe. To clarify, I would like the two rows where this happens to be stored in the new dataframe.
DataFrame as text as requested:
timestamp open high low close volume
0 2020-01-01 00:00:00 129.16 130.98 128.68 130.24 4.714333e+04
1 2020-01-01 08:00:00 130.24 132.40 129.87 132.08 5.183323e+04
2 2020-01-01 16:00:00 132.08 133.05 129.74 130.77 4.579396e+04
3 2020-01-02 00:00:00 130.72 130.78 128.69 129.26 6.606601e+04
4 2020-01-02 08:00:00 129.23 130.28 128.90 129.59 4.849893e+04
5 2020-01-02 16:00:00 129.58 129.78 126.38 127.19 9.919212e+04
6 2020-01-03 00:00:00 127.19 130.15 125.88 128.86 1.276414e+05
This can be accomplished using Series.shift
>>> df['close'] != df['open'].shift(-1)
0 2020-01-01 False
1 2020-01-01 False
2 2020-01-01 True
3 2020-01-02 True
4 2020-01-02 True
5 2020-01-02 False
6 2020-01-03 True
This compares the close value in one row to the open value of the next row ("shifted" one row ahead).
You can then select the rows for which the condition is True.
>>> df[df['close'] != df['open'].shift(-1)]
timestamp open high low close volume
2 2020-01-01 16:00:00 132.08 133.05 129.74 130.77 45793.96
3 2020-01-02 00:00:00 130.72 130.78 128.69 129.26 66066.01
4 2020-01-02 08:00:00 129.23 130.28 128.90 129.59 48498.93
6 2020-01-03 00:00:00 127.19 130.15 125.88 128.86 127641.40
This only returns the second of the two rows; to get the first, we can then shift back one, and unite the two conditions.
>>> row_condition = df['close'] != df['open'].shift(-1)
>>> row_before = row_condition.shift(1)
>>> df[row_condition | row_before]
timestamp open high low close volume
0 2020-01-01 00:00:00 129.16 130.98 128.68 130.24 47143.33
1 2020-01-01 08:00:00 130.24 132.40 129.87 132.08 51833.23
2 2020-01-01 16:00:00 132.08 133.05 129.74 130.77 45793.96
3 2020-01-02 00:00:00 130.72 130.78 128.69 129.26 66066.01
4 2020-01-02 08:00:00 129.23 130.28 128.90 129.59 48498.93
5 2020-01-02 16:00:00 129.58 129.78 126.38 127.19 99192.12
6 2020-01-03 00:00:00 127.19 130.15 125.88 128.86 127641.40
Providing a textual sample of the DataFrame is useful because this can be copied directly into a Python session; I would have had to manually type the content of your screenshot otherwise.
I'm working on a timeseries dataframe which looks like this and has data from January to August 2020.
Timestamp Value
2020-01-01 00:00:00 -68.95370
2020-01-01 00:05:00 -67.90175
2020-01-01 00:10:00 -67.45966
2020-01-01 00:15:00 -67.07624
2020-01-01 00:20:00 -67.30549
.....
2020-07-01 00:00:00 -65.34212
I'm trying to apply a filter on the previous dataframe using the columns start_time and end_time in the dataframe below:
start_time end_time
2020-01-12 16:15:00 2020-01-13 16:00:00
2020-01-26 16:00:00 2020-01-26 16:10:00
2020-04-12 16:00:00 2020-04-13 16:00:00
2020-04-20 16:00:00 2020-04-21 16:00:00
2020-05-02 16:00:00 2020-05-03 16:00:00
The output should assign all values which are not within the start and end time as zero and retain values for the start and end times specified in the filter. I tried applying two simultaneous filters for start and end time but didn't work.
Any help would be appreciated.
Idea is create all masks by Series.between in list comprehension, then join with logical_or by np.logical_or.reduce and last pass to Series.where:
print (df1)
Timestamp Value
0 2020-01-13 00:00:00 -68.95370 <- changed data for match
1 2020-01-01 00:05:00 -67.90175
2 2020-01-01 00:10:00 -67.45966
3 2020-01-01 00:15:00 -67.07624
4 2020-01-01 00:20:00 -67.30549
5 2020-07-01 00:00:00 -65.34212
L = [df1['Timestamp'].between(s, e) for s, e in df2[['start_time','end_time']].values]
m = np.logical_or.reduce(L)
df1['Value'] = df1['Value'].where(m, 0)
print (df1)
Timestamp Value
0 2020-01-13 00:00:00 -68.9537
1 2020-01-01 00:05:00 0.0000
2 2020-01-01 00:10:00 0.0000
3 2020-01-01 00:15:00 0.0000
4 2020-01-01 00:20:00 0.0000
5 2020-07-01 00:00:00 0.0000
Solution using outer join of merge method and query:
print(df1)
timestamp Value <- changed Timestamp to timestamp to avoid name conflict in query
0 2020-01-13 00:00:00 -68.95370 <- also changed data for match
1 2020-01-01 00:05:00 -67.90175
2 2020-01-01 00:10:00 -67.45966
3 2020-01-01 00:15:00 -67.07624
4 2020-01-01 00:20:00 -67.30549
5 2020-07-01 00:00:00 -65.34212
df1.loc[df1.index.difference(df1.assign(key=0).merge(df2.assign(key=0), how = 'outer')\
.query("timestamp >= start_time and timestamp < end_time").index),"Value"] = 0
result:
timestamp Value
0 2020-01-13 00:00:00 -68.9537
1 2020-01-01 00:05:00 0.0000
2 2020-01-01 00:10:00 0.0000
3 2020-01-01 00:15:00 0.0000
4 2020-01-01 00:20:00 0.0000
5 2020-07-01 00:00:00 0.0000
Key assign(key=0) is added to both dataframes to produce cartesian product.
Is there a way to filter rows if the column2 has all zeroes 10 minutes ahead from the current value in columnn1. How can I do this while keeping datetime index?
2020-01-01 00:01:00 60 0
2020-01-01 00:02:00 70 0
2020-01-01 00:03:00 80 0
2020-01-01 00:04:00 70 0
2020-01-01 00:05:00 60 0
2020-01-01 00:06:00 60 0
2020-01-01 00:07:00 70 0
2020-01-01 00:08:00 80 0
2020-01-01 00:09:00 80 2
2020-01-01 00:10:00 80 0
2020-01-01 00:11:00 70 0
2020-01-01 00:12:00 70 0
2020-01-01 00:13:00 50 0
2020-01-01 00:14:00 50 0
2020-01-01 00:15:00 60 0
2020-01-01 00:16:00 60 0
2020-01-01 00:17:00 70 0
2020-01-01 00:18:00 70 0
2020-01-01 00:19:00 80 0
2020-01-01 00:20:00 80 0
2020-01-01 00:21:00 80 1
2020-01-01 00:22:00 90 2
Expected output
2020-01-01 00:19:00 80 0
2020-01-01 00:20:00 80 0
I figured it out. It's actually simple.
input['col3'] = input['col2'].rolling(10).sum()
output = input.loc[(input['col3'] == 0)]
Just a guess, because I do not know pandas, but assuming it is a bit like SQL or linq or linkable datasets in C# - what about linking/joining your table (A) with itself (B) for all 12 minutes, grouping by each row of A and then summing the column2 of B (if only positive values there) and filter (SQL having) by the ones whose sum is 0?
As result report A.column0, A.column1 and SUM(B.column2)?
Using pandas.DataFrame.query, pandas.DataFrame.query - documentation
df.query(f'column_1 == {0} and column_2 == {value} or column_3 == {another_value}')
I am getting some strange behavior with np.floor and np.trunc when I am using pandas dataframes.
This is the behavior I expect, which is working:
np.floor(30.000)
Out[133]: 30.0
But you can see that when I do a series it treats "floating integers" wrong, and rounds them down by one. I used a temporary fix of adding 0.00001 to my entire dataframe, which in my case is okay because of the level I am rounding to, but I would like to know what is happening and how to do it correctly.
Edit: I specifically need the round down to zero feature.
sample_series
2020-01-01 00:00:00 26.750
2020-01-01 01:00:00 27.500
2020-01-01 02:00:00 28.250
2020-01-01 03:00:00 30.000
2020-01-01 04:00:00 30.625
2020-01-01 05:00:00 31.000
2020-01-01 06:00:00 33.375
2020-01-01 07:00:00 33.750
2020-01-01 08:00:00 34.000
In: sample_series.apply(np.float)
Out:
2020-01-01 00:00:00 26.0
2020-01-01 01:00:00 27.0
2020-01-01 02:00:00 28.0
2020-01-01 03:00:00 29.0
2020-01-01 04:00:00 30.0
2020-01-01 05:00:00 30.0
2020-01-01 06:00:00 33.0
2020-01-01 07:00:00 33.0
2020-01-01 08:00:00 33.0
2020-01-01 09:00:00 33.0
2020-01-01 10:00:00 34.0
You may be able to use around() function
np.around([0.37, 1.64], decimals=1)
array([0.4, 1.6])
I have a df as follows:
id dates values tz
1 2020-01-01 00:15:00 87.8 +01
2 2020-01-01 00:30:00 88.3 +01
3 2020-01-01 00:45:00 89.0 +01
4 2020-01-01 01:00:00 90.1 +01
5 2020-01-01 01:15:00 91.3 +01
6 2020-01-01 01:30:00 92.4 +01
7 2020-01-01 01:45:00 92.9 +01
8 2020-01-01 02:00:00 92.5 +01
9 2020-01-01 02:15:00 91.0 +01
10 2020-01-01 02:30:00 88.7 +01
11 2020-01-01 02:45:00 86.4 +01
12 2020-01-01 03:00:00 84.7 +01
What I would like to do is to club every 4 rows based on the id column and then add the values in the values column and assign it to the dates value when the timestamp in minutes is 00
Example:
id dates values tz
1 2020-01-01 00:15:00 87.8 +01
2 2020-01-01 00:30:00 88.3 +01
3 2020-01-01 00:45:00 89.0 +01
4 2020-01-01 01:00:00 90.1 +01
When I club the first 4 values, the output should be as follows:
id dates values tz
1 2020-01-01 01:00:00 355.2 +01 <--- (87.8+88.3+89.0+90.1 = 355.2)
and similarly for the other rows as well..
The desired output:
id dates values tz
1 2020-01-01 01:00:00 355.2 +01 <--- (87.8+88.3+89.0+90.1 = 355.2)
2 2020-01-01 02:00:00 369.1 +01 <--- (91.3+92.4+92.9+91.0 = 369.1)
3 2020-01-01 03:00:00 350.8 +01 <--- (91.0+88.7+86.4+84.7 = 350.8)
How can this be done?
I think here is possible aggregate by each 4 rows with np.arange by length of DataFrame with aggregate sum with last values per groups by GroupBy.agg:
df = df.groupby(np.arange(len(df)) // 4).agg({'dates':'last','values':'sum', 'tz':'last'})
print (df)
dates values tz
0 2020-01-01 01:00:00 355.2 1
1 2020-01-01 02:00:00 369.1 1
2 2020-01-01 03:00:00 350.8 1