With dataset df I plotted a graph looking like the following:
df
Time Temperature
8:23:04 18.5
8:23:04 19
9:12:57 19
9:12:57 20
9:12:58 20
9:12:58 21
9:12:59 21
9:12:59 23
9:13:00 23
9:13:00 25
9:13:01 25
9:13:01 27
9:13:02 27
9:13:02 28
9:13:03 28
Graph(Overall)
When zooming in the data, we can see more details:
I would like to count the number of activations of this temperature measurement device, which gives rise to temperature increasing drastically. I have defined an activation as below:
Let T0, T1, T2, T3 be temperature at time t=0,t=1,t=2,t=3, and d0= T1-T0, d1= T2-T1, d2= T3-T2, ... be the difference of 2 adjacent values.
If
1) d0 ≥ 0 and d1 ≥ 0 and d2 ≥ 0, and
2) T2- T0 > max(d0, d1, d2), and
3) T2-T0 < 30 second
It is considered as an activation. I want to count how many activations are there in total. What's a good way to do this?
Thanks.
There could be a number of different, valid answers depending on how a spike is defined.
Assuming you just want the indices where the temperature increases significantly. One simple method is to just look for very large jumps in value, above some threshold value. The threshold can be calculated from the mean difference of the data, which should give a rough approximation of where the significant variations in value occur. Here's a basic implementation:
import numpy as np
# Data
x = np.array([0, 1, 2, 50, 51, 52, 53, 100, 99, 98, 97, 96, 10, 9, 8, 80])
# Data diff
xdiff = x[1:] - x[0:-1]
# Find mean change
xdiff_mean = np.abs(xdiff).mean()
# Identify all indices greater than the mean
spikes = xdiff > abs(xdiff_mean)+1
print(x[1:][spikes]) # prints 50, 100, 80
print(np.where(spikes)[0]+1) # prints 3, 7, 15
You could also look use outlier rejection, which would be much more clever than this basic comparison to the mean difference. There are lots of answers on how to do that:
Can scipy.stats identify and mask obvious outliers?
Related
I have this dataframe:
sales
0 22.000000
1 25.000000
2 22.000000
3 18.000000
4 23.000000
and want to 'forecast' the next value as a simple numeric to pass it on easily to another code. I thought numpy.polyfit could do that, but not sure how. I am just after something simple.
If numpy.polyfit can do a curve fit, the curve can continue to the next x value. This would give a forecast like in excel. That's good for me.
Unfortunately, numpy.org doesn't say how forecasting is possible, hence asking here :)
Forecasting can be done by using np.poly1d(z) as described in the documentation.
n = 10 # number of values you would like to forecast
deg = 1 # degree of polynomial
df = pd.DataFrame([22, 25, 22, 18, 23], columns=['Sales'])
z = np.polyfit(df.index, df.Sales, deg)
p = np.poly1d(z)
for row in range(len(df), len(df) + n):
df.loc[row, 'Sales'] = p(row)
You can change n and deg as you like. But remember, you only have 5 observations so deg must be 4 or lower.
I have an example df like df = pd.DataFrame({'price': [100, 101, 99, 95, 97, 88], 'qty': [12, 5, 1, 3, 1, 3]}). I want to calculate the rolling 5 qty average of (price * qty / qty), and the desired output is 100, 101, 100.6, 97, 96.2, 91.2.
I don't have a good way to calculate this currently unfortunately, I have a slow way that gets close which is to calculate the cumulative sum of qty and then df.qty_cumsum[(df.qty_cumsum<= x.qty_cumsum- 5)].argmax() which returns the max arg of the qty - 5, then I can use this to calculate weighted average in a second step.
Thanks
One option is to repeat price, then take rolling with rows, and groupby index, taking last:
np.repeat(df['price'], df['qty']).rolling(5).mean().groupby(level=0).last()
Output:
0 100.0
1 101.0
2 100.6
3 97.0
4 96.2
5 91.2
Name: price, dtype: float64
P.S. And if you have large qty values, it would also probably make sense to make it more efficient by clipping qty to 5 (since there is no difference if it's 5 or 12, for example):
np.repeat(df['price'], np.clip(df['qty'], 0, 5)
).rolling(5).mean().groupby(level=0).last()
I have dataset df as below:
Time Temperature
17:29:33 18
8:23:04 18.5
8:23:04 19
9:12:57 19
9:12:57 20
9:12:58 20
9:12:58 21
9:12:59 21
9:12:59 23
9:13:00 23
9:13:00 25
9:13:01 25
9:13:01 27
9:13:02 27
9:13:02 28
9:13:03 28
which constantly records temperature data whenever there is a temperature change that is greater than 0.5°C.
I want to calculate the total time duration where the temperature is between 25°C-40°C (ie. if the spikes exceed 40°C, the corresponding time will not be taken into account). How can I do this in Python?
Edited:
Below is a plot for better illustration of the dataset.
Thanks.
Since the temperature can be between 25 and 40 and out of range we probably need to calculate the duration of different intervals, so I use DataFrame.groupby here
l=25
h = 40
measure_range = df['Temperature'].between(l,h)
df_range = df.loc[measure_range]
groups = (~measure_range).cumsum()
intervals_df = (pd.to_datetime(df_range['Time'].astype(str))
.groupby(groups)
.agg(['first','last'])
.reset_index(drop=True)
.assign(Total_time=lambda x: x.diff(axis =1).iloc[:,-1],
first = lambda x: x['first'].dt.time,
last = lambda x: x['last'].dt.time)
)
print(intervals_df)
first last Total_time
0 09:13:00 09:13:03 00:00:03
in this way a row is generated in the dataframe for each time interval in which the temperature is between l and h continuously.
Do it step by step , numpy.ptp is a way to calculate the max and min different from numpy
df.Time=pd.Timedelta(df.Time)
s = df.Temperature.between(25,40)
out = df[s].groupby((~s).cumsum()).Time.agg(['min', 'max', np.ptp])
min max ptp
Temperature
10 09:13:00 09:13:03 00:00:03
Make sure the time column is in the right format.
df['time'] = pd.to_timedelta(df['time'],unit='s')
Get the time when temp reaches 40. (Tail gives you the most recent temp. You can use head() if required). Reset index to get the diff later.
temp_40 = df[df['temp'] == 40]['time'].tail(1)
temp_40 = temp_40.reset_index(drop = True)
Similarly, get the time when the temp reached 25.
temp_25 = df[df['temp'] == 25]['time'].tail(1)
temp_25 = temp_25.reset_index(drop = True)
Now get the diff
temp_40 - temp_25
What I hope to do is be able to divide a value in a 1 dimensional numpy array by the following value. For example, I have an array that looks like this.
[ 0 20 23 25 27 28 29 30 30 22 20 19 19 19 19 18 18 19 19 19 19 19 ]
I want to do this:
0/20 #0th value divided by 1st value
20/23 #1st value divided by 2nd value
23/25 #2nd value divided by 3rd value
25/27 #3rd value divided by 4th value
etc...
I can easily do it through a loop, however I was wondering if there is a more efficient way of doing this with numpy operations.
Get two Slices - One from start to last-1, another from start+1 to last and perform element-wise division -
a[:-1]/a[1:]
To get floating point divisions -
np.true_divide(a[:-1],a[1:])
Or put from __future__ import division and then use a[:-1]/a[1:].
Being views into the input array, these slices are really efficiently accessed for element-wise division operation.
Sample run -
In [56]: a # Input array
Out[56]: array([96, 81, 48, 53, 18, 92, 79, 43, 13, 69])
In [57]: from __future__ import division
In [58]: a[:-1]/a[1:]
Out[58]:
array([ 1.18518519, 1.6875 , 0.90566038, 2.94444444, 0.19565217,
1.16455696, 1.8372093 , 3.30769231, 0.1884058 ])
In [59]: a[0]/a[1]
Out[59]: 1.1851851851851851
In [60]: a[1]/a[2]
Out[60]: 1.6875
In [61]: a[2]/a[3]
Out[61]: 0.90566037735849059
I have a data frame, from which I can select a column (series) as follows:
df:
value_rank
275488 90
275490 35
275491 60
275492 23
275493 23
275494 34
275495 75
275496 40
275497 69
275498 14
275499 83
... ...
value_rank is a previously created percentile rank from a larger data-set. What I am trying to do, is to create bins of this data set, e.g. quintile
pd.qcut(df.value_rank, 5, labels=False)
275488 4
275490 1
275491 3
275492 1
275493 1
275494 1
275495 3
275496 2
... ...
This appears fine, as expected, but it isn't.
In fact, I have 1569 columns. The nearest number divisible by 5 bins is 1565 which should give 1565 / 5 = 313 observations in each bin. There are 4 extra records, so I would expect to have 4 bins with 314 observations, and one with 313 observations. Instead, I get this:
obs = pd.qcut(df.value_rank, 5, labels=False)
obs.value_counts()
0 329
3 314
1 313
4 311
2 302
I have no nans in df, and cannot think of any reason why this is happening. Literally beginning to tear my hair out!
Here is a small example:
df:
value_rank
286742 11
286835 53
286865 40
286930 31
286936 45
286955 27
287031 30
287111 36
287269 30
287310 18
pd.qcut gives this:
pd.qcut(df.value_rank, 5, labels = False).value_counts()
bin count
1 3
4 2
3 2
0 2
2 1
There should be 2 observations in each bin, not 3 in bin 1 and 1 in bin 2!
qcut is trying to compensate for repeating values. This is earlier to visualize if you return the bin limits along with your qcut results:
In [42]: test_list = [ 11, 18, 27, 30, 30, 31, 36, 40, 45, 53 ]
In [43]: test_series = pd.Series(test_list, name='value_rank')
In [49]: pd.qcut(test_series, 5, retbins=True, labels=False)
Out[49]:
(array([0, 0, 1, 1, 1, 2, 3, 3, 4, 4]),
array([ 11. , 25.2, 30. , 33. , 41. , 53. ]))
You can see that there was no choice but to set the bin limit at 30, so qcut had to "steal" one from the expected values in the third bin and place them in the second. I'm thinking that this is just happening at a larger scale with your percentiles since you're basically condensing their ranks into a 1 to 100 scale. Any reason not to just run qcut directly on the data instead of the percentiles or return percentiles that have greater precision?
Just try with the below code :
pd.qcut(df.rank(method='first'),nbins)
If you must get equal (or nearly equal) bins, then here's a trick you can use with qcut. Using the same data as the accepted answer, we can force these into equal bins by adding some random noise to the original test_list and binning according to those values.
test_list = [ 11, 18, 27, 30, 30, 31, 36, 40, 45, 53 ]
np.random.seed(42) #set this for reproducible results
test_list_rnd = np.array(test_list) + np.random.random(len(test_list)) #add noise to data
test_series = pd.Series(test_list_rnd, name='value_rank')
pd.qcut(test_series, 5, retbins=True, labels=False)
Output:
(0 0
1 0
2 1
3 2
4 1
5 2
6 3
7 3
8 4
9 4
Name: value_rank, dtype: int64,
array([ 11.37454012, 25.97573801, 30.42160255, 33.11683016,
41.81316392, 53.70807258]))
So, now we have two 0's, two 1's, two 2's and two 4's!
Disclaimer
Obviously, use this at your discretion because results can vary based on your data; like how large your data set is and/or the spacing, for instance. The above "trick" works well for integers because even though we are "salting" the test_list, it will still rank order in the sense that there will won't be a value in group 0 greater than a value in group 1 (maybe equal, but not greater). If, however, you have floats, this may be tricky and you may have to reduce the size of your noise accordingly. For instance if you had floats like 2.1, 5.3, 5.3, 5.4, etc., you should should reduce the noise by dividing by 10: np.random.random(len(test_list)) / 10. If you have arbitrarily long floats, however, you probably would not have this problem in the first place, given the noise already present in "real" data.
This problem arises from duplicate values. A possible solution to force equal sized bins is to use the index as the input for pd.qcut after sorting the dataframe:
import random
df = pd.DataFrame({'A': [random.randint(3, 9) for x in range(20)]}).sort_values('A').reset_index()
del df['index']
df = df.reset_index()
df['A'].plot.hist(bins=30);
picture: https://i.stack.imgur.com/ztjzn.png
df.head()
df['qcut_v1'] = pd.qcut(df['A'], q=4)
df['qcut_v2'] = pd.qcut(df['index'], q=4)
df
picture: https://i.stack.imgur.com/RB4TN.png
df.groupby('qcut_v1').count().reset_index()
picture: https://i.stack.imgur.com/IKtsW.png
df.groupby('qcut_v2').count().reset_index()
picture: https://i.stack.imgur.com/4jrkU.png
sorry I cannot post images since I don't have at least 10 reputation on stackoverflow -.-