What I hope to do is be able to divide a value in a 1 dimensional numpy array by the following value. For example, I have an array that looks like this.
[ 0 20 23 25 27 28 29 30 30 22 20 19 19 19 19 18 18 19 19 19 19 19 ]
I want to do this:
0/20 #0th value divided by 1st value
20/23 #1st value divided by 2nd value
23/25 #2nd value divided by 3rd value
25/27 #3rd value divided by 4th value
etc...
I can easily do it through a loop, however I was wondering if there is a more efficient way of doing this with numpy operations.
Get two Slices - One from start to last-1, another from start+1 to last and perform element-wise division -
a[:-1]/a[1:]
To get floating point divisions -
np.true_divide(a[:-1],a[1:])
Or put from __future__ import division and then use a[:-1]/a[1:].
Being views into the input array, these slices are really efficiently accessed for element-wise division operation.
Sample run -
In [56]: a # Input array
Out[56]: array([96, 81, 48, 53, 18, 92, 79, 43, 13, 69])
In [57]: from __future__ import division
In [58]: a[:-1]/a[1:]
Out[58]:
array([ 1.18518519, 1.6875 , 0.90566038, 2.94444444, 0.19565217,
1.16455696, 1.8372093 , 3.30769231, 0.1884058 ])
In [59]: a[0]/a[1]
Out[59]: 1.1851851851851851
In [60]: a[1]/a[2]
Out[60]: 1.6875
In [61]: a[2]/a[3]
Out[61]: 0.90566037735849059
Related
With dataset df I plotted a graph looking like the following:
df
Time Temperature
8:23:04 18.5
8:23:04 19
9:12:57 19
9:12:57 20
9:12:58 20
9:12:58 21
9:12:59 21
9:12:59 23
9:13:00 23
9:13:00 25
9:13:01 25
9:13:01 27
9:13:02 27
9:13:02 28
9:13:03 28
Graph(Overall)
When zooming in the data, we can see more details:
I would like to count the number of activations of this temperature measurement device, which gives rise to temperature increasing drastically. I have defined an activation as below:
Let T0, T1, T2, T3 be temperature at time t=0,t=1,t=2,t=3, and d0= T1-T0, d1= T2-T1, d2= T3-T2, ... be the difference of 2 adjacent values.
If
1) d0 ≥ 0 and d1 ≥ 0 and d2 ≥ 0, and
2) T2- T0 > max(d0, d1, d2), and
3) T2-T0 < 30 second
It is considered as an activation. I want to count how many activations are there in total. What's a good way to do this?
Thanks.
There could be a number of different, valid answers depending on how a spike is defined.
Assuming you just want the indices where the temperature increases significantly. One simple method is to just look for very large jumps in value, above some threshold value. The threshold can be calculated from the mean difference of the data, which should give a rough approximation of where the significant variations in value occur. Here's a basic implementation:
import numpy as np
# Data
x = np.array([0, 1, 2, 50, 51, 52, 53, 100, 99, 98, 97, 96, 10, 9, 8, 80])
# Data diff
xdiff = x[1:] - x[0:-1]
# Find mean change
xdiff_mean = np.abs(xdiff).mean()
# Identify all indices greater than the mean
spikes = xdiff > abs(xdiff_mean)+1
print(x[1:][spikes]) # prints 50, 100, 80
print(np.where(spikes)[0]+1) # prints 3, 7, 15
You could also look use outlier rejection, which would be much more clever than this basic comparison to the mean difference. There are lots of answers on how to do that:
Can scipy.stats identify and mask obvious outliers?
Sorry if this is duplicate post - I can't find a related post though
from random import seed
seed(100)
P = pd.DataFrame(np.random.randint(0, 100, size=(1000, 2)), columns=list('AB'))
What I'd like is to group P by the quartiles/quantiles/deciles/etc of column A and then calculate a aggregate statistic (such as mean) by group. I can define deciles of the column as
P['A'].quantile(np.arange(10) / 10)
I'm not sure how to groupby the deciles of A. Thanks in advance!
If you want to group P e.g. by quartiles, run:
gr = P.groupby(pd.qcut(P.A, 4, labels=False))
Then you can perform any operations on these groups.
For presentation, below you have just a printout of P limited to
20 rows:
for key, grp in gr:
print(f'\nGroup: {key}\n{grp}')
which gives:
Group: 0
A B
0 8 24
3 10 94
10 9 93
15 4 91
17 7 49
Group: 1
A B
7 34 24
8 15 60
12 27 4
13 31 1
14 13 83
Group: 2
A B
4 52 98
5 53 66
9 58 16
16 59 67
18 47 65
Group: 3
A B
1 67 87
2 79 48
6 98 14
11 86 2
19 61 14
As you can see, each group (quartile) has 5 members, so the grouping is
correct.
As a supplement
If you are interested in borders of each quartile, run:
pd.qcut(P.A, 4, labels=False, retbins=True)[1]
Then cut returns 2 results (a tuple). The first element (number 0) is
the result returned before, but we are this time interested in the
second element (number 1) - the bin borders.
For your data they are:
array([ 4. , 12.25, 40.5 , 59.5 , 98. ])
So e.g. the first quartile is between 4 and 12.35.
You can use the quantile Series to make another column, to marking each row with its quantile label, and then group by that column. numpy searchsorted is very useful to do this:
import numpy as np
import pandas as pd
from random import seed
seed(100)
P = pd.DataFrame(np.random.randint(0, 100, size=(1000, 2)), columns=list('AB'))
q = P['A'].quantile(np.arange(10) / 10)
P['G'] = P['A'].apply(lambda x : q.index[np.searchsorted(q, x, side='right')-1])
Since the quantile Series stores the lower values of the quantile intervals, be sure to pass the parameter side='right' to np.searchsorted to not get 0 (the minimum should be 1 or you have one index more than you need).
Now you can elaborate your statistics by doing, for example:
P.groupby('G').agg(['sum', 'mean']) #add to the list all the statistics method you wish
I have a pandas dataframe like below:
Coordinate
1 (1150.0,1760.0)
28 (1260.0,1910.0)
6 (1030.0,2070.0)
12 (1170.0,2300.0)
9 (790.0,2260.0)
5 (750.0,2030.0)
26 (490.0,2130.0)
29 (360.0,1980.0)
3 (40.0,2090.0)
2 (630.0,1660.0)
20 (590.0,1390.0)
Now, I want to create a new column 'dotProduct' by applying the formula
np.dot((b-a),(b-c)) where b is the Coordinates(1260.0,1910.0) for index 28, c is the same for index 6, (i.e. (1030.0,2070.0)). The calculated product is for row 2. So, in a way I have to get the previous row value and next value too. This way I have to calculate for entire 'Coordinate' I am quite new to pandas, hence still in learning path. Please guide me a bit.
Thanks a lot for the help.
I assume that your 'Coordinate' column elements are already tuples of float values.
# Convert elements of 'Coordinate' into numpy array
df.Coordinate = df.Coordinate.apply(np.array)
# Subtract +/- 1 shifted values from original 'Coordinate'
a = df.Coordinate - df.Coordinate.shift(1)
b = df.Coordinate - df.Coordinate.shift(-1)
# take row-wise dot product based on the arrays a, b
df['dotProduct'] = [np.dot(x, y) for x, y in zip(a, b)]
# make 'Coordinate' tuple again (if you want)
df.Coordinate = df.Coordinate.apply(tuple)
Now I get this as df:
Coordinate dotProduct
1 (1150.0, 1760.0) NaN
28 (1260.0, 1910.0) 1300.0
6 (1030.0, 2070.0) -4600.0
12 (1170.0, 2300.0) 62400.0
9 (790.0, 2260.0) -24400.0
5 (750.0, 2030.0) 12600.0
26 (490.0, 2130.0) -18800.0
29 (360.0, 1980.0) -25100.0
3 (40.0, 2090.0) 236100.0
2 (630.0, 1660.0) -92500.0
20 (590.0, 1390.0) NaN
This question already has answers here:
Extracting first n columns of a numpy matrix
(2 answers)
Closed 4 years ago.
Although I am not new to python, I am a very infrequent user and get out of my depth pretty quickly. I’m guessing there is a simple solution to this which I am totally missing.
I have an image which is stored as a 2D numpy array. Rather than 3 values as with an RGB image, each pixel contains 6 numbers. The XY coordinates of a pixel is all stored as the row values of the array, while there are 6 columns which correspond to the different wavelength values. Currently, I can call up any single pixel and see what the wavelength values are, but I would like to add up the values over a range of pixels.
I know summing arrays is straightforward, and this is essentially what I'm looking to achieve over an arbitrary range of specified pixels in the image.
a=np.array([1,2,3,4,5,6])
b=np.array ([1,2,3,4,5,6])
sum=a+b
sum=[2,4,6,8,10,12]
I’m guessing I need a loop for specifying the range of input pixels I would like to be summed. Note, as with the example above, I’m not trying to sum the 6 values of one array, rather to sum the all the first elements, all the second elements, etc. However, I think what is happening is that the loop runs over the specified pixel range, but doesn’t store the values to be added to the next pixel. I’m not sure how to overcome this.
What I have so far is:
fileinput=np.load (‘C:\\Users\\image_data.npy’) #this is the image, each pixel is 6 values
exampledata=np.arange(42).reshape((7,6))
for x in range(1,4)
signal=exampledata[x]
print(exampledata[x]) #this prints all the arrays I would like to sum, ie it shows list of 6 values for each of the pixels specified within the range.
sum.exampledata[x] # sums up the values within each list, rather than all the first elements, all the second elements etc.
exampledata.sum(axis=1) # produces the error: AxisError: axis 1 is out of bounds for array of dimension 1
I suppose I could sum up a small range of pixels manually, though this only works for small ranges of pixels.
first=fileinput[1]
second=fileinput[2]
third=fileinput[3]
sum=first+second+third
This should work
exampledata = np.arange(42).reshape((7,6))
# Sum data along axis 0:
sum = exampledata[0:len(exampledata)].sum(0)
print(sum)
Initial 2D array:
[[ 0 1 2 3 4 5]
[ 6 7 8 9 10 11]
[12 13 14 15 16 17]
[18 19 20 21 22 23]
[24 25 26 27 28 29]
[30 31 32 33 34 35]
[36 37 38 39 40 41]]
Output:
[126 133 140 147 154 161]
I have a data frame, from which I can select a column (series) as follows:
df:
value_rank
275488 90
275490 35
275491 60
275492 23
275493 23
275494 34
275495 75
275496 40
275497 69
275498 14
275499 83
... ...
value_rank is a previously created percentile rank from a larger data-set. What I am trying to do, is to create bins of this data set, e.g. quintile
pd.qcut(df.value_rank, 5, labels=False)
275488 4
275490 1
275491 3
275492 1
275493 1
275494 1
275495 3
275496 2
... ...
This appears fine, as expected, but it isn't.
In fact, I have 1569 columns. The nearest number divisible by 5 bins is 1565 which should give 1565 / 5 = 313 observations in each bin. There are 4 extra records, so I would expect to have 4 bins with 314 observations, and one with 313 observations. Instead, I get this:
obs = pd.qcut(df.value_rank, 5, labels=False)
obs.value_counts()
0 329
3 314
1 313
4 311
2 302
I have no nans in df, and cannot think of any reason why this is happening. Literally beginning to tear my hair out!
Here is a small example:
df:
value_rank
286742 11
286835 53
286865 40
286930 31
286936 45
286955 27
287031 30
287111 36
287269 30
287310 18
pd.qcut gives this:
pd.qcut(df.value_rank, 5, labels = False).value_counts()
bin count
1 3
4 2
3 2
0 2
2 1
There should be 2 observations in each bin, not 3 in bin 1 and 1 in bin 2!
qcut is trying to compensate for repeating values. This is earlier to visualize if you return the bin limits along with your qcut results:
In [42]: test_list = [ 11, 18, 27, 30, 30, 31, 36, 40, 45, 53 ]
In [43]: test_series = pd.Series(test_list, name='value_rank')
In [49]: pd.qcut(test_series, 5, retbins=True, labels=False)
Out[49]:
(array([0, 0, 1, 1, 1, 2, 3, 3, 4, 4]),
array([ 11. , 25.2, 30. , 33. , 41. , 53. ]))
You can see that there was no choice but to set the bin limit at 30, so qcut had to "steal" one from the expected values in the third bin and place them in the second. I'm thinking that this is just happening at a larger scale with your percentiles since you're basically condensing their ranks into a 1 to 100 scale. Any reason not to just run qcut directly on the data instead of the percentiles or return percentiles that have greater precision?
Just try with the below code :
pd.qcut(df.rank(method='first'),nbins)
If you must get equal (or nearly equal) bins, then here's a trick you can use with qcut. Using the same data as the accepted answer, we can force these into equal bins by adding some random noise to the original test_list and binning according to those values.
test_list = [ 11, 18, 27, 30, 30, 31, 36, 40, 45, 53 ]
np.random.seed(42) #set this for reproducible results
test_list_rnd = np.array(test_list) + np.random.random(len(test_list)) #add noise to data
test_series = pd.Series(test_list_rnd, name='value_rank')
pd.qcut(test_series, 5, retbins=True, labels=False)
Output:
(0 0
1 0
2 1
3 2
4 1
5 2
6 3
7 3
8 4
9 4
Name: value_rank, dtype: int64,
array([ 11.37454012, 25.97573801, 30.42160255, 33.11683016,
41.81316392, 53.70807258]))
So, now we have two 0's, two 1's, two 2's and two 4's!
Disclaimer
Obviously, use this at your discretion because results can vary based on your data; like how large your data set is and/or the spacing, for instance. The above "trick" works well for integers because even though we are "salting" the test_list, it will still rank order in the sense that there will won't be a value in group 0 greater than a value in group 1 (maybe equal, but not greater). If, however, you have floats, this may be tricky and you may have to reduce the size of your noise accordingly. For instance if you had floats like 2.1, 5.3, 5.3, 5.4, etc., you should should reduce the noise by dividing by 10: np.random.random(len(test_list)) / 10. If you have arbitrarily long floats, however, you probably would not have this problem in the first place, given the noise already present in "real" data.
This problem arises from duplicate values. A possible solution to force equal sized bins is to use the index as the input for pd.qcut after sorting the dataframe:
import random
df = pd.DataFrame({'A': [random.randint(3, 9) for x in range(20)]}).sort_values('A').reset_index()
del df['index']
df = df.reset_index()
df['A'].plot.hist(bins=30);
picture: https://i.stack.imgur.com/ztjzn.png
df.head()
df['qcut_v1'] = pd.qcut(df['A'], q=4)
df['qcut_v2'] = pd.qcut(df['index'], q=4)
df
picture: https://i.stack.imgur.com/RB4TN.png
df.groupby('qcut_v1').count().reset_index()
picture: https://i.stack.imgur.com/IKtsW.png
df.groupby('qcut_v2').count().reset_index()
picture: https://i.stack.imgur.com/4jrkU.png
sorry I cannot post images since I don't have at least 10 reputation on stackoverflow -.-