Background: I have a numpy array of float entries. This is basically a set of observations of something, suppose temperature measured during 24 hours. Imagine that one who records the temperature is not available for the entire day, instead he/she takes few (say 5) readings during an hour and again after few hours, takes reading (say 8 times). All the measurements he/she puts in a single np.array and has handed over to me!
Problem: I have no idea when the readings were taken. So I decide to cluster the observations in the following way: maybe, first recognize local peaks in the array and all entries that are close enough (chosen tolerance, say 1 deg) are grouped together, meaning, I want to split the array into a list of sub-arrays. Note, any entry should belong to exactly one group.
One possible approach: First, sort the array, then split it into sub-arrays with two conditions: (1) Difference between the first and last entries is not more than 1 deg, (2) Difference between the last entry of a sub-array and the first entry of the next sub-array is greater than 1 deg. How can I achieve this fast (numpy way)?
I'm a complete newbie to python, and I'm currently trying to work on a problem that allows me to take the average of each column except the number of columns is unknown.
I figured how to do it if I knew how many columns it is and to do each calculation separate. I'm supposed to do it by creating an empty list and looping the columns back into it.
import numpy as np
#average of all data not including NAN
def average (dataset):
return np.mean (dataset [np.isfinite (dataset)])
#this is how I did it by each column separate
dataset = np.genfromtxt("some file")
print (average(dataset [:,0]))
print (average(dataset [:,1]))
#what I'm trying to do with a loop
def avg (dataset):
for column in dataset:
lst = []
column = #i'm not sure how to define how many columns I have
Avg = average (column)
return Avg
You can use the numpy.mean() function:
https://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html
with:
np.mean(my_data, axis=0)
The axis indicates whether you are taking the average along columns or rows (axis = 0 means you take the average of each column, what you are trying to do). The output will be a vector whose length is the same as the number of columns (or rows) along which you took the average, and each element is the average of the corresponding column (or row). You do not need to know the shape of the matrix in advance to do this.
You CAN do this using a for loop, but it's not a good idea -- looping over matrices in numpy is slow, whereas using vectorized operations like np.mean() is very very fast. So in general when using numpy one tries to use those types of built-in operations instead of looping over everything at least if possible.
Also -- if you want the number of columns in your matrix -- it's
my_matrix.shape[1]
returns number of columns;
my_matrix.shape[0] is number of rows.
I am trying to apply BucketedRandomProjectionLSH's function model.approxNearestNeighbors(df, key, n) on all the rows of a dataframe in order to approx-find the top n most similar items for every item. My dataframe has 1 million rows.
My problem is that I have to find a way to compute it within a reasonable time (no more than 2 hrs). I've read about that function approxSimilarityJoin(df, df, threshold) but the function takes way too long and doesn't return the right number of rows : if my dataframe has 100.000 rows, and I set a threshold VERY high/permissive I get something like not even 10% of the number of rows returned.
So, I'm thinking about using approxNearestNeighbors on all rows so that the computation time is almost linear.
How do you apply that function to every row of a dataframe ? I can't use a UDF since I need the model + a dataframe as inputs.
Do you have any suggestions ?
I have csv files which are 1200 Rows x 3 Columns. Number of rows can differ from as low as 500 to as large as 5000 but columns remain same.
I want to create a feature vector from these files which will thus maintain consistent cells/vector length & thus help in finding out the distance between these vectors.
FILE_1
A, B, C
(267.09669678867186, 6.3664069175720197, 1257325.5809999991),
(368.24070923984374, 9.0808353424072301, 49603.662999999884),
(324.21470826328124, 11.489830970764199, 244391.04699999979),
(514.33452027500005, 7.5162401199340803, 56322.424999999988),
(386.19673340976561, 9.4927110671997106, 175958.77100000033),
(240.09965330898439, 10.3463039398193, 457819.8519411764),
(242.17559998691405, 8.4401674270629901, 144891.51100000029),
(314.23066895664061, 7.4405002593994096, 58433.818999999959),
(933.3073596304688, 7.1564397811889604, 41977.960000000014),
(274.04136473476564, 4.8482465744018599, 48782.314891525479),
(584.2639294320312, 7.90128517150879, 49730.705000000096),
(202.13173096835936, 10.559995651245099, 20847.805144088608),
(324.98563963710939, 2.2546300888061501, 43767.774800000007),
(464.35059935390626, 11.573680877685501, 1701597.3915132943),
(776.28339964687495, 8.7755222320556605, 106882.2469999999),
(310.11652952968751, 10.3175926208496, 710341.19162800116),
(331.19962889492189, 10.7578010559082, 224621.80632433048),
(452.31337752387947, 7.3100395202636701, 820707.26700000139),
(430.16615111171876, 10.134071350097701, 18197.691999999963),
(498.24687010585939, 11.0102319717407, 45423.269964585743),
.....,
.....,
500th row
FILE_2
(363.02781861484374, 8.8369808197021502, 72898.479666666608),
(644.20353882968755, 8.6263589859008807, 22776.78799999999),
(259.25105469882811, 9.8575859069824201, 499615.64068339905),
(410.19474608242189, 9.8795070648193395, 316146.18800000293),
(288.12153809726561, 4.7451887130737296, 58615.577999999943),
(376.25868409335936, 10.508985519409199, 196522.12200000012),
(261.11118895351564, 8.5228433609008807, 32721.110000000026),
(319.98896605312501, 3.2100667953491202, 60587.077000000027),
(286.94926268398439, 4.7687568664550799, 47842.133999999867),
(121.00206177890625, 7.9372291564941397, 239813.20531182736),
(308.19895750820314, 6.0029039382934597, 26354.519000000011),
(677.17011839687495, 9.0299625396728498, 10391.757655172449),
(182.1304913216797, 8.0010566711425799, 145583.55700000061),
(187.06341736972655, 9.9460496902465803, 77488.229000000007),
(144.07867615878905, 3.6044106483459499, 104651.56499999999),
(288.92317015468751, 4.3750333786010698, 151872.1949999998),
(228.2089825326172, 4.4475774765014604, 658120.07628214348),
(496.18831055820311, 11.422966003418001, 2371155.6659999997),
(467.30134398281251, 11.0771179199219, 109702.48440899582),
(163.08418089687501, 5.7271881103515598, 38107.106791666629),
.....,
.....,
3400th row
You can see that there is no correspondence between the two files, i.e. if someone asked you to calculate the distance between these two vectors its not possible.
The aim is to be able to interpolate the rows of both the files in such a manner so that there is a consistency across all such files. i.e. when I look up first row,
it should represent same feature across all the files. Now lest look at FILE_1
Range of values for three columns is (considering only 20 rows for time being)
A: 202.13173096835936,933.3073596304688
B: 2.2546300888061501, 11.573680877685501
C: 18197.691999999963,1701597.3915132943
I want to put these points on a 3d array, the grid size of which will be .1X.1X.1 (or lets say 10X10X10 or any arbitrary size grid cell)
But for that to work we need to normalize the data (mean normalize etc)
Now the data we have is a 3d data, which need to be normalized in order to interpolate them into this 3d array. Which neednt be 3d even if its a vector then that will also do.
Now when I said I need to average the points, by that I meant that if in a cell more than two points happen to fall (which will happen if the cell size is big eg 100X100X100) then we will take the average value of x,y,z coordinate as the value of that cell.
These interpolated vectors will have same length & correspondence, because corresponding point of a vector when compared to rest of such vectors will represent same point.
**NOTE : Min & Max range for all coordinates across all files is 100:1000,2:12, 10000:2000000
Suppose that you have hundreds of numpy arrays and you want to calculate correlation between each of them. I calculated it with the help of nested for loops. But, execution took huge time(20 minutes!). One way to make this calculation more efficient is to calculate one half of the correlation table diagonal, copy it to other half and make diagonal line equal to 1. What I mean is that, correlation(x,y)=correlation(y,x) and correlation(x,x) is always equal to 1. However, with these corrections, code will also take much time(approx 7-8 minutes). Any other suggestions?
My code
for x in data_set:
for y in data_set:
correlation = np.corrcoef(x,y)[1][0]
I am quite sure you can achieve must faster results by creating a 2-D array and calculating its correlation matrix (as opposed to calculate pair wise correlations one by one).
From numpy's corrcoef documentation the input can be:
" 1-D or 2-D array containing multiple variables and observations. Each row of m represents a variable, and each column a single observation of all those variables."
https://docs.scipy.org/doc/numpy/reference/generated/numpy.corrcoef.html