Correlation table - python

Suppose that you have hundreds of numpy arrays and you want to calculate correlation between each of them. I calculated it with the help of nested for loops. But, execution took huge time(20 minutes!). One way to make this calculation more efficient is to calculate one half of the correlation table diagonal, copy it to other half and make diagonal line equal to 1. What I mean is that, correlation(x,y)=correlation(y,x) and correlation(x,x) is always equal to 1. However, with these corrections, code will also take much time(approx 7-8 minutes). Any other suggestions?
My code
for x in data_set:
for y in data_set:
correlation = np.corrcoef(x,y)[1][0]

I am quite sure you can achieve must faster results by creating a 2-D array and calculating its correlation matrix (as opposed to calculate pair wise correlations one by one).
From numpy's corrcoef documentation the input can be:
" 1-D or 2-D array containing multiple variables and observations. Each row of m represents a variable, and each column a single observation of all those variables."
https://docs.scipy.org/doc/numpy/reference/generated/numpy.corrcoef.html

Related

Easy way to cluster entries in a numpy array with a condition

Background: I have a numpy array of float entries. This is basically a set of observations of something, suppose temperature measured during 24 hours. Imagine that one who records the temperature is not available for the entire day, instead he/she takes few (say 5) readings during an hour and again after few hours, takes reading (say 8 times). All the measurements he/she puts in a single np.array and has handed over to me!
Problem: I have no idea when the readings were taken. So I decide to cluster the observations in the following way: maybe, first recognize local peaks in the array and all entries that are close enough (chosen tolerance, say 1 deg) are grouped together, meaning, I want to split the array into a list of sub-arrays. Note, any entry should belong to exactly one group.
One possible approach: First, sort the array, then split it into sub-arrays with two conditions: (1) Difference between the first and last entries is not more than 1 deg, (2) Difference between the last entry of a sub-array and the first entry of the next sub-array is greater than 1 deg. How can I achieve this fast (numpy way)?

Optimization of matrix with each sum of a column is equal to 1

I have a matrix where the values in each column need to be optimised such that the error with a following equation is minimised.
Each number in the matrix must be decimal and equal to or between zero and one.
The columns summed up need, to sum up to 1. So each number represents a proportion of something.
For this, I'm using scipy.optimize.minimize() with bounds for all the values such that they fit the restraints. The matrix is flattened to be usable in the optimize function.
The resulting optimised matrix fits the restraints but fails to archive the necessary constraint of reaching 1 when a column is summed up.
What can I do to make sure that each column reaches one with this function or do you have a suggestion for a better optimizer?
Finished Minimisation:
Optimized Matrix:
Summed up columns (Should be 1. for every entry):

How to to arrange a loop in order to loop over columns and then do something

I'm a complete newbie to python, and I'm currently trying to work on a problem that allows me to take the average of each column except the number of columns is unknown.
I figured how to do it if I knew how many columns it is and to do each calculation separate. I'm supposed to do it by creating an empty list and looping the columns back into it.
import numpy as np
#average of all data not including NAN
def average (dataset):
return np.mean (dataset [np.isfinite (dataset)])
#this is how I did it by each column separate
dataset = np.genfromtxt("some file")
print (average(dataset [:,0]))
print (average(dataset [:,1]))
#what I'm trying to do with a loop
def avg (dataset):
for column in dataset:
lst = []
column = #i'm not sure how to define how many columns I have
Avg = average (column)
return Avg
You can use the numpy.mean() function:
https://docs.scipy.org/doc/numpy/reference/generated/numpy.mean.html
with:
np.mean(my_data, axis=0)
The axis indicates whether you are taking the average along columns or rows (axis = 0 means you take the average of each column, what you are trying to do). The output will be a vector whose length is the same as the number of columns (or rows) along which you took the average, and each element is the average of the corresponding column (or row). You do not need to know the shape of the matrix in advance to do this.
You CAN do this using a for loop, but it's not a good idea -- looping over matrices in numpy is slow, whereas using vectorized operations like np.mean() is very very fast. So in general when using numpy one tries to use those types of built-in operations instead of looping over everything at least if possible.
Also -- if you want the number of columns in your matrix -- it's
my_matrix.shape[1]
returns number of columns;
my_matrix.shape[0] is number of rows.

physical dimensions and array dimensions

If I have a rainfall map which has three dimensions (latitude, longitude and rainfall value), if I put it in an array, do I need a 2D or 3D array? How would the array look like?
If I have a series of daily rainfall map which has four dimensions (lat, long, rainfall value and time), if I put it in an array, do I need a 3D or 4D array?
I am thinking that I would need a 2D and 3D arrays, respectively, because the latitude and longitude can be represented by a 1D array only (but reshaped such that it has more than 1 rows and columns). Enlighten me please.
I think that both propositions from #Genhis and #Bitzel are right depending on what you want to do...
If you want to be effective, I would recommend you to put both in 2D data structure and I would even advise you specifically to choose a pandas dataframe which will put your data in some kind of matrix-like data structure but let you choose multiple indexes if you need to "think" in 3D or 4D.
It will especially be helpful with the 2nd kind of data you're mentioning "(lat, long, rainfall value and time)" as it is part of what is called "time series". Pandas has a lot of methods to help you averaging over some period of time (you can also group your data by longitude, latitude or location if needed).
On the contrary, if your objective is to learn about how to compute those numbers in Python, then you can use 2D arrays for the first case and 2D or 3D for the 2nd one as previous answers recommended. You could use something like numpy arrays as data structure instead of pure python list but that's debatable...
One important point: Choosing 3D arrays for the time series as #Genhis proposes would ask you to convert time in indexes (through lookup tables or hash function) but that will require some more work...
As I said, you could also learn about tidy, wide and long formats if you want to learn more about those questions...
for the rainfall map, the values you're describing are (latitude, longitude, rainfall value), you need to use a 2D array (matrix) since all you need is 3 columns and a number of rows. It will look like:
rainfall
For the values (lat, long, rainfall value, time) it's the same case. You need to use a 2D array with 4 columns and a number of rows:
Rainfall matrix
I believe that the rainfall value shouldn't be a dimension. Therefore, you could use 2D array[lat][lon] = rainfall_value or 3D array[time][lat][lon] = rainfall_value respectively.
If you want to reduce number of dimensions further, you can combine latitude and longitude into one dimension as you suggested, which would make arrays 1D/2D.

(Randomly?) find an amount by summing a 2D array

I have a 2D array with :
an index,
a numerical value
When I sum this 2D array I get an amount (let's say "a").
I am provided with another amount (let's say "b", a <> b , and b is the target) and the granularity isn't fine enough to segregate one row from another.
The idea here is to try to find all the rows that compose b and discard the others.
What I am trying to do is building a script that (randomly ?) select rows and sum them until it approches (reduce the distance) the targeted sum.
I was first thinking about trying to start at a random point and from there try to sum each combination of rows and keep adding them up until
-> I have something close enough
or
-> the number of set iteration is over (1 million ?)
... but with the number of rows involved this won't fit in memory.
Any Ideas ?

Categories