Python: Combining 2D arrays with 1 common column that has different values - python

I want to combine two arrays which represent a curve where the variable is column 1, however the column 0 values do not always match:
import numpy as np
arr1= np.array([(12,1003),(17,900),(20,810)])
arr2= np.array([(10,1020),(17,902),(19,870),(21,750)])
I want to combine these into one array where the column 0 is combined and both column 1s are stacked with gaps where there is no value for the corresponding column 0 value, something like this:
arr3=np.array([((10,None,1020),(12,1003,None),(17,900,902),(19,None,870),(20,810,None),(21,None,750))])
The reason for this is that I want to be able to get mean values of the second column for each array but they are not at exactly the same column 0 value so the idea of creating this array is to then interpolate to replace all the None values, then create mean values from column 1 and 2 and have an extra column to represent that.
I have used numPy for everything else so far but obviously have got stuck with the np.column_stack function as it needs lists of the same length and also will be blind to stacking based on values from column o. Lastly I do not want to create a fit for the data as the actual data is non-linear and possibily not consistent so a fit will not work and interpolation seems like the most accurate method.
There may be an answer already but due to me not knowing how to describe it well I can't find it. Also I am relatively new to python so please don't make any assumptions about my knowledge other than it is very little.
Thank you.

will this help ??
import pandas
import numpy as np
arr1= np.array([(12,1003),(17,900),(20,810)])
arr2= np.array([(10,1020),(17,902),(19,870),(21,750)])
d1 = pandas.DataFrame(arr1)
d2 = pandas.DataFrame(arr2)
d1.columns = d2.columns = ['t','v']
d3 = pandas.DataFrame(np.array(d1.merge(d2, on='t',how='outer')))
print d3.values
# use d3.as_matrix() to convert to numpy array
output
[[ 12. 1003. nan]
[ 17. 900. 902.]
[ 20. 810. nan]
[ 10. nan 1020.]
[ 19. nan 870.]
[ 21. nan 750.]]

Related

Change 0 values to nan values in numpy array changes everything to nan

I have numpy array with the shape of (1212,2117).
The array contains pixels with value 0 or values that are rgeater than 0 , looks like this:
I want to give the 0 pixels value of no data. I have tried to do it this way:
arr=arr.astype('float')
arr[arr==0]=np.nan
It seems like the result is chart that is all NaN.with one little square:
plt.imshow(test)
However it seems like all the values were changes, as if I check what is the max or min value of this array I get nan:
test.max()
>>>nan
test.min()
>>>nan
I would like to understand why I get this result and how can I correctly give no data values for pixels with value of 0.
You have the reason and solution in the docs (Notes section).
NaN values are propagated, that is if at least one item is NaN, the
corresponding max value will be NaN as well. To ignore NaN values
(MATLAB behavior), please use nanmax.
np.nanmax(arr)
# and
np.nanmin(arr))
Should give the expected result.
Make a list
Iterate through every pixel, check if it is a nan, then append 0, if not then append the number.
np.array(your array)
Although it is unpythonic, it may get the job done.

Element-by-element division in pandas dataframe with "/"?

Would be great to understand how this actually work. Perhaps there is something in Python/Pandas that I don't quite understand.
I have a dataframe (price data) and would like to calculate the returns. Rows are the stocks while columns are the dates.
For simplicity, I have created the prices with some random numbers.
import pandas as pd
import numpy as np
df_price = pd.DataFrame(np.random.rand(10,10))
df_ret = df_price.iloc[:,1:]/df_price.iloc[:,:-1]-1
There are two things are find it strange here:
My numerator and denominator are both 10 x 9. Why the output is a 10 x 10 with the first column being nans.
Why the results are all 0 besides the first columns being nans. i.e. why the calculation didn't perform?
Thanks.
When we do the div, we need to consider the index and columns for both df_price[:,1:] and df_price.iloc[:,:-1], matched firstly, so we need to add the .values to remove the index and column match first, then the output will perform what we expected.
df_ret = df_price.iloc[:,1:]/df_price.iloc[:,:-1].values-1
Example
s=pd.Series([2,4,6])
s.iloc[1:]/s.iloc[:-1]
Out[54]:
0 NaN # here the index s.iloc[:-1] included
1 1.0
2 NaN # here the index s.iloc[1:] included
dtype: float64
From above we can say , the pandas object , match the index first , and more like a outer match.

Numpy: multiplying with NaN values without using nan_to_num

I was able to optimise some operations in my program quite a bit using numpy. When I profile a run, I noticed that most of the time is spent in numpy.nan_to_num. I'd like to improve this even further.
The sort of calculations occurring are multiplication of two arrays for which one of the arrays could contain nan values. I want these to be treated as zeros, but I can't initialise the array with zeros, as nan has a meaning later on and can't be set to 0. Is there a way of doing multiplications (and additions) with nan being treated as zero?
From the nan_to_num docstring, I can see a new array is produced which may explain why it's taking so long.
Replace nan with zero and inf with finite numbers.
Returns an array or scalar replacing Not a Number (NaN) with zero,...
A function like nansum for arbitrary arithmetic operations would be great.
Here's some example data:
import numpy as np
a = np.random.rand(1000, 1000)
a[a < 0.1] = np.nan # set some random values to nan
b = np.ones_like(a)
One option is to use np.where to set the value of the result to 0 wherever one of your arrays is equal to NaN:
result = np.where(np.isnan(a), 0, a * b)
If you have to do several operations on an array that contains NaNs, you might consider using masked arrays, which provide a more general method for dealing with missing or invalid values:
masked_a = np.ma.masked_invalid(a)
result2 = masked_a * b
Here, result2 is another np.ma.masked_array whose .mask attribute is set according to where the NaN values were in a. To convert this back to a normal np.ndarray with the masked values replaced by 0s, you can use the .filled() method, passing in the fill value of your choice:
result_filled = result2.filled(0)
assert np.all(result_filled == result)

setting null values in a numpy array

how do I null certain values in numpy array based on a condition?
I don't understand why I end up with 0 instead of null or empty values where the condition is not met... b is a numpy array populated with 0 and 1 values, c is another fully populated numpy array. All arrays are 71x71x166
a = np.empty(((71,71,166)))
d = np.empty(((71,71,166)))
for indexes, value in np.ndenumerate(b):
i,j,k = indexes
a[i,j,k] = np.where(b[i,j,k] == 1, c[i,j,k], d[i,j,k])
I want to end up with an array which only has values where the condition is met and is empty everywhere else but with out changing its shape
FULL ISSUE FOR CLARIFICATION as asked for:
I start with a float populated array with shape (71,71,166)
I make an int array based on a cutoff applied to the float array basically creating a number of bins, roughly marking out 10 areas within the array with 0 values in between
What I want to end up with is an array with shape (71,71,166) which has the average values in a particular array direction (assuming vertical direction, if you think of a 3D array as a 3D cube) of a certain "bin"...
so I was trying to loop through the "bins" b == 1, b == 2 etc, sampling the float where that condition is met but being null elsewhere so I can take the average, and then recombine into one array at the end of the loop....
Not sure if I'm making myself understood. I'm using the np.where and using the indexing as I keep getting errors when I try and do it without although it feels very inefficient.
Consider this example:
import numpy as np
data = np.random.random((4,3))
mask = np.random.random_integers(0,1,(4,3))
data[mask==0] = np.NaN
The data will be set to nan wherever the mask is 0. You can use any kind of condition you want, of course, or do something different for different values in b.
To erase everything except a specific bin, try the following:
c[b!=1] = np.NaN
So, to make a copy of everything in a specific bin:
a = np.copy(c)
a[b!=1] == np.NaN
To get the average of everything in a bin:
np.mean(c[b==1])
So perhaps this might do what you want (where bins is a list of bin values):
a = np.empty(c.shape)
a[b==0] = np.NaN
for bin in bins:
a[b==bin] = np.mean(c[b==bin])
np.empty sometimes fills the array with 0's; it's undefined what the contents of an empty() array is, so 0 is perfectly valid. For example, try this instead:
d = np.nan * np.empty((71, 71, 166)).
But consider using numpy's strength, and don't iterate over the array:
a = np.where(b, c, d)
(since b is 0 or 1, I've excluded the explicit comparison b == 1.)
You may even want to consider using a masked array instead:
a = np.ma.masked_where(b, c)
which seems to make more sense with respect to your question: "how do I null certain values in a numpy array based on a condition" (replace null with mask and you're done).

Composite a numpy array/matrix based on column values and variables?

I'm playing with NumPy and Scipy and I'm having trouble finding a feature in the documentation. I was thus wondering if anyone could help.
Suppose I have an array in NumPy with two columns and k rows. One column serves as an numerical indicator (e.g. 2 = male, 1 = female, 0 = unknown) while the second column is perhaps a list of values or scores.
Lets say that I want to find the standard deviation (could be mean or whatever, I just want to apply a function) of the values for all rows with indicator 0, and then for 1, and finally, 2.
Is there a predefined function to composite this for me?
In R, the equivalent can be found in the plyr package. Does NumPy and/or Scipy have an equivalent, or am I stuck creating a mask for this array and then somehow filtering through this mask and then applying my function?
As always, thanks for your help!
If I understand your description, you have a dataset something like this:
In [79]: x=np.random.randint(0,3,size=100)
In [80]: y=np.random.randint(0,100,size=100)
In [81]: d=np.vstack([x,y]).T
In [88]: print d[:5,:]
[[ 0 43]
[ 1 60]
[ 2 60]
[ 1 4]
[ 0 30]]
In this situation numpy.unique can be used to generate an array of unique "key" values:
In [82]: idx=np.unique(d[:,0])
In [83]: print idx
[0 1 2]
and those values used to drive a generator expression like this:
[113]: g=(d[np.where(d[:,0]==val),1].std() for val in idx)
The generator g will emit the standard deviation of all the entries in d which match each entry in the index. numpy.fromiterator can then be used to collect the results:
In [114]: print np.vstack([idx,np.fromiter(g,dtype=np.float)]).T
[[ 0. 26.87376385]
[ 1. 29.41046084]
[ 2. 24.2477246 ]]
Note there is conversion of the keys to floating point in the last step during stacking, you might not want that depending on your data, but I did just it for illustrative purposes to have a "nice" looking final result to post.
You can use masked array operations for that.
http://docs.scipy.org/doc/numpy/reference/maskedarray.html#maskedarray
To create the mask, you can use the numpy.where function, like so:
male_mask = numpy.where(a[:,0]==2, False, True)
female_mask = numpy.where(a[:,0]==1, False, True)
Then, remember to use the special functions from numpy.ma:
http://docs.scipy.org/doc/numpy/reference/routines.ma.html
male_average = numpy.ma.average(ma.array(a[:,1], mask=male_mask))
EDIT: actually, this works just as well:
numpy.ma.average(ma.array(a[:,1], mask=a[:,0]!=value))

Categories